Attentional Bottleneck: Towards an Interpretable Deep Driving Network Jinkyu Kim and Mayank Bansal Waymo Research { jinkyukim, mayban } @waymo.com Abstract. Deep neural networks are a key component of behavior pre- diction and motion generation for self-driving cars. One of their main drawbacks is a lack of transparency: they should provide easy to in- terpret rationales for what triggers certain behaviors. We propose an architecture called Attentional Bottleneck with the goal of improving transparency. Our key idea is to combine visual attention, which iden- tifies what aspects of the input the model is using, with an information bottleneck that enables the model to only use aspects of the input which are important. This not only provides sparse and interpretable attention maps (e.g. focusing only on specific vehicles in the scene), but it adds this transparency at no cost to model accuracy. In fact, we find slight improvements in accuracy when applying Attentional Bottleneck to the ChauffeurNet model, whereas we find that the accuracy deteriorates with a traditional visual attention model. Keywords: Self-driving vehicles, eXplainable AI, Motion generation 1 Introduction Deep neural networks are powerful function estimators and have been a key component in self-driving software systems [2, 30]. Such networks are, however, notoriously cryptic – their hidden layer activations may have no obvious rela- tion to the function being estimated by the network. Interpretable models that make deep models more transparent are important for a number of reasons: (1) user-acceptance : neural network autonomous control is a radical technology that requires a very high-level of user-trust, (2) extrapolation : users should be able to anticipate what the vehicle will do in most scenarios by understanding the causal behavior of the model, and (3) human-vehicle communication : communication can be grounded in the vehicle’s internal state. One way of making models transparent is via visual attention [28, 17, 14]. Visual attention finds spatially varying scalar attention weights α ( x, y ) ∈ [0 , 1] typically by learning a multi-layer perceptron from a set of input features F = { f ( x, y ) } . Attended features A = { a ( x, y ) } obtained as a ( x, y ) = α ( x, y ) f ( x, y ) are then used by the model instead of the original features F . The model is trained end-to-end leading the attention weights to link the network’s output to its input – visualizing the weights as a 2D heatmap thus provides insight into arXiv:2005.04298v1 [cs.CV] 8 May 2020 2 J. Kim and M. Bansal Sensors : Lidar, Camera, and Radar Perception system Attentional Bottleneck Visual Encoder Bottlenecked information Scene context Top-down input representation Visual Encoder Motion Generator Attention Heat Map Predicted Agent Poses Generating attentional bottleneck Predicting future agent poses Fig. 1: An overview of our interpretable driving model. Our model takes a top- down input representation I and outputs the future agent poses Y along with an attention map. An Attentional Bottleneck encodes the inputs I to a latent vector z while also producing an interpretable attention heat map. The motion generator operates in a partially observable environment using only the dense scene context S ⊂ I along with z to predict poses Y . the areas of the input image that the network attends to. Furthermore, to be easily interpretable, attention needs to be sparse (i.e. low entropy), while ideally also enhancing the performance of the original model. Unfortunately, given the complexity of the driving task, we find that a straightforward integration of attention maps tends to find all potentially salient image areas, resulting in limited interpretability (e.g. Fig. 2). In this work, we manage to achieve sparse and salient attention maps and good final model performance, by attaching attention to a bottlenecked latent representation of the input. However, given the information loss in the bottle- neck, we need to provide the model direct access to a subset of dense inputs (e.g. road lane geometry and connectivity information) that are harder to compress. This frees up the bottleneck branch to focus on selecting the most relevant parts of the dynamic input (e.g. nearby objects), while retaining the model perfor- mance. End-to-end driving models that directly process a camera image as input have several scene elements confounded into nearby pixels thus making a separation into dense and sparse input subsets infeasible. Therefore, we focus on improving the interpretability of a driving model that uses a mid-level input representation. This means that instead of directly using low-level sensor data, the model uses Attentional Bottleneck: Towards an Interpretable Deep Driving Network 3 higher-level semantic information like objects detected by a perception system. As a proxy for such a network, we work with the recently published Chauffeur- Net [2] model, although the ideas presented are more generally applicable. The inputs I to this network consist of information about the roadmap, traffic lights, dynamic objects, etc. rendered in separate channels in common top-down view coordinate system around the agent. The model predicts future agent poses Y in the same top-down view (see Fig. 3). To generate sparser and more interpretable attention maps, we propose an architecture called Attentional Bottleneck (Fig. 1) that combines visual attention with the information bottleneck approach [23] of training deep models through supervised learning [1, 6, 10]. We define z as a bottleneck latent representation of an attention weighted feature encoding A I = α I · F I of the input features F I . We leverage the mid-level input representation to separate the subset of dense inputs into a set S ⊂ I . Conditioned on z and S , the motion generator finally predicts the target Y . Our goal is to learn both the attention weighting function α I and an encoding z that is maximally informative about the target Y . To prevent z from being the identity encoding of the inputs and to focus the network on specific areas of causality, we impose an information bottleneck constraint on the complexity of z by a pooling operation. We preserve spatial information in the attention map by incorporating a positional encoding step, and encode non-local information by using Atrous convolutions. We evaluate our approach on the large-scale ( ≈ 60 days of continuous driv- ing) dataset from [2] and show quantitative and qualitative results illustrating that our generated attention maps result in much sparser (and thus more in- terpretable) visualization of the internal states than a baseline visual attention model. We also show that our approach improves the motion generation accu- racy in contrast to a traditional visual attention model that results in decreased accuracy. 2 Related Work 2.1 Deep Driving Models Recently there is growing interest in end-to-end driving models that process raw sensor data to directly output driving controls. Most of these approaches learn a driving policy through supervised regression over observation-action pairs from human drivers. ALVINN (Autonomous Land Vehicle In a Neural Network) [19] was the first attempt to train a neural network for the navigational task of road following. Bojarski et al. [4] trained a 5-layer ConvNet to predict steering con- trols only from a dashcam image, while Xu et al. [27] utilized a dilated ConvNet combined with an LSTM so as to predict vehicle’s discretized future motions. Hecker et al. [12] explored the extended model that takes a surround-view multi- camera system, a route planner, and a CAN bus reader. Codevilla et al. [9] ex- plored a conditional end-to-end driving model that takes high-level command input (i.e. left, straight, right) at test time. These models show good perfor- 4 J. Kim and M. Bansal mance in simple driving scenarios (i.e. lane following). Their behavior, however, is opaque and learning to drive in urban areas remains challenging. To reduce the complexity and for better interpretability, there is growing interest in end-to-mid and mid-to-mid driving models that produce a mid-level output representation in the form of a drivable trajectory by consuming either raw sensor or an intermediate scene representation as input. Zeng et al. [30] introduced an end-to-mid neural motion generator, which takes Lidar data and an HD map as inputs and outputs a future trajectory. This model also detects 3D bounding boxes as an intermediate representation. Bansal et al. [2] introduced ChauffeurNet , a mid-to-mid model that takes advantage of separate perception and control components. Using a top-down representation of the environment and intended route as input, the model outputs a driving trajectory that is consumed by a controller, which then translates it to steering and acceleration. Recent works [5, 25] also suggest that such a top-down scene representation can successfully be used to learn high-level semantic information. In this work, we focus on improving the explainability of such a model. 2.2 Visual Explanations Explainability of deep neural networks has seen growing interest in computer vision and machine learning [11]. In landmark work, Zeiler et al. [29] utilized deconvolution layers to visualize the internal representation of a ConvNet. Bo- jarski et al. [3] developed a richer notion of contribution of a pixel to the output, while other approaches [31, 21] have explored synthesizing an image causing high neuron activations. However, a difficulty with de-convolution based approaches is the lack of a formal notion of contribution of spatially-extended features (rather than pixels). Attention-based approaches [28, 17] have been increasingly employed for im- proving a model’s ability to explain by providing spatial attention maps that highlight areas of the image that the network attends to. Kim et al. [14] utilize an attention model followed by additional salience filtering to show regions that causally affect the output. To reduce the complexity of explanations, Wang et al. [26] introduce an instance-level attention model that finds objects (i.e. cars and pedestrians) that the network needs to pay attention to. Such attention may be more intuitive and interpretable for users to understand the model’s behavior. However, the model needs to take the whole input context as an additional input, which may compromise the causality of the attention – explanations may not represent causal relationships between the system’s input and its behavior. To preserve the causality, we use a top-down representation of the environment as an input, which consists of information around the agent rendered in separable channels. Another notable approach is the work by Chen et al. [7], which defined human-interpretable intermediate features such as the curvature, deviation to neighboring lanes, and distances from the front-located vehicles. A CNN is trained to produce these features from an image, and a simple controller maps Attentional Bottleneck: Towards an Interpretable Deep Driving Network 5 ChauffeurNet ChauffeurNet w/ Visual Attention Ours FeatureNet (ConvNet) AgentRNN (Motion Generator) Visual Attention FeatureNet (ConvNet) AgentRNN (Motion Generator) FeatureNet (ConvNet) Positional Encoding Avg Pooling Atrous Spatial Attention FeatureNet (ConvNet) AgentRNN (Motion Generator) Rendered Input Images ChauffeurNet w/ Visual Attention Attention maps Ours Fig. 2: (left) Attentional Bottleneck design compared with a baseline visual at- tention model applied to ChauffeurNet. (right) Comparison of attention maps from our model against those from a baseline visual attention model. Note that our heatmaps are much sparser and thus more interpretable. them to steering angle. Similarly, Sauer et al. [20] proposed conditional affor- dance learning approach that maps visual inputs to intermediate representations conditioned on high-level command input. However, the intermediate feature descriptors provide a limited and ad-hoc vocabulary for explanations. Zeng et al. [30] co-trained a perception model that provides bounding boxes of dynamic objects, which are then used as an intermediate and interpretable feature. Here, we instead take full advantage of existing well-established perception systems with different sensor sources (i.e. Lidar, Radar, and Camera) via a mid-level input representation. This reduces the complexity of the driving network and employs more reliable perception outputs. There is also a growing effort on textual explanations that justify the deci- sions that were made and explain the “why” in a natural language [13, 17, 15]. However, textual explanations are often rationalizations – explanations that jus- tify the system’s behavior in a post-hoc manner – and are less helpful with understanding the causal behavior of the model. In this work, we focus on im- proving attention-based mechanism – to provide introspective explanations that are based on the system’s internal state and represent causal relationships be- tween the system’s input and its behavior. 3 Mid-to-mid Driving Model with Visual Attention 3.1 ChauffeurNet Bansal et al. [2] introduced a mid-to-mid driving network called ChauffeurNet that recurrently predicts future poses of the agent by processing a top-down 6 J. Kim and M. Bansal Inputs Outputs (a) Roadmap (f) Traffic Lights (g) Dynamic Objects (i) Future Agent Poses (h) Future Agent Boxes (d) Current Agent Box (c) Past Agent Poses (b) Speed Limits (e) Route Fig. 3: Top-down rendered inputs I (left) and outputs Y (right) for the Chauf- feurNet model. The subset of dense scene context inputs S are shown in the top row. representation of the environment as an input. For completeness, we summarize some of the key details of the paper here. The input to the neural network consists of a set of images I of size W × H pixels rendered into a top-down view coordinate system that is fixed relative to the current location of the agent. As shown in Fig. 3, the set I contains: (a) Roadmap: a 3-channel image with a rendering of color coded lanes, stop signs, cross-walks, etc. (b) Speed limit: a gray-scale image with lane centers color coded in proportion to their known speed limit. (c) Past agent poses: the ego-vehicle’s past poses rendered as a trail of points. (d) Current agent box: the current agent represented by a full bounding box. (e) The intended route. (f) Traffic lights: a gray-scale image where each lane center is color coded to reflect different traffic light states (red light: brightest gray level, green light: darkest gray level). (g) Dynamic objects: a gray-scale image that renders all the potential dynamic objects (vehicles, cyclists, pedestrians) as oriented boxes. Both (f) and (g) are a sequence of 5 images reflecting the environment state over the past 5 timesteps. Visual Encoder (FeatureNet). In the ChauffeurNet model, the rendered in- puts I are fed to a large-receptive field convolutional FeatureNet with skip con- nections, which outputs features F that capture the environmental context and the intent. This feature F (of size w × h × d ) contains a set of d -dimensional la- tent vectors over the spatial dimension, i.e. F = { f 1 , f 2 , . . . , f l } , where f i ∈ R d and l (= w × h ) is the spatial dimension of the extracted features. Selecting a subset of these feature slices will allow the attention model to selectively attend to different parts of input images. Motion Generator (AgentRNN). The feature encoding F is fed to a recur- rent neural network ( AgentRNN ) which predicts the outputs Y consisting of the next point p k on the driving trajectory, and the agent bounding box heatmap B k , conditioned on the features F , the iteration number k ∈ { 1 , . . . , N } , the memory M k − 1 of past predictions from AgentRNN , and the agent bounding box Attentional Bottleneck: Towards an Interpretable Deep Driving Network 7 heatmap B k − 1 predicted in the previous iteration. p k , B k = AgentRNN( k, F , M k − 1 , B k − 1 ) (1) 3.2 ChauffeurNet with Visual Attention One way of making models interpretable is via visual attention [17, 14]. These models provide introspective (visual) explanations by filtering out non-salient image regions – the remaining (attended) regions potentially have a causal ef- fect on the output. The goal of visual attention is to find an attended feature A = { a 1 , a 2 , . . . , a l } , where a i ∈ R d from the original feature F . They utilize a deterministic soft attention mechanism that is trainable by standard back- propagation methods, which thus has advantages over a hard stochastic atten- tion mechanism that requires reinforcement learning. As discussed by several works [28, 14], the attended features can be computed as a i = π ( α i , f i ) = α i f i for i = { 1 , 2 , . . . , l } , where α i are scalar attention weights in [0 , 1] satisfying ∑ i α i = 1. These weights are estimated from the input features F typically by a multi-layer perceptron, i.e. α i = f MLP ( f i ) where the parameters of f MLP are learned as part of training the entire model end-to-end. Since the attention weights vary spatially and depend on the input (via the features F ), they can be visualized as an attention heatmap aligned with the input image, with brighter regions reflecting areas salient for the task. To allow us to explain the driving decisions made by ChauffeurNet, we apply this vanilla visual attention approach by replacing the original features F in Eq. 1 with the attended features A as shown in Fig. 2 (left). As shown in Fig. 2 (right), this approach generates vague and verbose attention maps which do not add to the interpretability of the model. Therefore, we use this approach as a baseline for our “Attentional Bottleneck” approach. 4 Attentional Bottleneck We propose a novel architecture called Attentional Bottleneck with a focus on generating sparse and fine-grained visual explanations. We encode the environ- ment I through an information bottleneck that serves to restrict information in the input to only the most relevant parts of the input, and thus allows the driv- ing model to focus on specific features in the environment. We tie this feature selection to the spatial distribution of features by employing a spatial attention mechanism before the bottleneck. While the driving task involves focusing on specific objects and entities in the scene for the immediate driving decisions, hu- mans also employ a holistic understanding of some elements of the environment. For example, humans are aware of the overall map of the environment through visual scanning or through looking at a navigation app. We find that compress- ing this kind of dense information through the bottleneck either leads to dense attention maps or degrades the model performance. Therefore, we leverage the mid-level separable input representation and provide the model full access to a 8 J. Kim and M. Bansal d h w 1x1 conv 1x1 conv 3x3 conv 3x3 conv 1x1 conv Softmax ( w x h x d ) ( w x h x 1 ) ( w x h x 1 ) ( w x h x d ) ( w x h x d ) ( w x h x d ) Rate=2 Rate=4 d w h d w h d d concatenation Fig. 4: Atrous Spatial Attention Block. We apply three parallel Atrous convolu- tions with different atrous rates. The resulting features from all three branches are then concatenated and fed into 1x1 convolution and softmax layers to gen- erate the attention weights α . subset of inputs S ⊂ I containing the dense context about the environment, through a separate branch. This frees up the bottleneck branch to focus on spe- cific parts of the input (e.g. specific objects) making the attention map sparser and more interpretable. As shown in Fig. 2, our modified ChauffeurNet model consists of a dense input encoder branch and an attentional bottleneck branch providing encoded input features to the AgentRNN . Grounding Attentional Bottleneck into AgentRNN. Like the baseline model, the inputs I are first encoded into features F I by the FeatureNet net- work. To capture non-local information, we propose an Atrous Spatial Attention layer that computes the attention weights α I and outputs the attended features A I . The attended features are depth-concatenated with a positional encoding V followed by a multi-layer perceptron g MLP , and an average pooling layer to generate the final bottleneck representation z . z = l ∑ i =1 g MLP ([ a i ; v i ]) (2) The dense scene context inputs S are similarly encoded into features F S using another FeatureNet network with identical architecture. We modify AgentRNN to incorporate the bottleneck vector by concatenating it with each of the features f i ∈ F S : p k , B k = AgentRNN( k, F S , z , M k − 1 , B k − 1 ) (3) We discuss the Atrous Spatial Attention and the positional encoding stages in the following paragraphs and present ablation results for these blocks in the experiments. Atrous Spatial Attention. Attention models are typically applied to features generated by the last layer of a convolutional encoder. Attention weights for each spatial location are usually computed independently, allowing them to capture Attentional Bottleneck: Towards an Interpretable Deep Driving Network 9 local information around the corresponding specific spatial location (e.g. “there is a pedestrian running”). However, we argue that the attention model also needs to capture non-local information especially for the driving task (e.g. “there is a pedestrian running towards the crosswalk ahead”). Seo et al. [22] explored using 3 × 3 convolution to consider local context in generating attention maps. Here, we advocate using Atrous convolution (also known as dilated convolution), which has been shown to be effective for accurately capturing semantic information at an arbitrary scale [8]. As shown in Fig. 4, we apply three parallel Atrous convolutions with different rates on top of the feature map F I . For implementation, we closely follow the work by Chen et al. [8]. Specifically, our atrous convolution layers include a 1 × 1 convolution, and two 3 × 3 convolutions with rates 2 & 4 with d filters and batch normalization. The resulting features from all the branches are then concatenated and fed into another 1 × 1 convolution to generate the attention logits. A spatial softmax yields normalized attention weights α . Positional Encoding. As shown in Equation 2, to obtain the latent bottle- neck vector z , we use spatial summation with the attended features A I , which removes the positional information. To preserve this information, we append a spatial basis to the feature A I . Following Vaswani et al. [24] and Parmar et al. [18], we generate a spatial basis V (of the same dimension as F ) that con- tains d -dimensional vectors V = { v 1 , v 2 , . . . , v l } , where v i ∈ R d . Each vec- tor v i encodes positional information about the spatial location ( x i , y i ) using four types of Fourier basis functions viz. sin( x i /f u ), cos( x i /f u ), sin( y i /f u ), and cos( y i /f u ), where f u = 1000 u is the spatial wavelength with channel index u = { 0 , 4 /d, 8 /d, . . . , d/d } . Each positional encoding feature v i is then concate- nated with the corresponding attended feature a i ∈ A I as shown in Equation 2. 5 Experiment 5.1 Dataset We use the large-scale dataset from [2] that contains over 26 million expert driv- ing examples amounting to about 60 days of continuous driving. Data has been collected by a vehicle instrumented with multiple sensors (i.e. cameras, lidar, radar). A reliable perception system provides accurate environmental descrip- tions including dynamic objects (i.e. vehicles and pedestrians) and traffic light states. Along with perception, the dataset also provides: (i) the prior map of the environment (i.e. roadmap), (ii) vehicle pose information, and (iii) the speed- limits. The input field of view is 80 m × 80 m (a resolution of 400 × 400 pixels in image coordinates) and the effective forward sensing range of the ego-vehicle is 64 meters. 5.2 Training and Evaluation Details We trained our models end-to-end with Adam optimization [16] using expo- nential decaying learning rates and random initialization (i.e. no pre-trained 10 J. Kim and M. Bansal Rendered Images Attention Heat Maps Rendered Images Attention Heat Maps Rendered Images Attention Heat Maps 0 5 10 15 20 25 x10 -3 0 5 10 15 20 25 x10 -3 0 5 10 15 20 25 x10 -3 Slowing down Stopping Driving curvy road Avoiding obstacles Lane following Fig. 5: We provide typical examples of attention heat maps in diverse driving scenarios. Our model attends to driving-related visual cues like highlighting stop/yield signs, crosswalks or cars ahead that cause braking, road contours on curved roads, or multiple pinch points from parked cars on narrow roads. weights), with ChauffeurNet’s default losses. Our FeatureNet creates features F with dimensions 50 × 50 × 128 which lead to 50 × 50 attention maps that are up- sampled to the input resolution of 400 × 400 by a pyramid expansion step. Note that we use ChauffeurNet’s default losses and training strategy [2] to train our model end-to-end. The losses consist of pure imitation losses (i.e. agent position, Attentional Bottleneck: Towards an Interpretable Deep Driving Network 11 2.2 1.9 ADE (pixels) 1.6 1.3 1.0 A B C D E F FDE (pixels) A B C D E F 4.6 4.0 3.4 2.8 2.2 B C D E F Entropy 8.0 6.75 5.5 4.25 3.0 C : A + Attentional Bottleneck D : C + PerceptionRNN E : D + Atrous Spatial Attention F : E + Positional Encoding A : ChauffeurNet B : A + Visual Attention A B C D E F Collision 1e-3 1e-6 1e-9 Fig. 6: Comparison of motion generation performance and attention map spar- sity between baseline ChauffeurNet, visual attention and Attentional Bottleneck ablation designs. heading, box, meta prediction loss) as well as environment losses to provide bet- ter generalization (i.e. collision loss, on-road loss, geometry loss, and auxiliary losses). For quantitative evaluation, we use the following metrics: ADE and FDE. To quantitatively evaluate motion generation performance, we use two widely-used (Euclidean distance-based) metrics: (i) the average dis- placement error (ADE) 1 K ∑ K k =0 || ˆ p k − p gt k || 2 , and (ii) the final displacement error (FDE) || ˆ p K − p gt K || 2 , where K = 10 is the total number of predicted waypoints, and the superscript gt denotes the ground-truth values. Collision. We also use the collision rate by measuring the potential overlap of the predicted agent box with the ground-truth boxes of all other objects in the scene, i.e. 1 K ∑ K k =0 ∑ i,j B gt k − 1 ( i, j ) B k ( i, j ) . Entropy S ( α ) . To measure the sparseness of the generated attention maps, we measure the entropy of the generated attention heat map α , i.e. S ( α ) = − ∑ l i =1 α i log α i . 5.3 Quantitative Analysis We start by quantitatively comparing our attentional bottleneck model (model F) with the baseline ChauffeurNet [2] (model A) and ChauffeurNet with visual attention (model B) models (see Fig. 2). We observe in Fig. 6 that the incorpora- tion of visual attention for improving the interpretability of the baseline model degrades its performance as measured by the larger ADE and FDE numbers. This is not the case with our attentional bottleneck model where we observe im- proved ADE and FDE numbers – possibly due to improved focus by the model on specific causal factors. Examples in Fig. 2 (right) compare our attention maps to those from the visual attention model, and confirm that the latter generates verbose attention heat maps – finding all potentially salient objects. In con- trast, our model provides much sparser attention heat maps which are easier to associate with specific objects or rendered features and are thus easier to interpret. This is evident by comparing their distributions where the attention 12 J. Kim and M. Bansal Rendered Images C D Attention Maps E F C : A + Attentional Bottleneck D : C + PerceptionRNN E : D + Atrous Spatial Attention F : E + Positional Encoding A : ChauffeurNet Fig. 7: Qualitative comparison of attention maps between Attentional Bottleneck ablation designs. weights from our model are mostly concentrated around zero probability values (see supplemental figures). Effect of incorporating Behavior Prediction. As detailed in [2], the pre- diction of potential future trajectories of dynamic objects in the scene helps the network learn better features for the motion generation task. However, to ac- complish this the network would need to attend to all the objects. This renders the attentional bottleneck useless for the objective of exposing only the objects relevant for the primary goal of agent motion generation. In this paper, our fo- cus is on the main motion generator branch, but for completeness, we present one particular architecture choice that allows us to enable PerceptionRNN [2] while preserving the interpretability of the attentional bottleneck. We add an additional FeatureNet and Atrous Spatial Attention branch that encodes only the dynamic object channels O into attended object features A O which are fed into PerceptionRNN. To allow the PerceptionRNN losses to influence the mo- tion generation task, we also inject A O into the AgentRNN bottleneck branch by modifying features f i ∈ F I to f ′ i = [ f i ; a i ] where ; denotes depth concatenation and a i ∈ A O . In Fig. 6, we compare the metrics from a baseline Attentional Bottleneck model against one that is co-trained with PerceptionRNN in the above setup and observe that this co-training indeed provides improvements in all metrics. As shown in Fig. 7, incorporation of PerceptionRNN (model D) im- proves situation-specific dependence on salient dynamic objects in the scene, i.e. vehicles ahead and pedestrians crossing the crosswalk. Effect of Atrous Spatial Attention. We illustrate the effect of capturing more non-local information using Atrous convolutions in Fig. 6. Relative to the model using only a local receptive field, this model achieves better ADE, FDE, Attentional Bottleneck: Towards an Interpretable Deep Driving Network 13 $ % & Fig. 8: (A-C) Attention maps over time, sampled at every 5 timesteps, illustrate the smooth variation of attention. For better visualization, attention maps are overlaid and shown over the satellite map image. We provide six snapshots on the bottom. Our model appears to attend to important cues, e.g. (a) multiple pinch points, (b) oncoming lanes on a T-junction, (c, e) a stop sign, (d, e) a crosswalk, and (f) vehicles. We also provide a video as supplemental materials. Map data c © Google 2020. and Collision numbers while also improving the sparsity of the attention map. Qualitatively, this model tends to attend to inter-object image regions (compare D vs. E in Fig. 7). Effect of Positional Encoding. Quantitatively, the motion generation regres- sion performance slightly degrade (but is still better than the original model) as we concatenate a spatial basis to obtain a latent bottleneck vector. The attention maps become more sparse and we find them to have better spatial alignment with the causal objects (compare E and F in Fig. 7). Thus there is some tension between sparse maps and motion generation performance. 5.4 Qualitative Analysis Fig. 5 shows several examples covering common driving scenarios that involve slowing down, stopping, avoiding obstacles etc. We visualize a flattened view of the input (odd rows) and the corresponding attention maps (even rows). The attention maps are overlaid on the input images and include contour lines for easier viewing. Note that our attention maps are quite sparse and provide plausible visual evidence of what triggered a particular behavior, e.g. highlighting stop/yield signs or vehicles ahead causing deceleration, road contours when on curved roads, or multiple pinch points from parked vehicles when on narrow roads. 14 J. Kim and M. Bansal Rendered Images Attention Maps Rendered Images Attention Maps Counterfactual Scenarios Rendered Images Attention Maps (A) (B) Fig. 9: (A) Examples of counter-factual outcomes where the driving model ap- pears to attend alternative important cues, e.g. (top) vehicles ahead → a stop sign, (bottom) red lights → vehicles ahead. (B) Examples where the driving model appears to under-attend to important cues, e.g. ignoring oncoming lanes on intersections. Attention Maps over Time. In Fig. 8, we illustrate attention maps over time for typical driving scenarios, i.e. avoiding multiple pinch points, stopping at an intersection with 4-way stop signs, etc. We also provide snapshots on the right column where our model attends to important driving-related cues. Our supplemental video also demonstrates the smooth variation of the attention maps across several temporal sequences illustrating the changing causality with each driving decision. Counterfactual Experiments. Fig. 9 (A) shows examples where the attention maps change in the counterfactual driving scenarios by getting rid of a subset of environmental descriptions, i.e. dynamic objects (i.e. with vs. without vehi- cles) and traffic light states (i.e. red vs. green). These indicate that our model successfully focuses on the correct cues in specific situations. We provide more diverse examples as supplemental materials. Failure Cases. Fig. 9 (B) shows examples where the attention maps fail to capture the correct causal behavior. This indicates either that the original model failed to focus on the correct cues in specific situations (e.g. “look both ways before making a right turn”), or that the attention model is not able to correctly explain this situation. 6 Conclusions We described an approach for improving interpretablity of a mid-to-mid deep driving model by augmenting a visual attention model with an attentional bot- tleneck layer. Our results highlight sparse attention maps which are easy to interpret and do not degrade model performance. We see opportunity in taking this further to generate instance level attention maps and to also use these maps as a guide to improving the performance of the baseline driving model. Attentional Bottleneck: Towards an Interpretable Deep Driving Network 15 Acknowledgements We thank Dragomir Anguelov, Anca Dragan, and Alexander Gorban at Waymo Research, John Canny, Trevor Darrell, Anna Rohrbach, and Yang Gao at UC Berkeley for their helpful comments. Bibliography [1] Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational infor- mation bottleneck. ICLR (2017) [2] Bansal, M., Krizhevsky, A., Ogale, A.: Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS (2019) [3] Bojarski, M., Choromanska, A., Choromanski, K., Firner, B., Jackel, L., Muller, U., Zieba, K.: Visualbackprop: visualizing cnns for autonomous driv- ing. arXiv preprint (2016) [4] Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for self-driving cars. CoRR abs/1604.07316 (2016) [5] Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction. CoRL (2019) [6] Chalk, M., Marre, O., Tkacik, G.: Relevant sparse codes with variational information bottleneck. In: NeurIPS. pp. 1957–1965 (2016) [7] Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: Learning affor- dance for direct perception in autonomous driving. In: CVPR. pp. 2722– 2730 (2015) [8] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), 834–848 (2017) [9] Codevilla, F., Miiller, M., L ́ opez, A., Koltun, V., Dosovitskiy, A.: End-to- end driving via conditional imitation learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). pp. 1–9. IEEE (2018) [10] Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Botvinick, M., Larochelle, H., Levine, S., Bengio, Y.: Infobot: Transfer and exploration via the information bottleneck. ICLR (2019) [11] Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA) (2017) [12] Hecker, S., Dai, D., Van Gool, L.: End-to-end learning of driving models with surround-view cameras and route planners. In: ECCV (2018) [13] Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explana- tions. In: ECCV (2018) [14] Kim, J., Canny, J.: Interpretable learning for self-driving cars by visualizing causal attention. ICCV (2017) [15] Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explana- tions for self-driving vehicles. In: ECCV (2018) [16] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ICLR (2015) [17] Park, D.H., Hendricks, L.A., Akata, Z., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: Justifying decisions and pointing to the evi- dence. In: CVPR (2018) Attentional Bottleneck: Towards an Interpretable Deep Driving Network 17 [18] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran, D.: Image transformer. ICML (2018) [19] Pomerleau, D.A.: Alvinn: An autonomous land vehicle in a neural network. In: NeurIPS. pp. 305–313 (1989) [20] Sauer, A., Savinov, N., Geiger, A.: Conditional affordance learning for driv- ing in urban environments. CoRL (2018) [21] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV. pp. 618–626 (2017) [22] Seo, P.H., Lin, Z., Cohen, S., Shen, X., Han, B.: Progressive attention net- works for visual attribute prediction. BMVC (2018) [23] Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. Proceedings of The 37th Allerton Conference on Communication, Control, and Computing pp. 368–377 (1999) [24] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) [25] Wang, D., Devin, C., Cai, Q.Z., Kr ̈ ahenb ̈ uhl, P., Darrell, T.: Monocular plan view networks for autonomous driving. IROS (2019) [26] Wang, D., Devin, C., Cai, Q.Z., Yu, F., Darrell, T.: Deep object centric policies for autonomous driving. ICRA (2019) [27] Xu, H., Gao, Y., Yu, F., Darrell, T.: End-to-end learning of driving models from large-scale video datasets. In: CVPR (2017) [28] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. pp. 2048–2057 (2015) [29] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional net- works. In: ECCV. pp. 818–833. Springer (2014) [30] Zeng, W., Luo, W., Suo, S., Sadat, A., Yang, B., Casas, S., Urtasun, R.: End-to-end interpretable neural motion planner. In: CVPR. pp. 8660–8669 (2019) [31] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR. pp. 2921–2929 (2016)