Attentional Bottleneck: Towards an Interpretable Deep Driving Network Jinkyu Kim and Mayank Bansal Waymo Research {jinkyukim, mayban}@waymo.com Abstract. Deep neural networks are a key component of behavior pre- diction and motion generation for self-driving cars. One of their main drawbacks is a lack of transparency: they should provide easy to in- terpret rationales for what triggers certain behaviors. We propose an architecture called Attentional Bottleneck with the goal of improving transparency. Our key idea is to combine visual attention, which iden- tifies what aspects of the input the model is using, with an information bottleneckthatenablesthemodeltoonlyuseaspectsoftheinputwhich areimportant.Thisnotonlyprovidessparseandinterpretableattention maps (e.g. focusing only on specific vehicles in the scene), but it adds this transparency at no cost to model accuracy. In fact, we find slight improvements in accuracy when applying Attentional Bottleneck to the ChauffeurNetmodel,whereaswefindthattheaccuracydeteriorateswith a traditional visual attention model. Keywords: Self-driving vehicles, eXplainable AI, Motion generation 1 Introduction Deep neural networks are powerful function estimators and have been a key component in self-driving software systems [2, 30]. Such networks are, however, notoriously cryptic – their hidden layer activations may have no obvious rela- tion to the function being estimated by the network. Interpretable models that make deep models more transparent are important for a number of reasons: (1) user-acceptance:neuralnetworkautonomouscontrolisaradicaltechnologythat requiresaveryhigh-levelofuser-trust,(2)extrapolation:usersshouldbeableto anticipatewhatthevehiclewilldoinmostscenariosbyunderstandingthecausal behavior of the model, and (3) human-vehicle communication: communication can be grounded in the vehicle’s internal state. One way of making models transparent is via visual attention [28, 17, 14]. Visual attention finds spatially varying scalar attention weights α(x,y) ∈ [0,1] typically by learning a multi-layer perceptron from a set of input features F = {f(x,y)}. Attended features A = {a(x,y)} obtained as a(x,y) = α(x,y)f(x,y) are then used by the model instead of the original features F. The model is trained end-to-end leading the attention weights to link the network’s output to its input – visualizing the weights as a 2D heatmap thus provides insight into 0202 yaM 8 ]VC.sc[ 1v89240.5002:viXra2 J. Kim and M. Bansal ar d a R d n a a, er nars, oCrasm: ed LSi metsys noitpecreP Encoder Visual Bottleneck Attentional Bottlenecked Top-down input information representation Scene context Encoder Visual Generator Motion Generating attentional bottleneck Predicting future agent poses Attention Heat Map Predicted Agent Poses Fig.1: An overview of our interpretable driving model. Our model takes a top- down input representation I and outputs the future agent poses Y along with an attention map. An Attentional Bottleneck encodes the inputs I to a latent vector z while also producing an interpretable attention heat map. The motion generator operates in a partially observable environment using only the dense scene context S ⊂I along with z to predict poses Y. the areas of the input image that the network attends to. Furthermore, to be easilyinterpretable,attentionneedstobesparse(i.e.lowentropy),whileideally also enhancing the performance of the original model. Unfortunately, given the complexity of the driving task, we find that a straightforward integration of attention maps tends to find all potentially salient image areas, resulting in limited interpretability (e.g. Fig. 2). In this work, we manage to achieve sparse and salient attention maps and good final model performance, by attaching attention to a bottlenecked latent representation of the input. However, given the information loss in the bottle- neck,weneedtoprovidethemodeldirectaccesstoasubsetofdenseinputs(e.g. road lane geometry and connectivity information) that are harder to compress. Thisfreesupthebottleneckbranchtofocusonselectingthemostrelevantparts of the dynamic input (e.g. nearby objects), while retaining the model perfor- mance. End-to-enddrivingmodelsthatdirectlyprocessacameraimageasinputhave several scene elements confounded into nearby pixels thus making a separation intodenseandsparseinputsubsetsinfeasible.Therefore,wefocusonimproving theinterpretabilityofadrivingmodelthatusesamid-levelinputrepresentation. This means that instead of directly using low-level sensor data, the model usesAttentional Bottleneck: Towards an Interpretable Deep Driving Network 3 higher-level semantic information like objects detected by a perception system. As a proxy for such a network, we work with the recently published Chauffeur- Net [2] model, although the ideas presented are more generally applicable. The inputsI tothisnetworkconsistofinformationabouttheroadmap,trafficlights, dynamic objects, etc. rendered in separate channels in common top-down view coordinate system around the agent. The model predicts future agent poses Y in the same top-down view (see Fig. 3). To generate sparser and more interpretable attention maps, we propose an architecturecalledAttentionalBottleneck (Fig.1)thatcombinesvisualattention with the information bottleneck approach [23] of training deep models through supervised learning [1, 6, 10]. We define z as a bottleneck latent representation of an attention weighted feature encoding A = α ·F of the input features I I I F . We leverage the mid-level input representation to separate the subset of I dense inputs into a set S ⊂ I. Conditioned on z and S, the motion generator finally predicts the target Y. Our goal is to learn both the attention weighting function α and an encoding z that is maximally informative about the target I Y. To prevent z from being the identity encoding of the inputs and to focus the network on specific areas of causality, we impose an information bottleneck constraint on the complexity of z by a pooling operation. We preserve spatial information in the attention map by incorporating a positional encoding step, and encode non-local information by using Atrous convolutions. We evaluate our approach on the large-scale (≈ 60 days of continuous driv- ing) dataset from [2] and show quantitative and qualitative results illustrating that our generated attention maps result in much sparser (and thus more in- terpretable) visualization of the internal states than a baseline visual attention model. We also show that our approach improves the motion generation accu- racy in contrast to a traditional visual attention model that results in decreased accuracy. 2 Related Work 2.1 Deep Driving Models Recentlythereisgrowinginterestinend-to-enddrivingmodelsthatprocessraw sensordatatodirectlyoutputdrivingcontrols.Mostoftheseapproacheslearna driving policy through supervised regression over observation-action pairs from human drivers. ALVINN (Autonomous Land Vehicle In a Neural Network) [19] was the first attempt to train a neural network for the navigational task of road following. Bojarski et al. [4] trained a 5-layer ConvNet to predict steering con- trolsonlyfromadashcamimage,whileXuet al.[27]utilizedadilatedConvNet combined with an LSTM so as to predict vehicle’s discretized future motions. Heckeret al.[12]exploredtheextendedmodelthattakesasurround-viewmulti- camera system, a route planner, and a CAN bus reader. Codevilla et al. [9] ex- plored a conditional end-to-end driving model that takes high-level command input (i.e. left, straight, right) at test time. These models show good perfor-4 J. Kim and M. Bansal mance in simple driving scenarios (i.e. lane following). Their behavior, however, is opaque and learning to drive in urban areas remains challenging. To reduce the complexity and for better interpretability, there is growing interest in end-to-mid and mid-to-mid driving models that produce a mid-level output representation in the form of a drivable trajectory by consuming either raw sensor or an intermediate scene representation as input. Zeng et al. [30] introduced an end-to-mid neural motion generator, which takes Lidar data and anHDmapasinputsandoutputsafuturetrajectory.Thismodelalsodetects3D bounding boxes as an intermediate representation. Bansal et al. [2] introduced ChauffeurNet, a mid-to-mid model that takes advantage of separate perception and control components. Using a top-down representation of the environment and intended route as input, the model outputs a driving trajectory that is consumed by a controller, which then translates it to steering and acceleration. Recent works [5, 25] also suggest that such a top-down scene representation can successfully be used to learn high-level semantic information. In this work, we focus on improving the explainability of such a model. 2.2 Visual Explanations Explainability of deep neural networks has seen growing interest in computer vision and machine learning [11]. In landmark work, Zeiler et al. [29] utilized deconvolution layers to visualize the internal representation of a ConvNet. Bo- jarskiet al.[3]developedarichernotionofcontributionofapixeltotheoutput, whileotherapproaches[31,21]haveexploredsynthesizinganimagecausinghigh neuronactivations.However,adifficultywithde-convolutionbasedapproachesis thelackofaformalnotionofcontributionofspatially-extendedfeatures(rather than pixels). Attention-based approaches [28, 17]havebeen increasinglyemployed forim- proving a model’s ability to explain by providing spatial attention maps that highlight areas of the image that the network attends to. Kim et al. [14] utilize an attention model followed by additional salience filtering to show regions that causally affect the output. To reduce the complexity of explanations, Wang et al. [26] introduce an instance-level attention model that finds objects (i.e. cars andpedestrians)thatthenetworkneedstopayattentionto.Suchattentionmay bemoreintuitiveandinterpretableforuserstounderstandthemodel’sbehavior. However,themodelneedstotakethewholeinputcontextasanadditionalinput, which may compromise the causality of the attention – explanations may not represent causal relationships between the system’s input and its behavior. To preserve the causality, we use a top-down representation of the environment as an input, which consists of information around the agent rendered in separable channels. Another notable approach is the work by Chen et al. [7], which defined human-interpretable intermediate features such as the curvature, deviation to neighboring lanes, and distances from the front-located vehicles. A CNN is trained to produce these features from an image, and a simple controller mapsAttentional Bottleneck: Towards an Interpretable Deep Driving Network 5 AgentRNN AgentRNN AgentRNN (Motion Generator) (Motion Generator) (Motion Generator) Avg Pooling Positional Encoding Visual Attention Atr Ao tu tes n S tip oa ntial FeatureNet FeatureNet FeatureNet FeatureNet (ConvNet) (ConvNet) (ConvNet) (ConvNet) ChauffeurNet ChauffeurNet Ours w/ Visual Attention Rendered Input Images Attention maps w/ C Vh isa uu af lf e Au ttr eN ne tt ion Ours Fig.2: (left) Attentional Bottleneck design compared with a baseline visual at- tention model applied to ChauffeurNet. (right) Comparison of attention maps from our model against those from a baseline visual attention model. Note that our heatmaps are much sparser and thus more interpretable. them to steering angle. Similarly, Sauer et al. [20] proposed conditional affor- dancelearningapproachthatmapsvisualinputstointermediaterepresentations conditioned on high-level command input. However, the intermediate feature descriptors provide a limited and ad-hoc vocabulary for explanations. Zeng et al. [30] co-trained a perception model that provides bounding boxes of dynamic objects, which are then used as an intermediate and interpretable feature. Here, we instead take full advantage of existing well-established perception systems with different sensor sources (i.e. Lidar, Radar, and Camera) via a mid-level input representation. This reduces the complexity of the driving network and employs more reliable perception outputs. There is also a growing effort on textual explanations that justify the deci- sions that were made and explain the “why” in a natural language [13, 17, 15]. However,textualexplanationsareoftenrationalizations–explanationsthatjus- tify the system’s behavior in a post-hoc manner – and are less helpful with understanding the causal behavior of the model. In this work, we focus on im- provingattention-basedmechanism–toprovideintrospectiveexplanationsthat are based on the system’s internal state and represent causal relationships be- tween the system’s input and its behavior. 3 Mid-to-mid Driving Model with Visual Attention 3.1 ChauffeurNet Bansal et al. [2] introduced a mid-to-mid driving network called ChauffeurNet that recurrently predicts future poses of the agent by processing a top-down6 J. Kim and M. Bansal Inputs Outputs (a) Roadmap (b) Speed Limits (c) Past Agent Poses (d) Current Agent Box (h) Future Agent Boxes (e) Route (f) Traffic Lights (g) Dynamic Objects (i) Future Agent Poses Fig.3: Top-down rendered inputs I (left) and outputs Y (right) for the Chauf- feurNet model. The subset of dense scene context inputs S are shown in the top row. representation of the environment as an input. For completeness, we summarize someofthekeydetailsofthepaperhere.Theinputtotheneuralnetworkconsists ofasetofimagesI ofsizeW×H pixelsrenderedintoatop-downviewcoordinate system that is fixed relative to the current location of the agent. As shown in Fig. 3, the set I contains: (a) Roadmap: a 3-channel image with a rendering of color coded lanes, stop signs, cross-walks, etc. (b) Speed limit: a gray-scale image with lane centers color coded in proportion to their known speed limit. (c) Past agent poses: the ego-vehicle’s past poses rendered as a trail of points. (d)Currentagentbox:thecurrentagentrepresentedbyafullboundingbox.(e) The intended route. (f) Traffic lights: a gray-scale image where each lane center is color coded to reflect different traffic light states (red light: brightest gray level, green light: darkest gray level). (g) Dynamic objects: a gray-scale image that renders all the potential dynamic objects (vehicles, cyclists, pedestrians) as oriented boxes. Both (f) and (g) are a sequence of 5 images reflecting the environment state over the past 5 timesteps. Visual Encoder (FeatureNet). In the ChauffeurNet model, the rendered in- puts I are fed to a large-receptive field convolutional FeatureNet with skip con- nections, which outputs features F that capture the environmental context and the intent. This feature F (of size w×h×d) contains a set of d-dimensional la- tent vectors over the spatial dimension, i.e. F = {f , f , ..., f }, where f ∈ Rd 1 2 l i and l (= w×h) is the spatial dimension of the extracted features. Selecting a subset of these feature slices will allow the attention model to selectively attend to different parts of input images. Motion Generator (AgentRNN). The feature encoding F is fed to a recur- rent neuralnetwork(AgentRNN)whichpredicts theoutputs Y consisting ofthe next point p on the driving trajectory, and the agent bounding box heatmap k B , conditioned on the features F, the iteration number k ∈ {1,...,N}, the k memoryM ofpastpredictionsfromAgentRNN,andtheagentboundingbox k−1Attentional Bottleneck: Towards an Interpretable Deep Driving Network 7 heatmap B predicted in the previous iteration. k−1 p ,B =AgentRNN(k,F,M ,B ) (1) k k k−1 k−1 3.2 ChauffeurNet with Visual Attention One way of making models interpretable is via visual attention [17, 14]. These models provide introspective (visual) explanations by filtering out non-salient image regions – the remaining (attended) regions potentially have a causal ef- fect on the output. The goal of visual attention is to find an attended feature A = {a ,a ,...,a }, where a ∈ Rd from the original feature F. They utilize 1 2 l i a deterministic soft attention mechanism that is trainable by standard back- propagation methods, which thus has advantages over a hard stochastic atten- tion mechanism that requires reinforcement learning. As discussed by several works [28, 14], the attended features can be computed as a = π(α ,f ) = α f i i i i i for i = {1,2,...,l}, where α are scalar attention weights in [0,1] satisfying i (cid:80) α = 1. These weights are estimated from the input features F typically i i by a multi-layer perceptron, i.e. α = f (f ) where the parameters of f i MLP i MLP are learned as part of training the entire model end-to-end. Since the attention weightsvaryspatiallyanddependontheinput(viathefeaturesF),theycanbe visualized as an attention heatmap aligned with the input image, with brighter regions reflecting areas salient for the task. ToallowustoexplainthedrivingdecisionsmadebyChauffeurNet,weapply this vanilla visual attention approach by replacing the original features F in Eq. 1 with the attended features A as shown in Fig. 2 (left). As shown in Fig. 2 (right),thisapproachgeneratesvagueandverboseattentionmapswhichdonot add to the interpretability of the model. Therefore, we use this approach as a baseline for our “Attentional Bottleneck” approach. 4 Attentional Bottleneck We propose a novel architecture called Attentional Bottleneck with a focus on generating sparse and fine-grained visual explanations. We encode the environ- ment I through an information bottleneck that serves to restrict information in the input to only the most relevant parts of the input, and thus allows the driv- ing model to focus on specific features in the environment. We tie this feature selection to the spatial distribution of features by employing a spatial attention mechanism before the bottleneck. While the driving task involves focusing on specificobjectsandentitiesinthescenefortheimmediatedrivingdecisions,hu- mansalsoemployaholisticunderstandingofsomeelementsoftheenvironment. For example, humans are aware of the overall map of the environment through visual scanning or through looking at a navigation app. We find that compress- ing this kind of dense information through the bottleneck either leads to dense attention maps or degrades the model performance. Therefore, we leverage the mid-level separable input representation and provide the model full access to a8 J. Kim and M. Bansal 1 x 1 c h o n w v 3 x 3 c o n v 3 x 3 1 x 1 c o n v 1 x 1 c o n v S o ftm a x d c o n v (wxhxd) (wxhxd) (wxhxd) (wxhxd) (wxhx1) (wxhx1) concatenation w w Rate=2 h h Rate=4 d d d d Fig.4: Atrous Spatial Attention Block. We apply three parallel Atrous convolu- tions with different atrous rates. The resulting features from all three branches are then concatenated and fed into 1x1 convolution and softmax layers to gen- erate the attention weights α. subset of inputs S ⊂ I containing the dense context about the environment, through a separate branch. This frees up the bottleneck branch to focus on spe- cific parts of the input (e.g. specific objects) making the attention map sparser and more interpretable. As shown in Fig. 2, our modified ChauffeurNet model consists of a dense input encoder branch and an attentional bottleneck branch providing encoded input features to the AgentRNN. Grounding Attentional Bottleneck into AgentRNN. Like the baseline model, the inputs I are first encoded into features F by the FeatureNet net- I work.Tocapturenon-localinformation,weproposeanAtrousSpatialAttention layerthatcomputestheattentionweightsα andoutputstheattendedfeatures I A . The attended features are depth-concatenated with a positional encoding I V followed by a multi-layer perceptron g , and an average pooling layer to MLP generate the final bottleneck representation z. l (cid:88) z= g ([a ;v ]) (2) MLP i i i=1 ThedensescenecontextinputsS aresimilarlyencodedintofeaturesF using S another FeatureNet network with identical architecture. We modify AgentRNN toincorporatethebottleneckvectorbyconcatenatingitwitheachofthefeatures f ∈F : i S p ,B =AgentRNN(k,F ,z,M ,B ) (3) k k S k−1 k−1 We discuss the Atrous Spatial Attention and the positional encoding stages in the following paragraphs and present ablation results for these blocks in the experiments. Atrous Spatial Attention. Attention models are typically applied to features generatedbythelastlayerofaconvolutionalencoder.Attentionweightsforeach spatial location are usually computed independently, allowing them to captureAttentional Bottleneck: Towards an Interpretable Deep Driving Network 9 local information around the corresponding specific spatial location (e.g. “there isapedestrianrunning”).However,wearguethattheattentionmodelalsoneeds to capture non-local information especially for the driving task (e.g. “there is a pedestrianrunningtowardsthecrosswalkahead”).Seoet al.[22]exploredusing 3×3convolutiontoconsiderlocalcontextingeneratingattentionmaps.Here,we advocate using Atrous convolution (also known as dilated convolution), which has been shown to be effective for accurately capturing semantic information at an arbitrary scale [8]. AsshowninFig.4,weapplythreeparallelAtrousconvolutionswithdifferent rates on top of the feature map F . For implementation, we closely follow the I work by Chen et al. [8]. Specifically, our atrous convolution layers include a 1×1 convolution, and two 3×3 convolutions with rates 2 & 4 with d filters and batch normalization. The resulting features from all the branches are then concatenated and fed into another 1×1 convolution to generate the attention logits. A spatial softmax yields normalized attention weights α. Positional Encoding. As shown in Equation 2, to obtain the latent bottle- neck vector z, we use spatial summation with the attended features A , which I removes the positional information. To preserve this information, we append a spatial basis to the feature A . Following Vaswani et al. [24] and Parmar et I al. [18], we generate a spatial basis V (of the same dimension as F) that con- tains d-dimensional vectors V = {v ,v ,...,v }, where v ∈ Rd. Each vec- 1 2 l i tor v encodes positional information about the spatial location (x ,y ) using i i i four types of Fourier basis functions viz. sin(x /f ), cos(x /f ), sin(y /f ), and i u i u i u cos(y /f ), where f = 1000u is the spatial wavelength with channel index i u u u = {0,4/d,8/d,...,d/d}. Each positional encoding feature v is then concate- i natedwiththecorrespondingattendedfeaturea ∈A asshowninEquation2. i I 5 Experiment 5.1 Dataset Weusethelarge-scaledatasetfrom[2]thatcontainsover26millionexpertdriv- ing examples amounting to about 60 days of continuous driving. Data has been collected by a vehicle instrumented with multiple sensors (i.e. cameras, lidar, radar). A reliable perception system provides accurate environmental descrip- tions including dynamic objects (i.e. vehicles and pedestrians) and traffic light states.Alongwithperception,thedatasetalsoprovides:(i)thepriormapofthe environment (i.e. roadmap), (ii) vehicle pose information, and (iii) the speed- limits. The input field of view is 80m×80m (a resolution of 400×400 pixels in image coordinates) and the effective forward sensing range of the ego-vehicle is 64 meters. 5.2 Training and Evaluation Details We trained our models end-to-end with Adam optimization [16] using expo- nential decaying learning rates and random initialization (i.e. no pre-trained10 J. Kim and M. Bansal deredneR noitnettA deredneR noitnettA deredneR noitnettA segamI spaM taeH segamI spaM taeH segamI spaM taeH Slowing down x10-3 25 20 15 10 5 0 Stopping Driving curvy road x10-3 25 20 15 10 5 0 Avoiding obstacles Lane following x10-3 25 20 15 10 5 0 Fig.5: We provide typical examples of attention heat maps in diverse driving scenarios. Our model attends to driving-related visual cues like highlighting stop/yield signs, crosswalks or cars ahead that cause braking, road contours on curved roads, or multiple pinch points from parked cars on narrow roads. weights), with ChauffeurNet’s default losses. Our FeatureNet creates features F withdimensions50×50×128whichleadto50×50attentionmapsthatareup- sampled to the input resolution of 400×400 by a pyramid expansion step. Note that we use ChauffeurNet’s default losses and training strategy [2] to train our modelend-to-end.Thelossesconsistofpureimitationlosses(i.e.agentposition,Attentional Bottleneck: Towards an Interpretable Deep Driving Network 11 2.2 1.9 )slexip( EDA 1.6 1.3 1.0 A B C D E F )slexip( EDF 4.6 4.0 3.4 2.8 2.2 A B C D E F B C D E F yportnE 8.0 6.75 5.5 4.25 3.0 A B C D E F A: ChauffeurNet C: A + Attentional Bottleneck E: D + Atrous Spatial Attention B: A + Visual Attention D: C + PerceptionRNN F: E + Positional Encoding noisilloC 1e-3 1e-6 1e-9 Fig.6: Comparison of motion generation performance and attention map spar- sitybetweenbaselineChauffeurNet,visualattentionandAttentionalBottleneck ablation designs. heading,box,metapredictionloss)aswellasenvironmentlossestoprovidebet- ter generalization (i.e. collision loss, on-road loss, geometry loss, and auxiliary losses). For quantitative evaluation, we use the following metrics: ADE and FDE. To quantitatively evaluate motion generation performance, we use two widely-used (Euclidean distance-based) metrics: (i) the average dis- placementerror(ADE) 1 (cid:80)K ||pˆ −pgt|| ,and(ii)thefinaldisplacementerror K k=0 k k 2 (FDE) ||pˆ −pgt|| , where K =10 is the total number of predicted waypoints, K K 2 and the superscript gt denotes the ground-truth values. Collision. We also use the collision rate by measuring the potential overlap of the predicted agent box with the ground-truth boxes of all other objects in the scene, i.e. 1 (cid:80)K (cid:80) Bgt (i,j)B (i,j). K k=0 i,j k−1 k Entropy S(α). To measure the sparseness of the generated attention maps, we measure the entropy of the generated attention heat map α, i.e. S(α) = −(cid:80)l α logα . i=1 i i 5.3 Quantitative Analysis We start by quantitatively comparing our attentional bottleneck model (model F) with the baseline ChauffeurNet [2] (model A) and ChauffeurNet with visual attention(modelB)models(seeFig.2).WeobserveinFig.6thattheincorpora- tion of visual attention for improving the interpretability of the baseline model degrades its performance as measured by the larger ADE and FDE numbers. Thisisnotthecasewithourattentionalbottleneckmodelwhereweobserveim- proved ADE and FDE numbers – possibly due to improved focus by the model onspecificcausalfactors.ExamplesinFig.2(right)compareourattentionmaps to those from the visual attention model, and confirm that the latter generates verbose attention heat maps – finding all potentially salient objects. In con- trast, our model provides much sparser attention heat maps which are easier to associate with specific objects or rendered features and are thus easier to interpret. This is evident by comparing their distributions where the attention12 J. Kim and M. Bansal Rendered Attention Maps Images C D E F A: ChauffeurNet C: A + Attentional Bottleneck E: D + Atrous Spatial Attention D: C + PerceptionRNN F: E + Positional Encoding Fig.7:QualitativecomparisonofattentionmapsbetweenAttentionalBottleneck ablation designs. weights from our model are mostly concentrated around zero probability values (see supplemental figures). Effect of incorporating Behavior Prediction. As detailed in [2], the pre- diction of potential future trajectories of dynamic objects in the scene helps the network learn better features for the motion generation task. However, to ac- complish this the network would need to attend to all the objects. This renders the attentional bottleneck useless for the objective of exposing only the objects relevant for the primary goal of agent motion generation. In this paper, our fo- cus is on the main motion generator branch, but for completeness, we present one particular architecture choice that allows us to enable PerceptionRNN [2] while preserving the interpretability of the attentional bottleneck. We add an additional FeatureNet and Atrous Spatial Attention branch that encodes only the dynamic object channels O into attended object features A which are fed O into PerceptionRNN. To allow the PerceptionRNN losses to influence the mo- tiongenerationtask,wealsoinjectA intotheAgentRNNbottleneckbranchby O modifying features f ∈ F to f(cid:48) = [f ;a ] where ; denotes depth concatenation i I i i i and a ∈ A . In Fig. 6, we compare the metrics from a baseline Attentional i O Bottleneck model against one that is co-trained with PerceptionRNN in the above setup and observe that this co-training indeed provides improvements in all metrics. As shown in Fig. 7, incorporation of PerceptionRNN (model D) im- provessituation-specificdependenceonsalientdynamicobjectsinthescene,i.e. vehicles ahead and pedestrians crossing the crosswalk. Effect of Atrous Spatial Attention. We illustrate the effect of capturing more non-local information using Atrous convolutions in Fig. 6. Relative to the model using only a local receptive field, this model achieves better ADE, FDE,Attentional Bottleneck: Towards an Interpretable Deep Driving Network 13 (cid:11)(cid:36)(cid:12) (cid:11)(cid:37)(cid:12) (cid:11)(cid:38)(cid:12) Fig.8: (A-C) Attention maps over time, sampled at every 5 timesteps, illustrate the smooth variation of attention. For better visualization, attention maps are overlaid and shown over the satellite map image. We provide six snapshots on the bottom. Our model appears to attend to important cues, e.g. (a) multiple pinch points, (b) oncoming lanes on a T-junction, (c, e) a stop sign, (d, e) a crosswalk, and (f) vehicles. We also provide a video as supplemental materials. Map data (cid:13)cGoogle 2020. and Collision numbers while also improving the sparsity of the attention map. Qualitatively,thismodeltendstoattendtointer-objectimageregions(compare D vs. E in Fig. 7). Effect of Positional Encoding.Quantitatively,themotiongenerationregres- sionperformanceslightly degrade(butisstill better thantheoriginalmodel)as weconcatenateaspatialbasistoobtainalatentbottleneckvector.Theattention maps become more sparse and we find them to have better spatial alignment withthe causalobjects(compare EandFin Fig.7).Thusthereis sometension between sparse maps and motion generation performance. 5.4 Qualitative Analysis Fig. 5 shows several examples covering common driving scenarios that involve slowing down, stopping, avoiding obstacles etc. We visualize a flattened view of the input (odd rows) and the corresponding attention maps (even rows). The attention maps are overlaid on the input images and include contour lines for easier viewing. Note that our attention maps are quite sparse and provide plausiblevisualevidenceofwhattriggeredaparticularbehavior,e.g.highlighting stop/yield signs or vehicles ahead causing deceleration, road contours when on curved roads, or multiple pinch points from parked vehicles when on narrow roads.14 J. Kim and M. Bansal Rendered Images Counterfactual Scenarios Attention Maps Rendered Images Attention Maps Rendered Images Attention Maps (A) (B) Fig.9: (A) Examples of counter-factual outcomes where the driving model ap- pears to attend alternative important cues, e.g. (top) vehicles ahead → a stop sign, (bottom) red lights → vehicles ahead. (B) Examples where the driving model appears to under-attend to important cues, e.g. ignoring oncoming lanes on intersections. Attention Maps over Time. In Fig. 8, we illustrate attention maps over time for typical driving scenarios, i.e. avoiding multiple pinch points, stopping at an intersection with 4-way stop signs, etc. We also provide snapshots on the right column where our model attends to important driving-related cues. Our supplementalvideoalsodemonstratesthesmoothvariationoftheattentionmaps across several temporal sequences illustrating the changing causality with each driving decision. CounterfactualExperiments.Fig.9(A)showsexampleswheretheattention maps change in the counterfactual driving scenarios by getting rid of a subset of environmental descriptions, i.e. dynamic objects (i.e. with vs. without vehi- cles) and traffic light states (i.e. red vs. green). These indicate that our model successfully focuses on the correct cues in specific situations. We provide more diverse examples as supplemental materials. Failure Cases. Fig. 9 (B) shows examples where the attention maps fail to capturethecorrectcausalbehavior.Thisindicateseitherthattheoriginalmodel failed to focus on the correct cues in specific situations (e.g. “look both ways beforemakingarightturn”),orthattheattentionmodelisnotabletocorrectly explain this situation. 6 Conclusions We described an approach for improving interpretablity of a mid-to-mid deep driving model by augmenting a visual attention model with an attentional bot- tleneck layer. Our results highlight sparse attention maps which are easy to interpret and do not degrade model performance. We see opportunity in taking thisfurthertogenerateinstancelevelattentionmapsandtoalsousethesemaps as a guide to improving the performance of the baseline driving model.Attentional Bottleneck: Towards an Interpretable Deep Driving Network 15 Acknowledgements We thank Dragomir Anguelov, Anca Dragan, and Alexander Gorban at Waymo Research, John Canny, Trevor Darrell, Anna Rohrbach, and Yang Gao at UC Berkeley for their helpful comments.Bibliography [1] Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational infor- mation bottleneck. ICLR (2017) [2] Bansal, M., Krizhevsky, A., Ogale, A.: Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS (2019) [3] Bojarski, M., Choromanska, A., Choromanski, K., Firner, B., Jackel, L., Muller,U.,Zieba,K.:Visualbackprop:visualizingcnnsforautonomousdriv- ing. arXiv preprint (2016) [4] Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for self-driving cars. CoRR abs/1604.07316 (2016) [5] Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction. CoRL (2019) [6] Chalk, M., Marre, O., Tkacik, G.: Relevant sparse codes with variational information bottleneck. In: NeurIPS. pp. 1957–1965 (2016) [7] Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: Learning affor- dance for direct perception in autonomous driving. In: CVPR. pp. 2722– 2730 (2015) [8] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:Semanticimagesegmentationwithdeepconvolutionalnets,atrous convolution, and fully connected crfs. TPAMI 40(4), 834–848 (2017) [9] Codevilla, F., Miiller, M., L´opez, A., Koltun, V., Dosovitskiy, A.: End-to- end driving via conditional imitation learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). pp. 1–9. IEEE (2018) [10] Goyal,A.,Islam,R.,Strouse,D.,Ahmed,Z.,Botvinick,M.,Larochelle,H., Levine,S.,Bengio,Y.:Infobot:Transferandexplorationviatheinformation bottleneck. ICLR (2019) [11] Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA) (2017) [12] Hecker, S., Dai, D., Van Gool, L.: End-to-end learning of driving models with surround-view cameras and route planners. In: ECCV (2018) [13] Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explana- tions. In: ECCV (2018) [14] Kim,J.,Canny,J.:Interpretablelearningforself-drivingcarsbyvisualizing causal attention. ICCV (2017) [15] Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explana- tions for self-driving vehicles. In: ECCV (2018) [16] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ICLR (2015) [17] Park, D.H., Hendricks, L.A., Akata, Z., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: Justifying decisions and pointing to the evi- dence. In: CVPR (2018)Attentional Bottleneck: Towards an Interpretable Deep Driving Network 17 [18] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L(cid:32) ., Shazeer, N., Ku, A., Tran, D.: Image transformer. ICML (2018) [19] Pomerleau,D.A.:Alvinn:Anautonomouslandvehicleinaneuralnetwork. In: NeurIPS. pp. 305–313 (1989) [20] Sauer,A.,Savinov,N.,Geiger,A.:Conditionalaffordancelearningfordriv- ing in urban environments. CoRL (2018) [21] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV. pp. 618–626 (2017) [22] Seo, P.H., Lin, Z., Cohen, S., Shen, X., Han, B.: Progressive attention net- works for visual attribute prediction. BMVC (2018) [23] Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. Proceedings of The 37th Allerton Conference on Communication, Control, and Computing pp. 368–377 (1999) [24] Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N., Kaiser, L(cid:32) ., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) [25] Wang, D., Devin, C., Cai, Q.Z., Kr¨ahenbu¨hl, P., Darrell, T.: Monocular plan view networks for autonomous driving. IROS (2019) [26] Wang, D., Devin, C., Cai, Q.Z., Yu, F., Darrell, T.: Deep object centric policies for autonomous driving. ICRA (2019) [27] Xu, H., Gao, Y., Yu, F., Darrell, T.: End-to-end learning of driving models from large-scale video datasets. In: CVPR (2017) [28] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,Bengio,Y.:Show,attendandtell:Neuralimagecaptiongenerationwith visual attention. In: ICML. pp. 2048–2057 (2015) [29] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional net- works. In: ECCV. pp. 818–833. Springer (2014) [30] Zeng, W., Luo, W., Suo, S., Sadat, A., Yang, B., Casas, S., Urtasun, R.: End-to-end interpretable neural motion planner. In: CVPR. pp. 8660–8669 (2019) [31] Zhou, B., Khosla,A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR. pp. 2921–2929 (2016)