Attentional Bottleneck: Towards an
Interpretable Deep Driving Network
Jinkyu Kim and Mayank Bansal
Waymo Research
{jinkyukim, mayban}@waymo.com
Abstract. Deep neural networks are a key component of behavior pre-
diction and motion generation for self-driving cars. One of their main
drawbacks is a lack of transparency: they should provide easy to in-
terpret rationales for what triggers certain behaviors. We propose an
architecture called Attentional Bottleneck with the goal of improving
transparency. Our key idea is to combine visual attention, which iden-
tifies what aspects of the input the model is using, with an information
bottleneck that enables the model to only use aspects of the input which
are important. This not only provides sparse and interpretable attention
maps (e.g. focusing only on specific vehicles in the scene), but it adds
this transparency at no cost to model accuracy. In fact, we find slight
improvements in accuracy when applying Attentional Bottleneck to the
ChauffeurNet model, whereas we find that the accuracy deteriorates with
a traditional visual attention model.
Keywords: Self-driving vehicles, eXplainable AI, Motion generation
1
Introduction
Deep neural networks are powerful function estimators and have been a key
component in self-driving software systems [2, 30]. Such networks are, however,
notoriously cryptic – their hidden layer activations may have no obvious rela-
tion to the function being estimated by the network. Interpretable models that
make deep models more transparent are important for a number of reasons: (1)
user-acceptance: neural network autonomous control is a radical technology that
requires a very high-level of user-trust, (2) extrapolation: users should be able to
anticipate what the vehicle will do in most scenarios by understanding the causal
behavior of the model, and (3) human-vehicle communication: communication
can be grounded in the vehicle’s internal state.
One way of making models transparent is via visual attention [28, 17, 14].
Visual attention finds spatially varying scalar attention weights α(x, y) ∈[0, 1]
typically by learning a multi-layer perceptron from a set of input features F =
{f(x, y)}. Attended features A = {a(x, y)} obtained as a(x, y) = α(x, y)f(x, y)
are then used by the model instead of the original features F. The model is
trained end-to-end leading the attention weights to link the network’s output to
its input – visualizing the weights as a 2D heatmap thus provides insight into
arXiv:2005.04298v1  [cs.CV]  8 May 2020
2
J. Kim and M. Bansal
Sensors: 
Lidar, Camera, and Radar
Perception system
Attentional 
Bottleneck
Visual 
Encoder
Bottlenecked
information
Scene context
Top-down input 
representation
Visual 
Encoder
Motion
Generator
Attention Heat Map
Predicted Agent Poses
Generating attentional bottleneck
Predicting future agent poses
Fig. 1: An overview of our interpretable driving model. Our model takes a top-
down input representation I and outputs the future agent poses Y along with
an attention map. An Attentional Bottleneck encodes the inputs I to a latent
vector z while also producing an interpretable attention heat map. The motion
generator operates in a partially observable environment using only the dense
scene context S ⊂I along with z to predict poses Y.
the areas of the input image that the network attends to. Furthermore, to be
easily interpretable, attention needs to be sparse (i.e. low entropy), while ideally
also enhancing the performance of the original model. Unfortunately, given the
complexity of the driving task, we find that a straightforward integration of
attention maps tends to find all potentially salient image areas, resulting in
limited interpretability (e.g. Fig. 2).
In this work, we manage to achieve sparse and salient attention maps and
good final model performance, by attaching attention to a bottlenecked latent
representation of the input. However, given the information loss in the bottle-
neck, we need to provide the model direct access to a subset of dense inputs (e.g.
road lane geometry and connectivity information) that are harder to compress.
This frees up the bottleneck branch to focus on selecting the most relevant parts
of the dynamic input (e.g. nearby objects), while retaining the model perfor-
mance.
End-to-end driving models that directly process a camera image as input have
several scene elements confounded into nearby pixels thus making a separation
into dense and sparse input subsets infeasible. Therefore, we focus on improving
the interpretability of a driving model that uses a mid-level input representation.
This means that instead of directly using low-level sensor data, the model uses
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
3
higher-level semantic information like objects detected by a perception system.
As a proxy for such a network, we work with the recently published Chauffeur-
Net [2] model, although the ideas presented are more generally applicable. The
inputs I to this network consist of information about the roadmap, traffic lights,
dynamic objects, etc. rendered in separate channels in common top-down view
coordinate system around the agent. The model predicts future agent poses Y
in the same top-down view (see Fig. 3).
To generate sparser and more interpretable attention maps, we propose an
architecture called Attentional Bottleneck (Fig. 1) that combines visual attention
with the information bottleneck approach [23] of training deep models through
supervised learning [1, 6, 10]. We define z as a bottleneck latent representation
of an attention weighted feature encoding AI = αI · FI of the input features
FI. We leverage the mid-level input representation to separate the subset of
dense inputs into a set S ⊂I. Conditioned on z and S, the motion generator
finally predicts the target Y. Our goal is to learn both the attention weighting
function αI and an encoding z that is maximally informative about the target
Y. To prevent z from being the identity encoding of the inputs and to focus
the network on specific areas of causality, we impose an information bottleneck
constraint on the complexity of z by a pooling operation. We preserve spatial
information in the attention map by incorporating a positional encoding step,
and encode non-local information by using Atrous convolutions.
We evaluate our approach on the large-scale (≈60 days of continuous driv-
ing) dataset from [2] and show quantitative and qualitative results illustrating
that our generated attention maps result in much sparser (and thus more in-
terpretable) visualization of the internal states than a baseline visual attention
model. We also show that our approach improves the motion generation accu-
racy in contrast to a traditional visual attention model that results in decreased
accuracy.
2
Related Work
2.1
Deep Driving Models
Recently there is growing interest in end-to-end driving models that process raw
sensor data to directly output driving controls. Most of these approaches learn a
driving policy through supervised regression over observation-action pairs from
human drivers. ALVINN (Autonomous Land Vehicle In a Neural Network) [19]
was the first attempt to train a neural network for the navigational task of road
following. Bojarski et al. [4] trained a 5-layer ConvNet to predict steering con-
trols only from a dashcam image, while Xu et al. [27] utilized a dilated ConvNet
combined with an LSTM so as to predict vehicle’s discretized future motions.
Hecker et al. [12] explored the extended model that takes a surround-view multi-
camera system, a route planner, and a CAN bus reader. Codevilla et al. [9] ex-
plored a conditional end-to-end driving model that takes high-level command
input (i.e. left, straight, right) at test time. These models show good perfor-
4
J. Kim and M. Bansal
mance in simple driving scenarios (i.e. lane following). Their behavior, however,
is opaque and learning to drive in urban areas remains challenging.
To reduce the complexity and for better interpretability, there is growing
interest in end-to-mid and mid-to-mid driving models that produce a mid-level
output representation in the form of a drivable trajectory by consuming either
raw sensor or an intermediate scene representation as input. Zeng et al. [30]
introduced an end-to-mid neural motion generator, which takes Lidar data and
an HD map as inputs and outputs a future trajectory. This model also detects 3D
bounding boxes as an intermediate representation. Bansal et al. [2] introduced
ChauffeurNet, a mid-to-mid model that takes advantage of separate perception
and control components. Using a top-down representation of the environment
and intended route as input, the model outputs a driving trajectory that is
consumed by a controller, which then translates it to steering and acceleration.
Recent works [5, 25] also suggest that such a top-down scene representation can
successfully be used to learn high-level semantic information. In this work, we
focus on improving the explainability of such a model.
2.2
Visual Explanations
Explainability of deep neural networks has seen growing interest in computer
vision and machine learning [11]. In landmark work, Zeiler et al. [29] utilized
deconvolution layers to visualize the internal representation of a ConvNet. Bo-
jarski et al. [3] developed a richer notion of contribution of a pixel to the output,
while other approaches [31, 21] have explored synthesizing an image causing high
neuron activations. However, a difficulty with de-convolution based approaches is
the lack of a formal notion of contribution of spatially-extended features (rather
than pixels).
Attention-based approaches [28, 17] have been increasingly employed for im-
proving a model’s ability to explain by providing spatial attention maps that
highlight areas of the image that the network attends to. Kim et al. [14] utilize
an attention model followed by additional salience filtering to show regions that
causally affect the output. To reduce the complexity of explanations, Wang et
al. [26] introduce an instance-level attention model that finds objects (i.e. cars
and pedestrians) that the network needs to pay attention to. Such attention may
be more intuitive and interpretable for users to understand the model’s behavior.
However, the model needs to take the whole input context as an additional input,
which may compromise the causality of the attention – explanations may not
represent causal relationships between the system’s input and its behavior. To
preserve the causality, we use a top-down representation of the environment as
an input, which consists of information around the agent rendered in separable
channels.
Another notable approach is the work by Chen et al. [7], which defined
human-interpretable intermediate features such as the curvature, deviation to
neighboring lanes, and distances from the front-located vehicles. A CNN is
trained to produce these features from an image, and a simple controller maps
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
5
ChauffeurNet
ChauffeurNet
w/ Visual Attention
Ours
FeatureNet
(ConvNet)
AgentRNN
(Motion Generator)
Visual Attention
FeatureNet
(ConvNet)
AgentRNN
(Motion Generator)
FeatureNet
(ConvNet)
Positional
Encoding
Avg Pooling
Atrous Spatial
Attention
FeatureNet
(ConvNet)
AgentRNN
(Motion Generator)
Rendered
Input Images
ChauffeurNet
w/ Visual Attention
Attention maps
Ours
Fig. 2: (left) Attentional Bottleneck design compared with a baseline visual at-
tention model applied to ChauffeurNet. (right) Comparison of attention maps
from our model against those from a baseline visual attention model. Note that
our heatmaps are much sparser and thus more interpretable.
them to steering angle. Similarly, Sauer et al. [20] proposed conditional affor-
dance learning approach that maps visual inputs to intermediate representations
conditioned on high-level command input. However, the intermediate feature
descriptors provide a limited and ad-hoc vocabulary for explanations. Zeng et
al. [30] co-trained a perception model that provides bounding boxes of dynamic
objects, which are then used as an intermediate and interpretable feature. Here,
we instead take full advantage of existing well-established perception systems
with different sensor sources (i.e. Lidar, Radar, and Camera) via a mid-level
input representation. This reduces the complexity of the driving network and
employs more reliable perception outputs.
There is also a growing effort on textual explanations that justify the deci-
sions that were made and explain the “why” in a natural language [13, 17, 15].
However, textual explanations are often rationalizations – explanations that jus-
tify the system’s behavior in a post-hoc manner – and are less helpful with
understanding the causal behavior of the model. In this work, we focus on im-
proving attention-based mechanism – to provide introspective explanations that
are based on the system’s internal state and represent causal relationships be-
tween the system’s input and its behavior.
3
Mid-to-mid Driving Model with Visual Attention
3.1
ChauffeurNet
Bansal et al. [2] introduced a mid-to-mid driving network called ChauffeurNet
that recurrently predicts future poses of the agent by processing a top-down
6
J. Kim and M. Bansal
Inputs
Outputs
(a) Roadmap
(f) Traffic Lights
(g) Dynamic Objects
(i) Future Agent Poses
(h) Future Agent Boxes
(d) Current Agent Box
(c) Past Agent Poses
(b) Speed Limits
(e) Route
Fig. 3: Top-down rendered inputs I (left) and outputs Y (right) for the Chauf-
feurNet model. The subset of dense scene context inputs S are shown in the top
row.
representation of the environment as an input. For completeness, we summarize
some of the key details of the paper here. The input to the neural network consists
of a set of images I of size W ×H pixels rendered into a top-down view coordinate
system that is fixed relative to the current location of the agent. As shown in
Fig. 3, the set I contains: (a) Roadmap: a 3-channel image with a rendering
of color coded lanes, stop signs, cross-walks, etc. (b) Speed limit: a gray-scale
image with lane centers color coded in proportion to their known speed limit.
(c) Past agent poses: the ego-vehicle’s past poses rendered as a trail of points.
(d) Current agent box: the current agent represented by a full bounding box. (e)
The intended route. (f) Traffic lights: a gray-scale image where each lane center
is color coded to reflect different traffic light states (red light: brightest gray
level, green light: darkest gray level). (g) Dynamic objects: a gray-scale image
that renders all the potential dynamic objects (vehicles, cyclists, pedestrians)
as oriented boxes. Both (f) and (g) are a sequence of 5 images reflecting the
environment state over the past 5 timesteps.
Visual Encoder (FeatureNet). In the ChauffeurNet model, the rendered in-
puts I are fed to a large-receptive field convolutional FeatureNet with skip con-
nections, which outputs features F that capture the environmental context and
the intent. This feature F (of size w×h×d) contains a set of d-dimensional la-
tent vectors over the spatial dimension, i.e. F = {f1, f2, . . ., fl}, where fi ∈Rd
and l (= w × h) is the spatial dimension of the extracted features. Selecting a
subset of these feature slices will allow the attention model to selectively attend
to different parts of input images.
Motion Generator (AgentRNN). The feature encoding F is fed to a recur-
rent neural network (AgentRNN) which predicts the outputs Y consisting of the
next point pk on the driving trajectory, and the agent bounding box heatmap
Bk, conditioned on the features F, the iteration number k ∈{1, . . . , N}, the
memory Mk−1 of past predictions from AgentRNN, and the agent bounding box
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
7
heatmap Bk−1 predicted in the previous iteration.
pk, Bk = AgentRNN(k, F, Mk−1, Bk−1)
(1)
3.2
ChauffeurNet with Visual Attention
One way of making models interpretable is via visual attention [17, 14]. These
models provide introspective (visual) explanations by filtering out non-salient
image regions – the remaining (attended) regions potentially have a causal ef-
fect on the output. The goal of visual attention is to find an attended feature
A = {a1, a2, . . . , al}, where ai ∈Rd from the original feature F. They utilize
a deterministic soft attention mechanism that is trainable by standard back-
propagation methods, which thus has advantages over a hard stochastic atten-
tion mechanism that requires reinforcement learning. As discussed by several
works [28, 14], the attended features can be computed as ai = π(αi, fi) = αifi
for i = {1, 2, . . . , l}, where αi are scalar attention weights in [0, 1] satisfying
P
i αi = 1. These weights are estimated from the input features F typically
by a multi-layer perceptron, i.e. αi = fMLP(fi) where the parameters of fMLP
are learned as part of training the entire model end-to-end. Since the attention
weights vary spatially and depend on the input (via the features F), they can be
visualized as an attention heatmap aligned with the input image, with brighter
regions reflecting areas salient for the task.
To allow us to explain the driving decisions made by ChauffeurNet, we apply
this vanilla visual attention approach by replacing the original features F in
Eq. 1 with the attended features A as shown in Fig. 2 (left). As shown in Fig. 2
(right), this approach generates vague and verbose attention maps which do not
add to the interpretability of the model. Therefore, we use this approach as a
baseline for our “Attentional Bottleneck” approach.
4
Attentional Bottleneck
We propose a novel architecture called Attentional Bottleneck with a focus on
generating sparse and fine-grained visual explanations. We encode the environ-
ment I through an information bottleneck that serves to restrict information in
the input to only the most relevant parts of the input, and thus allows the driv-
ing model to focus on specific features in the environment. We tie this feature
selection to the spatial distribution of features by employing a spatial attention
mechanism before the bottleneck. While the driving task involves focusing on
specific objects and entities in the scene for the immediate driving decisions, hu-
mans also employ a holistic understanding of some elements of the environment.
For example, humans are aware of the overall map of the environment through
visual scanning or through looking at a navigation app. We find that compress-
ing this kind of dense information through the bottleneck either leads to dense
attention maps or degrades the model performance. Therefore, we leverage the
mid-level separable input representation and provide the model full access to a
8
J. Kim and M. Bansal
d
h
w
1x1 conv
1x1 conv
3x3 conv
3x3 conv
1x1 conv
Softmax
(wxhxd) 
(wxhx1) 
(wxhx1) 
(wxhxd) 
(wxhxd) 
(wxhxd) 
Rate=2
Rate=4
d
w
h
d
w
h
d
d
concatenation
Fig. 4: Atrous Spatial Attention Block. We apply three parallel Atrous convolu-
tions with different atrous rates. The resulting features from all three branches
are then concatenated and fed into 1x1 convolution and softmax layers to gen-
erate the attention weights α.
subset of inputs S ⊂I containing the dense context about the environment,
through a separate branch. This frees up the bottleneck branch to focus on spe-
cific parts of the input (e.g. specific objects) making the attention map sparser
and more interpretable.
As shown in Fig. 2, our modified ChauffeurNet model consists of a dense
input encoder branch and an attentional bottleneck branch providing encoded
input features to the AgentRNN.
Grounding Attentional Bottleneck into AgentRNN. Like the baseline
model, the inputs I are first encoded into features FI by the FeatureNet net-
work. To capture non-local information, we propose an Atrous Spatial Attention
layer that computes the attention weights αI and outputs the attended features
AI. The attended features are depth-concatenated with a positional encoding
V followed by a multi-layer perceptron gMLP, and an average pooling layer to
generate the final bottleneck representation z.
z =
l
X
i=1
gMLP([ai; vi])
(2)
The dense scene context inputs S are similarly encoded into features FS using
another FeatureNet network with identical architecture. We modify AgentRNN
to incorporate the bottleneck vector by concatenating it with each of the features
fi ∈FS:
pk, Bk = AgentRNN(k, FS, z, Mk−1, Bk−1)
(3)
We discuss the Atrous Spatial Attention and the positional encoding stages
in the following paragraphs and present ablation results for these blocks in the
experiments.
Atrous Spatial Attention. Attention models are typically applied to features
generated by the last layer of a convolutional encoder. Attention weights for each
spatial location are usually computed independently, allowing them to capture
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
9
local information around the corresponding specific spatial location (e.g. “there
is a pedestrian running”). However, we argue that the attention model also needs
to capture non-local information especially for the driving task (e.g. “there is a
pedestrian running towards the crosswalk ahead”). Seo et al. [22] explored using
3×3 convolution to consider local context in generating attention maps. Here, we
advocate using Atrous convolution (also known as dilated convolution), which
has been shown to be effective for accurately capturing semantic information at
an arbitrary scale [8].
As shown in Fig. 4, we apply three parallel Atrous convolutions with different
rates on top of the feature map FI. For implementation, we closely follow the
work by Chen et al. [8]. Specifically, our atrous convolution layers include a
1×1 convolution, and two 3×3 convolutions with rates 2 & 4 with d filters
and batch normalization. The resulting features from all the branches are then
concatenated and fed into another 1×1 convolution to generate the attention
logits. A spatial softmax yields normalized attention weights α.
Positional Encoding. As shown in Equation 2, to obtain the latent bottle-
neck vector z, we use spatial summation with the attended features AI, which
removes the positional information. To preserve this information, we append a
spatial basis to the feature AI. Following Vaswani et al. [24] and Parmar et
al. [18], we generate a spatial basis V (of the same dimension as F) that con-
tains d-dimensional vectors V = {v1, v2, . . . , vl}, where vi ∈Rd. Each vec-
tor vi encodes positional information about the spatial location (xi, yi) using
four types of Fourier basis functions viz. sin(xi/fu), cos(xi/fu), sin(yi/fu), and
cos(yi/fu), where fu = 1000u is the spatial wavelength with channel index
u = {0, 4/d, 8/d, . . . , d/d}. Each positional encoding feature vi is then concate-
nated with the corresponding attended feature ai ∈AI as shown in Equation 2.
5
Experiment
5.1
Dataset
We use the large-scale dataset from [2] that contains over 26 million expert driv-
ing examples amounting to about 60 days of continuous driving. Data has been
collected by a vehicle instrumented with multiple sensors (i.e. cameras, lidar,
radar). A reliable perception system provides accurate environmental descrip-
tions including dynamic objects (i.e. vehicles and pedestrians) and traffic light
states. Along with perception, the dataset also provides: (i) the prior map of the
environment (i.e. roadmap), (ii) vehicle pose information, and (iii) the speed-
limits. The input field of view is 80m × 80m (a resolution of 400 × 400 pixels in
image coordinates) and the effective forward sensing range of the ego-vehicle is
64 meters.
5.2
Training and Evaluation Details
We trained our models end-to-end with Adam optimization [16] using expo-
nential decaying learning rates and random initialization (i.e. no pre-trained
10
J. Kim and M. Bansal
Rendered
Images
Attention
Heat Maps
Rendered
Images
Attention
Heat Maps
Rendered
Images
Attention
Heat Maps
0
5
10
15
20
25
x10-3
0
5
10
15
20
25
x10-3
0
5
10
15
20
25
x10-3
Slowing down
Stopping
Driving curvy road
Avoiding obstacles
Lane following
Fig. 5: We provide typical examples of attention heat maps in diverse driving
scenarios. Our model attends to driving-related visual cues like highlighting
stop/yield signs, crosswalks or cars ahead that cause braking, road contours
on curved roads, or multiple pinch points from parked cars on narrow roads.
weights), with ChauffeurNet’s default losses. Our FeatureNet creates features F
with dimensions 50×50×128 which lead to 50×50 attention maps that are up-
sampled to the input resolution of 400 × 400 by a pyramid expansion step. Note
that we use ChauffeurNet’s default losses and training strategy [2] to train our
model end-to-end. The losses consist of pure imitation losses (i.e. agent position,
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
11
2.2
1.9
ADE (pixels)
1.6
1.3
1.0 A
B
C
D
E
F
FDE (pixels)
A
B
C
D
E
F
4.6
4.0
3.4
2.8
2.2
B
C
D
E
F
Entropy
8.0
6.75
5.5
4.25
3.0
C: A + Attentional Bottleneck
D: C + PerceptionRNN
E: D + Atrous Spatial Attention
F: E + Positional Encoding
A: ChauffeurNet
B: A + Visual Attention
A
B
C
D
E
F
Collision
1e-3
1e-6
1e-9
Fig. 6: Comparison of motion generation performance and attention map spar-
sity between baseline ChauffeurNet, visual attention and Attentional Bottleneck
ablation designs.
heading, box, meta prediction loss) as well as environment losses to provide bet-
ter generalization (i.e. collision loss, on-road loss, geometry loss, and auxiliary
losses). For quantitative evaluation, we use the following metrics:
ADE and FDE. To quantitatively evaluate motion generation performance,
we use two widely-used (Euclidean distance-based) metrics: (i) the average dis-
placement error (ADE) 1
K
PK
k=0 ||ˆpk−pgt
k ||2, and (ii) the final displacement error
(FDE) ||ˆpK −pgt
K||2, where K = 10 is the total number of predicted waypoints,
and the superscript gt denotes the ground-truth values.
Collision. We also use the collision rate by measuring the potential overlap of
the predicted agent box with the ground-truth boxes of all other objects in the
scene, i.e.
1
K
PK
k=0
P
i,j Bgt
k−1(i, j)Bk(i, j).
Entropy S(α). To measure the sparseness of the generated attention maps,
we measure the entropy of the generated attention heat map α, i.e. S(α) =
−Pl
i=1 αi log αi.
5.3
Quantitative Analysis
We start by quantitatively comparing our attentional bottleneck model (model
F) with the baseline ChauffeurNet [2] (model A) and ChauffeurNet with visual
attention (model B) models (see Fig. 2). We observe in Fig. 6 that the incorpora-
tion of visual attention for improving the interpretability of the baseline model
degrades its performance as measured by the larger ADE and FDE numbers.
This is not the case with our attentional bottleneck model where we observe im-
proved ADE and FDE numbers – possibly due to improved focus by the model
on specific causal factors. Examples in Fig. 2 (right) compare our attention maps
to those from the visual attention model, and confirm that the latter generates
verbose attention heat maps – finding all potentially salient objects. In con-
trast, our model provides much sparser attention heat maps which are easier
to associate with specific objects or rendered features and are thus easier to
interpret. This is evident by comparing their distributions where the attention
12
J. Kim and M. Bansal
Rendered
Images
C
D
Attention Maps
E
F
C: A + Attentional Bottleneck
D: C + PerceptionRNN
E: D + Atrous Spatial Attention
F: E + Positional Encoding
A: ChauffeurNet
Fig. 7: Qualitative comparison of attention maps between Attentional Bottleneck
ablation designs.
weights from our model are mostly concentrated around zero probability values
(see supplemental figures).
Effect of incorporating Behavior Prediction. As detailed in [2], the pre-
diction of potential future trajectories of dynamic objects in the scene helps the
network learn better features for the motion generation task. However, to ac-
complish this the network would need to attend to all the objects. This renders
the attentional bottleneck useless for the objective of exposing only the objects
relevant for the primary goal of agent motion generation. In this paper, our fo-
cus is on the main motion generator branch, but for completeness, we present
one particular architecture choice that allows us to enable PerceptionRNN [2]
while preserving the interpretability of the attentional bottleneck. We add an
additional FeatureNet and Atrous Spatial Attention branch that encodes only
the dynamic object channels O into attended object features AO which are fed
into PerceptionRNN. To allow the PerceptionRNN losses to influence the mo-
tion generation task, we also inject AO into the AgentRNN bottleneck branch by
modifying features fi ∈FI to f ′
i = [fi; ai] where ; denotes depth concatenation
and ai ∈AO. In Fig. 6, we compare the metrics from a baseline Attentional
Bottleneck model against one that is co-trained with PerceptionRNN in the
above setup and observe that this co-training indeed provides improvements in
all metrics. As shown in Fig. 7, incorporation of PerceptionRNN (model D) im-
proves situation-specific dependence on salient dynamic objects in the scene, i.e.
vehicles ahead and pedestrians crossing the crosswalk.
Effect of Atrous Spatial Attention. We illustrate the effect of capturing
more non-local information using Atrous convolutions in Fig. 6. Relative to the
model using only a local receptive field, this model achieves better ADE, FDE,
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
13
$
%
&
Fig. 8: (A-C) Attention maps over time, sampled at every 5 timesteps, illustrate
the smooth variation of attention. For better visualization, attention maps are
overlaid and shown over the satellite map image. We provide six snapshots on
the bottom. Our model appears to attend to important cues, e.g. (a) multiple
pinch points, (b) oncoming lanes on a T-junction, (c, e) a stop sign, (d, e) a
crosswalk, and (f) vehicles. We also provide a video as supplemental materials.
Map data c⃝Google 2020.
and Collision numbers while also improving the sparsity of the attention map.
Qualitatively, this model tends to attend to inter-object image regions (compare
D vs. E in Fig. 7).
Effect of Positional Encoding. Quantitatively, the motion generation regres-
sion performance slightly degrade (but is still better than the original model) as
we concatenate a spatial basis to obtain a latent bottleneck vector. The attention
maps become more sparse and we find them to have better spatial alignment
with the causal objects (compare E and F in Fig. 7). Thus there is some tension
between sparse maps and motion generation performance.
5.4
Qualitative Analysis
Fig. 5 shows several examples covering common driving scenarios that involve
slowing down, stopping, avoiding obstacles etc. We visualize a flattened view
of the input (odd rows) and the corresponding attention maps (even rows).
The attention maps are overlaid on the input images and include contour lines
for easier viewing. Note that our attention maps are quite sparse and provide
plausible visual evidence of what triggered a particular behavior, e.g. highlighting
stop/yield signs or vehicles ahead causing deceleration, road contours when on
curved roads, or multiple pinch points from parked vehicles when on narrow
roads.
14
J. Kim and M. Bansal
Rendered Images
Attention Maps
Rendered Images
Attention Maps
Counterfactual Scenarios
Rendered Images
Attention Maps
(A)
(B)
Fig. 9: (A) Examples of counter-factual outcomes where the driving model ap-
pears to attend alternative important cues, e.g. (top) vehicles ahead →a stop
sign, (bottom) red lights →vehicles ahead. (B) Examples where the driving
model appears to under-attend to important cues, e.g. ignoring oncoming lanes
on intersections.
Attention Maps over Time. In Fig. 8, we illustrate attention maps over
time for typical driving scenarios, i.e. avoiding multiple pinch points, stopping
at an intersection with 4-way stop signs, etc. We also provide snapshots on the
right column where our model attends to important driving-related cues. Our
supplemental video also demonstrates the smooth variation of the attention maps
across several temporal sequences illustrating the changing causality with each
driving decision.
Counterfactual Experiments. Fig. 9 (A) shows examples where the attention
maps change in the counterfactual driving scenarios by getting rid of a subset
of environmental descriptions, i.e. dynamic objects (i.e. with vs. without vehi-
cles) and traffic light states (i.e. red vs. green). These indicate that our model
successfully focuses on the correct cues in specific situations. We provide more
diverse examples as supplemental materials.
Failure Cases. Fig. 9 (B) shows examples where the attention maps fail to
capture the correct causal behavior. This indicates either that the original model
failed to focus on the correct cues in specific situations (e.g. “look both ways
before making a right turn”), or that the attention model is not able to correctly
explain this situation.
6
Conclusions
We described an approach for improving interpretablity of a mid-to-mid deep
driving model by augmenting a visual attention model with an attentional bot-
tleneck layer. Our results highlight sparse attention maps which are easy to
interpret and do not degrade model performance. We see opportunity in taking
this further to generate instance level attention maps and to also use these maps
as a guide to improving the performance of the baseline driving model.
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
15
Acknowledgements
We thank Dragomir Anguelov, Anca Dragan, and Alexander Gorban at Waymo
Research, John Canny, Trevor Darrell, Anna Rohrbach, and Yang Gao at UC
Berkeley for their helpful comments.
Bibliography
[1] Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational infor-
mation bottleneck. ICLR (2017)
[2] Bansal, M., Krizhevsky, A., Ogale, A.: Chauffeurnet: Learning to drive by
imitating the best and synthesizing the worst. RSS (2019)
[3] Bojarski, M., Choromanska, A., Choromanski, K., Firner, B., Jackel, L.,
Muller, U., Zieba, K.: Visualbackprop: visualizing cnns for autonomous driv-
ing. arXiv preprint (2016)
[4] Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal,
P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end
learning for self-driving cars. CoRR abs/1604.07316 (2016)
[5] Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple proba-
bilistic anchor trajectory hypotheses for behavior prediction. CoRL (2019)
[6] Chalk, M., Marre, O., Tkacik, G.: Relevant sparse codes with variational
information bottleneck. In: NeurIPS. pp. 1957–1965 (2016)
[7] Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: Learning affor-
dance for direct perception in autonomous driving. In: CVPR. pp. 2722–
2730 (2015)
[8] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.:
Deeplab: Semantic image segmentation with deep convolutional nets, atrous
convolution, and fully connected crfs. TPAMI 40(4), 834–848 (2017)
[9] Codevilla, F., Miiller, M., L´opez, A., Koltun, V., Dosovitskiy, A.: End-to-
end driving via conditional imitation learning. In: 2018 IEEE International
Conference on Robotics and Automation (ICRA). pp. 1–9. IEEE (2018)
[10] Goyal, A., Islam, R., Strouse, D., Ahmed, Z., Botvinick, M., Larochelle, H.,
Levine, S., Bengio, Y.: Infobot: Transfer and exploration via the information
bottleneck. ICLR (2019)
[11] Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced
Research Projects Agency (DARPA) (2017)
[12] Hecker, S., Dai, D., Van Gool, L.: End-to-end learning of driving models
with surround-view cameras and route planners. In: ECCV (2018)
[13] Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explana-
tions. In: ECCV (2018)
[14] Kim, J., Canny, J.: Interpretable learning for self-driving cars by visualizing
causal attention. ICCV (2017)
[15] Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explana-
tions for self-driving vehicles. In: ECCV (2018)
[16] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ICLR
(2015)
[17] Park, D.H., Hendricks, L.A., Akata, Z., Schiele, B., Darrell, T., Rohrbach,
M.: Multimodal explanations: Justifying decisions and pointing to the evi-
dence. In: CVPR (2018)
Attentional Bottleneck: Towards an Interpretable Deep Driving Network
17
[18] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser,  L., Shazeer, N., Ku, A.,
Tran, D.: Image transformer. ICML (2018)
[19] Pomerleau, D.A.: Alvinn: An autonomous land vehicle in a neural network.
In: NeurIPS. pp. 305–313 (1989)
[20] Sauer, A., Savinov, N., Geiger, A.: Conditional affordance learning for driv-
ing in urban environments. CoRL (2018)
[21] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra,
D.: Grad-cam: Visual explanations from deep networks via gradient-based
localization. In: ICCV. pp. 618–626 (2017)
[22] Seo, P.H., Lin, Z., Cohen, S., Shen, X., Han, B.: Progressive attention net-
works for visual attribute prediction. BMVC (2018)
[23] Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method.
Proceedings of The 37th Allerton Conference on Communication, Control,
and Computing pp. 368–377 (1999)
[24] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser,  L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
[25] Wang, D., Devin, C., Cai, Q.Z., Kr¨ahenb¨uhl, P., Darrell, T.: Monocular
plan view networks for autonomous driving. IROS (2019)
[26] Wang, D., Devin, C., Cai, Q.Z., Yu, F., Darrell, T.: Deep object centric
policies for autonomous driving. ICRA (2019)
[27] Xu, H., Gao, Y., Yu, F., Darrell, T.: End-to-end learning of driving models
from large-scale video datasets. In: CVPR (2017)
[28] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel,
R., Bengio, Y.: Show, attend and tell: Neural image caption generation with
visual attention. In: ICML. pp. 2048–2057 (2015)
[29] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional net-
works. In: ECCV. pp. 818–833. Springer (2014)
[30] Zeng, W., Luo, W., Suo, S., Sadat, A., Yang, B., Casas, S., Urtasun, R.:
End-to-end interpretable neural motion planner. In: CVPR. pp. 8660–8669
(2019)
[31] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep
features for discriminative localization. In: CVPR. pp. 2921–2929 (2016)