Attentional Bottleneck: Towards an
Interpretable Deep Driving Network
Jinkyu Kim and Mayank Bansal
Waymo Research
{jinkyukim, mayban}@waymo.com
Abstract. Deep neural networks are a key component of behavior pre-
diction and motion generation for self-driving cars. One of their main
drawbacks is a lack of transparency: they should provide easy to in-
terpret rationales for what triggers certain behaviors. We propose an
architecture called Attentional Bottleneck with the goal of improving
transparency. Our key idea is to combine visual attention, which iden-
tifies what aspects of the input the model is using, with an information
bottleneckthatenablesthemodeltoonlyuseaspectsoftheinputwhich
areimportant.Thisnotonlyprovidessparseandinterpretableattention
maps (e.g. focusing only on specific vehicles in the scene), but it adds
this transparency at no cost to model accuracy. In fact, we find slight
improvements in accuracy when applying Attentional Bottleneck to the
ChauffeurNetmodel,whereaswefindthattheaccuracydeteriorateswith
a traditional visual attention model.
Keywords: Self-driving vehicles, eXplainable AI, Motion generation
1 Introduction
Deep neural networks are powerful function estimators and have been a key
component in self-driving software systems [2, 30]. Such networks are, however,
notoriously cryptic – their hidden layer activations may have no obvious rela-
tion to the function being estimated by the network. Interpretable models that
make deep models more transparent are important for a number of reasons: (1)
user-acceptance:neuralnetworkautonomouscontrolisaradicaltechnologythat
requiresaveryhigh-levelofuser-trust,(2)extrapolation:usersshouldbeableto
anticipatewhatthevehiclewilldoinmostscenariosbyunderstandingthecausal
behavior of the model, and (3) human-vehicle communication: communication
can be grounded in the vehicle’s internal state.
One way of making models transparent is via visual attention [28, 17, 14].
Visual attention finds spatially varying scalar attention weights α(x,y) ∈ [0,1]
typically by learning a multi-layer perceptron from a set of input features F =
{f(x,y)}. Attended features A = {a(x,y)} obtained as a(x,y) = α(x,y)f(x,y)
are then used by the model instead of the original features F. The model is
trained end-to-end leading the attention weights to link the network’s output to
its input – visualizing the weights as a 2D heatmap thus provides insight into
0202
yaM
8
]VC.sc[
1v89240.5002:viXra2 J. Kim and M. Bansal
ar
d
a
R
d
n
a a,
er
nars,
oCrasm:
ed
LSi
metsys
noitpecreP Encoder
Visual
Bottleneck Attentional
Bottlenecked
Top-down input information
representation
Scene context
Encoder Visual Generator Motion
Generating attentional bottleneck
Predicting future agent poses
Attention Heat Map
Predicted Agent Poses
Fig.1: An overview of our interpretable driving model. Our model takes a top-
down input representation I and outputs the future agent poses Y along with
an attention map. An Attentional Bottleneck encodes the inputs I to a latent
vector z while also producing an interpretable attention heat map. The motion
generator operates in a partially observable environment using only the dense
scene context S ⊂I along with z to predict poses Y.
the areas of the input image that the network attends to. Furthermore, to be
easilyinterpretable,attentionneedstobesparse(i.e.lowentropy),whileideally
also enhancing the performance of the original model. Unfortunately, given the
complexity of the driving task, we find that a straightforward integration of
attention maps tends to find all potentially salient image areas, resulting in
limited interpretability (e.g. Fig. 2).
In this work, we manage to achieve sparse and salient attention maps and
good final model performance, by attaching attention to a bottlenecked latent
representation of the input. However, given the information loss in the bottle-
neck,weneedtoprovidethemodeldirectaccesstoasubsetofdenseinputs(e.g.
road lane geometry and connectivity information) that are harder to compress.
Thisfreesupthebottleneckbranchtofocusonselectingthemostrelevantparts
of the dynamic input (e.g. nearby objects), while retaining the model perfor-
mance.
End-to-enddrivingmodelsthatdirectlyprocessacameraimageasinputhave
several scene elements confounded into nearby pixels thus making a separation
intodenseandsparseinputsubsetsinfeasible.Therefore,wefocusonimproving
theinterpretabilityofadrivingmodelthatusesamid-levelinputrepresentation.
This means that instead of directly using low-level sensor data, the model usesAttentional Bottleneck: Towards an Interpretable Deep Driving Network 3
higher-level semantic information like objects detected by a perception system.
As a proxy for such a network, we work with the recently published Chauffeur-
Net [2] model, although the ideas presented are more generally applicable. The
inputsI tothisnetworkconsistofinformationabouttheroadmap,trafficlights,
dynamic objects, etc. rendered in separate channels in common top-down view
coordinate system around the agent. The model predicts future agent poses Y
in the same top-down view (see Fig. 3).
To generate sparser and more interpretable attention maps, we propose an
architecturecalledAttentionalBottleneck (Fig.1)thatcombinesvisualattention
with the information bottleneck approach [23] of training deep models through
supervised learning [1, 6, 10]. We define z as a bottleneck latent representation
of an attention weighted feature encoding A = α ·F of the input features
I I I
F . We leverage the mid-level input representation to separate the subset of
I
dense inputs into a set S ⊂ I. Conditioned on z and S, the motion generator
finally predicts the target Y. Our goal is to learn both the attention weighting
function α and an encoding z that is maximally informative about the target
I
Y. To prevent z from being the identity encoding of the inputs and to focus
the network on specific areas of causality, we impose an information bottleneck
constraint on the complexity of z by a pooling operation. We preserve spatial
information in the attention map by incorporating a positional encoding step,
and encode non-local information by using Atrous convolutions.
We evaluate our approach on the large-scale (≈ 60 days of continuous driv-
ing) dataset from [2] and show quantitative and qualitative results illustrating
that our generated attention maps result in much sparser (and thus more in-
terpretable) visualization of the internal states than a baseline visual attention
model. We also show that our approach improves the motion generation accu-
racy in contrast to a traditional visual attention model that results in decreased
accuracy.
2 Related Work
2.1 Deep Driving Models
Recentlythereisgrowinginterestinend-to-enddrivingmodelsthatprocessraw
sensordatatodirectlyoutputdrivingcontrols.Mostoftheseapproacheslearna
driving policy through supervised regression over observation-action pairs from
human drivers. ALVINN (Autonomous Land Vehicle In a Neural Network) [19]
was the first attempt to train a neural network for the navigational task of road
following. Bojarski et al. [4] trained a 5-layer ConvNet to predict steering con-
trolsonlyfromadashcamimage,whileXuet al.[27]utilizedadilatedConvNet
combined with an LSTM so as to predict vehicle’s discretized future motions.
Heckeret al.[12]exploredtheextendedmodelthattakesasurround-viewmulti-
camera system, a route planner, and a CAN bus reader. Codevilla et al. [9] ex-
plored a conditional end-to-end driving model that takes high-level command
input (i.e. left, straight, right) at test time. These models show good perfor-4 J. Kim and M. Bansal
mance in simple driving scenarios (i.e. lane following). Their behavior, however,
is opaque and learning to drive in urban areas remains challenging.
To reduce the complexity and for better interpretability, there is growing
interest in end-to-mid and mid-to-mid driving models that produce a mid-level
output representation in the form of a drivable trajectory by consuming either
raw sensor or an intermediate scene representation as input. Zeng et al. [30]
introduced an end-to-mid neural motion generator, which takes Lidar data and
anHDmapasinputsandoutputsafuturetrajectory.Thismodelalsodetects3D
bounding boxes as an intermediate representation. Bansal et al. [2] introduced
ChauffeurNet, a mid-to-mid model that takes advantage of separate perception
and control components. Using a top-down representation of the environment
and intended route as input, the model outputs a driving trajectory that is
consumed by a controller, which then translates it to steering and acceleration.
Recent works [5, 25] also suggest that such a top-down scene representation can
successfully be used to learn high-level semantic information. In this work, we
focus on improving the explainability of such a model.
2.2 Visual Explanations
Explainability of deep neural networks has seen growing interest in computer
vision and machine learning [11]. In landmark work, Zeiler et al. [29] utilized
deconvolution layers to visualize the internal representation of a ConvNet. Bo-
jarskiet al.[3]developedarichernotionofcontributionofapixeltotheoutput,
whileotherapproaches[31,21]haveexploredsynthesizinganimagecausinghigh
neuronactivations.However,adifficultywithde-convolutionbasedapproachesis
thelackofaformalnotionofcontributionofspatially-extendedfeatures(rather
than pixels).
Attention-based approaches [28, 17]havebeen increasinglyemployed forim-
proving a model’s ability to explain by providing spatial attention maps that
highlight areas of the image that the network attends to. Kim et al. [14] utilize
an attention model followed by additional salience filtering to show regions that
causally affect the output. To reduce the complexity of explanations, Wang et
al. [26] introduce an instance-level attention model that finds objects (i.e. cars
andpedestrians)thatthenetworkneedstopayattentionto.Suchattentionmay
bemoreintuitiveandinterpretableforuserstounderstandthemodel’sbehavior.
However,themodelneedstotakethewholeinputcontextasanadditionalinput,
which may compromise the causality of the attention – explanations may not
represent causal relationships between the system’s input and its behavior. To
preserve the causality, we use a top-down representation of the environment as
an input, which consists of information around the agent rendered in separable
channels.
Another notable approach is the work by Chen et al. [7], which defined
human-interpretable intermediate features such as the curvature, deviation to
neighboring lanes, and distances from the front-located vehicles. A CNN is
trained to produce these features from an image, and a simple controller mapsAttentional Bottleneck: Towards an Interpretable Deep Driving Network 5
AgentRNN AgentRNN AgentRNN
(Motion Generator) (Motion Generator) (Motion Generator)
Avg Pooling
Positional
Encoding
Visual Attention Atr Ao tu tes n S tip oa ntial
FeatureNet FeatureNet FeatureNet FeatureNet
(ConvNet) (ConvNet) (ConvNet) (ConvNet)
ChauffeurNet ChauffeurNet Ours
w/ Visual Attention
Rendered Input Images
Attention maps
w/ C Vh isa uu af lf e Au ttr eN ne tt ion Ours
Fig.2: (left) Attentional Bottleneck design compared with a baseline visual at-
tention model applied to ChauffeurNet. (right) Comparison of attention maps
from our model against those from a baseline visual attention model. Note that
our heatmaps are much sparser and thus more interpretable.
them to steering angle. Similarly, Sauer et al. [20] proposed conditional affor-
dancelearningapproachthatmapsvisualinputstointermediaterepresentations
conditioned on high-level command input. However, the intermediate feature
descriptors provide a limited and ad-hoc vocabulary for explanations. Zeng et
al. [30] co-trained a perception model that provides bounding boxes of dynamic
objects, which are then used as an intermediate and interpretable feature. Here,
we instead take full advantage of existing well-established perception systems
with different sensor sources (i.e. Lidar, Radar, and Camera) via a mid-level
input representation. This reduces the complexity of the driving network and
employs more reliable perception outputs.
There is also a growing effort on textual explanations that justify the deci-
sions that were made and explain the “why” in a natural language [13, 17, 15].
However,textualexplanationsareoftenrationalizations–explanationsthatjus-
tify the system’s behavior in a post-hoc manner – and are less helpful with
understanding the causal behavior of the model. In this work, we focus on im-
provingattention-basedmechanism–toprovideintrospectiveexplanationsthat
are based on the system’s internal state and represent causal relationships be-
tween the system’s input and its behavior.
3 Mid-to-mid Driving Model with Visual Attention
3.1 ChauffeurNet
Bansal et al. [2] introduced a mid-to-mid driving network called ChauffeurNet
that recurrently predicts future poses of the agent by processing a top-down6 J. Kim and M. Bansal
Inputs Outputs
(a) Roadmap (b) Speed Limits (c) Past Agent Poses (d) Current Agent Box (h) Future Agent Boxes
(e) Route (f) Traffic Lights (g) Dynamic Objects (i) Future Agent Poses
Fig.3: Top-down rendered inputs I (left) and outputs Y (right) for the Chauf-
feurNet model. The subset of dense scene context inputs S are shown in the top
row.
representation of the environment as an input. For completeness, we summarize
someofthekeydetailsofthepaperhere.Theinputtotheneuralnetworkconsists
ofasetofimagesI ofsizeW×H pixelsrenderedintoatop-downviewcoordinate
system that is fixed relative to the current location of the agent. As shown in
Fig. 3, the set I contains: (a) Roadmap: a 3-channel image with a rendering
of color coded lanes, stop signs, cross-walks, etc. (b) Speed limit: a gray-scale
image with lane centers color coded in proportion to their known speed limit.
(c) Past agent poses: the ego-vehicle’s past poses rendered as a trail of points.
(d)Currentagentbox:thecurrentagentrepresentedbyafullboundingbox.(e)
The intended route. (f) Traffic lights: a gray-scale image where each lane center
is color coded to reflect different traffic light states (red light: brightest gray
level, green light: darkest gray level). (g) Dynamic objects: a gray-scale image
that renders all the potential dynamic objects (vehicles, cyclists, pedestrians)
as oriented boxes. Both (f) and (g) are a sequence of 5 images reflecting the
environment state over the past 5 timesteps.
Visual Encoder (FeatureNet). In the ChauffeurNet model, the rendered in-
puts I are fed to a large-receptive field convolutional FeatureNet with skip con-
nections, which outputs features F that capture the environmental context and
the intent. This feature F (of size w×h×d) contains a set of d-dimensional la-
tent vectors over the spatial dimension, i.e. F = {f , f , ..., f }, where f ∈ Rd
1 2 l i
and l (= w×h) is the spatial dimension of the extracted features. Selecting a
subset of these feature slices will allow the attention model to selectively attend
to different parts of input images.
Motion Generator (AgentRNN). The feature encoding F is fed to a recur-
rent neuralnetwork(AgentRNN)whichpredicts theoutputs Y consisting ofthe
next point p on the driving trajectory, and the agent bounding box heatmap
k
B , conditioned on the features F, the iteration number k ∈ {1,...,N}, the
k
memoryM ofpastpredictionsfromAgentRNN,andtheagentboundingbox
k−1Attentional Bottleneck: Towards an Interpretable Deep Driving Network 7
heatmap B predicted in the previous iteration.
k−1
p ,B =AgentRNN(k,F,M ,B ) (1)
k k k−1 k−1
3.2 ChauffeurNet with Visual Attention
One way of making models interpretable is via visual attention [17, 14]. These
models provide introspective (visual) explanations by filtering out non-salient
image regions – the remaining (attended) regions potentially have a causal ef-
fect on the output. The goal of visual attention is to find an attended feature
A = {a ,a ,...,a }, where a ∈ Rd from the original feature F. They utilize
1 2 l i
a deterministic soft attention mechanism that is trainable by standard back-
propagation methods, which thus has advantages over a hard stochastic atten-
tion mechanism that requires reinforcement learning. As discussed by several
works [28, 14], the attended features can be computed as a = π(α ,f ) = α f
i i i i i
for i = {1,2,...,l}, where α are scalar attention weights in [0,1] satisfying
i
(cid:80)
α = 1. These weights are estimated from the input features F typically
i i
by a multi-layer perceptron, i.e. α = f (f ) where the parameters of f
i MLP i MLP
are learned as part of training the entire model end-to-end. Since the attention
weightsvaryspatiallyanddependontheinput(viathefeaturesF),theycanbe
visualized as an attention heatmap aligned with the input image, with brighter
regions reflecting areas salient for the task.
ToallowustoexplainthedrivingdecisionsmadebyChauffeurNet,weapply
this vanilla visual attention approach by replacing the original features F in
Eq. 1 with the attended features A as shown in Fig. 2 (left). As shown in Fig. 2
(right),thisapproachgeneratesvagueandverboseattentionmapswhichdonot
add to the interpretability of the model. Therefore, we use this approach as a
baseline for our “Attentional Bottleneck” approach.
4 Attentional Bottleneck
We propose a novel architecture called Attentional Bottleneck with a focus on
generating sparse and fine-grained visual explanations. We encode the environ-
ment I through an information bottleneck that serves to restrict information in
the input to only the most relevant parts of the input, and thus allows the driv-
ing model to focus on specific features in the environment. We tie this feature
selection to the spatial distribution of features by employing a spatial attention
mechanism before the bottleneck. While the driving task involves focusing on
specificobjectsandentitiesinthescenefortheimmediatedrivingdecisions,hu-
mansalsoemployaholisticunderstandingofsomeelementsoftheenvironment.
For example, humans are aware of the overall map of the environment through
visual scanning or through looking at a navigation app. We find that compress-
ing this kind of dense information through the bottleneck either leads to dense
attention maps or degrades the model performance. Therefore, we leverage the
mid-level separable input representation and provide the model full access to a8 J. Kim and M. Bansal
1
x
1
c
h o n
w v 3 x 3 c o n v 3 x 3 1 x 1 c o n v 1 x 1 c o n v S o ftm a x
d c
o n
v
(wxhxd)
(wxhxd)
(wxhxd)
(wxhxd) (wxhx1) (wxhx1)
concatenation
w w
Rate=2 h h
Rate=4
d d d d
Fig.4: Atrous Spatial Attention Block. We apply three parallel Atrous convolu-
tions with different atrous rates. The resulting features from all three branches
are then concatenated and fed into 1x1 convolution and softmax layers to gen-
erate the attention weights α.
subset of inputs S ⊂ I containing the dense context about the environment,
through a separate branch. This frees up the bottleneck branch to focus on spe-
cific parts of the input (e.g. specific objects) making the attention map sparser
and more interpretable.
As shown in Fig. 2, our modified ChauffeurNet model consists of a dense
input encoder branch and an attentional bottleneck branch providing encoded
input features to the AgentRNN.
Grounding Attentional Bottleneck into AgentRNN. Like the baseline
model, the inputs I are first encoded into features F by the FeatureNet net-
I
work.Tocapturenon-localinformation,weproposeanAtrousSpatialAttention
layerthatcomputestheattentionweightsα andoutputstheattendedfeatures
I
A . The attended features are depth-concatenated with a positional encoding
I
V followed by a multi-layer perceptron g , and an average pooling layer to
MLP
generate the final bottleneck representation z.
l
(cid:88)
z= g ([a ;v ]) (2)
MLP i i
i=1
ThedensescenecontextinputsS aresimilarlyencodedintofeaturesF using
S
another FeatureNet network with identical architecture. We modify AgentRNN
toincorporatethebottleneckvectorbyconcatenatingitwitheachofthefeatures
f ∈F :
i S
p ,B =AgentRNN(k,F ,z,M ,B ) (3)
k k S k−1 k−1
We discuss the Atrous Spatial Attention and the positional encoding stages
in the following paragraphs and present ablation results for these blocks in the
experiments.
Atrous Spatial Attention. Attention models are typically applied to features
generatedbythelastlayerofaconvolutionalencoder.Attentionweightsforeach
spatial location are usually computed independently, allowing them to captureAttentional Bottleneck: Towards an Interpretable Deep Driving Network 9
local information around the corresponding specific spatial location (e.g. “there
isapedestrianrunning”).However,wearguethattheattentionmodelalsoneeds
to capture non-local information especially for the driving task (e.g. “there is a
pedestrianrunningtowardsthecrosswalkahead”).Seoet al.[22]exploredusing
3×3convolutiontoconsiderlocalcontextingeneratingattentionmaps.Here,we
advocate using Atrous convolution (also known as dilated convolution), which
has been shown to be effective for accurately capturing semantic information at
an arbitrary scale [8].
AsshowninFig.4,weapplythreeparallelAtrousconvolutionswithdifferent
rates on top of the feature map F . For implementation, we closely follow the
I
work by Chen et al. [8]. Specifically, our atrous convolution layers include a
1×1 convolution, and two 3×3 convolutions with rates 2 & 4 with d filters
and batch normalization. The resulting features from all the branches are then
concatenated and fed into another 1×1 convolution to generate the attention
logits. A spatial softmax yields normalized attention weights α.
Positional Encoding. As shown in Equation 2, to obtain the latent bottle-
neck vector z, we use spatial summation with the attended features A , which
I
removes the positional information. To preserve this information, we append a
spatial basis to the feature A . Following Vaswani et al. [24] and Parmar et
I
al. [18], we generate a spatial basis V (of the same dimension as F) that con-
tains d-dimensional vectors V = {v ,v ,...,v }, where v ∈ Rd. Each vec-
1 2 l i
tor v encodes positional information about the spatial location (x ,y ) using
i i i
four types of Fourier basis functions viz. sin(x /f ), cos(x /f ), sin(y /f ), and
i u i u i u
cos(y /f ), where f = 1000u is the spatial wavelength with channel index
i u u
u = {0,4/d,8/d,...,d/d}. Each positional encoding feature v is then concate-
i
natedwiththecorrespondingattendedfeaturea ∈A asshowninEquation2.
i I
5 Experiment
5.1 Dataset
Weusethelarge-scaledatasetfrom[2]thatcontainsover26millionexpertdriv-
ing examples amounting to about 60 days of continuous driving. Data has been
collected by a vehicle instrumented with multiple sensors (i.e. cameras, lidar,
radar). A reliable perception system provides accurate environmental descrip-
tions including dynamic objects (i.e. vehicles and pedestrians) and traffic light
states.Alongwithperception,thedatasetalsoprovides:(i)thepriormapofthe
environment (i.e. roadmap), (ii) vehicle pose information, and (iii) the speed-
limits. The input field of view is 80m×80m (a resolution of 400×400 pixels in
image coordinates) and the effective forward sensing range of the ego-vehicle is
64 meters.
5.2 Training and Evaluation Details
We trained our models end-to-end with Adam optimization [16] using expo-
nential decaying learning rates and random initialization (i.e. no pre-trained10 J. Kim and M. Bansal
deredneR
noitnettA
deredneR
noitnettA
deredneR
noitnettA
segamI
spaM
taeH
segamI
spaM
taeH
segamI
spaM
taeH
Slowing down
x10-3
25 20
15
10
5
0
Stopping Driving curvy road
x10-3
25
20
15
10
5
0
Avoiding obstacles Lane following
x10-3
25
20
15
10
5
0
Fig.5: We provide typical examples of attention heat maps in diverse driving
scenarios. Our model attends to driving-related visual cues like highlighting
stop/yield signs, crosswalks or cars ahead that cause braking, road contours
on curved roads, or multiple pinch points from parked cars on narrow roads.
weights), with ChauffeurNet’s default losses. Our FeatureNet creates features F
withdimensions50×50×128whichleadto50×50attentionmapsthatareup-
sampled to the input resolution of 400×400 by a pyramid expansion step. Note
that we use ChauffeurNet’s default losses and training strategy [2] to train our
modelend-to-end.Thelossesconsistofpureimitationlosses(i.e.agentposition,Attentional Bottleneck: Towards an Interpretable Deep Driving Network 11
2.2
1.9
)slexip(
EDA
1.6
1.3
1.0
A B C D E F
)slexip(
EDF
4.6
4.0
3.4
2.8
2.2
A B C D E F B C D E F
yportnE
8.0
6.75
5.5
4.25
3.0
A B C D E F
A: ChauffeurNet C: A + Attentional Bottleneck E: D + Atrous Spatial Attention
B: A + Visual Attention D: C + PerceptionRNN F: E + Positional Encoding
noisilloC
1e-3
1e-6
1e-9
Fig.6: Comparison of motion generation performance and attention map spar-
sitybetweenbaselineChauffeurNet,visualattentionandAttentionalBottleneck
ablation designs.
heading,box,metapredictionloss)aswellasenvironmentlossestoprovidebet-
ter generalization (i.e. collision loss, on-road loss, geometry loss, and auxiliary
losses). For quantitative evaluation, we use the following metrics:
ADE and FDE. To quantitatively evaluate motion generation performance,
we use two widely-used (Euclidean distance-based) metrics: (i) the average dis-
placementerror(ADE) 1 (cid:80)K ||pˆ −pgt|| ,and(ii)thefinaldisplacementerror
K k=0 k k 2
(FDE) ||pˆ −pgt|| , where K =10 is the total number of predicted waypoints,
K K 2
and the superscript gt denotes the ground-truth values.
Collision. We also use the collision rate by measuring the potential overlap of
the predicted agent box with the ground-truth boxes of all other objects in the
scene, i.e. 1 (cid:80)K (cid:80) Bgt (i,j)B (i,j).
K k=0 i,j k−1 k
Entropy S(α). To measure the sparseness of the generated attention maps,
we measure the entropy of the generated attention heat map α, i.e. S(α) =
−(cid:80)l
α logα .
i=1 i i
5.3 Quantitative Analysis
We start by quantitatively comparing our attentional bottleneck model (model
F) with the baseline ChauffeurNet [2] (model A) and ChauffeurNet with visual
attention(modelB)models(seeFig.2).WeobserveinFig.6thattheincorpora-
tion of visual attention for improving the interpretability of the baseline model
degrades its performance as measured by the larger ADE and FDE numbers.
Thisisnotthecasewithourattentionalbottleneckmodelwhereweobserveim-
proved ADE and FDE numbers – possibly due to improved focus by the model
onspecificcausalfactors.ExamplesinFig.2(right)compareourattentionmaps
to those from the visual attention model, and confirm that the latter generates
verbose attention heat maps – finding all potentially salient objects. In con-
trast, our model provides much sparser attention heat maps which are easier
to associate with specific objects or rendered features and are thus easier to
interpret. This is evident by comparing their distributions where the attention12 J. Kim and M. Bansal
Rendered Attention Maps
Images C D E F
A: ChauffeurNet C: A + Attentional Bottleneck E: D + Atrous Spatial Attention
D: C + PerceptionRNN F: E + Positional Encoding
Fig.7:QualitativecomparisonofattentionmapsbetweenAttentionalBottleneck
ablation designs.
weights from our model are mostly concentrated around zero probability values
(see supplemental figures).
Effect of incorporating Behavior Prediction. As detailed in [2], the pre-
diction of potential future trajectories of dynamic objects in the scene helps the
network learn better features for the motion generation task. However, to ac-
complish this the network would need to attend to all the objects. This renders
the attentional bottleneck useless for the objective of exposing only the objects
relevant for the primary goal of agent motion generation. In this paper, our fo-
cus is on the main motion generator branch, but for completeness, we present
one particular architecture choice that allows us to enable PerceptionRNN [2]
while preserving the interpretability of the attentional bottleneck. We add an
additional FeatureNet and Atrous Spatial Attention branch that encodes only
the dynamic object channels O into attended object features A which are fed
O
into PerceptionRNN. To allow the PerceptionRNN losses to influence the mo-
tiongenerationtask,wealsoinjectA intotheAgentRNNbottleneckbranchby
O
modifying features f ∈ F to f(cid:48) = [f ;a ] where ; denotes depth concatenation
i I i i i
and a ∈ A . In Fig. 6, we compare the metrics from a baseline Attentional
i O
Bottleneck model against one that is co-trained with PerceptionRNN in the
above setup and observe that this co-training indeed provides improvements in
all metrics. As shown in Fig. 7, incorporation of PerceptionRNN (model D) im-
provessituation-specificdependenceonsalientdynamicobjectsinthescene,i.e.
vehicles ahead and pedestrians crossing the crosswalk.
Effect of Atrous Spatial Attention. We illustrate the effect of capturing
more non-local information using Atrous convolutions in Fig. 6. Relative to the
model using only a local receptive field, this model achieves better ADE, FDE,Attentional Bottleneck: Towards an Interpretable Deep Driving Network 13
(cid:11)(cid:36)(cid:12) (cid:11)(cid:37)(cid:12) (cid:11)(cid:38)(cid:12)
Fig.8: (A-C) Attention maps over time, sampled at every 5 timesteps, illustrate
the smooth variation of attention. For better visualization, attention maps are
overlaid and shown over the satellite map image. We provide six snapshots on
the bottom. Our model appears to attend to important cues, e.g. (a) multiple
pinch points, (b) oncoming lanes on a T-junction, (c, e) a stop sign, (d, e) a
crosswalk, and (f) vehicles. We also provide a video as supplemental materials.
Map data (cid:13)cGoogle 2020.
and Collision numbers while also improving the sparsity of the attention map.
Qualitatively,thismodeltendstoattendtointer-objectimageregions(compare
D vs. E in Fig. 7).
Effect of Positional Encoding.Quantitatively,themotiongenerationregres-
sionperformanceslightly degrade(butisstill better thantheoriginalmodel)as
weconcatenateaspatialbasistoobtainalatentbottleneckvector.Theattention
maps become more sparse and we find them to have better spatial alignment
withthe causalobjects(compare EandFin Fig.7).Thusthereis sometension
between sparse maps and motion generation performance.
5.4 Qualitative Analysis
Fig. 5 shows several examples covering common driving scenarios that involve
slowing down, stopping, avoiding obstacles etc. We visualize a flattened view
of the input (odd rows) and the corresponding attention maps (even rows).
The attention maps are overlaid on the input images and include contour lines
for easier viewing. Note that our attention maps are quite sparse and provide
plausiblevisualevidenceofwhattriggeredaparticularbehavior,e.g.highlighting
stop/yield signs or vehicles ahead causing deceleration, road contours when on
curved roads, or multiple pinch points from parked vehicles when on narrow
roads.14 J. Kim and M. Bansal
Rendered Images Counterfactual Scenarios Attention Maps Rendered Images Attention Maps Rendered Images Attention Maps
(A) (B)
Fig.9: (A) Examples of counter-factual outcomes where the driving model ap-
pears to attend alternative important cues, e.g. (top) vehicles ahead → a stop
sign, (bottom) red lights → vehicles ahead. (B) Examples where the driving
model appears to under-attend to important cues, e.g. ignoring oncoming lanes
on intersections.
Attention Maps over Time. In Fig. 8, we illustrate attention maps over
time for typical driving scenarios, i.e. avoiding multiple pinch points, stopping
at an intersection with 4-way stop signs, etc. We also provide snapshots on the
right column where our model attends to important driving-related cues. Our
supplementalvideoalsodemonstratesthesmoothvariationoftheattentionmaps
across several temporal sequences illustrating the changing causality with each
driving decision.
CounterfactualExperiments.Fig.9(A)showsexampleswheretheattention
maps change in the counterfactual driving scenarios by getting rid of a subset
of environmental descriptions, i.e. dynamic objects (i.e. with vs. without vehi-
cles) and traffic light states (i.e. red vs. green). These indicate that our model
successfully focuses on the correct cues in specific situations. We provide more
diverse examples as supplemental materials.
Failure Cases. Fig. 9 (B) shows examples where the attention maps fail to
capturethecorrectcausalbehavior.Thisindicateseitherthattheoriginalmodel
failed to focus on the correct cues in specific situations (e.g. “look both ways
beforemakingarightturn”),orthattheattentionmodelisnotabletocorrectly
explain this situation.
6 Conclusions
We described an approach for improving interpretablity of a mid-to-mid deep
driving model by augmenting a visual attention model with an attentional bot-
tleneck layer. Our results highlight sparse attention maps which are easy to
interpret and do not degrade model performance. We see opportunity in taking
thisfurthertogenerateinstancelevelattentionmapsandtoalsousethesemaps
as a guide to improving the performance of the baseline driving model.Attentional Bottleneck: Towards an Interpretable Deep Driving Network 15
Acknowledgements
We thank Dragomir Anguelov, Anca Dragan, and Alexander Gorban at Waymo
Research, John Canny, Trevor Darrell, Anna Rohrbach, and Yang Gao at UC
Berkeley for their helpful comments.Bibliography
[1] Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational infor-
mation bottleneck. ICLR (2017)
[2] Bansal, M., Krizhevsky, A., Ogale, A.: Chauffeurnet: Learning to drive by
imitating the best and synthesizing the worst. RSS (2019)
[3] Bojarski, M., Choromanska, A., Choromanski, K., Firner, B., Jackel, L.,
Muller,U.,Zieba,K.:Visualbackprop:visualizingcnnsforautonomousdriv-
ing. arXiv preprint (2016)
[4] Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal,
P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end
learning for self-driving cars. CoRR abs/1604.07316 (2016)
[5] Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: Multiple proba-
bilistic anchor trajectory hypotheses for behavior prediction. CoRL (2019)
[6] Chalk, M., Marre, O., Tkacik, G.: Relevant sparse codes with variational
information bottleneck. In: NeurIPS. pp. 1957–1965 (2016)
[7] Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: Learning affor-
dance for direct perception in autonomous driving. In: CVPR. pp. 2722–
2730 (2015)
[8] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.:
Deeplab:Semanticimagesegmentationwithdeepconvolutionalnets,atrous
convolution, and fully connected crfs. TPAMI 40(4), 834–848 (2017)
[9] Codevilla, F., Miiller, M., L´opez, A., Koltun, V., Dosovitskiy, A.: End-to-
end driving via conditional imitation learning. In: 2018 IEEE International
Conference on Robotics and Automation (ICRA). pp. 1–9. IEEE (2018)
[10] Goyal,A.,Islam,R.,Strouse,D.,Ahmed,Z.,Botvinick,M.,Larochelle,H.,
Levine,S.,Bengio,Y.:Infobot:Transferandexplorationviatheinformation
bottleneck. ICLR (2019)
[11] Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced
Research Projects Agency (DARPA) (2017)
[12] Hecker, S., Dai, D., Van Gool, L.: End-to-end learning of driving models
with surround-view cameras and route planners. In: ECCV (2018)
[13] Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explana-
tions. In: ECCV (2018)
[14] Kim,J.,Canny,J.:Interpretablelearningforself-drivingcarsbyvisualizing
causal attention. ICCV (2017)
[15] Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explana-
tions for self-driving vehicles. In: ECCV (2018)
[16] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ICLR
(2015)
[17] Park, D.H., Hendricks, L.A., Akata, Z., Schiele, B., Darrell, T., Rohrbach,
M.: Multimodal explanations: Justifying decisions and pointing to the evi-
dence. In: CVPR (2018)Attentional Bottleneck: Towards an Interpretable Deep Driving Network 17
[18] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L(cid:32) ., Shazeer, N., Ku, A.,
Tran, D.: Image transformer. ICML (2018)
[19] Pomerleau,D.A.:Alvinn:Anautonomouslandvehicleinaneuralnetwork.
In: NeurIPS. pp. 305–313 (1989)
[20] Sauer,A.,Savinov,N.,Geiger,A.:Conditionalaffordancelearningfordriv-
ing in urban environments. CoRL (2018)
[21] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra,
D.: Grad-cam: Visual explanations from deep networks via gradient-based
localization. In: ICCV. pp. 618–626 (2017)
[22] Seo, P.H., Lin, Z., Cohen, S., Shen, X., Han, B.: Progressive attention net-
works for visual attribute prediction. BMVC (2018)
[23] Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method.
Proceedings of The 37th Allerton Conference on Communication, Control,
and Computing pp. 368–377 (1999)
[24] Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,
Kaiser, L(cid:32) ., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
[25] Wang, D., Devin, C., Cai, Q.Z., Kr¨ahenbu¨hl, P., Darrell, T.: Monocular
plan view networks for autonomous driving. IROS (2019)
[26] Wang, D., Devin, C., Cai, Q.Z., Yu, F., Darrell, T.: Deep object centric
policies for autonomous driving. ICRA (2019)
[27] Xu, H., Gao, Y., Yu, F., Darrell, T.: End-to-end learning of driving models
from large-scale video datasets. In: CVPR (2017)
[28] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel,
R.,Bengio,Y.:Show,attendandtell:Neuralimagecaptiongenerationwith
visual attention. In: ICML. pp. 2048–2057 (2015)
[29] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional net-
works. In: ECCV. pp. 818–833. Springer (2014)
[30] Zeng, W., Luo, W., Suo, S., Sadat, A., Yang, B., Casas, S., Urtasun, R.:
End-to-end interpretable neural motion planner. In: CVPR. pp. 8660–8669
(2019)
[31] Zhou, B., Khosla,A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep
features for discriminative localization. In: CVPR. pp. 2921–2929 (2016)