Identifying Driver Interactions via Conditional Behavior Prediction
Ekaterina Tolstaya1, Reza Mahjourian2, Carlton Downey2,
Balakrishnan Varadarajan2, Benjamin Sapp2, Dragomir Anguelov2
Abstract— Interactive driving scenarios, such as lane changes,
merges and unprotected turns, are some of the most challenging
situations for autonomous driving. Planning in interactive
scenarios requires accurately modeling the reactions of other
agents to different future actions of the ego agent. We develop
end-to-end models for conditional behavior prediction (CBP)
that take as an input a query future trajectory for an ego-
agent, and predict distributions over future trajectories for
other agents conditioned on the query. Leveraging such a model,
we develop a general-purpose agent interactivity score derived
from probabilistic ﬁrst principles. The interactivity score allows
us to ﬁnd interesting interactive scenarios for training and
evaluating behavior prediction models. We further demonstrate
that the proposed score is effective for agent prioritization under
computational budget constraints.
I. INTRODUCTION
Behavior prediction is a core component of real-world
systems involving human-robot interaction. This task is
particularly challenging due to the high degree of uncertainty
in the future—the intent of human actors is unobserved, and
multiple interacting agents may continually inﬂuence one
another.
We are particularly interested in the high-impact application
of Autonomous Vehicles (AV), in which a robot may wish
to pose behavior prediction queries of the form “If I take
action X, what will agent B do?”, as shown in Figure 1. We
assert that this type of conditional inference is important and
fundamental for making planning decisions in an interactive
driving environment. In this paper, we focus on probabilistic
models of future behavior that can condition on possible
future action sequences (i.e., trajectories) of other agents. We
call this task conditional behavior prediction (CBP).
In the literature, there are a family of behavior prediction
models for which the conditioning capability comes naturally:
those that employ step-wise, iterative sampling (“roll-outs”)
for multiple agents in a scene, e.g. [1], [2], [3], [4]. In such
models, it is possible to control the action sequences for a
subset of agents, so that the roll-out of others will take them
into account. While ﬂexible, these sample-based models have
signiﬁcant disadvantages for real-world applications: sample-
based inference is risky to employ in a safety-critical system,
iterative errors can compound [5], it is difﬁcult to control
sample diversity [6], [7], and attempting to jointly model all
agents is often intractable computationally. Some past work
has focused on tightly-coupled robot-human interaction in
1 General Robotics Automation Sensing and Perception (GRASP)
Laboratory at the University of Pennsylvania, eig@seas.upenn.edu
2 Waymo, rezama@waymo.com
The ﬁrst author acknowledges support from the NSF Graduate Research
Fellowships Program.
Fig. 1. A conditional behavior prediction model describes how one
agent’s predicted future trajectory can shift due to the actions of
other agents.
limited driving game environments: [4] iteratively conditions
on generated human actions in a CVAE framework; [8]
formulates the interaction problem as a 2-player game with
human reward learned via inverse reinforcement learning.
As an alternative to sample-based models, there is a long
line of work on single-shot, passive behavior prediction [9],
[10], [11], [12], [13], [14], [15]. These models are compelling
due to tractability and practical parametric output distributions,
and have become the popular choice in AV systems and
associated benchmarks [15]. However, these models ignore
the fact that the AV ego-agent will take actions in the future,
which may cause a critical reaction by another agent. Using
such models makes decision-making challenging: because
the models do not condition on any explicit ego actions, they
must implicitly account for all possible ego-actions (or ignore
interactions altogether). In practice, interaction modeling has
been handled via aggregating neighboring agents’ observed
states via max-pooling, transformer layers, or graph neural
network architectures [1], [16], [17], [18].
In this paper, we propose a single-shot, conditional behavior
prediction model. Our CBP model is a powerful, end-to-end
trained deep neural network, which takes into account static
and dynamic scene elements—road lanes, agent state histories
(vehicle, pedestrian and cyclist), trafﬁc light information, etc.
From these inputs, we predict a diverse set of future outcomes,
represented as Gaussian Mixture distributions, where each
mixture component corresponds to a future state sequence
(i.e., a trajectory with uncertainty). We train these models
to be capable of conditional inference by selectively adding
future trajectory information for some agents as additional
inputs to the model. We use large datasets of logged driving
data and train models via maximum likelihood to output
either conditional or passive (marginal) predictions for any
subset of agents in a scene. The recently proposed WIMP
model [19] is also a single-shot conditional inference model;
ours differs in that we condition on generic trajectories for
arXiv:2104.09959v2  [cs.RO]  1 Jun 2021
any subset of agents in a fully probabilistic framework.
The notion of interactivity is a key concept for this problem,
and a key contribution of this paper is to formalize the notion
and obtain a simple and practical interactivity score. Now that
we are equipped with a probabilistic model for conditional
future distributions, we can quantify a notion of interactivity
as follows. We quantify the degree of inﬂuence one agent
has on another as the KL-divergence between (a) the agent’s
future distribution conditioned on the other’s future and (b)
its marginal distribution. We then take an expectation over all
possible conditioned futures for the other agent to get a ﬁnal
interactivity score. This results in a simple, agent-symmetric
computation in the form of mutual information between the
two agent’s futures. In contrast, past work have hand-designed
models of surprise or discomfort for motion planning [20],
[21], [22], [23]. Entropy and mutual information have been
previously used in AV applications as a measure of uncertainty
to predict collisions [24].
In real-world driving, the interactivity score can be used to
anticipate driver interactions. When processing data ofﬂine,
we demonstrate the use of the interactivity score for mining
interactive scenarios that are potentially unsafe, since the
target agent’s expectations are being violated. Furthermore,
we demonstrate the beneﬁts of the interactivity score for
prioritizing agents for behavior prediction and planning. In
contrast to previous work that built a special-purpose model
trained directly for the task of prioritization [23], which
was derived from an implicitly-deﬁned side-channel output
of a blackbox planner, we provide a formulation that is
independent of a speciﬁc planner deﬁnition and consequently,
more generally applicable.
Our contributions can be summarized as follows: (1) We
provide a novel, principled information-theoretic deﬁnition
of interactivity, which applies to any multi-agent interaction
application, (2) we develop a ﬁrst-of-its-kind, single-shot,
deep neural network for probabilistic conditional behavior
prediction and (3) we show our interactivity score improves
state-of-the-art model performance in several settings.
II. DEFINING AGENT INTERACTIVITY
We deﬁne an agent trajectory S as a ﬁxed-length, time-
discretized sequence of agent states up to a ﬁnite time horizon.
All quantities in this work consider a pair of agents A and
B. Without loss of generality, we consider A to be the query
agent whose plan for the future can potentially affect B,
the target agent. The future trajectories of A and B are
random variables SA and SB. The marginal probability of
a particular realization of agent B’s trajectory sB is given
by p(SB = sB), also indicated by the shorthand p(sB).
The conditional distribution of agent B’s future trajectory
given a realization of agent A’s trajectory sA is given by
p(SB = sB|SA = sA), indicated by the shorthand p(sB|sA).
Even in highly interactive scenarios, agents may behave as
expected by other agents and not exert any inﬂuence on one
another. A deﬁne a surprising interaction as one in which the
target agent experiences a change in their behavior due to the
query agent’s observed trajectory. When we have access to
ground-truth future trajectories, we can quantify interactions
by estimating the change in log likelihood of the target’s
ground-truth future sB:
∆LL := log p(sB|sA) −log p(sB)
(1)
A large change in the log-likelihood indicates a situation in
which the likelihood of the target agent’s trajectory changes
signiﬁcantly as a result of the query agent’s action. If the
target’s trajectory sB becomes more likely given the query
agent’s trajectory sA, then ∆LL will be positive. If it becomes
less likely, then ∆LL will be negative. If there is no change,
then ∆LL will be zero.
A query agent may need to estimate the impact of a
planned future trajectory sA on the target agent B. Since
we don’t have access to the ground-truth future for the
target agent, we can quantify the potential for a surprising
interaction by estimating the shift in the distribution of the
target agent’s trajectory. More speciﬁcally, we use the KL-
divergence between the conditional and marginal distributions
for the target’s predicted future trajectory SB to quantify the
degree of inﬂuence exerted on B by a a trajectory sA:
DKL

p(SB|sA)


p(SB)

=
Z
sB p(sB|sA) log p(sB|sA)
p(sB)
(2)
For example, in Fig. 1, if the query agent decides to change
lanes in front of the target agent, the target agent will have
to slow down. In this case, the KL-divergence will reﬂect
a signiﬁcant change in the target agent’s expected behavior
as a result of the query agent’s planned lane change. In the
absence of a particular plan for the query or target agent,
we can consider the set of all possible actions for the query
agent and compute the expectation of the degree of inﬂuence
over all those possible actions. This expectation is deﬁned
as the mutual information between the two agents’ future
trajectories SA and SB, and is computed as:
I(SA, SB) =
Z
sA p(sA)DKL

p(SB|sA)


p(SB)

(3)
Mutual information expresses the dependence between two
random variables. It is non-negative, I(SA, SB) ≥0, and
symmetric, I(SA, SB) = I(SB, SA) [25]. We use this
quantity as the interactivity score between agents A and
B. For example, if the target agent is driving closely behind
the query agent, we expect their interactivity score to be high
because the target agent is likely to respond immediately to
any actions, such as deceleration or acceleration, from the
query agent.
III. METHOD
In the previous section, we developed a measure of
interactivity between a pair of agents. In this section, we
discuss training a conditional behavior prediction model that
can estimate the distributions p(sB) and p(sB|sA). We discuss
the internals of this model and losses. We then discuss the
process for computing the interactivity score by sampling
from the predicted distributions.
Let x denote observations from the scene, including past
trajectories of all agents, and context information such as
lane semantics. Let t denote a discrete time step, and let st
denote the state of an agent at time t. The realization of the
future trajectory s = {s1, . . . , sT } is a sequence of states for
t ∈{1, . . . , T}, a ﬁxed horizon.
A CBP model predicts p(SB|SA= sA, x), the distribution
of future trajectories for B conditioned on sA. The CBP model
receives as input a realization of the future trajectory of the
query agent, sA = [sA
1 , . . . , sA
T ], which we refer to as agent
A’s plan, or the conditional query. Following the approach
of MultiPath [9], the model predicts a set of K trajectories
for agent B, µµµB = {µµµBk}K
k=1, where each trajectory is a
sequence of states µµµBk = {µµµBk
1 , . . . ,µµµBk
T }, capturing K
potentially-different intents for agent B. The model predicts
uncertainty over the K intents as a softmax distribution
πBk(x, sA). The model also predicts Gaussian uncertainty
over the positions of the trajectory waypoints as:
φBk(sB
t |x, sA) = N
�sB
t |µµµBk
t
(x, sA), ΣBk
t
(x, sA)

.
(4)
This yields the full conditional distribution p( ˆSB|sA, x) as a
Gaussian Mixture Model (GMM) with mixture weights ﬁxed
over all time steps of the same trajectory:
p( ˆSB = sB|x, sA) =
K
X
k=1
πBk(x, sA)
T
Y
t=1
φBk(sB
t |x, sA). (5)
The Gaussian parameters µµµBk
t
and ΣBk
t
are directly predicted
by a deep neural network (DNN). The softmax distribution
is computed as πBk(x, sA) =
exp f B
k (x,sA)
P
i exp f B
i (x,sA), where f B
k are
logit values also output by the DNN.
The computation of the interactivity score also requires
the estimation of marginal distributions, p(SB|x), which
are not conditioned on any future plan for A. We train
a single model which can produce both marginal and
conditional predictions, in order to have comparable quantities
without any uncertainty due to model variance. Marginal
predictions, p( ˜SB|x), are provided by turning off inputs from
the conditional query encoder in the model. We adopt the
shorthands πBk(x, ∅), φBk(x, ∅) to describe this operation,
which gives us the marginal distribution as
p( ˜SB = sB|x) =
K
X
k=1
πBk(x, ∅)
T
Y
t=1
φBk(sB
t |x, ∅).
(6)
Given the conditional and marginal predictions of the CBP
model, we can now compute the mutual information. Directly
computing the mutual information between the future states
of two agents via Eq. (3) is intractable between the GMM
distributions (Eq. (6)). We estimate the outer expectation via
importance sampling. Rather than sampling N samples from
the marginal distribution, we will use the most likely 6 modes
of the marginal distribution’s GMM in Eq. (6) as in standard
motion forecasting metrics [15], with sA
k ∈{µµµkA(x)}6
k=1:
I(SA, SB|x)≈1
M
6
X
k=1
X
m
p(sA
k |x) log p( ˆSB=sB
m|sA
k , x)
p( ˜SB = sB
m|x)
(7)
where the marginal and conditional probabilities are evaluated
via Eqs. (5) and (6). The use of other more efﬁcient
approaches for estimating KL divergence are left to future
work [26].
To train the model for conditional prediction, we set
the conditional query/plan input to agent A’s ground-truth
future trajectory from the training dataset. We learn to
predict the distribution parameters f B
k (x, sA), µµµBk
t
(x, sA),
and ΣBk
t
(x, sA) via supervised learning with the negative
log-likelihood loss,
L(θ) =
M
X
m=1
K
X
k=1
1(k = kB
m)
h
log πBk(xm, sA
m; θ)
+
T
X
t=1
log N(sBk
t
|µµµBk
t
, ΣBk
t
; xm, sA
m; θ)
i
,
(8)
where km is the index of the mode of the distribution that
has the closest endpoint to the given ground-truth trajectory,
kB
m = arg mink
PT
t=1∥sBk
t
−µµµBk
t
∥2.
Above, we describe how to produce predictions for a
single agent B. However, for increased efﬁciency, our model
produces predictions for multiple agents in parallel. To
encourage the model to maintain the fundamental physical
property that agents cannot occupy the same future location
in space-time, we include an additional loss function:
LO(θ) =
X
i
X
j
πAiπBj max
t
exp(−∥µµµAi
t −µµµBj
t ∥2
2/α), (9)
where {(πAi,µµµAi)}K
i=1 and {(πBj,µµµBj)}K
j=1 are the modes
and probabilities of the future trajectory distributions for
agents A and B.
IV. EXPERIMENTS
A. Data
We collected a large, in-house dataset of real-world driving
from urban and suburban environments, using a vehicle
equipped with an industry-grade sensor and perception stack,
which provides us with tracked objects. In total, the training
set has 1.9 billion vehicle agents that we learn to model,
from 19 million unique scenarios, comprising 18 years of
continuous driving data. The models receive 2 seconds of
history and predict 15 seconds of future behavior for all
agents in the scene, including the AV. The state of the agents
are recorded at 5 Hz. Features describing the past states of the
agents include (x, y, z) position, velocity vector, acceleration
vector, orientation θ, and angular velocity. There are also
binary attributes indicating whether the vehicle is signaling to
turn left or right, and whether it is parked. The lane markings
and boundaries are represented by 500 points sampled around
the current location of each predicted vehicle to balance
memory requirements.
At training time, we select one agent uniformly at random
from the vehicles in the scene to be the query agent. For 95%
of the samples, the query agent’s future ground-truth is fed to
the model as the conditional query input. For the other 5%, no
conditional query is provided, leading to marginal behavior
prediction, with the split chosen through cross-validation.
Fig. 2. The architecture of the conditional behavior prediction model.
For every scene, the model predicts future behaviors for up
to the 20 closest vehicles. Other agents (vehicles, pedestrians,
and cyclists) are still used in the agent state feature encoder,
but the model doesn’t predict futures for them. We observed
that prediction performance beyond 20 agents degrades rapidly
due to sensor limitations.
B. Model Architecture
The architecture is composed of an input encoder stage,
a trajectory decoder stage, and a GNN-based trajectory
reﬁnement stage, shown in Fig. 2. The encoder stage is
composed of a road lane encoder which uses an architecture
similar to VectorNet [14], and a track history encoder which
uses a 64-dimensional LSTM applied to 5 time steps of past
state observations comprising 1 second of history. The result
of the above two encoders are concatenated and passed into
a decoder which outputs a sequence of (x, y) points via
predicted polynomial coefﬁcients for K = 287 trajectory
modes [9]. In our experiments, we use a tenth-degree
polynomial. The resulting trajectories are further reﬁned
using a GNN [27], [17]. The GNN uses an attention-based
aggregation function that combines relative agent positions
as edge features to form messages passed to each node
[28]. We apply one message update, which passes trajectory
information between neighboring agents, and then re-apply
trajectory decoding. This process can reﬁne the agents’
trajectory distributions with awareness of their neighbors’
distributions. Further details of this state-of-the-art model
architecture are currently under anonymous review.
V. RESULTS
A. Metrics
Given a labeled example (x, sA, sB), the weighted Average
Distance Error (wADE) over the most likely 6 modes of the
conditional prediction of agent B’s future trajectory given
the query agent A’s future trajectory is:
wADE6
CBP(B) = 1
T
X
t
6
X
k=1
πBk(sA, x)∥sB
t −µµµBk
t
∥2, (10)
where µµµBk
t
= µµµBk
t
(sA, x) is the kth mode for the predicted
position of agent B at time t with its respective probability
of πBk(sA, x). Likewise, we can compute the wADEBP
metric using µµµBk
t
= µµµBk
t
(x, ∅) and πBk(x, ∅). Computing
their difference: ∆wADE(B) = wADEBP(B) −wADECBP(B)
quantiﬁes the reduction in B’s error due to conditioning on
A. Another established metric for behavior prediction is the
TABLE I: Comparison of CBP models on an evaluation dataset
containing over 8 million agent pairs. Metrics are computed and
averaged over all (query agent, target agent) pairs possible in every
scene. The mean error is computed only over predictions for the
target agent and does not include predictions for the query agent.
The standard error of the mean is also reported.
Method
wADE6
CBP(B) ↓
minADE6
CBP(B) ↓
Non-conditional
3.486 ± 0.0017
1.207 ± 0.00062
Early fusion (encoder)
3.142 ± 0.0016
1.170 ± 0.00061
Late fusion (GNN)
3.469 ± 0.0017
1.209 ± 0.00063
Early and late fusion
3.160 ± 0.0016
1.172 ± 0.00067
minimum Average Distance Error (minADE), deﬁned for
conditional models over the most likely 6 modes as:
minADE6
CBP(B) = min
1≤k≤6
1
T
X
t
∥sB
t −µµµBk
t
(sA, x)∥2. (11)
To obtain a low minADE value, the model needs to
accurately predict the ground-truth future as one of its
predicted intents. On the other hand, the wADE metric is
more suitable for evaluating multi-modal distributions and can
reﬂect shifts in distribution of intent probabilities. Therefore,
we use wADE as the main metric in the following results.
Furthermore, ∆wADE is closely related to the deﬁnition of
∆LL in Eq. (1), but since it is weighted by the distance error,
it is less sensitive to prediction errors for nearly-stationary
vehicles.
B. Conditional Behavior Prediction
Comparing accuracy between marginal and conditional
predictions from the trained model shows a 10% improvement
for conditional predictions, as seen in Table I. This is clear
conﬁrmation that our model is leveraging future information
to improve predictive power, as expected. The early fusion
conditional encoder receives the conditional query at an earlier
stage in the model, whereas the late fusion setup feeds the
query to the GNN only at the ﬁnal prediction stage. As the
results show, the early-fusion variant signiﬁcantly outperforms
late fusion.
C. Evaluation on Argoverse
Our model is competitive with state of the art on the
popular Argoverse benchmark dataset [15]. On the validation
dataset, we achieve a minADE6 of 0.7488, which is near state
of the art in recent work: 0.71 by Liang et al. [29], 0.728 by
TNT [30], and 0.75 by WIMP [19]. By conditioning on the
sensor vehicle, the CBP model reduces minADE by 0.8% to
0.7409, consistent with our more exhaustive studies on the
internal dataset.
D. Distribution of Interactivity Scores
Figure 3 shows the histogram of interactivity scores be-
tween all agent pairs in the evaluation dataset. The incidence
of interactions in most datasets are rare, so the interactivity
score may be a good tool to automatically mine a dataset for
interactive examples.
Fig. 3. Histogram of interactivity score (mutual information) between
8,919,306 pairs of agents in the validation dataset.
E. Interactivity Score Predicts Surprise
The mutual information score allows us to discover
scenarios with a potential for surprising interactions, where
the ground-truth future of the query agent causes a target
agent to change its behavior. Using the ground-truth future
trajectories of agents sA and sB, we can quantify how query
agent A affected target agent B in reality by comparing the
prediction error between the conditional and marginal (non-
conditional) models. A large, positive ∆wADE indicates that
providing the query agent’s future signiﬁcantly improves the
prediction accuracy for the target agent.
Figure 4a shows that there is a strong correlation between
high values of mutual information and high values of
∆wADE. In other words, agents with high interactivity scores
are more likely to exert inﬂuence on one another. Note that
the interactivity score does not use any future information,
while ∆wADE does.
Also, in percentiles with high mutual information, there
is a high occurrence of examples where the conditional
prediction errors are much lower than marginal prediction
errors. These are scenes where the behavior of the query agent
has signiﬁcantly affected the target agent. Such examples are
not present in the lower mutual information percentiles.
On the other hand, Figure 4b shows a decrease in av-
erage ∆wADE for the percentiles with the highest mutual
information. Upon inspection of a portion of scenes in the
top percentiles, we observe many examples where the agent
pair are positioned very close to each other and can exert
inﬂuence on one another, however since they are almost
stationary and ∆wADE is sensitive to distance, the impact of
inﬂuence on ∆wADE is small. We also observe that a high KL
divergence for the target agent given the ground-truth query
trajectory strongly correlates with high values of ∆wADE.
Given the future trajectory of the query agent, we can predict
surprising interactions even more accurately than without
future trajectories for either agent.
Figure 7 shows two examples of pairs of interacting agents
discovered in the evaluation set by ﬁltering by high mutual
information and high ∆wADE. In the ﬁrst example, one vehicle
yields to another in a turn. While in the marginal prediction
there is a high probability for the target agent to cross the
intersection, the conditional prediction shows the target agent
yielding. In second example, the target agent slows down
behind a query agent which is braking.
Fig. 4. (a) The mutual information between the target and query
agents and (b) the KL divergence for the target agent given the
ground-truth query trajectory both correlate with the incidence
of surprising interactions, as measured by the target’s ∆wADE(B).
Shaded regions are between the 10 & 90th, 20 & 80th, 30 & 70th,
and 40 & 60th percentiles.
F. Selecting Salient Agents
This section demonstrates using the interactivity score
to predict which vehicles are salient for planning for the
autonomous vehicle. We predict the trajectory of the AV
both in the original scene, and in a modiﬁed scene where
some agents have been removed. We show agents with high
interactivity with the AV are more likely to affect its behavior,
compared to agents that are just closer to it. In the dataset, we
typically have 10 to 32 cars in a scene, but in practice, very
few of these cars are actually relevant for planning for the
AV, so they could potentially be excluded from high-ﬁdelity
behavior predictions on-board the vehicle.
In the ﬁrst experiment, we compute the mutual information
between the autonomous vehicle and every other agent. We
choose the top N agents with the largest mutual information
values. Then, we remove all others from the scene, and use
only the top N agents states to predict the trajectory of the
AV. We compare this approach to selecting the top N agents
closest in distance to the AV in the scene. This is a common
heuristic used for identifying relevant vehicles in the scene.
Figure 5a shows that mutual information can identify
more relevant agents for planning up to N = 4. For larger
numbers of agents, the distance heuristic outperforms mutual
information as an agent selection mechanism. In practice, for
agent prioritization onboard an AV, mutual information could
be combined with other heuristics, such as distance.
In the second experiment, we do not remove the pruned
agents from the scene, but remove them from the set of agents
whose behaviors are to be predicted by the model. In this
case, the pruned agents are visible to the model as scene
context. Figure 5b compares using interactivity score vs. a
distance heuristic in this task. As the results show, moving
less interactive agents to scene context is actually improving
predictions for the autonomous vehicle, as long as at least
the 3 most interactive agents are kept in the prediction set.
One potential explanation for this result is that reducing the
prediction set of the model provides an attention mechanism
for the prediction of the autonomous vehicle that emphasizes
the potential future trajectories of certain agents over others.
In particular, the message-passing mechanism in the GNN
can focus only on the relevant neighbors for the AV.
Figure 6 visualizes pruning agents by mutual information
vs. pruning by distance in the same scene. We see that the
mutual information selects vehicles that are behind and ahead
of the AV in the same lane, in addition to a few vehicles
further ahead in neighboring lanes. The distance metric, on
the other hand, selects vehicles that are multiple lanes away
and are not likely to interact with the AV.
(a) Other agents are removed from the scene.
(b) Other agents are used only as context; their behavior is not predicted.
Fig. 5. The interactivity score allows pruning agents that are not
relevant for planning for the AV. The bars show the average BP error
over the pruned scene minus the BP error over the original scene.
Note that there are no conditional predictions in this experiment.
The error bars indicate the standard error of the mean.
Fig. 6. The inset shows the non-pruned scene with the AV (pink)
and other cars (blue). On the left, agents with a low interactivity
score with the AV are pruned. On the right, agents are pruned based
on distance to the AV.
G. Challenges and Future Work
Fig. 8 shows an example where our metrics have selected
a pair of vehicles slowing down in parallel lanes at an
(a) Query agent turns left and target agent yields.
(b) Target agent slows down behind the query agent.
Fig. 7. Two examples of interacting agents found by sorting
examples by mutual information and ∆wADE. The marginal (left)
and conditional predictions (right) are shown with the query in solid
green, and predictions in dashed cyan lines.
Fig. 8. An example in which the query and target agents slow down
in parallel lanes as a result of a trafﬁc light change. The marginal
(left) and conditional predictions (right) are shown with the query
in solid green.
intersection. These agents are reacting to a change in trafﬁc
light state, rather than to one another. The CBP model can
not differentiate between correlation and causation of two
agent’s trajectories. Before using a trajectory as a query, one
can compute the marginal likelihood of the query p(sB, x), to
determine whether sB is a likely query for which the model
can accurately provide counterfactual predictions.
The interactivity score can be evaluated very efﬁciently
by pre-computing the embedding of the roadgraph, which is
the most expensive part of the architecture in practice, and
batching the different queries to evaluate them in parallel. We
could also consider using our interactivity score as a reward
signal in cooperative multi-agent reinforcement learning,
similar to the notion of inﬂuence introduced in [31].
REFERENCES
[1] A. ”Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and
S. Savarese, “Social LSTM: Human Trajectory Prediction in Crowded
Spaces,” in CVPR, 2016.
[2] C. Tang and R. R. Salakhutdinov, “Multiple futures prediction,” in
NeurIPS, 2019.
[3] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “PRECOG:
Prediction conditioned on goals in visual multi-agent settings,” in Intl.
Conf. on Computer Vision, 2019.
[4] E. Schmerling, K. Leung, W. Vollprecht, and M. Pavone, “Multimodal
probabilistic model-based planning for human-robot interaction,” in
IEEE Intl. Conf. on Robotics and Automation.
IEEE, 2018, pp. 1–9.
[5] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning
and structured prediction to no-regret online learning,” in ICML, 2011.
[6] N. Rhinehart, K. Kitani, and P. Vernaza, “R2P2: A reparameterized
pushforward policy for diverse, precise generative path forecasting,” in
ECCV, 2018.
[7] Y. Yuan and K. Kitani, “Diverse trajectory forecasting with determi-
nantal point processes,” ICLR, 2020.
[8] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for
autonomous cars that leverage effects on human actions.” in Robotics:
Science and Systems conference, 2016.
[9] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple
probabilistic anchor trajectory hypotheses for behavior prediction,” in
Conf. on Robot Learning, 2019.
[10] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, and
M. Chandraker, “DESIRE: Distant future prediction in dynamic scenes
with interacting agents,” in CVPR, 2017.
[11] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun,
“End-to-end interpretable neural motion planner,” in CVPR, 2019.
[12] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict
intention from raw sensor data,” in Conf. on Robot Learning, 2018.
[13] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social
GAN: Socially acceptable trajectories with generative adversarial
networks,” in CVPR, 2018.
[14] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid,
“VectorNet: Encoding hd maps and agent dynamics from vectorized
representation,” in CVPR, 2020.
[15] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett,
D. Wang, P. Carr, S. Lucey, D. Ramanan et al., “Argoverse: 3d tracking
and forecasting with rich maps,” in CVPR, 2019.
[16] J. Mercat, T. Gilles, N. Zoghby, G. Sandou, D. Beauvois, and G. Gil,
“Multi-head attention for joint multi-modal vehicle motion forecasting,”
in IEEE Intl. Conf. on Robotics and Automation, 2020.
[17] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially-aware
graph neural networks for relational behavior forecasting from sensor
data,” in IEEE Intl. Conf. on Robotics and Automation.
IEEE, 2020.
[18] K. Mangalam, H. Girase, S. Agarwal, K.-H. Lee, E. Adeli, J. Malik,
and A. Gaidon, “It is not the journey but the destination: Endpoint
conditioned trajectory prediction,” arXiv:2004.02025, 2020.
[19] S. Khandelwal, W. Qi, J. Singh, A. Hartnett, and D. Ramanan, “What-if
motion prediction for autonomous driving,” ArXiv, 2020.
[20] A. K. Pandey and R. Alami, “A framework towards a socially aware
mobile robot motion in human-centered dynamic environment,” in
IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, 2010.
[21] L. Scandolo and T. Fraichard, “An anthropomorphic navigation scheme
for dynamic scenarios,” in IEEE Intl. Conf. on Robotics and Automation,
2011.
[22] E. A. Sisbot, L. F. Marin-Urias, R. Alami, and T. Simeon, “A human
aware mobile robot motion planner,” IEEE Transactions on Robotics,
2007.
[23] K. S. Refaat, K. Ding, N. Ponomareva, and S. Ross, “Agent prioritiza-
tion for autonomous navigation,” in IEEE/RSJ Intl. Conf. on Intelligent
Robots and Systems, 2019.
[24] R. Michelmore, M. Kwiatkowska, and Y. Gal, “Evaluating uncertainty
quantiﬁcation in end-to-end autonomous driving control,” arXiv preprint
arXiv:1811.06817, 2018.
[25] C. E. Shannon, “A mathematical theory of communication,” The Bell
system technical journal, 1948.
[26] J. R. Hershey and P. A. Olsen, “Approximating the kullback leibler
divergence between gaussian mixture models,” in Intl. Conf. on
Acoustics, Speech and Signal Proc., 2007.
[27] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,
V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,
R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. Dahl,
A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess,
D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu,
“Relational inductive biases, deep learning, and graph networks,” arXiv
preprint arXiv:1806.01261, 2018.
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
NeurIPS, 2017.
[29] M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Urtasun,
“Learning lane graph representations for motion forecasting,” arXiv
preprint arXiv:2007.13732, 2020.
[30] H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y. Shen,
Y. Shen, Y. Chai, C. Schmid et al., “Tnt: Target-driven trajectory
prediction,” arXiv preprint arXiv:2008.08294, 2020.
[31] N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Ortega, D. Strouse,
J. Z. Leibo, and N. De Freitas, “Social inﬂuence as intrinsic motivation
for multi-agent deep reinforcement learning,” in ICML, 2019.