Autonomous Robots manuscript No.
(will be inserted by the editor)
ALAN: Adaptive Learning for Multi-Agent Navigation
Julio Godoy · Tiannan Chen · Stephen J. Guy · Ioannis Karamouzas ·
Maria Gini
Received: date / Accepted: date
Abstract In multi-agent navigation, agents need to
move towards their goal locations while avoiding colli-
sions with other agents and static obstacles, often with-
out communication with each other. Existing methods
compute motions that are optimal locally but do not ac-
count for the aggregated motions of all agents, produc-
ing inefficient global behavior especially when agents
move in a crowded space. In this work, we develop
methods to allow agents to dynamically adapt their
behavior to their local conditions. We accomplish this
by formulating the multi-agent navigation problem as
an action-selection problem, and propose an approach,
ALAN, that allows agents to compute time-efficient and
collision-free motions. ALAN is highly scalable because
each agent makes its own decisions on how to move
using a set of velocities optimized for a variety of navi-
gation tasks. Experimental results show that the agents
using ALAN, in general, reach their destinations faster
than using ORCA, a state-of-the-art collision avoidance
framework, the Social Forces model for pedestrian nav-
igation, and a Predictive collision avoidance model.
Keywords Multi-agent navigation · Online learning ·
Action selection · Multi-agent coordination
Julio Godoy
Department of Computer Science, Universidad de Concep-
cion
Edmundo Larenas 219, Concepcion, Chile
E-mail: juliogodoy@gmail.com
Tiannan Chen, Stephen J. Guy and Maria Gini
Department of Computer Science and Engineering, Univer-
sity of Minnesota
200 Union Street SE, Minneapolis, MN 55455, USA
Ioannis Karamouzas
School of Computing, Clemson University
100 McAdams Hall, Clemson, South Carolina, SC 29634, USA
1 Introduction
Real-time goal-directed navigation of multiple agents
is required in many domains, such as swarm robotics,
pedestrian navigation, planning for evacuation, and traf-
fic engineering. Conflicting constraints and the need to
operate in real time make this problem challenging.
Agents need to move towards their goals in a timely
manner, but also need to avoid collisions with each
other and the environment. In addition, due to the num-
ber of agents and the real-time constraints, each agent
needs to compute its own motion without any commu-
nication with the other agents.
While decentralization is essential for scalability and
robustness, achieving globally efficient motions is crit-
ical, especially in applications such as search and res-
cue, aerial surveillance, and evacuation planning, where
time is critical. Over the past twenty years, many decen-
tralized techniques for real-time multi-agent navigation
have been proposed, with approaches such as Optimal
Reciprocal Collision Avoidance (ORCA) [5] being able
to provide guarantees about collision-free motion for
the agents. Although such techniques generate locally
efficient motions for each agent, the overall flow and
global behavior of the agents can be far from efficient;
agents plan only for themselves and do not consider
how their motions affect the other agents. This leads to
inefficient motions, congestion, and even deadlocks.
In this paper, we are interested in situations where
agents have to minimize their overall travel time. We
assume each agent has a preferred velocity indicating its
desired direction of motion (typically oriented towards
its goal) and speed. An agent runs a continuous cycle of
sensing and acting. In each cycle, it has to choose a new
velocity that avoids obstacles but is as close as possible
to its preferred velocity. We show that by intelligently
arXiv:1710.04296v1  [cs.MA]  11 Oct 2017
2
Julio Godoy et al.
selecting preferred velocities that account for the global
state of the multi-agent system, the time efficiency of
the entire crowd can be significantly improved.
In our setting, agents learn how to select their veloc-
ities in an online fashion without communicating with
each other. To do so, we adapt a multi-armed ban-
dit formulation to the velocity selection problem and
present ALAN (Adaptive Learning Approach for Multi-
Agent Navigation). With ALAN, agents choose intel-
ligently from a set of actions, one at each timestep,
based on both their goal and on how their motion will
affect other agents. We show how critical the set of
available actions is to performance, and we present a
Markov Chain Monte Carlo learning method to learn
an optimized action space for navigation in a variety
of environments. Together with a scheme that guaran-
tees collision-free motions, these features allow ALAN
agents to minimize their overall travel time. 1
Main Results. This paper presents four main con-
tributions. First, we formulate the multi-agent naviga-
tion problem in a multi-armed bandit setting. This en-
ables each agent to decide its motions independently
of the other agents. The other agents affect indirectly
how an agent moves, because they affect the reward the
agent receives. The independence of the choices made
by each agent makes the approach highly scalable. Sec-
ond, we propose an online action selection method in-
spired by the Softmax action selection technique [50],
which achieves the exploration exploitation tradeoff.
Third, we propose a Markov Chain Monte Carlo method
to learn offline an optimized action set for specific navi-
gation environments, as well as an action set optimized
for multiple navigation scenarios. Last, we show experi-
mentally that our approach leads to more time efficient
motions in a variety of scenarios, reducing the travel
time of all agents as compared to ORCA, the Social
Forces model for simulating pedestrian dynamics [20]
and the Pedestrian model for collision avoidance [28].
This work is an extended version of [13], which in-
troduced a multi-armed bandit formulation for multi-
agent navigation problems. Compared to [13], here we
reduce ALAN’s dependency on parameters, present an
offline approach to learn an optimized action set, and
perform an extended experimental analysis of ALAN.
The rest of the paper is organized as follows. In Sec-
tion 2, we review relevant related work. In Section 3,
we provide background on collision avoidance methods,
especially on ORCA which is used in ALAN. In Sec-
tion 4, we present our problem formulation for multi-
agent navigation. ALAN and its components are de-
scribed in Section 5, while our experimental setup is
1 Videos
highlighting
our
work
can
be
found
in
http://motion.cs.umn.edu/r/ActionSelection
described in Section 6, where we also present our per-
formance metric, the scenarios we use to evaluate our
approach, and experimental results. Section 7 presents
our Markov Chain Monte Carlo method for learning ac-
tion spaces for different navigation environments. We
perform a thorough experimental analysis of the per-
formance of ALAN in Section 8, where we also discuss
its applicability in multi-robot systems. Finally, we con-
clude and present future research plans in Section 9.
2 Related Work
Extensive research in the areas of multi-agent navi-
gation and learning has been conducted over the last
decade. In this section, we present an overview of prior
work most closely related to our approach. For a more
comprehensive discussion on multi-agent navigation and
learning we refer the reader to the surveys of Pelechano
et al. [40] and Bu¸soniu et al. [8], respectively.
2.1 Multi-Agent Navigation
Numerous models have been proposed to simulate in-
dividuals and groups of interacting agents. The sem-
inal work of Reynolds on boids has been influential
on this field [45]. Reynolds used simple local rules to
create visually compelling flocks of birds and schools
of fishes. Later he extended his model to include au-
tonomous agent behavior [44]. Since Reynolds’s orig-
inal work, many interesting crowd simulation models
have been introduced that account for groups [4], cog-
nitive and behavioral rules [11,46], biomechanical prin-
ciples [16] and sociological or psychological factors [39,
15,42]. Recent work models the contagion of psycholog-
ical states in a crowd of agents, for example, in evacu-
ation simulations [52]. Our approach, in contrast, does
not make assumptions about the psychological states of
the agents, therefore it is more generally applicable.
An extensive literature also exists on modeling the
local dynamics of the agents and computing collision-
free motions. Many different agent-based techniques for
collision avoidance have been proposed in control the-
ory [37], traffic simulation [7], animation [14,28] and
robotics [30]. In Section 3 we provide a more detailed
description of the collision-avoidance technique we use.
We focus on minimizing the travel time of the agents,
but other metrics have been studied in the literature.
For example, the work in [48,56,27] addresses the prob-
lem of minimizing the total length of the path of the
agents, formulating the path planning problem as a
mixed integer linear program. Coordinating the motion
ALAN: Adaptive Learning for Multi-Agent Navigation
3
of a set of pebbles in a graph to minimize the number
of moves was studied in [33].
2.2 Reinforcement Learning
Many learning approaches used for robots and agents
derive from the reinforcement learning literature [8].
Reinforcement Learning (RL) addresses how autonomous
agents can learn by interacting with the environment to
achieve their desired goal [49]. An RL agent performs
actions that affect its state and environment, and re-
ceives a reward value which indicates the quality of the
performed action. This reward is used as feedback for
the agent to improve its future decisions. Different ap-
proaches have been proposed to incorporate RL when
multiple agents share the environment (see [8,32,53] for
an extensive overview).
In multi-agent RL algorithms, agents typically need
to collect information on how other agents behave and
find a policy that maximizes their reward. This is ex-
pensive when the state space is large and requires a
significant degree of exploration to create an accurate
model for each agent. Hence, approaches that model
the entire environment are focused on small problems
and/or a small number of agents. To reduce complexity,
some approaches focus on the local interaction neigh-
borhood of each agent [57,58]. By considering a local
neighborhood, the state space of each agent is reduced.
To completely avoid the state space complexity, the
learning problem can be formulated as a multi-armed
bandit problem [49], where the agents use the reward
of each action to make future decisions. In multi-armed
bandit problems, the relation between exploiting the
current best action and exploring potentially better ac-
tions is critical [2,34].
2.2.1 Action Selection Techniques
A variety of approaches aim at balancing exploration
and exploitation, which is critical for online learning
problems such as ours.
A simple approach is ϵ-greedy, which works by se-
lecting the highest valued action with probability 1-ϵ,
and a random action with probability ϵ, 0 ≤ϵ ≤1. The
value of ϵ indicates the degree of exploration that the
agent performs [50]. Because of its probabilistic nature,
ϵ-greedy can find the optimal action, without being sen-
sitive to the difference between the values of the actions.
This means that ϵ-greedy does the same amount of ex-
ploration regardless of how much better the best known
action is, compared to the other actions.
Another widely used action-selection technique is
the upper confidence bounds (UCB) algorithm [3]. UCB
is a deterministic method that samples the actions pro-
portionally to the upper-bound of the estimated value
of their rewards (based on their current average reward)
and their confidence interval (computed using a rela-
tion between the number of times each action was se-
lected and the total number of action taken so far by
the agent). Unlike ϵ-greedy, UCB considers the value of
all actions when deciding which one to choose. However,
it does unnecessary exploration when the reward distri-
bution is static (i.e., the best action does not change).
A method that combines the probabilistic nature
of ϵ-greedy and that accounts for the changing reward
structure is the Softmax action selection strategy. Soft-
max biases the action choice based on their relative re-
ward value, which means that it increases exploration
when all actions have similar value, and it reduces it
when some (or one) action is significantly better than
the rest. The action selection method we use is based
on the Softmax strategy, due to these properties.
2.3 Learning in Multi-Agent Navigation
Extensive work has also been done on learning and
adapting motion behavior of agents in crowded environ-
ments. Depending on the nature of the learning process,
the work can be classified in two main categories: offline
and online learning. In offline learning, agents repeat-
edly explore the environment and try to learn the op-
timal policy given an objective function. Examples of
desired learned behaviors include collision avoidance,
shortest path to destination, and specific group for-
mations. As an example, the work in [23] uses inverse
reinforcement learning for agents to learn paths from
recorded training data. Similarly, the approach in [51]
applies Q-learning to plan paths for agents in crowds.
In this approach, agents learn in a series of episodes
the best path to their destination. A SARSA-based [50]
learning algorithm has also been used in [35] for of-
fline learning of behaviors in crowd simulations. The
approach in [9] analyzes different strategies for shar-
ing policies between agents to speed up the learning
process in crowd simulations. In the area of swarm in-
telligence, the work in [24] uses evolutionary algorithms
for robotics, learning offline the parameters of the fit-
ness function and sharing the learned rules in unknown
environments.
Offline learning has significant limitations, which
arise from the need to train the agents before the en-
vironment is known. In contrast, the main part of our
work is an online learning approach. In online approaches,
agents are given only partial knowledge of their environ-
ment, and are expected to adapt their strategies as they
discover more of the environment. Our approach allows
4
Julio Godoy et al.
agents to adapt online to unknown environments, and
it does not require explicit communication between the
agents.
3 Background
In this section, we first describe different techniques
that agents can employ to avoid collisions, specifically
focusing on the technique we use in our work.
3.1 Collision Avoidance
Methods that have been proposed to prevent collisions
during navigation can be classified as reactive and an-
ticipatory.
In reactive collision avoidance, agents adapt their
motion to other agents and obstacles along their paths.
Many reactive methods [45,44,19,30,43] use artificial
repulsive forces to avoid collisions. However, these tech-
niques do not anticipate collisions. Only when agents
are sufficiently close, they react to avoid collisions. This
can lead to oscillations and local minima. Another limi-
tation of these methods is that the forces must be tuned
separately for each scenario, limiting their robustness.
In anticipatory collision avoidance, agents predict
and avoid potential upcoming collisions by linearly ex-
trapolating their current velocities. In this line, geo-
metrically based algorithms compute collision-free ve-
locities for the agents using either sampling [54,41,29,
38] or optimization techniques [5,14].
3.2 ORCA
The Optimal Reciprocal Collision Avoidance framework
(ORCA) is an anticipatory collision avoidance that builds
on the concept of Velocity Obstacles [10], where agents
detect and avoid potential collisions by linearly extrap-
olating their current velocities. Given two agents, Ai
and Aj, the set of velocity obstacles V OAi|Aj repre-
sents the set of all relative velocities between i and j
that will result in a collision at some future moment.
Using the VO formulation, we can guarantee collision
avoidance by choosing a relative velocity that lies out-
side the set V OAi|Aj. Let u denote the minimum change
in the relative velocity of i and j needed to avoid the
collision. ORCA assumes that the two agents will share
the responsibility of avoiding it and requires each agent
to change its current velocity by at least 1
2u. Then, the
set of feasible velocities for i induced by j is the half-
plane of velocities given by:
ORCAAi|Aj = {v |(v −(vi + 1
2u)) · ˆu},
where ˆu is the normalized vector u (see Fig. 1). Similar
formulation can be derived for determining Ai’s per-
mitted velocities with respect to a static obstacle Ok.
We denote this set as ORCAAi|Ok.
The overall approach works as follows. At each time
step of the simulation, each agent i uses its goal-oriented
velocity vgoal
i
and computes a new collision-free veloc-
ity by taking into account its neighboring agents and
static obstacles. First, agent i infers its set of feasible
velocities, FVAi, from the intersection of all permit-
ted half-planes ORCAAi|Aj and ORCAAi|Ok induced
by each neighboring agent j and obstacle Ok, respec-
tively. Having computed FVAi, the agent selects a new
velocity vnew
i
for itself that is closest to its preferred
velocity vpref
i
and lies inside the region of feasible ve-
locities:
vnew
i
= arg min
v∈F VAi
∥v −vpref
i
∥.
(1)
The optimization problem in (1) can be efficiently solved
using linear programming, since FVAi is a convex region
bounded by linear constraints. Finally, agent i updates
its position based on the newly computed velocity. As
ORCA is a decentralized approach, each agent com-
putes its velocity independently.
In addition, each agent typically uses its goal-oriented
velocity vgoal
i
as an input preferred velocity to ORCA
in (1).
3.3 Limitations of ORCA
Although ORCA guarantees collision-free motions and
provides a locally optimal behavior for each agent, the
lack of coordination between agents can lead to globally
inefficient motions. For an example, see Fig. 2. Here,
three agents start from the initial positions (Fig. 2)(a)
and must reach the final positions (Fig. 2)(b). Because
the agents follow only their goal-oriented preferred ve-
locity, they will get stuck in a local minimum which
generates the trajectories shown in Fig. 2(c). If instead
the agents behaved differently, for instance, by selecting
a different vpref for a short period of time, they might
find a larger region of feasible velocities. This might in-
directly help avoid/solve the overall congestion, benefit-
ing all agents. Our proposed approach, ALAN, directly
addresses this limitation, allowing agents to adapt their
preferred velocity online, hence improving their motion
efficiency. An example of the trajectories generated by
our approach can be seen in Fig. 2(d).
ALAN: Adaptive Learning for Multi-Agent Navigation
5
y
x
0
vi
vj
Aj
Ai
ri
rj
(a) Agents Ai and Aj moving at velocities vi
and vj, respectively
V Oi|j
vy
vx
ORCAi|j
0
vi −vj
vi
ri + rj
u
(b) Ai’s allowed velocities, in the velocity
space
Fig. 1: (a) Two agents, Ai and Aj, moving towards a potential collision. (b) The set of allowed velocities for agent
i induced by agent j is indicated by the half-plane delimited by the line perpendicular to ˆu through the point
vi + 1
2u, where u is the vector from vi −vj to the closest point on the boundary of V Oi|j
(a) Start positions
(b) Goal positions
(c) ORCA
(d) ALAN
Fig. 2: Three agents cross paths. (a) Initial positions of the agents. (b) Goal positions of the agents. (c) When
navigating with ORCA, the agents run into and push each other resulting in inefficient paths. (d) When using
ALAN the agents select different preferred velocities which avoid local minima, resulting in more efficient paths.
4 Problem Formulation
In our problem setting, given an environment and a
set A of agents, each with a start and a goal position,
our goal is to enable the agents to reach their goals as
soon as possible and without collisions. We also require
that the agents move independently and without explic-
itly communicating with each other. For simplicity, we
model each agent as a disc which moves on a 2D plane
that may also contain a set of k static obstacles O (ap-
proximated by line segments in all our experiments).
Given n agents, let agent Ai have radius ri, goal po-
sition gi, and maximum speed υmax
i
. Let also pt
i and vt
i
denote the agent’s position and velocity, respectively,
at time t. Furthermore, agent Ai has a preferred veloc-
ity vpref
i
at which it prefers to move. Let vgoal
i
be the
preferred velocity directed towards the agent’s goal gi
with a magnitude equal to υmax
i
. The main objective
of our work is to minimize the travel time of the set of
agents A to their goals, while guaranteeing collision-free
motions. To measure this global travel time, we could
consider the travel time of the last agent that reaches
its goal. However, this value does not provide any in-
formation of the travel time of all the other agents.
Instead, we measure this travel time, TTime(A), as a
linear combination of the average travel time of all the
agents in A and its spread. Formally:
TTime(A) = µ (TimeToGoal(A))
+ 3 σ (TimeToGoal(A))
(2)
6
Julio Godoy et al.
where TimeToGoal(A) is the set of travel times of all
agents in A from their start positions to their goals, and
µ(·) and σ(·) are the average and the standard deviation
(using the unbiased estimator) of the set TimeToGoal(A),
respectively. If the times to goals of the agents follow
a normal distribution, then TTime(A) represents the
upper bound of the TimeToGoal(A) for approximately
99.7% of the agents. Even if the distribution is not nor-
mal, at least 89% of the times will fall within three
standard deviation (Chebyshev’s inequality). Our ob-
jective can be formalized as follows:
minimize
TTime(A)
s.t.
∥pt
i −pt
j∥> ri + rj, ∀
i̸=ji, j ∈[1, n]
dist(pt
i, Oj) > ri, ∀i ∈[1, n], j ∈[1, k]
∥vt
i∥≤υmax
i
,
∀i ∈[1, n]
(3)
where dist(·) denotes the shortest distance between two
positions. To simplify the notation, in the rest of the
paper we omit the index of the specific agent being
referred, unless it is needed for clarity.
Minimizing Eq. 3 for a large number of agents using
a centralized planner with complete information is in-
tractable (PSPACE-hard [25]), given the combinatorial
nature of the optimization problem and the continu-
ous space of movement for the agents. Since we require
that the agents navigate independently and without ex-
plicit communication with each other, Eq. 3 has to be
minimized in a decentralized manner. As the agents do
not know in advance which trajectories are feasible, the
problem becomes for each agent to decide how to move
at each timestep, given its perception of the local envi-
ronment. This is the question addressed by our online
learning approach, ALAN, which is described next.
5 ALAN
ALAN is composed by an action selection framework,
which provides a set of preferred velocities an agent
can choose from, and a reward function the agent uses
to evaluate the velocities and select the velocity to be
used next. ALAN keeps an updated reward value for
each action using a moving time window of the recently
obtained rewards. If information about the set of nav-
igation environments is available, ALAN can take ad-
vantage of an action learning approach to compute, in
an offline manner, an action set that is optimized for
one or a set of scenarios (see Section 7).
In ALAN, each agent runs a continuous cycle of
sensing and action until it reaches its destination. To
guarantee real-time behavior, we impose a hard time
constraint of 50 ms per cycle. We assume that the radii,
positions and velocities of nearby agents and obstacles
can be obtained by sensing. At each cycle the agent
senses and computes its new collision-free velocity which
is used until the next cycle. The velocity has to respect
the agent’s geometric and kinematics constraints while
ensuring progress towards its goal.
To achieve this, ALAN follows a two-step process.
First, the agent selects a preferred velocity vpref (as
described later in Section 5.3). Next, this vpref is passed
to ORCA which produces a collision-free velocity vnew,
which is the velocity the agent will use during the next
timestep.
Algorithm 1 shows an overview of ALAN. This al-
gorithm is executed at every cycle. If an action is to be
selected in the current cycle (line 3, in average every 0.2
secs.), the Softmax action selection method (presented
later in Section 5.3) returns a vpref (line 4), which is
passed to ORCA. After computing potential collisions,
ORCA returns a new collision-free velocity vnew (line
6), and the getAction method returns the ID of the
action a that corresponds to the vpref selected (line 7).
This action a is executed (line 8), which moves the agent
with the collision-free velocity vnew for the duration
of the cycle, before updating the agent’s position for
the next simulation step (line 9). The agent determines
the quality of the action a (lines 10-12) by computing
its reward value (see Section 5.1). This value becomes
available to the action selection mechanism, which will
select a new vpref in the next cycle. This cycle repeats
until the agent reaches its goal.
Algorithm 1: The ALAN algorithm for an agent
1: initialize simulation
2: while not at the goal do
3:
if UpdateAction(t) then
4:
vpref ←Softmax(Act)
5:
end if
6:
vnew ←ORCA(vpref)
7:
a ←getAction(vpref)
8:
Execute(a)
9:
pt ←pt-1 + vnew · ∆t
10:
Rgoal
a
←GoalReward(at−1) (cf. Eq. 5)
11:
Rpolite
a
←PoliteReward(at−1) (cf. Eq. 6)
12:
Ra ←(1 −γ) · Rgoal
a
+ γ · Rpolite
a
13: end while
The main issue is how an agent should choose its
preferred velocity. Typically, an agent would prefer a ve-
locity that drives it closer to its goal, but different veloc-
ities may help the entire set of agents to reach their des-
tinations faster (consider, for example, an agent moving
backwards to alleviate congestion). Therefore, we allow
the agents to use different actions, which correspond
to different preferred velocities (throughout the rest of
ALAN: Adaptive Learning for Multi-Agent Navigation
7
this paper, we will use the terms preferred velocities
and actions interchangeably). In principle, finding the
best motion would require each agent to make a choice
at every step in a continuous 2D space, the space of all
possible speeds and directions. This is not practical in
real-time domains. Instead, agents plan their motions
over a discretized set of a small number of preferred
velocities, the set Act. An example set of 8 actions uni-
formly distributed in the space of directions is shown
in Figure 3. We call this set Sample set.
Different action sets affect the performance of the
agents. We analyze this later (Section 7), where we
present an offline learning method to find an optimal
set of actions.
Goal
Agent
0
1
2
3
4
5
6
7
Fig. 3: Example set of actions with the corresponding
action ID. The eight actions correspond to moving at
1.5 m/s with different angles with respect to the goal:
0◦, 45◦, 90◦, 135◦, −45◦, −90◦, −135◦and 180◦.
5.1 Reward Function
The quality of an agent’s selected action vpref is eval-
uated based on two criteria: how much it moves the
agent to its goal, and its effect on the motion of nearby
agents. The first criterion allows agents to reach their
goals, finding non-direct goal paths when facing con-
gestion or static obstacles. The second criterion en-
courages actions that do not slow down the motion of
other agents. To do this, agents take advantage of the
reciprocity assumption of ORCA: when a collision is
predicted, both potentially colliding agents will devi-
ate to avoid each other. Hence, if a collision-free vnew
computed by ORCA is significantly different from the
selected preferred velocity vpref, it represents a devia-
tion of the same magnitude for another agent. There-
fore, to minimize the negative impact of its decisions
on the nearby agents, i.e., to be polite towards them,
each agent should choose actions whose vnew is similar
to the vpref that produced it. This duality of goal ori-
ented and “socially aware” behaviors, in humans, has
been recently studied in [47].
Observe the example navigation task in Figure 4
where two agents must travel to the other side of a small
corridor. In Figure 4(c), agents use only goal progress as
a criterion for evaluating their actions. In Figure 4(d),
they use both goal progress and the effect in the motion
of other agents as criteria for evaluating their actions.
Using the Sample action set defined in Fig. 3, agents
first move to the center of the corridor. If agents only
optimize their goal progress, eventually one of them will
start pushing the other out of the corridor as they ex-
plore their actions (Figure 4(c)). This pushing behavior
continues slowly until the agents exit the corridor and
find space to avoid each other and eventually reach their
goals. If agents also consider the effect of their actions
on others, then one of them will eventually find that
the backwards action does not add constraints to the
motion of the other agent, hence can help that agent to
move to its goal. The agent then willingly moves back-
wards (Figure 4(d)), exiting the corridor faster and re-
ducing the travel time for both agents. We show that
considering both criteria in the evaluation of each ac-
tion reduces the travel time of the agents overall.
Specifically, we define the reward Ra for an agent
performing action a to be a convex combination of a
goal-oriented component and a politeness component:
Ra = (1 −γ) · Rgoal
a
+ γ · Rpolite
a
,
(4)
where the parameter γ, called coordination factor, con-
trols the influence of each component in the total re-
ward (0 ≤γ < 1).
The goal-oriented component Rgoal
a
computes the
scalar product of the collision-free velocity vnew of the
agent with the normalized vector pointing from the po-
sition p of the agent to its goal g. This component pro-
motes preferred velocities that lead the agent as quickly
as possible to its goal. More formally:
Rgoal
a
= vnew ·
g −p
∥g −p∥
(5)
The politeness component Rpolite
a
compares the exe-
cuted preferred velocity with the resulting collision-free
velocity. These two velocities will be similar when the
preferred velocity does not conflict with other agents’
motions, and will be different when it leads to potential
collisions. Hence, the similarity between vnew and vpref
indicates how polite is the corresponding action, with
respect to the motion of the other agents. Polite actions
reduce the constraints on other agents’ motions, allow-
ing them to move and therefore advancing the global
simulation state. Formally:
Rpolite
a
= vnew · vpref
(6)
If an agent maximizes Rgoal
a
, it would not consider
the effects of its actions on the other agents. On the
other hand, if the agent tries to maximize Rpolite
a
, it
8
Julio Godoy et al.
(a)
(b)
(c)
(d)
Fig. 4: Two agents moving to their goals in opposite sides of the corridor. Different behaviors are produced by
optimizing different metrics. (b) When meeting in the middle of the corridor, agents cannot continue their goal
motions without colliding. (c) Considering only goal progress when choosing actions results in one agent pushing
the other out of the corridor. (d) Considering both goal progress and effect of action in other agents results in one
agent moving backwards to help the other move to its goal.
has no incentive to move towards its goal, which means
it might never reach it. Therefore, an agent should aim
at maximizing a combination of both components. Dif-
ferent behaviors may be obtained with different values
of γ. In Section 6.7, we analyze how sensitive the per-
formance of ALAN is to different values of γ. Overall,
we found that γ = 0.4 provides an appropriate balance
between these two extremes.
(1, 1)
(0.2, 0.1)
(0.5, 1)
Goal
(-1, 1)
Fig. 5: Example of reward values for different actions
under clear and congested local conditions. The reward
Ra of each action a is shown as a pair of goal-oriented
and a politeness components (Rgoal
a
, Rpolite
a
).
Figure 5 shows an example of conditions an agent
may encounter. Here, there is congestion on one side of
the agent, which results in low reward values for the left
angled motion. The other actions are not constrained,
and consequently their reward value is higher. In this
case, the agent will choose the straight goal-oriented
action, as it maximizes Ra.
5.2 Multi-armed Bandit Formulation
As the number of navigation states is very large, we
adapt a stateless representation. Each agent can select
one action at a time, hence the question is which one
should the agent execute at a given time. In ALAN,
agents learn the reward value of each action through its
execution, in an online manner, and keep the recently
obtained rewards (using a moving time window of the
rewards) to decide how to act. We allow a chosen action
to be executed for a number of cycles, and perform an a-
posteriori evaluation to account for bad decisions. This
way, the problem of deciding how to move becomes a
resource allocation problem, where agents have a set of
alternatives strategies and have to learn their estimated
value via sampling, choosing one at each time in an
online manner until reaching their goals.
Online learning problems with a discrete set of ac-
tions and stateless representation can be well formu-
lated as multi-armed bandit problems. In a multi-armed
bandit problem, an agent makes sequential decisions
on a set of actions to maximize its expected reward.
This formulation is well-suited for stationary problems,
as existing algorithms guarantee a logarithmic bound
on the regret. Although our problem is non-stationary
in a global sense, as the joint local conditions of the
agents are highly dynamic, individual agents often un-
dergo long periods of stationary reward distributions.
An example of a solution for a navigation task, where
we can distinguish periods of stationary reward distri-
butions, is shown in Figure 6. Here, a single agent on
the left must travel to the other side of an incoming
group of agents (Fig. 6(a)). Initially, the optimal ac-
tion for the single agent is to directly move towards
its goal, as it has not yet sensed the incoming group
(Fig. 6(b)). The optimal action does not change until
ALAN: Adaptive Learning for Multi-Agent Navigation
9
(a)
(b)
(c)
(d)
Fig. 6: Distinguishable periods of stationary reward dis-
tribution for the agent on the left. (a) The agent must
reach its goal on the other side of a group of agents
moving in the opposite direction. The optimal action
in each period changes between (b) the goal oriented
motion, (c) the sideways motion to avoid the incoming
group, and (d) the goal oriented motion again, once the
agent has avoided the group.
the agent sees the group of agents in its direct goal
path, at which point the best action is to move side-
ways towards the goal (Fig. 6(c)) to avoid the group.
This sideways motion is the locally optimal action un-
til the agent has reached the clear area. Finally, once
the agent has avoided the group, the locally optimal
action is again to move straight towards the goal (Fig.
6(d)). Hence, in this example we see three periods of
stationary reward distribution.
Therefore, by learning the action that maximizes a
local reward function (Eq. 4) in each of these stationary
periods, agents can adapt to the local conditions.
5.3 Action Selection
We now describe how ALAN selects, at each action de-
cision step, one of the available actions based on their
computed reward values and a probabilistic action-selection
strategy, Softmax, which is described next.
5.3.1 Softmax
Softmax is a general action selection method that bal-
ances exploration and exploitation in a probabilistic
manner [50,59,55]. This method biases the action selec-
tion towards actions that have higher value (or reward,
in our terminology), by making the probability of select-
ing an action dependent on its current estimated value.
The most popular Softmax method uses the Boltzmann
distribution to select among the actions. Assuming that
Ra is the reward value of action a, the probability of
choosing a is given by the following equation:
Softmax(a) = exp
Ra
τ
 ,|Act|
X
a=1
exp
Ra
τ

(7)
The degree of exploration performed by a Boltzmann-
based Softmax method is controlled by the parameter
τ, also called the temperature. With values of τ close
to zero the highest-valued actions are more likely to be
chosen, while high values of τ make the probability of
choosing each action similar. We use a value of τ=0.2, as
we found that it shows enough differentiation between
different action values without being too greedy.
Another critical design issue of our action selection
method is the duration of the time window used. Keep-
ing old samples with low values might make a good
action look bad, but discarding them too quickly will
ignore the past. Because of this, we use a moving time
window of the most recently obtained rewards, and
compute the estimated value of each action based only
on the rewards in that time window, using the last sam-
pled reward for each. If an action has not been sampled
recently, it is assumed to have a neutral (zero) value,
which represents the uncertainty of the agent with re-
spect to the real value of the action. Actions with a neu-
tral value have a low probability of being selected if the
currently chosen action has a “good” value (>0), and
have a high probability of being selected if the currently
chosen action has a “bad” value (<0). When making an
action decision, an agent retrieves the last sampled re-
ward value for each action in the time window, or zero
if the action has not been sampled recently. These val-
ues are then used by Softmax (Eq. 7) to determine the
probability of each action being chosen.
In Section 6.6 we analyze the effect of different sizes
of time window on the performance of ALAN.
10
Julio Godoy et al.
Goal
Goal
Goal
(a) Initial
(b) Middle
(c) End
Fig. 7: Screen shots of three states of a navigation problem. (a) Initially, the black agent can move unconstrained
towards the goal. (b) During its interaction with other agents, the black agent moves sideways since this increases
its reward. (c) Finally, when its goal path is free, the black agent moves again towards the goal.
Simulation state
Action ID
0
1
2
3
4
5
6
7
Initial
reward
0.997
0
0
0.147
0
0.145
0
0
prob
94.1%
0.64%
0.64%
1.34%
0.64%
1.33%
0.64%
0.64%
Middle
reward
-0.05
-0.42
-0.54
0
0.001
-0.192
0.456
0
prob
5.4%
0.83%
0.46%
7.1%
7.1%
2.7%
69.3%
7.1%
End
reward
0.63
0.47
0
0.48
0
0
0.177
0
prob
56.7%
25%
2.4%
3%
2.4%
2.4%
5.8%
2.4%
Table 1: Reward values and probability for each action of being chosen by the black agent using ALAN in the
three different states shown in Figure 7.
5.3.2 Evolution of rewards during simulation
As agents move to their goals, their evaluation of the
available actions affects the probability of choosing each
action. Figure 7 shows three simulation states of a nav-
igation task while Table 1 shows, for each action of the
black agent, the computed rewards and probability of
being chosen as the next action. The goal of this eval-
uation is to empirically show how the estimated value
of each action changes as the agent faces different con-
ditions, and how these estimates affect the probability
of the action being chosen.
In the Initial state (Fig. 7(a)), the black agent can
move unconstrained towards the goal, which is reflected
in the high reward and corresponding probability of the
goal oriented action (ID 0). In the Middle state (Fig.
7(b)), the black agent is facing congestion, which trans-
lates into a low reward for the goal oriented action. In-
stead, it determines that the action with the highest
value is the one moving left (ID 6), which also has the
highest probability of being chosen. Finally, in the End
state (Fig. 7(c)), the goal path of the black agent is free.
Through exploration, the black agent determines that
the goal oriented motion (ID 0) is again the one with
the best value, though with lower reward value than in
the beginning, as the wall prevents the agent from mov-
ing at full speed. With a 56.7% probability, the agent
selects the goal oriented motion and eventually reaches
its goal. Note that the actions not sampled during the
duration of the time window used in this experiment
(2s) are assigned the neutral zero value.
6 Evaluation
We now present the experimental setup, performance
metrics, and scenarios used to compare the performance
of ALAN to other navigation approaches (Section 6.4).
We also evaluate the design choices of ALAN, such as
the action selection method (Section 6.5), the time win-
dow length (Section 6.6), and the balance between goal
progress and politeness, controlled by the coordination
factor γ (Section 6.7) in the reward function. Additional
results are presented later, after we extend the action
selection method to include learning the action space.
6.1 Experimental Setup
We implemented ALAN in C++. Results were gathered
on an Intel Core i7 at 3.5 GHz. Each experimental result
is the average over 30 simulations. In all our runs, we
ALAN: Adaptive Learning for Multi-Agent Navigation
11
updated the positions of the agents every ∆t = 50 ms
and set the maximum speed υmax of each agent to
1.5 m/s and its radius r to 0.5 m. Agents could sense
other agents within a 15 m radius, and obstacles within
1 m. To avoid synchronization artifacts, agents are given
a small random delay in how frequently they can up-
date their vpref (with new vpref decisions computed ev-
ery 0.2 s on average). This delay also gives ORCA a few
timesteps to incorporate sudden velocity changes before
the actions are evaluated. Small random perturbations
were added to the preferred velocities of the agents to
prevent symmetry problems.
6.2 Performance Metric
To evaluate the performance of ALAN, we measure the
time that the agents take to reach their goals compared
to the upper bound of their theoretical minimum travel
time. We call this metric interaction overhead.
Definition: Interaction Overhead. The interaction over-
head is the difference between the travel time of the set
of agents, as measured by Eq. 2, and the upper bound
of their travel time if all the agents could follow their
shortest paths to their goals at maximum speed without
interacting with each other, i.e.:
Interaction Overhead = TTime(A) −MinTTime(A)
where MinTTime(A) is the upper bound of the the-
oretical minimum travel time of the set of agents A,
computed similarly to Eq. 2, and evaluated as follows:
MinTTime(A) = µ (MinimumGoalTime(A)))
+ 3σ (MinimumGoalTime(A))
(8)
where MinimumGoalTime(A) is the set of travel times
for all agents in A, if they could follow their shortest
route to their goals, unconstrained, at maximum speed.
The interaction overhead metric allows us to evalu-
ate the performance of ALAN from a theoretical stand-
point in each of the navigation scenarios.
An interaction overhead of zero represents a lower
bound on the optimal travel time for the agents, and it
is the best result that any optimal centralized approach
could potentially achieve.
6.3 Scenarios
To evaluate ALAN we used a variety of scenarios, with
different numbers of agents and, in some cases, with
static obstacles. Figure 8 shows the different simulation
scenarios. These include: (a) Congested: 32 agents
are placed very close to the narrow exit of an open
hallway and must escape the hallway through this exit
(Fig. 8(a)); (b) Deadlock: Ten agents start at oppo-
site sides of a long, narrow corridor. Only one agent
can fit in the narrow space (Fig. 8(b)); (c) Incoming:
A single agent interacts with a group of 15 agents mov-
ing in the opposite direction (Fig. 8(c)); (d) Blocks:
Five agents must avoid a set of block-shaped obstacles
to reach their goals (Fig. 8(d)); (e) Bidirectional:
two groups of 9 agents each move in opposite directions
inside a corridor (Fig. 8(e)); (f) Circle: 80 agents walk
to their antipodal points on a circle (Fig 8(f)); (g) In-
tersection: 80 agents in four perpendicular streams
meet in an intersection (Fig 8(g)); (h) Crowd: 400
randomly placed agents must reach their randomly as-
signed goal positions, while moving inside a squared
room (Fig 8(h)).
6.4 Comparison of ALAN to Other Navigation
Approaches
To better quantify the effect of ALAN in reducing the
travel time of the agents, we compare its interaction
overhead values with existing navigation algorithms:
ORCA, the Social Forces model proposed by Helbing et
al. [20], that has been extensively used to simulate the
navigation of pedestrians [19,26,18,21], and the Predic-
tive collision avoidance model proposed in [28].
Results from this comparison can be observed in
Figure 9. Overall, ALAN outperforms the other ap-
proaches in most cases, and gets agents to their goals
even when the other three approaches fail to do so.
In scenarios with obstacles, ALAN is able to move the
agents to their goals, while some (sometimes all) other
evaluated approaches cannot. Here, the good perfor-
mance of ALAN can be explained by the diversity of
motions available and the behavior encouraged by the
reward function, which allows agents to find alterna-
tive goal paths while avoiding obstacles and, when such
paths do not exist (such as in the Deadlock scenario)
it allows agents to adapt a polite behavior and “get out
of the way” of other agents, backtracking and allowing
them to move to their goals.
In obstacle-free scenarios (Circle and Incoming),
agents have more space to maneuver while moving to
their goals. Hence, finding an implicitly coordinated
motion is not as critical as in the previous case. In a
large scenario, such as Circle the exploratory behav-
ior of ALAN before and after congestion (where agents
can move to their goals unconstrained) prevents it from
outperforming ORCA and the Social Force models. In
12
Julio Godoy et al.
(c)
(f)
Goal
(a)
(b)
(d)
(e)
(g)
(h)
Fig. 8: Simulated scenarios:(a) Congested, (b) Deadlock, (c) Incoming, (d) Blocks, (e) Bidirectional, (f)
Circle, (g) Intersection and (h) Crowd.
a small scenario like Incoming, the overhead of ex-
ploration does not affect ALAN as much as in Cir-
cle, allowing it to outperform both ORCA and the So-
cial Force model. However, with the Predictive model,
agents in the group make space for the single agent to
move directly to its goal, reaching it faster than with
ALAN.
From this evaluation, we can observe that ALAN
works especially well when agents are highly constrained
by both other agents and static obstacles, and its per-
formance advantage is more moderate when agents go
through long periods of unconstrained motion.
6.5 Evaluation of Action Selection Method
A key component of ALAN is its Softmax inspired ac-
tion selection method. Here, we validate this design
0 
50 
100 
150 
200 
250 
300 
350 
Congested 
Deadlock 
Incoming 
Blocks 
Bidirec9onal 
Circle 
Intersec9on 
Crowd 
Interac(on Overhead (s) 
ORCA 
Social Forces 
Predic9ve Model 
ALAN 
N/A 
N/A N/A N/A 
N/A N/A N/A 
N/A 
N/A N/A 
N/A 
Fig. 9: Interaction overhead of ORCA, the Social
Forces, the Predictive model, and ALAN in all sce-
narios. N/A indicates cases where the corresponding
method was unable to get agents to their goals.
ALAN: Adaptive Learning for Multi-Agent Navigation
13
choice by comparing the interaction overhead of dif-
ferent action selection methods, namely, ϵ-greedy [50]
(with an ϵ value of 0.1) and UCB [3], within the con-
text of ALAN. This evaluation is done using the Sample
action set (Fig. 3).
0 
50 
100 
150 
200 
250 
Congested Deadlock 
Incoming 
Blocks 
Bidirec8onal 
Circle 
Intersec8on 
Crowd 
Interac(on Overhead (s) 
So;max 
ε-greedy 
UCB 
Fig. 10: Interaction overhead of ALAN, using different
action selection methods (Softmax, ϵ-greedy, and UCB)
with the Sample action set (Fig.3) in all scenarios.
Results (Fig. 10) indicate that the Softmax action
selection helps ALAN achieve the best results. This can
be explained by the combination of Softmax’s proba-
bilistic nature and its non-uniform randomized explo-
ration. Unlike ϵ-greedy, in Softmax exploration is in-
versely proportional to action values. Unlike UCB, the
action choice is probabilistic, and it does not depend on
the frequency with which each action has been chosen,
which is important as that number is not necessarily
related to the optimal action.
6.6 Effect of time window size
To answer the question of what should be the time win-
dow size in ALAN we show, in Figure 11, a summary
of the interaction overhead results obtained by varying
the size of the time window (up to 20 secs.).
Figure 11 shows that agents perform best when us-
ing a time window of approximately 1-5 seconds, which
corresponds to approximately 5−25 action decisions (as
a decision is made on average every 0.2 secs). In gen-
eral, keeping the estimated values for too long or too lit-
tle time hurts performance. Discarding action estimates
too quickly (which turns their value into zero) makes
the agent “forget” the previously chosen actions. This
means that, while exploring, an agent does not have an
intuition of which actions can provide a better or worse
reward value, as all have the same probability of being
chosen. On the other hand, keeping action estimates for
too long perpetuates possibly outdated values, and re-
duces the probability of choosing an action that might
have recently increased its quality. Results show that a
0 
50 
100 
150 
200 
250 
300 
0 
5 
10 
15 
20 
Interac(on Overhead (s) 
Time Window size (s) 
Congested 
Deadlock 
Incoming 
Blocks 
Bidirec9onal 
Circle 
Intersec9on 
Crowd 
Fig. 11: Interaction overhead of ALAN in all scenarios,
for different sizes of the time window used for comput-
ing the estimated value of each action.
time window of 1-5 seconds provides a good balance:
it provides agents with some recent information, use-
ful for biasing the exploration towards recently tried
“good” actions and away from “bad” actions, while also
preventing an outdated reward value from introducing
noise in the action decision of the agent. Unless other-
wise noted, we use a time window of 2 seconds through-
out all our experiments.
6.7 Coordination Factor γ
The coordination factor γ controls how goal oriented
or polite are the agents in ALAN, based on the reward
function (Eq. 4). Figure 12 shows how the value of γ
affects the performance of ALAN. We varied the value
of γ between 0 and 0.9, where γ=0 means that agents
optimize their actions only based on their goal progress,
while γ=0.9 implies that agents optimize their actions
based mostly on their politeness, and barely take into
account their goal progress. With γ=1 agents make no
progress towards their goal.
A first observation, based on Figure 12, is that a
high weight on the politeness component (a high value
of γ) increases the interaction overhead in all scenarios.
This is most noticeable with values of γ > 0.6. Here, the
agents are too deferent towards each other, which ends
up slowing down their progress. On the other hand, a
high weight on the goal oriented component (low val-
ues of γ) seems to only have a significant negative ef-
fect on the Deadlock scenario, and a slight negative
effect on the Intersection. In the Deadlock sce-
nario, agents on one group are forced to move back-
wards to exit the narrow corridor. Selfishly maximizing
their goal progress prevents agents (of one group) from
quickly backtracking and clearing the way for agents in
14
Julio Godoy et al.
0 
50 
100 
150 
200 
250 
300 
0 
0.2 
0.4 
0.6 
0.8 
Interac(on Overhead (s) 
Coordina(on Factor (γ) 
Congested 
Deadlock 
Incoming 
Blocks 
Bidirec=onal 
Circle 
Intersec=on 
Crowd 
Fig. 12: Interaction overhead of ALAN in all eight sce-
narios, for a range of values of the coordination factor
parameter γ.
the opposite group. In this case, a balance between goal
oriented and polite behavior (γ values between 0.3 and
0.6) allows agents to more quickly switch between both
types of behavior. In other scenarios, ALAN is robust
to a wide variety of γ values, minimizing the interac-
tion overhead values when γ < 0.5. In these cases, op-
timizing the action selection based mostly on the goal
progress allows agents to find alternative goal paths,
using the open space to avoid congestion.
Overall, giving slightly more weight to the goal ori-
ented component than the politeness component allows
agents to alternate between goal oriented and polite
behaviors based to their local conditions, showing def-
erence to other agents in order to avoid (or resolve)
congestion but also moving to the goal when the path
is clear. For these reasons, we used a γ value of 0.4 in
all ALAN experiments.
7 Action Space Learning
The success of the motion scheme in ALAN depends
strongly on the action space, i.e. the action set speci-
fying the preferred velocities. The action selection we
have shown is limited to the pre-defined sample set of
actions (Fig. 3). However, depending on the environ-
ment, different sets of actions might provide motions
which improve the navigation.
We propose an offline learning approach based on a
Markov Chain Monte Carlo (MCMC) method [17,36]
with simulated annealing [31] to determine, for a given
environment (or set of environments), the set of actions
that minimizes the travel time.
Although MCMC is typically used as a sampling
method, we use it as an optimization method of sam-
pling with a bias towards regions of better performance.
We chose MCMC over other methods because of the na-
ture of the problem, i.e. the effectiveness of any subset
of actions depends on the others. First, greedy methods
like gradient descent would not be successful given the
local minima. Second, the bandit formulation for choos-
ing actions within an action set does not apply because
the optimization cannot be decomposed to each action.
Third, evolutionary methods only work well when bet-
ter solutions to subproblems (subsets of the actions) are
likely to provide a better solution to the whole problem,
which is not our case.
Our method is summarized in Algorithm 2. It starts
from a set composed of two actions, one action along
the goal direction, the other action in a random direc-
tion. The MCMC process searches through the action
space with biased exploration towards action sets that
promote more time-efficient interactions. The explored
action set with the highest performance is regarded as
the result at the end of the process. Below we describe
each step in more detail.
Algorithm 2: The MCMC action space learning
Act ←{GoalDir, RandomDir}, Actopt ←Act
2: F ←Evaluate(Act), Fopt ←F
T ←Tinit, dT ←(Tfinal −Tinit)/(N −1)
4: for i = 1 to N do
M ←SelectModification(Act, i)
6:
Act′ ←ApplyModification(Act, M)
F ′ ←Evaluate(Act′, i)
8:
if F ′ < Fopt then
Fopt ←F ′, Actopt ←Act′
10:
end if
if Rand(0, 1) < q(Act, Act′)exp((F −F ′)/T ) then
12:
F ←F ′, Act ←Act′
end if
14:
T ←T −dT
end for
16: return Actopt
Action Set Modification.
In each iteration, we
perform one of the following types of modifications:
– Modify an action within an interval around its cur-
rent direction, symmetric on both sides.
– Remove an action that is not the initial goal-directing
one.
– Add an action within the modification interval of
an existing action.
The first type of modification is explored with higher
weight (i.e. performed more often), because we consider
the quality of the actions to be more important than the
number of actions. Following the simulated-annealing
scheme, the modification range decreases over iterations
ALAN: Adaptive Learning for Multi-Agent Navigation
15
as the simulation moves from global exploration to local
refinement. The modification ranges are determined by
short learning processes.
Action Set Evaluation. The performance of each
new action set is evaluated via ALAN simulation runs.
Eq. 2 is used to estimate the travel time of the set of
agents (here the set of agents is made implicit while the
action set is an explicit input to the simulation). We
evaluate an action set Act with the function F, whose
definition is equivalent to the definition of TTime in
Eq. 2 (Section 4) but with action set as the explicit
argument rather than the set of agents.
The simulation is repeated multiple times and the
average evaluation from all repeated runs is used to
evaluate the action set. Following the simulated-annealing
scheme, the number of simulation runs increases over it-
erations, as later local refinement has less uncertainty.
Action Set Update.
We use a common version
of MCMC, the Metropolis-Hasting Monte Carlo [17]
scheme to reject some of the attempted modifications
to efficiently explore better action sets. The probabil-
ity of keeping a change is related to how it changes the
evaluation F, which is the key to biasing towards ac-
tion sets with lower evaluation values. The probability
to accept a new action set Act′ over a previous action
set Act is
min

1, q(Act, Act′) exp
F −F ′
T

,
(9)
where F and F ′ are the evaluation with action set Act
and Act′ respectively, q(Act, Act′) is a factor account-
ing for the asymmetric likelihood of attempted transi-
tioning between Act and Act′, and T is a parameter
within the simulated-annealing scheme. The parameter
T decreases over iterations, making the probability of
accepting unfavorable changes decrease, which moves
the optimization from global exploration towards local
refinement.
After a predefined set of iterations of the MCMC
process, the action set Act with the lowest travel time
is returned. In our domain, agents have no previous
knowledge of the environment, which means that they
cannot determine which actions are available before-
hand. However, this MCMC approach allows us to do
a qualitative analysis of what behaviors are most effec-
tive in each type of environment, as we will see in the
next Section.
7.1 Optimized Action Sets
To find an optimized set of actions for the scenarios,
shown in Figure 8, we first learned an optimal action
set for each individual scenario. Then, we used MCMC
again to learn an action set that would work well across
different scenarios, even ones not considered in the learn-
ing process.
7.1.1 Action Sets Optimized for Each Scenario
Figures 13 and 14 show the set of actions computed by
MCMC for the agents in the scenarios: Congested,
Deadlock, Blocks, Bidirectional, Intersection
and Crowd.
As a general observation, the action set learned for
all these scenarios contains at least one action that
moves the agent, to some degree, backwards from its
goal. This backtracking helps in reducing congestion,
allowing agents coming from behind, or in the opposite
direction, to move to their goals. In the Congested
and Deadlock scenarios, our MCMC approach found
that a set of just 3 actions is enough to minimize the ar-
rival time of the agents. The obstacles that characterize
these two environments (in the form of a doorway or a
narrow corridor) significantly reduce the flow of agents
moving to their goals. In these two scenarios, the back-
wards actions are almost symmetrical with respect to
the goal of the agent. This allows the agent to use the
space available outside of the narrow area to spread
evenly, left and right, while clearing the path to the
goal of the other agents. This behavior avoids conges-
tion on any given side of the narrow area, minimizing
the travel time of the agents.
A set of 3 actions was also found for the Crowd
scenario, where the action that moves the agent side-
ways helps it to better avoid agents coming in different
directions, while the backwards action enables agents
to backtrack when there is no space to avoid the con-
gestion that quickly develops in this scenario.
The action set found in the Blocks scenario is
larger (9 versus 3 actions) and highly asymmetrical as
compared to the Congested and Deadlock cases.
Most actions in this scenario move the agents closer to
their goals, unlike the dominant backtracking motions
of the previous cases. This indicates that the agents
need a more fined grained control to quickly move around
the obstacles and reach their goals.
In the Bidirectional scenario, the 6 actions found
mostly bias the motion of the agents to their right,
similar to the Block scenario. Given the symmetri-
cal nature of this scenario, this bias allows agents to
create lanes in each side of the corridor, increasing the
efficiency of their own motions and creating space for
agents coming in the opposite direction.
In the Intersection scenario, our MCMC method
found that a small set of two actions optimizes the
travel time of the agents. The backtracking behavior en-
16
Julio Godoy et al.
Goal
Agent
(a) Congested
Goal
Agent
(b) Deadlock
Goal
Agent
(c) Blocks
Fig. 13: Optimized set of actions found by the MCMC method for the (a) Congested, (b) Deadlock and (c)
Blocks scenarios.
Goal
Agent
(a) Bidirectional
Goal
Agent
(b) Intersection
Goal
Agent
(c) Crowd
Fig. 14: Optimized set of actions found by the MCMC method for the (a) Bidirectional, (b) Intersection
and (c) Crowd scenarios.
abled by the backwards action helps agents resolve con-
gestion in the intersection of the four corridors, which
creates space for other agents to move to their goals,
reducing overall their travel time.
Figures 15(a) and 15(b) show the optimized set of
actions for the Incoming and Circle scenarios. A com-
mon pattern found by MCMC for these environments
is that the actions are heavily biased towards one of the
sides of the agents. This bias, along with the absence
of obstacles, allows agents to move around other agents
using the available space. In the Incoming scenario,
the sideway actions help the single agent moving to the
right (see Figure 8(c)) to avoid getting trapped by the
incoming group. In the Circle scenario, the optimized
actions allow the agents to create a vortex-shaped for-
mation when reaching the center of the environment
(see Figure 8(f)), which avoids congestion and helps
the agents reach their goals faster. Note that, in both
scenarios, the two sideways actions are very similar to
each other. This gives agents a more fine grained con-
trol of their avoidance behavior, minimizing the detour
from their goal oriented motion.
7.1.2 Multi-scenario Optimized Action Set
To learn a multi-scenario action set, first we trained
MCMC on a set of five scenarios, leaving out the Bidi-
rectional, Intersection and Crowd scenarios as
test examples. We chose to leave out these scenarios be-
Goal
Agent
(a) Incoming
Goal
Agent
(b) Circle
Fig. 15: Optimized set of actions (velocities) found by
the MCMC method for the (a) Incoming and (b) Cir-
cle scenarios.
cause without being identical to other scenarios, they
share some features with the training set: they have ob-
stacles which constrain the motion of the agents, and
also require agents to interact with each other. Then,
we evaluated the resulting multi-scenario optimized ac-
tion set in the entire set of eight scenarios.
The multi-scenario optimized action set can be seen
in Figure 16. We can observe two main features of this
action set. First, note the asymmetry of the actions,
which is helpful in obstacle-free environments to im-
plicitly coordinate the motion of agents and avoid con-
gestion. Second, we can see that half of the actions move
the agents backwards from their goals, which is useful
in very constrained scenarios. Again, the apparently re-
dundant actions, both backwards as well as towards the
goal, give agents better control of their behaviors.
ALAN: Adaptive Learning for Multi-Agent Navigation
17
Goal
Agent
Fig. 16: Optimized set of actions (velocities) found by
the MCMC method when trained on five of the eight
scenarios in Fig. 8.
0 
50 
100 
150 
200 
250 
Congested Deadlock 
Incoming 
Blocks 
Bidirec8onal 
Circle 
Intersec8on 
Crowd 
Interac(on Overhead (s) 
Sample 
Mul8-Scenario 
Per-Scenario 
Fig. 17: Interaction overhead (s) of ALAN, using the
sample action set, the multi-scenario optimized action
set, and per scenario optimized action set.
7.2 Comparison of Performance between Action Sets
We compared the interaction overhead results of using
ALAN with different action sets: the sample set (see
Figure 3), the per-scenario optimized set and the multi-
scenario optimized set, computed using our MCMC ap-
proach.
Figure 17 shows that our MCMC approach learns
optimized action sets that outperform the sample ac-
tions. When computed on a per-scenario basis, this op-
timized action set outperforms the sample action set in
all scenarios (in all pairwise t-tests, p < 0.05). When
computed using five of the eight evaluated scenarios, it
still outperforms the sample set in most scenarios (p <
0.05) while performing similarly in the Bidirectional
scenario. Note that while the Bidirectional, Inter-
section and Crowd scenarios were not included in the
training process of the multi-scenario optimized action
set, this set still outperforms the Sample action set in
the Intersection and Crowd. This indicates that the
multi-scenario action set generalizes well to previously
unseen environments.
On the other hand, the interaction overhead results
of the per-scenario optimized action set are better than
the multi-scenario action set, with pairwise differences
being statistically significant (p < 0.05) in most scenar-
ios. The only exceptions correspond to the Congested
and the Incoming scenarios, where the performance
difference is not significant.
We can observe that agents using the multi-scenario
optimized action set display behaviors typically attributed
to social conventions in human crowds, where pedes-
trians defer to others to improve the flow and avoid
deadlocks. An example of these behaviors can be seen
in the Deadlock scenario (backtracking to defer to
incoming agents). These behaviors enable agents to re-
duce their travel time in a wide range of environments
without the need for specific (and often unavailable)
domain knowledge. Further, the per-scenario optimized
action set enables agents to showcase human-like be-
haviors in the Bidirectional scenario (each group of
agents avoids incoming agents moving to their right)
and implicitly coordinated motion in the Circle sce-
nario (agents forming a vortex in the middle of the sce-
nario to avoid congestion).
8 Analysis of ALAN
In this section, we analyze different aspects of ALAN,
such as its runtime, how its performance scales with re-
spect to the number of agents, as well as its robustness
to failure in the actuators of the agents. We also com-
pare the performance of ALAN with a strategy where
the preferred velocity of each agent is randomized at dif-
ferent time intervals, and show that ALAN outperforms
this strategy in all but one scenario. Unless otherwise
noted, results labeled with ALAN are obtained with the
multi-scenario optimized set of actions (Fig. 16).
8.1 Runtime Complexity
During each simulation cycle, each agent performs two
main operations: it first chooses a preferred velocity us-
ing its online action-selection algorithm and then maps
this velocity to a collision-free one using ORCA. In
practice, since the number of actions that need to be
evaluated is small, selecting a new action has a negligi-
ble runtime, while ORCA dominates the overall runtime
performance. Consequently, similar to ORCA, ALAN
runs in O(n) time per agent, where n is the number
of neighboring agents and obstacles used to compute
the non-colliding velocity of the agent. In time units,
ORCA takes approx. 1.5 × 10−5 seconds to compute a
new collision-free velocity, while ALAN takes approx.
3 × 10−6 to select a new preferred velocity. In total,
ALAN takes approx. 1.8 × 10−5 of processing time for
each agent. This corresponds to less than a thousandth
of a simulation timestep, which allows us to simulate
hundreds of agents in real-time.
18
Julio Godoy et al.
Method
Congested
Deadlock
Incoming
Block
Bidirectional
Circle
Intersection
Crowd
ORCA w/random action every 1s.
238.2
337.5
12.7
48.2
37.7
49.4
164
151.6
ORCA w/random action every 2s.
290.3
553.2
10.6
106.6
34.4
31.8
166.2
138.5
ORCA w/random action every 3s.
263.4
768.9
10.9
151.9
36.4
28.6
129.6
137.9
ORCA
299.7
N/A
19.8
N/A
94.9
27.4
178.2
144.6
ALAN
149.5
74.4
3.9
15.7
33.9
33.4
115.6
107.9
Table 2: Comparison of interaction overhead of ORCA, ALAN, and a sample action chosen randomly every few
seconds. Bold numbers indicate best results, which may be more than one if there is not a statistically significant
difference between them.
8.2 ALAN vs random velocity perturbation
Table 2 compares the interaction overhead performance
of ALAN, ORCA, and a random action selection, where
agents select a random action (from the Sample action
set) every 1, 2 or 3 seconds. Results indicate that, in
most scenarios, randomizing the selection of the pre-
ferred velocity does prevent (or solve) congestion, which
results in lower travel times than ORCA in many cases,
or even allows agents to reach their goals when ORCA
alone cannot. Specifically, selecting a random action
with some frequency allows agents to reach their goals
in the Blocks environment (unlike vanilla ORCA): by
randomly moving sideways, agents can escape the local
minima behind the obstacles. We can also observe that
in the Incoming and Bidirectional scenarios, the
performance is better than just following goal directed
motion (ORCA). In the Incoming scenario, specifi-
cally, the selection of random actions creates some space
between the incoming agents, space that is used by the
single agent to (slowly) move to its goal.
With respect to the performance obtained by ALAN,
we can observe that, in general, random perturbations
to the preferred velocity of ORCA perform worse than
ALAN, as this does not allow agents to adapt to the
changes in the local navigation conditions.
8.3 Scalability
To analyze how the performance of ALAN scales when
there are more agents, we varied the number of agents
in the Intersection and Crowd scenarios (Figure 8),
and evaluated the interaction overhead time. Results,
shown in Figure 18, indicate that ALAN scales better
than ORCA in both scenarios. In the Crowd environ-
ment, the performance of ALAN and ORCA is similar
with 350 agents. As we increase the number of agents,
the difference in interaction overhead is more notice-
able. In the Intersection scenario, the difference in
performance between ORCA and ALAN is noticeable
starting at 40 agents, and increases as the number of
agents also increases.
0 
100 
200 
300 
400 
500 
600 
300 
400 
500 
600 
Interac(on Overhead (s) 
Number of agents 
ORCA 
ALAN 
(a) Crowd
0 
100 
200 
300 
400 
500 
0 
50 
100 
150 
Interac(on Overhead (s) 
Number of agents 
ORCA 
ALAN 
(b) Intersection
Fig. 18: Interaction overhead as a function of the num-
ber of agents in the Crowd and Intersection scenarios.
8.4 Limitations of ALAN
Although ALAN successfully reduces the travel time of
the agents compared to existing navigation approaches,
it is not free of limitations. One such limitation re-
lates to the probabilistic nature of its action selection.
Specifically, there is no guarantee that agents will al-
ways choose the optimal action. This can have a nega-
tive impact on other agents motions. Also, agents eval-
uate their actions based only on their past observations
without considering their long-term consequences. This
might prevent an agent from reaching its goal, for exam-
ple, when large obstacles block its goal oriented paths.
8.4.1 Applicability of ALAN to multi-robot systems
To use ALAN in multi-robot systems, some assump-
tions would need to be changed. Since ALAN uses ORCA
for computing collision-free velocities, it makes the same
assumptions of holonomic disc-shaped robots as ORCA.
ORCA would need to be adapted for computing veloc-
ity obstacles for other robot shapes. Currently, we do
not assume bounds on the acceleration of the agents
and do not consider rotations in the time to take an
action. Robots with non-holonomic constraints would
need to account for rotations and other kinematic con-
straints, which could be done, for example, using recent
extensions to ORCA [6,12,1]. Even without bounds on
the acceleration, the motions produced by ALAN look
realistic in many of the scenarios, except for ORCA’s as-
ALAN: Adaptive Learning for Multi-Agent Navigation
19
sumption of 360 degree of sensing range, which is more
than double the field of view used by humans. Hence,
agents might react to other agents coming from behind
to avoid collisions, in a way that might not mimic how
humans move.
ORCA assumes that agents have perfect sensing ca-
pabilities of the positions and velocities of other agents,
which is not necessarily true in decentralized multi robot
systems. Fortunately, this problem has been tackled
previously by other researchers (for example, [22], where
authors deal with the problem of uncertainty in the lo-
calization of agents). Hence, we can use existing solu-
tions to reduce the gap between simulation and real
world execution of ALAN.
Imperfect Actuators. We always include a small
amount of noise in the computed preferred velocities
to avoid symmetry issues. This noise can reflect some
level of inaccuracy of the actuators. Since in the real
world actuators can fail, we evaluated the performance
of ALAN when each action chosen has some probability
of not being executed, because the actuators failed. Re-
sults, shown in Figure 19, indicate that ALAN is robust
to failure in the actuators. Performance degrades grace-
fully as the probability of actions not being executed
increases. Specifically, the rate at which the interaction
overhead values increase depends on the frequency of
change of the locally optimal action. In the Incoming
scenario, for example, the locally optimal action for the
single agent only changes a couple of times (to avoid
the group and to resume goal oriented motion), hence
the performance degradation is not very noticeable un-
til the probability of actuator failure is over 70%. On
the other hand, in the Congested scenario the perfor-
mance degradation is visible at around 20% of proba-
bility of actuator failure. Overall, this result indicates
that ALAN still performs well under these conditions.
0 
50 
100 
150 
200 
250 
0 
0.1 
0.2 
0.3 
0.4 
0.5 
0.6 
0.7 
0.8 
0.9 
Interac(on Overhead (s) 
Probability of actuator malfunc(on 
Congested 
Deadlock 
Incoming 
Blocks 
Bidirec?onal 
Circle 
Intersec?on 
Crowd 
Fig. 19: Interaction overhead of ALAN in the eight sce-
narios shown in Figure 8, when actions have a proba-
bility of not being executed.
9 Conclusions and Future Work
In this paper, we addressed the problem of computing
time-efficient motions in multi-agent navigation tasks,
where there is no communication or prior coordination
between the agents. We proposed ALAN, an adaptive
learning approach for multi-agent navigation. We for-
mulated the multi-agent navigation problem as an ac-
tion selection problem in a multi-armed bandit setting,
and proposed an action selection algorithm to reduce
the travel time of the agents.
In ALAN, the agents use recent observations to de-
cide which action to execute. ALAN uses principles of
the Softmax action selection strategy and a limited time
window of rewards to dynamically adapt the motion of
the agents to their local conditions. We also introduced
an offline Markov Chain Monte Carlo method that al-
lows agents to learn an optimized action space in each
individual environment, and in a larger set of scenar-
ios. This enables agents to reach their goals faster than
using a predefined set of actions.
Experimental results in a variety of scenarios and
with different numbers of agents show that, in general,
agents using ALAN make more time-efficient motions
than using ORCA, the Social Forces model, and a Pre-
dictive model for pedestrian navigation. ALAN’s low
computational complexity and completely distributed
nature make it an ideal choice for multi-robot systems
that have to operate in real-time, often with limited
processing resources.
There are many avenues for future research. We
plan to investigate the applicability of ALAN to hetero-
geneous environments, for example, by letting ALAN
agents learn the types of the other agents present in the
environment and their intended goals. This would allow
an agent to more accurately account for the behavior of
nearby agents during action selection. Finally, we would
also like to port our approach to real robots and test
it in real-world environments, such as for search and
rescue operations or evacuation planning.
References
1. Alonso-Mora, J., Breitenmoser, A., Rufli, M., Beardsley,
P., Siegwart, R.: Optimal reciprocal collision avoidance
for multiple non-holonomic robots. In: Distributed Au-
tonomous Robotic Systems, pp. 203–216. Springer (2013)
2. Audibert, J.Y., Munos, R., Szepesv´ari, C.: Exploration–
exploitation tradeoffusing variance estimates in multi-
armed bandits. Theoretical Computer Science 410(19),
1876–1902 (2009)
3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analy-
sis of the multiarmed bandit problem. Machine Learning
47(2-3), 235–256 (2002)
20
Julio Godoy et al.
4. Bayazit, O., Lien, J.M., Amato, N.: Better group behav-
iors in complex environments using global roadmaps. In:
8th International Conference on Artificial life, pp. 362–
370 (2003)
5. van den Berg, J., Guy, S.J., Lin, M., Manocha, D.: Recip-
rocal n-body collision avoidance. In: Proc. International
Symposium of Robotics Research, pp. 3–19. Springer
(2011)
6. van den Berg, J., Snape, J., Guy, S.J., Manocha, D.: Re-
ciprocal collision avoidance with acceleration-velocity ob-
stacles. In: IEEE International Conference on Robotics
and Automation, pp. 3475–3482 (2011)
7. Bham, G.H., Benekohal, R.F.: A high fidelity traffic
simulation model based on cellular automata and car-
following concepts.
Transportation Research Part C:
Emerging Technologies 12(1), 1–32 (2004)
8. Bu¸soniu, L., Babuˇska, R., De Schutter, B.: A comprehen-
sive survey of multi-agent reinforcement learning. IEEE
Trans. Syst., Man, Cybern. C, Appl. Rev 38(2), 156–172
(2008)
9. Cunningham, B., Cao, Y.: Levels of realism for coopera-
tive multi-agent reinforcement learning. In: Advances in
Swarm Intelligence, pp. 573–582. Springer (2012)
10. Fiorini, P., Shiller, Z.: Motion planning in dynamic en-
vironments using Velocity Obstacles.
The Int. J. of
Robotics Research 17, 760–772 (1998)
11. Funge, J., Tu, X., Terzopoulos, D.: Cognitive modeling:
knowledge, reasoning and planning for intelligent charac-
ters. In: 26th Annual Conference on Computer Graphics
and Interactive Techniques, pp. 29–38 (1999)
12. Giese, A., Latypov, D., Amato, N.M.: Reciprocally-
rotating velocity obstacles. In: IEEE International Con-
ference on Robotics and Automation, pp. 3234–3241
(2014)
13. Godoy, J., Karamouzas, I., Guy, S.J., Gini, M.: Adaptive
learning for multi-agent navigation. In: Proc. Int. Conf.
on Autonomous Agents and Multi-Agent Systems, pp.
1577–1585 (2015)
14. Guy, S., Chhugani, J., Kim, C., Satish, N., Lin, M.,
Manocha, D., Dubey, P.: Clearpath: highly parallel colli-
sion avoidance for multi-agent simulation. In: ACM SIG-
GRAPH/Eurographics Symposium on Computer Anima-
tion, pp. 177–187 (2009)
15. Guy, S., Kim, S., Lin, M., Manocha, D.: Simulating het-
erogeneous crowd behaviors using personality trait the-
ory. In: Proc. ACM SIGGRAPH/Eurographics Sympo-
sium on Computer Animation, pp. 43–52 (2011)
16. Guy, S.J., Chhugani, J., Curtis, S., Pradeep, D., Lin, M.,
Manocha, D.: PLEdestrians: A least-effort approach to
crowd simulation. In: ACM SIGGRAPH/Eurographics
Symposium on Computer Animation, pp. 119–128 (2010)
17. Hastings, W.K.: Monte carlo sampling methods using
markov chains and their applications. Biometrika 57(1),
97–109 (1970)
18. Helbing, D., Buzna, L., Werner, T.: Self-organized pedes-
trian crowd dynamics and design solutions. Traffic Forum
12 (2003)
19. Helbing, D., Farkas, I., Vicsek, T.: Simulating dynami-
cal features of escape panic. Nature 407(6803), 487–490
(2000)
20. Helbing, D., Molnar, P.: Social force model for pedestrian
dynamics. Physical review E 51(5), 4282 (1995)
21. Helbing, D., Molnar, P., Farkas, I.J., Bolay, K.: Self-
organizing pedestrian movement. Environment and Plan-
ning B: Planning and Design 28(3), 361–384 (2001)
22. Hennes, D., Claes, D., Meeussen, W., Tuyls, K.: Multi-
robot collision avoidance with localization uncertainty.
In: Proc. Int. Conf. on Autonomous Agents and Multi-
Agent Systems, pp. 147–154 (2012)
23. Henry, P., Vollmer, C., Ferris, B., Fox, D.: Learning to
navigate through crowded environments. In: Proc. IEEE
Int. Conf. on Robotics and Automation, pp. 981–986
(2010)
24. Hettiarachchi, S.: An evolutionary approach to swarm
adaptation in dense environments. In: IEEE Int’l Conf.
on Control Automation and Systems, pp. 962–966 (2010)
25. Hopcroft, J.E., Schwartz, J.T., Sharir, M.: On the com-
plexity of motion planning for multiple independent ob-
jects; pspace-hardness of the” warehouseman’s problem”.
The Int. J. of Robotics Research 3(4), 76–88 (1984)
26. Johansson, A., Helbing, D., Shukla, P.K.: Specification
of the social force pedestrian model by evolutionary ad-
justment to video tracking data. Advances in Complex
Systems 10, 271–288 (2007)
27. Karamouzas, I., Geraerts, R., van der Stappen, A.F.:
Space-time group motion planning.
In: Algorithmic
Foundations of Robotics X, pp. 227–243. Springer (2013)
28. Karamouzas, I., Heil, P., van Beek, P., Overmars, M.: A
predictive collision avoidance model for pedestrian simu-
lation. In: Motion in Games, LNCS, vol. 5884, pp. 41–52.
Springer (2009)
29. Karamouzas, I., Overmars, M.: Simulating and evaluat-
ing the local behavior of small pedestrian groups. IEEE
Trans. Vis. Comput. Graphics 18(3), 394–406 (2012)
30. Khatib, O.: Real-time obstacle avoidance for manipula-
tors and mobile robots. Int. J. Robotics Research 5(1),
90–98 (1986)
31. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., et al.: Opti-
mization by simmulated annealing. Science 220(4598),
671–680 (1983)
32. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learn-
ing in robotics: A survey. The International Journal of
Robotics Research 32(11), 1238–1274 (2013)
33. Kornhauser, D.M., Miller, G.L., Spirakis, P.G.: Coordi-
nating pebble motion on graphs, the diameter of permu-
tation groups, and applications.
Master’s thesis, M. I.
T., Dept. of Electrical Engineering and Computer Sci-
ence (1984)
34. Macready, W.G., Wolpert, D.H.: Bandit problems and
the exploration/exploitation tradeoff. IEEE Trans. Evol.
Comput. 2(1), 2–22 (1998)
35. Martinez-Gil, F., Lozano, M., Fern´andez, F.: Multi-agent
reinforcement learning for simulating pedestrian navi-
gation.
In: Adaptive and Learning Agents, pp. 54–69.
Springer (2012)
36. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N.,
Teller, A.H., Teller, E.: Equation of state calculations by
fast computing machines. The journal of chemical physics
21(6), 1087–1092 (1953)
37. Olfati-Saber, R.: Flocking for multi-agent dynamic sys-
tems: Algorithms and theory. IEEE Transactions on au-
tomatic control 51(3), 401–420 (2006)
38. Ondˇrej, J., Pettr´e, J., Olivier, A.H., Donikian, S.: A
synthetic-vision based steering approach for crowd simu-
lation. ACM Trans. Graphics 29(4), 123 (2010)
39. Pelechano, N., Allbeck, J., Badler, N.: Controlling indi-
vidual agents in high-density crowd simulation. In: Proc.
ACM SIGGRAPH/Eurographics Symposium on Com-
puter Animation, pp. 99–108 (2007)
40. Pelechano,
N.,
Allbeck,
J.M.,
Badler,
N.I.:
Virtual
crowds: Methods, simulation, and control. Synthesis Lec-
tures on Computer Graphics and Animation 3(1), 1–176
(2008)
ALAN: Adaptive Learning for Multi-Agent Navigation
21
41. Pettr´e,
J.,
Ondrej,
J.,
Olivier,
A.H.,
Cr´etual,
A.,
Donikian, S.: Experiment-based modeling, simulation
and validation of interactions between virtual walkers. In:
ACM SIGGRAPH/Eurographics Symposium on Com-
puter Animation, pp. 189–198 (2009)
42. Popelov´a, M., B´ıda, M., Brom, C., Gemrot, J., Tomek,
J.: When a couple goes together: walk along steering. In:
Motion in Games, LNCS, vol. 7060, pp. 278–289. Springer
(2011)
43. Ratering, S., Gini, M.: Robot navigation in a known en-
vironment with unknown moving obstacles. Autonomous
Robots 1(2), 149–165 (1995)
44. Reynolds, C.: Steering behaviors for autonomous char-
acters.
In: Game Developers Conference, pp. 763–782
(1999)
45. Reynolds, C.W.: Flocks, herds, and schools: A distributed
behavioral model.
Computer Graphics 21(4), 24–34
(1987)
46. Shao, W., Terzopoulos, D.: Autonomous pedestrians.
Graphical Models 69(5-6), 246–274 (2007)
47. Sieben, A., Schumann, J., Seyfried, A.: Collective phe-
nomena in crowdswhere pedestrian dynamics need social
psychology. PLoS one 12(6) (2017)
48. Solovey, K., Yu, J., Zamir, O., Halperin, D.: Motion plan-
ning for unlabeled discs with optimality guarantees. In:
Robotics: Science and Systems (2015)
49. Sutton, R.S.: Learning to predict by the methods of tem-
poral differences. Machine Learning 3(1), 9–44 (1988)
50. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An
Introduction. MIT Press (1998)
51. Torrey, L.: Crowd simulation via multi-agent reinforce-
ment learning. In: Proc. Artificial Intelligence and Inter-
active Digital Entertainment, pp. 89–94 (2010)
52. Tsai, J., Bowring, E., Marsella, S., Tambe, M.: Empiri-
cal evaluation of computational fear contagion models in
crowd dispersions. Autonomous Agents and Multi-Agent
Systems pp. 1–18 (2013)
53. Uther, W., Veloso, M.: Adversarial reinforcement learn-
ing. Tech. rep., Carnegie Mellon University (1997)
54. van den Berg, J., Lin, M., Manocha, D.: Reciprocal ve-
locity obstacles for real-time multi-agent navigation. In:
Proc. IEEE Int. Conf. on Robotics and Automation, pp.
1928–1935 (2008)
55. Whiteson, S., Taylor, M.E., Stone, P.: Empirical studies
in action selection with reinforcement learning. Adaptive
Behavior 15(1), 33–50 (2007)
56. Yu, J., LaValle, S.M.: Planning optimal paths for multiple
robots on graphs. In: Proc. IEEE Int. Conf. on Robotics
and Automation, pp. 3612–3617. IEEE (2013)
57. Zhang, C., Lesser, V.: Coordinated multi-agent learning
for decentralized POMDPs.
In: 7th Annual Workshop
on Multiagent Sequential Decision-Making Under Uncer-
tainty (MSDM) at AAMAS, pp. 72–78 (2012)
58. Zhang, C., Lesser, V.: Coordinating multi-agent rein-
forcement learning with limited communication. In: Proc.
Int. Conf. on Autonomous Agents and Multi-Agent Sys-
tems, pp. 1101–1108 (2013)
59. Ziebart, B.D., Ratliff, N., Gallagher, G., Mertz, C., Peter-
son, K., Bagnell, J.A., Hebert, M., Dey, A.K., Srinivasa,
S.: Planning-based prediction for pedestrians. In: Proc.
IEEE/RSJ Int. Conf. on Intelligent Robots and Systems,
pp. 3931–3936 (2009)