arXiv:1711.08224v1  [cs.RO]  22 Nov 2017
1
Depth Control of Model-Free AUVs via
Reinforcement Learning
Hui Wu, Shiji Song, Member IEEE, Keyou You, Member IEEE, and Cheng Wu
Abstract—In this paper, we consider depth control problems
of an autonomous underwater vehicle (AUV) for tracking the
desired depth trajectories. Due to the unknown dynamical model
of the AUV, the problems cannot be solved by most of model-
based controllers. To this purpose, we formulate the depth control
problems of the AUV as continuous-state, continuous-action
Markov decision processes (MDPs) under unknown transition
probabilities. Based on deterministic policy gradient (DPG) and
neural network approximation, we propose a model-free rein-
forcement learning (RL) algorithm that learns a state-feedback
controller from sampled trajectories of the AUV. To improve
the performance of the RL algorithm, we further propose a
batch-learning scheme through replaying previous prioritized
trajectories. We illustrate with simulations that our model-free
method is even comparable to the model-based controllers as
LQI and NMPC. Moreover, we validate the effectiveness of the
proposed RL algorithm on a seaﬂoor data set sampled from the
South China Sea.
Index Terms—AUV, Depth control, Reinforcement learning,
Deterministic policy gradient, Neural network, Prioritized ex-
perience replay
I. INTRODUCTION
A
UTONOMOUS underwater vehicle (AUV) is a type of
self-controlled submarine whose ﬂexibility, autonomy
and size-diversity make it advantageous in many applications,
including seabed mapping [1], chemical pluming tracing [2],
resource gathering, contaminant source localization [3], op-
eration under dangerous environment, maritime rescue, etc.
Therefore, the control of AUVs has drawn great attention
of the control community. Among many control problems
of AUVs, the depth control is crucial in many applications.
For example, when performing a seabed mapping, an AUV is
required to keep a constant distance from the seaﬂoor.
There are many difﬁculties for the depth control prob-
lems of AUVs. The nonlinear dynamics of AUVs renders
bad performances of many linear controllers such as linear
quadratic integral (LQI), ﬁxed proportional-integral-derivative
(PID) controllers. Even with nonlinear controllers, sometimes
it is hard to obtain an exact dynamical model of the AUV
in practice. Moreover, the complicated undersea environment
brings various disturbances, e.g., sea currents, waves and
model uncertainty, all of which increase the difﬁculty for the
depth control.
Most of conventional control approaches mainly focus on
solving the control problems of the AUV based on an exact
This work was supported by National Science Foundation of China
(41427806).
The authors are with the Department of Automation and TNList, Ts-
inghua University, Beijing, 100084, China (e-mail: wuhui115199@163.com;
shijis@tsinghua.edu.cn; youky@tsinghua.edu.cn; wuc@tsinghua.edu.cn).
dynamical model. An extended PID controller including an
acceleration feedback is proposed for the dynamical posi-
tioning systems of marine surface vessels [4]. The controller
compensates the disturbances of the slow-varying forces by
introducing a measured acceleration feedback. A self-adaptive
PID controller tuned by Mamdani fuzzy rule is used to control
a nonlinear AUV system in tracking the heading and depth
with a stabilized speed, and outperforms classical tuned PID
controllers [5].
Other model-based controllers of AUVs include backstep-
ping [6] [7], sliding-mode [8] [9], model predictive control
[10] [11], etc. An adaptive backstepping controller is designed
for the tracking problem of ships and mechanical systems
which guarantees uniform global asymptotic stability (UGAS)
of the closed-loop tracking error [12]. Combined with line-of-
sight guidance, two sliding mode controllers are respectively
designed for the sway-yaw-roll control of a ship [8]. Model
predictive control (MPC) is a control strategy where control
input at each sampling time is estimated based on the pre-
dictions over a certain horizon [13]. In [14], a controller is
proposed by solving a particular MPC problem and controls
a nonlinear constrained submarine to track the sea ﬂoor.
However, the performance of model-based controllers will
seriously degrade if under an incorrect dynamical model. In
practical applications, the accurate dynamical models of AUVs
are obviously difﬁcult to obtain due to the complex under-
water environment. For such case, a model-free controller is
required, which is learned by reinforcement learning (RL) in
this work.
RL is a dynamic-programming-based solving framework for
the Markov decision process (MDP) without the transition
model. It has been applied successfully in the robotic control
problems, including the path planning for a single mobile robot
[15] or mutirobot [16], robot soccer [17], biped robot [18],
unmanned aerial vehicle(UAV) [19], etc.
In this paper, we propose a RL framework for the depth
control problems of the AUV based on the deterministic
policy gradient (DPG) theorem and neural network approx-
imation. We consider three depth control problems including
the constant depth control, the curved depth tracking and the
seaﬂoor tracking according to different target trajectories and
information.
The key of applying RL is how to model the depth control
problems as MDPs. MDP describes such a process where an
agent at some state takes an action and transits to the next state
with an one-step cost. In our problems, the deﬁnitions of the
‘state’ and ‘one-step cost’ are signiﬁcant for the performance
of the RL. Usually the motions of an AUV are described by
2
six coordinates and their derivatives. It is straightforward to
regard the coordinates as the states of MDPs directly. However,
we ﬁnd this scenario not adaptive since the desired depth
trajectories are not included and some of the coordinates are
periodic, thus design a better state.
Most of RL algorithms usually approximate an value func-
tion and an policy function both of which are used to evaluate
and generate a policy respectively. The forms of the approx-
imators are determined according to the transition model of
the MDP. However, the stochastic nonlinear dynamics and the
constrained control inputs in the depth control problems of
the AUV result in serious approximation difﬁculties for RL.
Therefore, we design neural network approximators consider-
ing their powerful representation abilities, where it is crucial
to design adaptive structures of the networks for our problems.
After designing the networks, we train them based on
the sampled trajectories of the AUV, which compensates the
model-free limitation. However, the depth control problems
require high real-time because of massive consumed energy
when controlling the AUV, which means the sampled data
may be not sufﬁcient for the online demand. To improve the
data efﬁciency, we proposed a batch-learning scheme through
replaying previous experiences.
The main contributions of our paper are summarized as
follows:
1.
We formulate three depth control problems of AUVs as
MDPs with adaptive forms of the states and the cost
functions.
2.
We design two neural networks with speciﬁc structures
and rectiﬁed linear unit (ReLu) activation functions.
3.
We propose a batch-gradient updating scheme called
prioritized experience replay and incorporate it into our
RL framework.
The reminder of the paper is organized as follows. In Sec-
tion II, we describe the motions of AUVs and the three depth
control problems. In Section III, the depth control problems
are modeled as MDPs under appropriate forms of states and
one-step cost functions. In Section IV, a RL algorithm based
on the DPG is applied to solve the MDPs. In Section V, we
highlight several innovative techniques which are proposed to
improve the performance of the RL algorithm. In Section VI,
the simulations are performed on a classic REMUS AUV and
the performances of two model-based controllers are compared
with that of our algorithm to validate its effectiveness. In
addition, an experiment on a real seaﬂoor data set is performed
to show the practicability of our RL framework.
II. PROBLEM FORMULATION
In this section, we describe the coordinate frames of AUVs
and the depth control problems.
A. Coordinate Frames of AUVs
The motions of an AUV have six degrees of freedom
(DOFs) including surge, sway, heave, which refer to the longi-
tudinal, sideways, vertical displacements, and yaw, roll, pitch,
which describe the rotations around the vertical, longitudinal,
(a)
(b)
Fig. 1.
The coordinate description of the AUV. (a)The six DOFs of the
motions. (b)The two coordinate frames for describing the AUV motions.
transverse axises. Fig. 1(a) illustrates the details of the six
DOFs.
Correspondingly there are six independent coordinates de-
termining the position and orientation of the AUV. An earth-
ﬁxed coordinate frame {I} is deﬁned for the six coordinates
corresponding to the position and the orientation along the x, y
and z axes denoted by η = [x, y, z, φ, θ, ψ]T . The earth-ﬁxed
frame is assumed to be inertial by ignoring the effect of the
earth’s rotation. The linear and angular velocities denoted by
ν = [u, v, w, p, q, r]T are described in a body-ﬁxed coordinate
frame {B}, which is a moving coordinate frame whose origin
is ﬁxed to the AUV. Fig. 1(b) shows the two coordinate frames
and six coordinates.
B. Depth Control Problems
For simplicity, we just consider the depth control problems
of AUVs on the x-z plane in this work, all of which can
be easily extended to three-dimensional cases. Thus we only
examine the motions on the x-z plane and drop the terms out
of the plane. Furthermore, the surge speed u is assumed to be
constant. The remaining coordinates are denoted by the vector
χ = [z, θ, w, q]T , including heave position z, heave velocity
w, pitch orientation θ and pitch angular velocity q.
The dynamical equation of the AUV is deﬁned as the
follows
˙χ = f(χ, u, ξ)
(1)
where u denotes the control vector and ξ denotes the possible
disturbance.
The purpose of the depth control is to control the AUV
to track a desired depth with minimum energy consumption,
where the desired depth trajectory zr is given by
zr = g(x).
(2)
We are concerned with the depth control under three types
of situations according to the form and the information about
zr:
1.
Constant depth control: The constant depth control is
to control the AUV to operate at a constant depth. The
desired depth is constant, i.e., ˙zr = 0. To prevent the
AUV from oscillating around zr, the heading direction is
required to keep consistent with the x axis, i.e., the pitch
angle θ = 0.
2.
Curved depth tracking: The curved depth tracking is to
control the AUV to track a given curved depth trajectory.
3
Fig. 2. The evolution of MDP.
The desired depth trajectory is a curve and its derivative
˙zr that describes the tendency of the curve is known.
Besides minimizing |z −zr|, the pitch angle is required
to keep consistent with the slope angle of the curve.
3.
Seaﬂoor tracking: Seaﬂoor tracking is to control the AUV
to track the seaﬂoor and keep a constant safe distance
simultaneously. When tracking the seaﬂoor, sonar-like
devices on AUVs can only measure the relative depth
z −zr from itself to the seaﬂoor, thus the slope angle of
the seaﬂoor curve is unknown. The missing information
makes the control problem more difﬁcult.
III. MDP MODELING
In this section, we model the above three depth control
problems as MDPs with unknown transition probabilities due
to the unknown dynamics of the AUV.
A. Markov Decision Process
A MDP is a stochastic process satisfying the Markov
property. It consists four components: a state space S, an
action space A, an one-step cost function c(s, a) : S ×
A →R and a stationary one-step transition probability
p(st|s1, a1, . . . , st−1, at−1). The Markov property means that
current state is only dependent on the last state and action, i.e.,
p(st|s1, a1, . . . , st−1, at−1) = p(st|st−1, at−1).
(3)
The MDP describes how an agent interacts with the envi-
ronment: at some time step t, the agent in state st takes
an action at and transfers to the next state st+1 according
to the transition probability with an observed one-step cost
ct = c(st, at). Fig. 2 illustrates the evolution of MDP.
The MDP problem is to ﬁnd a policy to minimize the long-
term cumulative cost function. The policy is a map from the
state space S to the action space A, and can be deﬁned as a
function form π : S →A or a distribution π : S ×A →[0, 1].
Therefore, the optimization problem is formulated as
min
π∈P J(π) = min
π∈P E
" K
X
k=1
γk−1ck|π
#
(4)
where P denotes the policy space and γ is a discounted
factor with 0 < γ < 1. The superscript K of the summation
represents the horizon of the problem.
The deﬁnitions of four components of MDPs are essential
for the performance of RL algorithms. For the depth control
problems of the AUV, the unknown dynamics means the
unknown transition probability, and the action corresponds to
the control variable, i.e., a = u. Therefore, the key is how
to design the states and the one-step cost functions for the
control problems.
B. MDP for Constant Depth Control
The purpose of the constant depth control problem is to
control an AUV to operate at a constant depth zr. This task
is simple but basic for more complicated cases. We design an
one-step cost function of the form
c(χ, u) = ρ1(z −zr)2 + ρ2θ2 + ρ3w2 + ρ4q2 + uT Ru (5)
where ρ1(z −zr)2 is for minimizing the relative depth, ρ2θ2
for keeping the pitch angle along the x axis and the last three
terms for minimizing the consumed energy. The cost function
provides a trade-off between different controlling objectives
through the coefﬁcients ρ1, ρ2, ρ3, ρ4.
It is intuitive to choose χ which describes the motions of
the AUV as the state, but actually this choice is not advisable.
Firstly, the pitch angle θ is an angular variable which cannot
be added to the state directly due to its periodicity. A state with
θ = 0 and one with θ = 2π are different but actually equivalent
with each other. Hence we divide the pitch angle into two
trigonometric components [cos(θ), sin(θ)]T to eliminates the
periodicity.
The second drawback is the absolute depth z in the state.
Imagine that if we have controlled an AUV at a speciﬁc depth
zr, and now the target depth changes to a new depth z′
r that
the AUV has never visited. In this case, the control policy
for the old zr does not work for the new depth z′
r because
the new states have not been visited before. Thus the relative
depth ∆z .= z −zr is a better choice.
To overcome the above drawbacks, we design the state of
the constant depth control as follows
s = [∆z, cos(θ), sin(θ), w, q]T .
(6)
C. MDP for Curved Depth Control
The curved depth control is to control the AUV to track
a given curved depth trajectory zr = g(x). The state (6) is
not sufﬁcient for the curved depth control because it does not
include the slope angle and its derivative of the curve denoted
by θc and ˙θc. The slope angle θc determines whether the AUV
goes uphill or downhill, and ˙θc represents the change of rate
of θc.
Fig. 3 shows two situations where the AUVs in the same
state deﬁned according to (6) are tracking two curves with
different ˙θc. Obviously they cannot be controlled to track the
two curves under the same policy. The failure results from that
the AUV cannot anticipate the ‘tendency’ of the curve since
the state does not contain ˙θc.
In order to add the information about ˙θc to the state, we
consider the form
˙θc =
g′′(x)
[1 + g′(x)]2 ˙x =
g′′(x)
[1 + g′(x)]2 (u0 cos θ + w sin θ)
(7)
4
Fig. 3. Two different situations with different change ratio of θc where the
the AUVs in the same state cannot be controlled to track the two curves under
the same policy.
Fig. 4. The seaﬂoor tracking problem.
where u0 denotes the constant surge speed of the AUV, and
g′(x), g′′(x) denote the ﬁrst and second derivative of g(x)
with respect to x.
Then we consider the derivative of the relative depth z∆.=
z −zr as follows
˙z∆= ˙z −˙zr
= w cos θ −u0 sin θ + (u0 cos θ + w sin θ) tan θc
= w cos(θ −θc) −u0 sin(θ −θc)
cos θc
= w cos θ∆−u0 sin θ∆
cos θc
(8)
where θ∆.= θ−θc denotes the relative angle between the pitch
angle of the AUV and the slope angle of the target depth curve.
Inspired by the form of ˙z∆, we design the following one-
step cost function
c(z∆, θ∆, w, q, u) = ρ1z2
∆+ρ2θ2
∆+ρ3w2+ρ4q2+uTRu (9)
where ρ1z2
∆and ρ2θ2
∆minimize the relative depth and the
relative heading direction respectively. It means that the AUV
is controlled to track the depth curve and its tendency simul-
taneously.
The state for the curved depth control is deﬁned as
s = [z∆, cos θ∆, sin θ∆, cos θc, sin θc, ˙θ∆, w, q]T
(10)
D. MDP for Seaﬂoor Tracking
Illustrated in Fig. 4, seaﬂoor tracking is to control the AUV
to track the seaﬂoor while keeping a constant relative depth.
In this case the AUV can only measure the relative vertical
distance ∆z = z−zr from the seaﬂoor by a sonar-like device,
but cannot obtain the slope angle and its derivative of the
seaﬂoor curve.
Therefore, the state (10) is not feasible in this case due to the
missing observations of θc and ˙θc. If we adopt the state deﬁned
in (6), the problem illustrated in Fig. 3 still exists. Actually,
this problem is also called “perceptual aliasing” [20] which
means that different parts of the environment appear similar
to the sensor system of the AUV. The reason is that the state
(6) is just a partial observation of the environment. Thus we
consider expanding the state to contain more information.
Though not observed by the AUV, the tendency of the
seaﬂoor curve can still be estimated by the most recent
measured sequence of the relative vertical distances, i.e.,
[∆zt−N+1, . . . , ∆zt−1, ∆zt] where N denotes the window
size of the preceding history.
With the same one-step cost function (5), we deﬁne the
expanded state for the seaﬂoor tracking problem as
s = [∆zt−N+1, . . . , ∆zt−1, ∆zt, cos θ, sin θ, w, q]T
(11)
The value of the window size N is signiﬁcant for the
performance of the state, and we determine the best setting
for N in the simulation.
IV. SOLVING MDPS OF DEPTH CONTROL VIA RL
In this section, we adopt the reinforcement learning algo-
rithm to solve the MDPs for the depth control problems of the
AUV in Section III.
A. Dynamic Programming
Here we introduce a classic solving routine for MDPs called
dynamic programming as the basis of our RL algorithm for the
depth control.
We ﬁrstly deﬁne two types of functions that evaluate the
performance of a policy. Value function is a long-term cost
function deﬁned by
V π(s) = E
" K
X
k=1
γk−1ck|s1 = s, π
#
(12)
with a starting state s1 under a speciﬁc policy π. Action-value
function (also called the Q-value function) is a value function
Qπ(s, a) = E
" K
X
k=1
γk−1ck|s1 = s, a1 = a, π
#
(13)
with a chosen starting action a1.
Note that the relationship between the long-term cost func-
tion (4) and value function (12) is given as
J(π) =
Z
s
p1(s)V π(s)ds
(14)
where p1(s) denotes the initial state probability. So the mini-
mization (4) is equivalent with the Bellman optimality equation
V ∗(s) = min
π∈P V π(s)
= min
π∈P
Z
a
π(a|s)·

c(s, a) +
Z
s′ p(s′|s, a)V ∗(s′)ds′

da
(15)
The Bellman optimality equation determines the basic rou-
tine of solving the MDP problem comprising two phases
known as policy evaluation and policy improvement. The
5
policy evaluation estimates the value function of a policy π
as the evaluation of its performance by iteratively using the
Bellman equation
V π
j+1(s) =
Z
a
π(a|s)

c(s, a) +
Z
s′ p(s′|s, a)V π
j (s′)ds′

da
(16)
with an initially supposed value function V π
0 (s). The iteration
can be performed until the convergence (Policy Iteration) or
for ﬁxed steps (Generalized Policy Iteration), or even one step
(Value Iteration) [21].
After the policy evaluation, the policy improvement takes
place to obtain an improved policy based on the estimated
value function by a greedy minimization
π′(s) = argmin
a∈A

c(s, a) +
Z
s′ p(s′|s, a)V π(s′)ds′

.
(17)
Both phases are iterated alternatively until the policy con-
verges.
B. Temporal Difference and Value Function Approximation
The dynamic programming only ﬁts to the MDPs with
ﬁnite state and ﬁnite action spaces under a known transition
probability. For the MDPs constructed in Section III, the tran-
sition probabilities p(s′|s, a) are unknown due to the unknown
dynamics of the AUV, thus the value function updating (16)
cannot be executed. Moreover, the depth control of the AUV is
an online control problem which means that the AUV collects
sampled trajectories during the tracking process. Thus we
adopt a new rule known as temporal difference (TD) to update
the value function online using newly sampled data.
Suppose that we obtain a transition pair (sk, uk, sk+1)
observed by the AUV at time k, then TD gives the updating
of the Q-value function as the form [21]
Q(sk, uk) ←Q(sk, uk)
+α[c(sk, uk) + γ min
u Q(sk+1, u) −Q(sk, uk)]
(18)
where α > 0 is a learning rate.
The TD algorithm updates the map from state-action pairs
to their Q values and stores it as a lookup table. However,
for the depth control problem, the state consists of the motion
vector χ of the AUV and the desired depth zr, while the
actions are usually forces and torques of the propellers. All
these continuous variables lead to continuous state and action
spaces, thus obviously a lookup table is not sufﬁcient.
We
represent
the
map
by
a
parameterized
function
Q(s, u|ω) and instead update the parameter ω as follows
δk = c(sk, uk) + γQ(sk+1, µ(sk+1|θ)|ω) −Q(sk, uk|ω)
ωk+1 = ωk −αδk∇ωQ(sk, uk|ω)
(19)
where µ(sk+1|θ) is a policy function deﬁned in the next
subsection.
C. Deterministic Policy Gradient
The continuous control inputs for the depth control of the
AUV result in continuous actions, thus the minimization (17)
over the continuous action space is time-consuming if executed
every iteration.
Instead, we implement the policy improve phase by the
DPG algorithm which updates the policy along the negative
gradient of the value function. The DPG algorithm assumes
a deterministic parameterized policy function µ(s|θ) and
updates the parameter θ along the negative gradient of the
long-term cost function of the depth control of the AUV
θk+1 .= θk −α \
∇θJ(θ)
(20)
where
\
∇θJ(θ) is a stochastic approximation of the true
gradient. The DPG algorithm derives the form of \
∇θJ(θ) [22]
as follows
\
∇θJ(θ) = 1
M
M
X
i=1
∇θµ(si|θ)∇uiQµ(si, ui).
(21)
where Qµ denotes the Q-value function under the policy
µ(u|s) and the approximate gradient is calculated by a tran-
sition sequence [s1, u1, . . . , sM, uM].
In the last subsection, the Q-value function is approximated
by a parameterized approximator Q(s, u|ω), thus we replace
∇uiQµ(si, ui) with ∇uiQ(si, ui|ω). The approximate gra-
dient is given by
\
∇θJ(θ) ≈1
M
M
X
i=1
∇θµ(si|θ)∇uiQ(si, ui|ω).
(22)
Note that the approximate gradient in (22) is calculated by
a transition sequence, which seems to be inappropriate with
the online characteristic of the depth control problems. We
can still get a online updating rule if setting M = 1, but the
deviation of the approximation may be ampliﬁed. Actually,
a batch updating scheme can be performed by sliding the
sequence along the trajectory of the AUV.
We have deﬁned two function approximators Q(s, u|ω)
and µ(s|θ) in TD and DPG algorithms, but do not give the
detailed forms of the approximators. Considering the nonlinear
and complicated dynamics of the AUV, we constructed two
neural network approximators, evaluation network Q(s, u|ω)
and policy network µ(s|θ) with ω and θ as the weights.
To illustrate the updating of the two networks through TD
and DPG algorithms, we present a structure diagram showed in
Fig. 5. The ultimate goal of the RL algorithm is to learn the
state feedback controller represented by the policy network.
There are two back-propagating paths in our algorithm. The
evaluation network is back-propagated by the error between
the current Q-value ˆQ(sk, uk) and that of successor state-
action pair ˆQ(sk+1, uk+1) plus the one-step cost, which is
the idea of the TD algorithm. The output of the evaluation
network is then passed to the “gradient module” to generate the
gradient ∇uk ˆQ(sk, uk), which is propagated back to update
6
Fig. 5. The structure diagram of NNDPG.
the policy network through the DPG algorithm. The two back-
propagating paths correspond respectively to the policy evalua-
tion and policy improvement phases in dynamic programming.
Note that the module “state convertor” transforms the AUV’s
coordinates χk and reference depth signal zref into the state
sk, which denotes the process of state selection illustrated in
Section III.
V. IMPROVED STRATEGIES
In the last section, we have illustrated a RL framework
for the depth control of the AUV, which updates two neural
network approximators by iterating the TD and DPG algo-
rithms. Combining the characteristics of the control problems,
we further proposed improved strategies from two aspects.
Firstly, we design adaptive structures for the neural networks
according to the physical constrains when controlling the AUV,
where a new type of activation function is adopted. Then
we proposed a batch-learning scheme to improve the data-
efﬁciency due to the online features of the depth control
problems.
A. Neural Network Approximators
In Section IV, we deﬁne two function approximators
Q(s, u|ω) and µ(s|θ) to approximate Q-value function and
policy function respectively, but do not assume the detailed
forms of the approximators. Since the dynamics of the AUV is
usually nonlinear and complicated, we constructed two neural
network approximators, evaluation network Q(s, u|ω) and
policy network µ(s|θ) with ω and θ as the weights.
The evaluation network has four layers with the state s and
the control variable u as the inputs where u is not included
until the second layer referring to [23]. The output layer is a
linear unit to generate a scalar Q-value.
The policy network is designed with three layers and
generates the control variable u with the inputting state s.
Considering the limit power of the propellers of the AUV, the
output u must be constrained in a given range. Therefore, we
adopt tanh units as the activation functions of the output layer.
The output of the tanh function in [−1, 1] is scaled to the given
interval.
Fig. 6. The curves of ReLu, Sigmoid and Tanh functions where the ReLu only
inhibits changes along the negative x axis and the other two inhibit changes
in two directions.
(a) Evaluation network structure.
(b) Policy network structure.
Fig. 7. The structures of two neural networks. The evaluation network has 4
layers with the control input included at the second layer. The policy network
has 3 layers with tanh function as the output activation unit.
Besides the structures of the networks, we also adopt a
new type of activation function known as rectiﬁed linear unit
(ReLu) for the hidden layers. In conventional neural network
controllers, the activation functions usually use sigmoid or
tanh functions which are sensitive to the changes near zero but
insensitive to the changes at large levels. This saturate property
brings a ‘gradient vanish’ problem where the gradient of the
unit reduces to zero under a large input, which means the large
input does not help the training of the network. However, the
ReLu function avoids the problem as it only inhibits changes
in one direction as illustrated in Fig. 6. Moreover, the simpler
form of the ReLu can accelerate the training of the networks,
which just ﬁts the online property of the depth control for the
AUV.
In the end, we illustrate the full structures of the two
networks in Fig. 7
B. Prioritized Experience Replay
In this subsection, we consider how to improve the data
efﬁciency for the depth control problems of the AUV. In
7
practice, it consumes large amounts of resources to control
the AUV in real undersea environment so that each record
along the trajectory is ‘expensive’. As mentioned, the RL
algorithm learns the optimal control policy based on the
sampled trajectories data of the AUV. The learning process
is necessary but wastes time and resources since the AUV
may take some suboptimal even bad actions for interacting
fully with the environment. For the depth control of the AUV,
we need to shorten the learning process through improving the
data efﬁciency.
We propose a batch learning scheme called prioritized ex-
perience relay which is an improved version of the experience
replay proposed by Lin [24]. Imagine the scene of controlling
the AUV with the RL algorithm. The AUV observes the
subsequent state s′, one-step cost c in state s with control
input u executed for each time. We call the tuple (s, u, s′, c)
as an ‘experience’. Instead of updating the evaluation network
and policy network by the newly sampled experience, the pri-
oritized experience replay uses a cache to store all the visited
experiences and samples a batch of previous experiences from
the cache to update the two networks. The replay mechanism
reuses previous experiences as if they were visited newly thus
improves the data efﬁciency greatly.
Actually, not all experiences should be focused equally. If
an experience brings minor difference to the weights of the
networks, it does not deserve to be replayed since the networks
have learned the implicit pattern contained by the experience.
Whereas a ‘wrong’ experience should be replayed frequently.
Inspired by the equation (19), the prioritized experience
replay adopts the ‘error’ of the TD algorithm as the priority
of an experience, which is given by
PRIk = |c(sk, uk)+γQ(sk+1, µ(sk+1|θ)|ω)−Q(sk, uk|ω)|.
(23)
The priority measures how probably the experience is sampled
from the cache. Therefore, an experience with higher priority
makes more differences to the weight of the evaluation net-
work and is replayed with greater probability.
To sum up, we propose an algorithm called neural-network-
based deterministic policy gradient (NNDPG) by combining
the above techniques. A more detailed procedure of NNDPG
is given in Algorithm 1.
In the end of this section, it is necessary to discuss the
advantages of NNDPG. Firstly, NNDPG does not require any
knowledge about the AUV model but can still ﬁnd control
policies whose performance is competitive with the controllers
under the exact dynamics of the system. What is more, it
greatly improves the data efﬁciency and the performance by
proposing the prioritized experience replay which is ﬁrstly
used in the RL controller for the control problems of the AUV.
VI. PERFORMANCE STUDY
A. AUV Dynamics for Simulation
In this subsection we present a set of explicit dynamical
equations of a classical a six-DOF ‘REMUS’ AUV model
Algorithm 1 NNDPG for the depth control of the AUV
Input:
Target depth zr, number of episodes M, number of steps
for each episode T , batch size N, learning rates for
evaluation and policy networks αω and αθ.
Output:
Depth control policy u = µ(s|θ).
1: Initialize evaluation network Q(s, u|ω) and policy net-
work µ(s|θ). Initialize the experience replay cache R.
2: for i = 1 to M do
3:
Reset the initial state s0.
4:
for t = 0 to T do
5:
Generate a control input ut = µ(st|θ) + △ut with
current policy network and exploration noise △ut.
6:
Execute control ut and obtain st+1. Calculate one-
step cost ct according to (5) or (9).
7:
Push transition tuple (st, ut, ct+1, st+1) into R and
calculate its priority according to (23)
8:
Sample
a
minibatch
of
N
transitions
{(si, ui, ci+1, si+1)
|
1
≤
i
≤
N} from R
according to their priorities.
9:
Update evaluation network and refresh the priorities
of samples.
yi = ci+1 + γQ(si+1, µ(si+1|θ)|ω)
δi = yi −Q(si, ui|ω)
ω ←ω −1
N αω
PN
i=1 δi∇ωQ(si, ui|ω)
10:
Compute ∇uiQ(si, ui|ω) for each sample.
11:
Update policy network by
θ ←θ −1
N αθ
PN
i=1 ∇θµ(si|θ)∇uiQ(si, ui|ω)
12:
end for
13: end for
[25], which is utilized to validate our algorithm. However,
the experiments can be extended easily to other AUV models
since our algorithm is a model-free method.
As mentioned, we only consider the motions of the AUV in
x-z plane and assume a constant surge speed u0. The simpliﬁed
dynamical equations are given by
m( ˙w −uq −xG ˙q −zGq2) =Z ˙q ˙q + Z ˙w ˙w + Zuquq + Zuwuw
+ Zwww|w| + Zqqq|q|
+ (W −B)cosθ
+ τ1 + △τ1
(24a)
Iyy ˙q + m[xG(uq −˙w) + zGwq] =M ˙q ˙q + M ˙w ˙w
+ Muquq + Muwuw
+ Mwww|w| + Mqqq|q|
−(xGW −xBB)cosθ
−(zGW −zBB)sinθ
+ τ2 + △τ2
(24b)
˙z = wcosθ −usinθ
(24c)
˙θ = q
(24d)
8
TABLE I
THE HYDRODYNAMIC PARAMETERS OF THE REMUS AUV
Coefﬁcients
Value
Coefﬁcients
Value
m
30.51
Iyy
3.45
M ˙q
-4.88
Mww
3.18
M ˙w
-1.93
Mqq
-188
Muq
-2
Muw
24
Z ˙q
-1.93
Zqq
-0.632
Z ˙w
-35.5
Zww
-131
Zuq
-28.6
Zwq
-5.22
where [xG, yG, zG]T and [xB, yB, zB]T are respectively the
centers of gravity and buoyancy; Z ˙q, Z ˙w, M ˙q and M ˙w denote
the added masses; Zuq, Zuw, Muq, Muw denote the body
lift force and moment coefﬁcients; Zww, Zqq, Mww, Mqq the
cross-ﬂow drag coefﬁcients; W and B represent the ROV’s
weight and buoyancy. The bounded control inputs τ1 and τ2
are propeller thrusts and torques with disturbances △τ1 and
△τ2 caused by unstable underwater environment. The values
of hydrodynamic coefﬁcients are presented in Table I
B. Linear Quadratic Gaussian Integral Control
We compare two model-based controllers with the state
feedback controller learned by NNDPG. The ﬁrst is the linear
quadratic gaussian integral (LQI) controller [26] derived from
a linearized AUV model.
The nonlinear model of the AUV (24a)-(24d) can be lin-
earized through SIMULINK linearization mode at a steady
state [5]
w = w0 + w′
q = q0 + q′
z = z0 + z′
θ = θ0 + θ′
(25)
where [w′, q′, z′, θ′]T denote the tiny linearization error and
the steady state point is set to [0, 0, 2.0, 0]T. The linearized
AUV model is derived directly
˙χ = Aχ + Bu + ξ
y = Cχ
(26)
where coefﬁcient matrices A, B, C are give by
A =








−1.0421
0.7856
0
0.0207
6.0038
−0.6624
0
−0.7083
1.0000
0
0
−2.0000
0
1.0000
0
0








B =








0.0153
0.0035
−0.0035
0.1209
0
0
0
0








C =

0
0
1
0
0
0
0
1


(27)
Fig. 8. The structure diagram of LQI.
and the output y = [z, θ]T .
Since the state and output variables are all measurable, a
LQI controller is designed to solve the depth control problem
for the linearized AUV model showed in Fig. 8. A feedback
controller is designed as
u = K[χ, ǫ]T = Kχχ + Kǫǫ
(28)
where ǫ is the output of the integrator
ǫ(t) =
Z t
0
(yr −y)dt.
(29)
The gain matrix K is obtained by solving an algebraic
Riccati equation, which is derived from the minimization of
the following cost function
J(u) =
Z t
0



h
χ
ǫ
i
Q

χ
ǫ

+ uT Ru


dt.
(30)
C. Nonlinear Model Predictive Control
The LQI controller is designed based on an linearized AUV
model which is an approximation of the original nonlinear
model. Therefore, we adopt a nonlinear controller derived
from nonlinear model predictive control (NMPC) under an
exact nonlinear AUV dynamics. NMPC designs a N-steps
cumulative cost function [14]
Jk = 1
2xT
k+NP0xk+N + 1
2
N−1
X
i=0
(xT
k+iQxk+i + uT
k+iRuk+i).
(31)
For each time step k, NMPC predicts an optimal N-step
control sequence {uk, uk+1, . . . , uk+N−1} by minimizing the
optimization function
min
{uk,uk+1,...,uk+N−1} Jk
s.t. xi+1 = f(xi, ui) i = k, . . . , k + N −1,
(32)
where the count of predicting steps N is also called prediction
horizon. NMPC solves the N-step optimization problem by
iterating two processes alternatively. The forward process ex-
ecutes the system equation using a candidate control sequence
to ﬁnd the predictive state sequence {xk+i}. The backward
process ﬁnds the Lagrange Multipliers to eliminate the partial
derivative terms of Jk with respect to the state sequence, and
then updates the control sequence along the gradient vector.
The two processes are repeated until the desired accuracy.
9
D. Experiment Settings
In this subsection, we introduce the experiment settings for
the simulations.
The LQI and NMPC controllers are implemented on the
Matlab R2017a platform using control system and model
predictive control toolboxes. As mentioned, the AUV model
is linearized by the SIMULINK from the S-function of the
AUV dynamics (24a)-(24d). The NNDPG is implemented by
python2.7 on linux system using the Google’s open source
module Tensorﬂow.
It should be noticed that all the controllers and models are
implemented as discrete time versions with sample time dt =
0.1 s, although some of them are described under continuous
time in the previous sections. For example, the general AUV
dynamics (1) is discretized using the forward Euler formula
χk+1 = χk + dt · f(χk, uk).
(33)
The sample horizon is set to T = 100 s.
The disturbance term ξ is generated by an Ornstein-
Uhlenbeck process [27]
ξk+1 = β(µ −ξk) + σε
(34)
where ε is a noise term complying standard normal distri-
bution, and other parameters are set as µ = 0, β = 0.15,
σ = 0.3. Note that this is a temporal correlated random
process.
E. Simulation Results for Constant Depth Control
Firstly, we compare the performance of NNDPG with those
of LQI and NMPC on the constant depth control problem. Fig.
9 shows the tracking behaviors of the three controllers from
an initial depth z0 = 2.0 m to a target depth zr = 8.0 m. We
use three indices, steady-state error (SSE), overshooting and
response time (RT), to evaluate the performance of a controller,
whose exact values are present in Table.II.
We can ﬁnd that the LQI behaves worst among the
controllers. This result illustrates that the performance of
the model-based controller will deteriorate under an inexact
model.
In addition, the simulation shows that the performance of
NNDPG is comparable with that of NMPC based on a perfect
nonlinear AUV model, and even defeats the latter with a faster
convergence speed and a smaller overshooting (bold decimals
in Table.II). It illustrates the effectiveness of the proposed
algorithm under an unknown AUV model.
Fig. 10 shows the control sequences of the three controllers.
The control policy learned by NNDPG changes more sensi-
tively than the other controllers. We explain the phenomenon
by the approximation error of the neural network. LQI and
NMPG can obtain a smoother control law function since they
can access the dynamical equations of the AUV. However,
a neural network (policy network) is used to generate the
control sequence in NNDPG. Therefore, it can be regarded
as a compensation for the unknown dynamical model.
To validate the improved data-efﬁciency of the prioritized
experience replay, we compare its performance with the orig-
inal experience replay through the converging process of the
(a)
(b)
Fig. 9. The tracking behaviors of the three controllers for the constant depth
control problem where zr = 8.0 m and θr = 0.0 rad. Figures (a) and (b)
illustrate respectively the tracking trajectories of the depth z and pitch angle
θ.
TABLE II
THE PERFORMANCE INDEXES OF THREE CONTROLLERS FOR THE
CONSTANT DEPTH CONTROL.
Controllers
LQI
NMPC
NNDPG
SSE(z)
0.0436
0.0094
0.0191
Overshooting(z)
3.0849
0.6772
0.1945
RT(z)
42.5
7.8
7.3
SSE(θ)
0.0158
0.0065
0.0108
RT(θ)
46.5
12.2
11.0
total reward, showed in Fig. 11. We ﬁnd that the NNDPG with
prioritized experience replay spends less steps in converging
than the one with original experience replay, since the former
replays previous experiences by an more efﬁcient way.
F. Simulation Results for Curved Depth Control
In this subsection, the AUV is controlled to track a curved
depth trajectory. We set the tracked trajectory as a sinusoidal
function zr = zr0 −sin(π/50 · x) where zr0 = 10 m. At
10
(a)
(b)
Fig. 10. The control sequences of the three controllers for the constant depth
control problem. Figures (a) and (b) illustrate respectively the trajectories of
τ1 and τ2.
Fig. 11.
Comparison of NNDPG with prioritized experience replay and
experience replay.
Fig. 12. The tracking errors of the curved depth control for LQI and different
versions of NNDPG.
Fig. 13.
The tracking trajectories of the curved depth control for LQI and
different versions of NNDPG.
ﬁrst, we assume that the NNDPG has the tendency infor-
mation about the trajectory including the slope angle and
its derivative as the situation of the curved depth control
studied in Section III. Then we validate the algorithm un-
der the situation where the slope angle is not measurable.
Instead, a preceding historical sequence of the measured
relative depths [∆zt−N, ∆zt−N+1, . . . , ∆zt−1, ∆zt] is offered
where the length of the sequence is called window size. The
tracking errors and trajectories are showed in Fig. 12 and Fig.
13, where NNDPG-PI denotes the NNDGP algorithm with
the information about the slope angle, and NNDPG-WIN-
X denotes the NNDPG involved with X recently measured
relative depths.
We can observe that the performance of NNDPG-PI is the
best one while NNDPG-WIN-1 including the current relative
depth performs much worse, which validates that the state
designed for the constant depth control problem becomes
partially observable for the curved depth setting. However,
when we add the recent two measured relative depths to the
state (NNDPG-WIN-3), the tracking error is much reduced.
The improvement can be explained by the implicit tendency
information containing in the latest measurements.
11
Fig. 14. The boxplot of long-term cost for different window size.
To determine the best choice of the window size, we list and
compare the performances of NNDPG with different window
sizes from 1 to 9. The performances are evaluated according
to the long-term cost of one experiment
J =
T
X
k=1
γk−1c(sk, uk).
(35)
Because there are disturbances existing in the AUV dynamics,
we make ten experiments for each window size. The results
are presented in Fig. 14 as a boxplot to show the distributions
of the performances.
It is clear to see that the supplement of the past measured
relative depths does compensate the missing tendency infor-
mation of the desired depth trajectory. However, it does not
mean more is better. Actually, the records beyond too many
steps before are showed to deteriorate the performance because
they may disturb the learned policy. From the plot we ﬁnd the
best value of the window size is 3, which corresponds to the
minimal mean and variance.
G. Simulation Results of Realistic Seaﬂoor Tracking
In the end, we test the proposed RL framework for tracking
a real seaﬂoor. The data set sampled from the seaﬂoor of
the South China Sea at (23◦06′N, 120◦07′E) is provided by
the SHENYANG institute of automation Chinese academy of
science. We sample the depths along a preset path and obtain
a 2D distance-depth seaﬂoor curve showed in the Fig. 15.
Our destination is to control the AUV to track the seaﬂoor
curve and keep a constant safe distance simultaneously. The
tracking trajectory is obtained by the control policy learned by
NNDPG-WIN-3 showed in Fig. 16. It illustrates that the AUV
is controlled to track the rapidly changing seaﬂoor curve with
satisfying performance and tracking error.
VII. CONCLUSION
This paper has proposed a model-free reinforcement learn-
ing framework for the depth control problem of AUVs in
discrete time. Three different depth control problems were
studied and modeled as a Markov decision process with
Fig. 15. The process of sampling the depths data along a preset path.
Fig. 16. The tracking trajectory of NNDPG-WIN-3 and the realistic seaﬂoor.
appropriate forms of the state and one-step cost function. A
reinforcement learning algorithm NNDPG was proposed to
learn a state-feedback controller represented by a neural net-
work called policy network. The weight of the policy network
was updated by an approximate policy gradient calculated
based on the deterministic policy gradient theorem, while
another evaluation network was used to estimate the state-
action value function and updated by the temporal difference
algorithm. The alternative updates of two networks composed
one iteration step of NNDPG. To improve the convergence, the
prioritized experience replay was proposed to replay previous
experiences to train the network.
We tested the proposed model-free RL framework on a
classical REMUS AUV model and compared the performance
with those of two model-based controllers. The results showed
that the performance of the policy found by NNDPG can
compete with those of the controllers under the exact dynamics
12
of the AUV. In addition, we found that the observability of the
chosen state inﬂuenced the performance and the recent history
could be added to improve the performance.
In the future, we will verify the proposed model-free RL
framework on a real underwater vehicle which is a deep-sea
controllable visual sampler (DCVS) operating at 6000 meters
under the sea level.
REFERENCES
[1] A. Kenny, I. Cato, M. Desprez, G. Fader, R. Schuttenhelm, and J. Side, An
overview of seabed-mapping technologies in the context of marine habitat
classiﬁcation, ICES Journal of Marine Science: Journal du Conseil, vol.
60, no. 2, pp. 411C418, 2003.
[2] J. A. Farrell, S. Pang, and W. Li, Chemical plume tracing via an
autonomous underwater vehicle, IEEE Journal of Oceanic Engineering,
vol. 30, no. 2, pp. 428C442, 2005.
[3] Z. Liu, P. Smith, T. Park, A. A. Trindade, and Q. Hui, Automated
contaminant source localization in spatio-temporal ﬁelds: A response
surface and experimental design approach, IEEE Transactions on Systems,
Man, and Cybernetics: Systems, vol. 47, no. 3, pp. 569C583, 2017.
[4] K.-P. Lindegaard, Acceleration feedback in dynamic positioning, Ph.D.
dissertation, Dept. Eng. Cybern., Norwegian Univ. Sci. Technol., Trond-
heim, Norway, 2003.
[5] M. H. Khodayari and S. Balochian, Modeling and control of autonomous
underwater vehicle (auv) in heading and depth attitude via self-adaptive
fuzzy pid controller, Journal of Marine Science and Technology, vol. 20,
no. 3, pp. 559C578, 2015.
[6] H. J. Wang, Z. Y. Chen, H. M. Jia, and X. H. Chen, Nn-backstepping for
diving control of an underactuated auv, in Oceans, 2011, pp. 1C6.
[7] D. Zhu, Y. Zhao, and M. Yan, A bio-inspired neurodynamics-based
backstepping path-following control of an auv with ocean current, In-
ternational Journal of Robotics and Automation, vol. 27, no. 3, p. 298,
2012.
[8] M. C. Fang and J. H. Luo, On the track keeping and roll reduction of
the ship in random waves using different sliding mode controllers, Ocean
Engineering, vol. 34, no. 3, pp. 479C488, 2007.
[9] S. Wen, M. Z. Chen, Z. Zeng, X. Yu, and T. Huang, Fuzzy control
for uncertain vehicle active suspension systems via dynamic slidingmode
approach, IEEE Transactions on Systems, Man, and Cybernetics: Systems,
vol. 47, no. 1, pp. 24C32, 2017.
[10] X. Ai, K. You, and S. Song, A source-seeking strategy for an au-
tonomous underwater vehicle via on-line ﬁeld estimation, in Control,
Automation, Robotics and Vision (ICARCV), 2016 14th International
Conference on. IEEE, 2016, pp. 1C6.
[11] Z. Li, J. Deng, R. Lu, Y. Xu, J. Bai, and C.-Y. Su, Trajectory-tracking
control of mobile robot systems incorporating neural-dynamic optimized
model predictive approach, IEEE Transactions on Systems, Man, and
Cybernetics: Systems, vol. 46, no. 6, pp. 740C749, 2016.
[12] T. Fossen, A. Loria, A. Teel et al., A theorem for ugas and ules of
(passive) nonautonomous systems: Robust control of mechanical systems
and ships, International Journal of Robust and Nonlinear Control, vol.
11, no. 2, pp. 95C108, 2001.
[13] A. Budiyono, Model predictive control for autonomous underwater
vehicle, Indian Journal of Geo-Marine Sciences, vol. 40, no. 2, pp.
191C199, 2010.
[14] G. J. Sutton and R. R. Bitmead, Performance and computational
implementation of nonlinear model predictive control on a submarine,
Nonlinear Model Predictive Control, pp. 461C472, 2000.
[15] A. Konar, I. G. Chakraborty, S. J. Singh, L. C. Jain, and A. K. Nagar,
A deterministic improved q-learning for path planning of a mobile robot,
IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 43,
no. 5, pp. 1141C1153, 2013.
[16] P. Rakshit, A. Konar, P. Bhowmik, I. Goswami, S. Das, L. C. Jain, and A.
K. Nagar, Realization of an adaptive memetic algorithm using differential
evolution and q-learning: a case study in multirobot path planning, IEEE
Transactions on Systems, Man, and Cybernetics: Systems, vol. 43, no. 4,
pp. 814C831, 2013.
[17] M. Riedmiller, T. Gabel, R. Hafner, and S. Lange, Reinforcement
learning for robot soccer, Autonomous Robots, vol. 27, no. 1, pp. 55C73,
2009.
[18] C. Zhou and Q. Meng, Dynamic balance of a biped robot using fuzzy
reinforcement learning agents, Fuzzy sets and Systems, vol. 134, no. 1,
pp. 169C187, 2003.
[19] B. Schlkopf, J. Platt, and T. Hofmann, An application of reinforcement
learning to aerobatic helicopter ﬂight, in Advances in Neural Information
Processing Systems 19, Proceedings of the Twentieth Conference on
Neural Information Processing Systems, Vancouver, British Columbia,
Canada, December, 2006, pp. 1C8.
[20] A. Now, P. Vrancx, and Y. De Hauwere, Reinforcement learning:
Stateof- the-art, 2012.
[21] R. S. Sutton and A. G. Barto, Reinforcement learning : an introduction,
IEEE Transactions on Neural Networks, vol. 9, no. 5, p. 1054, 1998.
[22] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
Deterministic policy gradient algorithms, in International Conference on
Machine Learning, 2014, pp. 387C395.
[23] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, Continuous control with deep reinforcement
learning, Computer Science, vol. 8, no. 6, p. A187, 2015.
[24] L. J. Lin, Self-improving reactive agents based on reinforcement learn-
ing, planning and teaching, Machine Learning, vol. 8, no. 3, pp. 293C321,
1992.
[25] H. Pan and M. Xin, Depth control of autonomous underwater vehicles
using indirect robust control method, in American Control Conference,
2012, pp. 98C113.
[26] P. C. Young and J. Willems, An approach to the linear multivariable
servomechanism problem, International journal of control, vol. 15, no.
5, pp. 961C979, 1972.
[27] P. Mazur, On the theory of brownian motion, Physica, vol. 25, no. 16,
pp. 149C162, 1959.