arXiv:1310.0063v2 [cs.SY] 1 Apr 2014 Online Approximate Optimal Station Keeping of an Autonomous Underwater Vehicle Patrick Walters and Warren E. Dixon Abstract —Online approximation of an optimal station keeping strategy for a fully actuated six degrees-of-freedom autonomous underwater vehicle is considered. The developed controller is an approximation of the solution to a two player zero-sum game where the controller is the minimizing player and an external disturbance is the maximizing player. The solution is approximated using a reinforcement learning-based actor- critic framework. The result guarantees uniformly ultimately bounded (UUB) convergence of the states and UUB convergence of the approximated policies to the optimal polices without the requirement of persistence of excitation. I. I NTRODUCTION Autonomous underwater vehicles (AUVs) play an increas- ingly important role in commercial and military objectives. The operational tasks of AUVs vary, including: inspection, monitoring, exploration, and surveillance [1]. During a mis- sion, an AUV may be required to remain on station for an extended period of time, e.g., as a communication link for multiple vehicles, or for persistent environmental monitoring of a specific area. The success of the mission could rely on the vehicle’s ability to hold a precise station (e.g., station keeping near underwater structures and features) while maximizing its time on station. Energy expended for propulsion is tightly coupled to the endurance of AUVs [2], specially when station keeping in extreme environments with strong currents or high seas. Therefore, by reducing the energy expended for extended station keeping, the time on station can be maximized. The precise station keeping of an AUV is challenging be- cause of the six degree-of-freedom (DOF) nonlinear dynamics of the vehicle and unmodeled environmental disturbances, such as surface effects and ocean currents. Common ap- proaches to the control of an underwater vehicle include robust and adaptive control methods [3]–[6]. These methods provide robustness to disturbances or model uncertainty; however, do not explicitly attempt to reduce energy expended by propul- sion. This motivates the use of control methods where an optimal control policy can be selected to satisfy a performance criteria, such as the infinite-horizon quadratic performance criteria used to develop optimal policies that minimize the square of the total control effort (energy expended) and state Patrick Walters and Warren E. Dixon are with the Department of Mechani- cal and Aerospace Engineering, University of Florida, Gainesville, FL, USA. Email: {walters8, wdixon}@ufl.edu This research is supported in part by NSF award numbers 0901491, 1161260, 1217908, ONR grant number N00014-13-1-0151, and a contract with the AFRL Mathematical Modeling and Optimization Institute. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring agency. error (accuracy) [7]. Because of the difficulties associated with finding closed-form analytical solutions to optimal control problems for nonlinear systems, previous results in literature have linearized the AUV model, developing H ∞ control poli- cies [8], [9] and a model-based predictive control policy [10]. Considering the nonlinear AUV dynamics, [11] numerically approximates the solution to the Hamilton-Jacobi-Bellman equation using an iterative application of Galerkin’s method and [12] develops a PID-based H ∞ control strategy. Reinforcement learning-based (RL-based) methods have been recently used to approximate solutions to optimal con- trol problems [13]–[16]. The online RL-based actor-critic framework in [15] and [16] approximate the value function and optimal policies of an optimal control problem using the so-called Bellman error. The actor-critic framework ap- proximation provides uniformly ultimately bounded (UUB) convergence of the state to the origin and UUB convergence of the approximate policy to the optimal policy. These methods require persistence of excitation (PE) to insure convergence to the optimal policy, which is undesirable for an operational underwater vehicle. In this result, a two player zero-sum differential game is developed where the controller is the minimizing player and an external disturbance is the maximizing player. The performance criteria for the two player game is from the H ∞ control problem. This performance criteria captures the desire to develop an optimal policy similar to the infinite-horizon quadratic performance criteria, yet still including the need for attenuating the unmodeled disturbances. The developed controller differs from results such as [14] and [15] in that it re- moves the PE requirement through the addition of concurrent- learning to the value function’s adaptive update law. As out- lined in [17], concurrent-learning uses the additional knowl- edge of recorded data to remove the PE requirement. Due to the unique structure of the actor-critic framework, the recorded data is replaced with sampled data points at the current time instant. This paper presents a novel approach to station keeping of a fully actuated 6 DOF AUV, which is robust to unmodeled environmental disturbances using the actor-critic framework to approximate a solution to the two player zero-sum game without the need for PE. A Lyapunov-based stability analysis is presented which guarantees UUB convergence of the states and UUB convergence of the approximated policies to the optimal policies. II. V EHICLE M ODEL Consider the nonlinear equations of motion for an un- derwater vehicle with the addition of an unknown additive disturbance given by [18] ̇ η = J ( η ) ν, (1) M ̇ ν + C ( ν ) ν + D ( ν ) ν + g ( η ) = τ b + τ d , (2) where ν ∈ R 6 is the body-fixed translational and angular ve- locity vector, η ∈ R 6 is the earth-fixed position and orientation vector, J : R 6 → R 6 × 6 is the coordinate transformation be- tween the body-fixed and earth-fixed coordinates, M ∈ R 6 × 6 is the inertia matrix including added mass, C : R 6 → R 6 × 6 is the centripetal and Coriolis matrix, D : R 6 → R 6 × 6 is the hydrodynamic damping and friction matrix, g : R 6 → R 6 is the gravitational and buoyancy force and moment vector, τ d ∈ R 6 is the unknown disturbance (e.g., ocean currents and surface effects), and τ b ∈ R 6 is the body-fixed force and moment control input. The state vectors in (1) are further defined as η = [ x y z φ θ ψ ] T , ν = [ u v w p q r ] T , where x , y , z ∈ R are the earth-fixed position vector com- ponents of the center of mass, φ , θ , ψ ∈ R represent the roll, pitch, and yaw angles, respectively, u , v , w ∈ R are the body-fixed translational velocities, and p , q , r ∈ R are the body-fixed angular velocities. The coordinate transformation J is defined as J , [ J 1 ( η ) 0 3 × 3 0 3 × 3 J 2 ( η ) ] , where J 1 : R 6 → R 3 × 3 is the coordinate transformation between body-fixed and earth-fixed translational velocities represented by the Z-Y-X Euler angle rotation matrix as J 1 ,   cψcθ − sψcφ + cψsθsφ sψsφ + cψsθcφ sψcθ cψcφ + sψsθsφ − cψsφ + sψsθcφ − sθ cθsφ cθcφ   , and J 2 : R 6 → R 3 × 3 represents the coordinate transformation between body-fixed and earth-fixed angular velocities defined as J 2 ,   1 sφtθ cφtθ 0 cφ − sφ 0 sφ/cθ cφ/cθ   , where s · , c · , t · denote sin ( · ) , cos ( · ) , tan ( · ) , respectively. Assumption 1. A pitch of ± π 2 rad is avoided. For a vehicle with metacentric stability, Assumption 1 is easily satisfied [18]; therefore, the coordinate transformation J , and it’s inverse exist and are bounded. By applying the kinematic relationship in (1) to (2) , the vehicle dynamics can be expressed in the earth-fixed frame as [18] ̄ M ( η ) ̈ η + ̄ C ( η, ̇ η, ν ) ̇ η + ̄ D ( η, ν ) ̇ η + ̄ g ( η ) = ̄ τ b + ̄ τ d , where ̄ M : R 6 → R 6 × 6 , ̄ C : R 6 × R 6 × R 6 → R 6 × 6 , ̄ D : R 6 × R 6 → R 6 × 6 and ̄ g : R 6 → R 6 , and are defined as ̄ M , J − T M J − 1 , ̄ C , J − T [ C − M J − 1 ̇ J ] J − 1 , ̄ D , J − T DJ − 1 , ̄ g , J − T g, ̄ τ b , J − T τ b and ̄ τ d , J − T τ d . Property 1. The transformed inertia matrix ̄ M is symmetric, positive definite [18], and satisfies m ‖ ξ m ‖ 2 ≤ ξ T m ̄ M ( η ) ξ m ≤ m ( η ) ‖ ξ m ‖ 2 , ∀ ξ m , η ∈ R 6 where m ∈ R is a positive known constant, and m : R 6 → [0 , ∞ ) is a positive known function. The inverse ̄ M − 1 satisfies 1 m ( η ) ‖ ξ m ‖ 2 ≤ ξ T m ̄ M − 1 ( η ) ξ m ≤ 1 m ‖ ξ m ‖ 2 , ∀ ξ m , η ∈ R 6 . For the subsequent development, η and ν are assumed to be measurable by sensors commonly used for underwater navigation, such as the ones found in [19]. The matrices J , ̄ M , ̄ C , ̄ D , and ̄ g are assumed to be known, while the additive disturbance ̄ τ d is assumed to be unknown. The dynamics can be rewritten in the following control affine form: ̇ ζ = f ( ζ ) + g ( ζ ) ( u 1 + u 2 ) , where ζ , [ η ̇ η ] T ∈ R 12 is the state vector, u 1 , ̄ τ b , u 2 , ̄ τ d ∈ R 6 are the control vectors, and the functions f : R 12 → R 12 and g : R 12 → R 12 × 6 are locally Lipschitz and defined as f , [ ̇ η − ̄ M − 1 ̄ C ̇ η − ̄ M − 1 ̄ D ̇ η − ̄ M − 1 ̄ g ] , g , [ 0 ̄ M − 1 ] . (3) Assumption 2. The underwater vehicle is neutrally buoyant and the center of gravity is located vertically below the center of buoyancy on the z axis; hence, f (0) = 0 due to the form of ̄ g [18]. III. F ORMULATION OF T WO P LAYER Z ERO -S UM D IFFERENTIAL G AME The performance index for the H ∞ control problem is [20] J c ( ζ, u 1 , u 2 ) = ∞ ˆ t r ( ζ ( τ ) , u 1 ( τ ) , u 2 ( τ )) dτ, (4) where r : R 12 → [0 , ∞ ) is the local cost defined as r ( ζ, u 1 , u 2 ) , ζ T Qζ + u T 1 Ru 1 − γ 2 u T 2 u 2 . (5) In (5) , Q ∈ R 12 × 12 is positive definite matrix, R ∈ R 6 × 6 is symmetric positive definite matrix, u 1 is the controller and the minimizing player, u 2 is the disturbance and the maximizing player, γ ≥ γ ∗ > 0 where γ ∗ is the smallest value of γ for which the system is stabilized [21]. The matrix Q has the property q ‖ ξ q ‖ 2 ≤ ξ T q Qξ q ≤ q ‖ ξ q ‖ 2 , ∀ ξ q ∈ R 12 where q and q are positive constants. The infinite-time scalar value functional V : [0 , ∞ ) → [0 , ∞ ) for the two player zero-sum game is written as V = min u 1 max u 2 ∞ ˆ t r ( ζ ( τ ) , u 1 ( τ ) , u 2 ( τ )) dτ, A unique solution exists to the differential game if the Nash condition holds min u 1 max u 2 J c ( ζ (0) , u 1 , u 2 ) = max u 1 min u 2 J c ( ζ (0) , u 1 , u 2 ) . The objective of the optimal control problem is to find the optimal policies u ∗ 1 and u ∗ 2 that minimize the performance index (4) subject to the dynamic constraints in (3) . Assuming that a minimizing policy exists and the value function is continuously differentiable, the Hamiltonian is defined as H , r ( ζ, u ∗ 1 , u ∗ 2 ) + ∂V ∂ζ ( f + g ( u ∗ 1 + u ∗ 2 )) , (6) The Hamilton-Jacobi-Isaac’s (HJI) equation is given as [22] 0 = ∂V ∂t + H, (7) where ∂V ∂t ≡ 0 since the value function is not an explicit function of time. After substituting (5) into (7) , the optimal policies are given by [7] u ∗ 1 = − 1 2 R − 1 g T ( ∂V ∂ζ ) T , (8) u ∗ 2 = 1 2 γ 2 g T ( ∂V ∂ζ ) T . (9) The analytical expressions for the optimal controllers in (8) and (9) require knowledge of the value function which is the solution to the HJI equation in (7) . A closed-form analytical solution to the HJI equation is generally infeasible; hence, the subsequent development seeks an approximate solution. IV. A PPROXIMATE S OLUTION While various function approximation methods could be used, the subsequent development is based on the use of neural networks (NNs) to approximate the value function and optimal policies. The subsequent development is also based on a temporary assumption that the state lies on a compact set where ζ ( t ) ∈ χ ⊂ R 12 , ∀ t ∈ [0 , ∞ ) . This assumption is common in NN literature (cf. [23], [24]), and is relieved by the subsequent stability analysis (Remark 1). Specifically, a semi- global analysis indicates that if the initial state is bounded, then the entire state trajectory remains on a compact set. Assumption 3. The value function can be represented by a single-layer NN with m neurons as V = W T σ + ǫ, (10) where W ∈ R m is the ideal weight vector bounded above by a known positive constant, σ : R 12 → R m is a bounded, contin- uously differentiable activation function, and ǫ : R 12 → R is the bounded, continuously differential function reconstruction error. Using (8) - (10) , the optimal policies can be represented as u ∗ 1 = − 1 2 R − 1 g T ( σ ′ T W + ǫ ′ T ) , (11) u ∗ 2 = 1 2 γ 2 g T ( σ ′ T W + ǫ ′ T ) . (12) Based on (10) - (12) , NN approximations of the value function and the optimal policy are defined as ˆ V = ˆ W T c σ, (13) ˆ u 1 = − 1 2 R − 1 g T σ ′ T ˆ W a 1 , (14) ˆ u 2 = 1 2 γ 2 g T σ ′ T ˆ W a 2 , (15) where ˆ W c , ˆ W a 1 , ˆ W a 2 ∈ R m are estimates of the constant ideal weight vector W . The weight estimation errors are defined as ̃ W c , W − ˆ W c , ̃ W a 1 , W − ˆ W a 1 , and ̃ W a 2 , W − ˆ W a 2 . Substituting (13) - (15) into (6) , the approximate Hamiltonian is given by ˆ H = r ( ζ, ˆ u 1 , ˆ u 2 ) + ∂ ˆ V ∂ζ ( f + g (ˆ u 1 + ˆ u 2 )) . (16) The error between the optimal and approximate Hamiltonian is called the Bellman error δ ∈ R , defined as δ , ˆ H − H, (17) where H ≡ 0 . Therefore, the Bellman error can be written in a measurable form as δ = r ( ζ, ˆ u 1 , ˆ u 2 ) + ˆ W T c ω, where ω , σ ′ ( f + g (ˆ u 1 + ˆ u 2 )) ∈ R m . Assumption 4. There exists a set of sampled data points { ζ j ∈ χ | j = 1 , 2 , . . . , N } such that ∀ t ∈ [0 , ∞ ) , rank   N ∑ j =1 ω j ω T j p j   = L, (18) where p j , √ 1 + ω T j ω j the normalization constant and ω j are evaluated at the specified data point, ζ j . The rank condition in (18) cannot be guaranteed to hold a priori. However, heuristically, the condition can be met by sampling redundant data, i.e., N ≫ L . Based on Assumption 4, it can be shown that ∑ N j =1 ω j ω T j p j > 0 such that c ‖ ξ c ‖ 2 ≤ ξ T c   n ∑ j =1 ω j ω T j p j   ξ c ≤ c ‖ ξ c ‖ 2 , ∀ ξ c ∈ R 4 even in the absence of persistent excitation [25], [26]. The value function update law is based on concurrent learning gradient descent of the Bellman error given by [17] ̇ ˆ W c = − η c 1 p ∂δ ∂ ˆ W c δ − η c n ∑ j =1 1 p j ∂δ j ∂ ˆ W c δ j , (19) where η c ∈ R is a positive adaptation gain, ∂δ ∂ ˆ W c = ω is the regressor matrix, p , √ 1 + ω T ω is a normalization constant. The policy NN update laws are given by ̇ ˆ W a 1 = proj { − η a 1 ( ˆ W a 1 − ˆ W c )} , (20) ̇ ˆ W a 2 = proj { − η a 2 ( ˆ W a 2 − ˆ W c )} , (21) where η a 1 , η a 2 ∈ R are positive gains, and proj {·} is a smooth projection operator used to bound the weight estimates [27]. Using Assumption 3 and properties of the projection operator, the policy NN weight estimation errors can be bounded above by positive constants. V. S TABILITY A NALYSIS An unmeasurable form of the Bellman error can be written using (6) , (16) and (17) , as δ = − ̃ W T c ω − ǫ ′ f + 1 4 ǫ ′ G 1 ǫ ′ T − 1 4 ǫ ′ G 2 ǫ ′ T + 1 2 ǫ ′ G 1 σ ′ T W − 1 2 ǫ ′ G 2 σ ′ T W (22) + 1 4 ̃ W T a 1 G σ 1 ̃ W a 1 − 1 4 ̃ W T a 2 G σ 2 ̃ W a 2 , where G 1 , gR − 1 g T ∈ R 12 × 12 , G 2 , gγ − 2 g T ∈ R 12 × 12 , G σ 1 , σ ′ G 1 σ ′ T ∈ R m × m and G σ 2 , σ ′ G 2 σ ′ T ∈ R m × m are symmetric, positive semi-definite matrices. Similarly, the Bellman error at the sampled data points can be written as δ j = − ̃ W T c ω j + 1 4 ̃ W T a 1 G σ 1 j ̃ W a 1 (23) − 1 4 ̃ W T a 2 G σ 2 j ̃ W a 2 + E j , where E j , 1 2 ǫ ′ j ( G 1 − G 2 ) σ ′ j T W + 1 4 ǫ ′ j ( G 1 − G 2 ) ǫ ′ T j − ǫ ′ j f j ∈ R is a constant at each data point. For the subsequent analysis, the function f on the compact set χ is Lipschitz continuous and can be bounded by ‖ f ( ζ ) ‖ ≤ L f ‖ ζ ‖ , ∀ ζ ∈ χ, where L f is a positive constant, and the normalized regressor in (19) can be upper bounded by ∥ ∥ ∥ ω p ∥ ∥ ∥ ≤ 1 . Theorem 1. If Assumptions 1-3 hold and the following suffi- cient conditions are satisfied q > η c L f ‖ ǫ ′ ‖ ε 2 , (24) c > L f ‖ ǫ ′ ‖ 2 ε + η a 1 + η a 2 2 η c , (25) λ min ( R ) ≥ γ 2 , (26) where ‖·‖ , sup ζ ‖·‖ and λ min ( · ) represents the mini- mum eigenvalue, and Z , [ ζ T ̃ W T c ̃ W T a 1 ̃ W T a 2 ] T ∈ R 12+3 m , then the policies in (14) and (15) with the NN update laws in (19) - (21) guarantee UUB regulation of the state ζ ( t ) and UUB convergence of the approximated policies ˆ u 1 and ˆ u 2 to the optimal policies u ∗ 1 and u ∗ 2 . Proof: Consider the continuously differentiable, positive definite candidate Lyapunov function V L = V + 1 2 ̃ W c T ̃ W c + 1 2 ̃ W T a 1 ̃ W a 1 + 1 2 ̃ W T a 2 ̃ W a 2 , where V is positive definite when the sufficient condition in (26) is satisfied. Since V is continuously differentiable and positive definite, from Lemma 4.3 of [28] there exist two class K functions, such that α 1 ( ‖ ζ ‖ ) ≤ V ( ζ ) ≤ α 2 ( ‖ ζ ‖ ) . (27) Using (27) , V L can be bounded by α 3 ( ‖ Z ‖ ) ≤ V L ( Z ) ≤ α 4 ( ‖ Z ‖ ) , (28) where α 3 and α 4 are class K functions. The time derivative of the candidate Lyapunov function is ̇ V L = ∂V ∂ζ f + ∂V ∂ζ g (ˆ u 1 + ˆ u 2 ) − ̃ W T c ̇ ˆ W c − ̃ W T a 1 ̇ ˆ W a 1 − ̃ W T a 2 ̇ ˆ W a 2 . (29) Using (7) , ∂V ∂ζ f = − ∂V ∂ζ g ( u ∗ 1 + u ∗ 2 ) − r ( ζ, u ∗ 1 , u ∗ 2 ) . Then, ̇ V L = ∂V ∂ζ g (ˆ u 1 + ˆ u 2 ) − ∂V ∂ζ g ( u ∗ 1 + u ∗ 2 ) − r ( ζ, u ∗ 1 , u ∗ 2 ) − ̃ W T c ̇ ˆ W c − ̃ W T a 1 ̇ ˆ W a 1 − ̃ W T a 2 ̇ ˆ W a 2 . Substituting (19) - (21) for ̇ ˆ W c , ̇ ˆ W a 1 , and ̇ ˆ W a 2 , respectively, yields ̇ V L = − ζ T Qζ − u ∗ T 1 Ru ∗ 1 + γ 2 u ∗ T 2 u ∗ 2 + ∂V ∂ζ g (ˆ u 1 + ˆ u 2 ) − ∂V ∂ζ g ( u ∗ 1 + u ∗ 2 ) + ̃ W T c   η c ω T p δ + η c n ∑ j =1 ω T j p j δ j   + ̃ W T a 1 η a 1 ( ˆ W a 1 − ˆ W c ) + ̃ W T a 2 η a 2 ( ˆ W a 2 − ˆ W c ) . Using Young’s inequality, (10) - (12) , (14) , (15) , (22) , (23) , and (26) the Lyapunov derivative can be upper bounded as ̇ V L ≤ − φ ζ ‖ ζ ‖ 2 − φ c ∥ ∥ ∥ ̃ W c ∥ ∥ ∥ 2 − φ a 1 ∥ ∥ ∥ ̃ W a 1 ∥ ∥ ∥ 2 − φ a 2 ∥ ∥ ∥ ̃ W a 2 ∥ ∥ ∥ 2 + κ a 1 ∥ ∥ ∥ ̃ W a 1 ∥ ∥ ∥ + κ a 2 ∥ ∥ ∥ ̃ W a 2 ∥ ∥ ∥ + κ c ∥ ∥ ∥ ̃ W c ∥ ∥ ∥ + κ, where φ ζ = q − η c L f ‖ ǫ ′ ‖ ε 2 , φ c = η c ( c − L f ‖ ǫ ′ ‖ 2 ε − η a 1 + η a 2 2 η c ) , φ a 1 = η a 1 2 , φ a 2 = η a 2 2 , κ c = sup ζ ∈ χ ∥ ∥ ∥ ∥ ∥ ∥ η c 4 ̃ W T a 1 G σ 1 ̃ W a 1 + η c 4 n ∑ j =1 ̃ W T a 1 G σ 1 j ̃ W a 1 − η c 4 ̃ W T a 2 G σ 2 ̃ W a 2 − η c 4 n ∑ j =1 ̃ W T a 2 G σ 2 j ̃ W a 2 + η c 2 ǫ ′ ( G 1 − G 2 ) σ ′ T W + η c 4 ǫ ′ ( G 1 − G 2 ) ǫ ′ T + η c n ∑ j =1 E j ∥ ∥ ∥ ∥ ∥ ∥ , κ a 1 = sup ζ ∈ χ ∥ ∥ ∥ ∥ 1 2 W T G σ 1 + 1 2 ǫ ′ G 1 σ ′ T ∥ ∥ ∥ ∥ , κ a 2 = sup ζ ∈ χ ∥ ∥ ∥ ∥ − 1 2 W T G σ 2 − 1 2 ǫ ′ G 2 σ ′ T ∥ ∥ ∥ ∥ , κ = sup ζ ∈ χ ∥ ∥ ∥ ∥ 1 4 ǫ ′ Gǫ ′ T ∥ ∥ ∥ ∥ . The constants φ ζ , φ c , φ a 1 , and φ a 2 are positive if the inequal- ities q > η c L f ‖ ǫ ′ ‖ ε 2 , c > L f ‖ ǫ ′ ‖ 2 ε + η a 1 + η a 2 2 η c (30) are satisfied. Completing the squares, the upper bound on the Lyapunov derivative can be written as ̇ V L ≤ − φ ζ ‖ ζ ‖ 2 − φ c 2 ∥ ∥ ∥ ̃ W c ∥ ∥ ∥ 2 − φ a 1 2 ∥ ∥ ∥ ̃ W a 1 ∥ ∥ ∥ 2 − φ a 2 2 ∥ ∥ ∥ ̃ W a 2 ∥ ∥ ∥ 2 + κ 2 c 2 φ c + κ 2 a 1 2 φ a 1 + κ 2 a 2 2 φ a 2 + κ, which can be further upper bounded as ̇ V L ≤ − α 5 ‖ Z ‖ , ∀ ‖ Z ‖ ≥ K > 0 , (31) where K , √ κ 2 c 2 α 5 φ c + κ 2 a 1 2 α 5 φ a 1 + κ 2 a 2 2 α 5 φ a 2 + κ α 5 and α 5 is a positive constant. Invoking Theorem 4.18 in [28], Z is UUB. Based on the definition of Z and the inequalities in (28) and (31) , ζ, ̃ W c , ̃ W a 1 , ̃ W a 2 ∈ L ∞ . From the definition of W and the NN weight estimation errors, ˆ W c , ˆ W a 1 , ˆ W a 2 ∈ L ∞ . Using the policy update laws, ̇ ˆ W a 1 , ̇ ˆ W a 2 ∈ L ∞ . It follows that ˆ V , ˆ u 1 , ˆ u 2 ∈ L ∞ . From the dynamics in (3) , ̇ ζ ∈ L ∞ . By the definition in (17) , δ ∈ L ∞ . By the definition of the normalized value function update law, ̇ ˆ W c ∈ L ∞ . Remark 1 . If ‖ Z (0) ‖ ≥ K , then ̇ V L ( Z (0)) < 0 . There exists an ε 1 ∈ [0 , ∞ ) such that V L ( Z ( ε 1 )) < V L ( Z (0)) . Using (28) , α 3 ( ‖ Z ( ε 1 ) ‖ ) ≤ V L ( ε 1 ) < α 4 ( ‖ Z (0) ‖ ) . Rearranging terms, ‖ Z ( ε 1 ) ‖ < α − 1 3 ( α 4 ( ‖ Z (0) ‖ )) . Hence, Z ( ε 1 ) ∈ L ∞ . It can be shown by induction that Z ( t ) ∈ L ∞ , ∀ t ∈ [0 , ∞ ) when ‖ Z (0) ‖ ≥ K . Using a similar argument when ‖ Z (0) ‖ < K , ‖ Z ( t ) ‖ < α − 1 3 ( α 4 ( K )) . Therefore, Z ( t ) ∈ L ∞ , ∀ t ∈ [0 , ∞ ) when ‖ Z (0) ‖ < K . Since Z ( t ) ∈ L ∞ , ∀ t ∈ [0 , ∞ ) , the state, ζ , is shown to lie on the compact set, χ , where χ , { ζ ∈ R 12 | ‖ ζ ‖ ≤ α − 1 3 ( α 4 (max ( ‖ Z (0) ‖ , K ))) } . VI. C ONCLUSION An online approximation of a robust optimal control strategy is developed to enable station keeping by an AUV. Using the RL-based actor-critic framework, the solution to the HJI equation is approximated. A gradient descent adaptive update law with concurrent-learning approximates the value function. A Lyapunov-based stability analysis concludes UUB conver- gence of the states and UUB convergence of the approximated policies to the optimal polices without the requirement of PE. Future work includes numerical simulation of the developed controller, and comparison to a numerical offline optimal solution to evaluated performance. Since a model for a arbi- trary AUV is often difficult to determine, extending the result to include a state estimator could relax the requirement of exact model knowledge. Also, the extension of the developed technique to underactuated AUVs would open the result to a much broader class of vehicles. R EFERENCES [1] G. Griffiths, Technology and applications of autonomous underwater vehicles , G. Griffiths, Ed. CRC Press, 2003. [2] J. Bellingham, Y. Zhang, J. Kerwin, J. Erikson, B. Hobson, B. Kieft, M. Godin, R. McEwen, T. Hoover, J. Paul, A. Hamilton, J. Franklin, and A. Banka, “Efficient propulsion for the tethys long-range autonomous underwater vehicle,” in Auton. Underw. Veh. (AUV), 2010 IEEE/OES , 2010, pp. 1–7. [3] D. Yoerger and J. Slotine, “Robust trajectory control of underwater vehicles,” IEEE J. Oceanic. Eng. , vol. 10, no. 4, pp. 462–470, 1985. [4] A. Healey and D. Lienard, “Multivariable sliding mode control for autonomous diving and steering of unmanned underwater vehicles,” IEEE J. Oceanic. Eng. , vol. 18, pp. 327 –339, 1993. [5] L. Lapierre and B. Jouvencel, “Robust nonlinear path-following control of an AUV,” IEEE J. Oceanic. Eng. , vol. 33, no. 2, pp. 89–102, 2008. [6] E. Sebastian and M. A. Sotelo, “Adaptive fuzzy sliding mode controller for the kinematic variables of an underwater vehicle,” J. Intell. Robot. Syst. , vol. 49, no. 2, pp. 189–215, 2007. [7] D. Kirk, Optimal Control Theory: An Introduction . Dover, 2004. [8] I. Kaminer, A. Pascoal, C. Silvestre, and P. Khargonekar, “Control of an underwater vehicle using H ∞ synthesis,” in Proc. IEEE Conf. Decis. Control , 1991. [9] J. Petrich and D. J. Stilwell, “Robust control for an autonomous underwater vehicle that suppresses pitch and yaw coupling,” Ocean Eng. , vol. 38, pp. 197 – 204, 2011. [10] J. S. Riedel and A. J. Healey, “Model based predictive control of auvs for station keeping in a shallow water wave environment,” Naval Postgraduate School, Tech. Rep., 2005. [11] T. McLain and R. Beard, “Successive galerkin approximations to the nonlinear optimal control of an underwater robotic vehicle,” in Proc. IEEE Int. Conf. Robot. Autom. , 1998. [12] J. Park, W. Chung, and J. Yuh, “Nonlinear H ∞ optimal PID control of autonomous underwater vehicles,” in Proc. 2000 Int. Symp. Underw. Technol. , 2000, pp. 193–198. [13] K. Vamvoudakis and F. Lewis, “Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica , vol. 46, pp. 878–888, 2010. [14] ——, “Online neural network solution of nonlinear two-player zero-sum games using synchronous policy iteration,” in Proc. IEEE Conf. Decis. Control , 2010. [15] S. Bhasin, R. Kamalapurkar, M. Johnson, K. Vamvoudakis, F. L. Lewis, and W. Dixon, “A novel actor-critic-identifier architecture for approx- imate optimal control of uncertain nonlinear systems,” Automatica , vol. 49, no. 1, pp. 89–92, 2013. [16] M. Johnson, S. Bhasin, and W. E. Dixon, “Nonlinear two-player zero- sum game approximate solution using a policy iteration algorithm,” in Proc. IEEE Conf. Decis. Control , 2011, pp. 142–147. [17] G. Chowdhary and E. Johnson, “Concurrent learning for convergence in adaptive control without persistency of excitation,” in Proc. IEEE Conf. Decis. Control , 2010, pp. 3674–3679. [18] T. I. Fossen, Handbook of Marine Craft Hydrodynamics and Motion Control . Wiley, 2011. [19] J. C. Kinsey, R. M. Eustice, and L. L. Whitcomb, “A survey of underwater vehicle navigation: Recent advances and new challenges,” in Proc. IFAC Conf. on Manoeuvring and Control of Mar. Craft , Lisbon, Portugal, 2006. [20] F. L. Lewis, Optimal Control . John Wiley & Sons, 1986. [21] A. Van der Schaft, “L2-gain analysis of nonlinear systems and nonlinear H ∞ control,” IEEE Trans. Autom. Control , vol. 37, no. 6, pp. 770–784, 1992. [22] T. Basar and G. J. Olsder, Dynamic Noncooperative Game Theory . SIAM, 1999. [23] K. Hornik, M. Stinchcombe, and H. White, “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,” Neural Netw. , vol. 3, no. 5, pp. 551 – 560, 1990. [24] F. L. Lewis, J. Campos, and R. Selmic, Neuro-Fuzzy Control of Indus- trial Systems with Actuator Nonlinearities . SIAM, 2002. [25] G. V. Chowdhary and E. N. Johnson, “Theory and flight-test validation of a concurrent-learning adaptive controller,” J. Guid. Contr. Dynam. , vol. 34, no. 2, pp. 592–607, March 2011. [26] G. Chowdhary, T. Yucelen, M. Mühlegg, and E. N. Johnson, “Concurrent learning adaptive control of linear systems with exponentially convergent bounds,” Int. J. Adapt Control Signal Process. , vol. 27, pp. 280–301, 2012. [27] W. E. Dixon, A. Behal, D. M. Dawson, and S. Nagarkatti, Nonlin- ear Control of Engineering Systems: A Lyapunov-Based Approach . Birkhauser: Boston, 2003. [28] H. K. Khalil, Nonlinear Systems , 3rd ed. Prentice Hall, 2002.