A Separation-Based Design to Data-Driven Control for Large-Scale Partially Observed Systems Dan Yu 1 , Mohammadhussein Rafieisakhaei 2 and Suman Chakravorty 1 Abstract —This paper studies the partially observed stochastic optimal control problem for systems with state dynamics gov- erned by Partial Differential Equations (PDEs) that leads to an extremely large problem. First, an open-loop deterministic trajectory optimization problem is solved using a black box sim- ulation model of the dynamical system. Next, a Linear Quadratic Gaussian (LQG) controller is designed for the nominal trajectory- dependent linearized system, which is identified using input- output experimental data consisting of the impulse responses of the optimized nominal system. A computational nonlinear heat example is used to illustrate the performance of the approach. I. I NTRODUCTION In this paper, we consider the stochastic control of partially observed nonlinear dynamical systems that are governed by Partial Differential Equations (PDEs). In particular, we pro- pose a novel data-based approach to the solution of very large Partially Observed Markov Decision Processes (POMDPs) wherein the underlying state space is obtained from the dis- cretization of a PDE; problems whose solution has never been hitherto attempted using approximate MDP-based techniques. It is well-known that the global optimal solution for MDPs can be found by solving the Hamilton-Jacobi-Bellman (HJB) equation. The solution techniques can be further divided into model-based and model-free techniques, according to whether the solution methodology uses an analytical model of the system or it uses a black box simulation model or actual experiments. The Reinforcement Learning (RL) techniques [1, 2] that are based on the Differential Dynamic Programming (DDP) or iLQG approach [3, 4] have shown the potential for RL algorithms to scale to higher dimensional continuous state and control space problems, such as high dimensional robotic task planning and learning problems. Fundamentally, rather than solving the derived “Dynamic Programming” problem as in the majority of the approaches above that requires the optimization of the feedback law, our approach is to directly solve the original stochastic optimiza- tion problem in a “separated open-loop/closed-loop” fashion wherein: 1) we solve an open-loop deterministic optimization *This material is based upon work partially supported by NSF under Contract Nos. CNS-1646449 and Science & Technology Center Grant CCF- 0939370, the U.S. Army Research Office under Contract No. W911NF-15-1- 0279, and NPRP grant NPRP 8-1531-2-651 from the Qatar National Research Fund, a member of Qatar Foundation, AFOSR contract Dynamic Data Driven Application Systems (DDDAS) contract FA9550-17-1-0068 and NSF NRI project ECCS-1637889. 1 D. Yu and S. Chakravorty are with the Department of Aerospace Engineering, and 2 M. Rafieisakhaei is with the Department of Electri- cal and Computer Engineering, Texas A&M University, College Station, Texas, 77840 USA. { yudan198811@hotmail.com, mrafieis, schakrav@tamu.edu } problem to obtain an optimal nominal trajectory in a model- free fashion, and then 2) we design a closed-loop controller for the resulting linearized time-varying system around the optimal nominal trajectory, again in a model-free fashion. Nonetheless, the above “divide and conquer” strategy can be shown to be near-optimal [5, 6]. The primary contributions of the proposed approach are: 1) We specify a detailed set of experiments to accomplish the closed-loop controller design for any unknown nonlinear system, no mater how high dimensional. This series of exper- iments consists of a sequence of input perturbations to collect the impulse responses of the system, first to find an optimized nominal trajectory, and then to recover the Linear Time- Varying (LTV) system corresponding to the perturbations of the nominal system in order to design an LQG controller. 2) In general, for large-scale systems with partially observed states, the system identification algorithms such as time- varying Eigensystem Realization Algorithm (ERA) [7] auto- matically construct reduced order model of the LTV system, which results in a reduced order estimator and controller. Therefore, even for large-scale systems, such as partially observed systems with dynamics governed by PDEs, the com- putation of the feedback policy is computationally tractable. For instance, in the partially observed nonlinear heat control problem considered in this paper, the complexity is reduced by O (10 5 ) when compared to DDP-based RL techniques. 3) We provide a unification of traditional linear and non- linear optimal control techniques with Adaptive Dynamic Programming (ADP) [8] and RL techniques in the context of Stochastic Dynamic Programming problems. II. P ROBLEM S ETUP Consider a discrete-time nonlinear dynamical system: x k +1 = f ( x k , u k , w k ) , y k = h ( x k , v k ) , (1) where x ∈ R n x , y ∈ R n y , u ∈ R n u are the state, measurement and control vectors, respectively, the system and measurement functions, f ( · ) and h ( · ) , are nonlinear, and { w k , v k , k ≥ 0 } are zero-mean, uncorrelated Gaussian white noises with covariances W and V , respectively. In considering PDEs, the dynamics are discretized using Finite Difference (FD) or Finite Element (FE), which can lead to a state space problem consisting of, e.g., millions of states. The belief b ( x k ) is the conditional distribution of the state x k given all past data. In this paper, we consider Gaussian beliefs denoted by b k := ( μ k , Σ k ) , where μ k and Σ k are the mean and covariance (whose size is O ( n 2 x ) , which for a PDE with large n x is extremely large ). We denote the belief dynamics by b k +1 = τ ( b k , u k , y k +1 ) . arXiv:1707.03092v1 [cs.SY] 11 Jul 2017 Stochastic Control Problem: Given b 0 and a finite time horizon of N > 0 , for unknown nonlinear f ( · ) and h ( · ) , find the control policy π = { π 0 , π 1 , · · · , π N − 1 } , where π k is the control policy at time k and u k = π k ( b k ) , such that J π = E ( N − 1 ∑ k =0 c k ( b k , u k ) + c N ( b N )) , (2) is minimized, where { c k ( · , · ) } N − 1 k =0 denotes the immediate cost function, and c N ( · ) denotes the terminal cost. III. S EPARATION -B ASED F EEDBACK C ONTROL D ESIGN Let { ̄ u k } N − 1 k =0 , { ̄ μ k } N k =0 , { ̄ y k } N k =0 , { ̄ b k } N k =0 denote the nom- inal control, state, observation, and belief trajectories of the system, respectively, where given ̄ u k = π k ( ̄ b k ) , we have: ̄ μ k +1 = f ( ̄ μ k , ̄ u k , 0) , ̄ y k = h ( ̄ μ k , 0) , ̄ b k +1 = τ ( ̄ b k , ̄ u k , ̄ y k +1 ) , with the initial conditions of ̄ b 0 = b 0 , and ̄ μ 0 = E [ b 0 ] . The nominal cost and its first order expansion are [5, 9]: ̄ J := N − 1 ∑ k =0 c k ( ̄ b k , ̄ u k ) + c N ( ̄ b N ) , J ≈ ̄ J + N − 1 ∑ k =0 ( C b k ( b k − ̄ b k ) + C u k ( u k − ̄ u k )) + C b K ( b N − ̄ b N ) ︸ ︷︷ ︸ =: δJ . Theorem 1 (Cost Function Linearization Error): The expected first-order linearization error of the cost function is zero, E ( δJ ) = 0 . Theorem 1 shows that the first order approximation of the stochastic cost function is dominated by the nominal cost and depends only on the nominal trajectories of the system, independent of the feedback gain. Therefore, the design of the optimal feedback gain can be separated from the design of the optimal nominal trajectory of the system. As a result, the stochastic optimal control problem can be divided into two separate problems: the first is a deterministic problem to design the open-loop optimal control sequence, and hence, the optimal nominal trajectory of the system. The second problem is the design of an optimal linear feedback law to track the nominal trajectory (which is the optimal belief state trajectory unlike typical trajectory optimization based RL methods designed for fully observed problems such as [1, 2]). We propose a three-step framework to solve the stochastic feedback control problem as follows. Step 1. Open-Loop Trajectory Optimization in Belief Space. Solve the open-loop optimization problem given b 0 : { u ∗ k } N − 1 k =0 = arg min { u k } ̄ J ( { b k } N k =0 , { u k } N − 1 k =0 ) , b k +1 = τ ( b k , u k , ̄ y k +1 ) , (3) where the nominal observations ̄ y k are generated as follows: x k +1 = f ( x k , u k , 0) , ̄ y k = h ( x k , 0) with x 0 = μ 0 . Given the nominal observations ̄ y k , the belief evolution is deterministic and the above is a deterministic optimization problem [10]. The open-loop optimization problem is solved using the gradient descent approach [11, 12] utilizing an Ensemble Kalman Filter (EnKF) [13]. Denote the initial guess of the control sequence by U (0) = { u (0) k } N − 1 k =0 , and the corresponding belief state estimated using EnKF by B (0) = { b (0) k } N k =0 . The control policy is updated iteratively via U ( n +1) = U ( n ) − α ∇ U ̄ J ( B ( n ) , U ( n ) ) , (4) until a convergence criterion is met, where U ( n ) = { u ( n ) k } N − 1 k =0 is the control sequence in the n th iteration, B ( n ) = { b ( n ) k } N k =0 denotes the corresponding belief, and α is the step-size pa- rameter. Finlly, denote the nominal belief by { ̄ μ k , ̄ Σ k } N k =0 . Step 2. Linear Time-Varying System Identification. We linearize the system (1) around the nominal mean trajectory { ̄ μ k } . For simplicity, assume that the control and disturbance enter through same channels and the noise is purely additive: δx k +1 = A k δx k + B k ( δu k + w k ) , δy k = C k δx k + v k , (5) where δx k = x k − ̄ μ k , δu k = u k − ̄ u k , δy k = y k − h ( ̄ μ k , 0) describe the state, control and measurement deviations from the nominal trajectory respectively, and A k = ∂f ( x, u, w ) ∂x | ̄ μ k , ̄ u k , 0 , B k = ∂f ( x, u, w ) ∂u | ̄ μ k , ̄ u k , 0 , C k = ∂h ( x, v ) ∂x | ̄ μ k , 0 . (6) We identify the system (5) using impulse responses of the system via the time-varying ERA [7]. Denote the identified system’s deviations by δa k +1 = ˆ A k δa k + ˆ B k ( δu k + w k ) , δy k = ˆ C k δa k + v k , (7) where δa k ∈ < n r denotes the reduced order model (ROM) de- viation states, and n r << n x , thereby automatically providing a compact parametrization of the problem. Step 3. Closed-Loop Controller Design. Given system (7), we design the closed-loop controller to follow the optimal nominal trajectory, which is to minimize the cost function J f = N − 1 ∑ k =0 ( δ ˆ a ′ k Q k δ ˆ a k + δu ′ k R k δu k ) + δ ˆ a ′ N Q N δ ˆ a N , (8) where δ ˆ a k denotes the estimates of the deviation state δa k , Q k , Q N are positive definite, and R k is positive semi-definite. For the linear system (7), the “separation principle” of linear control theory can be used [14], and the design of the optimal linear stochastic controller can be separated into the decoupled design of a KF and a fully observed optimal LQR controller. A flow chart for the Separation-based Nonlinear Stochastic Control Design is shown in Fig. 2. IV. E XPERIMENTS We test the method on a one-dimensional nonlinear heat transfer problem. Let T ( x, t ) be the temperature distribution at location x and time t , K ( x, T ) be the thermal diffusivity, η be the convective heat transfer coefficient, u ( t ) be the external heat sources and L be the length of the slab. The heat transfer PDE along the slab along with its boundary conditions is: ∂T ∂t = K ( x, T ) ∂ 2 T ∂x 2 − ηT + u ( t ) , (9) T ( x, 0) = 100 ◦ F, ∂T ∂x | x =0 = 0 , T ( L, t ) = 150 ◦ F. (10) The system is discretized using finite difference method with a 100 equally-spaced grid points. There are five point sources evenly located between [0 . 1 L, 0 . 9 L ] , where the sen- sors are placed, as well. The total simulation time is 62 . 5 s with Location x (m) 0 0.1 0.2 0.3 0.4 0.5 0.6 Temperature T ( ° F) 148 149 150 151 152 Nominal t = 37.5s Nominal t = 62.5s Closed Loop t = 37.5s Closed Loop t = 62.5s (a) Comparison of Belief Trajectory Time (s) 0 20 40 60 Error -0.1 0 0.1 0.2 0.3 Position x = 0.4L Closed loop error Open loop error 2 σ bound (b) Estimation Error at x = 0 . 4 L . Time (s) 0 20 40 60 Error -2 0 2 Position x = 0.9L Closed loop error Open loop error 2 σ bound (c) Estimation Error at x = 0 . 9 L Fig. 1. Performance of the Proposed Approach Fig. 2. Separation-Based Stochastic Feedback Control Algorithm a time-step of 0 . 25 s . The control objective is to reach the target temperature T f =(150 ± 3) ◦ F for the entire field within 37 . 5 s , and keep the temperature at (150 ± 3) ◦ F between [37 . 5 , 62 . 5] s . The open-loop optimal nominal (belief mean) trajectory and optimal control are shown in Fig. 3. For the identified reduced order system, we have ˆ A k ∈ < 20 × 20 . The feedback design decouples into the solution of two 20 x 20 Ricatti equations, one for the controller and one for the Kalman filter. Note that if we were to use an iLQG-based design, the size of the state space would be 10100, and the policy evaluation step would require the solution of a 10100 x 10100 Ricatti equation. Time (s) 0 20 40 60 Position x (m) 0 0.2 0.4 0.6 100 150 200 250 (a) Nominal Belief Trajectory Time (s) 0 20 40 60 Heat Source (K/s) 0 10 20 30 Controller 1 Controller 2 Controller 3 Controller 4 Controller 5 (b) Optimal Control Fig. 3. Open Loop Optimization Solution With the identified linearized system, we design the closed- loop controller. We run 1000 individual simulations with process noise w k ∼ N (0 , I ) and measurement noise v k ∼ N (0 , I ) . In Fig. 1(a), we compare the averaged closed-loop trajectory with the nominal trajectory at time t = 37 . 5 s, t = 62 . 5 s . In Figs. 1(b)-(c), we randomly choose two positions, and show the errors between the actual trajectory and optimal trajectory with 2 σ bounds in one simulation. For comparison, the open-loop error is also shown in the figure. It is observed that the averaged state estimates over 1000 Monte-Carlo simulations runs are close to the open-loop optimal trajectory, which implies that the control objective to minimize the expected cost function could be achieved using the proposed approach. In this partially observed problem, the computational complexity of designing the online estimator and controller using the identified ROM model is reduced by the order of O ( n 4 x n 2 r ) = O (10 5 ) , and for a general three dimen- sional problem this reduction could be even more significant. V. C ONCLUSION In this paper, we proposed a separation-based design of the stochastic optimal control problem for systems with unknown nonlinear dynamics and partially observed states. The open- loop optimization and system identification are efficiently implemented offline using the impulse responses of the system, and an LQG controller based on the ROM is implemented online, which is computationally fast. We showed the per- formance of the proposed approach on a one-dimensional nonlinear heat transfer problem. B IBLIOGRAPHY [1] R. Akrour, A. Abdolmaleki, H. Abdulsamad, and G. Neumann, “Model Free Trajectory Optimization for Reinforcement Learning,” in Proceed- ings of the International Conference on Machine Learning , 2016. [2] E. Todorov and Y. Tassa, “Iterative Local Dynamic Programming,” in Proc. of the IEEE Int. Symposium on ADP and RL. , 2009. [3] S. Levine and P. Abbeel, “Learning Neural Network Policies with Guided Search under Unknown Dynamics,” in Advances in Neural Information Processing Systems , 2014. [4] S. Levine and K. Vladlen, “Learning Complex Neural Network Policies with Trajectory Optimization,” in Proceedings of the International Conference on Machine Learning , 2014. [5] M. Rafieisakhaei, S. Chakravorty, and P. Kumar, “Near-Optimal Belief Space Planning via T-LQG,” arXiv preprint arXiv:1705.09415 , 2017. [6] M. Rafieisakhaei, S. Chakravorty, and P. Kumar, “A near-optimal sepa- ration principle for nonlinear stochastic systems arising in robotic path planning and control,” arXiv preprint arXiv:1705.08566 , 2017. [7] M. Majji, J.-N. Juang, and J. L. Junkins, “Time-varying Eigensystem Realization Algorithm,” Journal of Guidance, Control, and Dynamics , vol. 33, no. 1, pp. 13–28, 2010. [8] R. P. Bithmead, V. Wertz, and M. Gerers, Adaptive Optimal Control: The Thinking Man’s G.P.C . Prentice Hall Professional Technical Reference, 1991. [9] M. Rafieisakhaei, S. Chakravorty, and P. Kumar, “Belief Space Plan- ning Simplified: Trajectory-Optimized LQG (T-LQG),” arXiv preprint arXiv:1608.03013 , 2016. [10] R. Platt, R. Tedrake, L. Kaelbling, and T. Lozano-Perez, “Belief space planning assuming maximum likelihood observatoins,” in Proceedings of Robotics: Science and Systems (RSS) , June 2010. [11] A.E.Bryson and W. Denham, “A steepest-ascent method for solving op- timum programming problems,” Journal of Applied Mechanics , vol. 29, no. 2, 1962. [12] A. Gosavi, Simulation-based optimization: Parametric optimization tech- niques and reinforcement learning . Norwell, MA, USA: Kluwer Academic Publishers, 2003. [13] S.Gillijins et al. , “What Is the Ensemble Kalman Filter and How Well Does it Work?” in Proceedings of the 2006 American Control Conference , 2006, pp. 4448–4453. [14] D. P. Bertsekas, Dynamic Programming and Optimal Control, Two Volume Set , 2nd ed. Athena Scientific, 1995.