arXiv:1709.06166v1 [cs.AI] 18 Sep 2017 DropoutDAgger: A Bayesian Approach to Safe Imitation Learning Kunal Menda, Katherine Driggs-Campbell, and Mykel J. Kochenderfer Abstract— While imitation learning is becoming com- mon practice in robotics, this approach often suffers from data mismatch and compounding errors. DAgger is an iterative algorithm that addresses these issues by continually aggregating training data from both the expert and novice policies, but does not consider the impact of safety. We present a probabilistic extension to DAgger, which uses the distribution over actions provided by the novice policy, for a given observation. Our method, which we call DropoutDAgger, uses dropout to train the novice as a Bayesian neural network that provides insight to its confidence. Using the distribution over the novice’s actions, we estimate a probabilistic measure of safety with respect to the expert action, tuned to balance exploration and exploitation. The utility of this approach is evaluated on the MuJoCo HalfCheetah and in a simple driving experiment, demonstrating improved performance and safety compared to other DAgger variants and classic imitation learning. I. INTRODUCTION Recently, there have been many advances in robotics driven by breakthroughs in deep imitation learning [1], [2]. Yet, to be truly intelligent, such systems must have the ability to explore their state-space in a safe way [3]. One method to guide exploration is to learn from expert demonstrations [4], [5]. In contrast with rein- forcement learning, where an explicit reward function must be defined, imitation learning guides exploration through expert supervision, allowing our robot policy to effectively learn directly from experiences [2]. However, such supervised approaches are often sub- optimal or fail when the policy that is being trained (re- ferred to as the novice policy) encounters new situations or enters a state that is poorly represented in the dataset provided by the expert [6], [7]. While failures may be insignificant in simulation, safe learning is of the utmost importance when acting in the real-world [3]. Methods for guided policy search in imitation learning settings have been developed [8]. An example of these approaches is DAGGER, which improves the data repre- sented in the training dataset by continually aggregating new data from both the expert and novice policies. DAG- GER has many desirable properties, including online functionality and theoretical guarantees. This approach, however, does not provide safety guarantees. Recent This material is based upon work supported by SAIC Motor. K. Menda, K. Driggs-Campbell, and M.J. Kochenderfer are with the Aeronautics and Astronautics Department at Stanford University, Stanford, CA USA (e-mail: {kmenda,krdc,mykel}@stanford.edu). work extended DAGGER to address some inherent draw- backs [9], [10]. In particular, SAFEDAGGER augments DAGGER with a decision rule policy to provide safe exploration with minimal influence from the expert [11]. This paper augments DAGGER by extending the ap- proach to a probabilistic domain. We build upon the SAFEDAGGER idea of safety by considering the distri- bution over the novice’s actions. This approach allows us to glean some insight into the deep policy through a notion of confidence, in addition to safety bounds. We demonstrate how our method out-performs exist- ing algorithms in classical imitation learning settings. This paper presents three key contributions: 1) We develop a probabilistic notion of safety to balance exploration and exploitation; 2) We present DROPOUTDAGGER, a Bayesian exten- sion to DAGGER; and 3) We demonstrate the utility of this approach with improved performance and safety in imitation learning case studies. This paper is organized as follows. Section II pro- vides a brief overview of the underlying principles to be employed in this work. The methodology behind DROPOUTDAGGER is presented in Section III. The two experimental settings used to validate our approach are described in Section IV. Section V discusses our findings and outlines future work. II. BACKGROUND This section presents a brief technical overview of DAGGER, SAFEDAGGER, and dropout as applied to Bayesian neural networks. A. DAgger and SafeDAgger The DAGGER framework extends traditional super- vised learning approaches by simultaneously running both an expert policy that we wish to clone and a novice policy we wish to train [12]. By constantly aggregating new data samples from the expert policy, the underlying model and reward structure are uncovered. Given some initial training set D0 generated by the expert policy πexp, an initial novice policy πnov,0 is trained. Using this initialization, DAgger iteratively collects episodes additional training examples from a mixture of the expert and novice policy. During a given episode, the combined-expert-and-novice system interacts with the environment under the supervision of πexp(ot) πnov(ot) Decision Rule Environment aexp,t anov,t at ot Fig. 1: Flowchart for action selection for DAgger and DAgger variants, where the Decision Rule differs between approaches. Algorithm 1 DAGGER 1: procedure DAGGER(DR(·)) 2: Initialize D ←∅ 3: Initialize πnov,i 4: for epoch i = 1 : K do 5: Sample T -step trajectories with at = DR(ot) 6: Get Di = {s, πexp(s)} of states visited 7: Aggregate datasets: D ←D ∩Di 8: Train πnov,i+1 on D a decision rule. The decision rule decides at every time- step whether the novice’s or the expert’s choice of action is used to interact with the environment (Figure 1). The observations received during the episodes of an epoch and the expert’s choice of corresponding actions make up a new dataset called Di. The new dataset of training examples is combined with the previous sets: D = D ∪Di, and the novice policy is then re-trained on D. The DAGGER Algorithm is presented in Algorithm 1. By allowing the novice to act, the combined system explores parts of the state-space further from the nominal trajectories of the expert. In querying the expert in these parts of the state-space, the novice is able to learn a more robust policy. However, allowing the novice to always act risks the possibility of encountering an unsafe state, which can be costly in real-world experiments. The vanilla DAGGER algorithm and SAFEDAGGER balance this trade-off by their choice of decision rules. Under the vanilla DAGGER decision-rule (Algo- rithm 2), the expert’s action is chosen with probability βi ∈[0, 1], where i denotes the DAgger epoch. If βi = λβi−1 for some λ ∈(0, 1), then the novice takes increasingly more actions each epoch. As the novice is given more training labels from previous epochs, it is allowed greater autonomy in exploring the state-space. The vanilla DAGGER decision-rule does not take into account any similarity measure between the novice and the expert choice of action. Hence, even if the novice suggests a highly unsafe action, vanilla DAGGER allows the novice to act with probability (1 −βi). The decision-rule employed by SAFEDAGGER, presented in Algorithm 2 VANILLADAGGER Decision Rule 1: procedure DR(ot, i, β0, λ) 2: anov,t ←πnov,i(ot) 3: aexp,t ←πexp(ot) 4: βi ←λiβ0 5: z ∼Uniform(0, 1) 6: if z ≤βi then 7: return aexp,t 8: else 9: return anov,t Algorithm 3 SAFEDAGGER* Decision Rule 1: procedure DR(ot, τ) 2: anov,t ←πnov,i(ot) 3: aexp,t ←πexp(ot) 4: if ∥anov,t −aexp,t∥≤τ then 5: return anov,t 6: else 7: return aexp,t Algorithm 3 and referred to as SAFEDAGGER*, allows the novice to act if the distance between the actions is less than some chosen threshold τ [11].1 An ideal decision rule would allow the novice to act if there is a sufficiently low probability that the system can transition to an unsafe state. If the combined system is currently near an unsafe state, the tolerable perturbation from the expert’s choice of action is smaller than when the system is far from unsafe states. Hence, the single threshold τ employed in SAFEDAGGER* is either too conservative when the system is far from unsafe states or too relaxed when near them. To approximate the ideal decision rule in a model-free manner, we propose considering the distance between the novice’s and expert’s actions as well as the entropy in the novice policy. To estimate the uncertainty of the novice policy, we utilize Bayesian Deep Learning. B. Bayesian Approximation via Dropout To overcome the fact that deep learning lacks the ability to reason about model uncertainty [13], Gal et al. have worked towards approximating Bayesian models with neural networks through dropout [14]. By incorpo- rating dropout at every weight layer in a network, an approximation of a Gaussian process is obtained. Given a policy trained with dropout and an input observation, we query the network N times to obtain a distribution 1 To reduce the number of expert queries, SAFEDAGGER ap- proximates the SAFEDAGGER* decision rule via a deep policy that determines whether or not the novice policy is likely to deviate from the reference policy. Unlike SAFEDAGGER, we are not concerned with minimizing expert queries. Hence, we compare to the SAFEDAGGER* decision rule directly, as opposed to the approximation. τ a1 a2 Novice action samples Mean novice action Expert action (a) Well-represented state τ a1 a2 Novice action samples Mean novice action Expert action (b) Poorly-represented state Fig. 2: Example computation of the DROPOUTDAGGER decision rule governing whether the expert’s or novice’s action is chosen at a given time-step. The action space is two-dimensional, with action ⃗a = [a1, a2]. The novice policy is queried N times to estimate the probability ˆp is within a ball of radius τ centered at the expert action. If ˆp ≥p, for chosen threshold p, then we choose the mean novice action. Otherwise, we choose the expert’s action. Figure 2a shows an example state that is well-represented in D. Because we have many expert labels for this state, the novice policy is low-entropy and centered near the expert’s action. Hence, both DROPOUTDAGGER and SAFEDAGGER* decision rules would allow the novice to act. However, in Figure 2b, the state is poorly-represented in D, and the novice policy is consequently high-entropy. The DROPOUTDAGGER decision rule would not allow the novice to act if p is appropriately chosen, but the SAFEDAGGER* decision rule still would allow the novice to act. Vanilla DAGGER would choose between the two with a weighted coin-flip, with no regard to the similarity of the actions. over actions, using randomly sampled dropout masks. For more details, we guide the reader to [14], [15]. By invoking dropout, our novice policy approximates a Gaussian process that will produce a low-entropy dis- tribution over actions that is centered around the expert’s action, if the input observation is well represented in D. Further, the novice policy will produce high-entropy distributions over actions if the input observation is unlike what has been labeled by the expert in D. III. DROPOUT DAGGER We present the DROPOUTDAGGER decision rule, in which we choose the mean action of the novice only if its distribution over actions has sufficient probability mass around the action suggested by the expert. The algorithm, described in Algorithm 4, is parameterized by τ, which specifies the size of a ball around the expert’s action, and p, which is a threshold for the probability mass we desire to be inside this ball, if the novice is allowed to act. An example of computation of this decision rule is shown in Figure 2. We approximate this distribution over actions by first requiring that the neural-network policy is trained dataset D using dropout, and then querying the network multiple times with the current observation and random dropout masks. As previously stated, an ideal decision rule choose the novice’s action in ‘low-risk states,’ and choose the expert’s action in ‘high-risk states.’ By using the DROPOUTDAGGER decision rule, we allow the novice to act in familiar states that are well represented in D, Algorithm 4 DROPOUTDAGGER Decision Rule 1: procedure DR(ot, τ, p, N) 2: anov,t,j∈{1,...,N} ←πnov,i(ot) 3: aexp,t ←πexp(ot) 4: ˆp ←1 N PN j=1 1{∥aexp,t −anov,t,j∥≤τ} 5: if ˆp ≥p then 6: return 1 N PN j=1 anov,t,j 7: else 8: return aexp,t but hand control back to the expert when the combined system enters an unfamiliar region of the state-space. A comparison between the vanilla DAGGER, SAFEDAGGER*, and DROPOUTDAGGER decision rules can be seen in Figures 2a and 2b. Vanilla DAGGER leaves the choice of action up to a weighted coin-flip, with no regard to current state of the system. SAFEDAGGER* is too restrictive in safer, familiar regions of the state-space to sufficiently guarantee safety in unsafe regions. DROPOUTDAGGER is able to utilize the additional information provided by the distribution over the novice’s action in order to allow the novice to control the system in familiar parts of the state space, and hand control back to the expert in unfamiliar parts of the state-space. By appropriately choosing the hyper-parameters p and τ, we satisfy the dual objectives of allowing the novice to act only if its distribution over actions is sufficiently TABLE I: Hyperparameters used to train Expert Policy for Half-Cheetah domain. Parameter Value Unit Algorithm TRPO MLP Hidden Layer Sizes (64, 64) neurons γ 0.99 λ 0.97 TRPO Max Step 0.01 Batch Size 25000 timesteps epoch Max. Episode Length 100 timesteps Environment Seed 1 low-entropy, as well as sufficiently close to the expert’s. The dropout probability d should be chosen to reflect the epistemic uncertainty, arising from finite demonstration data, and aleatoric uncertainty, arising from the stochas- tic environment. The probability d should be selected for either by grid-search to minimize loss on test- demonstration data, or optimized for using ‘Concrete Dropout,’ which uses a continuous relaxation of discrete dropout masks [16]. If we set d to zero, DROPOUTDAG- GER effectively reduces to SAFEDAGGER*. If we set τ to zero, the algorithm reduces to behavior cloning. If we set p to zero, the algorithm reduces to one in which the expert merely labels the data, but does not ever influence the system during an episode. IV. EXPERIMENTS We demonstrate that DROPOUTDAGGER is able to achieve expert-level performance, while maintaining safety during training in two experimental domains. For each episode, we use average total reward of the combined expert-novice system as a measure of ‘safety performance,’ and the average total reward of the novice alone as a measure of ‘learning performance.’ An al- gorithm ‘safe’ if it demonstrates safety performance on par with that of BEHAVIORCLONING, in which only the expert acts. Learning performance is assessed by the rate at which the novice achieves expert-level performance. A. MuJoCo HalfCheetah The MuJoCo HalfCheetah-v1 domain is an OpenAI Gym Environment with observations in R20 and actions in R6 [17]. The environment provides reward propor- tional to the horizontal distance traveled. An optimal policy propels the half-cheetah robot into a steady run, going as far forward as possible in the time it has. First, we train a multi layer perceptron (MLP) policy to act as the expert, and then compare the performance of DROPOUTDAGGER to other DAGGER variants. We compare two scenarios. First, the novice is given the same observation as the expert. Second, the novice sees the observation corrupted by diagonal-Gaussian noise with σ = 0.1, representing settings where the expert and novice see different observations. The added noise increases aleatoric uncertainty, which degrades performance of naive imitation learning approaches. The optimal policy is trained using the TRPO hy- perparameters summarized in Table I [18], using the rllab implementation of TRPO [19], [20]. The MLP representing the novice policy has two hidden layers with 64 hidden units each, followed by a hidden layer with 32 hidden units. When training the neural network on a given dataset D, an ADAM optimizer is used with a learning rate of 10−3, β1 = 0.9, and β2 = 0.999. Weights are l2 regularized with regularization weight 10−5. The DROPOUTDAGGER policy is trained with a dropout probability d of 0.05. The network is trained for 100 epochs, with a mini-batch size of 32. We compare DROPOUTDAGGER to BEHAVIOR- CLONING, in which the decision-rule always chooses the expert’s action, EXPERTLABELSONLY, in which the decision-rule always chooses the novice’s action, Vanilla DAGGER, and SAFEDAGGER*. When testing Vanilla DAGGER, we use β0 = 1 and βi = 0.63βi−1, which brings β down to 0.01 by the tenth DAgger epoch. Hyperparameters of τ = 0.3 and p = 0.6 are used for DROPOUTDAGGER, and τ = 0.6 is used for SAFEDAGGER*, chosen by grid search. The performance of each policy on the environment are averaged over 50 episodes to estimate of safety and learning performance. Figures 3 and 4 show that DROPOUTDAGGER does not compromise the safety of the combined expert-novice system, while being able to train a well-performing novice policy at a rate compa- rable to other variants of DAgger. Since BEHAVIORCLONING never chooses the novice action, it unsurprisingly perfectly safe. However, we see that all other algorithms except DROPOUTDAGGER compromise the safety. Since the dataset generated by BEHAVIORCLONING contains autocorrelated samples drawn only from nominal expert trajectories, it has poor learning performance. DROPOUTDAGGER both maintains safety performance and achieves a learning performance comparable to all other DAGGER variants. We see in Figure 4 that adding observation noise adversely affects the learning performance of all algo- rithms. This consequently adversely affects the safety performance of Vanilla DAGGER, SAFEDAGGER*, and of course EXPERTLABELSONLY, but does not compro- mise the safety performance of DROPOUTDAGGER. B. Dubins Car Lidar In the Dubins Car Lidar domain, we demonstrate that DROPOUTDAGGER can safely learn a policy in spite of high aleatoric uncertainty. This environment, depicted in Figure 5, consists of a simple Dubins car that navigates out of a room. A Dubins path is represented by a circular 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Avg. Total Return Safety Performance 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Learning Performance DropoutDAgger VanillaDAgger SafeDAgger* Behavior Cloning ExpertLabelsOnly Fig. 3: Performance of variants of the DAGGER algorithm on the vanilla MuJoCo HalfCheetah-v1 environment, averaged over 50 episodes. As we can see, DROPOUTDAGGER makes no compromise to the safety of the combined system, while novice appears to learn a well-performing policy as quickly as other algorithms. 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Avg. Total Return Safety Performance 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Learning Performance DropoutDAgger VanillaDAgger SafeDAgger* Behavior Cloning ExpertLabelsOnly Fig. 4: Performance of variants of the DAGGER algorithm on a variant of MuJoCo HalfCheetah-v1 environment with noisy observations, averaged over 50 episodes. Increased uncertainty compromises the safety of all variants of DAGGER except DROPOUTDAGGER and BEHAVIORCLONING. While DROPOUTDAGGER continues to provide safety, the learning performance remains comparable to other algorithms. −100 −50 0 50 100 −50 0 50 100 Fig. 5: The Dubins Car Lidar environment. arc of radius R, a straight path, and a second circular arc of radius R. Given any two poses (x, y, θ) sufficiently far apart with some achievable turning radius, Dubins path can take a Dubins car from the first pose to the second. A finite-state controller acts as the expert, following a TABLE II: Dubins Car Lidar Parameters Parameter Value Unit Room Height/Width 100 m Exit Width 20 m Lidar Resolution 100 rays Lidar Max. Range 100 m σ1 10 m σ2 10 m m Timestep 0.1 s Max. Angular Veloicty ±1.0 rad s Dubins path at a fixed velocity from the initial state to goal state pointing out of the room in the exit. The expert policy is given access to its exact pose, but the novice policy only has access to noisy ‘lidar’ mea- surements. These are range measurements to the nearest obstacle along 100 equally spaced rays propagating from the center of the Dubins car (Figure 5). The following 0 2 4 6 8 0 0.2 0.4 0.6 0.8 1 DAgger Epoch Avg. Total Return Safety Performance 0 2 4 6 8 −1 −0.5 0 0.5 1 DAgger Epoch Learning Performance DropoutDAgger VanillaDAgger SafeDAgger* Behavior Cloning ExpertLabelsOnly Fig. 6: Performance of variants of the DAGGER algorithm on the Dubins Car Lidar environment, averaged over 50 episodes. Even with high aleatoric uncertainty, DROPOUTDAGGER has best learning and safety performance. BEHAVIORCLONING and SAFEDAGGER* do not compromise safety, but exhibit worse learning performance. 0 2 4 6 8 −1 −0.5 0 0.5 1 DAgger Epoch Avg. Total Return τ = 0.1, p = 0.3, d = 0.05 τ = 0.1, p = 0.6, d = 0.05 τ = 0.3, p = 0.3, d = 0.05 τ = 0.3, p = 0.6, d = 0.05 τ = 0.6, p = 0.6, d = 0.05 τ = 0.1, p = 0.3, d = 0.1 τ = 0.1, p = 0.6, d = 0.1 τ = 0.3, p = 0.3, d = 0.1 τ = 0.3, p = 0.6, d = 0.1 τ = 0.6, p = 0.6, d = 0.1 Fig. 7: Comparison of learning performance of various hyperparameters used for the DROPOUTDAGGER algorithm. noise model gives the corrupted measurement ˆx: ˜x = z1 + (1 + z2)x (1) ˆx = max (min(˜x, L), 0) (2) where x is the original measurement, zi ∼N(0, σi) for i ∈{1, 2}, and L is the maximum lidar range. We train an MLP to map the lidar measurements to the car’s angular velocity. The environment parameters are summarized in Table II. The algorithms and optimizer parameters are identical to those in Section IV-A. In Figure 6, we see that DROPOUTDAGGER main- tains perfect safety and good learning performance. Un- der high aleatoric uncertainty, the BEHAVIORCLONING learning performance deteriorates, thus highlighting the importance of exploration for robustness. Figure 7 shows the learning performance of vari- ous choices of hyperparameters for DROPOUTDAGGER. Though all variants enjoy perfect safety performance, we observe that reducing τ or increasing p make the algorithm more conservative and reduce learning perfor- mance to that of BEHAVIORCLONING, as expected. It is interesting to note that increasing the dropout probability d from 0.05 to 0.1 appears to reduce the sensitivity of the learning performance to the choice of τ and p. V. DISCUSSION Naive algorithms like BEHAVIORCLONING rely on a large set of demonstrations to provide a good dataset for learning. DROPOUTDAGGER extends naive imita- tion learning, adding the ability to safely explore the state-space. Using the novice’s action distribution, the DROPOUTDAGGER decision rule allows the novice to act when in familiar regions of the state-space, but returns control to the expert when entering unfamiliar regions. Our experiments show that DROPOUTDAGGER allows the combined system to safely gather data and explore poorly represented states. DROPOUTDAGGER demonstrates no compromise to safety and learning performance comparable to other algorithms. Though DROPOUTDAGGER exhibits a good mix of safety and learning performance, the algorithm still depends on three hyperparameters. In particular, we observe that the choice of dropout probability d can affect the sensitivity of learning performance to these parameters. Future work includes exploring the opti- mization of the dropout mask using Concrete Dropout at every epoch [16]. Additionally, since all the methods described here only apply to continuous actions, we hope to extend the presented tools to discrete and hybrid action spaces. REFERENCES [1] J. Kober and J. Peters, “Imitation and reinforcement learning,” IEEE Robotics Automation Magazine, vol. 17, no. 2, pp. 55–62, June 2010. [2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009. [3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´e, “Concrete problems in AI safety,” Available on arXiv:1606.06565, 2016. [4] B. Price and C. Boutilier, “Accelerating reinforcement learning through implicit imitation,” Journal of Artificial Intelligence Research, vol. 19, pp. 569–629, 2003. [5] S. Schaal, “Learning from demonstration,” in Advances in Neural Information Processing Systems (NIPS), 1997, pp. 1040–1046. [6] H. Daum´e, J. Langford, and D. Marcu, “Search-based structured prediction,” Machine Learning, vol. 75, no. 3, pp. 297–325, 2009. [7] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in International Conference on Artificial Intelligence and Statistics, 2010, pp. 661–668. [8] S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning (ICML), 2013, pp. 1–9. [9] B. Kim and J. Pineau, “Maximum mean discrepancy imitation learning,” in Robotics: Science and Systems, 2013. [10] M. Laskey, S. Staszak, W. Y.-S. Hsieh, J. Mahler, et al., “Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces,” in IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 462–469. [11] J. Zhang and K. Cho, “Query-efficient imitation learning for end- to-end autonomous driving,” Available on arXiv:1605.06450, 2016. [12] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in International Conference on Artificial Intelligence and Statis- tics, 2011, pp. 627–635. [13] R. Oliveira, P. Tabacof, and E. Valle, “Known unknowns: Un- certainty quality in bayesian neural networks,” Available on arXiv:1612.01251, 2016. [14] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approxima- tion: Representing model uncertainty in deep learning,” Available on arXiv:1506.02142, 2015. [15] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural net- works from overfitting,” Journal of Machine Learning, vol. 15, no. 1, pp. 1929–1958, 2014. [16] Y. Gal, J. Hron, and A. Kendall, “Concrete Dropout,” 2017. [17] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “OpenAI Gym,” Available on arXiv:1606.01540, 2016. [18] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup, “Repro- ducibility of benchmarked deep reinforcement learning tasks for continuous control,” Available on arXiv:1708.04133, 2017. [19] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous con- trol,” in International Conference on Machine Learning (ICML), 2016, pp. 1329–1338. [20] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning (ICML), 2015, pp. 1889–1897. arXiv:1709.06166v1 [cs.AI] 18 Sep 2017 DropoutDAgger: A Bayesian Approach to Safe Imitation Learning Kunal Menda, Katherine Driggs-Campbell, and Mykel J. Kochenderfer Abstract— While imitation learning is becoming com- mon practice in robotics, this approach often suffers from data mismatch and compounding errors. DAgger is an iterative algorithm that addresses these issues by continually aggregating training data from both the expert and novice policies, but does not consider the impact of safety. We present a probabilistic extension to DAgger, which uses the distribution over actions provided by the novice policy, for a given observation. Our method, which we call DropoutDAgger, uses dropout to train the novice as a Bayesian neural network that provides insight to its confidence. Using the distribution over the novice’s actions, we estimate a probabilistic measure of safety with respect to the expert action, tuned to balance exploration and exploitation. The utility of this approach is evaluated on the MuJoCo HalfCheetah and in a simple driving experiment, demonstrating improved performance and safety compared to other DAgger variants and classic imitation learning. I. INTRODUCTION Recently, there have been many advances in robotics driven by breakthroughs in deep imitation learning [1], [2]. Yet, to be truly intelligent, such systems must have the ability to explore their state-space in a safe way [3]. One method to guide exploration is to learn from expert demonstrations [4], [5]. In contrast with rein- forcement learning, where an explicit reward function must be defined, imitation learning guides exploration through expert supervision, allowing our robot policy to effectively learn directly from experiences [2]. However, such supervised approaches are often sub- optimal or fail when the policy that is being trained (re- ferred to as the novice policy) encounters new situations or enters a state that is poorly represented in the dataset provided by the expert [6], [7]. While failures may be insignificant in simulation, safe learning is of the utmost importance when acting in the real-world [3]. Methods for guided policy search in imitation learning settings have been developed [8]. An example of these approaches is DAGGER, which improves the data repre- sented in the training dataset by continually aggregating new data from both the expert and novice policies. DAG- GER has many desirable properties, including online functionality and theoretical guarantees. This approach, however, does not provide safety guarantees. Recent This material is based upon work supported by SAIC Motor. K. Menda, K. Driggs-Campbell, and M.J. Kochenderfer are with the Aeronautics and Astronautics Department at Stanford University, Stanford, CA USA (e-mail: {kmenda,krdc,mykel}@stanford.edu). work extended DAGGER to address some inherent draw- backs [9], [10]. In particular, SAFEDAGGER augments DAGGER with a decision rule policy to provide safe exploration with minimal influence from the expert [11]. This paper augments DAGGER by extending the ap- proach to a probabilistic domain. We build upon the SAFEDAGGER idea of safety by considering the distri- bution over the novice’s actions. This approach allows us to glean some insight into the deep policy through a notion of confidence, in addition to safety bounds. We demonstrate how our method out-performs exist- ing algorithms in classical imitation learning settings. This paper presents three key contributions: 1) We develop a probabilistic notion of safety to balance exploration and exploitation; 2) We present DROPOUTDAGGER, a Bayesian exten- sion to DAGGER; and 3) We demonstrate the utility of this approach with improved performance and safety in imitation learning case studies. This paper is organized as follows. Section II pro- vides a brief overview of the underlying principles to be employed in this work. The methodology behind DROPOUTDAGGER is presented in Section III. The two experimental settings used to validate our approach are described in Section IV. Section V discusses our findings and outlines future work. II. BACKGROUND This section presents a brief technical overview of DAGGER, SAFEDAGGER, and dropout as applied to Bayesian neural networks. A. DAgger and SafeDAgger The DAGGER framework extends traditional super- vised learning approaches by simultaneously running both an expert policy that we wish to clone and a novice policy we wish to train [12]. By constantly aggregating new data samples from the expert policy, the underlying model and reward structure are uncovered. Given some initial training set D0 generated by the expert policy πexp, an initial novice policy πnov,0 is trained. Using this initialization, DAgger iteratively collects episodes additional training examples from a mixture of the expert and novice policy. During a given episode, the combined-expert-and-novice system interacts with the environment under the supervision of πexp(ot) πnov(ot) Decision Rule Environment aexp,t anov,t at ot Fig. 1: Flowchart for action selection for DAgger and DAgger variants, where the Decision Rule differs between approaches. Algorithm 1 DAGGER 1: procedure DAGGER(DR(·)) 2: Initialize D ←∅ 3: Initialize πnov,i 4: for epoch i = 1 : K do 5: Sample T -step trajectories with at = DR(ot) 6: Get Di = {s, πexp(s)} of states visited 7: Aggregate datasets: D ←D ∩Di 8: Train πnov,i+1 on D a decision rule. The decision rule decides at every time- step whether the novice’s or the expert’s choice of action is used to interact with the environment (Figure 1). The observations received during the episodes of an epoch and the expert’s choice of corresponding actions make up a new dataset called Di. The new dataset of training examples is combined with the previous sets: D = D ∪Di, and the novice policy is then re-trained on D. The DAGGER Algorithm is presented in Algorithm 1. By allowing the novice to act, the combined system explores parts of the state-space further from the nominal trajectories of the expert. In querying the expert in these parts of the state-space, the novice is able to learn a more robust policy. However, allowing the novice to always act risks the possibility of encountering an unsafe state, which can be costly in real-world experiments. The vanilla DAGGER algorithm and SAFEDAGGER balance this trade-off by their choice of decision rules. Under the vanilla DAGGER decision-rule (Algo- rithm 2), the expert’s action is chosen with probability βi ∈[0, 1], where i denotes the DAgger epoch. If βi = λβi−1 for some λ ∈(0, 1), then the novice takes increasingly more actions each epoch. As the novice is given more training labels from previous epochs, it is allowed greater autonomy in exploring the state-space. The vanilla DAGGER decision-rule does not take into account any similarity measure between the novice and the expert choice of action. Hence, even if the novice suggests a highly unsafe action, vanilla DAGGER allows the novice to act with probability (1 −βi). The decision-rule employed by SAFEDAGGER, presented in Algorithm 2 VANILLADAGGER Decision Rule 1: procedure DR(ot, i, β0, λ) 2: anov,t ←πnov,i(ot) 3: aexp,t ←πexp(ot) 4: βi ←λiβ0 5: z ∼Uniform(0, 1) 6: if z ≤βi then 7: return aexp,t 8: else 9: return anov,t Algorithm 3 SAFEDAGGER* Decision Rule 1: procedure DR(ot, τ) 2: anov,t ←πnov,i(ot) 3: aexp,t ←πexp(ot) 4: if ∥anov,t −aexp,t∥≤τ then 5: return anov,t 6: else 7: return aexp,t Algorithm 3 and referred to as SAFEDAGGER*, allows the novice to act if the distance between the actions is less than some chosen threshold τ [11].1 An ideal decision rule would allow the novice to act if there is a sufficiently low probability that the system can transition to an unsafe state. If the combined system is currently near an unsafe state, the tolerable perturbation from the expert’s choice of action is smaller than when the system is far from unsafe states. Hence, the single threshold τ employed in SAFEDAGGER* is either too conservative when the system is far from unsafe states or too relaxed when near them. To approximate the ideal decision rule in a model-free manner, we propose considering the distance between the novice’s and expert’s actions as well as the entropy in the novice policy. To estimate the uncertainty of the novice policy, we utilize Bayesian Deep Learning. B. Bayesian Approximation via Dropout To overcome the fact that deep learning lacks the ability to reason about model uncertainty [13], Gal et al. have worked towards approximating Bayesian models with neural networks through dropout [14]. By incorpo- rating dropout at every weight layer in a network, an approximation of a Gaussian process is obtained. Given a policy trained with dropout and an input observation, we query the network N times to obtain a distribution 1 To reduce the number of expert queries, SAFEDAGGER ap- proximates the SAFEDAGGER* decision rule via a deep policy that determines whether or not the novice policy is likely to deviate from the reference policy. Unlike SAFEDAGGER, we are not concerned with minimizing expert queries. Hence, we compare to the SAFEDAGGER* decision rule directly, as opposed to the approximation. τ a1 a2 Novice action samples Mean novice action Expert action (a) Well-represented state τ a1 a2 Novice action samples Mean novice action Expert action (b) Poorly-represented state Fig. 2: Example computation of the DROPOUTDAGGER decision rule governing whether the expert’s or novice’s action is chosen at a given time-step. The action space is two-dimensional, with action ⃗a = [a1, a2]. The novice policy is queried N times to estimate the probability ˆp is within a ball of radius τ centered at the expert action. If ˆp ≥p, for chosen threshold p, then we choose the mean novice action. Otherwise, we choose the expert’s action. Figure 2a shows an example state that is well-represented in D. Because we have many expert labels for this state, the novice policy is low-entropy and centered near the expert’s action. Hence, both DROPOUTDAGGER and SAFEDAGGER* decision rules would allow the novice to act. However, in Figure 2b, the state is poorly-represented in D, and the novice policy is consequently high-entropy. The DROPOUTDAGGER decision rule would not allow the novice to act if p is appropriately chosen, but the SAFEDAGGER* decision rule still would allow the novice to act. Vanilla DAGGER would choose between the two with a weighted coin-flip, with no regard to the similarity of the actions. over actions, using randomly sampled dropout masks. For more details, we guide the reader to [14], [15]. By invoking dropout, our novice policy approximates a Gaussian process that will produce a low-entropy dis- tribution over actions that is centered around the expert’s action, if the input observation is well represented in D. Further, the novice policy will produce high-entropy distributions over actions if the input observation is unlike what has been labeled by the expert in D. III. DROPOUT DAGGER We present the DROPOUTDAGGER decision rule, in which we choose the mean action of the novice only if its distribution over actions has sufficient probability mass around the action suggested by the expert. The algorithm, described in Algorithm 4, is parameterized by τ, which specifies the size of a ball around the expert’s action, and p, which is a threshold for the probability mass we desire to be inside this ball, if the novice is allowed to act. An example of computation of this decision rule is shown in Figure 2. We approximate this distribution over actions by first requiring that the neural-network policy is trained dataset D using dropout, and then querying the network multiple times with the current observation and random dropout masks. As previously stated, an ideal decision rule choose the novice’s action in ‘low-risk states,’ and choose the expert’s action in ‘high-risk states.’ By using the DROPOUTDAGGER decision rule, we allow the novice to act in familiar states that are well represented in D, Algorithm 4 DROPOUTDAGGER Decision Rule 1: procedure DR(ot, τ, p, N) 2: anov,t,j∈{1,...,N} ←πnov,i(ot) 3: aexp,t ←πexp(ot) 4: ˆp ←1 N PN j=1 1{∥aexp,t −anov,t,j∥≤τ} 5: if ˆp ≥p then 6: return 1 N PN j=1 anov,t,j 7: else 8: return aexp,t but hand control back to the expert when the combined system enters an unfamiliar region of the state-space. A comparison between the vanilla DAGGER, SAFEDAGGER*, and DROPOUTDAGGER decision rules can be seen in Figures 2a and 2b. Vanilla DAGGER leaves the choice of action up to a weighted coin-flip, with no regard to current state of the system. SAFEDAGGER* is too restrictive in safer, familiar regions of the state-space to sufficiently guarantee safety in unsafe regions. DROPOUTDAGGER is able to utilize the additional information provided by the distribution over the novice’s action in order to allow the novice to control the system in familiar parts of the state space, and hand control back to the expert in unfamiliar parts of the state-space. By appropriately choosing the hyper-parameters p and τ, we satisfy the dual objectives of allowing the novice to act only if its distribution over actions is sufficiently TABLE I: Hyperparameters used to train Expert Policy for Half-Cheetah domain. Parameter Value Unit Algorithm TRPO MLP Hidden Layer Sizes (64, 64) neurons γ 0.99 λ 0.97 TRPO Max Step 0.01 Batch Size 25000 timesteps epoch Max. Episode Length 100 timesteps Environment Seed 1 low-entropy, as well as sufficiently close to the expert’s. The dropout probability d should be chosen to reflect the epistemic uncertainty, arising from finite demonstration data, and aleatoric uncertainty, arising from the stochas- tic environment. The probability d should be selected for either by grid-search to minimize loss on test- demonstration data, or optimized for using ‘Concrete Dropout,’ which uses a continuous relaxation of discrete dropout masks [16]. If we set d to zero, DROPOUTDAG- GER effectively reduces to SAFEDAGGER*. If we set τ to zero, the algorithm reduces to behavior cloning. If we set p to zero, the algorithm reduces to one in which the expert merely labels the data, but does not ever influence the system during an episode. IV. EXPERIMENTS We demonstrate that DROPOUTDAGGER is able to achieve expert-level performance, while maintaining safety during training in two experimental domains. For each episode, we use average total reward of the combined expert-novice system as a measure of ‘safety performance,’ and the average total reward of the novice alone as a measure of ‘learning performance.’ An al- gorithm ‘safe’ if it demonstrates safety performance on par with that of BEHAVIORCLONING, in which only the expert acts. Learning performance is assessed by the rate at which the novice achieves expert-level performance. A. MuJoCo HalfCheetah The MuJoCo HalfCheetah-v1 domain is an OpenAI Gym Environment with observations in R20 and actions in R6 [17]. The environment provides reward propor- tional to the horizontal distance traveled. An optimal policy propels the half-cheetah robot into a steady run, going as far forward as possible in the time it has. First, we train a multi layer perceptron (MLP) policy to act as the expert, and then compare the performance of DROPOUTDAGGER to other DAGGER variants. We compare two scenarios. First, the novice is given the same observation as the expert. Second, the novice sees the observation corrupted by diagonal-Gaussian noise with σ = 0.1, representing settings where the expert and novice see different observations. The added noise increases aleatoric uncertainty, which degrades performance of naive imitation learning approaches. The optimal policy is trained using the TRPO hy- perparameters summarized in Table I [18], using the rllab implementation of TRPO [19], [20]. The MLP representing the novice policy has two hidden layers with 64 hidden units each, followed by a hidden layer with 32 hidden units. When training the neural network on a given dataset D, an ADAM optimizer is used with a learning rate of 10−3, β1 = 0.9, and β2 = 0.999. Weights are l2 regularized with regularization weight 10−5. The DROPOUTDAGGER policy is trained with a dropout probability d of 0.05. The network is trained for 100 epochs, with a mini-batch size of 32. We compare DROPOUTDAGGER to BEHAVIOR- CLONING, in which the decision-rule always chooses the expert’s action, EXPERTLABELSONLY, in which the decision-rule always chooses the novice’s action, Vanilla DAGGER, and SAFEDAGGER*. When testing Vanilla DAGGER, we use β0 = 1 and βi = 0.63βi−1, which brings β down to 0.01 by the tenth DAgger epoch. Hyperparameters of τ = 0.3 and p = 0.6 are used for DROPOUTDAGGER, and τ = 0.6 is used for SAFEDAGGER*, chosen by grid search. The performance of each policy on the environment are averaged over 50 episodes to estimate of safety and learning performance. Figures 3 and 4 show that DROPOUTDAGGER does not compromise the safety of the combined expert-novice system, while being able to train a well-performing novice policy at a rate compa- rable to other variants of DAgger. Since BEHAVIORCLONING never chooses the novice action, it unsurprisingly perfectly safe. However, we see that all other algorithms except DROPOUTDAGGER compromise the safety. Since the dataset generated by BEHAVIORCLONING contains autocorrelated samples drawn only from nominal expert trajectories, it has poor learning performance. DROPOUTDAGGER both maintains safety performance and achieves a learning performance comparable to all other DAGGER variants. We see in Figure 4 that adding observation noise adversely affects the learning performance of all algo- rithms. This consequently adversely affects the safety performance of Vanilla DAGGER, SAFEDAGGER*, and of course EXPERTLABELSONLY, but does not compro- mise the safety performance of DROPOUTDAGGER. B. Dubins Car Lidar In the Dubins Car Lidar domain, we demonstrate that DROPOUTDAGGER can safely learn a policy in spite of high aleatoric uncertainty. This environment, depicted in Figure 5, consists of a simple Dubins car that navigates out of a room. A Dubins path is represented by a circular 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Avg. Total Return Safety Performance 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Learning Performance DropoutDAgger VanillaDAgger SafeDAgger* Behavior Cloning ExpertLabelsOnly Fig. 3: Performance of variants of the DAGGER algorithm on the vanilla MuJoCo HalfCheetah-v1 environment, averaged over 50 episodes. As we can see, DROPOUTDAGGER makes no compromise to the safety of the combined system, while novice appears to learn a well-performing policy as quickly as other algorithms. 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Avg. Total Return Safety Performance 0 2 4 6 8 0 100 200 300 400 DAgger Epoch Learning Performance DropoutDAgger VanillaDAgger SafeDAgger* Behavior Cloning ExpertLabelsOnly Fig. 4: Performance of variants of the DAGGER algorithm on a variant of MuJoCo HalfCheetah-v1 environment with noisy observations, averaged over 50 episodes. Increased uncertainty compromises the safety of all variants of DAGGER except DROPOUTDAGGER and BEHAVIORCLONING. While DROPOUTDAGGER continues to provide safety, the learning performance remains comparable to other algorithms. −100 −50 0 50 100 −50 0 50 100 Fig. 5: The Dubins Car Lidar environment. arc of radius R, a straight path, and a second circular arc of radius R. Given any two poses (x, y, θ) sufficiently far apart with some achievable turning radius, Dubins path can take a Dubins car from the first pose to the second. A finite-state controller acts as the expert, following a TABLE II: Dubins Car Lidar Parameters Parameter Value Unit Room Height/Width 100 m Exit Width 20 m Lidar Resolution 100 rays Lidar Max. Range 100 m σ1 10 m σ2 10 m m Timestep 0.1 s Max. Angular Veloicty ±1.0 rad s Dubins path at a fixed velocity from the initial state to goal state pointing out of the room in the exit. The expert policy is given access to its exact pose, but the novice policy only has access to noisy ‘lidar’ mea- surements. These are range measurements to the nearest obstacle along 100 equally spaced rays propagating from the center of the Dubins car (Figure 5). The following 0 2 4 6 8 0 0.2 0.4 0.6 0.8 1 DAgger Epoch Avg. Total Return Safety Performance 0 2 4 6 8 −1 −0.5 0 0.5 1 DAgger Epoch Learning Performance DropoutDAgger VanillaDAgger SafeDAgger* Behavior Cloning ExpertLabelsOnly Fig. 6: Performance of variants of the DAGGER algorithm on the Dubins Car Lidar environment, averaged over 50 episodes. Even with high aleatoric uncertainty, DROPOUTDAGGER has best learning and safety performance. BEHAVIORCLONING and SAFEDAGGER* do not compromise safety, but exhibit worse learning performance. 0 2 4 6 8 −1 −0.5 0 0.5 1 DAgger Epoch Avg. Total Return τ = 0.1, p = 0.3, d = 0.05 τ = 0.1, p = 0.6, d = 0.05 τ = 0.3, p = 0.3, d = 0.05 τ = 0.3, p = 0.6, d = 0.05 τ = 0.6, p = 0.6, d = 0.05 τ = 0.1, p = 0.3, d = 0.1 τ = 0.1, p = 0.6, d = 0.1 τ = 0.3, p = 0.3, d = 0.1 τ = 0.3, p = 0.6, d = 0.1 τ = 0.6, p = 0.6, d = 0.1 Fig. 7: Comparison of learning performance of various hyperparameters used for the DROPOUTDAGGER algorithm. noise model gives the corrupted measurement ˆx: ˜x = z1 + (1 + z2)x (1) ˆx = max (min(˜x, L), 0) (2) where x is the original measurement, zi ∼N(0, σi) for i ∈{1, 2}, and L is the maximum lidar range. We train an MLP to map the lidar measurements to the car’s angular velocity. The environment parameters are summarized in Table II. The algorithms and optimizer parameters are identical to those in Section IV-A. In Figure 6, we see that DROPOUTDAGGER main- tains perfect safety and good learning performance. Un- der high aleatoric uncertainty, the BEHAVIORCLONING learning performance deteriorates, thus highlighting the importance of exploration for robustness. Figure 7 shows the learning performance of vari- ous choices of hyperparameters for DROPOUTDAGGER. Though all variants enjoy perfect safety performance, we observe that reducing τ or increasing p make the algorithm more conservative and reduce learning perfor- mance to that of BEHAVIORCLONING, as expected. It is interesting to note that increasing the dropout probability d from 0.05 to 0.1 appears to reduce the sensitivity of the learning performance to the choice of τ and p. V. DISCUSSION Naive algorithms like BEHAVIORCLONING rely on a large set of demonstrations to provide a good dataset for learning. DROPOUTDAGGER extends naive imita- tion learning, adding the ability to safely explore the state-space. Using the novice’s action distribution, the DROPOUTDAGGER decision rule allows the novice to act when in familiar regions of the state-space, but returns control to the expert when entering unfamiliar regions. Our experiments show that DROPOUTDAGGER allows the combined system to safely gather data and explore poorly represented states. DROPOUTDAGGER demonstrates no compromise to safety and learning performance comparable to other algorithms. Though DROPOUTDAGGER exhibits a good mix of safety and learning performance, the algorithm still depends on three hyperparameters. In particular, we observe that the choice of dropout probability d can affect the sensitivity of learning performance to these parameters. Future work includes exploring the opti- mization of the dropout mask using Concrete Dropout at every epoch [16]. Additionally, since all the methods described here only apply to continuous actions, we hope to extend the presented tools to discrete and hybrid action spaces. REFERENCES [1] J. Kober and J. Peters, “Imitation and reinforcement learning,” IEEE Robotics Automation Magazine, vol. 17, no. 2, pp. 55–62, June 2010. [2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and Autonomous Systems, vol. 57, no. 5, pp. 469–483, 2009. [3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man´e, “Concrete problems in AI safety,” Available on arXiv:1606.06565, 2016. [4] B. Price and C. Boutilier, “Accelerating reinforcement learning through implicit imitation,” Journal of Artificial Intelligence Research, vol. 19, pp. 569–629, 2003. [5] S. Schaal, “Learning from demonstration,” in Advances in Neural Information Processing Systems (NIPS), 1997, pp. 1040–1046. [6] H. Daum´e, J. Langford, and D. Marcu, “Search-based structured prediction,” Machine Learning, vol. 75, no. 3, pp. 297–325, 2009. [7] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in International Conference on Artificial Intelligence and Statistics, 2010, pp. 661–668. [8] S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning (ICML), 2013, pp. 1–9. [9] B. Kim and J. Pineau, “Maximum mean discrepancy imitation learning,” in Robotics: Science and Systems, 2013. [10] M. Laskey, S. Staszak, W. Y.-S. Hsieh, J. Mahler, et al., “Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces,” in IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 462–469. [11] J. Zhang and K. Cho, “Query-efficient imitation learning for end- to-end autonomous driving,” Available on arXiv:1605.06450, 2016. [12] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in International Conference on Artificial Intelligence and Statis- tics, 2011, pp. 627–635. [13] R. Oliveira, P. Tabacof, and E. Valle, “Known unknowns: Un- certainty quality in bayesian neural networks,” Available on arXiv:1612.01251, 2016. [14] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approxima- tion: Representing model uncertainty in deep learning,” Available on arXiv:1506.02142, 2015. [15] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural net- works from overfitting,” Journal of Machine Learning, vol. 15, no. 1, pp. 1929–1958, 2014. [16] Y. Gal, J. Hron, and A. Kendall, “Concrete Dropout,” 2017. [17] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “OpenAI Gym,” Available on arXiv:1606.01540, 2016. [18] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup, “Repro- ducibility of benchmarked deep reinforcement learning tasks for continuous control,” Available on arXiv:1708.04133, 2017. [19] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous con- trol,” in International Conference on Machine Learning (ICML), 2016, pp. 1329–1338. [20] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning (ICML), 2015, pp. 1889–1897.