Data-efficient Deep Reinforcement Learning for Dexterous Manipulation Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, Martin Riedmiller DeepMind Abstract—Deep learning and reinforcement learning methods have recently been used to solve a variety of problems in continu- ous control domains. An obvious application of these techniques is dexterous manipulation tasks in robotics which are difficult to solve using traditional control theory or hand-engineered approaches. One example of such a task is to grasp an object and precisely stack it on another. Solving this difficult and practically relevant problem in the real world is an important long-term goal for the field of robotics. Here we take a step towards this goal by examining the problem in simulation and providing models and techniques aimed at solving it. We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them. Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots. I. INTRODUCTION Dexterous manipulation is a fundamental challenge in robotics. Researchers have long been seeking a way to enable robots to robustly and flexibly interact with fixed and free objects of different shapes, materials, and surface properties in the context of a broad range of tasks and environmental conditions. Such flexibility is very difficult to achieve with manually designed controllers. The recent resurgence of neural networks and “deep learning” has inspired hope that these methods will be as effective in the control domain as they are for perception. And indeed, in simulation, recent work has used neural networks to learn solutions to a variety of control problems from scratch (e.g. [7, 20, 32, 31, 11, 17]). While the flexibility and generality of learning approaches is promising for robotics, these methods typically require a large amount of data that grows with the complexity of the task. What is feasible on a simulated system, where hundreds of millions of control steps are possible [23], does not necessarily transfer to real robot applications due to unrealistic learning times. One solution to this problem is to restrict the generality of the controller by incorporating task specific knowledge, e.g. in the form of dynamic movement primitives [30], or in the form of strong teaching signals, e.g. kinesthetic teaching of trajectories [24]. Recent works have had some success learning flexible neural network policies directly on real robots (e.g. [18, 5, 39]), but tasks as complex as grasping-and-stacking remain daunting. An important issue for the application of learning methods in robotics is to understand how to make the best use of collected data, which can be expensive to obtain, both in terms of time and money. To keep learning times reasonably low even in complex scenarios, it is crucial to find a practical compromise between the generality of the controller and the necessary restrictions of the task setup. This is the gap that we aim to fill in this paper: exploring the potential of a learning approach that keeps prior assumptions low while keeping data consumption in reasonable bounds. Simultaneously, we are interested in approaches that are broadly applicable, robust, and practical. In this paper we provide a simulation study that investigates the possibility of learning complex manipulation skills end- to-end with a general purpose model-free deep reinforcement learning algorithm. The express goal of this work is to assess the feasibility of performing analogous end-to-end learning experiments on real robotics hardware and to provide guidance with respect to the choice of learning algorithm and experi- mental setup and the performance that we can hope to achieve. The task which we consider to this end is that of picking up a Lego brick from the table and stacking it onto a second nearby brick using a robotic arm with 9 degrees of freedom (DoF), six in the arm and three for the fingers in the gripper. In addition to having a high-dimensional state and action space, the task exemplifies several of the challenges that are encountered in real-world manipulation problems. Firstly, it involves contact-rich interactions between the robotic arm and two freely moving objects. Secondly it requires mastering several sub-skills (reaching, grasping, and stacking). Each of these sub-skills is challenging in its own right as they require both precision (for instance, successful stacking requires ac- curate alignment of the two bricks) and as well as robust generalization over a large state space (e.g. different initial positions of the bricks and the initial configuration of the arm). Finally, there exist non-trivial and long-ranging dependencies between the solutions for different subtasks: for instance, the ability to successfully stack the brick in the later part of the task depends critically on having picked up the brick in a sensible way beforehand. On the algorithm side we build on the Deep Deterministic Policy Gradient (DDPG; [20]), a general purpose model-free reinforcement learning algorithm for continuous action spaces, and extend it in two ways (section V): firstly, we improve the the data efficiency of the algorithm by scheduling updates Fig. 1: Simulation rendering of the Lego task in different completion stages (also corresponding to different subtasks): (a) starting state, (b) reaching, (c) grasping, (also StackInHand starting state) and (d) stacking of the network parameters independently of interactions with the environment. Secondly, we overcome the computational and experimental bottlenecks of single-machine single-robot learning by introducing a distributed version of DDPG which allows data collection and network training to be spread out over multiple computers and robots. We further propose two broadly applicable strategies that allow us to inject prior knowledge into the learning process in order to help reliably find solutions to complex tasks and further reduce the amount of environmental interaction. The first of these strategies is a recipe for designing effective shaping rewards for compositional tasks (section VI), while the second (section VII) uses a suitable bias in the distribution of initial states to achieve an effect akin to a curriculum or a form of apprenticeship learning. In combination these contributions allow us to reliably learn robust policies for the full task from scratch in less than 10 million environment transitions. This corresponds to less than 10 hours of interaction time on 16 robots, thus entering a regime that no longer seems unrealistic with modern experimental setups. In addition, when states from successful trajectories are used as the start states for learning trials the full task can be learned with 1 million transitions (i.e. less than 1 hour of interaction on 16 robots). To our knowledge our results provide the first demonstration of solving complex manipulation problems involving multiple freely moving ob- jects. They are also encouraging as a sensible lower bound for real-world experiments suggesting that it may indeed be possible to learn such non-trivial manipulation skills directly on real robots. II. RELATED WORK Reinforcement learning approaches solve tasks through re- peated interactions with the environment guided by a reward signal that indicates the success or failure of a trial. A wide variety of techniques have been developed that exploit this idea [34], with a broad distinction often made between value- based and policy search methods. While the former estimate and improve a value function, policy search methods directly optimize the parameters of a policy to maximize cumulative reward. The latter have been routinely applied in robotics, in part because they straightforwardly handle continuous and high-dimensional action spaces [3] and applications include manipulation [26, 13, 25, 37, 18, 5, 39, 8], locomotion e.g. [16, 21], and a range of other challenges such as helicopter flight [1]. One limitation that has hampered policy search methods is that they can scale poorly with the number of parameters that need to be estimated. This limitation, and other constraints when working with real robotics hardware has led research to focus on the use of manually engineered and restrictive features and movement representations, particularly trajectory- based ones such as spline based dynamic movement primitives. Simplifying the policy space can make learning on real hard- ware tractable, but it also limits the kinds of problems that can be solved. In order to solve a problem such as picking up and manipulating an object, more expressive function classes are likely to be needed. The use of rich and flexible function approximators such as neural networks in RL dates back many years, e.g. [38, 35, 12, 10]. In the last few years there has been a resurgence of interest in end-to-end training of neural networks for challenging control problems, and several algorithms, both value and policy focused have been developed and applied to challenging problems including continuous control, e.g. [22, 23, 6, 7, 20, 32, 31, 11, 17]. These methods work well with large neural networks and can learn directly from raw visual input streams. With few exceptions, e.g. [10, 5, 18, 39], they have been considered too data-inefficient for robotics applications. One exception are guided policy search methods (GPS) [18, 39]. These have recently been applied to several manip- ulation problems and employ a teacher algorithm to locally optimize trajectories which are then summarized by a neu- ral network policy. GPS algorithms gain data-efficiency by employing aggressive local policy updates and by performing extensive training of their neural network policy before col- lecting more real-world data. The teacher can use model-based [18] or model-free [39] trajectory optimization. The former can struggle in situations with strong discontinuities in the dynamics, and both rely on access to a well defined and fully observed state space. Model-free value function approaches offer an alternative way to handle to the issue of data-efficiency in robotics. Such approaches enable effective reuse of data and do not require full access to the state space or to a model of the environment. One recent work [5], closely related to the ideas followed in this paper, provides a proof of concept demonstration that value-based methods using neural network approximators can be used for robotic manipulation in the real world . This work applied a Q-learning approach [7] to a door opening task in which a robotic arm fitted with an unactuated hook needed to reach to a handle and pull a door to a given angle. The starting state of the arm and door were fixed across trials and the reward structure was smooth and structured, with one term expressing the distance from the hook to the handle and a second term expressing the distance of the door to the desired angle. This task was learned in approximately 2 hours across 2 robots pooling their experience into a shared replay buffer. This work thus made use of a complementary solution to the need for large amounts of interaction data: the use of experimental rigs that allow large scale data collection, e.g. [27], including the use of several robots from which experience are gathered in parallel [19, 5, 39]. This can be combined with single machine or distributed training depending on whether the bottleneck is primarily one of data collection or also one of network training [23]. Finally, the use of demonstration data has played an impor- tant role in robot learning, both as a means to obtain suitable cost functions [2, 14, 4, 8] but also to bootstrap and thus speed up learning. For the latter, kinesthetic teaching is widely used [26, 13, 25, 39]. It integrates naturally with trajectory-based movement representations but the need for a human operator to be able to guide the robot through the full movement can be limiting. Furthermore, when the policy representation is not trajectory based (e.g. direct torque control with neural networks) the use of human demonstration trajectories may be less straightforward (e.g. since the associated controls are not available). III. BACKGROUND In this section we briefly formalize the learning problem, summarize the DDPG algorithm, and explain its relationship to several other Q-function based reinforcement learning (RL) algorithms. The RL problem consists of an agent interacting with an environment in a sequential manner to maximize the expected sum of rewards. At time t the agent observes the state xt of the system and produces a control ut = π(xt; θ) according to policy π with parameters θ. This leads the environment to tran- sition to a new state xt+1 according to the dynamics xt+1 ∼ p(·|xt, ut), and the agent receives a reward rt = r(xt, ut). The goal is to maximize the expected sum of discounted rewards J(θ) = Eτ∼ρθ P t γt−1r(xt, ut)  , where ρ(θ) is the distribu- tion over trajectories τ = (x0, u0, x1, u1, . . . ) induced by the current policy: ρθ(τ) = p(x0) Q t>0 p(xt|xt−1, π(xt−1; θ)). DPG [33] is a policy gradient algorithm for continuous action spaces that improves the deterministic policy function π via backpropagation of the action-value gradient from a learned approximation to the Q-function. Specifically, DPG maintains a parametric approximation Q(xt, ut; φ) to the action value function Qπ(xt, ut) associated with π and φ is chosen to minimize E(xt,ut,xt+1)∼¯ρ  (Q(xt, ut; φ) −yt)2 (1) where yt = r(xt, ut) + γQ(xt+1, π(xt+1)). ¯ρ is usually close to the marginal transition distribution induced by π but often not identical. For instance, during learning ut may be chosen to be a noisy version of π(xt; θ), e.g. ut = π(xt; θ)+ϵ where ϵ ∼N(0, σ2) and ¯ρ is then the transition distribution induced by this noisy policy. The policy parameters θ are then updated according to ∆θ ∝E(x,u)∼¯ρ  ∂ ∂uQ(x, u; φ) ∂ ∂θπ(x; θ)  . (2) DDPG [20] is an improvement of the original DPG algo- rithm adding experience replay and target networks: Experi- ence is collected into a buffer and updates to θ and φ (eqs. 1, 2) are computed using mini-batch updates with random samples from this buffer. Furthermore, a second set of ”target- networks” is maintained with parameters θ′ and φ′. These are used to compute yt in eqn. (1) and their parameters are slowly updated towards the current parameters θ, φ. Both measures significantly improve the stability of DDPG. DDPG bears a relation to several other recent model free RL algorithms: The NAF algorithm [7] which has recently been applied to a real-world robotics problem [5] can be viewed as a DDPG variant where the Q-function is quadratic in the action so that the optimal action can be easily recovered directly from the Q-function, making a separate representation of the policy unnecessary. DDPG and especially NAF are the continuous action counterparts of DQN [22], a Q-learning algorithm that recently re-popularized the use of experience replay and target networks to stabilize learning with powerful function approximators such as neural networks. DDPG, NAF, and DQN all interleave mini-batch updates of the Q-function (and the policy for DDPG) with data collection via interaction with the environment. These mini-batch based updates set DDPG and DQN apart from the otherwise closely related NFQ and NFQCA algorithms for discrete and continuous actions respectively. NFQ [29] and NFQCA [9] employ the same basic update as DDPG and DQN, however, they are batch algorithms that perform updates less frequently and fully re-fit the Q-function and the policy network after every episode with several hundred iterations of gradient descent with Rprop [28] and using full-batch updates with the entire replay buffer. The aggressive training makes NFQCA data efficient, but the full batch updates can become impractical with large networks, large observation spaces, or when the number of training episodes is large. Finally, DPG can be seen as the deterministic limit of a particular instance of the stochastic value gradients (SVG) family [11], which also computes policy gradient via back-propagation of value gradients, but optimizes stochastic policies. Discrete Continuous Mini-batch learning Target networks DQN DDPG, NAF Full-batch learning with Rprop Parameter resetting NFQ NFQCA One appealing property of the above family of algorithms is that the use of a Q-function facilitates off-policy learning. This allows decoupling the collection of experience data from the updates of the policy and value networks, a desirable property given that experience is expensive to collect in a robotics setup. In this context, because neural network training is often slow, decoupling allows us to make many parameter update steps per step in the environment, ensuring that the networks are well fit to the data that is currently available. IV. TASK AND EXPERIMENTAL SETUP The full task that we consider in this paper is to use the arm to pick up one Lego Duplo brick from the table and stack it onto the remaining brick. This ”composite” task can be decomposed into several subtasks, including grasping and stacking. In our experiments we consider the full task as well as the two sub-tasks in isolation as shown in the table below: Starting state Reward Grasp Both bricks on table Brick 1 above table StackInHand Brick 1 in gripper Bricks stacked Stack Both bricks on table Bricks stacked In every episode the arm starts in a random configuration with the positioning of gripper and brick appropriate for the task of interest. We implement the experiments in a physically plausible simulation in MuJoCo [36] with the simulated arm being closely matched to a real-world Jaco arm1 setup in our lab. Episodes are terminated after 150 steps, with each step corresponding to 50ms of physical simulation time. This means that the agent has 7.5 seconds to perform the task. Un- less otherwise noted we give a reward of one upon successful completion of the task and zero otherwise. The observation vector provided to the agent contains information about the angles and angular velocities of the 6 joints of the arm and 3 fingers of the gripper. In addition, we provide information about the position and orientation of the two bricks and relative distances of the two bricks to the pinch position of the gripper, i.e. roughly the position where the fin- gertips would meet if the fingers are closed. The 9-dimensional continuous action directly sets the velocities of the arm and finger joints. In experiments not reported in this paper we have tried using an observation vector containing only the raw state of the brick in addition to the arm configuration (i.e. without the vector between the end-effector and brick) and found that 1Jaco is a robotics arm developed by Kinova Robotics this increased the number of environment interactions needed roughly by a factor of two to three. The only hyper-parameter that we optimize for each ex- perimental condition is the learning rate. For each condition we train and measure the performance of 10 agents with different random initial network parameters. After every 30 training episodes the agent is evaluated for 10 episodes. We used the mean performance at each evaluation phase as the performance measure presented in all plots. We found empirically that 10 episodes of evaluation gave a reasonable proxy for performance in the studied tasks. In the plots the line shows the mean performance for the set and the shaded regions correspond to the range between the worst and best performing agent in the set. In all plots the x-axis represents the number of environment transitions seen so far at an evaluation point (in millions) and the y-axis represent episode return. A video of the full setup and examples of policies solving the component and full tasks can be found here: https://www.youtube.com/watch?v=8QnD8ZM0YCo. V. ASYNCHRONOUS DPG WITH VARIABLE REPLAY STEPS In this section we study two methods for extending the DDPG algorithm and find that they can have significant effect on data and computation efficiency, in some cases making the difference between finding a solution to a task or not. a) Multiple mini-batch replay steps: Deep neural net- works can require many steps of gradient descent to converge. In a supervised learning setting this affects purely computa- tion time. In reinforcement learning, however, neural network training is interleaved with the acquisition of interaction expe- rience, and the nature of the latter is affected by the state of the former – and vice versa – so the situation is more complicated. To gain a better understanding of this interaction we modified the original DDPG algorithm as described in [20] to perform a fixed but configurable number of mini-batch updates per step in the environment. In [20] one update was performed after each new interaction step. We refer to DDPG with a configurable number of update steps as DPG-R and tested the impact of this modification on the two primitive tasks Grasp and StackInHand. The results are shown in Fig. 2. It is evident that the number of update steps has a dramatic effect on the amount of experience data required for learning successful policies. After one million interactions the original version of DDPG with a single update step (blue traces) appears to have made no progress towards a successful policy for stacking, and only a small number of controllers have learned to grasp. Increasing the number of updates per interaction to 5 greatly improves the results (green traces), and with 40 updates (purple) the first successful policies for stacking and grasping are obtained after 200,000 and 300,000 interactions respectively (corresponding to 1,300 and 2,000 episodes). It is notable that although the improvement is task dependent and the dependence between update steps and convergence is clearly not linear, in both cases we continue to see a reduction in total environment interaction up to 40 update steps, the maximum used in the experiment. One may speculate as to why changing the number of updates per environment step has such a pronounced effect. One hypothesis is that, loosely speaking and drawing an analogy to supervised learning, insufficient training leads to underfitting of the policy and value network with respect to the already collected training data. Unlike in supervised learning, however, where the dataset is typically fixed, the quality of the policy directly feeds back into the data acquisition process since the policy network is used for exploration, thus affecting the quality the data used in future iterations of network training. We have observed in various experiments (not listed here) that other aspects of the network architecture and training process can have a similar effect on the extent of underfitting. Some examples include the type of non-linearities used in the network layers, the size of layers and the learning rate. It is important to note that one cannot replicate the effect of multiple replay steps simply by increasing the learning rate. In practice we find that attempts to do so make training unstable. Fig. 2: Mean episode return as a function of number of transitions seen (in millions) of DPG-R (single worker) on the Grasp (left) and StackInHand (right) task with 1 (blue), 5 (green), 10 (red), 20 (yellow) and 40 (purple) mini-batch updates per environment step b) Asynchronous DPG: While increasing the number of update steps relative to the number of environment interactions greatly improves the data efficiency of the algorithm it can also strongly increase the computation time. In the extreme case, in simulation, when the overall run time is dominated by the network updates it may scale linearly with the number of replay steps. In this setting it is desirable to be able to parallelize the update computations. In a real robotics setup the overall run time is typically dominated by the collection of robot interactions. In this case it is desirable to be able to collect experience from multiple robots simultaneously (e.g. as in [39, 5]). We therefore develop an asynchronous version of DPG that allows parallelization of training and environment interaction by combining multiple instances of an DPG-R actor and critic that each share their network parameters and can be configured to either share or have independent experience replay buffers. This is inspired by the A3C algorithm proposed in [23], and also analogous to [5, 39]. We found that this strategy is also an effective way to share parameters for DPG. That is, we employ asynchronous updates whereby each worker has its own copy of the parameters and uses it for computing gradients which are then applied to a shared parameter instance without any synchronization. We use the Adam optimizer [15] with local non-shared first-order statistics and a single shared instance of second-order statistics. The pseudo code of the asynchronous DPG-R is shown in algorithm box 1. Algorithm 1 (A)DPG-R algorithm Initialize global shared critic and actor network parameters: θQ′′ and θµ′′ Pseudo code for each learner thread: Initialize critic network Q(s, a|θQ) and actor µ(s|θµ) with weights θQ and θµ. Initialize target network Q′ and µ′ with weights: θQ′ ←θQ, θµ′ ←θµ Initialize replay buffer R for episode = 1, M do Receive initial observation state s1 for t = 1, T do Select action at = µ(st|θµ) + Nt according to the current policy and exploration noise Perform action at, observe reward rt and new state st+1 Store transition (st, at, rt, st+1) in R for update = 1, R do Sample a random minibatch of N transitions (si, ai, ri, si+1) from R Set yi = ri + γQ′(si+1, µ′(si+1|θµ′)|θQ′) Perform asynchronous update of the shared param- eters of the critic by minimizing the loss: L = 1 N P i(yi −Q(si, ai|θQ)2) Perform asynchronous update of shared parameters of actor policy using the sampled gradient: ∇θµ′′ µ|si ≈1 N X i ∇aQ(s, a|θQ)|∇θµµ(s|θµ)|si Copy the shared parameters to the local ones: θQ ←θQ′′, θµ ←θµ′′ Every S update steps, update the target networks: θQ′ ←θQ, θµ′ ←θµ end for end for end for Figure 3 compares the performance of ADPG-R for different number of update steps and 16 workers (all workers perform- ing both data collection and computing updates). Similar to Fig. 2 we find that increasing the ratio of update steps per environment steps improves data efficiency, although the effect appears to be somewhat less pronounced than for DPG-R. Figure 4 (top row) directly compares the single-worker and asynchronous version of DPG-R. In both cases we choose the best performing number of replay steps and learning rate. As we can see, the use of multiple workers does not affect overall Fig. 3: Mean episode return as a function of number of transitions seen (in millions) of ADPG-R (16 workers) on the Grasp (left) and StackInHand (right) task. Different colored traces indicate number of replay step as in Fig. 2 data efficiency for StackInHand but it reduced roughly in half for Grasp, with the note that the single worker still hasn’t quite converged. Figure 4 (bottom row) plots the same data but as a function of environment steps per worker. This measure corresponds to the optimal wall clock efficiency that we can achieve, under the assumption that communication time between workers is negligible compared to environment interaction and gradient computation (this usually holds up to a certain degree of parallelization). This theoretical wall clock time for running an experiment with 16 workers is about 16x lower for Stack- InHand and roughly 8x lower for Grasp. Overall these results show that distributing neural network training and data collection across multiple computers and robots can be an extremely effective way of reducing the overall run time of experiments and thus making it feasible to run more challenging experiments. We make extensive use of asynchronous DPG for remaining the experiments. Fig. 4: Figure with two panels: (a) Grasp; (b) StackInHand; 16 workers vs single worker in data (total for all workers) and ”wallclock” (per-worker) time in millions of transitions with best replay step and learning rate selection. VI. COMPOSITE SHAPING REWARDS In the previous section we discussed how the ability of DDPG to exploit information that is available in the acquired interaction data affects learning speed. One important factor that determines what information is available from this data is the nature of the reward function. The reward function in the previous section was ”sparse” or ”pure” reward where a reward of 1 was given for states that correspond to successful task completion (brick lifted above 3cm for grasp; for stack) and 0 otherwise. For this reward to be useful for learning it is of course necessary that the agent is able to enter this goal region in state space with whatever exploration strategy is chosen. This was indeed the case for the two subtasks in isolation, but it is highly unlikely for the full task: without further guidance na¨ıve random exploration is very unlikely to lead to a successful grasp and stack as we also experimentally verify in Fig. 5. One commonly used solution to this problem is to provide informative shaping rewards that allow a learning signal to be obtained even with simple exploration strategies, e.g. by embedding information about the value function in the reward function for every transition acquired from the environment. For instance, for a simple reaching problem with a robotic arm we could define a shaping reward that takes into account the distance between the end-effector and the target. While this a convenient way of embedding prior knowledge about the solution and is a widely and successfully used approach for simple problems it comes with several caveats, especially for complex sequential or compositional tasks such as the one we are interested in here. Firstly, while a suitable shaping reward may be easy to construct for simple problems for more complex composite tasks, such as the one considered in this paper, a suitable reward function is often non-obvious and may require con- siderable effort and experimentation. Secondly, and related to the previous point, the use of a shaping reward typically alters the solution to the optimization problem. The effect of this can be benign but especially when it comes to complex tasks a small mistake may lead to complete failure of learning as we will demonstrate below. Thirdly, in a robotics setup not all information that would be desirable to define a good shaping reward may be easily available. For instance, in the manipulation problem considered in this paper determining the position of the Lego bricks requires extra instrumentation of the experimental setup. In this section we propose and analyze several possible reward functions for our full Stack task, aiming to provide a recipe that can be applied to other tasks with similar compositional structure. Shaping rewards are typically defined based on some notion of distance from or progress towards a goal state. We attempt to transfer this idea to our compositional setup via, what we call, composite (shaping) rewards. These reward functions return an increasing reward as the agent com- pletes components of the full task. They are either piecewise constant or smoothly varying across different regions of the Sparse reward components Subtask Description Reward Reach Brick 1 hypothetical pinch site position of the fingers is in a box around the first brick position 0.125 Grasp Brick 1 the first brick is located at least 3cm above the table surface, which is only possible if the arm is holding the brick 0.25 Stack Brick 1 bricks stacked 1.00 Smoothly varying reward components Reaching to brick 1 distance of the pinch site to the first brick - non-linear bounded [0, 0.125] Reaching to stack while grasped: distance of the first brick to the stacking site of the second brick - non-linear bounded [0.25, 0.5] TABLE I: Composite reward function state space that correspond to completed subtasks. In the case of Stack we use the reward components described in table I. These reward components can be combined in different ways. We consider three different composite rewards in ad- ditional to the original sparse task reward: Grasp shaping: Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.25 when the brick 1 has been grasped and a reward of 1.0 after completion of the full task. Reach and grasp shaping: Reach brick 1, Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.125 when being close to brick 1, a reward of 0.25 when brick 1 has been grasped, and a reward of 1.0 after completion of the full task. Full composite shaping: the sparse reward components as be- fore in combination with the distance-based smoothly varying components. Figure 5 shows the results of learning with the above reward functions (blue traces). The figure makes clear that learning with the sparse reward only does not succeed for the full task. Introducing an intermediate reward for grasping allows the agent to learn to grasp but learning is very slow. The time to successful grasping can be substantially reduced by giving a distance based reward component for reaching to the first brick, but learning does not progress beyond grasping. Only with an additional intermediate reward component as in continuous reach, grasp, stack the full task can be solved. Although the above reward functions are specific to the particular task, we expect that the idea of a composite reward function can be applied to many other tasks thus allow- ing learning for to succeed even for challenging problems. Nevertheless, great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing the reward function components: e.g. reach and grasp components leading to a grasp unsuitable for stacking, agent not stacking the bricks because it will stop receiving the grasping reward before it receives reward for stacking and the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick. We show examples of these in the video: https://www.youtube.com/watch?v=8QnD8ZM0YCo. VII. LEARNING FROM INSTRUCTIVE STATES In the previous section we have described a strategy for designing effective reward functions for complex composi- tional tasks which alleviate the burden of exploration. We have also pointed out, however, that designing shaping rewards can be error prone and may rely on privileged information. In this section we describe a different strategy for embedding prior knowledge into the training process and improving exploration that reduces the reliance on carefully designed reward functions. Specifically we propose to let the distribution of states at which the learning agent is initialized at the beginning of an episode reflect the compositional nature of the task: In our case, instead of initializing the agent always at the beginning of the full task with both bricks on the table we can, for instance, choose to initialize the agent occasionally with the brick already in its hand and thus prepared for stacking in the same way as when learning the subtask StackInHand in section V. Trajectories of policies solving the task will have to visit this region of space before stacking the bricks and we can thus think of this initialization strategy as initializing the agent closer to the goal. More generally, we can choose to initialize episodes with states taken from anywhere along or close to successful tra- jectories. Suitable states can be either manually defined (as in section V), or they can be obtained from a human demonstrator or a previously trained agent that can partially solve the task. This can be seen as a form of apprenticeship learning in which we provide teacher information by influencing the state visitation distribution. We perform experiments with two alternative methods for generating the starting states. The first one uses manually defined initial states and amounts to the possibility discussed above: we initialize the learning agent in either the original starting states with both bricks located on the table or in states where the first brick is already in the gripper as if the agent just performed a successful grasp and lifted the brick. These two sets of start states correspond to those used in section V. The second method for generating instructive starting states can also be used on a real robot provided a human demonstra- tor or a pre-trained policy are available. It aims at initializing the learning agent along solution trajectory states in a more fine-grained fashion. We sample a random number of steps for each episode between one and the expected number of steps required to solve the task from the original starting states and then run the demonstrator for this number of steps. The final state of this process is then used as a starting state initialization for the learning agent which then acts in the environment for the remainder of the episode. The results of these experiments are shown in Figure 5. It shows results for the four reward functions considered in the previous section when combined with the simple augmented start state distribution. While there is still no learning for the basic sparse reward case, results obtained with all other reward functions are improved. In particular, even for the second Fig. 5: Four panels with (a) no progress without extra shaping (b, c, d) different shaping strategies for the composite task with starting states with both bricks on the table (blue), manually defined initial states (green) and initial states continuously on solution trajectories (red). On all plots, x-axis is millions of transitions of total experience and y-axis is mean episode return. Policies with mean return over 100 robustly perform the full Stack from different starting states. simplest reward function (Grasp shaping) we now obtain some controllers that can solve the full task. Learning with the full composite shaping reward is faster and more robust than without the use of instructive states. The top left plot of Figure 5 (red trace) shows results for the case where the episode is initialized anywhere along trajectories from a pre-trained controller. We use this start state distribution in combination with the basic sparse reward for the overall case (Stack without shaping). Episodes were configured to be 50 steps, shorter than in the previous experiments, to be better suited to this setup with assisted exploration. During testing we still used episodes with 150 steps as before (so the traces are comparable). We can see a large improvement in performance in comparison to the two-state method variant even in the absence of any shaping rewards. We can learn a robust policy for all seeds within a total of 1 million environment transitions. This corresponds to less than 1 hour of interaction time on 16 simulated robots. Overall these results suggest that an appropriate start state distribution does not only greatly speed up learning, it also allows simpler reward function to be used. In our final ex- periment the simplest reward function, only indicating overall experimental success, was sufficient to solve the task. Con- sidering the difficulties that can be associated with designing good shaping rewards this is an encouraging results. The robustness of the policies that we can train to the starting state variation are also quite encouraging. Table II lists the success rate by task from 1000 trials. You can find a video Success rate (1000 random starts) Grasp 99.2% StackInHand 98.2% Stack 95.5% TABLE II: Robustness of learned policies. with trained policies performing the Grasp, StackInHand and Stack tasks from different initial states in the supplementary material. VIII. CONCLUSION We have introduced two extensions to the DDPG algorithm which make it a powerful method for learning robust policies for complex continuous control tasks. Specifically, we have shown that by decoupling the frequency of network updates from the environment interaction we can substantially improve data-efficiency, to a level that in some cases makes the difference between finding a solution or not. The asynchronous version of DDPG which allows data collection and network training to be distributed over several computers and (simu- lated) robots has provided us with a close to linear speed up in wall-clock time for 16 parallel workers. In addition, we presented two methods that help to guide the learning process towards good solutions and thus reduce the pressure on exploration strategies and speed up learning. The first, composite rewards, is a recipe for constructing effective reward functions for tasks that consist of a sequence of sub- tasks. The second, instructive starting states, can be seen as a lightweight form of apprenticeship learning that facilitates learning of long horizon tasks even with sparse rewards, a property of many real-world problems. Taken together, the algorithmic changes and exploration shaping strategies have allowed us to learn robust policies for the Stack task within a number of transitions that is feasible to collect in a real- robot system within a few days, or in significantly less time if multiple robots were used for training. It is of course a challenge to judge the transfer of results in simulation to the real world. We have taken care to design a physically realistic simulation, and in initial experiments, which we have performed both in simulation and on the physical robot, we generally find a good correspondence of performance and learning speed between simulation and real world. This makes us optimistic that our performance numbers also hold when going to the real world. A second caveat of our simulated setup is that it currently uses information about the state of the environment, which although not impossible to obtain on a real robot, may require additional instrumentation of the experimental setup, e.g. to determine the position of the two bricks in the work space. To address this second issue we are currently focusing on end-to-end learning directly from raw visual information. Here, we have some first results showing the feasibility of learning policies for grasping with a success rate of about 80% across different starting conditions. We view the algorithms and techniques presented here as an important step towards applying versatile deep reinforcement learning methods for real-robot dexterous manipulation with perception. REFERENCES [1] J Andrew Bagnell and Jeff G Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEE International Conference on, volume 2, pages 1615–1620. IEEE, 2001. [2] A. Boularias, J. Kober, and J. Peters. Relative entropy inverse reinforcement learning. In JMLR Workshop and Conference Proceedings Volume 15: AISTATS 2011, pages 182–189, Cambridge, MA, USA, April 2011. MIT Press. [3] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1–142, 2013. [4] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 49–58, 2016. URL http://jmlr.org/proceedings/papers/v48/finn16.html. [5] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manip- ulation. arXiv preprint arXiv:1610.00633, 2016. [6] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. International Conference on Learning Representations (ICLR), 2016. [7] Shixiang Gu, Tim Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning (ICML), 2016. [8] Abhishek Gupta, Clemens Eppner, Sergey Levine, and Pieter Abbeel. Learning dexterous manipulation for a soft robotic hand from human demonstrations. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea, October 9-14, 2016, pages 3786–3793, 2016. [9] Roland Hafner and Martin Riedmiller. Reinforcement learning in feedback control. Machine learning, 84(1-2): 137–169, 2011. [10] Roland Hafner and Martin A. Riedmiller. Neural rein- forcement learning controllers for a real robot applica- tion. In 2007 IEEE International Conference on Robotics and Automation, ICRA 2007, 10-14 April 2007, Roma, Italy, pages 2098–2103, 2007. [11] Nicolas Heess, Gregory Wayne, David Silver, Tim Lill- icrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Ad- vances in Neural Information Processing Systems (NIPS), pages 2926–2934, 2015. [12] K. J. Hunt, D. Sbarbaro, R. ˙Zbikowski, and P. J. Gawthrop. Neural networks for control systems: A survey. Automatica, 28(6):1083–1112, November 1992. ISSN 0005-1098. [13] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal. Learning force control policies for compliant manipula- tion. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2011), Sept. 25-30, San Francisco, CA, 2011. URL http://www-clmc.usc.edu/ publications/K/kalakrishnan-IROS2011. [14] M. Kalakrishnan, P. Pastor, L. Righetti, and S. Schaal. Learning objective functions for manipulation. In IEEE International Conference on Robotics and Automation, 2013. [15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [16] Nate Kohl and Peter Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Proceedings of the IEEE International Conference on Robotics and Automation, May 2004. [17] Sergey Levine and Pieter Abbeel. Learning neural net- work policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems (NIPS), pages 1071–1079, 2014. [18] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015. [19] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. CoRR, abs/1603.02199, 2016. [20] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforce- ment learning. International Conference on Learning Representations (ICLR), 2016. [21] Takamitsu Matsubara, Jun Morimoto, Jun Nakanishi, Masa-aki Sato, and Kenji Doya. Learning cpg- based biped locomotion with a policy gradient method. Robotics and Autonomous Systems, 54(11):911–920, 2006. [22] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep rein- forcement learning. Nature, 518(7540):529–533, 2015. [23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning (ICML), 2016. [24] K. Muelling, J. Kober, O. Kroemer, and J. Pe- ters. Learning to select and generalize striking move- ments in robot table tennis. (3):263–279, 2013. URL http://www.ias.informatik.tu-darmstadt.de/uploads/ Publications/Muelling IJRR 2013.pdf. [25] P. Pastor, M. Kalakrishnan, S. Chitta, E. Theodorou, and S. Schaal. Skill learning and task outcome prediction for manipulation. In IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 9-13, 2011. [26] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In International Conference on Intelligent Robots and Systems (IROS), pages 2219–2225. IEEE, 2006. [27] Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. CoRR, abs/1509.06825, 2015. URL http: //arxiv.org/abs/1509.06825. [28] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algo- rithm. In H. Ruspini, editor, Proceedings of the IEEE International Conference on Neural Networks (ICNN), pages 586 – 591, San Francisco, 1993. [29] Martin A. Riedmiller. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005, 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005, Proceedings, pages 317– 328, 2005. [30] Stefan Schaal. Dynamic Movement Primitives -A Frame- work for Motor Control in Humans and Humanoid Robotics, pages 261–280. Springer Tokyo, Tokyo, 2006. ISBN 978-4-431-31381-6. doi: 10.1007/4-431-31381-8 23. URL http://dx.doi.org/10.1007/4-431-31381-8 23. [31] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimiza- tion. In International Conference on Machine Learning (ICML), pages 1889–1897, 2015. [32] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. Interna- tional Conference on Learning Representations (ICLR), 2016. [33] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), 2014. [34] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cam- bridge, 1998. [35] Gerald Tesauro. Temporal difference learning and td- gammon. Commun. ACM, 38(3):58–68, 1995. [36] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. [37] Herke van Hoof, Tucker Hermans, Gerhard Neumann, and Jan Peters. Learning robot in-hand manipulation with tactile features. In 15th IEEE-RAS International Con- ference on Humanoid Robots, Humanoids 2015, Seoul, South Korea, November 3-5, 2015, pages 121–127, 2015. [38] Paul J. Webros. Neural networks for control. chapter A Menu of Designs for Reinforcement Learning over Time, pages 67–95. 1990. ISBN 0-262-13261-3. [39] Ali Yahya, Adrian Li, Mrinal Kalakrishnan, Yevgen Chebotar, and Sergey Levine. Collective robot rein- forcement learning with distributed asynchronous guided policy search. CoRR, abs/1610.00673, 2016. URL http://arxiv.org/abs/1610.00673. APPENDIX A. Reward function In this section we provide further details regarding the reward functions described in section VI. For our experiments we derived these from the state vector of the simulation, but they could also be obtained through instrumentation in hardware. The reward functions are defined in terms of the following quantities: • b(1) z : height of brick 1 above table • sB1 {x,y,z}: x,y,z positions of site located roughly in the center of brick 1 • sB2 {x,y,z}: x,y,z positions of site located just above brick 2, at the position where sB1 will be located when brick 1 is stacked on top of brick 2. • sP {x,y,z}: x,y,z positions of the pinch site of the hand – roughly the position where the fingertips would meet if the fingers are closed.. 1) Sparse reward components: Using the above we can define the following conditions for the successful completion of subtasks: a) Reach Brick 1: The pinch site of the fingers is within a virtual box around the first brick position. reach =(|sB1 x −sP x | < ∆reach x ) ∧(|sB1 y −sP y | < ∆reach y ) ∧(|sB1 z −sP z | < ∆reach z ), where ∆reach {x,y,z} denote the half-lengths of the sides of the virtual box for reaching. b) Grasp Brick 1: Brick 1 is located above the table surface by a threshold, θ, that is possible only if the arm is the brick has been lifted. grasp =b(1) z > θ c) Stack: Brick 1 is stacked on brick 2. This is expressed as a box constraint on the displacement between brick 1 and brick 2 measured in the coordinate system of brick 2. stack =(|C(2) x (sB1 −sB2)| < ∆stack x ) ∧(|C(2) y (sB1 −sB2)| < ∆stack y ) ∧(|C(2) z (sB1 −sB2)| < ∆stack z ), where ∆stack {x,y,z} denote the half-lengths of the sides of the virtual box for stacking, and C(2) is the rotation matrix that projects a vector into the coordinate system of brick 2. This projection into the coordinate system of brick 2 is necessary since brick 2 is allowed to move freely. It ensures that the box constraint is considered relative to the pose of brick 2. While this criterion for a successful stack is quite complicated to express in terms of sites, it could be easily implemented in hardware e.g. via a contact sensor attached to brick 2. 2) Shaping components: The full composite reward also includes two distance based shaping components that guide the hand to the brick 1 and then brick 1 to brick 2. These could be approximate and would be relatively simple to implement with a hardware visual system that can only roughly identify the centroid of an object. The shaping components of the reward are given as follows: a) Reaching to brick 1: : rS1(sB1, sP ) = 1 −tanh2(w1∥sB1 −sP ∥2) b) Reaching to brick 2 for stacking: rS2(sB1, sB2) = 1 −tanh2(w2∥sB1 −sB2∥2). 3) Full reward: Using the above components the reward functions from section VI: Stack, Grasp shaping, Reach and grasp shaping, and Full composite shaping can be expressed as in equations (3, 4, 5, 6) below. These make use of the predicates above to determine whether which subtasks have been completed and return a reward accordingly. r(b(1) z , sP , sB1, sB2) = ( 1 if stack(b(1) z , sP , sB1, sB2) 0 otherwise (3) r(b(1) z , sP , sB1, sB2) =      1 if stack(b(1) z , sP , sB1, sB2) 0.25 if ¬stack(b(1) z , sP , sB1, sB2) ∧grasp(b(1) z , sP , sB1, sB2) 0 otherwise (4) r(b(1) z , sP , sB1, sB2) =          1 if stack(b(1) z , sP , sB1, sB2) 0.25 if ¬stack(b(1) z , sP , sB1, sB2) ∧grasp(b(1) z , sP , sB1, sB2) 0.125 if ¬(stack(b(1) z , sP , sB1, sB2) ∨grasp(b(1) z , sP , sB1, sB2)) ∧reach(b(1) z , sP , sB1, sB2) 0 otherwise (5) r(b(1) z , sP , sB1, sB2) =          1 if stack(b(1) z , sP , sB1, sB2) 0.25 + 0.25rS2(sB1, sP ) if ¬stack(b(1) z , sP , sB1, sB2) ∧grasp(b(1) z , sP , sB1, sB2) 0.125 if ¬(stack(b(1) z , sP , sB1, sB2) ∨grasp(b(1) z , sP , sB1, sB2)) ∧reach(b(1) z , sP , sB1, sB2) 0 + 0.125rS1(sB1, sP ) otherwise (6) Data-efficient Deep Reinforcement Learning for Dexterous Manipulation Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, Martin Riedmiller DeepMind Abstract—Deep learning and reinforcement learning methods have recently been used to solve a variety of problems in continu- ous control domains. An obvious application of these techniques is dexterous manipulation tasks in robotics which are difficult to solve using traditional control theory or hand-engineered approaches. One example of such a task is to grasp an object and precisely stack it on another. Solving this difficult and practically relevant problem in the real world is an important long-term goal for the field of robotics. Here we take a step towards this goal by examining the problem in simulation and providing models and techniques aimed at solving it. We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them. Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots. I. INTRODUCTION Dexterous manipulation is a fundamental challenge in robotics. Researchers have long been seeking a way to enable robots to robustly and flexibly interact with fixed and free objects of different shapes, materials, and surface properties in the context of a broad range of tasks and environmental conditions. Such flexibility is very difficult to achieve with manually designed controllers. The recent resurgence of neural networks and “deep learning” has inspired hope that these methods will be as effective in the control domain as they are for perception. And indeed, in simulation, recent work has used neural networks to learn solutions to a variety of control problems from scratch (e.g. [7, 20, 32, 31, 11, 17]). While the flexibility and generality of learning approaches is promising for robotics, these methods typically require a large amount of data that grows with the complexity of the task. What is feasible on a simulated system, where hundreds of millions of control steps are possible [23], does not necessarily transfer to real robot applications due to unrealistic learning times. One solution to this problem is to restrict the generality of the controller by incorporating task specific knowledge, e.g. in the form of dynamic movement primitives [30], or in the form of strong teaching signals, e.g. kinesthetic teaching of trajectories [24]. Recent works have had some success learning flexible neural network policies directly on real robots (e.g. [18, 5, 39]), but tasks as complex as grasping-and-stacking remain daunting. An important issue for the application of learning methods in robotics is to understand how to make the best use of collected data, which can be expensive to obtain, both in terms of time and money. To keep learning times reasonably low even in complex scenarios, it is crucial to find a practical compromise between the generality of the controller and the necessary restrictions of the task setup. This is the gap that we aim to fill in this paper: exploring the potential of a learning approach that keeps prior assumptions low while keeping data consumption in reasonable bounds. Simultaneously, we are interested in approaches that are broadly applicable, robust, and practical. In this paper we provide a simulation study that investigates the possibility of learning complex manipulation skills end- to-end with a general purpose model-free deep reinforcement learning algorithm. The express goal of this work is to assess the feasibility of performing analogous end-to-end learning experiments on real robotics hardware and to provide guidance with respect to the choice of learning algorithm and experi- mental setup and the performance that we can hope to achieve. The task which we consider to this end is that of picking up a Lego brick from the table and stacking it onto a second nearby brick using a robotic arm with 9 degrees of freedom (DoF), six in the arm and three for the fingers in the gripper. In addition to having a high-dimensional state and action space, the task exemplifies several of the challenges that are encountered in real-world manipulation problems. Firstly, it involves contact-rich interactions between the robotic arm and two freely moving objects. Secondly it requires mastering several sub-skills (reaching, grasping, and stacking). Each of these sub-skills is challenging in its own right as they require both precision (for instance, successful stacking requires ac- curate alignment of the two bricks) and as well as robust generalization over a large state space (e.g. different initial positions of the bricks and the initial configuration of the arm). Finally, there exist non-trivial and long-ranging dependencies between the solutions for different subtasks: for instance, the ability to successfully stack the brick in the later part of the task depends critically on having picked up the brick in a sensible way beforehand. On the algorithm side we build on the Deep Deterministic Policy Gradient (DDPG; [20]), a general purpose model-free reinforcement learning algorithm for continuous action spaces, and extend it in two ways (section V): firstly, we improve the the data efficiency of the algorithm by scheduling updates Fig. 1: Simulation rendering of the Lego task in different completion stages (also corresponding to different subtasks): (a) starting state, (b) reaching, (c) grasping, (also StackInHand starting state) and (d) stacking of the network parameters independently of interactions with the environment. Secondly, we overcome the computational and experimental bottlenecks of single-machine single-robot learning by introducing a distributed version of DDPG which allows data collection and network training to be spread out over multiple computers and robots. We further propose two broadly applicable strategies that allow us to inject prior knowledge into the learning process in order to help reliably find solutions to complex tasks and further reduce the amount of environmental interaction. The first of these strategies is a recipe for designing effective shaping rewards for compositional tasks (section VI), while the second (section VII) uses a suitable bias in the distribution of initial states to achieve an effect akin to a curriculum or a form of apprenticeship learning. In combination these contributions allow us to reliably learn robust policies for the full task from scratch in less than 10 million environment transitions. This corresponds to less than 10 hours of interaction time on 16 robots, thus entering a regime that no longer seems unrealistic with modern experimental setups. In addition, when states from successful trajectories are used as the start states for learning trials the full task can be learned with 1 million transitions (i.e. less than 1 hour of interaction on 16 robots). To our knowledge our results provide the first demonstration of solving complex manipulation problems involving multiple freely moving ob- jects. They are also encouraging as a sensible lower bound for real-world experiments suggesting that it may indeed be possible to learn such non-trivial manipulation skills directly on real robots. II. RELATED WORK Reinforcement learning approaches solve tasks through re- peated interactions with the environment guided by a reward signal that indicates the success or failure of a trial. A wide variety of techniques have been developed that exploit this idea [34], with a broad distinction often made between value- based and policy search methods. While the former estimate and improve a value function, policy search methods directly optimize the parameters of a policy to maximize cumulative reward. The latter have been routinely applied in robotics, in part because they straightforwardly handle continuous and high-dimensional action spaces [3] and applications include manipulation [26, 13, 25, 37, 18, 5, 39, 8], locomotion e.g. [16, 21], and a range of other challenges such as helicopter flight [1]. One limitation that has hampered policy search methods is that they can scale poorly with the number of parameters that need to be estimated. This limitation, and other constraints when working with real robotics hardware has led research to focus on the use of manually engineered and restrictive features and movement representations, particularly trajectory- based ones such as spline based dynamic movement primitives. Simplifying the policy space can make learning on real hard- ware tractable, but it also limits the kinds of problems that can be solved. In order to solve a problem such as picking up and manipulating an object, more expressive function classes are likely to be needed. The use of rich and flexible function approximators such as neural networks in RL dates back many years, e.g. [38, 35, 12, 10]. In the last few years there has been a resurgence of interest in end-to-end training of neural networks for challenging control problems, and several algorithms, both value and policy focused have been developed and applied to challenging problems including continuous control, e.g. [22, 23, 6, 7, 20, 32, 31, 11, 17]. These methods work well with large neural networks and can learn directly from raw visual input streams. With few exceptions, e.g. [10, 5, 18, 39], they have been considered too data-inefficient for robotics applications. One exception are guided policy search methods (GPS) [18, 39]. These have recently been applied to several manip- ulation problems and employ a teacher algorithm to locally optimize trajectories which are then summarized by a neu- ral network policy. GPS algorithms gain data-efficiency by employing aggressive local policy updates and by performing extensive training of their neural network policy before col- lecting more real-world data. The teacher can use model-based [18] or model-free [39] trajectory optimization. The former can struggle in situations with strong discontinuities in the dynamics, and both rely on access to a well defined and fully observed state space. Model-free value function approaches offer an alternative way to handle to the issue of data-efficiency in robotics. Such approaches enable effective reuse of data and do not require full access to the state space or to a model of the environment. One recent work [5], closely related to the ideas followed in this paper, provides a proof of concept demonstration that value-based methods using neural network approximators can be used for robotic manipulation in the real world . This work applied a Q-learning approach [7] to a door opening task in which a robotic arm fitted with an unactuated hook needed to reach to a handle and pull a door to a given angle. The starting state of the arm and door were fixed across trials and the reward structure was smooth and structured, with one term expressing the distance from the hook to the handle and a second term expressing the distance of the door to the desired angle. This task was learned in approximately 2 hours across 2 robots pooling their experience into a shared replay buffer. This work thus made use of a complementary solution to the need for large amounts of interaction data: the use of experimental rigs that allow large scale data collection, e.g. [27], including the use of several robots from which experience are gathered in parallel [19, 5, 39]. This can be combined with single machine or distributed training depending on whether the bottleneck is primarily one of data collection or also one of network training [23]. Finally, the use of demonstration data has played an impor- tant role in robot learning, both as a means to obtain suitable cost functions [2, 14, 4, 8] but also to bootstrap and thus speed up learning. For the latter, kinesthetic teaching is widely used [26, 13, 25, 39]. It integrates naturally with trajectory-based movement representations but the need for a human operator to be able to guide the robot through the full movement can be limiting. Furthermore, when the policy representation is not trajectory based (e.g. direct torque control with neural networks) the use of human demonstration trajectories may be less straightforward (e.g. since the associated controls are not available). III. BACKGROUND In this section we briefly formalize the learning problem, summarize the DDPG algorithm, and explain its relationship to several other Q-function based reinforcement learning (RL) algorithms. The RL problem consists of an agent interacting with an environment in a sequential manner to maximize the expected sum of rewards. At time t the agent observes the state xt of the system and produces a control ut = π(xt; θ) according to policy π with parameters θ. This leads the environment to tran- sition to a new state xt+1 according to the dynamics xt+1 ∼ p(·|xt, ut), and the agent receives a reward rt = r(xt, ut). The goal is to maximize the expected sum of discounted rewards J(θ) = Eτ∼ρθ P t γt−1r(xt, ut)  , where ρ(θ) is the distribu- tion over trajectories τ = (x0, u0, x1, u1, . . . ) induced by the current policy: ρθ(τ) = p(x0) Q t>0 p(xt|xt−1, π(xt−1; θ)). DPG [33] is a policy gradient algorithm for continuous action spaces that improves the deterministic policy function π via backpropagation of the action-value gradient from a learned approximation to the Q-function. Specifically, DPG maintains a parametric approximation Q(xt, ut; φ) to the action value function Qπ(xt, ut) associated with π and φ is chosen to minimize E(xt,ut,xt+1)∼¯ρ  (Q(xt, ut; φ) −yt)2 (1) where yt = r(xt, ut) + γQ(xt+1, π(xt+1)). ¯ρ is usually close to the marginal transition distribution induced by π but often not identical. For instance, during learning ut may be chosen to be a noisy version of π(xt; θ), e.g. ut = π(xt; θ)+ϵ where ϵ ∼N(0, σ2) and ¯ρ is then the transition distribution induced by this noisy policy. The policy parameters θ are then updated according to ∆θ ∝E(x,u)∼¯ρ  ∂ ∂uQ(x, u; φ) ∂ ∂θπ(x; θ)  . (2) DDPG [20] is an improvement of the original DPG algo- rithm adding experience replay and target networks: Experi- ence is collected into a buffer and updates to θ and φ (eqs. 1, 2) are computed using mini-batch updates with random samples from this buffer. Furthermore, a second set of ”target- networks” is maintained with parameters θ′ and φ′. These are used to compute yt in eqn. (1) and their parameters are slowly updated towards the current parameters θ, φ. Both measures significantly improve the stability of DDPG. DDPG bears a relation to several other recent model free RL algorithms: The NAF algorithm [7] which has recently been applied to a real-world robotics problem [5] can be viewed as a DDPG variant where the Q-function is quadratic in the action so that the optimal action can be easily recovered directly from the Q-function, making a separate representation of the policy unnecessary. DDPG and especially NAF are the continuous action counterparts of DQN [22], a Q-learning algorithm that recently re-popularized the use of experience replay and target networks to stabilize learning with powerful function approximators such as neural networks. DDPG, NAF, and DQN all interleave mini-batch updates of the Q-function (and the policy for DDPG) with data collection via interaction with the environment. These mini-batch based updates set DDPG and DQN apart from the otherwise closely related NFQ and NFQCA algorithms for discrete and continuous actions respectively. NFQ [29] and NFQCA [9] employ the same basic update as DDPG and DQN, however, they are batch algorithms that perform updates less frequently and fully re-fit the Q-function and the policy network after every episode with several hundred iterations of gradient descent with Rprop [28] and using full-batch updates with the entire replay buffer. The aggressive training makes NFQCA data efficient, but the full batch updates can become impractical with large networks, large observation spaces, or when the number of training episodes is large. Finally, DPG can be seen as the deterministic limit of a particular instance of the stochastic value gradients (SVG) family [11], which also computes policy gradient via back-propagation of value gradients, but optimizes stochastic policies. Discrete Continuous Mini-batch learning Target networks DQN DDPG, NAF Full-batch learning with Rprop Parameter resetting NFQ NFQCA One appealing property of the above family of algorithms is that the use of a Q-function facilitates off-policy learning. This allows decoupling the collection of experience data from the updates of the policy and value networks, a desirable property given that experience is expensive to collect in a robotics setup. In this context, because neural network training is often slow, decoupling allows us to make many parameter update steps per step in the environment, ensuring that the networks are well fit to the data that is currently available. IV. TASK AND EXPERIMENTAL SETUP The full task that we consider in this paper is to use the arm to pick up one Lego Duplo brick from the table and stack it onto the remaining brick. This ”composite” task can be decomposed into several subtasks, including grasping and stacking. In our experiments we consider the full task as well as the two sub-tasks in isolation as shown in the table below: Starting state Reward Grasp Both bricks on table Brick 1 above table StackInHand Brick 1 in gripper Bricks stacked Stack Both bricks on table Bricks stacked In every episode the arm starts in a random configuration with the positioning of gripper and brick appropriate for the task of interest. We implement the experiments in a physically plausible simulation in MuJoCo [36] with the simulated arm being closely matched to a real-world Jaco arm1 setup in our lab. Episodes are terminated after 150 steps, with each step corresponding to 50ms of physical simulation time. This means that the agent has 7.5 seconds to perform the task. Un- less otherwise noted we give a reward of one upon successful completion of the task and zero otherwise. The observation vector provided to the agent contains information about the angles and angular velocities of the 6 joints of the arm and 3 fingers of the gripper. In addition, we provide information about the position and orientation of the two bricks and relative distances of the two bricks to the pinch position of the gripper, i.e. roughly the position where the fin- gertips would meet if the fingers are closed. The 9-dimensional continuous action directly sets the velocities of the arm and finger joints. In experiments not reported in this paper we have tried using an observation vector containing only the raw state of the brick in addition to the arm configuration (i.e. without the vector between the end-effector and brick) and found that 1Jaco is a robotics arm developed by Kinova Robotics this increased the number of environment interactions needed roughly by a factor of two to three. The only hyper-parameter that we optimize for each ex- perimental condition is the learning rate. For each condition we train and measure the performance of 10 agents with different random initial network parameters. After every 30 training episodes the agent is evaluated for 10 episodes. We used the mean performance at each evaluation phase as the performance measure presented in all plots. We found empirically that 10 episodes of evaluation gave a reasonable proxy for performance in the studied tasks. In the plots the line shows the mean performance for the set and the shaded regions correspond to the range between the worst and best performing agent in the set. In all plots the x-axis represents the number of environment transitions seen so far at an evaluation point (in millions) and the y-axis represent episode return. A video of the full setup and examples of policies solving the component and full tasks can be found here: https://www.youtube.com/watch?v=8QnD8ZM0YCo. V. ASYNCHRONOUS DPG WITH VARIABLE REPLAY STEPS In this section we study two methods for extending the DDPG algorithm and find that they can have significant effect on data and computation efficiency, in some cases making the difference between finding a solution to a task or not. a) Multiple mini-batch replay steps: Deep neural net- works can require many steps of gradient descent to converge. In a supervised learning setting this affects purely computa- tion time. In reinforcement learning, however, neural network training is interleaved with the acquisition of interaction expe- rience, and the nature of the latter is affected by the state of the former – and vice versa – so the situation is more complicated. To gain a better understanding of this interaction we modified the original DDPG algorithm as described in [20] to perform a fixed but configurable number of mini-batch updates per step in the environment. In [20] one update was performed after each new interaction step. We refer to DDPG with a configurable number of update steps as DPG-R and tested the impact of this modification on the two primitive tasks Grasp and StackInHand. The results are shown in Fig. 2. It is evident that the number of update steps has a dramatic effect on the amount of experience data required for learning successful policies. After one million interactions the original version of DDPG with a single update step (blue traces) appears to have made no progress towards a successful policy for stacking, and only a small number of controllers have learned to grasp. Increasing the number of updates per interaction to 5 greatly improves the results (green traces), and with 40 updates (purple) the first successful policies for stacking and grasping are obtained after 200,000 and 300,000 interactions respectively (corresponding to 1,300 and 2,000 episodes). It is notable that although the improvement is task dependent and the dependence between update steps and convergence is clearly not linear, in both cases we continue to see a reduction in total environment interaction up to 40 update steps, the maximum used in the experiment. One may speculate as to why changing the number of updates per environment step has such a pronounced effect. One hypothesis is that, loosely speaking and drawing an analogy to supervised learning, insufficient training leads to underfitting of the policy and value network with respect to the already collected training data. Unlike in supervised learning, however, where the dataset is typically fixed, the quality of the policy directly feeds back into the data acquisition process since the policy network is used for exploration, thus affecting the quality the data used in future iterations of network training. We have observed in various experiments (not listed here) that other aspects of the network architecture and training process can have a similar effect on the extent of underfitting. Some examples include the type of non-linearities used in the network layers, the size of layers and the learning rate. It is important to note that one cannot replicate the effect of multiple replay steps simply by increasing the learning rate. In practice we find that attempts to do so make training unstable. Fig. 2: Mean episode return as a function of number of transitions seen (in millions) of DPG-R (single worker) on the Grasp (left) and StackInHand (right) task with 1 (blue), 5 (green), 10 (red), 20 (yellow) and 40 (purple) mini-batch updates per environment step b) Asynchronous DPG: While increasing the number of update steps relative to the number of environment interactions greatly improves the data efficiency of the algorithm it can also strongly increase the computation time. In the extreme case, in simulation, when the overall run time is dominated by the network updates it may scale linearly with the number of replay steps. In this setting it is desirable to be able to parallelize the update computations. In a real robotics setup the overall run time is typically dominated by the collection of robot interactions. In this case it is desirable to be able to collect experience from multiple robots simultaneously (e.g. as in [39, 5]). We therefore develop an asynchronous version of DPG that allows parallelization of training and environment interaction by combining multiple instances of an DPG-R actor and critic that each share their network parameters and can be configured to either share or have independent experience replay buffers. This is inspired by the A3C algorithm proposed in [23], and also analogous to [5, 39]. We found that this strategy is also an effective way to share parameters for DPG. That is, we employ asynchronous updates whereby each worker has its own copy of the parameters and uses it for computing gradients which are then applied to a shared parameter instance without any synchronization. We use the Adam optimizer [15] with local non-shared first-order statistics and a single shared instance of second-order statistics. The pseudo code of the asynchronous DPG-R is shown in algorithm box 1. Algorithm 1 (A)DPG-R algorithm Initialize global shared critic and actor network parameters: θQ′′ and θµ′′ Pseudo code for each learner thread: Initialize critic network Q(s, a|θQ) and actor µ(s|θµ) with weights θQ and θµ. Initialize target network Q′ and µ′ with weights: θQ′ ←θQ, θµ′ ←θµ Initialize replay buffer R for episode = 1, M do Receive initial observation state s1 for t = 1, T do Select action at = µ(st|θµ) + Nt according to the current policy and exploration noise Perform action at, observe reward rt and new state st+1 Store transition (st, at, rt, st+1) in R for update = 1, R do Sample a random minibatch of N transitions (si, ai, ri, si+1) from R Set yi = ri + γQ′(si+1, µ′(si+1|θµ′)|θQ′) Perform asynchronous update of the shared param- eters of the critic by minimizing the loss: L = 1 N P i(yi −Q(si, ai|θQ)2) Perform asynchronous update of shared parameters of actor policy using the sampled gradient: ∇θµ′′ µ|si ≈1 N X i ∇aQ(s, a|θQ)|∇θµµ(s|θµ)|si Copy the shared parameters to the local ones: θQ ←θQ′′, θµ ←θµ′′ Every S update steps, update the target networks: θQ′ ←θQ, θµ′ ←θµ end for end for end for Figure 3 compares the performance of ADPG-R for different number of update steps and 16 workers (all workers perform- ing both data collection and computing updates). Similar to Fig. 2 we find that increasing the ratio of update steps per environment steps improves data efficiency, although the effect appears to be somewhat less pronounced than for DPG-R. Figure 4 (top row) directly compares the single-worker and asynchronous version of DPG-R. In both cases we choose the best performing number of replay steps and learning rate. As we can see, the use of multiple workers does not affect overall Fig. 3: Mean episode return as a function of number of transitions seen (in millions) of ADPG-R (16 workers) on the Grasp (left) and StackInHand (right) task. Different colored traces indicate number of replay step as in Fig. 2 data efficiency for StackInHand but it reduced roughly in half for Grasp, with the note that the single worker still hasn’t quite converged. Figure 4 (bottom row) plots the same data but as a function of environment steps per worker. This measure corresponds to the optimal wall clock efficiency that we can achieve, under the assumption that communication time between workers is negligible compared to environment interaction and gradient computation (this usually holds up to a certain degree of parallelization). This theoretical wall clock time for running an experiment with 16 workers is about 16x lower for Stack- InHand and roughly 8x lower for Grasp. Overall these results show that distributing neural network training and data collection across multiple computers and robots can be an extremely effective way of reducing the overall run time of experiments and thus making it feasible to run more challenging experiments. We make extensive use of asynchronous DPG for remaining the experiments. Fig. 4: Figure with two panels: (a) Grasp; (b) StackInHand; 16 workers vs single worker in data (total for all workers) and ”wallclock” (per-worker) time in millions of transitions with best replay step and learning rate selection. VI. COMPOSITE SHAPING REWARDS In the previous section we discussed how the ability of DDPG to exploit information that is available in the acquired interaction data affects learning speed. One important factor that determines what information is available from this data is the nature of the reward function. The reward function in the previous section was ”sparse” or ”pure” reward where a reward of 1 was given for states that correspond to successful task completion (brick lifted above 3cm for grasp; for stack) and 0 otherwise. For this reward to be useful for learning it is of course necessary that the agent is able to enter this goal region in state space with whatever exploration strategy is chosen. This was indeed the case for the two subtasks in isolation, but it is highly unlikely for the full task: without further guidance na¨ıve random exploration is very unlikely to lead to a successful grasp and stack as we also experimentally verify in Fig. 5. One commonly used solution to this problem is to provide informative shaping rewards that allow a learning signal to be obtained even with simple exploration strategies, e.g. by embedding information about the value function in the reward function for every transition acquired from the environment. For instance, for a simple reaching problem with a robotic arm we could define a shaping reward that takes into account the distance between the end-effector and the target. While this a convenient way of embedding prior knowledge about the solution and is a widely and successfully used approach for simple problems it comes with several caveats, especially for complex sequential or compositional tasks such as the one we are interested in here. Firstly, while a suitable shaping reward may be easy to construct for simple problems for more complex composite tasks, such as the one considered in this paper, a suitable reward function is often non-obvious and may require con- siderable effort and experimentation. Secondly, and related to the previous point, the use of a shaping reward typically alters the solution to the optimization problem. The effect of this can be benign but especially when it comes to complex tasks a small mistake may lead to complete failure of learning as we will demonstrate below. Thirdly, in a robotics setup not all information that would be desirable to define a good shaping reward may be easily available. For instance, in the manipulation problem considered in this paper determining the position of the Lego bricks requires extra instrumentation of the experimental setup. In this section we propose and analyze several possible reward functions for our full Stack task, aiming to provide a recipe that can be applied to other tasks with similar compositional structure. Shaping rewards are typically defined based on some notion of distance from or progress towards a goal state. We attempt to transfer this idea to our compositional setup via, what we call, composite (shaping) rewards. These reward functions return an increasing reward as the agent com- pletes components of the full task. They are either piecewise constant or smoothly varying across different regions of the Sparse reward components Subtask Description Reward Reach Brick 1 hypothetical pinch site position of the fingers is in a box around the first brick position 0.125 Grasp Brick 1 the first brick is located at least 3cm above the table surface, which is only possible if the arm is holding the brick 0.25 Stack Brick 1 bricks stacked 1.00 Smoothly varying reward components Reaching to brick 1 distance of the pinch site to the first brick - non-linear bounded [0, 0.125] Reaching to stack while grasped: distance of the first brick to the stacking site of the second brick - non-linear bounded [0.25, 0.5] TABLE I: Composite reward function state space that correspond to completed subtasks. In the case of Stack we use the reward components described in table I. These reward components can be combined in different ways. We consider three different composite rewards in ad- ditional to the original sparse task reward: Grasp shaping: Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.25 when the brick 1 has been grasped and a reward of 1.0 after completion of the full task. Reach and grasp shaping: Reach brick 1, Grasp brick 1 and Stack brick 1, i.e. the agent receives a reward of 0.125 when being close to brick 1, a reward of 0.25 when brick 1 has been grasped, and a reward of 1.0 after completion of the full task. Full composite shaping: the sparse reward components as be- fore in combination with the distance-based smoothly varying components. Figure 5 shows the results of learning with the above reward functions (blue traces). The figure makes clear that learning with the sparse reward only does not succeed for the full task. Introducing an intermediate reward for grasping allows the agent to learn to grasp but learning is very slow. The time to successful grasping can be substantially reduced by giving a distance based reward component for reaching to the first brick, but learning does not progress beyond grasping. Only with an additional intermediate reward component as in continuous reach, grasp, stack the full task can be solved. Although the above reward functions are specific to the particular task, we expect that the idea of a composite reward function can be applied to many other tasks thus allow- ing learning for to succeed even for challenging problems. Nevertheless, great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing the reward function components: e.g. reach and grasp components leading to a grasp unsuitable for stacking, agent not stacking the bricks because it will stop receiving the grasping reward before it receives reward for stacking and the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick. We show examples of these in the video: https://www.youtube.com/watch?v=8QnD8ZM0YCo. VII. LEARNING FROM INSTRUCTIVE STATES In the previous section we have described a strategy for designing effective reward functions for complex composi- tional tasks which alleviate the burden of exploration. We have also pointed out, however, that designing shaping rewards can be error prone and may rely on privileged information. In this section we describe a different strategy for embedding prior knowledge into the training process and improving exploration that reduces the reliance on carefully designed reward functions. Specifically we propose to let the distribution of states at which the learning agent is initialized at the beginning of an episode reflect the compositional nature of the task: In our case, instead of initializing the agent always at the beginning of the full task with both bricks on the table we can, for instance, choose to initialize the agent occasionally with the brick already in its hand and thus prepared for stacking in the same way as when learning the subtask StackInHand in section V. Trajectories of policies solving the task will have to visit this region of space before stacking the bricks and we can thus think of this initialization strategy as initializing the agent closer to the goal. More generally, we can choose to initialize episodes with states taken from anywhere along or close to successful tra- jectories. Suitable states can be either manually defined (as in section V), or they can be obtained from a human demonstrator or a previously trained agent that can partially solve the task. This can be seen as a form of apprenticeship learning in which we provide teacher information by influencing the state visitation distribution. We perform experiments with two alternative methods for generating the starting states. The first one uses manually defined initial states and amounts to the possibility discussed above: we initialize the learning agent in either the original starting states with both bricks located on the table or in states where the first brick is already in the gripper as if the agent just performed a successful grasp and lifted the brick. These two sets of start states correspond to those used in section V. The second method for generating instructive starting states can also be used on a real robot provided a human demonstra- tor or a pre-trained policy are available. It aims at initializing the learning agent along solution trajectory states in a more fine-grained fashion. We sample a random number of steps for each episode between one and the expected number of steps required to solve the task from the original starting states and then run the demonstrator for this number of steps. The final state of this process is then used as a starting state initialization for the learning agent which then acts in the environment for the remainder of the episode. The results of these experiments are shown in Figure 5. It shows results for the four reward functions considered in the previous section when combined with the simple augmented start state distribution. While there is still no learning for the basic sparse reward case, results obtained with all other reward functions are improved. In particular, even for the second Fig. 5: Four panels with (a) no progress without extra shaping (b, c, d) different shaping strategies for the composite task with starting states with both bricks on the table (blue), manually defined initial states (green) and initial states continuously on solution trajectories (red). On all plots, x-axis is millions of transitions of total experience and y-axis is mean episode return. Policies with mean return over 100 robustly perform the full Stack from different starting states. simplest reward function (Grasp shaping) we now obtain some controllers that can solve the full task. Learning with the full composite shaping reward is faster and more robust than without the use of instructive states. The top left plot of Figure 5 (red trace) shows results for the case where the episode is initialized anywhere along trajectories from a pre-trained controller. We use this start state distribution in combination with the basic sparse reward for the overall case (Stack without shaping). Episodes were configured to be 50 steps, shorter than in the previous experiments, to be better suited to this setup with assisted exploration. During testing we still used episodes with 150 steps as before (so the traces are comparable). We can see a large improvement in performance in comparison to the two-state method variant even in the absence of any shaping rewards. We can learn a robust policy for all seeds within a total of 1 million environment transitions. This corresponds to less than 1 hour of interaction time on 16 simulated robots. Overall these results suggest that an appropriate start state distribution does not only greatly speed up learning, it also allows simpler reward function to be used. In our final ex- periment the simplest reward function, only indicating overall experimental success, was sufficient to solve the task. Con- sidering the difficulties that can be associated with designing good shaping rewards this is an encouraging results. The robustness of the policies that we can train to the starting state variation are also quite encouraging. Table II lists the success rate by task from 1000 trials. You can find a video Success rate (1000 random starts) Grasp 99.2% StackInHand 98.2% Stack 95.5% TABLE II: Robustness of learned policies. with trained policies performing the Grasp, StackInHand and Stack tasks from different initial states in the supplementary material. VIII. CONCLUSION We have introduced two extensions to the DDPG algorithm which make it a powerful method for learning robust policies for complex continuous control tasks. Specifically, we have shown that by decoupling the frequency of network updates from the environment interaction we can substantially improve data-efficiency, to a level that in some cases makes the difference between finding a solution or not. The asynchronous version of DDPG which allows data collection and network training to be distributed over several computers and (simu- lated) robots has provided us with a close to linear speed up in wall-clock time for 16 parallel workers. In addition, we presented two methods that help to guide the learning process towards good solutions and thus reduce the pressure on exploration strategies and speed up learning. The first, composite rewards, is a recipe for constructing effective reward functions for tasks that consist of a sequence of sub- tasks. The second, instructive starting states, can be seen as a lightweight form of apprenticeship learning that facilitates learning of long horizon tasks even with sparse rewards, a property of many real-world problems. Taken together, the algorithmic changes and exploration shaping strategies have allowed us to learn robust policies for the Stack task within a number of transitions that is feasible to collect in a real- robot system within a few days, or in significantly less time if multiple robots were used for training. It is of course a challenge to judge the transfer of results in simulation to the real world. We have taken care to design a physically realistic simulation, and in initial experiments, which we have performed both in simulation and on the physical robot, we generally find a good correspondence of performance and learning speed between simulation and real world. This makes us optimistic that our performance numbers also hold when going to the real world. A second caveat of our simulated setup is that it currently uses information about the state of the environment, which although not impossible to obtain on a real robot, may require additional instrumentation of the experimental setup, e.g. to determine the position of the two bricks in the work space. To address this second issue we are currently focusing on end-to-end learning directly from raw visual information. Here, we have some first results showing the feasibility of learning policies for grasping with a success rate of about 80% across different starting conditions. We view the algorithms and techniques presented here as an important step towards applying versatile deep reinforcement learning methods for real-robot dexterous manipulation with perception. REFERENCES [1] J Andrew Bagnell and Jeff G Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEE International Conference on, volume 2, pages 1615–1620. IEEE, 2001. [2] A. Boularias, J. Kober, and J. Peters. Relative entropy inverse reinforcement learning. In JMLR Workshop and Conference Proceedings Volume 15: AISTATS 2011, pages 182–189, Cambridge, MA, USA, April 2011. MIT Press. [3] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1–142, 2013. [4] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 49–58, 2016. URL http://jmlr.org/proceedings/papers/v48/finn16.html. [5] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manip- ulation. arXiv preprint arXiv:1610.00633, 2016. [6] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. International Conference on Learning Representations (ICLR), 2016. [7] Shixiang Gu, Tim Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning (ICML), 2016. [8] Abhishek Gupta, Clemens Eppner, Sergey Levine, and Pieter Abbeel. Learning dexterous manipulation for a soft robotic hand from human demonstrations. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea, October 9-14, 2016, pages 3786–3793, 2016. [9] Roland Hafner and Martin Riedmiller. Reinforcement learning in feedback control. Machine learning, 84(1-2): 137–169, 2011. [10] Roland Hafner and Martin A. Riedmiller. Neural rein- forcement learning controllers for a real robot applica- tion. In 2007 IEEE International Conference on Robotics and Automation, ICRA 2007, 10-14 April 2007, Roma, Italy, pages 2098–2103, 2007. [11] Nicolas Heess, Gregory Wayne, David Silver, Tim Lill- icrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Ad- vances in Neural Information Processing Systems (NIPS), pages 2926–2934, 2015. [12] K. J. Hunt, D. Sbarbaro, R. ˙Zbikowski, and P. J. Gawthrop. Neural networks for control systems: A survey. Automatica, 28(6):1083–1112, November 1992. ISSN 0005-1098. [13] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal. Learning force control policies for compliant manipula- tion. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2011), Sept. 25-30, San Francisco, CA, 2011. URL http://www-clmc.usc.edu/ publications/K/kalakrishnan-IROS2011. [14] M. Kalakrishnan, P. Pastor, L. Righetti, and S. Schaal. Learning objective functions for manipulation. In IEEE International Conference on Robotics and Automation, 2013. [15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [16] Nate Kohl and Peter Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Proceedings of the IEEE International Conference on Robotics and Automation, May 2004. [17] Sergey Levine and Pieter Abbeel. Learning neural net- work policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems (NIPS), pages 1071–1079, 2014. [18] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015. [19] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. CoRR, abs/1603.02199, 2016. [20] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforce- ment learning. International Conference on Learning Representations (ICLR), 2016. [21] Takamitsu Matsubara, Jun Morimoto, Jun Nakanishi, Masa-aki Sato, and Kenji Doya. Learning cpg- based biped locomotion with a policy gradient method. Robotics and Autonomous Systems, 54(11):911–920, 2006. [22] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep rein- forcement learning. Nature, 518(7540):529–533, 2015. [23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning (ICML), 2016. [24] K. Muelling, J. Kober, O. Kroemer, and J. Pe- ters. Learning to select and generalize striking move- ments in robot table tennis. (3):263–279, 2013. URL http://www.ias.informatik.tu-darmstadt.de/uploads/ Publications/Muelling IJRR 2013.pdf. [25] P. Pastor, M. Kalakrishnan, S. Chitta, E. Theodorou, and S. Schaal. Skill learning and task outcome prediction for manipulation. In IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 9-13, 2011. [26] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In International Conference on Intelligent Robots and Systems (IROS), pages 2219–2225. IEEE, 2006. [27] Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. CoRR, abs/1509.06825, 2015. URL http: //arxiv.org/abs/1509.06825. [28] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algo- rithm. In H. Ruspini, editor, Proceedings of the IEEE International Conference on Neural Networks (ICNN), pages 586 – 591, San Francisco, 1993. [29] Martin A. Riedmiller. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005, 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005, Proceedings, pages 317– 328, 2005. [30] Stefan Schaal. Dynamic Movement Primitives -A Frame- work for Motor Control in Humans and Humanoid Robotics, pages 261–280. Springer Tokyo, Tokyo, 2006. ISBN 978-4-431-31381-6. doi: 10.1007/4-431-31381-8 23. URL http://dx.doi.org/10.1007/4-431-31381-8 23. [31] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimiza- tion. In International Conference on Machine Learning (ICML), pages 1889–1897, 2015. [32] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. Interna- tional Conference on Learning Representations (ICLR), 2016. [33] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), 2014. [34] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cam- bridge, 1998. [35] Gerald Tesauro. Temporal difference learning and td- gammon. Commun. ACM, 38(3):58–68, 1995. [36] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. [37] Herke van Hoof, Tucker Hermans, Gerhard Neumann, and Jan Peters. Learning robot in-hand manipulation with tactile features. In 15th IEEE-RAS International Con- ference on Humanoid Robots, Humanoids 2015, Seoul, South Korea, November 3-5, 2015, pages 121–127, 2015. [38] Paul J. Webros. Neural networks for control. chapter A Menu of Designs for Reinforcement Learning over Time, pages 67–95. 1990. ISBN 0-262-13261-3. [39] Ali Yahya, Adrian Li, Mrinal Kalakrishnan, Yevgen Chebotar, and Sergey Levine. Collective robot rein- forcement learning with distributed asynchronous guided policy search. CoRR, abs/1610.00673, 2016. URL http://arxiv.org/abs/1610.00673. APPENDIX A. Reward function In this section we provide further details regarding the reward functions described in section VI. For our experiments we derived these from the state vector of the simulation, but they could also be obtained through instrumentation in hardware. The reward functions are defined in terms of the following quantities: • b(1) z : height of brick 1 above table • sB1 {x,y,z}: x,y,z positions of site located roughly in the center of brick 1 • sB2 {x,y,z}: x,y,z positions of site located just above brick 2, at the position where sB1 will be located when brick 1 is stacked on top of brick 2. • sP {x,y,z}: x,y,z positions of the pinch site of the hand – roughly the position where the fingertips would meet if the fingers are closed.. 1) Sparse reward components: Using the above we can define the following conditions for the successful completion of subtasks: a) Reach Brick 1: The pinch site of the fingers is within a virtual box around the first brick position. reach =(|sB1 x −sP x | < ∆reach x ) ∧(|sB1 y −sP y | < ∆reach y ) ∧(|sB1 z −sP z | < ∆reach z ), where ∆reach {x,y,z} denote the half-lengths of the sides of the virtual box for reaching. b) Grasp Brick 1: Brick 1 is located above the table surface by a threshold, θ, that is possible only if the arm is the brick has been lifted. grasp =b(1) z > θ c) Stack: Brick 1 is stacked on brick 2. This is expressed as a box constraint on the displacement between brick 1 and brick 2 measured in the coordinate system of brick 2. stack =(|C(2) x (sB1 −sB2)| < ∆stack x ) ∧(|C(2) y (sB1 −sB2)| < ∆stack y ) ∧(|C(2) z (sB1 −sB2)| < ∆stack z ), where ∆stack {x,y,z} denote the half-lengths of the sides of the virtual box for stacking, and C(2) is the rotation matrix that projects a vector into the coordinate system of brick 2. This projection into the coordinate system of brick 2 is necessary since brick 2 is allowed to move freely. It ensures that the box constraint is considered relative to the pose of brick 2. While this criterion for a successful stack is quite complicated to express in terms of sites, it could be easily implemented in hardware e.g. via a contact sensor attached to brick 2. 2) Shaping components: The full composite reward also includes two distance based shaping components that guide the hand to the brick 1 and then brick 1 to brick 2. These could be approximate and would be relatively simple to implement with a hardware visual system that can only roughly identify the centroid of an object. The shaping components of the reward are given as follows: a) Reaching to brick 1: : rS1(sB1, sP ) = 1 −tanh2(w1∥sB1 −sP ∥2) b) Reaching to brick 2 for stacking: rS2(sB1, sB2) = 1 −tanh2(w2∥sB1 −sB2∥2). 3) Full reward: Using the above components the reward functions from section VI: Stack, Grasp shaping, Reach and grasp shaping, and Full composite shaping can be expressed as in equations (3, 4, 5, 6) below. These make use of the predicates above to determine whether which subtasks have been completed and return a reward accordingly. r(b(1) z , sP , sB1, sB2) = ( 1 if stack(b(1) z , sP , sB1, sB2) 0 otherwise (3) r(b(1) z , sP , sB1, sB2) =      1 if stack(b(1) z , sP , sB1, sB2) 0.25 if ¬stack(b(1) z , sP , sB1, sB2) ∧grasp(b(1) z , sP , sB1, sB2) 0 otherwise (4) r(b(1) z , sP , sB1, sB2) =          1 if stack(b(1) z , sP , sB1, sB2) 0.25 if ¬stack(b(1) z , sP , sB1, sB2) ∧grasp(b(1) z , sP , sB1, sB2) 0.125 if ¬(stack(b(1) z , sP , sB1, sB2) ∨grasp(b(1) z , sP , sB1, sB2)) ∧reach(b(1) z , sP , sB1, sB2) 0 otherwise (5) r(b(1) z , sP , sB1, sB2) =          1 if stack(b(1) z , sP , sB1, sB2) 0.25 + 0.25rS2(sB1, sP ) if ¬stack(b(1) z , sP , sB1, sB2) ∧grasp(b(1) z , sP , sB1, sB2) 0.125 if ¬(stack(b(1) z , sP , sB1, sB2) ∨grasp(b(1) z , sP , sB1, sB2)) ∧reach(b(1) z , sP , sB1, sB2) 0 + 0.125rS1(sB1, sP ) otherwise (6)