Sim-to-Real Robot Learning from Pixels with Progressive Nets Andrei A. Rusu DeepMind London, UK andreirusu@google.com Mel Veˇ cerík DeepMind London, UK matejvecerik@google.com Thomas Rothörl DeepMind London, UK tcr@google.com Nicolas Heess DeepMind London, UK heess@google.com Razvan Pascanu DeepMind London, UK razp@google.com Raia Hadsell DeepMind London, UK raia@google.com Abstract: Applying end-to-end learning to solve complex, interactive, pixel- driven control tasks on a robot is an unsolved problem. Deep Reinforcement Learning algorithms are too slow to achieve performance on a real robot, but their potential has been demonstrated in simulated environments. We propose using progressive networks to bridge the reality gap and transfer learned policies from simulation to the real world. The progressive net approach is a general framework that enables reuse of everything from low-level visual features to high- level policies for transfer to new tasks, enabling a compositional, yet simple, approach to building complex skills. We present an early demonstration of this approach with a number of experiments in the domain of robot manipulation that focus on bridging the reality gap. Unlike other proposed approaches, our real- world experiments demonstrate successful task learning from raw visual input on a fully actuated robot manipulator. Moreover, rather than relying on model- based trajectory optimisation, the task learning is accomplished using only deep reinforcement learning and sparse rewards. Keywords: Robot learning, transfer, progressive networks, sim-to-real, CoRL. 1 Introduction Deep Reinforcement Learning offers new promise for achieving human-level control in robotics domains, especially for pixel-to-action scenarios where state estimation is from high dimensional sen- sors and environment interaction and feedback are critical. With deep RL, a new set of algorithms has emerged that can attain sophisticated, precise control on challenging tasks, but these accomplishments have been demonstrated primarily in simulation, rather than on actual robot platforms. While recent advances in simulation-driven deep RL are impressive [ 1 , 2 , 3 , 4 , 5 , 6 , 7 ], demonstrating learning capabilities on real robots remains the bar by which we must measure the practical applica- bility of these methods. However, this poses a significant challenge, given the "data-hungry" training regime required for current pixel-based deep RL methods, and the relative frailty of research robots and their human handlers. One solution is to use transfer learning methods to bridge the reality gap that separates simulation from real world domains. In this paper, we use progressive networks, a deep learning architecture that has recently been proposed for transfer learning, to demonstrate such an approach, thus providing a proof-of-concept pathway by which deep RL can be used to effect fast policy learning on a real robot. Progressive nets have been shown to produce positive transfer between disparate tasks such as Atari games by utilizing lateral connections to previously learnt models [ 8 ]. The addition of new capacity for each new task allows specialized input features to be learned, an important advantage for deep RL algorithms which are improved by sharply-tuned perceptual features. An advantage of progressive 1st Conference on Robot Learning (CoRL 2017), Mountain View, United States. arXiv:1610.04286v2 [cs.RO] 22 May 2018 nets compared with other methods for transfer learning or domain adaptation is that multiple tasks may be learned sequentially, without needing to specify source and target tasks. This paper presents an approach for transfer from simulation to the real robot that is proven using real-world, sparse-reward tasks. The tasks are learned using end-to-end deep RL, with RGB inputs and joint velocity output actions. First, an actor-critic network is trained in simulation using multiple asynchronous workers [ 6 ]. The network has a convolutional encoder followed by an LSTM. From the LSTM state, using a linear layer, we compute a set of discrete action outputs that control the different degrees of freedom of the simulated robot as well as the value function. After training, a new network is initialized with lateral, nonlinear connections to each convolutional and recurrent layer of the simulation-trained network. The new network is trained on a similar task on the real robot. Our initial findings show that the inductive bias imparted by the features and encoded policy of the simulation net is enough to give a dramatic learning speed-up on the real robot. 2 Transfer Learning from Simulation to Real Our approach relies on the progressive nets architecture, which enables transfer learning through lateral connections which connect each layer of previously learnt network columns to each new column, thus supporting rich compositionality of features. We first summarize progressive nets, and then we discuss their application for transfer in robot domains. 2.1 Progressive Networks Progressive networks are ideal for simulation-to-real transfer of policies in robot control domains, for multiple reasons. First, features learnt for one task may be transferred to many new tasks without destruction from fine-tuning. Second, the columns may be heterogeneous, which may be important for solving different tasks, including different input modalities, or simply to improve learning speed when transferring to the real robot. Third, progressive nets add new capacity, including new input connections, when transferring to new tasks. This is advantageous for bridging the reality gap, to accommodate dissimilar inputs between simulation and real sensors. A progressive network starts with a single column: a deep neural network having L layers with hidden activations h (1) i ∈ R n i , with n i the number of units at layer i ≤ L , and parameters Θ (1) trained to convergence. When switching to a second task, the parameters Θ (1) are “frozen” and a new column with parameters Θ (2) is instantiated (with random initialization), where layer h (2) i receives input from both h (2) i − 1 and h (1) i − 1 via lateral connections. Progressive networks can be generalized in a straightforward manner to have arbitrary network width per column/layer, to accommodate varying degrees of task difficulty, or to compile lateral connections from multiple, independent networks in an ensemble setting. h ( k ) i = f   W ( k ) i h ( k ) i − 1 + ∑ j