SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control Arunkumar Byravan, Felix Leeb, Franziska Meier and Dieter Fox Department of Computer Science & Engineering University of Washington, Seattle Abstract — In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network ex- plicitly learns to segment salient parts and predict their pose- embedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of point-wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed- loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing error in the pose space using gradient-based methods, similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in real-time, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs. Video results can be found at: https://rse-lab.cs. washington.edu/se3-structured-deep-ctrl/ I. I NTRODUCTION Imagine we are receiving observations of a scene from a camera and we would like to control our robot to reach a target scene. Traditional approaches to visual servoing [1] decompose this problem into two parts: data-associating the current scene to the target (usually through the use of features) and modeling the effect of applied actions to changes to the scene, combining these in a tight loop to servo to the target. Recent work on deep learning has looked at learning similar predictive models directly in the space of observations, relating changes in pixels or 3D points directly to the applied actions [2]–[4]. Given a target scene, we can use this predictive model to generate suitable controls to visually servo to the target using model-predictive control [5]. Unfortunately, for this pipeline to work, we need an external system (such as [6], [7]) capable of providing long range data associations to measure progress. As we showed in prior work [4], instead of reasoning about raw pixels, we can predict scene dynamics by decomposing the scene into objects and predicting object dynamics instead. While this significantly improves prediction results, it still does not provide a clear solution to the data-association problem that we encounter during control - we still lack the capability to explicitly associate objects/parts across Fig. 1: An example scenario showing the initial (left) and target point cloud (right). SE3-P OSE -N ETS can be used to control the robot to reach the target state based only on raw depth data. Depth images colorized for display purposes only. scenes. We observe three key points: 1) We can data- associate across scenes by learning to predict the poses of detected objects/parts in the scene (the pose implicitly provides tracking), 2) We can model the dynamics of an object directly in the predicted low-dimensional pose space, and 3) We can predict scene dynamics by combining the dynamics predictions of each detected part. We combine these ideas in this work to propose SE3-P OSE - N ETS , a deep network architecture for efficient visuomotor control that jointly learns to data-associate across long term sequences. We make the following contributions: • We show how it is possible to learn predictive models that detect parts of the scene and jointly learn a consistent pose space for these parts with minimal supervision. • We demonstrate how a deep predictive model can be used for reactive visuo-motor control using simple gradient backpropagation and a more sophisticated Gauss-Newton optimization, reminiscent of approaches in inverse kinematics [8]. • We present results on real-time reactive control of a Baxter arm using raw depth images and velocity control, both in simulation and on real data. Fig 1 shows an example scenario where our proposed method can be applied to control the robot to reach the target state (right) from the initial state (left). II. R ELATED WORK Modeling scenes and dynamics: Our work builds on top of prior work on learning structured models of scene dynamics [4]. Unlike SE3-N ETS we now explicitly model data associations through a low-dimensional pose embedding that we train to be consistent across long sequences. Similar to Boots et al. [2], our model learns to predict point clouds based arXiv:1710.00489v1 [cs.RO] 2 Oct 2017 on applied actions, but through a more structured intermediate representation that reasons about objects and their motions. Unlike Finn et al. [3], we operate on depth data and reason about motion in 3D using masks and SE (3) transforms while training our networks in a supervised fashion given point-wise data associations across pairs of frames. Visuomotor control: Recently, there has been a lot of work on visuomotor control, primarily through the use of deep networks [5], [9]–[13]. These methods either directly regress to controls from visual data [10], [11], generate controls by planning on learned forward dynamics models [5], [9], through inverse dynamics models [12] or by reinforcement learning [13]. Similar to some of these methods, we generates controls by planning with a learned dynamics model, albeit in a learned low-dimensional latent space. Specifically, work by Finn et al. [5] is closely related, but differs in two main ways: unlike their approach which controls in the observation space through sampled actions (at ≈ 5Hz), our controller runs gradient based optimization on a learned low-dimensional pose embedding in real-time (> 30 Hz). Also, their approach requires an external tracker to measure progress while we explicitly learn to data associate across large motions. Our work borrows several ideas from prior work by Watter et al. [9] which learns a latent low-dimensional embedding for fast reactive control from pairs of images related by an action. Unlike their work though, we use a structured latent representation (object poses), predict object masks and use a physically grounded 3D loss that only models change in observations as opposed to a restrictive image reconstruction loss. Last, our losses are physically motivated similar to those proposed for training position-velocity encoders [13], but our learned pose embedding is significantly more structured and we train our networks end-to-end directly for control. Data association: Related work in the computer vision literature has looked at tackling the data association problem, primarily by matching visual descriptors, either hand-tuned [14], or more recently, learned using deep networks [15], [16]. In prior work, Schmidt et al. [15] learn robust visual descriptors for long-range associations using correspondences over short training sequences. Unlike this work, we only use correspondences between pairs of frames to learn a consistent pose space that lets us data associate across long sequences. Visual servoing: Finally, there have been multiple approaches to visual servoing over the years [1], [17], [18], including some newer methods that use deep learned features and reinforcement learning [19]. While these methods depend on an external system for data association or on pre-specified features, our system is trained end-to-end and can control directly from raw depth data. III. SE3-P OSE -N ETS Our deep dynamics model SE3-P OSE -N ETS decomposes the problem of modeling scene dynamics into three sub- problems: a) modeling scene structure by identifying parts of the scene that move distinctly and by encoding their latent state as a 6D pose, b) modeling the dynamics of individual parts under the effect of the applied actions as a change in the latent pose space (parameterized as an SE (3) transform), and finally c) combining these local pose changes to model the dynamics of the entire scene. Each sub-problem is modeled by a separate component of the SE3-P OSE -N ET : • Modeling scene structure: An encoder ( h enc ) that decomposes the input point cloud ( x 1 ) into a set of K rigid parts, predicting per part a 6D pose ( p k , k = 1 . . . K ) and a dense segmentation mask ( m k ) that highlights points belonging to that part • Modeling part dynamics: A pose transition network ( h trans ) that models dynamics in the pose space, taking in the current poses ( p t ) and action ( u t ) to predict the change in poses ( ∆ p t ) • Predicting scene dynamics: A transform layer ( h tf m ) that predicts the next point cloud ( ˆ x t +1 ) given the current point cloud ( x t ), predicted object masks ( m t ) and the predicted pose deltas ( ∆ p t ) by explicitly applying 3D rigid body SE (3) transforms on the input point cloud. Fig. 2 shows the network architecture of the SE3-P OSE -N ET . Next, we present the details of the three sub-components and outline a training procedure for training the SE3-P OSE -N ET end-to-end with minimal supervision. A. Modeling scene structure Given a 3D point cloud x from a depth sensor (represented as a 3-channel image, 3 x H x W), the encoder (blue block in Fig. 2) segments the scene into distinctly moving parts ( m ) and predicts a 6D pose per segmented part ( p ). ( p , m ) = h enc ( x ) (1) The encoder has three parts, the first is a convolutional net- work that generates a latent representation of the input point cloud ( x ). This network has five convolutional layers, each followed by a max pooling layer. The latent representation is further used as input for the mask and pose predictions. Object masks: We use a de-convolutional network to predict a dense pixel-wise segmentation of the scene into it’s constituent parts ( m ). Similar to prior work [4], we use a fully-convolutional architecture with five de-convolutional layers and a skip-add architecture to improve the sharpness of the predicted segmentation. The masks predicted by this network are at full resolution with K channels (K x H x W), where K is a pre-specified hyper-parameter that is greater than or equal to the number of moving parts in the scene (including background). The predicted segmentation mask learns to attend to parts of the scene that move together, representing areas of the scene that can move independently as different parts. As in prior work [4], we formalize mask prediction as a soft-classification problem where the network outputs a k -length probability distribution which we sharpen to push towards a binary segmentation mask. Object poses: Given the encoded latent representation, we use a three layer fully-connected network to predict the 6D pose p k of each of the K segmented parts. We represent each pose by 6 numbers: a 3D position ( y ∈ R 3 ) and an 1 Bold fonts denote collections of items Pose and Mask Encoder : Transform Network: Pose Transition Network: p t m t x t p t p t+1 u t x t+1 Fully Connected Convolution Max-Pool Batchnorm Deconvolution Batchnorm Concatenation Transform Layer x t Pose and Mask Encoder x t x t+1 p t m t Transform Network u t x t x t+1 c +1 p t+1 Pose and Mask Encoder Pose Loss (L p ) 3D Loss (L x ) Pose and Mask Encoder x t x T p t Pose Transition Network u t p T Pose and Mask Encoder Pose Error (E) p t p t+1 Fig. 2: Top: SE3-P OSE -N ET architecture consisting of three components: the encoder ( h enc , shown in blue) that predicts dense segmentation masks ( m ) and 6D poses ( p ), a pose transition net ( h trans ) that models the change in the pose space ( ∆ p ) as an effect of the applied action ( u ) and the transform layer that applies these pose changes to the current point cloud to generate a predicted point cloud ( ˆ x ). Bottom Left: Graph showing the procedure for training the SE3-P OSE -N ET along with two loss functions: a 3D loss on the predicted point cloud ( L x ) and a pose consistency loss ( L p ) relating the "next" poses predicted by the transform network ( ˆ p t +1 ) and the encoder ( p t +1 ). Bottom Right: Control using the SE3-P OSE -N ET . Given a target point cloud ( x T ) encoded as poses ( p T ) through the learned encoder, we use the learned transition model ( h trans ) to plan a sequence of actions u 0 , u 1 , ..., u T by minimizing error (E) directly in the pose space from an initial point cloud p 0 . orientation ( R ∈ SO (3) ), represented as a 3-parameter axis- angle vector. As we show later, our pose network learns to predict consistent poses which can be used to data-associate observations over long sequences of motions. At a high level, the encoder implicitly learns the structure of observed scenes by persistently identifying parts and predicting a consistent pose for each part across multiple scenes. B. Modeling part dynamics Once we have identified the constituent parts of the scene and their poses, we can reason about the effect of applied actions on these parts. We model this notion of "part dynamics" through a fully-connected pose transition network that takes the predicted poses from the encoder ( p ) and applied actions ( u ) as input to predict the change in pose ( ∆ p ) for all K segmented parts: ∆ p = h trans ( p , u ) (2) where ∆ p = [ R , T ] is represented as an SE (3) transform per part, with a rotation R k ∈ SO (3) (parameterized as an axis-angle transform) and a translation vector T k ∈ R 3 . The transition network first applies two fully connected layers to both inputs, concatenates their outputs followed by two final fully-connected layers to predict the pose-deltas. As we show later in Sec. IV we rely on good predictions of pose-deltas through the pose-transition network for efficient control. C. Predicting scene dynamics Finally, given the predicted scene segmentation ( m t ) and the change in poses ( ∆ p t ), we can model the dynamics of the input scene ( x t ) under the effect of the applied action ( u t ). We do this through the Transform layer ( h tf m ) which applies the predicted rigid rotations ( R t ) and translations ( T t ) to the input point cloud, weighted by the predicted mask probabilities ( m t ). We predict the transformed point cloud ( ˆ x t +1 ) as: ˆ x j t +1 = K ∑ k =1 m kj t ( R k t x j t + T k t ) (3) where ˆ x j t +1 is the 3D output point corresponding to input point x j t . In effect, we apply the k th rotation and translation ( ∆ p k = [ R k , T k ]) to all points x j that belong to the corresponding object as indicated by the k th mask channel m k (assuming that the mask is binary after weight sharpening) to predict the transformed points ˆ x j belonging to that object. Repeating this for all objects gives us the transformed output point cloud ( ˆ x ). Note that this part has no trainable parameters. For more details, please refer to prior work [4]. D. Training We now outline a procedure to train the SE3-P OSE -N ET end-to-end, using supervision in the form of point-wise data associations across a pair of point clouds ( x t , x t +1 ), related by an action ( u t ) i.e. for each input point ( x i t ), we know it’s corresponding point in the next frame if it is visible ( x i t +1 ). No other supervision is given for learning the masks, poses, and the change in poses. Fig. 2 (bottom-left) shows a schematic of this procedure. Given two point clouds x t , x t +1 , we use the encoder to predict the corresponding masks and poses: p t , m t = h enc ( x t ) p t +1 , m t +1 = h enc ( x t +1 ) (4) Next, the predicted pose ( p t ) and control ( u t ) at t are used as input to the pose transition network to predict the change in pose from t to t + 1 : ∆ p t = h trans ( p t , u t ) (5) Finally, we use the transform layer (3) to predict the next point cloud: ˆ x t +1 = h tf m ( x t , m t , ∆ p t ) (6) The predicted mask ( m t +1 ) at time t + 1 is discarded. We use two losses to train the entire pipeline end to end: • A 3D loss ( L x ) that penalizes the error between the predicted point cloud ( ˆ x t +1 ) and the data associated target point cloud ( ̃ x t +1 ). We use a normalized version of the mean-squared error (MSE) that measures the negative log-likelihood under a Gaussian centered around the target with a standard deviation dependent on the target magnitude: L x = 1 N HW ∑ i =1 (ˆ x i t +1 − ̃ x i t +1 ) 2 α ̃ f i + β (7) where ( ̃ f i = ̃ x i t +1 − x i t ) denotes the ground truth motion for point i relative to the input point cloud x t , HW is the number of points in the point cloud, N is the number of points that actually move between t and t + 1 and α & β are hyper-parameters ( α = 0 . 5 , β = 1 e − 3 in all our experiments). This loss is aimed to tackle two main issues with a standard MSE loss: a) By normalizing the loss by a separate scalar per dimension ( ̃ f i ) that depends on the target magnitude we make the loss scale invariant allowing us to treat equally parts that move less (such as the end-effector when only the wrist rotates) as those that have large motion (eg. the elbow), and b) By dividing the total error by the number of points ( N ) that move in the scene, we treat scenes where very few points move equally as those where large parts move. • A pose consistency loss ( L p ) that encourages consistency between the poses predicted by the encoder ( p t , p t +1 ) and the change in pose predicted by the pose transition network ( ∆ p t ): ˆ p t +1 = p t ⊕ ∆ p t L p = 1 I I ∑ i =1 (ˆ p i t +1 − p i t +1 ) 2 (8) where ⊕ refers to composition in SE (3) pose space, ˆ p t +1 is the expected pose at t + 1 from composing the current pose ( p t ) and the predicted pose change from the transition model ( ∆ p t ) and I is the cardinality of p t . In essence, this loss constrains the encoder to predict poses that are consistent with the pose-deltas predicted by the transition model. This loss encourages global consistency in the pose space by enforcing local consistency over pairs of frames and is crucial for learning a pose space that is consistent across long term motions. The total loss for training ( L ) is a sum of the two losses: L = L x + γL p , where γ controls the relative strengths of the two losses. We set γ = 10 in all our experiments. A key point to note is that we do not provide any explicit supervision to learn the pose space. While the consistency loss ensures that the poses are more or less globally consistent, it does not anchor them to a specific 3D position or orientation. As such, the poses learned by the network need not correspond directly to the canonical 6D pose of the parts - the predicted part position y k does not need to correspond to its center and the orientation need not be aligned to the part’s principal axes. Providing more constraints to regularize and physically ground the pose space is an interesting area for future work. IV. C LOSED -L OOP V ISUOMOTOR C ONTROL USING SE3-P OSE -N ETS We now show how an SE3-P OSE -N ET can be used for closed-loop visuomotor control to reach a target specified as a target depth image, essentially performing visual servoing [1]. A crucial component of every visual servoing system is to perform data association between the current image and the target image, which can then be used to generate controls that reduce the corresponding offsets. SE3-P OSE -N ETS solve this problem by making use of the learned, low-dimensional latent pose space. By enforcing frame-to-frame consistency in the pose space through the consistency loss (Eqn. 8), the pose space becomes consistent, that is, our encoder network learns to data-associate observations to unique poses which are consistent under the effect of actions. Importantly, these data associations are generated at the mask, or object level, resulting in an ability akin to object detection in computer vision. Unlike prior work [4], [20] which is restricted to operate in the observation space of 3D points and requires data associations between current and target points to be provided externally, we can now directly minimize error between the poses p 0 and p T automatically extracted from the initial and the target depth image to recover the sequence of actions that takes the robot from p 0 to p t . Additionally, unlike prior work [20], we do not need an external tracking system to measure progress toward the goal as our learned encoder implicitly tracks in the pose space. A. Reactive control Algorithm 1 presents a simple algorithm for reactive control using SE3-P OSE -N ETS that efficiently computes a closed- loop sequence of controls that takes the robot from any initial state x 0 to the specified target x T (the corresponding network Algorithm 1 Reactive visuomotor control Given: Target point cloud ( x T ) Given: Pre-trained encoder ( h enc ) and transition model ( h trans ) Given: Maximum control magnitude: u max Compute target pose: p T = h enc ( x T ) while not converged do Receive current observation ( x t ) Predict current pose: p t = h enc ( x t ) Initialize control to all zeros: u t = 0 Predict change in pose: ∆ p t = h trans ( p t , u t ) Predict next pose: ˆ p t +1 = p t ⊕ ∆ p t Compute pose error: E = 1 I ∑ I i =1 (ˆ p i t +1 − p i T ) 2 Compute gradient of error w.r.t. control: g = dE dU t Compute control: u t = − u max ∗ g || g || Execute control u t on the robot structure is given in the lower right panel of Fig. 2). Given a target point cloud, x T , the algorithm uses the learned encoder to predict the poses of the constituent parts p T = h enc ( x T ) . This becomes the target to the controller. At every time step, the algorithm computes the pose embedding p t of the current observation x t . We would like to find controls that move these poses closer to the target poses. To do this, the algorithm makes a prediction through the learned pose transition model using the current poses ( p t ) and an initial guess for the controls (here we use u t = 0 ), resulting in a predicted change in poses ( ∆ p t ) and the corresponding predicted next pose ( ˆ p t +1 ) 2 . To move these poses towards the targets, we formulate an error function E based on the mean-squared error between these predicted poses and the target poses. The algorithm then computes the gradient of this error with respect to the control inputs, which it uses to generate the next controls. We propose two ways of computing the gradient: • Backpropagation: A simple approach to compute this gradient update is to backpropagate the gradients of the pose error E through the pose transition model. Unlike backpropagation during training, where we compute gradients w.r.t. the network weights, here we fix the weights and compute gradients over the input controls. The resulting control scheme is analogous to the Jacobian Transpose method from inverse kinematics [8], where backprop provides the gradient of the transition model. • Gauss-Newton: A better approach is to compute the Gauss-Newton update: g = ( J T J + λI ) − 1 J T ∗ g P (9) where J is the Jacobian of the transition model, and g P is the gradient of the pose error (E). However, instead of computing g via backpropagation, we condition the pose error gradient ( g P ) based on the Jacobian’s pseudo- inverse, where λ controls the strength of the conditioning 2 Even when using a zero control initialization, this forward pass through the network is necessary to get the correct gradients for the backward pass. (set to 1e-4 in all our experiments). In practice, this leads to significantly faster convergence with little to no additional overhead in computation compared to the backpropagation method as the Jacobian can be computed efficiently through finite differencing. We do this by running a single forward propagation with perturbed control inputs (perturbation set to 1e-3) stacked along the batch dimension to take advantage of GPU parallelism. Eqn. 9 is also analogous to the Damped Least Squares technique from inverse kinematics [8]. Finally, the algorithm computes the unit-vector in the direction of the computed update and scales this by a pre-specified control magnitude u max (1 radian in all our experiments) to get the next control u t . We execute this control on the robot and repeat in a closed-loop until convergence measured either by reaching a small error in the pose space ( E <  ) or a maximum number of iterations, whichever comes first. V. E VALUATION We first evaluate SE3-P OSE -N ETS on predicting the dynamics of a scene where a Baxter robot moves its right arm in front of the depth camera, both in simulation and in the real world. We also present results on control performance where the task is to control the joints of the Baxter’s right arm to reach a specified target observation. A. Task and Data collection We first provide details on the task setting in simulation. Our simulator uses OpenGL to render depth images from a camera pointed towards the robot (see Fig. 1) and is kinematic with little to no dynamics in the motion and no depth noise. We use this as a test bed to parse the effectiveness of the proposed algorithm and compare it to various baselines. We collected around 8 hours of training data in the simulator where the robot moves all joints on it’s right arm. Around half of the examples are whole arm motions where the robot plans a trajectory to reach a target end-effector position sampled randomly in the workspace in front of the robot. The rest of the motions are made of perturbations of individual joints on the robot from various initial configurations sampled to be within the viewpoint of the camera. These additional motions help in de-correlating the kinematic chain dependencies during training, improving performance especially on joints lower down the kinematic chain. Overall, this dataset has around 800,000 training images collected from a single fixed viewpoint. Similar to the simulated setting, we collect data from the real robot where the Baxter moves its right arm in front of an ASUS Xtion Pro camera placed around 2.5 meters from the robot. Data associations, ground truth masks, and ground truth flows are determined via the DART tracker [6] on the real data. We collected around 4.5 hours of training data on the real robot, with a 2:1 mix of whole arm motions and single joint motions. As before, the motions were generated through a planner that tries to get the end-effector to randomly sampled targets in the workspace. Unlike the simulated data, the depth data in the real world is quite noisy and there are significant Fig. 3: Masks generated by different networks on simulated (top) and real data (bottom). From left to right: Ground truth depth, ground truth masks, masks predicted by the SE3-P OSE -N ET , SE3-P OSE -N ET with joint angles, SE3-N ET and SE3-N ET with joint angles. Setting SE3-P OSE -N ETS SE3-P OSE -N ETS + Joint Angles SE3-N ETS SE3-N ETS + Joint Angles Flow Flow + Joint Angles Simulated 0.044 0.038 0.030 0.024 0.035 0.030 Real 0.234 0.224 0.221 0.212 0.228 0.218 TABLE I: Average per-point flow MSE (cm) across tasks and networks, normalized by the number of points M that move in the ground truth data (motion magnitude > 1mm). Our network achieves results slightly worse than the baseline networks on both simulated and real data. However, it is also solving additional tasks necessary for control. physical and dynamics effects. For both the simulated and real world settings, our controls are joint velocities ( u ). B. Baselines We compare the performance of our algorithm against five different baselines: • SE3-P OSE -N ETS + Joint Angles: Our proposed network with the joint angles of the robot given as an additional input to the encoder . We use this network as a strong baseline that uses significant additional information to inform the pose prediction. • SE3-N ETS : Prior work from [4] where the network directly predicts masks and change in poses given input point clouds and control. There is no explicit pose space in this network, so we do control in the full point cloud observation space for this network. • SE3-N ETS + Joint Angles: SE3-N ETS that addition- ally take in joint angles as inputs. • Flow Net: Baseline flow model from prior work [4]. This network directly regresses to a per-point 3D flow without any explicit SE (3) transforms or masks. • Flow Net + Joint Angles: Baseline flow network that additionally takes in joint angles as input. All baseline networks are trained on the same data as the SE3-P OSE -N ETS using the 3D normalized loss ( L x ). C. Training details We implemented our networks in PyTorch using the Adam optimizer for training with a learning rate of 1e-4. All our networks used Batch Normalization [21] and the PReLU non- linearity [22]. We set the maximum number of moving objects K = 8 for all our experiments (7 joints + background). We train each network for 100,000 iterations in simulation and 75,000 iterations on the real data, and use the network that achieves the least validation loss across all training iterations for all our results. D. Results on modeling scene dynamics First, we present results on the prediction task used for training all the networks. Table I shows the average per-point flow MSE (cm) across all baselines on both simulated and real data. SE3-N ETS achieve the best results on both the simulated and real datasets while the baseline flow network performs slightly worse. Unsurprisingly, networks that have access to the joint angles do better than those which do not, as they have strictly more information that is highly correlated with the sensor data. To initial surprise, the SE3-P OSE -N ETS have the largest prediction errors among all baseline models. However, this makes sense given the following considerations: a) SE3- P OSE -N ETS are trained to explicitly embed the observations in a pose space from which they predict the scene dynamics, rather than using the input point cloud directly. While this provides more structure within the network and is necessary for the control task, it also restricts the prediction to go through an information bottleneck which generally makes the training problem harder. b) SE3-P OSE -N ETS additionally have to optimize for the consistency loss, which enforces constraints that are different from those of the prediction problem evaluated in this experiment. Fig. 3 visualizes the masks predicted by SE3-P OSE -N ETS and the baseline SE3-N ET on an example each from the simulated and real data along with the ground truth masks. Even without any supervision, SE3-P OSE -N ETS and SE3- N ETS learn a detailed segmentation of the arm into multiple salient parts, most of which are consistent with ground truth segments on both the simulated and real data. E. Control performance Next, we test the performance of the different networks on controlling the Baxter’s right arm to reach a target configuration, specified as a point cloud x T . We test both the control algorithms presented in Sec. IV using our SE3- P OSE -N ET model and the baseline models by comparing their performance on a set of 11 distinct servoing tasks (each with an average initial error of ~30 degrees per joint). We first detail a few specifics followed by an analysis of the results. Control with baseline models: While SE3-P OSE -N ETS learn a pose space that can be used for long-term data associations and control, the baseline models operate directly Fig. 4: Convergence of joint angle error in simulated Baxter control tasks. (left): without joint angles, (middle) without joint angles and detected failure case removed (for all methods), (right) with joint angles. SE3-P OSE -N ETS perform as well or better than baseline methods even though baseline models have additional information in the form of ground truth-associations. Fig. 5: Convergence of joint angle error on real Baxter control tasks (left) without joint angles (right) with joint angles (averaged across joint 0,1,2,3). in the space of observations and thus require external data associations in the observation space to be able to do any control at all. For the simulation experiments, we provide these baseline algorithms with ground-truth associations and use the procedure outlined in Alg. 1 using the MSE between the predicted point cloud ˆ x t +1 and the target x T as the error to be minimized for generating controls. It is important to keep in mind that the baseline models have an advantage over SE3-P OSE -N ETS for the control task as they get strictly more information in the form of ground-truth data associations. Metric and Task specification: We use the mean absolute error in the joint angles as the metric for measuring control performance. We run all models to convergence (based on the pose error for SE3-P OSE -N ETS and 3D point/flow error for the baseline models) or for a maximum of 200 iterations. Additionally, for SE3-P OSE -N ETS we terminate if the pose error increases for 10 consecutive iterations. We integrate joint velocities forward to generate position commands for the robot both in simulation and the real world. Simulation results: Fig. 4 plots the error in joint angles as a function of the number of control iterations. The plots on the left and middle show results on networks that use only raw depth as input - we control the first six joints of the robot using these networks. The right figure shows results for networks that additionally use joint angles as input - we control all 7 joints of the robot with these networks. In general, SE3- P OSE -N ETS achieve excellent performance compared to the baseline models, converging quickly to an almost zero error even in the absence any external data associations . The flow model performs comparably to the SE3-P OSE -N ETS while SE3-N ETS converge far slower. We highlight a few key results: 1) For all methods, Gauss-Newton based optimization (GN) leads to faster convergence than Backprop. This is to be expected as Gauss-Newton conditions the gradient based on pseudo-second order information. 2) Baseline models perform worse given joint angles than without. This is due to an issue of credit assignment during gradient computation - the networks learn erroneous causations (when there are only correlations) between the input joint angles and the predicted flows which diminishes the control’s contribution to the prediction problem and subsequently affects the gradient. 3) All models struggle to model the motion of the final wrist joint due to increasing correlations along the kinematic chain that result in a small contribution of the joint’s own motion to the full movement of the wrist. SE3-P OSE -N ETS can overcome this problem given input joint angles (Fig. 4, right) which provides encouraging proof that adding in the joint state supplements information that is hard to parse directly from the visual state. 4) SE3-N ETS converge slowly due to a lack of good control initializations that are needed to ensure that the network starts off with a meaningful segmentation - given zero controls the SE3-N ET can choose not to segment the arm at all, and finally 5) Good performance of SE3-P OSE - N ETS indicates that the learned pose space is consistent across large motions and can be used for fast reactive control, albeit not quite as robust as the baseline methods given data associations. SE3-P OSE -N ETS fail to minimize the pose error on one of the tested configurations leading to an increasing error in Fig. 4, left. Our termination check that looks for increasing pose errors does correctly identify this case and we are able to succeed on all the other examples (Fig. 4, middle). We discuss ways to further improve the robustness of our approach in Sec. VI. Real robot results: We further test the control perfor- mance using SE3-P OSE -N ETS on a few real world examples. We do not compare to any baselines as they need an explicit external data association system to be feasible. On the real robot, we restrict ourselves to controlling the first four joints of the right arm using the SE3-P OSE -N ET and control the first six joints using the model that additionally takes in joint angles as input. Fig. 5 shows the errors as a function of the iteration count. Both models converge very quickly which indicates that our network is able to control robustly even in the presence of sensor noise and unmodeled dynamics. Surprisingly, there is very little difference between GN and Backprop algorithms on the real data. A video showing real- time control results on the Baxter can be found here. Speed: SE3-P OSE -N ETS optimize errors directly in the low-dimensional pose space for control. This leads to significant speedups compared to the baselines: while both the flow and SE3-N ETS can operate at around 10Hz (excluding the data-association pipeline), SE3-P OSE -N ETS run in real- time (30Hz) including the pose detection part. VI. D ISCUSSION This paper presents SE3-P OSE -N ETS , a framework for learning predictive models that enable control of objects in a scene. In the context of a robot manipulator, we showed they solve this problem by learning a predictive model for the individual parts of the manipulator, as in prior work [4]. Additionally, SE3-P OSE -N ETS learn a consistent pose space for these parts, essentially learning to detect the 6D poses of manipulator parts in the raw depth images. This detection capability enables SE3-P OSE -N ETS to solve the data association problem that is crucial for relating the current observation of the manipulator to a desired target observation. The difference between these poses can be used to generate control signals to move the manipulator to its target pose, similar to visual servoing applied to an image of the manipulator. We also showed how the learned network can be used to determine the gradients needed for the control signals. Our experiments show that SE3-P OSE -N ETS generate control superior to representations learned by previous techniques, even when these are provided with external data associations. Furthermore, in addition to providing data associations, SE3-P OSE -N ETS allow us to compute controls directly in the low dimensional pose space, enabling far more efficient control than techniques that operate in the raw perception space. Crucially, all these abilities are learned in a single framework based on raw data traces solely annotated with frame-to-frame point cloud correspondences. Overall, the control performance shown by our SE3-P OSE - N ETS is extremely encouraging and provides a strong proof of concept that such networks can learn a consistent pose space that provides long-range correspondences and fast reactive control. While this provides reason to rejoice, there are a multiple areas for improvement: 1) As shown in the real robot results, SE3-P OSE -N ETS (and other baselines) have difficulties handling joints further down the kinematic chain (joints 4,5,6 for the Baxter) whose motions are significantly correlated to the motions of the joints above. Additionally, the end-effector has poor visibility on depth images. Adding state information in the form of encoder data significantly alleviates this issue but does not fully solve it. There are potentially multiple ways to improve the model to tackle this problem, including curriculum and active learning along with better regularization and physical grounding of the pose space to remove inconsistencies. 2) A key area for future work is in extending our system to interact with and manipulate external objects. Here, a consistent pose space for objects in the scene will enable the robot to plan its motion toward the objects, enabling smooth interactions. 3) Finally, while we have shown that SE3-P OSE -N ETS can be used for single-step reactive control, we would like to do long-term planning using model based techniques such as iterative LQG [23] to leverage the full strength the latent pose space, i.e., fast real-time rollouts directly in the pose space. A CKNOWLEDGMENTS This work was funded in part by the National Science Foundation under contract number NSF-NRI-1637479 and STTR number 1622958 awarded to LULA robotics. We would also like to thank NVIDIA for generously providing a DGX used for this research via the UW NVIDIA AI Lab (NVAIL). R EFERENCES [1] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” IEEE transactions on robotics and automation , vol. 12, no. 5, pp. 651–670, 1996. [2] B. Boots, A. Byravan, and D. Fox, “Learning predictive models of a depth camera & manipulator from raw execution traces,” in ICRA . IEEE, 2014, pp. 4021–4028. [3] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in NIPS , 2016. [4] A. Byravan and D. Fox, “Se3-nets: Learning rigid body motion using deep neural networks,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on . IEEE, 2017, pp. 173–180. [5] C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on . IEEE, 2017, pp. 2786–2793. [6] T. Schmidt, R. A. Newcombe, and D. Fox, “Dart: Dense articulated real-time tracking.” in RSS , 2014. [7] R. Anderson, D. Gallup, J. T. Barron, J. Kontkanen, N. Snavely, C. Hernández, S. Agarwal, and S. M. Seitz, “Jump: virtual reality video,” ACM Transactions on Graphics (TOG) , 2016. [8] S. R. Buss, “Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods.” [9] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in NIPS , 2015, pp. 2728–2736. [10] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “From pixels to torques: Policy learning with deep dynamical models,” arXiv preprint arXiv:1502.02251 , 2015. [11] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” JMLR , vol. 17, no. 39, pp. 1–40, 2016. [12] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics,” arXiv preprint arXiv:1606.07419 , 2016. [13] R. Jonschkowski, R. Hafner, J. Scholz, and M. Riedmiller, “Pves: Position-velocity encoders for unsupervised learning of structured state representations,” arXiv preprint arXiv:1705.09805 , 2017. [14] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV , vol. 60, no. 2, pp. 91–110, 2004. [15] T. Schmidt, R. Newcombe, and D. Fox, “Self-supervised visual descriptor learning for dense correspondence,” IEEE Robotics and Automation Letters , vol. 2, no. 2, pp. 420–427, 2017. [16] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE International Conference on Computer Vision , 2015, pp. 2794–2802. [17] B. Espiau, F. Chaumette, and P. Rives, “A new approach to visual servoing in robotics,” Geometric reasoning for perception and action , pp. 106–136, 1993. [18] F. Chaumette, S. Hutchinson, and P. Corke, “Visual servoing,” in Springer Handbook of Robotics , 2016, pp. 841–866. [19] A. X. Lee, S. Levine, and P. Abbeel, “Learning visual servoing with deep features and fitted q-iteration,” arXiv preprint arXiv:1703.11000 , 2017. [20] C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” arXiv preprint arXiv:1610.00696 , 2016. [21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167 , 2015. [22] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in ICCV , 2015, pp. 1026–1034. [23] E. Todorov and W. Li, “A generalized iterative lqg method for locally- optimal feedback control of constrained nonlinear stochastic systems,” in ACC . IEEE, 2005, pp. 300–306.