Manifold Relevance Determination: Learning the Latent Space of Robotics Pete Trautman Abstract In this article we present the basics of manifold relevance determination (MRD) as introduced in [Damianou et al., 2012], and some applications where the technology might be of particular use. Section 1 acts as a short tutorial of the ideas developed in [Damianou et al., 2012], while Section 2 presents possible applications in sensor fusion, multi-agent SLAM, and “human- appropriate” robot movement (e.g. legibility and predictability [Dragan et al., 2013]). In particular, we show how MRD can be used to construct the underlying models in a data driven manner, rather than directly leveraging first principles theories (e.g., physics, psychology) as is commonly the case for sensor fusion, SLAM, and human robot interaction. We note that [Bekiroglu et al., 2016] leveraged MRD for correcting unstable robot grasps to stable robot grasps. 1 What is MRD? In this section, we explain how MRD has its origins in PCA—indeed, if we • generalize PCA to be nonlinear and probabilistic, we arrive at Gaussian process latent vari- able models (GPLVM). • If GPLVMs are generalized to the case of multiple views of the data we arrive at shared GPLVMs. • If we introduce private spaces to shared GPLVMs, then we recover a factorization of the latent space that encodes variance specific to the views. • Finally, if we approximately marginalize the latent space (instead of optimizing the latent space) we can automatically (i.e., in a data driven manner) determine the dimensionality and factorization of the latent space. Automatic determination of the dimensionality and factorization of the nonlinear latent space from multiple views is manifold relevance determination. See Figure 1 for the graphical model corresponding to this evolution. 1.1 Generalizing PCA To begin then, consider PCA: we are given a collection of centered observational data Y = [y1, . . . , yN] ∈RN×D, and we wish to relate this data to a latent space X = [x1, . . . , xN] ∈RN×q (where hopefully q ≪D) via the linear embedding W ∈RD×q yn = Wxn. (1.1) For this formulation, the solution is both exact and efficient: compute the eigendecomposition of the covariance matrix Y Y ⊤, and then construct W by choosing the q eigenvectors with the largest eigenvalues. PCA is thus interpreted as a linear projection of the data onto the subspace that most efficiently captures the variance in the data. arXiv:1705.03158v2 [cs.RO] 23 May 2017 Figure 1: Illustration copied from [Damianou et al., 2012]. Graphical model evolution of MRD, from left to right: first, we see GPLVM (alternatively, dual probabilistic PCA), which is then generalized to multiple views of the data over a single latent space. In the third frame, we see the the introduction of shared and private latent spaces to explain variance due to a specific view. Finally, MRD introduces an approximate latent space marginalization which in turn allows for automatic relevance determination priors to be used, thus enabling soft boundaries between shared and private latent subspaces and data driven factorization of the shared subspace. Imagine now that we wish to formulate a probabilistic version of PCA; we can add a noise term to Equation 1.1: yn = Wxn + ηn (1.2) where we assume p(ηn) = N(0, β−1I), (1.3) which allows us to formulate a likelihood over the data p(yn | xn, W, β) = N(yn | Wxn, β−1I). At this stage, we have a choice to make—we can either 1. Marginalize the latent variables X and optimize the parameters W or, 2. Marginalize the parameters W and optimize the latent variables X As it turns out, these two approaches are dual to one another (see [Lawrence, 2005]); the first approach, called probabilistic PCA ([Tipping and Bishop, 1999]), reduces to PCA if one chooses the parameters W that maximize the marginal likelihood W∗= arg max W p(Y | W, β). Alternatively, if we marginalize the parameters (by choosing a Gaussian prior over the parameters, p(w) = Q i N(wi | 0, I)), we arrive at a likelihood conditioned on the latent space: p(Y | X, β) = D Y d=1 p(yd | X, β) = D Y d=1 N(yd | 0, XX⊤+ β−1I) (1.4) Optimization of the latent variables results in a matrix decomposition problem equivalent to finding the largest q eigenvectors of Y Y ⊤—once again, we recover PCA in its traditional form. However, if instead we pause at Equation 1.4 and recall the “inner product” kernel function of Gaussian processes (see [Rasmussen and Williams, 2006]) k(xi, xj) = x⊤ i xj + β−1δij, (1.5) and note that this kernel function k(xi, xj) encodes linear embeddings of the data into the latent space, we immediately see how the decomposition N(yd | 0, XX⊤+ β−1I) suggests a novel generalization of probabilistic PCA: by replacing the inner product kernel with a covariance function that allows for nonlinear functionality (thus nonlinear embedding functions), we recover a nonlinear, probabilistic version of PCA. This approach is called Gaussian process latent variable models. 1.2 The rest of the story Building on the narrative of Section 1.1, we now formulate the MRD model. • Suppose that we are given two “views” of a dataset, Y ∈RN×DY and Z ∈RN×DZ (the number of samples N in each view could be different for Y and Z). • We assume the existence of a single latent variable X ∈RN×q that provides a low dimensional representation of the data through the nonlinear mappings {f Y d }DY d=1 : X 7→Y (1.6) and {f Z d }DZ d=1 : X 7→Z. Further, the embeddings are corrupted by additive gaussian noise, so we have that ynd = f Y d (xn) + ϵY nd znd = f Z d (xn) + ϵZ nd, where ϵY,Z ∼N(0, σ{Y,Z}), and ynd, znd represents dimension d of point n. • Similar to Equations 1.2 and 1.4, the above bullets lead to a likelihood function of the form p(Y, Z | X, θ) (1.7) where θ = {θY , θZ} denotes the parameters of the mapping functions f Y d and f Z d and the noise variances ϵ{Y,Z}. Just as in Section 1.1, we are forced to make a choice at this point—that is, we need to compute the likelihood 1.7, but we cannot marginalize over both the latent space and parameters. Thus, we make some modeling choices: • As with GPLVMs, we place a Gaussian process prior over the embedding mappings f Y d and f Z d (KY is as introduced in Equation 1.5, but evaluated on X): p({f Y d }DY d=1 | X, θY ) = DY Y d=1 N(f Y d | 0, KY), (1.8) and similarly for {f Z d }DZ d=1. This allows us to model the embedding mappings non parametri- cally, and, further, allows us to analytically marginalize out the parameters (or mappings): p(Y, Z | X, θ) = Z p(Y | {f Y d })p({f Y d } | X, θY )p(Z | {f Z d })p({f Z d } | X, θZ) (1.9) where {f Y d }, {f Z d } is shorthand for {f Y d }DY d=1, {f Z d }DZ d=1. • Unfortunately, while we can marginalize the mappings using GPs, we cannot analytically marginalize the latent space; this is a critical point—not marginalizing the latent space forces us to choose the dimensionality of the latent space, and forces us to choose how the latent space factorizes over shared and private subspaces. • One of the key contributions of MRD is approximately marginalizing the latent space using variational methods. This enables the use of “automatic relevance determination” (ARD) GP prior kernels; ARD kernels, in turn, enable the dimensionality and the factorization of the latent space to be determined from the data. 1.3 Advantages of MRD 1. The method models nonlinear embeddings into a factorized latent space. The prior over the embeddings is a GP, which is very expressive, and can thus capture a wide variety of embedding functions. The dimensionality of the latent space is driven by the data (rather than a heuristic). The factorization of the latent space is also driven by the data (rather than a heuristically imposed boundary on the latent space). 2. It is an unsupervised approach. For the purposes of multi agent SLAM, one could imagine implementing it as a “featureless” approach to map building. The latent space would con- ceivably correspond to the 6 dof pose of the vehicle, similar to the interpretation presented in the Yale Faces Experiment of [Damianou et al., 2012]. 3. The method is fully probabilistic. Thus, when one tries to regress across the latent space (perhaps in between views or apart from available training data), the answer is a distribution, rather than just a match or not a match. 4. Similarly, when one projects back out into viewpoint space, the answer is again a distribution— so the question “how does this novel view correspond to the training views” is answered with a distribution. 2 Applications of MRD Inspired by the work in [Bekiroglu et al., 2016], which applied MRD to the problem of transferring between stable and unstable robot grasps, we suggest the following potential applications. 2.1 MRD for Sensor Fusion MRD naturally captures the concept of sensor fusion; imagine that one has ℓviews of some target {Y 1, . . . , Y ℓ}. The traditional sequential Bayesian formulation of sensor fusion requires that we know a number of things in advance: 1. The state of the target. 2. The ℓdifferent sensor models. 3. What part of the state each sensor model captures. 4. The hard part: how each sensor interacts with each other sensor during observation. Notably, MRD learns all of these things directly from the ℓviews of the target: 1. The latent space X that is learned during training corresponds to the target under observa- tion. 2. The sensor models are learned using Gaussian process regression, as in Equation 1.8. 3. The private subspace Xi p captures what part of the target is being uniquely observed by sensor Y i. 4. The hard part: the subspaces shared between views i and j Xij s correspond to when sensor modalities are being fused. See the discussion of the Yale Face Experiment in [Damianou et al., 2012]. 2.2 MRD for Multi-Agent SLAM Similar to sensor fusion, multi-agent SLAM is a natural application of MRD. In particular, imagine that we have two sensors Y and Z observing a scene (see Figure 2). The two agents collect their datasets of images (let’s use images for convenience’s sake), and MRD is run. Presumably, the latent space should correspond to the 6DOF pose of either of the cameras—in other words, the localization of the sensor. Now imagine that a third sensor enters the scene, and observes Y ∗. A sensible question in the context of MA-SLAM is “what was the trajectory of the third sensor”? This question is answered in the following way: • We construct the distribution over possible poses of the camera X∗∼p(X | Y, Y ∗). In particular, we can query the distribution for the most likely pose of the new images Y ∗: Xmax = arg max X p(X | Y, Y ∗). • Xmax corresponds to the most likely trajectory that generated the new set of sensor images Y ∗. Perhaps more importantly, p(X | Y, Y ∗) is actually the tracking density of the third sensor! We could additionally answer the following interesting question: given the new view Y ∗, what would this portion of the scene look like to platform Z? In other words, given the view Y ∗, we can reconstruct what the scene would have looked like from sensor Z’s perspective. • Start with p(X | Y, Y ∗). • We next search the latent space for the nearest neighbors of X∗: Xnn = {x ∈X | ∥x −X∗∥< δ, x ∈XS} • We finally reconstruct the distribution Z∗∼p(Z | Xnn); this distribution encodes what the portion of the scene observed by Y ∗would look like to sensor Z. 2.3 MRD for “human friendly” movement We finish by discussing a trajectory generation problem. The setting could be the following: imagine a robot arm trying to work collaboratively with a human in a grasping scenario. Perhaps it is as simple as tasking the robot and the human with simultaneous grasping of two objects (human friendly trajectories would be those that are both Legible and Predictable to the human, for instance). How do we accomplish this using MRD? The idea is the following: 2.3.1 Regressing over the latent space • Collect samples YLP and Y̸L,̸P that correspond, respectively, to human friendly and not human friendly trajectories. Y Z XZ P XY P XS Given test image Y ⇤ • Compute X⇤⇠p(X | Y, Y ⇤) • Find Xnn = {x 2 X | kx −X⇤k < δ, x 2 XS} • Reconstruct Z⇤⇠p(Z | Xnn) • “Platform” Y, Z • X = 6-DoF trajectory? Figure 2: Two agents (Y, Z) performing multi-agent SLAM via MRD. The blue circle represents the scene of interest; the dashed colored lines represent the observations by either platform (red is agent Y , green is agent Z). The latent space is represented by X, which is partitioned into two “private” subspaces (XY P and XZ P ) and a “shared” latent subspace XS. The shared portion of the latent space allows us to transfer between the two observations Y and Z. • Use the regression capability of the underlying GPs to “flesh out” the latent space (see Fig- ure 3). This is valuable because collecting examples of human friendly trajectories can be very costly—it might even involve a human demonstrator. The underlying GP formulation provides a robust regression capability that can optimally utilize the data provided. Fur- thermore, the probabilistic formulation informs the user when the system is uncertain about whether a new sample is human friendly or not. 2.3.2 Transfer from random trajectory to desired trajectory Once we have collected the training samples, we hope to generate human friendly trajectories in an efficient and cheap manner. We suggest the following approach (see Figure ??): • use a COTs trajectory planner (RRT, PRM, CHOMP, etc) to generate a “human ignorant” path ¯f R • Do the following: Compute X∗∼p(X | Y̸L,̸P ,¯f R). Compute the nearest neighbor set Xnn = {˜x ∈Xs | ∥˜x −X∗∥< δ} Compute the “human friendly distribution” p(YLP | Xnn). Now, we choose f R max = arg max YLP p(YLP | Xnn) as our human friendly trajectory. The benefit of this approach is that it 1) leverages the ability of MRD to regress across small training sets (i.e. when samples are expensive) and also to 2) transfer between generalized modes of operation, such as human-friendly and not-human-friendly. One can also imagine applying this method to things like safe and not-safe modes. Y6L,6P YLP X ✓X ✓LP , wLP ✓6L,6P , w6L,6P -“human-friendly example” -“non-friendly example” • Use MRD to “flesh out” missing portions of X • Regression from non-friendly moves YLP ! friendly moves Y6L,6P Latent Space X Figure 3: Regressing across small sample sets using MRD. References [Bekiroglu et al., 2016] Bekiroglu, Y., Damianou, A., Detry, R., Stork, J., Kragic, D., and Ek, C. (2016). Proba- bilistic consolidation of grasp experience. In ICRA. [Damianou et al., 2012] Damianou, A., Ek, C., Titsias, M., and Lawrence, N. (2012). Manifold relevance determi- nation. In Proceedings of the 29th International Conference on Machine Learning. [Dragan et al., 2013] Dragan, A., Lee, K., and Srinivasa, S. (2013). Legibility and predictability of robot motion. In International Conference on Human-Robot Interaction. [Lawrence, 2005] Lawrence, N. (2005). Probabilistic non-linear principal component analysis with gaussian process latent variable models. The Journal of Machine Learning Research. [Rasmussen and Williams, 2006] Rasmussen, C. E. and Williams, C. (2006). Gaussian Processes for Machine Learning. MIT Press. [Tipping and Bishop, 1999] Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B.