Published in Transactions on Machine Learning Research (01/2023) VN-Transformer: Rotation-Equivariant Attention for Vector Neurons Serge Assaad ∗ serge.assaad@duke.edu Duke University Carlton Downey cmdowney@waymo.com Waymo LLC Rami Al-Rfou rmyeid@waymo.com Waymo LLC Nigamaa Nayakanti nigamaa@waymo.com Waymo LLC Ben Sapp bensapp@waymo.com Waymo LLC Reviewed on OpenReview: https://openreview.net/forum?id=EiX2L4sDPG Abstract Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developedframeworkofferingasimpleyeteffectiveapproachforderivingrotation-equivariant analogsofstandardmachinelearningoperationsbyextendingone-dimensionalscalarneurons to three-dimensional “vector neurons.” We introduce a novel “VN-Transformer” architecture to address several shortcomings of the current VN models. Our contributions are: (i) we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; (ii) we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-worlddatasets;(iii)wederivearotation-equivariantmechanismformulti-scalereduction of point-cloud resolution, greatly speeding up inference and training; (iv) we show that small tradeoffs in equivariance ((cid:15)-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results. ∗WorkdoneduringaninternshipatWaymoLLC. 1 3202 naJ 42 ]VC.sc[ 3v67140.6022:viXraPublished in Transactions on Machine Learning Research (01/2023) 1 Introduction (cid:296)(cid:1044)(cid:1109)(cid:939)(cid:937) A chair – seen from the front, the back, the top, or the side – is still a chair. When driving a car, our driving behavior is independent of our direction of travel. These simple examples demonstrate how humans excel at using rotation invariance and equivariance to understand the worldincontext(seefigureontheright). Unfortunately,typicalmachine learning models struggle to preserve equivariance/invariance when appropriate –it is indeed challenging to equipneural networks with the (cid:126)(cid:94)(cid:1074)(cid:940)(cid:1075) (cid:126)(cid:94)(cid:1074)(cid:940)(cid:1075) right inductive biases to represent 3D objects in an equivariant manner. Modeling spatial data is a core component in many domains such as CAD, AR/VR, and medical imaging applications. In assistive robotics and autonomous vehicle applications, 3D object detection, tracking, and motion forecasting form the basis for how a robot interacts with humans in the real world. Preserving rotation invariance/equivariance can improve training time, reduce model size, and provide crucial guarantees about model performance in the presence of noise. Spatial data is often represented as a point-cloud data structure. These point-clouds require both permutation invariance and rotation equivariance to be modeled sufficiently well. Approaches addressing permutation invariance include Zaheer et al. (2018); Lee et al. (2019); Qi et al. (2017). Recently, approaches jointly addressing rotation invariance and equivariance are gaining momentum. These (cid:28)(cid:271)(cid:265)(cid:367)(cid:209)(cid:213)(cid:265)(cid:310)(cid:241)(cid:177)(cid:259)(cid:1109)(cid:1219)(cid:1109)(cid:296)(cid:299)(cid:271)(cid:296)(cid:299)(cid:241)(cid:213)(cid:310)(cid:177)(cid:299)(cid:341) can roughly be categorized into modeling invariance or equivariance by: (i) data augmentation, (ii) canonical pose estimation, and (iii) model construction. Approaches (i) and (ii) do not guarantee exact equivariance since they rely on model parameters to learn the right inductive biases (Qi et al., 2017; Esteves et al., 2018; Jaderberg et al., 2015; Chang et al., 2015). Further, data augmentation makes training more costly and errors in pose estimation propagate to downstream tasks, degrading model performance. Moreover, pose estimation requires labeling objects with their observed poses. In contrast, proposals in (iii) do provide equivariance guarantees, including the Tensor Field Networks of Thomas et al. (2018) and the SE(3)-Transformer of Fuchs et al. (2020) (see Section 2). However, their formulations require complex mathematical machinery or are limited to specific network architectures. Most recently, Deng et al. (2021) proposed a simple and generalizable framework, dubbed Vector Neurons (VNs), that can be used to replace traditional building blocks of neural networks with rotation-equivariant analogs. The basis of the framework is lifting scalar neurons to 3-dimensional vectors, which admit simple mappings of SO(3) actions to latent spaces. WhileDengetal.(2021)developedaframeworkandbasiclayers,manyissuesrequiredforpracticaldeployment on real-world applications remain unaddressed. A summary of our contributions and their motivations are as follows: VN-Transformer. The use of Transformers in deep learning has exploded in popularity in recent years as the de facto standard mechanism for learned soft attention over input and latent representations. They have enjoyed many successes in image and natural language understanding (Khan et al., 2022; Vaswani et al., 2017), and they have become an essential modeling component in most domains. Our primary contribution of this paper is to develop a VN formulation of soft attention by generalizing scalar inner-product based attention to matrix inner-products (i.e., the Frobenius inner product). Thanks to Transformers’ ability to model functions on sets (since they are permutation-equivariant), they are a natural fit to model functions on point-clouds. Our VN-Transformer possesses all the appealing properties that have made the original Transformer so successful, as well as rotation equivariance as an added benefit. Direct point-set input. The original VN paper relied on edge convolution as a pre-processing step to capture local point-cloud structure. Such feature engineering is not data-driven and requires human involvement in designing and tuning. Moreover, the sparsity of these computations makes them slow to run 2Published in Transactions on Machine Learning Research (01/2023) (cid:296)(cid:1044)(cid:1109)(cid:944) (cid:296)(cid:1044)(cid:1109)(cid:944) (cid:65)(cid:265)(cid:334)(cid:65)(cid:265)(cid:177)(cid:334)(cid:299)(cid:241)(cid:177)(cid:177)(cid:299)(cid:265)(cid:241)(cid:177)(cid:310)(cid:1109)(cid:265)(cid:203)(cid:310)(cid:259)(cid:177)(cid:1109)(cid:203)(cid:303)(cid:259)(cid:177)(cid:303)(cid:1109)(cid:303)(cid:296)(cid:303)(cid:299)(cid:1109)(cid:213)(cid:296)(cid:209)(cid:299)(cid:213)(cid:1044)(cid:209)(cid:1044) (cid:38)(cid:298)(cid:38)(cid:315)(cid:298)(cid:241)(cid:334)(cid:315)(cid:177)(cid:241)(cid:334)(cid:299)(cid:241)(cid:177)(cid:177)(cid:299)(cid:265)(cid:241)(cid:177)(cid:310)(cid:1109)(cid:265)(cid:310)(cid:299)(cid:310)(cid:177)(cid:1109)(cid:310)(cid:253)(cid:299)(cid:213)(cid:177)(cid:203)(cid:253)(cid:213)(cid:310)(cid:271)(cid:203)(cid:299)(cid:310)(cid:271)(cid:341)(cid:299)(cid:341) on accelerated hardware. Our proposed rotation- equivariant attention mechanism learns higher-level fea- tures directly from single points for arbitrary point-clouds (see Section 4). (cid:119)(cid:271)(cid:119)(cid:271)(cid:271)(cid:259)(cid:1109)(cid:271)(cid:1219)(cid:259)(cid:1109)(cid:1109)(cid:87)(cid:1219)(cid:1109)(cid:81)(cid:87)(cid:119)(cid:81)(cid:119) (cid:156)(cid:88)(cid:156)(cid:1080)(cid:88)(cid:87)(cid:1080)(cid:81)(cid:87)(cid:119)(cid:81)(cid:119) Handling points augmented with non-spatial at- tributes. Real-world point-cloud datasets have compli- cated features sets – the [x,y,z] spatial dimensions are (cid:156)(cid:88)(cid:156)(cid:1080)(cid:88)(cid:65)(cid:265)(cid:1080)(cid:334)(cid:65)(cid:265)(cid:177)(cid:334)(cid:299)(cid:241)(cid:177)(cid:177)(cid:299)(cid:265)(cid:241)(cid:177)(cid:310)(cid:265)(cid:310) (cid:157)(cid:213)(cid:157)(cid:241)(cid:233)(cid:213)(cid:238)(cid:241)(cid:233)(cid:310)(cid:213)(cid:238)(cid:209)(cid:310)(cid:213)(cid:1109)(cid:119)(cid:209)(cid:271)(cid:1109)(cid:119)(cid:271)(cid:271)(cid:259)(cid:271)(cid:259) typically augmented with crucial non-spatial attributes (cid:156)(cid:88)(cid:156)(cid:1080)(cid:88)(cid:132)(cid:299)(cid:1080)(cid:177)(cid:132)(cid:265)(cid:299)(cid:177)(cid:303)(cid:265)(cid:232)(cid:271)(cid:303)(cid:232)(cid:299)(cid:271)(cid:264)(cid:299)(cid:213)(cid:264)(cid:299)(cid:213)(cid:1109)(cid:299)(cid:1109) (cid:156)(cid:88)(cid:156)(cid:1080)(cid:88)(cid:132)(cid:299)(cid:1080)(cid:177)(cid:132)(cid:265)(cid:299)(cid:177)(cid:303)(cid:265)(cid:232)(cid:271)(cid:303)(cid:232)(cid:299)(cid:271)(cid:264)(cid:299)(cid:213)(cid:264)(cid:299)(cid:213)(cid:1109)(cid:299)(cid:1109) [[x,y,z];[a]] where a can be high-dimensional. For exam- (cid:38)(cid:265)(cid:38)(cid:203)(cid:265)(cid:271)(cid:203)(cid:209)(cid:271)(cid:213)(cid:209)(cid:299)(cid:213)(cid:299) (cid:38)(cid:265)(cid:38)(cid:203)(cid:265)(cid:271)(cid:203)(cid:209)(cid:271)(cid:213)(cid:209)(cid:299)(cid:213)(cid:299) ple, Lidar point-clouds have intensity & elongation values associated with each point, multi-sensor point-clouds have modality types, and point-clouds with semantic type have semantic attributes. The VN framework restricted the (cid:156)(cid:88)(cid:156)(cid:1080)(cid:88)(cid:87)(cid:1080)(cid:81)(cid:87)(cid:119)(cid:81)(cid:119) (cid:156)(cid:88)(cid:156)(cid:1080)(cid:88)(cid:87)(cid:1080)(cid:81)(cid:87)(cid:119)(cid:81)(cid:119) scope of their work to spatial point-cloud data, limiting theapplicabilityoftheirmodelsforreal-worldpoint-clouds with attributes. We investigate two mechanisms to inte- (cid:65)(cid:265)(cid:296)(cid:65)(cid:265)(cid:315)(cid:296)(cid:310)(cid:315)(cid:1109)(cid:296)(cid:310)(cid:271)(cid:1109)(cid:296)(cid:241)(cid:265)(cid:271)(cid:310)(cid:241)(cid:1080)(cid:265)(cid:203)(cid:310)(cid:1080)(cid:259)(cid:271)(cid:203)(cid:315)(cid:259)(cid:271)(cid:209)(cid:315)(cid:209) (cid:65)(cid:265)(cid:296)(cid:65)(cid:265)(cid:315)(cid:296)(cid:310)(cid:315)(cid:1109)(cid:310)(cid:310)(cid:299)(cid:1109)(cid:177)(cid:310)(cid:253)(cid:299)(cid:213)(cid:177)(cid:203)(cid:253)(cid:213)(cid:310)(cid:271)(cid:203)(cid:310)(cid:299)(cid:271)(cid:341)(cid:299)(cid:341) grate attributes into equivariant models w(cid:28)(cid:271)(cid:265)(cid:367)h(cid:209)(cid:213)(cid:28)(cid:265)(cid:271)(cid:310)i(cid:241)(cid:265)(cid:177)(cid:367)(cid:259)l(cid:1109)(cid:1219)(cid:209)(cid:1109)(cid:213)e(cid:296)(cid:265)(cid:299)(cid:310)(cid:271)(cid:241)(cid:177)(cid:296)(cid:259)(cid:299)(cid:1109)(cid:1219)(cid:241)(cid:213)(cid:1109)p(cid:310)(cid:296)(cid:177)(cid:299)(cid:299)(cid:271)(cid:341)(cid:296)(cid:299)r(cid:241)(cid:213)(cid:310)(cid:177)e(cid:299)(cid:341)serving (a)Rotation-invariantclas- (b) Rotation-equivariant rotation equivariance (see Section 5). sificationmodel. trajectory forecasting model. Equivariantmulti-scalefeaturereduction. Practical Figure 1: VN-Transformer (“early fusion”) mod- data structures such as Lidar point-clouds are extremely els. Legend: ( ) SO(3)-equivariant features; ( ) large, consisting of hundreds of objects each with poten- SO(3)-invariantfeatures;( )Non-spatialfeatures. tially millions of points. To handle such computationally challenging situations, we design a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution. This mechanism learns how to pool the point set in a context-sensitive manner leading to a significant reduction in training and inference latency (see Section 6). (cid:15)-approximateequivariance. WhenattemptingtoscaleupVNmodelsfordistributedacceleratedhardware we observed significant numerical stability issues. We determined that these stemmed from a fundamental limitation of the original VN framework, where bias values could not be included in linear layers while preserving equivariance. We introduce the notion of (cid:15)-approximate equivariance, and use it to show that small tradeoffs in equivariance can be controlled to obtain large improvements in numerical stability via the addition of small biases, improving robustness of training on accelerated hardware. Additionally, we theoretically bound the propagation of rotation equivariance violations in VN networks (see Section 7). Empirical analysis. Finally, we evaluate our VN-Transformer on (i) the ModelNet40 shape classification task, (ii) a modified ModelNet40 which includes per-point non-spatial attributes, and (iii) a modified version of the Waymo Open Motion Dataset trajectory forecasting task (see Section 8). 2 Related work The machine learning community has long been interested in building models that achieve equivariance to certain transformations, e.g., permutations, translations, and rotations. For a thorough review, see Bronstein et al. (2021). Learned approximate transformation invariance. A very common approach is to learn robustness to input transforms via data augmentation (Zhou & Tuzel, 2018; Qi et al., 2018; Krizhevsky et al., 2012; Lang et al., 2019; Yang et al., 2018) or by explicitly predicting transforms to canonicalize pose (Jaderberg et al., 2015; Hinton et al., 2018; Esteves et al., 2018; Chang et al., 2015). Rotation-equivariant CNNs. Recently, there has been specific interest in designing rotation-equivariant image models for 2D perception tasks (Cohen & Welling, 2016; Worrall et al., 2017; Marcos et al., 2017; Chidester et al., 2018). Worrall & Brostow (2018) extended this work to 3D perception, and Veeling et al. (2018) demonstrated the promise of rotation-equivariant models for medical images. 3Published in Transactions on Machine Learning Research (01/2023) Equivariant point-cloud models. Thomas et al. (2018) proposed Tensor Field Networks (TFNs), which use tensor representations of point-cloud data, Clebsch-Gordan coefficients, and spherical harmonic filters to build rotation-equivariant models. Fuchs et al. (2020) propose an “SE(3)-Transformer” by adding an attention mechanism for TFNs. One of the key ideas behind this body of work is to create highly restricted weight matrices that commute with rotation operations by construction (i.e., WR = RW). In contrast, we propose a simpler alternative: a “VN-Transformer” which guarantees equivariance for arbitrary weight matrices, removing the need for the complex mathematical machinery of the SE(3)-Transformer. For a detailed comparison with Fuchs et al. (2020), see Appendix A. Controllable approximate equivariance. Finzietal.(2021)proposedequivariantpriorsonmodelweight matrices to achieve approximate equivariance, and Wang et al. (2022) proposed a relaxed steerable 2D convolution along with soft equivariance regularization. In this work, we introduce the related notion of “(cid:15)-approximate equivariance,” achieved by adding biases with small and controllable norms. We theoretically bound the equivariance violation introduced by this bias, and we also bound how such violations propagate through deep VN networks. Non-spatial attributes. TFNs and the SE(3)-Transformer account for non-spatial attributes associated with each point (e.g., color, intensity), which they refer to as “type-0” features. In this work, we investigate two mechanisms (early & late fusion) to incorporate non-spatial data into the VN framework. Attention-based architectures. Since the introduction of Transformers by Vaswani et al. (2017), self- attention and cross-attention mechanisms have provided powerful and versatile components which are propelling the field of natural language processing forward (Devlin et al., 2019; Liu et al., 2019; Lan et al., 2020; Yang et al., 2020). Lately, so-called “Vision Transformers” (Dosovitskiy et al., 2021; Khan et al., 2022) have had a similar impact on the field of computer vision, providing a compelling alternative to convolutional networks. 3 Background 3.1 Notation & preliminaries Dataset. Suppose we have a dataset D (cid:44) {X ,Y }P , where p ∈ {1,...,P} is an index into a point- p p p=1 cloud/label pair {X ,Y } – we omit the subscript p whenever it is unambiguous to do so. X ∈X ⊂RN×3 p p is a single 3D point-cloud with N points. In a classification problem, Y ∈ Y ⊂ {1,...,κ}, where κ is the number of classes. In a regression problem, we might have Y ∈Y ⊂RNout×Sout where N out is the number of output points, and S is the dimension of each output point (with N = S = 1 corresponding to out out out univariate regression). Index notation. We use “numpy-like” indexing of tensors. Assuming we have a tensor Z ∈ RA×B×C, we present some examples of this indexing scheme: Z(a) ∈ RB×C, Z(:,:,c) ∈ RA×B, Z(alo:ahi) ∈ R(ahi−alo+1)×B×C. Rotations & weights. Suppose we have a tensor V ∈RN×C×3 and a rotation matrix R∈SO(3), where SO(3) is the three-dimensional rotation group. We denote the “rotation” of the tensor by VR∈RN×C×3, defined as: (VR)(n) (cid:44) V(n)R, ∀n ∈ {1,...,N} – in other words, the rotated tensor VR is simply the concatenation of the N individually rotated matrices V(n)R∈RC×3. Additionally, if we have a matrix of weights W ∈RC0×C, we define the product WV ∈RN×C0×3 by (WV)(n) (cid:44)WV(n). Invariance and equivariance. Definition 1 (Rotation Invariance). f :X →Y is rotation-invariant if ∀R ∈SO(3), X ∈X, f(XR)= f(X). Definition 2 (RotationEquivariance). f :X →Y isrotation-equivariantif ∀R∈SO(3), X ∈X, f(XR)= f(X)R. For simplicity, we defined invariance/equivariance as above instead of the more general f(Xρ (g)) = X f(X)ρ (g), which requires background on group theory and representation theory. Y 4Published in Transactions on Machine Learning Research (01/2023) Proofs. We defer proofs to Appendix C. (cid:296)(cid:1044)(cid:1109)(cid:938)(cid:943) 3.2 The Vector Neuron (VN) framework In the Vector Neuron framework (Deng et al., 2021), the au- thors represent a single point (e.g., in a hidden layer of a neu- ral network) as a matrix V(n) ∈ RC×3 (see inset figure), where V ∈RN×C×3 can be thought of as a tensor representation of the entire point-cloud. This representation allows for the design of SO(3)-equivariant analogs of standard neural network operations. (cid:28)(cid:259)(cid:177)(cid:303)(cid:303)(cid:241)(cid:203)(cid:177)(cid:259)(cid:1109)(cid:265)(cid:213)(cid:315)(cid:299)(cid:271)(cid:265)(cid:303) (cid:156)(cid:213)(cid:203)(cid:310)(cid:271)(cid:299)(cid:1109)(cid:88)(cid:213)(cid:315)(cid:299)(cid:271)(cid:265)(cid:303) VN-Linear layer. Asanillustrativeexample, theVN-LinearlayerisafunctionVN-Linear(·;W):RC×3 → RC0×3, defined by VN-Linear(V(n);W)(cid:44)WV(n), where W ∈RC0×C is a matrix of learnable weights. This operation is rotation-equivariant: VN-Linear(V(n)R;W)=WV(n)R=(WV(n))R=VN-Linear(V(n);W)R. Deng et al. (2021) also develop VN analogs of common deep network layers ReLU, MLP, BatchNorm, and Pool. For further definitions and proofs of equivariance, see Appendix C. For further details, we point the (cid:28)(cid:271)(cid:265)(cid:367)(cid:209)(cid:213)(cid:265)(cid:310)(cid:241)(cid:177)(cid:259)(cid:1109)(cid:1219)(cid:1109)(cid:296)(cid:299)(cid:271)(cid:296)(cid:299)(cid:241)(cid:213)(cid:310)(cid:177)(cid:299)(cid:341) reader to Deng et al. (2021). 4 The VN-Transformer In this section, we extend the ideas presented in Deng et al. (2021) to design a “VN-Transformer” that enjoys the rotation equivariance property. 4.1 Rotation-invariant inner product The notion of an inner product between tokens is central to the attention operation from the original Transformer (Vaswani et al., 2017). Consider the Frobenius inner product between two VN representations, defined below. Definition 3 (Frobenius inner product). The Frobenius inner product between two matrices V(n),V(n0) ∈ RC×3 is defined by hV(n),V(n0)i (cid:44)PC P3 V(n,c,s)V(n0,c,s) =PC V(n,c)V(n0,c)(cid:124). F c=1 s=1 c=1 This choice of inner product is convenient because of its rotation invariance property, stated below. Proposition 1. The Frobenius inner product between Vector Neuron representations V(n),V(n0) ∈RC×3 is rotation-invariant, i.e. hV(n)R,V(n0)Ri =hV(n),V(n0)i , ∀R∈SO(3). F F Proof. C C C hV(n)R,V(n0)Ri =X (V(n,c)R)(V(n0,c)R)(cid:124) =X V(n,c)RR(cid:124) V(n0,c)(cid:124) =X V(n,c)V(n0,c)(cid:124) =hV(n),V(n0)i (1) F F c=1 c=1 c=1 Thisrotation-invariantinnerproductbetweenVNrepresentationsallowsustoconstructarotation-equivariant attention operation, detailed in the next section. 4.2 Rotation-equivariant attention ConsidertwotensorsQ∈RM×C×3 andK ∈RN×C×3,whichcanbethoughtofassetsofM (resp. N)tokens, eachaC×3matrix. UsingtheFrobeniusinnerproduct, wecandefineanattentionmatrixA(Q,K)∈RM×N between the two sets as follows: A(Q,K)(m) (cid:44)softmax(cid:16) √1 (cid:2) hQ(m),K(n)i (cid:3)N (cid:17) , (2) F 3C n=1 5Published in Transactions on Machine Learning Research (01/2023) This attention operation is rotation-equivariant w.r.t. simultaneous (cid:1044)(cid:1109)(cid:1044)(cid:1109)(cid:1044)(cid:1109) rotation of all inputs: Following Vaswani et al. (2017), we divide the √ inner products by 3C since Q(m),K(n) ∈RC×3. From Proposition 1, A(QR,KR)=A(Q,K) ∀R∈SO(3). Finally, we define the operation VN-Attn : RM×C×3 × RN×C×3 × (cid:156)(cid:88)(cid:1080)(cid:81)(cid:177)(cid:341)(cid:213)(cid:299)(cid:88)(cid:271)(cid:299)(cid:264) RN×C0×3 →RM×C0×3 as: (cid:156)(cid:88)(cid:1080)(cid:87)(cid:81)(cid:119) N X VN-Attn(Q,K,Z)(m) (cid:44) A(Q,K)(m,n)Z(n). (3) n=1 Proposition 2. VN-Attn(QR,KR,ZR)=VN-Attn(Q,K,Z)R. (cid:156)(cid:88)(cid:1080)(cid:81)(cid:177)(cid:341)(cid:213)(cid:299)(cid:88)(cid:271)(cid:299)(cid:264) Proof. (cid:156)(cid:88)(cid:1080)(cid:87)(cid:315)(cid:259)(cid:310)(cid:241)(cid:238)(cid:213)(cid:177)(cid:209)(cid:1)(cid:376)(cid:265) N X (cid:1044)(cid:1109)(cid:1044)(cid:1109)(cid:1044)(cid:1109) VN-Attn(QR,KR,ZR)(m) = A(QR,KR(cid:157)(cid:177)(cid:341)(cid:264))(cid:271)(cid:1109)((cid:1229)(cid:1109)(cid:28)m(cid:271)(cid:265)(cid:367)(cid:209),(cid:213)n(cid:265)(cid:310)(cid:241)(cid:177))(cid:259)(cid:1109)(cid:1219)Z(cid:1109)(cid:119)(cid:299)(cid:271)(cid:296)((cid:299)(cid:241)(cid:213)n(cid:310)(cid:177)(cid:299))(cid:341)R (4) n=1 " # Figure 2: VN-Transformer N ( =∗) X A(Q,K)(m,n)Z(n) R=VN-Attn(Q,K,Z)(m)R, (5) encoder block architecture. VN-MultiheadAttn and VN- n=1 LayerNorm are defined in where (∗) holds since A(QR,KR)=A(Q,K) (which follows straight- equation 6 and equation 7, forwardly from Proposition 1 and equation 2). respectively. VN-MLP is a composition of VN-Linear, This is extendable to multi-head attention with H heads, VN-BatchNorm, and VN-ReLU VN-MultiHeadAttn:RM×C×3×RN×C×3×RN×C0×3 →RM×C0×3: layers from Deng et al. (2021). VN-MultiHeadAttn(Q,K,Z)(cid:44)WO(cid:2) VN-Attn(WQQ,WKK,WZZ)(cid:3)H , (6) h h h h=1 where WQ,WK ∈ RP×C, WZ ∈ RP×C0 are feature, key, and value projection matrices of the h-th head h h h (respectively),andWO ∈RC0×HP isanoutputprojectionmatrix(Vaswanietal.,2017).1 VN-MultiHeadAttn is also rotation-equivariant, and is the key building block of our rotation-equivariant VN-Transformer. We note that a similar idea was proposed in Fuchs et al. (2020) – namely, they use inner products between (cid:296)(cid:1044)(cid:1109)(cid:938)(cid:940) equivariant representations (obtained from the TFN framework of Thomas et al. (2018)) to create a rotation- invariant attention matrix and a rotation-equivariant attention mechanism. Our attention mechanism can be thought of as the same treatment applied to the VN framework. For a more detailed comparison between this work and the proposal of Fuchs et al. (2020), see Appendix A. 4.3 Rotation-equivariant layer normalization Deng et al. (2021) allude to a rotation-equivariant version of the well-known layer normalization operation (Ba et al., 2016), but do not explicitly provide it – we do so here for completeness (see (cid:122)(cid:271)(cid:335)(cid:1080)(cid:335)(cid:241)(cid:303)(cid:213)(cid:1109) (cid:81)(cid:177)(cid:341)(cid:213)(cid:299)(cid:1109) Figure 3): (cid:265)(cid:271)(cid:299)(cid:264) (cid:88)(cid:271)(cid:299)(cid:264) VN-LayerNorm(V(n))(cid:44) (7) (cid:20) V(n,c) (cid:21)C (cid:12)LayerNorm(cid:16)(cid:2) ||V(n,c)|| (cid:3)C (cid:17) 1 , F eqig uu ivr ae ria3 n: t lV ayN e- rL na oy re mrN alo ir zm at: ionr .ot “a Lt ai yo en r- ||V(n,c)|| 2 c=1 2 c=1 1×3 Norm” is the standard layer normaliza- where (cid:12) is an elementwise product, LayerNorm: RC →RC is the tion operation of Ba et al. (2016). (cid:12) layer normalization operation of Ba et al. (2016), and 1 is a and (cid:11) are row-wise multiplication and 1×3 row-vector of ones. division. (cid:28)(cid:271)(cid:265)(cid:367)(cid:209)(cid:213)(cid:265)(cid:310)(cid:241)(cid:177)(cid:259)(cid:1109)(cid:1219)(cid:1109)(cid:296)(cid:299)(cid:271)(cid:296)(cid:299)(cid:241)(cid:213)(cid:310)(cid:177)(cid:299)(cid:341) 1Inpractice,wesetH andP suchthatHP =C0. 6Published in Transactions on Machine Learning Research (01/2023) 4.4 Encoder architecture Figure 2 details the architecture of our proposed rotation-equivariant VN-Transformer encoder. The encoder is structurally identical to the original Transformer encoder of Vaswani et al. (2017), with each operation replaced by its rotation-equivariant VN analog. 5 Non-spatial attributes “Real-world” point-clouds are typically augmented w(cid:56)(cid:241)i(cid:233)t(cid:315)h(cid:299)(cid:213)(cid:56)(cid:1109)(cid:942)(cid:241)(cid:233)c(cid:202)(cid:315)r(cid:299)(cid:213)u(cid:1109)(cid:942)-(cid:202) (cid:296)(cid:1044)(cid:1109)(cid:938)(cid:939) (cid:296)(cid:1044)(cid:1109)(cid:938)(cid:939) cial meta-data such as intensity & elongation for Lidar (cid:87)(cid:81)(cid:119) (cid:87)(cid:81)(cid:119) (cid:156)(cid:88)(cid:1080)(cid:87)(cid:81)(cid:119)(cid:156)(cid:88)(cid:1080)(cid:87)(cid:81)(cid:119) point-cloudsand&sensortypeformulti-sensorpoint-clouds. (cid:119)(cid:271)(cid:271)(cid:259) (cid:119)(cid:271)(cid:271)(cid:259) (cid:157)(cid:213)(cid:241)(cid:233)(cid:238)(cid:310)(cid:213)(cid:157)(cid:209)(cid:213)(cid:1109)(cid:119)(cid:241)(cid:233)(cid:271)(cid:238)(cid:271)(cid:310)(cid:259)(cid:213)(cid:209)(cid:1109)(cid:119)(cid:271)(cid:271)(cid:259) Handling such point-clouds while still satisfying equivari- (cid:87)(cid:81)(cid:119) (cid:87)(cid:81)(cid:119) (cid:156)(cid:88)(cid:1080)(cid:132)(cid:299)(cid:177)(cid:265)(cid:156)(cid:303)(cid:88)(cid:232)(cid:271)(cid:1080)(cid:132)(cid:299)(cid:264)(cid:299)(cid:177)(cid:213)(cid:265)(cid:299)(cid:303)(cid:1109)(cid:232)(cid:271)(cid:299)(cid:264)(cid:213)(cid:299)(cid:1109) (cid:38)(cid:265)(cid:203)(cid:271)(cid:209)(cid:213)(cid:38)(cid:299)(cid:265)(cid:203)(cid:271)(cid:209)(cid:213)(cid:299) ance/invariance w.r.t. spatial inputs would be useful for (cid:203)(cid:271)(cid:265)(cid:203)(cid:177)(cid:310)(cid:203)(cid:1109)(cid:271)(cid:265)(cid:203)(cid:177)(cid:310)(cid:1109) (cid:203)(cid:271)(cid:265)(cid:203)(cid:177)(cid:310)(cid:203)(cid:1109)(cid:271)(cid:265)(cid:203)(cid:177)(cid:310)(cid:1109) many applications. We investigate two strategies (late fusion and early fusion) (cid:56)(cid:259)(cid:177)(cid:376)(cid:213)(cid:265)(cid:56)(cid:259)(cid:177)(cid:376)(cid:213)(cid:265) to handle non-spatial attributes while maintaining rotation (cid:156)(cid:88)(cid:1080)(cid:65)(cid:265)(cid:334)(cid:177)(cid:156)(cid:299)(cid:88)(cid:241)(cid:177)(cid:1080)(cid:265)(cid:65)(cid:310)(cid:265)(cid:334)(cid:177)(cid:299)(cid:241)(cid:177)(cid:265)(cid:310) equivariance/invariance w.r.t. spatial dimensions: (cid:156)(cid:88)(cid:1080)(cid:132)(cid:299)(cid:177)(cid:265)(cid:156)(cid:303)(cid:88)(cid:232)(cid:271)(cid:1080)(cid:132)(cid:299)(cid:264)(cid:299)(cid:177)(cid:213)(cid:265)(cid:299)(cid:303)(cid:1109)(cid:232)(cid:271)(cid:299)(cid:264)(cid:213)(cid:299)(cid:1109) (cid:156)(cid:88)(cid:1080)(cid:132)(cid:299)(cid:177)(cid:265)(cid:156)(cid:303)(cid:88)(cid:232)(cid:271)(cid:1080)(cid:132)(cid:299)(cid:264)(cid:299)(cid:177)(cid:213)(cid:265)(cid:299)(cid:303)(cid:1109)(cid:232)(cid:271)(cid:299)(cid:264)(cid:213)(cid:299)(cid:1109) (cid:38)(cid:265)(cid:203)(cid:271)(cid:209)(cid:213)(cid:38)(cid:299)(cid:265)(cid:203)(cid:271)(cid:209)(cid:213)(cid:299) (cid:38)(cid:265)(cid:203)(cid:271)(cid:209)(cid:213)(cid:38)(cid:299)(cid:265)(cid:203)(cid:271)(cid:209)(cid:213)(cid:299) Late fusion. In this approach, we propose to incorporate (cid:156)(cid:88)(cid:1080)(cid:87)(cid:81)(cid:119)(cid:156)(cid:88)(cid:1080)(cid:87)(cid:81)(cid:119) (cid:156)(cid:88)(cid:1080)(cid:87)(cid:81)(cid:119)(cid:156)(cid:88)(cid:1080)(cid:87)(cid:81)(cid:119) non-spatial attributes into the model at a later stage, where we have already processed the spatial inputs in a rotation- (cid:65)(cid:265)(cid:296)(cid:315)(cid:310)(cid:1109)(cid:296)(cid:271)(cid:65)(cid:265)(cid:241)(cid:265)(cid:296)(cid:310)(cid:315)(cid:1080)(cid:310)(cid:203)(cid:1109)(cid:296)(cid:259)(cid:271)(cid:271)(cid:315)(cid:241)(cid:209)(cid:265)(cid:310)(cid:1109)(cid:1080)(cid:203)(cid:259)(cid:271)(cid:315)(cid:209)(cid:1109) (cid:65)(cid:265)(cid:296)(cid:315)(cid:310)(cid:1109)(cid:310)(cid:299)(cid:177)(cid:65)(cid:265)(cid:253)(cid:213)(cid:296)(cid:203)(cid:315)(cid:310)(cid:310)(cid:271)(cid:1109)(cid:310)(cid:299)(cid:299)(cid:341)(cid:177)(cid:253)(cid:213)(cid:203)(cid:310)(cid:271)(cid:299)(cid:341) equivariant fashion – our “late fusion” models for classi(cid:28)(cid:271)(cid:265)fi(cid:367)(cid:209)(cid:213)(cid:265)(cid:310)(cid:241)(cid:177)c(cid:259)(cid:1109)(cid:1219)(cid:1109)(cid:296)(cid:299)(cid:271)a(cid:296)(cid:299)(cid:241)(cid:213)(cid:28)(cid:310)(cid:177)(cid:271)(cid:299)(cid:265)(cid:341)-(cid:367)(cid:209)(cid:213)(cid:265)(cid:310)(cid:241)(cid:177)(cid:259)(cid:1109)(cid:1219)(cid:1109)(cid:296)(cid:299)(cid:271)(cid:296)(cid:299)(cid:241)(cid:213)(cid:310)(cid:177)(cid:299)(cid:341) tion and trajectory prediction are shown in Figure 4. (a) Rotation-invariant (b) Rotation-equivariant classificationmodel. trajectory forecasting Early fusion. Early fusion is a simple yet powerful way to model. process non-spatial attributes (Jaegle et al., 2021). In this approach, we do not treat non-spatial attributes differently Figure 4: VN-Transformer (“late fusion”) (see Figure 1) – we simply concatenate spatial & non-spatial models. inputs before feeding them into the VN-Transformer. The Legend: ( ) SO(3)-equivariant features; ( ) VN representations obtained are C × (3 + d ) matrices SO(3)-invariant features; ( ) Non-spatial A (instead of C×3). features. (cid:87)(cid:315)(cid:259)(cid:310)(cid:241)(cid:62)(cid:213)(cid:177)(cid:209)(cid:1)(cid:376)(cid:265) 6 Rotation-equivariant multi-scale feature aggregation Jaegle et al. (2021) recently proposed an attention-based architec- ture, PerceiverIO, which reduces the computational complexity of (cid:156)(cid:88)(cid:1080) the vanilla Transformer by reducing the number of tokens (and (cid:87)(cid:315)(cid:259)(cid:310)(cid:241)(cid:62)(cid:213)(cid:177)(cid:209)(cid:1)(cid:376)(cid:265) their dimension) in the intermediate representations of the net- (cid:156)(cid:88)(cid:1080)(cid:87)(cid:213)(cid:177)(cid:265)(cid:119)(cid:299)(cid:271)(cid:253)(cid:213)(cid:203)(cid:310) work. They achieve this reduction by learning a set Z ∈RM×C0 of “latent features,” which they use to perform QKV attention with the original input tokens X ∈ RN×C (with M << N and Figure 5: Rotation-equivariant latent C0 <