Published in Transactions on Machine Learning Research (01/2023) VN-Transformer: Rotation-Equivariant Attention for Vector Neurons Serge Assaad ∗ serge.assaad@duke.edu Duke University Carlton Downey cmdowney@waymo.com Waymo LLC Rami Al-Rfou rmyeid@waymo.com Waymo LLC Nigamaa Nayakanti nigamaa@waymo.com Waymo LLC Ben Sapp bensapp@waymo.com Waymo LLC Reviewed on OpenReview: https: // openreview. net/ forum? id= EiX2L4sDPG Abstract Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional “vector neurons.” We introduce a novel “VN-Transformer” architecture to address several shortcomings of the current VN models. Our contributions are: ( i ) we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; ( ii ) we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; ( iii ) we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; ( iv ) we show that small tradeoffs in equivariance (  -approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results. ∗ Work done during an internship at Waymo LLC. 1 arXiv:2206.04176v3 [cs.CV] 24 Jan 2023 Published in Transactions on Machine Learning Research (01/2023) 1 Introduction A chair – seen from the front, the back, the top, or the side – is still a chair. When driving a car, our driving behavior is independent of our direction of travel. These simple examples demonstrate how humans excel at using rotation invariance and equivariance to understand the world in context (see figure on the right). Unfortunately, typical machine learning models struggle to preserve equivariance/invariance when appropriate – it is indeed challenging to equip neural networks with the right inductive biases to represent 3D objects in an equivariant manner. Modeling spatial data is a core component in many domains such as CAD, AR/VR, and medical imaging applications. In assistive robotics and autonomous vehicle applications, 3D object detection, tracking, and motion forecasting form the basis for how a robot interacts with ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ ~^вάг ~^вάг humans in the real world. Preserving rotation invariance/equivariance can improve training time, reduce model size, and provide crucial guarantees about model performance in the presence of noise. Spatial data is often represented as a point-cloud data structure. These point-clouds require both permutation invariance and rotation equivariance to be modeled sufficiently well. Approaches addressing permutation invariance include Zaheer et al. (2018); Lee et al. (2019); Qi et al. (2017). Recently, approaches jointly addressing rotation invariance and equivariance are gaining momentum. These can roughly be categorized into modeling invariance or equivariance by: (i) data augmentation, (ii) canonical pose estimation, and (iii) model construction. Approaches ( i ) and ( ii ) do not guarantee exact equivariance since they rely on model parameters to learn the right inductive biases (Qi et al., 2017; Esteves et al., 2018; Jaderberg et al., 2015; Chang et al., 2015). Further, data augmentation makes training more costly and errors in pose estimation propagate to downstream tasks, degrading model performance. Moreover, pose estimation requires labeling objects with their observed poses. In contrast, proposals in ( iii ) do provide equivariance guarantees, including the Tensor Field Networks of Thomas et al. (2018) and the SE(3)-Transformer of Fuchs et al. (2020) (see Section 2). However, their formulations require complex mathematical machinery or are limited to specific network architectures. Most recently, Deng et al. (2021) proposed a simple and generalizable framework, dubbed Vector Neurons (VNs), that can be used to replace traditional building blocks of neural networks with rotation-equivariant analogs. The basis of the framework is lifting scalar neurons to 3-dimensional vectors, which admit simple mappings of SO (3) actions to latent spaces. While Deng et al. (2021) developed a framework and basic layers, many issues required for practical deployment on real-world applications remain unaddressed. A summary of our contributions and their motivations are as follows: VN-Transformer. The use of Transformers in deep learning has exploded in popularity in recent years as the de facto standard mechanism for learned soft attention over input and latent representations. They have enjoyed many successes in image and natural language understanding (Khan et al., 2022; Vaswani et al., 2017), and they have become an essential modeling component in most domains. Our primary contribution of this paper is to develop a VN formulation of soft attention by generalizing scalar inner-product based attention to matrix inner-products ( i.e. , the Frobenius inner product). Thanks to Transformers’ ability to model functions on sets (since they are permutation-equivariant), they are a natural fit to model functions on point-clouds. Our VN-Transformer possesses all the appealing properties that have made the original Transformer so successful, as well as rotation equivariance as an added benefit. Direct point-set input. The original VN paper relied on edge convolution as a pre-processing step to capture local point-cloud structure. Such feature engineering is not data-driven and requires human involvement in designing and tuning. Moreover, the sparsity of these computations makes them slow to run 2 Published in Transactions on Machine Learning Research (01/2023) on accelerated hardware. Our proposed rotation- equivariant attention mechanism learns higher-level fea- tures directly from single points for arbitrary point-clouds (see Section 4). Handling points augmented with non-spatial at- tributes. Real-world point-cloud datasets have compli- cated features sets – the [ x, y, z ] spatial dimensions are typically augmented with crucial non-spatial attributes [[ x, y, z ]; [ a ]] where a can be high-dimensional. For exam- ple, Lidar point-clouds have intensity & elongation values associated with each point, multi-sensor point-clouds have modality types, and point-clouds with semantic type have semantic attributes. The VN framework restricted the scope of their work to spatial point-cloud data, limiting the applicability of their models for real-world point-clouds with attributes. We investigate two mechanisms to inte- grate attributes into equivariant models while preserving rotation equivariance (see Section 5). Equivariant multi-scale feature reduction. Practical data structures such as Lidar point-clouds are extremely large, consisting of hundreds of objects each with poten- tially millions of points. To handle such computationally ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī œXиAĉŎ±īñ±ĉĶ wďďăѕӃѕWQw œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī ÕñéîĶÕÑѕwďď㠜XиWQw AĉĨĻĶѕĨďñĉĶиËăďĻÑ AĉĨĻĶѕĶī±ýÕËĶďīŕ A ĉŎ±īñ±ĉĶ ѕËă±įįѕĨīÕÑД &ĪĻñŎ±īñ±ĉĶ ѕĶī±ýÕËĶďīŕ (a) Rotation-invariant clas- sification model. ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī œXиAĉŎ±īñ±ĉĶ wďďăѕӃѕWQw œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī ÕñéîĶÕÑѕwďď㠜XиWQw AĉĨĻĶѕĨďñĉĶиËăďĻÑ AĉĨĻĶѕĶī±ýÕËĶďīŕ A ĉŎ±īñ±ĉĶ ѕËă±įįѕĨīÕÑД &ĪĻñŎ±īñ±ĉĶ ѕĶī±ýÕËĶďīŕ (b) Rotation-equivariant trajectory forecasting model. Figure 1: VN-Transformer (“early fusion”) mod- els. Legend: ( ) SO(3)-equivariant features; ( ) SO(3)-invariant features; ( ) Non-spatial features. challenging situations, we design a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution. This mechanism learns how to pool the point set in a context-sensitive manner leading to a significant reduction in training and inference latency (see Section 6).  -approximate equivariance. When attempting to scale up VN models for distributed accelerated hardware we observed significant numerical stability issues. We determined that these stemmed from a fundamental limitation of the original VN framework, where bias values could not be included in linear layers while preserving equivariance. We introduce the notion of  -approximate equivariance, and use it to show that small tradeoffs in equivariance can be controlled to obtain large improvements in numerical stability via the addition of small biases, improving robustness of training on accelerated hardware. Additionally, we theoretically bound the propagation of rotation equivariance violations in VN networks (see Section 7). Empirical analysis. Finally, we evaluate our VN-Transformer on ( i ) the ModelNet40 shape classification task, ( ii ) a modified ModelNet40 which includes per-point non-spatial attributes, and ( iii ) a modified version of the Waymo Open Motion Dataset trajectory forecasting task (see Section 8). 2 Related work The machine learning community has long been interested in building models that achieve equivariance to certain transformations, e.g. , permutations, translations, and rotations. For a thorough review, see Bronstein et al. (2021). Learned approximate transformation invariance. A very common approach is to learn robustness to input transforms via data augmentation (Zhou & Tuzel, 2018; Qi et al., 2018; Krizhevsky et al., 2012; Lang et al., 2019; Yang et al., 2018) or by explicitly predicting transforms to canonicalize pose (Jaderberg et al., 2015; Hinton et al., 2018; Esteves et al., 2018; Chang et al., 2015). Rotation-equivariant CNNs. Recently, there has been specific interest in designing rotation-equivariant image models for 2D perception tasks (Cohen & Welling, 2016; Worrall et al., 2017; Marcos et al., 2017; Chidester et al., 2018). Worrall & Brostow (2018) extended this work to 3D perception, and Veeling et al. (2018) demonstrated the promise of rotation-equivariant models for medical images. 3 Published in Transactions on Machine Learning Research (01/2023) Equivariant point-cloud models. Thomas et al. (2018) proposed Tensor Field Networks (TFNs), which use tensor representations of point-cloud data, Clebsch-Gordan coefficients, and spherical harmonic filters to build rotation-equivariant models. Fuchs et al. (2020) propose an “SE(3)-Transformer” by adding an attention mechanism for TFNs. One of the key ideas behind this body of work is to create highly restricted weight matrices that commute with rotation operations by construction ( i.e. , W R = RW ). In contrast, we propose a simpler alternative: a “VN-Transformer” which guarantees equivariance for arbitrary weight matrices, removing the need for the complex mathematical machinery of the SE(3)-Transformer. For a detailed comparison with Fuchs et al. (2020), see Appendix A. Controllable approximate equivariance. Finzi et al. (2021) proposed equivariant priors on model weight matrices to achieve approximate equivariance, and Wang et al. (2022) proposed a relaxed steerable 2D convolution along with soft equivariance regularization. In this work, we introduce the related notion of “  -approximate equivariance,” achieved by adding biases with small and controllable norms. We theoretically bound the equivariance violation introduced by this bias, and we also bound how such violations propagate through deep VN networks. Non-spatial attributes. TFNs and the SE(3)-Transformer account for non-spatial attributes associated with each point ( e.g. , color, intensity), which they refer to as “type- 0 ” features. In this work, we investigate two mechanisms (early & late fusion) to incorporate non-spatial data into the VN framework. Attention-based architectures. Since the introduction of Transformers by Vaswani et al. (2017), self- attention and cross-attention mechanisms have provided powerful and versatile components which are propelling the field of natural language processing forward (Devlin et al., 2019; Liu et al., 2019; Lan et al., 2020; Yang et al., 2020). Lately, so-called “Vision Transformers” (Dosovitskiy et al., 2021; Khan et al., 2022) have had a similar impact on the field of computer vision, providing a compelling alternative to convolutional networks. 3 Background 3.1 Notation & preliminaries Dataset. Suppose we have a dataset D , { X p , Y p } P p =1 , where p ∈ { 1 , . . . , P } is an index into a point- cloud/label pair { X p , Y p } – we omit the subscript p whenever it is unambiguous to do so. X ∈ X ⊂ R N × 3 is a single 3D point-cloud with N points. In a classification problem, Y ∈ Y ⊂ { 1 , . . . , κ } , where κ is the number of classes. In a regression problem, we might have Y ∈ Y ⊂ R N out × S out where N out is the number of output points, and S out is the dimension of each output point (with N out = S out = 1 corresponding to univariate regression). Index notation. We use “numpy-like” indexing of tensors. Assuming we have a tensor Z ∈ R A × B × C , we present some examples of this indexing scheme: Z ( a ) ∈ R B × C , Z (: , : ,c ) ∈ R A × B , Z ( a lo : a hi ) ∈ R ( a hi − a lo +1) × B × C . Rotations & weights. Suppose we have a tensor V ∈ R N × C × 3 and a rotation matrix R ∈ SO (3) , where SO (3) is the three-dimensional rotation group. We denote the “rotation” of the tensor by V R ∈ R N × C × 3 , defined as: ( V R ) ( n ) , V ( n ) R, ∀ n ∈ { 1 , . . . , N } – in other words, the rotated tensor V R is simply the concatenation of the N individually rotated matrices V ( n ) R ∈ R C × 3 . Additionally, if we have a matrix of weights W ∈ R C ′ × C , we define the product W V ∈ R N × C ′ × 3 by ( W V ) ( n ) , W V ( n ) . Invariance and equivariance. Definition 1 (Rotation Invariance) . f : X → Y is rotation-invariant if ∀ R ∈ SO (3) , X ∈ X , f ( XR ) = f ( X ) . Definition 2 (Rotation Equivariance) . f : X → Y is rotation-equivariant if ∀ R ∈ SO (3) , X ∈ X , f ( XR ) = f ( X ) R . For simplicity, we defined invariance/equivariance as above instead of the more general f ( Xρ X ( g )) = f ( X ) ρ Y ( g ) , which requires background on group theory and representation theory. 4 Published in Transactions on Machine Learning Research (01/2023) Proofs. We defer proofs to Appendix C. 3.2 The Vector Neuron (VN) framework In the Vector Neuron framework (Deng et al., 2021), the au- thors represent a single point ( e.g. , in a hidden layer of a neu- ral network) as a matrix V ( n ) ∈ R C × 3 (see inset figure), where V ∈ R N × C × 3 can be thought of as a tensor representation of the entire point-cloud. This representation allows for the design of SO(3)-equivariant analogs of standard neural network operations. ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ ă±įįñ˱ăѕĉÕĻīďĉį œÕËĶďīѕXÕĻīďĉį VN-Linear layer. As an illustrative example, the VN-Linear layer is a function VN-Linear ( · ; W ) : R C × 3 → R C ′ × 3 , defined by VN-Linear ( V ( n ) ; W ) , W V ( n ) , where W ∈ R C ′ × C is a matrix of learnable weights. This operation is rotation-equivariant: VN-Linear ( V ( n ) R ; W ) = W V ( n ) R = ( W V ( n ) ) R = VN-Linear ( V ( n ) ; W ) R . Deng et al. (2021) also develop VN analogs of common deep network layers ReLU, MLP, BatchNorm, and Pool. For further definitions and proofs of equivariance, see Appendix C. For further details, we point the reader to Deng et al. (2021). 4 The VN-Transformer In this section, we extend the ideas presented in Deng et al. (2021) to design a “VN-Transformer” that enjoys the rotation equivariance property. 4.1 Rotation-invariant inner product The notion of an inner product between tokens is central to the attention operation from the original Transformer (Vaswani et al., 2017). Consider the Frobenius inner product between two VN representations, defined below. Definition 3 (Frobenius inner product) . The Frobenius inner product between two matrices V ( n ) , V ( n ′ ) ∈ R C × 3 is defined by 〈 V ( n ) , V ( n ′ ) 〉 F , ∑ C c =1 ∑ 3 s =1 V ( n,c,s ) V ( n ′ ,c,s ) = ∑ C c =1 V ( n,c ) V ( n ′ ,c ) ᵀ . This choice of inner product is convenient because of its rotation invariance property, stated below. Proposition 1. The Frobenius inner product between Vector Neuron representations V ( n ) , V ( n ′ ) ∈ R C × 3 is rotation-invariant, i.e. 〈 V ( n ) R, V ( n ′ ) R 〉 F = 〈 V ( n ) , V ( n ′ ) 〉 F , ∀ R ∈ SO (3) . Proof. 〈 V ( n ) R, V ( n ′ ) R 〉 F = C ∑ c =1 ( V ( n,c ) R )( V ( n ′ ,c ) R ) ᵀ = C ∑ c =1 V ( n,c ) RR ᵀ V ( n ′ ,c ) ᵀ = C ∑ c =1 V ( n,c ) V ( n ′ ,c ) ᵀ = 〈 V ( n ) , V ( n ′ ) 〉 F (1) This rotation-invariant inner product between VN representations allows us to construct a rotation-equivariant attention operation, detailed in the next section. 4.2 Rotation-equivariant attention Consider two tensors Q ∈ R M × C × 3 and K ∈ R N × C × 3 , which can be thought of as sets of M (resp. N ) tokens, each a C × 3 matrix. Using the Frobenius inner product, we can define an attention matrix A ( Q, K ) ∈ R M × N between the two sets as follows: A ( Q, K ) ( m ) , softmax ( 1 √ 3 C [ 〈 Q ( m ) , K ( n ) 〉 F ] N n =1 ) , (2) 5 Published in Transactions on Machine Learning Research (01/2023) This attention operation is rotation-equivariant w.r.t. simultaneous rotation of all inputs: Following Vaswani et al. (2017), we divide the inner products by √ 3 C since Q ( m ) , K ( n ) ∈ R C × 3 . From Proposition 1, A ( QR, KR ) = A ( Q, K ) ∀ R ∈ SO(3). Finally, we define the operation VN-Attn : R M × C × 3 × R N × C × 3 × R N × C ′ × 3 → R M × C ′ × 3 as: VN-Attn ( Q, K, Z ) ( m ) , N ∑ n =1 A ( Q, K ) ( m,n ) Z ( n ) . (3) Proposition 2. VN-Attn ( QR, KR, ZR ) = VN-Attn ( Q, K, Z ) R. Proof. VN-Attn ( QR, KR, ZR ) ( m ) = N ∑ n =1 A ( QR, KR ) ( m,n ) Z ( n ) R (4) ( ∗ ) = [ N ∑ n =1 A ( Q, K ) ( m,n ) Z ( n ) ] R = VN-Attn ( Q, K, Z ) ( m ) R, (5) where ( ∗ ) holds since A ( QR, KR ) = A ( Q, K ) (which follows straight- forwardly from Proposition 1 and equation 2). This is extendable to multi-head attention with H heads, VN-MultiHeadAttn : R M × C × 3 × R N × C × 3 × R N × C ′ × 3 → R M × C ′ × 3 : ±ŕĈďѕӍѕďĉůÑÕĉĶñ±ăѕӃѕwīďĨīñÕĶ±īŕ ДѕДѕДѕ ДѕДѕДѕ œXиQ±ŕÕīXďīĈ œXиWQw œXиQ±ŕÕīXďīĈ œXиWĻăĶñîձџĉ Figure 2: VN-Transformer encoder block architecture. VN-MultiheadAttn and VN- LayerNorm are defined in equation 6 and equation 7, respectively. VN-MLP is a composition of VN-Linear, VN-BatchNorm, and VN-ReLU layers from Deng et al. (2021). VN-MultiHeadAttn ( Q, K, Z ) , W O [ VN-Attn ( W Q h Q, W K h K, W Z h Z ) ] H h =1 , (6) where W Q h , W K h ∈ R P × C , W Z h ∈ R P × C ′ are feature, key, and value projection matrices of the h -th head (respectively), and W O ∈ R C ′ × HP is an output projection matrix (Vaswani et al., 2017). 1 VN-MultiHeadAttn is also rotation-equivariant, and is the key building block of our rotation-equivariant VN-Transformer. We note that a similar idea was proposed in Fuchs et al. (2020) – namely, they use inner products between equivariant representations (obtained from the TFN framework of Thomas et al. (2018)) to create a rotation- invariant attention matrix and a rotation-equivariant attention mechanism. Our attention mechanism can be thought of as the same treatment applied to the VN framework. For a more detailed comparison between this work and the proposal of Fuchs et al. (2020), see Appendix A. 4.3 Rotation-equivariant layer normalization Deng et al. (2021) allude to a rotation-equivariant version of the well-known layer normalization operation (Ba et al., 2016), but do not explicitly provide it – we do so here for completeness (see Figure 3): VN-LayerNorm ( V ( n ) ) , (7) [ V ( n,c ) || V ( n,c ) || 2 ] C c =1 LayerNorm ([ || V ( n,c ) || 2 ] C c =1 ) 1 1 × 3 , where is an elementwise product, LayerNorm : R C → R C is the layer normalization operation of Ba et al. (2016), and 1 1 × 3 is a row-vector of ones. ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ Q±ŕÕīѕ XďīĈ zďŏиŏñįÕѕ ĉďīĈ Figure 3: VN-LayerNorm: rotation- equivariant layer normalization. “Layer Norm” is the standard layer normaliza- tion operation of Ba et al. (2016). and are row-wise multiplication and division. 1 In practice, we set H and P such that HP = C ′ . 6 Published in Transactions on Machine Learning Research (01/2023) 4.4 Encoder architecture Figure 2 details the architecture of our proposed rotation-equivariant VN-Transformer encoder. The encoder is structurally identical to the original Transformer encoder of Vaswani et al. (2017), with each operation replaced by its rotation-equivariant VN analog. 5 Non-spatial attributes “Real-world” point-clouds are typically augmented with cru- cial meta-data such as intensity & elongation for Lidar point-clouds and & sensor type for multi-sensor point-clouds. Handling such point-clouds while still satisfying equivari- ance/invariance w.r.t. spatial inputs would be useful for many applications. We investigate two strategies (late fusion and early fusion) to handle non-spatial attributes while maintaining rotation equivariance/invariance w.r.t. spatial dimensions: Late fusion. In this approach, we propose to incorporate non-spatial attributes into the model at a later stage, where we have already processed the spatial inputs in a rotation- equivariant fashion – our “late fusion” models for classifica- tion and trajectory prediction are shown in Figure 4. Early fusion. Early fusion is a simple yet powerful way to process non-spatial attributes (Jaegle et al., 2021). In this approach, we do not treat non-spatial attributes differently (see Figure 1) – we simply concatenate spatial & non-spatial inputs before feeding them into the VN-Transformer. The VN representations obtained are C × (3 + d A ) matrices (instead of C × 3 ). ĨДѕΪΫ ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ Ëďĉ˱Ķѕ AĉĨĻĶѕĶī±ýÕËĶďīŕ œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī œXиAĉŎ±īñ±ĉĶ 8ă±ŸÕĉ wďďă WQw œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī ÕñéîĶÕÑѕwďď㠜XиWQw WQw Ëďĉ˱Ķѕ œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī AĉĨĻĶѕĨďñĉĶиËăďĻÑѕ 8ñéĻīÕѕήÊ (a) Rotation-invariant classification model. ĨДѕΪΫ ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ Ëďĉ˱Ķѕ AĉĨĻĶѕĶī±ýÕËĶďīŕ œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī œXиAĉŎ±īñ±ĉĶ 8ă±ŸÕĉ wďďă WQw œXиWQw œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī ÕñéîĶÕÑѕwďď㠜XиWQw WQw Ëďĉ˱Ķѕ œXи„ī±ĉįèďīĈÕīѕ &ĉËďÑÕī AĉĨĻĶѕĨďñĉĶиËăďĻÑѕ 8ñéĻīÕѕήÊ (b) Rotation-equivariant trajectory forecasting model. Figure 4: VN-Transformer (“late fusion”) models. Legend: ( ) SO(3)-equivariant features; ( ) SO(3)-invariant features; ( ) Non-spatial features. 6 Rotation-equivariant multi-scale feature aggregation Jaegle et al. (2021) recently proposed an attention-based architec- ture, PerceiverIO, which reduces the computational complexity of the vanilla Transformer by reducing the number of tokens (and their dimension) in the intermediate representations of the net- work. They achieve this reduction by learning a set Z ∈ R M × C ′ of “latent features,” which they use to perform QKV attention with the original input tokens X ∈ R N × C (with M << N and C ′ << C ). Finally, they perform self-attention operations on the resulting M × C ′ array, leading to a O ( M 2 C ′ ) runtime instead of ±ŕĈďѕӍѕďĉůÑÕĉĶñ±ăѕӃѕwīďĨīñÕĶ±īŕ œXиWÕ±ĉwīďýÕËĶ œXи WĻăĶñ>ձџĉ WĻăĶñ>ձџĉ Figure 5: Rotation-equivariant latent features for Vector Neurons. O ( N 2 C ) for each encoder self-attention operation, greatly improving time complexity during training and inference – a boon for time-critical applications such as real-time motion forecasting. However, such learnable latent features would violate equivariance in our case, since these learnable features would have no information about the original input’s orientation. To remedy this, we instead propose to a learn a transformation from the inputs to the latent features (where the number of latent features is much smaller than the number of original inputs). Specifically, we propose to use a mean projection function VN-MeanProject ( V ) ( m ) , W ( m ) [ 1 N ∑ N n =1 V ( n ) ] , where W ∈ R M × C ′ × C is a learnable tensor. VN-MeanProject is both rotation- equivariant and permutation-invariant. We then perform VN-MultiHeadAttention between the resulting latent features and the original inputs V to get a smaller set of VN representations. The architecture diagram for our proposed rotation-equivariant “latent feature” mechanism is shown in Figure 5. 7 Published in Transactions on Machine Learning Research (01/2023) 7  -approximate equivariance We noticed that distributed training on accelerated hardware is numerically unstable for points with small norms. This is unique to VN models – the VN-Linear layer does not include a bias vector, which leads to frequent underflow issues on distributed accelerated hardware. We found that introducing small and controllable additive biases fixes these issues – we modify the VN-Linear layer by adding a bias with controllable norm: VN-LinearWithBias ( V ( n ) ; W, U,  ) , W V ( n ) + U, U ( c ) , B ( c ) / || B ( c ) || 2 , (8) where  ≥ 0 is a hyperparameter controlling the bias norm, and B ∈ R C ′ × 3 is a learnable matrix. This leads to significant improvements in training stability and model quality. In principle, VN-LinearWithBias is not equivariant, but its violation of equivariance can be bounded. Work on equivariance by construction typically treats rotation equivariance as a binary idea – a model is either equivariant, or it is not. This can be relaxed by asking: how large is the violation of equivariance? We quantify this with the equivariance violation metric, defined by: ∆( f, X, R ) , || f ( XR ) − f ( X ) R || F . (9) If ∆( f, X, R ) ≤  , we say f is  - approximately equivariant . We bound the equivariance violation of VN-LinearWithBias ( · ; W, U,  ) : R C × 3 → R C ′ × 3 below: Proposition 3. VN-LinearWithBias is (2  √ C ′ ) -approximately equivariant (tight when R = − I ) . A natural next question is: how do such equivariance violations propagate through a deep model? Proposition 4. Suppose we have K functions f k : X k → X k +1 (with X k ⊂ R C k × 3 , X k +1 ⊂ R C k +1 × 3 , k ∈ { 1 , . . . , K } ) s.t. 1. f k is  k -approximately equivariant for all k ∈ { 1 , . . . , K } 2. f k is L k -Lipschitz (w.r.t. || · || F ) for all k ∈ { 2 , . . . , K } . Then, the composition f K ◦ · · · ◦ f 1 is  1 ...K -approximately equivariant, where  1 ...K , L K ( · · · ( L 3 ( L 2  1 +  2 ) +  3 ) + · · · ) +  K . (10) Intuitively, each layer f k “stretches” the equivariance violation error of the previous layers by its Lipschitz constant L k , and adds its own violation  k to the total error. 8 Experiments 8.1 Rotation-invariant classification Figure 1a shows our proposed VN-Transformer architecture for classification. It consists of rotation-equivariant operations (VN-MLP and VN-Transformer Encoder blocks), followed by an invariant operation (VN-Invariant), and fi- nally standard Flatten/Pool/MLP operations to get class predictions. The resulting log- its/class predictions are rotation-invariant. Table 1: ModelNet40 test accuracy. Top block shows SO(3)-invariant baselines taken from Deng et al. (2021), included here for convenience. Model Acc. # Params TFN (Thomas et al., 2018) 88.5% – RI-Conv (Zhang et al., 2019) 86.5% – GC-Conv (Zhang et al., 2020) 89.0% – VN-PointNet (Deng et al., 2021) 77.2% 2.20M VN-DGCNN (Deng et al., 2021) 90.0% 2.00M VN-Transformer (ours) 90.8% 0.04M We evaluate our VN-Transformer classifier on the commonly used ModelNet40 dataset (Wu et al., 2015), a 40-class point-cloud classification problem. In Table 1, we compare our model with recent rotation-invariant models. The VN-Transformer outperforms the baseline VN models with orders of magnitude fewer parameters. Furthermore, we dispense with the computationally expensive edge-convolution used as a preprocessing step 8 Published in Transactions on Machine Learning Research (01/2023) ĨДѕΪΰ Figure 6: Example point-clouds from the ModelNet40 Polka-dot dataset. Cyan points correspond to a i = 0 , and pink points correspond to a i = 1 . Note that the “airplane” class has a narrower polka-dot radius than the “table” class, since we make the polka-dot radius dependent on the object class. 0 5 10 15 20 25 30 35 40 # Nearest Neighbors 0.86 0.88 0.90 0.92 0.94 Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Relative Speedup Accuracy Speedup Figure 7: VN-Transformer ModelNet40 accuracy and relative training speed vs. number of nearest neigh- bors used in edge-convolution preprocessing. Speed is computed relative to zero neighbors ( i.e. , no edge- convolution). Edge-convolution slows training speed by ∼ 5x and has little effect on model performance. in the models of Deng et al. (2021) and find that the VN-Transformer’s performance is relatively unaffected (see Figure 7). 32 64 128 256 512 # Latent Features 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Relative Speedup Accuracy Speedup Figure 8: Test set accuracy on ModelNet40 vs. number of latent features (described in Figure 5). Latent features provide a ∼ 2x speedup with minimal acc. degradation (vs. row 3 of Table 2). Table 2: Test set accuracy on ModelNet40 “Polka- Dot” dataset. Top block shows ModelNet40 re- sults ( i.e. , only spatial inputs). Bottom block shows ModelNet40 Polka-dot results.  is the bias norm in VNLinearWithBias of eq. equation 8. Model  Fusion Features Acc. VN-PointNet 0 – [ x, y, z ] 77.2% VN-Transformer 0 – [ x, y, z ] 88.5% VN-Transformer 10 − 6 – [ x, y, z ] 90.8% VN-PointNet 0 Early [ x, y, z, a ] 82.0% VN-Transformer 0 Early [ x, y, z, a ] 91.1% VN-Transformer 10 − 6 Early [ x, y, z, a ] 95.4 % VN-Transformer 10 − 6 Late [ x, y, z, a ] 91.0% 8.2 Classification with non-spatial attributes To evaluate our model’s ability to handle non-spatial attributes, we design a modified version of the ModelNet40 dataset, called ModelNet40 “Polka-dot,” which we construct (from ModelNet40) as follows: to each point [ x i , y i , z i ] , we append a i ∈ { 0 , 1 } . Within a radius r of a randomly chosen center, we randomly select 30 points and set a i = 1 . We make the “polka-dot” radius r depend on the label y ∈ { 1 , . . . , 40 } via r ( y ) , r lo + 1 39 ( y − 1)( r hi − r lo ) , where r lo = 0 . 3 , r hi = 1 . Figure 6 shows example point-clouds from the ModelNet40 Polka-dot dataset. By generating ModelNet40 Polka-dot in this way, we directly embed class information into the non-spatial attributes. In order to perform well on this task, models need to effectively fuse spatial and non-spatial information (there is no useful information in the non-spatial attributes alone since all point-clouds have ∑ N i =1 a i = 30 ). The results on ModelNet40 Polka-dot are shown in Table 2. Both VN-PointNet and VN-Transformer benefit significantly from the binary polka-dots, suggesting they are able to effectively combine spatial and non-spatial information. 9 Published in Transactions on Machine Learning Research (01/2023) 8.3 Latent features Figure 8 shows our results on ModelNet40 when we reduce the number of tokens from 1024 (the number of points in the original point-cloud) to 32 using the latent feature mechanism presented in Figure 5. Using latent features provides a ∼ 2x latency improvement (in training steps/sec) with minimal ( ∼ 1.7%) accuracy degradation (compared with row 3 of Table 2). This suggests a real benefit of the latent feature mechanism in time-sensitive applications such as autonomous driving. 8.4 Rotation-equivariant motion forecasting Figure 1b shows our proposed rotation-equivariant ar- chitecture for motion forecasting. In motion forecasting the goal is to predict the [ x, y, z ] locations of an agent for a sequence of future timesteps, given as input the past locations of the agent. We evaluate the model on a simplified version of the Waymo Open Motion Dataset (WOMD; Ettinger et al., 2021): • We select 4904 trajectories (3915 for training, 979 for testing). • Each trajectory consists of 91 [ x, y, z ] points for a single vehicle sampled at 5 Hz. • We use the first 11 points (the past) as input and we predict the remaining 80 points (the future). Table 3: Average Distance Error on WOMD. ↓ Lower is better.  z = random z -axis rotations used as data augmentation at training time.  is the bias in the VNLinearWithBias layers (equa- tion equation 8). The bottom block uses the speed a as an added input feature (via early fusion). Model  Features ADE ( ↓ ) Transformer – [ x, y, z ] 5.01 Transformer +  z – [ x, y, z ] 4.51 VN-Transformer 0 [ x, y, z ] 4.91 VN-Transformer 10 − 6 [ x, y, z ] 3.95 VN-Transformer 0 [ x, y, z, a ] 5.01 VN-Transformer 10 − 6 [ x, y, z, a ] 3.67 We evaluate the quality of our trajectory forecast- ing models using the Average Distance Error (ADE): ADE ( Y i , ˆ Y i ) , 1 T ∑ T t =1 || Y ( t ) i − ˆ Y ( t ) i || 2 , where T is the number of time-steps in the output trajectory and Y i , ˆ Y i ∈ R T × 3 are the ground-truth trajectory and the predicted trajectory, respectively. Results are shown in Table 3. Adding training-time random rotations about the z -axis yields improves the performance of the vanilla Transformer, and the VN-Transformer outperforms the vanilla Transformer (without the need for train-time ro- tation augmentations, thanks to equivariance). Figure 9 shows example predictions on WOMD. Equivariance violations of the vanilla Transformer models (columns (a) and (b)) are clearly demonstrated here, in contrast with the equivariant VN-Transformer (column (c)). ďĉůÑÕĉĶñ±ăѕӃѕĨīďĨīñÕĶ±īŕ в±гѕ вÊгѕ вËгѕ вΫгѕ вΪгѕ Figure 9: Example predictions on WOMD. Leg- end: Rectangle = current car position, Red points = input trajectory, Colored streaks = predicted trajectories. Columns are different trajectory models, with (a) Transformer, (b) Transformer +  z augmentations, and (c) VN-Transformer. Row (1) is one example in the dataset. Row (2) is a 45 ◦ rotation of the input points in Row (1). 9 Conclusion In this paper, we introduced the VN-Transformer, a rotation-equivariant Transformer model based on the Vector Neurons framework. VN-Transformer is a significant step towards building powerful, modular, and easy-to-use models that have appealing equivariance properties for point-cloud data. Limitations of our work include: ( i ) similar to previous work (Deng et al., 2021; Qi et al., 2017) we assume input data has been mean-centered. This is sensitive to outliers, and prevents us from making single-pass predictions for multi-object problems (we have to independently mean-center each agent first). Similarly, we have not addressed other types of invariance/equivariance ( e.g. , scale invariance) in this work; ( ii ) Proposition 4 shows an error bound on the total equivariance violation of the network with L k -Lipschitz layers. We know the Lipschitz constants of VN-Linear and VN-LinearWithBias (see Appendix C), but we have not yet determined them for other layers ( e.g. , VN-ReLU, VN-MultiHeadAttn). We will address these gaps in future 10 Published in Transactions on Machine Learning Research (01/2023) work, and we will also leverage VN-Transformers to obtain state-of-the-art performance on a number of key benchmarks such as the full Waymo Open Motion Dataset and ScanObjectNN. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. (Cited on Page 6) Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ArXiv , abs/2104.13478, 2021. (Cited on Page 3) Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 , 2015. (Cited on Page 2, 3) Benjamin Chidester, Minh N. Do, and Jian Ma. Rotation equivariance and invariance in convolutional neural networks, 2018. (Cited on Page 3) Taco Cohen and Max Welling. Group equivariant convolutional networks. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning , volume 48 of Proceedings of Machine Learning Research , pp. 2990–2999, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/cohenc16.html . (Cited on Page 3) Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and Leonidas Guibas. Vector neurons: A general framework for so(3)-equivariant networks, 2021. (Cited on Page 2, 5, 6, 8, 9, 10, 14, 15, 20, 21) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. (Cited on Page 4) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. (Cited on Page 4) Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer networks. In International Conference on Learning Representations , 2018. (Cited on Page 2, 3) Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aurélien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 9710–9719, October 2021. (Cited on Page 10, 15) Marc Anton Finzi, Gregory Benton, and Andrew Gordon Wilson. Residual pathway priors for soft equivariance constraints. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems , 2021. URL https://openreview.net/forum?id=k505ekjMzww . (Cited on Page 4) Fabian B. Fuchs, Daniel E. Worrall, Volker Fischer, and Max Welling. Se(3)-transformers: 3d roto-translation equivariant attention networks, 2020. (Cited on Page 2, 4, 6, 14) Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. In International conference on learning representations , 2018. (Cited on Page 3) Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural information processing systems , 28:2017–2025, 2015. (Cited on Page 2, 3) Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver io: A general architecture for structured inputs & outputs, 2021. (Cited on Page 7) 11 Published in Transactions on Machine Learning Research (01/2023) Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM Computing Surveys , Jan 2022. ISSN 1557-7341. doi: 10.1145/3505244. URL http://dx.doi.org/10.1145/3505244 . (Cited on Page 2, 4) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems , 25:1097–1105, 2012. (Cited on Page 3) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020. (Cited on Page 4) Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 12697–12705, 2019. (Cited on Page 3) Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks, 2019. (Cited on Page 2) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. (Cited on Page 4) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. (Cited on Page 15) Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Oct 2017. (Cited on Page 3) Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017. (Cited on Page 2, 10) Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 918–927, 2018. (Cited on Page 3) Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation- and translation-equivariant neural networks for 3d point clouds, 2018. (Cited on Page 2, 4, 6, 8) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. (Cited on Page 2, 4, 5, 6, 7, 15) Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology, 2018. (Cited on Page 3) Rui Wang, Robin Walters, and Rose Yu. Approximately equivariant networks for imperfectly symmetric dynamics, 2022. URL https://arxiv.org/abs/2201.11969 . (Cited on Page 4) Daniel Worrall and Gabriel Brostow. Cubenet: Equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV) , September 2018. (Cited on Page 3) Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017. (Cited on Page 3) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes, 2015. (Cited on Page 8, 15) Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pp. 7652–7660, 2018. (Cited on Page 3) 12 Published in Transactions on Machine Learning Research (01/2023) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2020. (Cited on Page 4) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets, 2018. (Cited on Page 2) Zhiyuan Zhang, Binh-Son Hua, David W. Rosen, and Sai-Kit Yeung. Rotation invariant convolutions for 3d point clouds deep learning, 2019. URL https://arxiv.org/abs/1908.06297 . (Cited on Page 8) Zhiyuan Zhang, Binh-Son Hua, Wei Chen, Yibin Tian, and Sai-Kit Yeung. Global context aware convolutions for 3d point cloud understanding. In 2020 International Conference on 3D Vision (3DV) , pp. 210–219, 2020. doi: 10.1109/3DV50981.2020.00031. (Cited on Page 8) Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 4490–4499, 2018. (Cited on Page 3) 13 Published in Transactions on Machine Learning Research (01/2023) VN-Transformer (ours) SE(3)-Transformer (with ` = 1 ) (Fuchs et al., 2020) Weight shape W Q ∈ R C ′ × C ( C, C ′ = # of channels) W 11 Q ∈ R 3 × 3 ( ∈ R (2 ` ′ +1) × (2 ` +1) in the general case) Action on input Left-multiply: W Q V ( n ) ( V ( n ) ∈ R C × 3 ) Right-multiply: f 1 i W 11 Q ( f 1 i ∈ R 1 × 3 ) Requirement for equiv- ariance None, W Q ∈ R C ′ × C is arbitrary W 11 Q = ∑ 2 J =0 ∑ J m = − J φ 11 J ( || x i || ) Y Jm ( x i / || x i || ) Q 11 Jm , ( φ = radial neural net., Y Jm = spherical harmonics, Q 11 Jm = Clebsch-Gordan coefficients) Table 4: Attention query computation in VN-Transformer vs. SE(3)-Transformer. W Q ∈ R C ′ × C is the VN-Transformer weight matrix used to compute the query ( C and C ′ are the number of input and query channels, respectively). A Detailed comparison with SE(3)-Transformer (Fuchs et al., 2020) A.1 Attention computation There is a rich literature on equivariant models using steerable kernels, and the SE(3)-Transformer is the closest development in this field to our work. Here, we make a detailed comparison between our work and the SE(3)-Transformer (and related steerable kernel-based models). For simplicity, we will compare the VN-Transformer with only spatial features vs. the SE(3)-Transformer with only type-1 features (i.e., spatial features). The key difference is in the way the weight matrices are defined and how they interact with the input . Specifically, Fuchs et al. use 3 × 3 weight matrices W 11 Q , W 11 K , W 11 V ∈ R 3 × 3 that act on the spatial/representation dimension of the input points ( e.g. , via f 1 i W 11 Q where f 1 i ∈ R 1 × 3 ). 2 As a result, in order to guarantee equivariance they need to design these matrices W 11 Q , W 11 K , W 11 V such that they each commute with a rotation operation (in general ( f 1 i R ) W 11 Q 6 = ( f 1 i W 11 Q ) R – this depends on the choice of W 11 Q ), hence the need for the machinery of Clebsch-Gordan coefficients, spherical harmonics, and radial neural nets to construct the weights. In contrast, in our proposed attention mechanism, the matrices W Q , W K , W Z ∈ R C ′ × C act on the channel dimension of the input ( e.g ., via W Q V ( n ) , where V ( n ) ∈ R C × 3 ) and not the spatial dimension. As a result, the operations W Q V ( n ) , W K V ( n ) , W Z V ( n ) are equivariant no matter the choice of W Q , W K , W Z , since W ( V ( n ) R ) = ( W V ( n ) ) R . This results in a significantly simpler construction of rotation-equivariant attention that is ( i ) accessible to a wider audience ( i.e. , it does not require an understanding of group theory, representation theory, spherical harmonics, Clebsch-Gordan coefficients, etc.) and ( ii ) much easier to implement. For a side-by-side comparison of both attention query computations, see Table 4 above. A.2 VN-Linear vs. SE(3)-Transformer “self-interaction” There is a relationship between the VN-Linear operation of Deng et al. (2021), and the “linear self-interaction” layers of Fuchs et al. (2020). Comparing equation (12) of Fuchs et al. (2020), repeated here for convenience: f ` out ,i,c ′ = ∑ c w `` c ′ c f ` in ,i,c , (11) 2 W `` ′ Q , W `` ′ K , W `` ′ V ∈ R (2 ` +1) × (2 ` ′ +1) in the general case, where `, ` ′ ∈ { 0 , 1 , 2 } are feature types 14 Published in Transactions on Machine Learning Research (01/2023) with the VN-Linear operation of Deng et al. (2021): V ( n ) out = W V ( n ) , W ∈ R C ′ × C , V ( n ) ∈ R C × 3 , (12) we see that these operations are identical. However, our proposed VN-MultiHeadAttention is different and significantly simpler than the attention mechanism of the SE(3)-Transformer (see Section A.1), as it relies only on ( i ) the rotation-invariant Frobenius inner product and ( ii ) straightforward multiplication by an arbitary weight matrix to compute the keys, queries, and values (equivalent to VN-Linear/linear self-interaction). In that sense, it is closer in spirit to the original Transformer of Vaswani et al. (replacing vector inner product with Frobenius inner product). Further, as explained previously, it does not require the special construction of weight matrices using Clebsch-Gordan coefficients, spherical harmonics, and radial neural networks as in the SE(3)-Transformer. B Experimental details B.1 Datasets ModelNet40 The ModelNet40 dataset (Wu et al., 2015) is publicly available at https://modelnet.cs.princeton.edu , with the following comment under “Copyright”: “All CAD models are downloaded from the Internet and the original authors hold the copyright of the CAD models. The label of the data was obtained by us via Amazon Mechanical Turk service and it is provided freely. This dataset is provided for the convenience of academic research only.” Waymo Open Motion Dataset The Waymo Open Motion Dataset (Ettinger et al., 2021) is publicly available at https://waymo.com/open/data/motion/ under a non-commercial use license agreement. Full license details can be found here: https://waymo.com/open/terms/ . B.2 Hyperparameter tuning Table 5 shows the hyperparameters we swept over for all our experiments on ModelNet40, ModelNet40 Polka-dot, and the Waymo Open Motion Dataset. Hyperparameter Value/Range Feature dimension of VN-Transformer {32, 64, 128, 256, 512, 1024} Number of attention heads {4, 8, 16, 32, 64, 128} Hidden layer dimension in encoder’s VN-MLP {32, 64, 128, 256, 512} Learning rate 10 − 3 Learning rate schedule Linear decay Optimizer AdamW (Loshchilov & Hutter, 2019) Epochs 4000  of VN-LinearWithBias {0, 10 − 6 } Table 5: Model hyperparameter ranges for ModelNet40, ModelNet40 Polka-dot, and Waymo Open Motion Dataset. B.3 Compute infrastructure We trained our models on TPU-v3 devices. which are accessible through Google Cloud. Our longest training jobs ran for less than 3 hours on 32 TPU cores. 15 Published in Transactions on Machine Learning Research (01/2023) C Proofs In this section, for convenience we will treat all 3D vectors as row-vectors: e.g. , x ∈ R 1 × 3 . We also note that, while all our proofs of invariance/equivariance use matrices V ( n ) ∈ R C × 3 , they can all be trivially generalized to V ( n ) ∈ R C × S . C.1 Partial invariance & equivariance Assuming the input point-cloud consists of spatial inputs X ∈ X ⊂ R N × 3 and associated non-spatial attributes A ∈ A ⊂ R N × d A ( d A is the number of non-spatial attributes associated with each point), we would like our model f : X × A → Y to satisfy the following property: Definition 4 (Partial rotation invariance) . A model f : X × A → Y satisfies partial rotation invariance if ∀ X ∈ X , A ∈ A , R ∈ SO (3) , f ( XR, A ) = f ( X, A ) . Definition 5 (Partial rotation equivariance) . A model f : X × A → Y (where Y ⊂ R N out × (3+ d A ) ) satisfies partial rotation equivariance if ∀ X ∈ X , A ∈ A , R ∈ SO (3) , f ( XR, A ) (: , :3) = f ( X, A ) (: , :3) R. We show here that the models in Figure 1 satisfy partial rotation invariance and equivariance (respectively). Consider two rotation matrices R d 1 × d 1 ∈ SO ( d 1 ) and R d 2 × d 2 ∈ SO ( d 2 ) . Lemma 1. The matrix R ( d 1 + d 2 ) × ( d 1 + d 2 ) , [ R d 1 × d 1 0 d 1 × d 2 0 d 2 × d 1 R d 2 × d 2 ] is a valid rotation matrix in SO ( d 1 + d 2 ) . Proof. We begin by showing that R ᵀ ( d 1 + d 2 ) × ( d 1 + d 2 ) = R − 1 ( d 1 + d 2 ) × ( d 1 + d 2 ) . We first compute R ᵀ ( d 1 + d 2 ) × ( d 1 + d 2 ) : R ᵀ ( d 1 + d 2 ) × ( d 1 + d 2 ) = [ R d 1 × d 1 0 d 1 × d 2 0 d 2 × d 1 R d 2 × d 2 ] ᵀ = [ R ᵀ d 1 × d 1 0 ᵀ d 2 × d 1 0 ᵀ d 1 × d 2 R ᵀ d 2 × d 2 ] ( ∗ ) = [ R − 1 d 1 × d 1 0 d 1 × d 2 0 d 2 × d 1 R − 1 d 2 × d 2 ] , (13) where ( ∗ ) holds since R d 1 × d 1 ∈ SO ( d 1 ) and R d 2 × d 2 ∈ SO ( d 2 ) by assumption. Now, we compute R − 1 ( d 1 + d 2 ) × ( d 1 + d 2 ) : R − 1 ( d 1 + d 2 ) × ( d 1 + d 2 ) = [ R d 1 × d 1 0 d 1 × d 2 0 d 2 × d 1 R d 2 × d 2 ] − 1 (14) = [[ R d 1 × d 1 − 0 d 1 × d 2 R − 1 d 2 × d 2 0 d 2 × d 1 ] − 1 0 d 1 × d 2 0 d 2 × d 1 [ R d 2 × d 2 − 0 d 2 × d 1 R − 1 d 1 × d 1 0 d 1 × d 2 ] − 1 ] (15) = [ R − 1 d 1 × d 1 0 d 1 × d 2 0 d 2 × d 1 R − 1 d 2 × d 2 ] . (16) Hence, we have that R − 1 ( d 1 + d 2 ) × ( d 1 + d 2 ) = R ᵀ ( d 1 + d 2 ) × ( d 1 + d 2 ) . Finally, we show that det( R ( d 1 + d 2 ) × ( d 1 + d 2 ) ) = 1 : det( R ( d 1 + d 2 ) × ( d 1 + d 2 ) ) = det ([ R d 1 × d 1 0 d 1 × d 2 0 d 2 × d 1 R d 2 × d 2 ]) = det( R d 1 × d 1 ) det( R d 2 × d 2 ) ( ∗ ) = 1 · 1 = 1 , (17) where ( ∗ ) holds since R d 1 × d 1 ∈ SO ( d 1 ) and R d 2 × d 2 ∈ SO ( d 2 ) by assumption. Proposition 5. The VN-Transformer model f : X × A → Y (where Y ⊂ R κ ) shown in Figure 1a satisfies partial rotation invariance. Proof. • For convenience, we reparametrize the model f as f concat : R N × (3+ d A ) → R κ (with κ the number of object classes) where f concat ([ X, A ]) = f ( X, A ) . It then suffices to show that f concat ([ XR, A ]) = f concat ([ X, A ]) R . 16 Published in Transactions on Machine Learning Research (01/2023) • First, note that f concat is SO (3 + d A ) -invariant, since it is composed of SO (3 + d A ) -equivariant operations followed by a SO (3 + d A ) -invariant operation. • Consider the matrix R (3+ d A ) × (3+ d A ) , [ R 0 3 × d A 0 d A × 3 I d A × d A ] , where R ∈ SO (3) is an arbitrary 3- dimensional rotation. From Lemma 1, R (3+ d A ) × (3+ d A ) ∈ SO (3 + d A ) . f concat ([ X, A ] R (3+ d A ) × (3+ d A ) ) ( ∗ ) = f concat ([ X, A ]) (18) ⇒ f concat ([ XR + A 0 d A × 3 , X 0 3 × d A + AI d A × d A ]) = f concat ([ X, A ]) (19) ⇒ f concat ([ XR, A ]) = f concat ([ X, A ]) , (20) where ( ∗ ) holds from SO (3 + d A ) -invariance of f concat . Proposition 6. The VN-Transformer model f : X × A → Y (where Y ⊂ R N out × (3+ d A ) ) shown in Figure 1b satisfies partial rotation equivariance. Proof. • For convenience, we reparametrize the model f as f concat : R N × (3+ d A ) → R N out × (3+ d A ) where f concat ([ X, A ]) = f ( X, A ) . It then suffices to show that f concat ([ XR, A ]) (: , :3) = f concat ([ X, A ]) (: , :3) R . • First, note that f concat is SO (3 + d A ) -equivariant, since it is composed of SO (3 + d A ) -equivariant operations. • Consider the matrix R (3+ d A ) × (3+ d A ) , [ R 0 3 × d A 0 d A × 3 I d A × d A ] , where R ∈ SO (3) is an arbitrary 3- dimensional rotation. From Lemma 1, R (3+ d A ) × (3+ d A ) ∈ SO (3 + d A ) . f concat ([ X, A ] R (3+ d A ) × (3+ d A ) ) ( ∗ ) = f concat ([ X, A ]) R (3+ d A ) × (3+ d A ) (21) ⇒ f concat ([ XR + A 0 d A × 3 , X 0 3 × d A + AI d A × d A ]) = [ f concat ([ X, A ]) (: , :3) R + f concat ([ X, A ]) (: , 4:) 0 d A × 3 , f concat ([ X, A ]) (: , :3) 0 3 × d A + f concat ([ X, A ]) (: , 4:) I d A × d A ] (22) ⇒ f concat ([ XR, A ]) = [ f concat ([ X, A ]) (: , :3) R, f concat ([ X, A ]) (: , 4:) ] (23) ⇒ f concat ([ XR, A ]) (: , :3) = f concat ([ X, A ]) (: , :3) R, (24) where ( ∗ ) holds from SO (3 + d A ) -equivariance of f concat . C.2  -approximate equivariance Proposition 3 (Restated) . VN-LinearWithBias ( · ; W, U,  ) is (2  √ C ′ ) -approximately equivariant. This bound is tight when R = − I 3 × 3 . 17 Published in Transactions on Machine Learning Research (01/2023) Proof. Set f , VN-LinearWithBias ( · ; W, U,  ) : f ( XR ) − f ( X ) R = ( W XR + U ) − ( W X + U ) R (25) = U − U R (26) ⇒ ∆( f, X, R ) 2 = || f ( XR ) − f ( X ) R || 2 F (27) = C ′ ∑ c =1 ||  ( U ( c ) − U ( c ) R ) || 2 2 (28) =  2 C ′ ∑ c =1 || U ( c ) − U ( c ) R || 2 2 (29) =  2 C ′ ∑ c =1 || U ( c ) || 2 2 + || U ( c ) R || 2 2 − 2 U ( c ) R ᵀ U ( c ) ᵀ (30) ≤  2 C ′ ∑ c =1 || U ( c ) || 2 2 + || U ( c ) R || 2 2 + 2 U ( c ) U ( c ) ᵀ (31) =  2 C ′ ∑ c =1 4 || U ( c ) || 2 2 (32) = 4  2 C ′ (33) ⇒ ∆( f, X, R ) ≤ 2  √ C ′ . (34) Lemma 2. Suppose 1. f : X 1 → X 2 (with X 1 ⊂ R C 1 × 3 , X 2 ⊂ R C 2 × 3 ) is  f -approximately equivariant. 2. g : X 2 → X 3 (with X 3 ⊂ R C 3 × 3 ) is  g -approximately equivariant and L g -Lipschitz (w.r.t. the Frobenius norm). Then the composition g ◦ f : X 1 → X 3 is ( L g  f +  g ) -approximately equivariant. Proof. ∆( g ◦ f, X, R ) = || g ( f ( XR )) − g ( f ( X )) R || F (35) = || g ( f ( XR )) − g ( f ( X ) R ) + g ( f ( X ) R ) − g ( f ( X )) R || F (36) ≤ || g ( f ( XR )) − g ( f ( X ) R ) || F + || g ( f ( X ) R ) − g ( f ( X )) R || F (37) ( ∗ ) ≤ || g ( f ( XR )) − g ( f ( X ) R ) || F +  g (38) ( ∗∗ ) ≤ L g || f ( XR ) − f ( X ) R || F +  g (39) = L g ∆( f, X, R ) +  g (40) ( ∗∗∗ ) ≤ L g  f +  g , (41) where ( ∗ ) holds from  g -approximate equivariance of g , ( ∗∗ ) holds because g is L g -Lipschitz, and ( ∗ ∗ ∗ ) holds from  f -approximate equivariance of f . Proposition 4 (Restated) . Suppose we have K functions f k : X k → X k +1 (with X k ⊂ R C k × 3 , X k +1 ⊂ R C k +1 × 3 ) for k ∈ { 1 , . . . , K } , satisfying the following: 1. f k is  k -approximately equivariant for all k ∈ { 1 , . . . , K } . 2. f k is L k -Lipschitz (w.r.t. || · || F ) for all k ∈ { 2 , . . . , K } . 18 Published in Transactions on Machine Learning Research (01/2023) Figure 10: Violations of equivariance in a neural network with L k -Lipschitz and  k -approximately equivariant layers. Then, the composition f K ◦ · · · ◦ f 1 is  1 ...K -approximately equivariant, where:  1 ...K , L K ( · · · ( L 3 ( L 2  1 +  2 ) +  3 ) + · · · ) +  K (42) Proof. ∆( f K ◦ · · · ◦ f 1 , X, R ) = ∆( f K ◦ ( f K − 1 ◦ · · · ◦ f 1 ) , X, R ) (43) ( K ) ≤ L K ∆( f K − 1 ◦ · · · ◦ f 1 , X, R ) +  K (44) ( K − 1) ≤ L K ( L K − 1 (∆( f K − 2 ◦ · · · ◦ f 1 , X, R ) +  K − 1 ) +  K (45) . . . (46) (1) ≤ L K ( · · · ( L 3 ( L 2 ∆( f 1 , X, R ) +  2 ) +  3 ) + · · · ) +  K (47) ≤ L K ( · · · ( L 3 ( L 2  1 +  2 ) +  3 ) + · · · ) +  K (48) where ( K ) − (1) hold from applying inequality equation 40 from Lemma 2 K times (setting g , f k and f , f k − 1 ◦ . . . f 1 at each step). Figure 10 illustrates the propagation of equivariance violations through a composition of 3 functions. Proposition 7. The VN-Linear ( · ; W ) : R C × S → R C ′ × S layer is σ ( W ) -Lipschitz w.r.t. the Frobenius norm, where σ ( W ) is the spectral norm of W . The same holds for VN-LinearWithBias ( · ; W, U,  ) . Proof. Consider X 1 , X 2 ∈ R C × S . We can write: || W X 1 − W X 2 || 2 F = S ∑ s =1 || W X (: ,s ) 1 − W X (: ,s ) 2 || 2 2 (49) = S ∑ s =1 || W ( X (: ,s ) 1 − X (: ,s ) 2 ) || 2 2 (50) ≤ S ∑ s =1 σ 2 ( W ) || X (: ,s ) 1 − X (: ,s ) 2 || 2 2 (51) = σ ( W ) 2 S ∑ s =1 || X (: ,s ) 1 − X (: ,s ) 2 || 2 2 (52) = σ ( W ) 2 || X 1 − X 2 || 2 F (53) ⇒ || W X 1 − W X 2 || F ≤ σ ( W ) || X 1 − X 2 || F (54) To see this for VN-LinearWithBias, note that || ( W X 1 + U ) − ( W X 2 + U ) || 2 F = || W X 1 − W X 2 || 2 F – we can then show the same result using the above proof. 19 Published in Transactions on Machine Learning Research (01/2023) C.3 Equivariance of VN-LayerNorm We define the VN analog of the layer normalization operation as follows: VN-LayerNorm ( V ( n ) ) , [ V ( n,c ) || V ( n,c ) || 2 ] C c =1 LayerNorm ([ || V ( n,c ) || 2 ] C c =1 ) 1 1 × 3 (55) Proposition 8. VN-LayerNorm : R C × 3 → R C × 3 is rotation-equivariant. Proof. VN-LayerNorm ( V ( n ) R ) ( c ) = V ( n,c ) R || V ( n,c ) R || 2 LayerNorm ([ || V ( n,c ) R || 2 ] C c ′ =1 ) ( c ) (56) ( ∗ ) = V ( n,c ) R || V ( n,c ) || 2 LayerNorm ([ || V ( n,c ) || 2 ] C c ′ =1 ) ( c ) (57) = [ V ( n,c ) || V ( n,c ) || 2 LayerNorm ([ || V ( n,c ) || 2 ] C c ′ =1 ) ( c ) ] R (58) = VN-LayerNorm ( V ( n ) ) ( c ) R (59) = [ VN-LayerNorm ( V ( n ) ) R ] ( c ) (60) ⇒ VN-LayerNorm ( V ( n ) R ) = VN-LayerNorm ( V ( n ) ) R, (61) where ( ∗ ) holds from invariance of vector norms to rotations. C.4 Definitions of VN layers from Deng et al. (2021) VN-ReLU layer The VN-ReLU layer is constructed as follows: from a given representation V ( n ) ∈ R C × 3 , we compute a feature set q ∈ R C × 3 : q , W V ( n ) , W ∈ R C × C . (62) Then, we compute a set of C “learnable directions” k ∈ R C × 3 : k , U V ( n ) , U ∈ R C × C . (63) Note that W, U are learnable square matrices. Finally, we compute the output of the VN-ReLU operation VN-ReLU ( · ; W, U ) : R C × 3 → R C × 3 as follows: VN-ReLU ( V ( n ) ) ( c ) , { q ( c ) if 〈 q ( c ) , k ( c ) 〉 ≥ 0 q ( c ) − 〈 q ( c ) , k ( c ) || k ( c ) || 〉 k ( c ) || k ( c ) || o.w. (64) Otherwise stated: if the inner product between the feature q ( c ) and the learnable direction k ( c ) is positive, return q ( c ) , else return the projection of q ( c ) onto the plane defined by the direction k ( c ) . It can be readily shown that VN-ReLU is rotation-equivariant (for a proof, see Appendix C.5). VN-Invariant layer VN-Invariant ( · ; W ) : R C × 3 → R C × 3 is defined as: VN-Invariant ( V ( n ) ; W ) , V ( n ) VN-MLP ( V ( n ) ; W ) ᵀ , (65) where VN-MLP ( · ; W ) : R C × 3 → R 3 × 3 is a composition of VN-Linear and VN-ReLU layers, and W is the set of all learnable parameters in VN-MLP . It can be easily shown that VN-Invariant is rotation-invariant (see Appendix C.5 for a proof). VN-Batch Norm, VN-Pool For rotation-equivariant analogs of the standard batch norm and pooling operations, we point the reader to Deng et al. (2021). 20 Published in Transactions on Machine Learning Research (01/2023) C.5 Invariance & equivariance of VN layers of Deng et al. (2021) Proposition 9. (Deng et al., 2021) VN-Linear ( · ; W ) : R C × 3 → R C ′ × 3 is rotation-equivariant. Proof. VN-Linear ( V ( n ) R ; W ) , W V ( n ) R = ( W V ( n ) ) R = VN-Linear ( V ( n ) ; W ) R (66) Proposition 10. (Deng et al., 2021) VN-ReLU : R C × 3 → R C × 3 is rotation-equivariant. Proof. VN-ReLU ( V ( n ) R ) ( c ) ( ∗ ) = { q ( c ) R if 〈 q ( c ) R, k ( c ) R 〉 ≥ 0 q ( c ) R − 〈 q ( c ) R, k ( c ) R || k ( c ) R || 2 〉 k ( c ) R || k ( c ) R || 2 o.w. (67) ( ∗∗ ) = { q ( c ) R if 〈 q ( c ) , k ( c ) 〉 ≥ 0 q ( c ) R − 〈 q ( c ) , k ( c ) || k ( c ) || 2 〉 k ( c ) R || k ( c ) || 2 o.w. (68) = [{ q ( c ) if 〈 q ( c ) , k ( c ) 〉 ≥ 0 q ( c ) − 〈 q ( c ) , k ( c ) || k ( c ) || 〉 k ( c ) || k ( c ) || 2 o.w. ] R (69) = VN-ReLU ( V ( n ) ) ( c ) R (70) = [ VN-ReLU ( V ( n ) ) R ] ( c ) (71) ⇒ VN-ReLU ( V ( n ) R ) = VN-ReLU ( V ( n ) ) R, (72) where ( ∗ ) holds because q and k are rotation-equivariant w.r.t. V ( n ) and ( ∗∗ ) holds because vector inner products are rotation-invariant. Proposition 11. (Deng et al., 2021) VN-Invariant : R C × 3 → R C × 3 is rotation-invariant. Proof. VN-Invariant ( V ( n ) R ; W ) = ( V ( n ) R ) VN-MLP ( V ( n ) R ; W ) ᵀ (73) ( ∗ ) = V ( n ) R [ VN-MLP ( V ( n ) ; W ) R ] ᵀ (74) = V ( n ) RR ᵀ VN-MLP ( V ( n ) ; W ) ᵀ (75) = V ( n ) VN-MLP ( V ( n ) ; W ) ᵀ (76) = VN-Invariant ( V ( n ) ; W ) , (77) where ( ∗ ) holds by equivariance of VN-MLP. 21