GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting Zhao Chen 1 Vincent Casser 1 Henrik Kretzschmar 1 Dragomir Anguelov 1 Abstract We propose GradTail, an algorithm that uses gra- dients to improve model performance on the fly in the face of long-tailed training data distribu- tions. Unlike conventional long-tail classifiers which operate on converged - and possibly overfit - models, we demonstrate that an approach based on gradient dot product agreement can isolate long-tailed data early on during model training and improve performance by dynamically picking higher sample weights for that data. We show that such upweighting leads to model improvements for both classification and regression models, the latter of which are relatively unexplored in the long-tail literature, and that the long-tail exam- ples found by gradient alignment are consistent with our semantic expectations. 1. Introduction Although modern deep learning and machine learning tech- niques have achieved impressive performance across a di- verse set of tasks, most models still struggle when faced with uneven data distributions containing very rare or long- tail examples. Such rare examples are frequently present in real world data (Bengio, 2015), and handling them well is crucial in safety-critical applications like autonomous driving (e.g. (Philion, 2019)). Conventional methods to mitigate the effects of long-tailed training distributions often require identifying long-tailed data and then labeling additional data of the same type (i.e. active learning), like in (Yang et al., 2015), which in effect increases the probability density of such data and moves it out of the long-tail. However, such data-driven approaches are expensive and may not ever definitively solve the problem, as squashing one set of long-tail examples can often lead to the formation of others. We focus within this work not on collecting more data, but rather dynamically upweighting long-tailed examples during 1 Waymo LLC, Mountain View, California, USA. Correspon- dence to: Zhao Chen < zhaoch@waymo.com > . Preprint version. training. At the core of this line of research is the assumption that the long-tailedness of a particular example is not only a function of the data distribution, but also the state of the model itself. Examples that are at one point long-tailed may be learned properly through dynamic upweighting and become more in-distribution later in training. Such a setting is relatively unexplored within the long-tail context; to wit, the majority of long-tail classification tech- niques, such as entropy (Louizos & Welling, 2017) or en- sembling (Vyas et al., 2018), prove unsuitable for dynamic upweighting as they rely on model convergence to derive meaningful uncertainty signals. In contrast, our work rests on the claim that there is already rich information available well before a model converges, and that tapping into that information allows us to mitigate long-tail effects on the fly. Using model dynamics as our long-tail probe allows us to tackle another fundamental issue in long-tail learning: how do we differentiate examples that are properly in the long-tail versus examples that are purely difficult? More formally, long-tail examples have high reducible, epistemic uncertainty , as a model can in principle learn them but strug- gles to (in this case, because they are rare). In contrast, those examples that exhibit high irreducible, aleatoric uncertainty are hard but not long-tail, as their difficulty derives from more fundamental sources of noise within the data rather than pure rarity (Kendall & Gal, 2017). Visual occlusions are potentially good examples of the latter category, as fully or mostly occluded objects cannot be detected by many vision systems regardless of how many examples of them exist in the training dataset. For the rest of this work, we will refer to examples of high aleatoric uncertainty as “hard,” and examples of high epistemic uncertainty as “rare.” We use quote-marks here to emphasize that although these la- bels are theoretically grounded, they deviate somewhat from what “rare” generally means in the literature. A discussion of this discrepancy will be given in Section 4.1, and more discussion is added in Appendix A. As we are focusing on system dynamics to mitigate long- tailed data, it is natural for us to hone in on gradients as the core entities with which to perform our calculations. Given two loss gradients ∇ w L ( x 0 ; w ) , ∇ w L ( x 1 ; w ) for loss L , trainable examples x 0 , x 1 and trainable weight w , the sim- ple dot product λ ∇ w L ( x 0 ) · ∇ w L ( x 1 ) tells us the change arXiv:2201.05938v2 [cs.LG] 19 Jan 2022 GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting Figure 1. Schematic description of the proposed GradTail technique. Gradients (with respect to some trainable weights) are taken from an unweighted loss function and compared to an average gradient vector. The resultant metric is normalized and a distance to hyperparameter pivot p is computed. This distance is then converted to a sample weight via a monotonic function which produces the final weighted loss. A choice of p near zero ensures that examples upweighted by GradTail are long-tail (i.e. contain high epistemic uncertainty). in L ( x 0 ) should we follow the gradient ∇ w L ( x 1 ) for a step size of λ (or, symmetrically, the change in L ( x 1 ) upon an update in direction λ ∇ w L ( x 0 ) . Crucially, we can calcu- late the average gradient vector , E x [ ∇ w L ( x )] and the dot product λ ∇ w L ( x 0 ) · E x [ ∇ w L ( x )] tells us how any particu- lar example is affected by an update for the mean example within a distribution. We note that this gradient quantity has a number of desirable properties that make it especially suited for our purposes: 1. Dynamic : Gradients can be calculated on the fly, and in fact are already calculated as part of the standard backpropagation loop. 2. Efficient : Because gradients are already calculated during normal training, our method has low compute overhead. 3. Explicit comparison : The gradient dot product is an explicit comparison between each example and the mean of a distribution, as opposed to the implicit com- parison that many uncertainty methods use. 4. Separability of uncertainty : Gradients can separate whether a problematic example is truly rare (and thus still learnable) or just hard. Examples that backprop- agate nearly orthogonal gradients, for instance, are in principle learnable by the model because application of that gradient will not obstruct learning of the average example. Examples that backpropagate opposing gradi- ents may indicate overly noisy data or incompatibility with the model, and thus may just be hard. 5. Nondependence on convergence : Gradients are dis- criminative early on in the training process, and do not require a converged model to be taken as a useful signal. 6. Dependence on labels : Gradients are explicitly de- pendent on labels. Thus, not only are they sensitive to out-of-distribution inputs, they are also sensitive to corrupted labels and other unwanted label behaviors. In contrast, many standard methods for uncertainty estimation are only explicitly dependent on the input. In general, the separability property posited in (4) allows us to isolate examples that are difficult but still learnable by the model. We hypothesize that by emphasizing these examples, we allow the model to explore parts of parameter space that are benign but which would be left untouched otherwise by the model. Our main contributions are as follows: • We introduce GradTail, an algorithm that can be im- plemented on the fly during model training to improve overall performance with emphasis on data in the long- tail. • We show through experiments with controlled syn- thetic data how gradient dot products produce a natural GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting continuous scale to separate difficult (aleatoric) data from rare (epistemic) data. • We demonstrate that GradTail works well in traditional sample weighting settings such as classification as well as in dense regression settings, which are often over- looked by the sample weighting literature. 2. Related Work Uncertainty estimation is of keen interest to deep learn- ing practitioners, as it provides useful information about failure modes and potential pathways towards improvement for a given model. Much of the work within uncertainty estimation is related to work on Bayesian deep learning (Hern ́ andez-Lobato & Adams, 2015; Depeweg et al., 2018; Maddox et al., 2019), in which distributions placed over network weights or inputs provide explicit measures of pre- dictive uncertainty. Extending model training to explicitly model uncertainties can also lead to more robust training outputs (Kendall & Gal, 2017). Modeling uncertainties is closely related to active learning (Yang et al., 2015; Yoo & Kweon, 2019), where uncertainty measurements can be used to collect new labeled data that will maximally benefit the model. Example-level weighting methods have been well-studied (He & Garcia, 2009), and can even be done on the fly (Lin et al., 2017). Reweighting has also become popular in multi- task learning (Chen et al., 2018; Kendall et al., 2018), where different tasks must be balanced with each other for optimal training. Multitask learning also has popularized gradient comparison techniques (Yu et al., 2020; Chen et al., 2020), which we leverage heavily within this current work. Out-of-distribution and long-tail detection provide im- portant tools to classify and mitigate the effect of data that lie far from the main data manifold. Entropy (Louizos & Welling, 2017; Shi et al., 2020) and ensemble (Vyas et al., 2018; Malinin et al., 2019) methods can select for the tails of our data distribution, and various learning methods exist to mitigate their effect on model performance (Liu et al., 2019; Tan et al., 2020). However, these methods usually focus on classification and are explicitly coupled with semantically defined long-tails within curated datasets (e.g. (Van Horn et al., 2018)). We instead seek a method that can operate without any long-tail information provided in the training data. 3. Methodology We present the main algorithm loop for GradTail dynamic upweighting in Algorithm 1. We note that the actual algo- rithm is very simple and only involves standard dot product and arithmetic operations. A schematic of the GradTail method is shown in Figure 1. Algorithm 1 Gradient Long-Tail Dynamic Upweighting (GradTail) note that all dot products denoted by · are normalized and therefore lie in the range [ − 1 , 1] . note that all operations keep the batch dimension intact unless explicitly noted. choose monotonically decreasing and positive activation function f : [0 , ∞ ] 7 → [1 , ∞ ] choose subset of trainable weights w ⊂ W from model Φ . choose dataset ( X , Y ) with minibatches x ⊂ X , y ⊂ Y . choose loss function L to calculate gradients. choose pivot p and decay λ . initialize to zeroes ̃ w , a tensor the same shape as w that will hold the average gradient. initialize to zero average variance variable σ . function GetLossForMinibatch( x , y ) calculate ∇ ( x ) := ∇ w L (Φ( x ) , y ) calculate θ ( x ) = ∇ ( x ) · ̃ w calculate σ x = E x [ | θ ( x ) | ] set σ = λσ + (1 − λ ) σ x set ̃ w = λ ̃ w + (1 − λ ) E x [ ∇ ( x )] calculate loss weights q = f ( | θ ( x ) σ − p | ) return ∑ q L (Φ( x ) , y ) end function The main hyperparameter is the pivot p , which presents us a lever to tune the tradeoff between exploration and sensitivity to aleatoric uncertainty. A more negative pivot means that we upweight examples that are less in agreement with the mean sample gradient, but for milder negative values can allow the model’s trainable parameter space to explore in relatively benign directions. Being able to control this trade- off is a key feature of GradTail, as it allows us to filter out examples that are outliers and incompatible with the bulk of the data distribution. In general, selecting a pivot of 0 provides a good baseline. Another important hyperparameter is the activation function f . f is a monotonically decreasing function, which produces positive outputs that are ≥ 1 . f is monotonically decreasing because it takes as input the absolute distance of the gradient dot product from some pivot point. The further the absolute distance from this pivot point, the less we want to upweight this particular sample. In our experiments within this work, we pick a sigmoid activation function f ( x ) = 1 . 0 + A 1 − e − Bx , where A, B are additional hyperparameters. We pick this function as it presents relatively mild slopes but also has highest variance near x = 0 , which allows it to be especially peaked around gradient dot products near the pivot point. GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting 4. Experiments We present results within this section on a number of exper- imental settings. We begin with in-depth discussion on how gradients are discriminative towards the long-tails of the distribution on a simple 2-dimensional synthetic example. We then move to real datasets and show strong results on both ImageNet2012 (Deng et al., 2009) and the monocular depth prediction task on the Waymo Open Dataset (Sun et al., 2020). Exact details for models used within these experiments are provided within the Appendix. 4.1. A Simple Toy Example Before we test our methodology on large-scale settings, in which the long-tail is extremely high-dimensional and often semantically ambiguous, we find it instructive to see how gradients can help identify the long-tail of a data distribution within a well-understood low dimensional setting. For our toy example, we use a 2-class classifier for data in 2 dimensions. The data we use for much of our toy example analysis is displayed in Figure 2. As seen, the data con- sists of 10,000 common-class points (the green cross data) and 400 uncommon-class points (the purple circle data). The common class is distributed I.I.D. as the standard nor- mal, while the uncommon class is right-upwards shifted to N ([2 . 2 , 2 . 2] , 0 . 5 I ) so that it is well-separated enough for the classification problem to be well-defined, while still close enough to the origin to create a difficult decision boundary. Our model is a simple 2-layer MLP with one hidden layer of five neurons. As expected, such a simple model often converges to the ma- jority classifier of the common class. Even in the best case scenario the result of a naive MLP classifier looks some- thing like the predictions in Figure 2(b), with the decision boundary pushed deep into the purple uncommon class’s distribution density. Such a result is expected due to the lopsidedness of the class frequencies. However, as shown in Figure 2(c), training with GradTail results in a much more accurate decision boundary, which comes quite close to the ground truth decision boundary shown by the dotted line in Figure 2(a). To dive deeper into why GradTail performs better in this sce- nario, we show the precise examples GradTail is upweight- ing in Figure 3. Here we separate the dataset into “common,” “rare,” and “hard” by associating each of these categories with a specific range of normalized gradient dot product values. Because we chose the pivot value for upweighting in this setting to be 0, we associate rare examples with a small range around 0, or [-0.07, 0.07]. “Hard” examples are any examples that fall below this range, and “common” examples are above. As seen in Figure 3(a), most examples in the common class fall under “common,” while most ex- (a) Ground Truth (b) Base Model (c) GradTail Figure 2. Toy example results for GradTail dynamic weighting. (a) Distribution of data within a 2-class toy example classifier. The common (green cross) class consists of 10,000 data points and fol- lows a N ( 0 , I ) distribution. The rare (purple circle) class consists of 400 data points and follows a N ([2 . 2 , 2 . 2] , 0 . 5 I ) distribution. The dotted line shows the analytically correct decision boundary, calculated for when the two data generating distributions have equal probability. (b) Result when a baseline MLP classifier is trained on the data. The result shown here represents a better-than- median outcome, as most baseline models converge to the majority classifier. (c) Result of an MLP classifier when GradTail is applied during training. amples in the uncommon class fall under “hard.” However, examples from both classes are caught in the “rare” range, and they form a fairly symmetric distribution around the true decision boundary. This result emphasizes that the defi- nition of “rare” or “long-tail” that is most useful to model training can be at odds with the common semantic definition found in the literature; rather than just labeling infrequent classes as “rare,” we see “rare” examples as high-impact examples that most help the model make correct predictions. These “rare” examples can come from either the infrequent or the frequent class distribution, as long as they benefit predictive accuracy on more troublesome data. As further illustration of these concepts, in Figure 4 we also investigate the interesting case when the uncommon class data is pulled far into the common class distribution, to the point where the common class frequency dominates the uncommon class frequency at all locations. In this setting, GradTail no longer outputs any data in the “rare” category, but all of the uncommon class data is now labeled as “hard.” Because adding more uncommon data will not appreciably change the quality of the classifier in this case (unless we GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting (a) GradTail (b) GradTail (rare) (c) Entropy Figure 3. Toy dataset splits based on GradTail and other methods. We show how the toy dataset shown in Figure 2 can be discretely clustered based on various uncertainty metrics. These classifica- tions are based on aggregated statistics for each data point across the entire training run. (a) For GradTail, because our pivot value is around 0, we use a small closed-interval range around zero of [ − 0 . 07 , 0 . 07] as the range that we associate with rare examples that we aim to upweight. Examples with dot products under that range are labeled as “hard” (red) examples, while examples with dot products above that range are labeled as “common” (green). Many examples in the low-frequency class are observed to be “hard,” while both classes have examples that are labeled as “rare” (yellow). (b) Only the “rare” examples found by GradTail along with the true decision boundary as reference. The “rare” exam- ples are distributed evenly around the decision boundary, and thus upweighting these examples will lead to better delineation of the decision boundary. (c) High entropy and low entropy points are la- beled in red and green, respectively. Because of the lopsided nature of the dataset, entropy proves to be a poor, incoherent predictor of data uncertainty, and the results look random. add a huge amount to put the class frequencies more into balance), the uncommon class data is purely “hard,” and will remain unaltered by GradTail. We note that in both data distributions analyzed in Figures 3 and 4, plotting points of high entropy leads to an incoherent result. Entropy is a common metric within classification settings for example uncertainty, and is often used as a rarity classifier. Meanwhile, our GradTail methodology produces meaningful results even within the more difficult setting. It is pertinent to note that for our toy examples, although GradTail produces better class-normalized mean accuracy due to a significantly improved decision boundary, it lowers the total accuracy. GradTail moves the decision boundary but will end up misclassifying substantially more members of the common class as a result. This regression results from the fact that our toy example is very low-dimensional (deliberately so for ease of visualization), to the point where movement of the decision boundary is an explicit tradeoff between accuracy of one class over the other. We will show in the following sections that when we are in a very high- dimensional setting GradTail can improve overall accuracy as well. We hypothesize that the higher dimensionality reduces the possibility of having an explicit accuracy trade- off between classes as there will always exist a perfectly separating hyperplane between different classes. (a) Hard Distribution Ground Truth (b) GradTail (c) Entropy Figure 4. The same “hard” vs “rare” vs “common” analysis as described in Figure 3 but on a different distribution, as shown in (a). This harder distribution corresponds to the case where the common class has higher frequency than the uncommon class at every point in the input space. As shown in (c), entropy still produces incoherent results, and in (b) we observe that GradTail no longer outputs any examples within the “rare” category. This result is sensible, as there are no high-leverage examples for which collecting similar examples will help us continuously improve the inferred decision boundary. Table 1. ImageNet Classification with GradTail. Higher numbers are better. All standard errors are within 0.05%. M ETHOD A CCURACY (T OP 1) A CCURACY (T OP 5) B ASELINE 76.8 92.9 F OCAL L OSS 76.2 93.0 G RAD T AIL 77.2 93.1 4.2. Classification We describe within this section experiments on the Imagenet Large Scale Visual Recognition Challenge 2012 dataset, consisting of 1000 classes of various objects and other en- tities. For our experiments we use a ResNet-50 (He et al., 2016) as our base network. Our GradTail model has a max loss weight of 3 and a pivot value of 0 . 0 . The GradTail gradient dot products are calculated on the last two layers, on both weights and biases. The main results are shown in Table 1 for a baseline model, the GradTail model, and a model with focal loss (Lin et al., 2017) used during training. We see that the top 5 accuracies are fairly clustered, but there is significant improvement within the GradTail model on top 1 accuracy. A closer look at the driving forces behind this improvement are encapsu- lated in Figure 4.2. Here we tabulate the top-1 accuracies for each quartile for the normalized gradient dot product . The first quartile accuracies represent the accuracies amongst the subpopulation of data that lie in the lowest 25% of the normalized gradient dot product value. As expected, for the baseline we see that the accuracy tends to improve as the gradient dot product increases. However, this trend is reversed to a shocking degree when focal loss is used. Focal loss greedily upweights examples that pro- GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting Figure 5. ImageNet accuracies broken down by gradient dot product quartile. Lowest quartile corresponds to the lowest gradient dot products. Correlation coefficients between accuracy and gradient dot product value are also provided for each model. duce higher loss, and the result is a complete decoupling between example accuracy and gradient dot product. Such a phenomenon is mathematically counterintuitive, as ex- amples that disagree with the main gradient direction now completely drive the training. In contrast, GradTail tracks closely with the performance of the baseline on each of the quartiles, but beats the base- line handily in the middle two quartiles. Interestingly, the baseline performs better than GradTail in the first quartile, where hard examples of the highest aleatoric uncertainty lie. This effect may be driven by the baseline seeing higher loss and thus higher gradients for members of the first quar- tile during training. GradTail sacrifices some performance within the hardest quartile in favor of focusing on the more learnable middle quartiles. The overall effect is beneficial to the model. We also present in Figure 4.2 the correlation value between gradient dot product band and performance, showing that GradTail produces the largest correlation. This increased correlation shows taht GradTail leads to better cal- ibration between gradient direction and performance, which we hypothesize is a desirable quality for a trained model. The astute reader might observe that the GradTail dynamic weight profiles will be different for different model training runs, and therefore the quartiles in Figure 4.2 are coupling to different examples for different baselines. While such analysis is correct, we deliberately used the quartile metric to emphasize our firm belief that being in the long-tail is not just a property of the data but also of the model as well . Given that many rare examples are borderline learnable by baseline models and therefore might happen to be learned well on certain runs, having model dependence as part of our long-tail definition affords us additional flexibility in identifying the true problem points that we can try to tune our models to learn. Despite our confidence in the metrics, it still would be reas- suring to know that the gradient dot product we calculate as part of GradTail produces semantically meaningful results (as we saw for our toy example in Section 4.1). In Figure 6 we visualize examples in three randomly chosen ImageNet classes that produce the highest and lowest gradient dot product values. We observe that for all three classes, high dot product examples are semantically similar while low dot product examples exhibit odd features. For volcanos, mid- eruption volcanos usually produce low gradient dot product, which is sensible as most volcanos within the dataset are dor- mant. Swans that exhibit low gradient dot product are often on land or in murky water, which hints that the environment is salient to the model for swan classification. Refrigerators with low gradient dot product often have additional agents in the scene (e.g. pets or humans), or are shot at odd angles or distances. In general, all of these visualizations improve our confidence that GradTail hones in on the appropriate examples. It is crucial to note that the visualizations in Figure 6 were all generated by models at around 10% through the training regimen. Unlike conventional long-tail work in the literature, our GradTail algorithm is fairly discriminative early on in the training, which demonstrates that model convergence does have to be a requirement in identifying and mitigating rare examples within the dataset. 4.3. Dense Regression (Depth Estimation) For our regression problem, we choose camera-based monocular depth estimation on the the Waymo Open Dataset. The dataset consists of camera images downsam- pled to 192 × 480 paired with ground truth that is per-pixel depth measurements from LiDAR. Our model uses a 5-block MobileNet (Howard et al., 2017) encoder with a 5-layer de- coder that also concatenates encoder block outputs in a similar way as U-Net (Ronneberger et al., 2015). The fil- ter layouts in the encoder network were searched via NAS (Bender et al., 2020) in the baseline setting to maximize potential performance of the model, and kept fixed for all experiments. Long-tail mitigation on dense regression data is a partic- ularly exciting topic, as long-tail work has largely been restricted to single-output classification problems. However, there is every reason to believe that regression problems GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting (a) Easy Examples (high gradient dot product) (b) Hard Examples (low gradient dot product) Figure 6. Gradient dot product visualization on the ImageNet dataset. Gradient dot products are averaged over 12 saved models around 10% through the total training regimen. The examples within the test set that exhibit both the highest (a) and lowest (b) gradient dot product for the classes of black swan (top row), volcano (middle row), and refrigerator (bottom row) are visualized. High gradient dot product implies that an example is generally aligned with the average training direction and thus is in-distribution, while low gradient dot product points to a difficult example that is out-of-distribution. In all three class cases, the hard examples are semantically reasonable and exhibit image features that make them more difficult (e.g. murky water for swans, eruptions for volcanos, and odd lighting or other animal/humans present for refrigerators). and dense prediction problems are all hurt by issues in the long-tail. As gradients are universal quantities within deep learning training dynamics, our proposed methodology is well-suited to tackle dense regression with minimal modifi- cations from the classification setting. The density of the output for our chosen problem can be one potential source of inefficiency. As we have to calculate a gradient dot product for each output, and our outputs are now in a grid of over 92 , 000 p ixels, it is not recommended to perform this calculation for each pixel. Even if it were compute-efficient to do so, there is significant spatial corre- lation between adjacent pixels and so it would be a waste of compute to consider long-tailed upweighting at every pixel location. Instead, we take six random patches of randomly sampled sizes between 20 × 20 and 100 × 100 and randomly sampled locations within the full image space. We also take a seventh patch consisting of any pixel that was not included in any proposal to ensure that we backpropagate some error signal at every location. We then mean the loss in each patch and concatenate for an effective batch size of 7 × the original batch size. The results of our experiments on depth estimation are shown in Table 2. We see that the overall error rate improves by a marginal (though statistically significant) amount, but the picture is clearer when we tabulate the error rate for different depth ranges. The error rate improves significantly for points within the 40-60m range, while being minimal for close points in the 0-20m range. Physically, we know that points further in the distance are necessarily less common within the dataset (due to parallax). Unlike the ImageNet case (Section 4.2), where example rarity is semantically in- formed and thus complicated, rarity dependence on distance within depth estimation datasets is a physical property of the sensors and thus a reasonable proxy for long-tailedness. These results are therefore a reassuring signal that although the overall improvement is small, we substantially improve performance in areas where we lack data and which gener- ally are problematic for standard models. The pivot parameter is also clearly important in this sce- GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting Table 2. Waymo Open Dataset Monocular Depth Estimation with GradTail. The lower the better for all metrics within this table. Total MRE has standard error within 0.03%, while all other results have standard error within 0.05%. M ETHOD MRE < 20 M (%) MRE 20-40 M (%) MRE 40-60 M (%) MRE > 60 M (%) T OTAL MRE (%) B ASELINE 10.1 11.9 14.2 16.8 11.0 G RAD T AIL , PIVOT − 2 . 5 10.5 11.9 14.2 16.8 11.2 G RAD T AIL , PIVOT +0 . 5 10.5 11.9 13.9 17.2 11.3 G RAD T AIL , PIVOT − 0 . 5 10.1 11.7 13.7 16.5 10.8 nario; a pivot close to zero but slightly negative provided the best results. Positive pivots, which would upweight in-distribution examples, and high negative pivots, which would upweight only hard examples, both degraded model quality. The optimal pivot being close to zero further sup- ports our emphasis on orthogonal gradients as meaningful. 5. Discussion Our goal for introducing GradTail is twofold: first, we want to demonstrate it as a general method that works to mitigate performance issues on long-tail parts of the data distribu- tion. We showed through substantial analysis on a low- dimensional toy example that GradTail works well within the context of highly lopsided data, and produces a sensible decision boundary even when one data distribution is 25 × the frequency of the other. We also showed improvements in performance on rare segments of the data in both a classi- fication (ImageNet) and dense regression (depth estimation on Waymo Open Dataset) context. The latter is especially exciting, as long-tail methods have largely been reserved for classification settings due to the complexity of moving to a continuous output space. However, because gradients are well-suited to continuous problems and are universal to neural network training, they are powerfully general entities to use for regression long-tail mitigation. Our second goal is to challenge pre-existing notions of what long-tailedness and rarity within a dataset mean. Long- tailedness has largely been a work of manual labor in the literature, with humans semantically labeling objects to be rare and ad-hoc triaging when new rare classes crop up. Datasets like iNaturalist (Van Horn et al., 2018) exist that boast long lists of long-tailed classes, where long-tailedness is a property of a class rather than an example. While such analysis has been crucial in robustifying neural network thus far, we hope that our work provides a fresh look at how we can improve model training by reasoning about long-tailedness at a learned, granular level and using model dynamics as our fundamental signal. 5.1. A Note on Compute/Memory Costs Throughout this work, we noted that Gradtail requires a gradient dot product to be calculated for each example. Al- though this may seem expensive, example-level gradient computation is already done in the vast majority of models, which allows us to reuse calculations to implement Grad- Tail. The overhead then becomes minimal, with the most expensive step becoming the taking of a single dot product. However, we have found that setting up this reused compu- tation is difficult in most modern deep learning frameworks, as it is common within such frameworks to hide the back- wards pass deep within the system backend. Intercepting the computation right before the per-example gradient signal is summed across the batch dimension can then become challenging. Up until now, such design decisions by these frameworks may have been due to the limited use of inter- cepting such a computation. However, we hope that our work on GradTail and followup research will clarify the potential benefit that access to the per-example gradients can offer, and that we can encourage designs of deep learn- ing framework that allow better interfacing to the gradient computation backend. 6. Conclusions We presented GradTail, a gradient-based dynamic weighting algorithm that upweights examples during training based on their rarity. We showed how GradTail works to produce a reasonable decision boundary on an extremely lopsided low-dimensional classification problem, as well as working to mitigate poor performance on rare examples within a higher-dimensional classification problem (ImageNet). Cru- cially, we show that GradTail generalizes to dense regression settings as well, which have hitherto been relatively inac- cessible to long-tail methods. Ultimately, we see GradTail as an important tool within the toolkit of any deep learning practitioner, but also see it as significant evidence that there is much to be learned about any complex data distribution from the information-rich but oft-ignored dynamics of train- ing a model. It may be that to truly robustify our models, we need to focus not on where our models converge, but rather on the myriad twisting paths by which they get there. GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting References Bender, G., Liu, H., Chen, B., Chu, G., Cheng, S., Kinder- mans, P.-J., and Le, Q. V. Can weight sharing outperform random architecture search? an investigation with tunas. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pp. 14323–14332, 2020. 6 Bengio, S. The battle against the long tail. In Talk on Workshop on Big Data and Statistical Machine Learning , volume 1, 2015. 1 Chen, Z., Badrinarayanan, V., Lee, C.-Y., and Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning , pp. 794–803. PMLR, 2018. 3 Chen, Z., Ngiam, J., Huang, Y., Luong, T., Kretzschmar, H., Chai, Y., and Anguelov, D. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. arXiv preprint arXiv:2010.06808 , 2020. 3 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pp. 248–255. Ieee, 2009. 4 Depeweg, S., Hernandez-Lobato, J.-M., Doshi-Velez, F., and Udluft, S. Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning. In International Conference on Machine Learning , pp. 1184– 1193. PMLR, 2018. 3 He, H. and Garcia, E. A. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering , 21(9):1263–1284, 2009. 3 He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016. 5, 13 Hern ́ andez-Lobato, J. M. and Adams, R. Probabilistic back- propagation for scalable learning of bayesian neural net- works. In International conference on machine learning , pp. 1861–1869. PMLR, 2015. 3 Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 , 2017. 6, 13 Kendall, A. and Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977 , 2017. 1, 3, 11 Kendall, A., Gal, Y., and Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 7482–7491, 2018. 3 Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll ́ ar, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision , pp. 2980–2988, 2017. 3, 5 Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 2537–2546, 2019. 3 Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- dient descent with warm restarts. arXiv preprint arXiv:1608.03983 , 2016. 13 Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In Interna- tional Conference on Machine Learning , pp. 2218–2227. PMLR, 2017. 1, 3 Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncer- tainty in deep learning. Advances in Neural Information Processing Systems , 32:13153–13164, 2019. 3 Malinin, A., Mlodozeniec, B., and Gales, M. Ensemble distribution distillation. arXiv preprint arXiv:1905.00076 , 2019. 3 Philion, J. Fastdraw: Addressing the long tail of lane de- tection by adapting a sequential prediction network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 11582–11591, 2019. 1 Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu- tional networks for biomedical image segmentation. In In- ternational Conference on Medical image computing and computer-assisted intervention , pp. 234–241. Springer, 2015. 6 Shi, W., Zhao, X., Chen, F., and Yu, Q. Multifaceted un- certainty estimation for label-efficient deep learning. Ad- vances in Neural Information Processing Systems , 33: 17247–17257, 2020. 3 Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Pat- naik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 2446–2454, 2020. 4 GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learn- ing. In International conference on machine learning , pp. 1139–1147. PMLR, 2013. 11 Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., and Yan, J. Equalization loss for long-tailed object recogni- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020. 3 Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., and Belongie, S. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 8769–8778, 2018. 3, 8 Vyas, A., Jammalamadaka, N., Zhu, X., Das, D., Kaul, B., and Willke, T. L. Out-of-distribution detection using an ensemble of self supervised leave-out classifiers. In Proceedings of the European Conference on Computer Vision (ECCV) , pp. 550–564, 2018. 1, 3 Yang, Y., Ma, Z., Nie, F., Chang, X., and Hauptmann, A. G. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Com- puter Vision , 113(2):113–127, 2015. 1, 3 Yoo, D. and Kweon, I. S. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pp. 93–102, 2019. 3 Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782 , 2020. 3 GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting A. Discussion on Aleatoric vs Epistemic Uncertainty and Gradient Dot Products Within the main body of this work, we repeatedly make the claim that orthogonal (or close-to-orthogonal) gradients with respect to the average gradient indicate examples with high epistemic uncertainty, while gradients with high negative gradient dot products indicate examples with high aleatoric uncertainty. We would like to take the opportunity to further discuss the connection between our work and these classical concepts in uncertainty estimation. Traditionally, epistemic uncertainty is also known as model uncertainty , and reflects our model’s lack of knowledge to learn a certain piece of data. Although there are various ways of modeling such uncertainty (e.g. putting a prior on the model weights as done in Bayesian deep learning (Kendall & Gal, 2017)), a key property of epistemic uncertainty is that it can always be alleviated by collecting more data. That key property immediately creates a fundamental link between epistemic uncertainty and long-tailedness. In contrast, aleatoric uncertainty details irreducible noise within the data that will be present regardless of how much data we collect. We thus ask ourselves what kinds of data a sufficient model will learn better given more examples of that data type. We argue that this is the point where re-evaluating the problem in the context of model gradients becomes especially helpful. Namely, from basic calculus there is at least a local guarantee that for a forward-pass model M , should an example x with label y produce gradients ∇ w L ( M ( x ); y ) that produce dot products that are zero or greater with respect to the average gradient, then applying these gradient updates will: 1. Reduce the loss L ( M ( x ); y ) . 2. Not degrade the performance of the average example within the dataset. Thus, we note that collecting more data with a similar gradient as x and therefore a similar dot product will necessarily reduce the uncertainty of x within the model M trained on the full dataset , while not reducing the performance of the model otherwise. This argument tells us that the uncertainty of data point x is largely epistemic. In contrast, if ∇ w L ( M ( x ); y ) is antiparallel or produces negative dot product when dotted with the average gradient, then collecting more examples with a similar gradient profile will hurt the performance of the average example for that model. Thus, as we train on the full dataset , although we may overfit to the example x and reduce its uncertainty, the overall model performance will degrade and become even more susceptible to the noise of misbehaving examples such as x . Such behavior does not fulfill the key property of epistemic uncertainty of being always reducible with more data collection, and so we associate its poor performance more with high aleatoric uncertainty. We note that much of this argument involves contextualizing sample uncertainty within a larger ecosystem of model training. In our view, the uncertainty of any given example is only meaningful in the context of a model that is making inference on all that data, as we never observe training examples by themselves in a vacuum. Such considerations are a key driving force of our proposed GradTail algorithm, as the dynamic quality of the algorithm ensures that the computed uncertainty (via gradient dot product) of a given example will often change with time and depends on the state of the model. We find that reasoning about uncertainty while tied to a model state is crucial for determining not only examples of differing uncertainty types, but examples that will be of optimal practical use to model training. Empirically, this claim is well supported, especially by our visualizations in Figure 3 where we showed that long-tail examples cluster around the decision boundary, and in Table 2 where we showed that performance degrades for our algorithm for large negative pivots. B. Training Details Unless otherwise noted, all trained models were trained on TPU cores within the TensorFlow framework. The Toy Example and depth estimation experiments were performed on TensorFlow 1, while the ImageNet experiments were performed on TensorFlow 2. B.1. Toy Example The toy example network is a simple MLP model with one hidden layer of 5 neurons. The network is barebones and does not employ any of the standard tools like batch normalization or dropout. Training for all models is performed for 10 , 000 s teps with an initial learning rate of 1 × 10 − 4 The learning rate is not decayed. We optimize using a look-ahead momentum optimizer (Sutskever et al., 2013) with momentum 0 . 9 GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting Our GradTail layer uses all weights (both convolutional weights and biases) within the MLP for comparison, and produces a maximum upsampling weight of 15 We also compare GradTail with inverse class-frequency weighting in Figure B.1. By inverse class-frequency weighting, we mean that during training, examples within the common class are assigned weight 1 a nd examples within the uncommon class are assigned weight w ≥ 1 . This weighting is a common technique used in practice to deal with known class imbalance within the training dataset. We compare this to the GradTail model, for which w is the maximum weight assigned to a detected long-tail example. The baseline w = 1 case is the one shown in Figure 2b. Figure A.1. GradTail classification versus Inverse Frequency Weighting classification on an imbalanced toy dataset. The top row corresponds to inverse fequency weighting results while the bottom are GradTail results. Each column corresponds to a different weighting. For inverse frequency weighting this is the weight assigned to training examples of the uncommon class. For GradTail this is the maximum upweight factor for an example that is within the long-tail. We see that as w increase, inverse frequency weighting forces the decision boundary further down and to the left, reflecting the increased leverage that the uncommon-class examples now have on the training dynamics. If we set the frequency weighting to the exact frequency ratio between the common and uncommon classes (in this case that ratio would be 25), the resulting decision boundary is rather aggressive and encroaches very far into the common class territory. Finding the right balance becomes a hyperparameter search problem, which is exponentially exacerbated when there are more than two classes. In contrast, GradTail is consistent throughout regardless of what we set as the maximum upweighting factor. We attribute this robustness to the dynamicness of the method; GradTail cannot deviate from a good balance point between the classes because once one class starts to dominate, the other class’s examples are more likely to be detected as long-tail. GradTail thus provides a restoring force to the system, leading to a stable convergence that would be elusive to manual methods like inverse frequency weighting where the same weight is applied throughout the training run regardless of how the model trains. We further note that although inverse frequency weighting is of some utility in some scenarios, one of our core beliefs within this work is that we need to move away from a manual semantic definition of long-tail. An example is not necessarily in the long-tail just because it belongs to a less common class, as the class definitions themselves may come from semi-arbitrary semantic labels and highly overlapping generating distributions. Overfitting our methodologies to a semantic class-based GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting definition of long-tail may lead to subpar performance, as demonstrated here. B.2. ImageNet ImageNet inputs are downsampled to 160x160 inputs before being put through a standard ResNet v1 architecture (He et al., 2016). Models are trained for 1.7 million steps with an initial learning rate of 0.025 and a cosine learning rate decay profile (Loshchilov & Hutter, 2016). We use the same optimizer as in the toy example (look ahead momentum with momentum parameter 0.9), and also use a label smoothing of 0.1. We set regularization at 6e-5 and set batch size to 256. Our GradTail layer uses gradients from the final two layers (both convolutional weights and biases), and the maximum GradTail weight is set to 3. We use a pivot value of 0 for all our ImageNet experiments, which is consistent with our interpretation that orthogonal gradients belong to long-tail examples. B.3. Waymo Open Dataset We perform monocular depth estimation on the camera images of the Waymo Open Dataset by training a regression model to produce metric per-pixel depth predictions from camera image only. The depth ground truth was obtained by projecting synchronized LiDAR points into the respective images. The model input and output resolution is set to 192 × 480 , and images are fed through a 5-block MobileNet (Howard et al., 2017) encoder with a corresponding decoder and skip connections. The training loss is a simple ` 1 loss applied only to pixels that have a ground truth depth value assigned to them. The learning rate is kept constant at lr = 0 . 0002 throughout the training, and the batch size is set to 10 . Our GradTail layer uses just the biases in the first two upsampling layers within the decoder, which allows our method to be especially efficient within this setting. The maximum GradTail weight is set to 15. C. More Insight on Methodology Hyperparameters Our proposed GradTail algorithm is relatively simple and straightforward, but does come with a few hyperparameters. In our experience, GradTail is fairly robust (see for example the frequency weighting experiments in Section B.1, but we include many of these hyperparameters as guards against edge cases and specific scenarios where dynamic drift might be severe enough to throw off GradTail without further mitigation. In this section we go through the hyperparameters and offere a few additional insights on each of them. Pivot parameter p . The pivot parameter p is the main hyperparameter and the one that has the largest influence on training (see for example the dense depth estimation results in Section 4.3). The pivot parameter represents our belief for what examples count as ”long-tail” and thus must be upweighted. Without any additional information, we recommend always setting the pivot to zero for initial experiments. This setting is because a pivot of zero means we upweight examples that backpropagate orthogonal gradients, which represent examples that are learnable but still difficult due to lack of model exploration of the parameter space. However, we leave the pivot as a hyperparameter because small swings in the negative and positive direction can offer additional (or less) regularization, with more negative pivots resulting in more regularization. Because we ultimately want to optimize our models on a test set, such regularization potential can be useful. Activation function f . The activation function transforms a normalized gradient dot product into a sample weight, which should be at least 1.0. We chose a sigmoid-like activation function as it provides us with a sharp peak close to the desired pivot, which allows us to be more selective with what samples to upweight. Decay rate λ . This is the decay rate by which we assign new average gradient values to the exponential moving average and variance σ . We set this decay rate in our experiments to 0.99, which means that our pool of average gradients reflects on average the gradients of the last 100 batches. Variance σ . Although not a hyperparameter, the zero-centered variance σ is worth discussing as it is an additional normalization term that may seem mysterious at first. We add σ into the main algorithm because our algorithm is a dynamic one which deals with moving statistics within a network that is training. In many cases the drift of σ through training was only mild, so it is plausible to use GradTail without σ , but as a safety measure it is still recommended. GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting D. Full ImageNet2012 Visualization for 2 Classes In the main paper, we showed the highest and lowest dot product images for some ImageNet classes in Figure 6. For completeness, we include all the images within the test set (50 each) for two ImageNet classes, volcano and refrigerator, in Figure A.2. For each class, the gradient dot products for each image become higher from left to right in each row, and then from top to bottom. In other words, the top left image in each class produced the lowest gradient dot product, while the bottom right image in each class produced the highest. In each case, zero gradients occur close to the beginning of the third row. We see that as gradient dot products become higher and higher, there is a clear semantic trend for images to standardize into fairly similar images. For refrigerators we begin to see images of full refrigerators (with doors open or closed) without much debris or extraneous elements in the frame. For volcanos we see many images of clear skies and snow-capped peaks. At the lower end of the gradient dot product scale, we begin to see odd features like animals or humans in front of a refrigerator or volcanos that are mid-eruption. We emphasize, however, that although we include these visualizations for completeness, we do not fully recommend the practice of trying to semantically reason why any given data input may or may not be difficult for a given model. One of our major intended contributions of our work was to demonstrate that it is reasonable to define long-tailedness purely as a function of the data and model, without reference to human defined classes or class hierarchies (which are always susceptible to at least a bit of arbitrariness). Although it is a useful sanity check to see that there is a clear semantic difference in ImageNet data as we progress along the dot product scale, we hope to emphasize that it is more useful to think about long-tailedness in terms of parameter space exploration and learnability, both of which our gradient framework addresses explicitly. GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting (a) Refrigerator (b) Volcano Figure A.2. ImageNet visualization ordered by gradient dot product for two classes within the ImageNet2012 test set. Images produce higher gradient dot product with the mean gradient as the images go from left to right, and then top to bottom. Images on the bottom right of each category have the highest gradient dot product, while images on the top left of each category have the lowest.