GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting ZhaoChen1 VincentCasser1 HenrikKretzschmar1 DragomirAnguelov1 Abstract training.Atthecoreofthislineofresearchistheassumption thatthelong-tailednessofaparticularexampleisnotonly WeproposeGradTail,analgorithmthatusesgra- afunctionofthedatadistribution,butalsothestateofthe dientstoimprovemodelperformanceonthefly model itself. Examples that are at one point long-tailed in the face of long-tailed training data distribu- maybelearnedproperlythroughdynamicupweightingand tions. Unlike conventional long-tail classifiers becomemorein-distributionlaterintraining. whichoperateonconverged-andpossiblyoverfit -models,wedemonstratethatanapproachbased Suchasettingisrelativelyunexploredwithinthelong-tail on gradient dot product agreement can isolate context;towit,themajorityoflong-tailclassificationtech- long-tailed data early on during model training niques,suchasentropy(Louizos&Welling,2017)oren- andimproveperformancebydynamicallypicking sembling(Vyasetal.,2018),proveunsuitablefordynamic highersampleweightsforthatdata. Weshowthat upweightingastheyrelyonmodelconvergencetoderive suchupweightingleadstomodelimprovements meaningfuluncertaintysignals. Incontrast,ourworkrests forbothclassificationandregressionmodels,the ontheclaimthatthereisalreadyrichinformationavailable latter of which are relatively unexplored in the wellbeforeamodelconverges, andthattappingintothat long-tail literature, and that the long-tail exam- informationallowsustomitigatelong-taileffectsonthefly. plesfoundbygradientalignmentareconsistent Using model dynamics as our long-tail probe allows us withoursemanticexpectations. to tackle another fundamental issue in long-tail learning: howdowedifferentiateexamplesthatareproperlyinthe long-tail versus examples that are purely difficult? More 1.Introduction formally,long-tailexampleshavehighreducible,epistemic Althoughmoderndeeplearningandmachinelearningtech- uncertainty,asamodelcaninprinciplelearnthembutstrug- niqueshaveachievedimpressiveperformanceacrossadi- glesto(inthiscase,becausetheyarerare).Incontrast,those verse set of tasks, most models still struggle when faced examplesthatexhibithighirreducible,aleatoricuncertainty withunevendatadistributionscontainingveryrareorlong- are hard but not long-tail, as their difficulty derives from tailexamples. Suchrareexamplesarefrequentlypresent morefundamentalsourcesofnoisewithinthedatarather inrealworlddata(Bengio,2015),andhandlingthemwell thanpurerarity(Kendall&Gal,2017). Visualocclusions is crucial in safety-critical applications like autonomous arepotentiallygoodexamplesofthelattercategory,asfully driving(e.g. (Philion,2019)). or mostly occluded objects cannot be detected by many visionsystemsregardlessofhowmanyexamplesofthem Conventionalmethodstomitigatetheeffectsoflong-tailed existinthetrainingdataset. Fortherestofthiswork,we trainingdistributionsoftenrequireidentifyinglong-tailed willrefertoexamplesofhighaleatoricuncertaintyas“hard,” data and then labeling additional data of the same type andexamplesofhighepistemicuncertaintyas“rare.”We (i.e. activelearning),likein(Yangetal.,2015),whichin usequote-marksheretoemphasizethatalthoughthesela- effect increases the probability density of such data and belsaretheoreticallygrounded,theydeviatesomewhatfrom moves it out of the long-tail. However, such data-driven what“rare”generallymeansintheliterature. Adiscussion approachesareexpensiveandmaynoteverdefinitivelysolve ofthisdiscrepancywillbegiveninSection4.1,andmore theproblem,assquashingonesetoflong-tailexamplescan discussionisaddedinAppendixA. oftenleadtotheformationofothers. Aswearefocusingonsystemdynamicstomitigatelong- Wefocuswithinthisworknotoncollectingmoredata,but taileddata,itisnaturalforustohoneinongradientsasthe ratherdynamicallyupweightinglong-tailedexamplesduring coreentitieswithwhichtoperformourcalculations. Given two loss gradients ∇ L(x ;w),∇ L(x ;w) for loss L, 1WaymoLLC,MountainView,California,USA.Correspon- w 0 w 1 denceto:ZhaoChen. trainableexamplesx 0,x 1andtrainableweightw,thesim- pledotproductλ∇ L(x )·∇ L(x )tellsusthechange w 0 w 1 Preprintversion. 2202 naJ 91 ]GL.sc[ 2v83950.1022:viXraGradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting Figure1.SchematicdescriptionoftheproposedGradTailtechnique.Gradients(withrespecttosometrainableweights)aretakenfroman unweightedlossfunctionandcomparedtoanaveragegradientvector.Theresultantmetricisnormalizedandadistancetohyperparameter pivotpiscomputed.Thisdistanceisthenconvertedtoasampleweightviaamonotonicfunctionwhichproducesthefinalweightedloss. AchoiceofpnearzeroensuresthatexamplesupweightedbyGradTailarelong-tail(i.e.containhighepistemicuncertainty). inL(x )shouldwefollowthegradient∇ L(x )forastep entsmayindicateoverlynoisydataorincompatibility 0 w 1 sizeofλ(or,symmetrically,thechangeinL(x )uponan withthemodel,andthusmayjustbehard. 1 update in direction λ∇ L(x ). Crucially, we can calcu- w 0 5. Nondependenceonconvergence: Gradientsaredis- latetheaveragegradientvector,E [∇ L(x)]andthedot x w criminative early on in the training process, and do productλ∇ L(x )·E [∇ L(x)]tellsushowanyparticu- w 0 x w notrequireaconvergedmodeltobetakenasauseful larexampleisaffectedbyanupdateforthemeanexample signal. within a distribution. We note that this gradient quantity hasanumberofdesirablepropertiesthatmakeitespecially 6. Dependence on labels: Gradients are explicitly de- suitedforourpurposes: pendent on labels. Thus, not only are they sensitive toout-of-distributioninputs,theyarealsosensitiveto 1. Dynamic: Gradientscanbecalculatedonthefly,and corruptedlabelsandotherunwantedlabelbehaviors. in fact are already calculated as part of the standard In contrast, many standard methods for uncertainty backpropagationloop. estimationareonlyexplicitlydependentontheinput. 2. Efficient: Because gradients are already calculated Ingeneral, theseparabilitypropertypositedin(4)allows duringnormaltraining,ourmethodhaslowcompute us to isolate examples that are difficult but still learnable overhead. bythemodel. Wehypothesizethatbyemphasizingthese examples,weallowthemodeltoexplorepartsofparameter 3. Explicitcomparison: Thegradientdotproductisan space that are benign but which would be left untouched explicit comparison between each example and the otherwisebythemodel. meanofadistribution,asopposedtotheimplicitcom- parisonthatmanyuncertaintymethodsuse. Ourmaincontributionsareasfollows: 4. Separabilityofuncertainty: Gradientscanseparate • WeintroduceGradTail, analgorithmthatcanbeim- whetheraproblematicexampleistrulyrare(andthus plementedontheflyduringmodeltrainingtoimprove stilllearnable)orjusthard. Examplesthatbackprop- overallperformancewithemphasisondatainthelong- agatenearlyorthogonalgradients,forinstance,arein tail. principlelearnablebythemodelbecauseapplicationof thatgradientwillnotobstructlearningoftheaverage • We show through experiments with controlled syn- example.Examplesthatbackpropagateopposinggradi- theticdatahowgradientdotproductsproduceanaturalGradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting continuousscaletoseparatedifficult(aleatoric)data Algorithm 1 Gradient Long-Tail Dynamic Upweighting fromrare(epistemic)data. (GradTail) • WedemonstratethatGradTailworkswellintraditional note that all dot products denoted by · are normalized sampleweightingsettingssuchasclassificationaswell andthereforelieintherange[−1,1]. asindenseregressionsettings,whichareoftenover- notethatalloperationskeepthebatchdimensionintact lookedbythesampleweightingliterature. unlessexplicitlynoted. 2.RelatedWork choosemonotonicallydecreasingandpositiveactivation functionf :[0,∞](cid:55)→[1,∞] Uncertainty estimation is of keen interest to deep learn- choosesubsetoftrainableweightsw⊂W frommodelΦ. ing practitioners, as it provides useful information about choosedataset(X,Y)withminibatchesx⊂X,y⊂Y. failuremodesandpotentialpathwaystowardsimprovement chooselossfunctionLtocalculategradients. for a given model. Much of the work within uncertainty choosepivotpanddecayλ. estimation is related to work on Bayesian deep learning initializetozeroesw˜,atensorthesameshapeaswthat (Herna´ndez-Lobato&Adams,2015;Depewegetal.,2018; willholdtheaveragegradient. Maddox et al., 2019), in which distributions placed over initializetozeroaveragevariancevariableσ. networkweightsorinputsprovideexplicitmeasuresofpre- dictiveuncertainty. Extendingmodeltrainingtoexplicitly functionGetLossForMinibatch(x,y) modeluncertaintiescanalsoleadtomorerobusttraining calculate∇(x):=∇ L(Φ(x),y) w outputs(Kendall&Gal,2017). Modelinguncertaintiesis calculateθ(x)=∇(x)·w˜ closely related to active learning (Yang et al., 2015; Yoo calculateσ =E [|θ(x)|] x x &Kweon,2019),whereuncertaintymeasurementscanbe setσ =λσ+(1−λ)σ x usedtocollectnewlabeleddatathatwillmaximallybenefit setw˜ =λw˜ +(1−λ)E [∇(x)] x themodel. calculatelossweightsq=f(|θ(x) −p|) (cid:80) σ return qL(Φ(x),y) Example-levelweightingmethodshavebeenwell-studied endfunction (He&Garcia,2009),andcanevenbedoneonthefly(Lin etal.,2017). Reweightinghasalsobecomepopularinmulti- tasklearning(Chenetal.,2018;Kendalletal.,2018),where differenttasksmustbebalancedwitheachotherforoptimal Themainhyperparameteristhepivotp,whichpresentsusa training. Multitasklearningalsohaspopularizedgradient levertotunethetradeoffbetweenexplorationandsensitivity comparisontechniques(Yuetal.,2020;Chenetal.,2020), toaleatoricuncertainty. Amorenegativepivotmeansthat whichweleverageheavilywithinthiscurrentwork. weupweightexamplesthatarelessinagreementwiththe Out-of-distribution and long-tail detection provide im- meansamplegradient,butformildernegativevaluescan portanttoolstoclassifyandmitigatetheeffectofdatathat allowthemodel’strainableparameterspacetoexplorein liefarfromthemaindatamanifold. Entropy(Louizos& relativelybenigndirections. Beingabletocontrolthistrade- Welling,2017;Shietal.,2020)andensemble(Vyasetal., offisakeyfeatureofGradTail,asitallowsustofilterout 2018;Malininetal.,2019)methodscanselectforthetailsof examplesthatareoutliersandincompatiblewiththebulk ourdatadistribution,andvariouslearningmethodsexistto ofthedatadistribution. Ingeneral, selectingapivotof0 mitigatetheireffectonmodelperformance(Liuetal.,2019; providesagoodbaseline. Tanetal.,2020). However,thesemethodsusuallyfocuson Anotherimportanthyperparameteristheactivationfunction classificationandareexplicitlycoupledwithsemantically f.f isamonotonicallydecreasingfunction,whichproduces definedlong-tailswithincurateddatasets(e.g. (VanHorn positiveoutputsthatare≥1. f ismonotonicallydecreasing etal.,2018)). Weinsteadseekamethodthatcanoperate becauseittakesasinputtheabsolutedistanceofthegradient withoutanylong-tailinformationprovidedinthetraining dotproductfromsomepivotpoint. Thefurthertheabsolute data. distancefromthispivotpoint,thelesswewanttoupweight thisparticularsample. 3.Methodology In our experiments within this work, we pick a sigmoid WepresentthemainalgorithmloopforGradTaildynamic activationfunctionf(x)=1.0+ A ,whereA,Bare 1−e−Bx upweightinginAlgorithm1. Wenotethattheactualalgo- additional hyperparameters. We pick this function as it rithmisverysimpleandonlyinvolvesstandarddotproduct presentsrelativelymildslopesbutalsohashighestvariance and arithmetic operations. A schematic of the GradTail nearx=0,whichallowsittobeespeciallypeakedaround methodisshowninFigure1. gradientdotproductsnearthepivotpoint.GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting 4.Experiments Wepresentresultswithinthissectiononanumberofexper- imentalsettings. Webeginwithin-depthdiscussiononhow gradients are discriminative towards the long-tails of the distributiononasimple2-dimensionalsyntheticexample. Wethenmovetorealdatasetsandshowstrongresultson bothImageNet2012(Dengetal.,2009)andthemonocular depth prediction task on the Waymo Open Dataset (Sun et al., 2020). Exact details for models used within these experimentsareprovidedwithintheAppendix. (a) GroundTruth 4.1.ASimpleToyExample Beforewetestourmethodologyonlarge-scalesettings,in whichthelong-tailisextremelyhigh-dimensionalandoften semanticallyambiguous,wefinditinstructivetoseehow gradientscanhelpidentifythelong-tailofadatadistribution withinawell-understoodlowdimensionalsetting. (b) BaseModel (c) GradTail Forourtoyexample,weusea2-classclassifierfordatain2 dimensions. Thedataweuseformuchofourtoyexample Figure2.ToyexampleresultsforGradTaildynamicweighting.(a) analysis is displayed in Figure 2. As seen, the data con- Distributionofdatawithina2-classtoyexampleclassifier. The sistsof10,000common-classpoints(thegreencrossdata) common(greencross)classconsistsof10,000datapointsandfol- and 400 uncommon-class points (the purple circle data). lowsaN(0,I)distribution.Therare(purplecircle)classconsists ThecommonclassisdistributedI.I.D.asthestandardnor- of400datapointsandfollowsaN([2.2,2.2],0.5I)distribution. mal,whiletheuncommonclassisright-upwardsshiftedto Thedottedlineshowstheanalyticallycorrectdecisionboundary, calculated for when the two data generating distributions have N([2.2,2.2],0.5I)sothatitiswell-separatedenoughforthe equalprobability. (b)ResultwhenabaselineMLPclassifieris classificationproblemtobewell-defined,whilestillclose trainedonthedata.Theresultshownhererepresentsabetter-than- enoughtotheorigintocreateadifficultdecisionboundary. medianoutcome,asmostbaselinemodelsconvergetothemajority Ourmodelisasimple2-layerMLPwithonehiddenlayer classifier.(c)ResultofanMLPclassifierwhenGradTailisapplied offiveneurons. duringtraining. Asexpected,suchasimplemodeloftenconvergestothema- jorityclassifierofthecommonclass. Eveninthebestcase amplesintheuncommonclassfallunder“hard.”However, scenario the result of a naive MLP classifier looks some- examplesfrombothclassesarecaughtinthe“rare”range, thinglikethepredictionsinFigure2(b),withthedecision and they form a fairly symmetric distribution around the boundary pushed deep into the purple uncommon class’s truedecisionboundary. Thisresultemphasizesthatthedefi- distribution density. Such a result is expected due to the nitionof“rare”or“long-tail”thatismostusefultomodel lopsidednessoftheclassfrequencies. However,asshownin trainingcanbeatoddswiththecommonsemanticdefinition Figure2(c),trainingwithGradTailresultsinamuchmore foundintheliterature;ratherthanjustlabelinginfrequent accuratedecisionboundary,whichcomesquiteclosetothe classes as “rare,” we see “rare” examples as high-impact groundtruthdecisionboundaryshownbythedottedlinein examplesthatmosthelpthemodelmakecorrectpredictions. Figure2(a). These“rare”examplescancomefromeithertheinfrequent TodivedeeperintowhyGradTailperformsbetterinthissce- or the frequent class distribution, as long as they benefit nario,weshowthepreciseexamplesGradTailisupweight- predictiveaccuracyonmoretroublesomedata. inginFigure3.Hereweseparatethedatasetinto“common,” Asfurtherillustrationoftheseconcepts,inFigure4wealso “rare,”and“hard”byassociatingeachofthesecategories investigatetheinterestingcasewhentheuncommonclass with a specific range of normalized gradient dot product data is pulled far into the common class distribution, to values. Becausewechosethepivotvalueforupweighting thepointwherethecommonclassfrequencydominatesthe in this setting to be 0, we associate rare examples with a uncommonclassfrequencyatalllocations. Inthissetting, small range around 0, or [-0.07, 0.07]. “Hard” examples GradTailnolongeroutputsanydatainthe“rare”category, areanyexamplesthatfallbelowthisrange,and“common” butalloftheuncommonclassdataisnowlabeledas“hard.” examplesareabove. AsseeninFigure3(a),mostexamples Becauseaddingmoreuncommondatawillnotappreciably inthecommonclassfallunder“common,”whilemostex- changethequalityoftheclassifierinthiscase(unlessweGradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting (a) GradTail (b) GradTail(rare) (c) Entropy (a) HardDistribution (b) GradTail (c) Entropy GroundTruth Figure3.ToydatasetsplitsbasedonGradTailandothermethods. WeshowhowthetoydatasetshowninFigure2canbediscretely Figure4.The same “hard” vs “rare” vs “common” analysis as clusteredbasedonvariousuncertaintymetrics. Theseclassifica- describedinFigure3butonadifferentdistribution,asshownin tionsarebasedonaggregatedstatisticsforeachdatapointacross (a). Thisharderdistributioncorrespondstothecasewherethe theentiretrainingrun.(a)ForGradTail,becauseourpivotvalue common class has higher frequency than the uncommon class isaround0,weuseasmallclosed-intervalrangearoundzeroof ateverypointintheinputspace. Asshownin(c),entropystill [−0.07,0.07]astherangethatweassociatewithrareexamples producesincoherentresults,andin(b)weobservethatGradTail thatweaimtoupweight.Exampleswithdotproductsunderthat nolongeroutputsanyexampleswithinthe“rare”category.This rangearelabeledas“hard”(red)examples,whileexampleswith resultissensible,astherearenohigh-leverageexamplesforwhich dotproductsabovethatrangearelabeledas“common”(green). collectingsimilarexampleswillhelpuscontinuouslyimprovethe Many examples in the low-frequency class are observed to be inferreddecisionboundary. “hard,”whilebothclasseshaveexamplesthatarelabeledas“rare” (yellow).(b)Onlythe“rare”examplesfoundbyGradTailalong withthetruedecisionboundaryasreference. The“rare”exam- Table1.ImageNetClassificationwithGradTail.Highernumbers plesaredistributedevenlyaroundthedecisionboundary,andthus arebetter.Allstandarderrorsarewithin0.05%. upweightingtheseexampleswillleadtobetterdelineationofthe decisionboundary.(c)Highentropyandlowentropypointsarela- METHOD ACCURACY(TOP1) ACCURACY(TOP5) beledinredandgreen,respectively.Becauseofthelopsidednature BASELINE 76.8 92.9 ofthedataset,entropyprovestobeapoor,incoherentpredictorof FOCALLOSS 76.2 93.0 datauncertainty,andtheresultslookrandom. GRADTAIL 77.2 93.1 addahugeamounttoputtheclassfrequenciesmoreinto 4.2.Classification balance), the uncommon class data is purely “hard,” and willremainunalteredbyGradTail. WedescribewithinthissectionexperimentsontheImagenet Large Scale Visual Recognition Challenge 2012 dataset, WenotethatinbothdatadistributionsanalyzedinFigures3 consistingof1000classesofvariousobjectsandotheren- and4,plottingpointsofhighentropyleadstoanincoherent tities. ForourexperimentsweuseaResNet-50(Heetal., result. Entropy is a common metric within classification 2016)asourbasenetwork. OurGradTailmodelhasamax settingsforexampleuncertainty,andisoftenusedasararity loss weight of 3 and a pivot value of 0.0. The GradTail classifier. Meanwhile,ourGradTailmethodologyproduces gradientdotproductsarecalculatedonthelasttwolayers, meaningfulresultsevenwithinthemoredifficultsetting. onbothweightsandbiases. Itispertinenttonotethatforourtoyexamples, although ThemainresultsareshowninTable1forabaselinemodel, GradTailproducesbetterclass-normalizedmeanaccuracy theGradTailmodel,andamodelwithfocalloss(Linetal., duetoasignificantlyimproveddecisionboundary,itlowers 2017)usedduringtraining. Weseethatthetop5accuracies thetotalaccuracy. GradTailmovesthedecisionboundary are fairly clustered, but there is significant improvement butwillendupmisclassifyingsubstantiallymoremembers withintheGradTailmodelontop1accuracy. Acloserlook of the common class as a result. This regression results atthedrivingforcesbehindthisimprovementareencapsu- fromthefactthatourtoyexampleisverylow-dimensional latedinFigure4.2.Herewetabulatethetop-1accuraciesfor (deliberatelysoforeaseofvisualization),tothepointwhere eachquartileforthenormalizedgradientdotproduct. The movementofthedecisionboundaryisanexplicittradeoff first quartile accuracies represent the accuracies amongst betweenaccuracyofoneclassovertheother. Wewillshow thesubpopulationofdatathatlieinthelowest25%ofthe inthefollowingsectionsthatwhenweareinaveryhigh- normalizedgradientdotproductvalue. dimensionalsettingGradTailcanimproveoverallaccuracy as well. We hypothesize that the higher dimensionality Asexpected,forthebaselineweseethattheaccuracytends reducesthepossibilityofhavinganexplicitaccuracytrade- toimproveasthegradientdotproductincreases. However, off between classes as there will always exist a perfectly thistrendisreversedtoashockingdegreewhenfocalloss separatinghyperplanebetweendifferentclasses. isused. Focallossgreedilyupweightsexamplesthatpro-GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting Figure5.ImageNetaccuraciesbrokendownbygradientdotproductquartile. Lowestquartilecorrespondstothelowestgradientdot products.Correlationcoefficientsbetweenaccuracyandgradientdotproductvaluearealsoprovidedforeachmodel. ducehigherloss, andthe resultisacompletedecoupling classes that produce the highest and lowest gradient dot betweenexampleaccuracyandgradientdotproduct. Such productvalues. Weobservethatforallthreeclasses,high a phenomenon is mathematically counterintuitive, as ex- dotproductexamplesaresemanticallysimilarwhilelowdot amplesthatdisagreewiththemaingradientdirectionnow productexamplesexhibitoddfeatures. Forvolcanos,mid- completelydrivethetraining. eruptionvolcanosusuallyproducelowgradientdotproduct, whichissensibleasmostvolcanoswithinthedatasetaredor- Incontrast,GradTailtrackscloselywiththeperformance mant. Swansthatexhibitlowgradientdotproductareoften ofthebaselineoneachofthequartiles,butbeatsthebase- onlandorinmurkywater,whichhintsthattheenvironment linehandilyinthemiddletwoquartiles. Interestingly,the issalienttothemodelforswanclassification. Refrigerators baselineperformsbetterthanGradTailinthefirstquartile, withlowgradientdotproductoftenhaveadditionalagents where hard examples of the highest aleatoric uncertainty inthescene(e.g. petsorhumans),orareshotatoddangles lie. Thiseffectmaybedrivenbythebaselineseeinghigher ordistances. Ingeneral,allofthesevisualizationsimprove lossandthushighergradientsformembersofthefirstquar- our confidence that GradTail hones in on the appropriate tileduringtraining. GradTailsacrificessomeperformance examples. withinthehardestquartileinfavoroffocusingonthemore learnablemiddlequartiles. Theoveralleffectisbeneficial ItiscrucialtonotethatthevisualizationsinFigure6were tothemodel. WealsopresentinFigure4.2thecorrelation allgeneratedbymodelsataround10%throughthetraining valuebetweengradientdotproductbandandperformance, regimen.Unlikeconventionallong-tailworkintheliterature, showingthatGradTailproducesthelargestcorrelation. This ourGradTailalgorithmisfairlydiscriminativeearlyonin increasedcorrelationshowstahtGradTailleadstobettercal- thetraining, whichdemonstratesthatmodelconvergence ibrationbetweengradientdirectionandperformance,which doeshavetobearequirementinidentifyingandmitigating wehypothesizeisadesirablequalityforatrainedmodel. rareexampleswithinthedataset. TheastutereadermightobservethattheGradTaildynamic 4.3.DenseRegression(DepthEstimation) weightprofileswillbedifferentfordifferentmodeltraining runs,andthereforethequartilesinFigure4.2arecoupling For our regression problem, we choose camera-based to different examples for different baselines. While such monocular depth estimation on the the Waymo Open analysisiscorrect,wedeliberatelyusedthequartilemetric Dataset. Thedatasetconsistsofcameraimagesdownsam- toemphasizeourfirmbeliefthatbeinginthelong-tailis pledto192×480pairedwithgroundtruththatisper-pixel notjustapropertyofthedatabutalsoofthemodelaswell. depthmeasurementsfromLiDAR.Ourmodelusesa5-block Giventhatmanyrareexamplesareborderlinelearnableby MobileNet(Howardetal.,2017)encoderwitha5-layerde- baselinemodelsandthereforemighthappentobelearned coder that also concatenates encoder block outputs in a welloncertainruns,havingmodeldependenceaspartof similarwayasU-Net(Ronnebergeretal.,2015). Thefil- our long-tail definition affords us additional flexibility in terlayoutsintheencodernetworkweresearchedviaNAS identifyingthetrueproblempointsthatwecantrytotune (Bender et al., 2020) in the baseline setting to maximize ourmodelstolearn. potentialperformanceofthemodel,andkeptfixedforall experiments. Despiteourconfidenceinthemetrics,itstillwouldbereas- suringtoknowthatthegradientdotproductwecalculateas Long-tail mitigation on dense regression data is a partic- partofGradTailproducessemanticallymeaningfulresults ularly exciting topic, as long-tail work has largely been (aswesawforourtoyexampleinSection4.1). InFigure6 restrictedtosingle-outputclassificationproblems. However, wevisualizeexamplesinthreerandomlychosenImageNet there is every reason to believe that regression problemsGradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting (a) EasyExamples(highgradientdotproduct) (b) HardExamples(lowgradientdotproduct) Figure6.GradientdotproductvisualizationontheImageNetdataset.Gradientdotproductsareaveragedover12savedmodelsaround 10%throughthetotaltrainingregimen. Theexampleswithinthetestsetthatexhibitboththehighest(a)andlowest(b)gradientdot productfortheclassesofblackswan(toprow),volcano(middlerow),andrefrigerator(bottomrow)arevisualized.Highgradientdot productimpliesthatanexampleisgenerallyalignedwiththeaveragetrainingdirectionandthusisin-distribution,whilelowgradientdot productpointstoadifficultexamplethatisout-of-distribution.Inallthreeclasscases,thehardexamplesaresemanticallyreasonable andexhibitimagefeaturesthatmakethemmoredifficult(e.g.murkywaterforswans,eruptionsforvolcanos,andoddlightingorother animal/humanspresentforrefrigerators). anddensepredictionproblemsareallhurtbyissuesinthe originalbatchsize. long-tail. Asgradientsareuniversalquantitieswithindeep The results of our experiments on depth estimation are learningtrainingdynamics,ourproposedmethodologyis showninTable2.Weseethattheoverallerrorrateimproves well-suitedtotackledenseregressionwithminimalmodifi- byamarginal(thoughstatisticallysignificant)amount,but cationsfromtheclassificationsetting. the picture is clearer when we tabulate the error rate for The density of the output for our chosen problem can be differentdepthranges. Theerrorrateimprovessignificantly onepotentialsourceofinefficiency. Aswehavetocalculate forpointswithinthe40-60mrange,whilebeingminimalfor agradientdotproductforeachoutput,andouroutputsare closepointsinthe0-20mrange. Physically,weknowthat nowinagridofover92,000pixels,itisnotrecommended pointsfurtherinthedistancearenecessarilylesscommon toperformthiscalculationforeachpixel. Evenifitwere withinthedataset(duetoparallax). UnliketheImageNet compute-efficienttodoso,thereissignificantspatialcorre- case(Section4.2),whereexamplerarityissemanticallyin- lationbetweenadjacentpixelsandsoitwouldbeawasteof formedandthuscomplicated,raritydependenceondistance computetoconsiderlong-tailedupweightingateverypixel withindepthestimationdatasetsisaphysicalpropertyof location. Instead,wetakesixrandompatchesofrandomly thesensorsandthusareasonableproxyforlong-tailedness. sampledsizesbetween20×20and100×100andrandomly Theseresultsarethereforeareassuringsignalthatalthough sampledlocationswithinthefullimagespace. Wealsotake theoverallimprovementissmall,wesubstantiallyimprove aseventhpatchconsistingofanypixelthatwasnotincluded performanceinareaswherewelackdataandwhichgener- inanyproposaltoensurethatwebackpropagatesomeerror allyareproblematicforstandardmodels. signal at every location. We then mean the loss in each The pivot parameter is also clearly important in this sce- patchandconcatenateforaneffectivebatchsizeof7×theGradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting Table2.WaymoOpenDatasetMonocularDepthEstimationwithGradTail.Thelowerthebetterforallmetricswithinthistable.Total MREhasstandarderrorwithin0.03%,whileallotherresultshavestandarderrorwithin0.05%. METHOD MRE<20M(%) MRE20-40M(%) MRE40-60M(%) MRE>60M(%) TOTALMRE(%) BASELINE 10.1 11.9 14.2 16.8 11.0 GRADTAIL,PIVOT−2.5 10.5 11.9 14.2 16.8 11.2 GRADTAIL,PIVOT+0.5 10.5 11.9 13.9 17.2 11.3 GRADTAIL,PIVOT−0.5 10.1 11.7 13.7 16.5 10.8 nario; apivotclosetozerobutslightlynegativeprovided though this may seem expensive, example-level gradient the best results. Positive pivots, which would upweight computationisalreadydoneinthevastmajorityofmodels, in-distributionexamples,andhighnegativepivots,which whichallowsustoreusecalculationstoimplementGrad- wouldupweightonlyhardexamples,bothdegradedmodel Tail. Theoverheadthenbecomesminimal,withthemost quality. Theoptimalpivotbeingclosetozerofurthersup- expensivestepbecomingthetakingofasingledotproduct. portsouremphasisonorthogonalgradientsasmeaningful. However,wehavefoundthatsettingupthisreusedcompu- tationisdifficultinmostmoderndeeplearningframeworks, 5.Discussion asitiscommonwithinsuchframeworkstohidetheback- wardspassdeepwithinthesystembackend.Interceptingthe OurgoalforintroducingGradTailistwofold: first,wewant computationrightbeforetheper-examplegradientsignal todemonstrateitasageneralmethodthatworkstomitigate is summed across the batch dimension can then become performanceissuesonlong-tailpartsofthedatadistribu- challenging. Upuntilnow,suchdesigndecisionsbythese tion. We showed through substantial analysis on a low- frameworksmayhavebeenduetothelimiteduseofinter- dimensionaltoyexamplethatGradTailworkswellwithin cepting such a computation. However, we hope that our thecontextofhighlylopsideddata,andproducesasensible work on GradTail and followup research will clarify the decisionboundaryevenwhenonedatadistributionis25× potential benefit that access to the per-example gradients thefrequencyoftheother. Wealsoshowedimprovements canoffer,andthatwecanencouragedesignsofdeeplearn- inperformanceonraresegmentsofthedatainbothaclassi- ingframeworkthatallowbetterinterfacingtothegradient fication(ImageNet)anddenseregression(depthestimation computationbackend. onWaymoOpenDataset)context. Thelatterisespecially exciting, as long-tail methods have largely been reserved 6.Conclusions forclassificationsettingsduetothecomplexityofmoving toacontinuousoutputspace. However,becausegradients WepresentedGradTail,agradient-baseddynamicweighting arewell-suitedtocontinuousproblemsandareuniversalto algorithmthatupweightsexamplesduringtrainingbased neuralnetworktraining,theyarepowerfullygeneralentities ontheirrarity. WeshowedhowGradTailworkstoproduce touseforregressionlong-tailmitigation. areasonabledecisionboundaryonanextremelylopsided Oursecondgoalistochallengepre-existingnotionsofwhat low-dimensionalclassificationproblem,aswellasworking long-tailedness and rarity within a dataset mean. Long- to mitigate poor performance on rare examples within a tailednesshaslargelybeenaworkofmanuallaborinthe higher-dimensionalclassificationproblem(ImageNet). Cru- literature,withhumanssemanticallylabelingobjectstobe cially,weshowthatGradTailgeneralizestodenseregression rare and ad-hoc triaging when new rare classes crop up. settingsaswell, whichhavehithertobeenrelativelyinac- DatasetslikeiNaturalist(VanHornetal.,2018)existthat cessibletolong-tailmethods. Ultimately,weseeGradTail boastlonglistsoflong-tailedclasses,wherelong-tailedness asanimportanttoolwithinthetoolkitofanydeeplearning isapropertyofaclassratherthananexample. Whilesuch practitioner,butalsoseeitassignificantevidencethatthere analysis has been crucial in robustifying neural network ismuchtobelearnedaboutanycomplexdatadistribution thus far, we hope that our work provides a fresh look at fromtheinformation-richbutoft-ignoreddynamicsoftrain- how we can improve model training by reasoning about ingamodel. Itmaybethattotrulyrobustifyourmodels, long-tailednessatalearned,granularlevelandusingmodel we need to focus not on where our models converge, but dynamicsasourfundamentalsignal. ratheronthemyriadtwistingpathsbywhichtheygetthere. 5.1.ANoteonCompute/MemoryCosts Throughout this work, we noted that Gradtail requires a gradientdotproducttobecalculatedforeachexample. Al-GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting References Kendall, A., Gal, Y., andCipolla, R. Multi-tasklearning usinguncertaintytoweighlossesforscenegeometryand Bender,G.,Liu,H.,Chen,B.,Chu,G.,Cheng,S.,Kinder- semantics. In Proceedings of the IEEE conference on mans,P.-J.,andLe,Q.V. Canweightsharingoutperform computervisionandpatternrecognition,pp.7482–7491, randomarchitecturesearch? aninvestigationwithtunas. 2018. 3 In Proceedings of the IEEE/CVF Conference on Com- puterVisionandPatternRecognition,pp.14323–14332, Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dolla´r, P. 2020. 6 Focallossfordenseobjectdetection. InProceedingsof the IEEE international conference on computer vision, Bengio, S. The battle against the long tail. In Talk on pp.2980–2988,2017. 3,5 WorkshoponBigDataandStatisticalMachineLearning, volume1,2015. 1 Liu,Z.,Miao,Z.,Zhan,X.,Wang,J.,Gong,B.,andYu,S.X. Large-scalelong-tailedrecognitioninanopenworld. In Chen,Z.,Badrinarayanan,V.,Lee,C.-Y.,andRabinovich, ProceedingsoftheIEEE/CVFConferenceonComputer A. Gradnorm: Gradientnormalizationforadaptiveloss VisionandPatternRecognition,pp.2537–2546,2019. 3 balancingindeepmultitasknetworks. InInternational ConferenceonMachineLearning,pp.794–803.PMLR, Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- 2018. 3 dient descent with warm restarts. arXiv preprint arXiv:1608.03983,2016. 13 Chen,Z.,Ngiam,J.,Huang,Y.,Luong,T.,Kretzschmar,H., Chai,Y.,andAnguelov,D. Justpickasign: Optimizing Louizos, C. and Welling, M. Multiplicative normalizing deepmultitaskmodelswithgradientsigndropout. arXiv flowsforvariationalbayesianneuralnetworks.InInterna- preprintarXiv:2010.06808,2020. 3 tionalConferenceonMachineLearning,pp.2218–2227. PMLR,2017. 1,3 Deng,J.,Dong,W.,Socher,R.,Li,L.-J.,Li,K.,andFei-Fei, L. Imagenet: Alarge-scalehierarchicalimagedatabase. Maddox,W.J.,Izmailov,P.,Garipov,T.,Vetrov,D.P.,and In2009IEEEconferenceoncomputervisionandpattern Wilson, A. G. A simple baseline for bayesian uncer- recognition,pp.248–255.Ieee,2009. 4 taintyindeeplearning. AdvancesinNeuralInformation Depeweg, S., Hernandez-Lobato, J.-M., Doshi-Velez, F., ProcessingSystems,32:13153–13164,2019. 3 andUdluft,S. Decompositionofuncertaintyinbayesian Malinin, A., Mlodozeniec, B., and Gales, M. Ensemble deeplearningforefficientandrisk-sensitivelearning. In distributiondistillation.arXivpreprintarXiv:1905.00076, InternationalConferenceonMachineLearning,pp.1184– 2019. 3 1193.PMLR,2018. 3 Philion, J. Fastdraw: Addressingthelongtailoflanede- He,H.andGarcia,E.A. Learningfromimbalanceddata. tectionbyadaptingasequentialpredictionnetwork. In IEEETransactionsonknowledgeanddataengineering, ProceedingsoftheIEEE/CVFConferenceonComputer 21(9):1263–1284,2009. 3 VisionandPatternRecognition,pp.11582–11591,2019. He,K.,Zhang,X.,Ren,S.,andSun,J. Deepresiduallearn- 1 ingforimagerecognition. InProceedingsoftheIEEE Ronneberger,O.,Fischer,P.,andBrox,T. U-net: Convolu- conferenceoncomputervisionandpatternrecognition, tionalnetworksforbiomedicalimagesegmentation.InIn- pp.770–778,2016. 5,13 ternationalConferenceonMedicalimagecomputingand Herna´ndez-Lobato,J.M.andAdams,R. Probabilisticback- computer-assistedintervention, pp.234–241.Springer, propagationforscalablelearningofbayesianneuralnet- 2015. 6 works. InInternationalconferenceonmachinelearning, Shi, W., Zhao, X., Chen, F., andYu, Q. Multifacetedun- pp.1861–1869.PMLR,2015. 3 certaintyestimationforlabel-efficientdeeplearning. Ad- Howard,A.G.,Zhu,M.,Chen,B.,Kalenichenko,D.,Wang, vances in Neural Information Processing Systems, 33: W.,Weyand,T.,Andreetto,M.,andAdam,H.Mobilenets: 17247–17257,2020. 3 Efficientconvolutionalneuralnetworksformobilevision Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Pat- applications. arXivpreprintarXiv:1704.04861,2017. 6, naik,V.,Tsui,P.,Guo,J.,Zhou,Y.,Chai,Y.,Caine,B., 13 etal. Scalabilityinperceptionforautonomousdriving: Kendall, A. and Gal, Y. What uncertainties do we need Waymoopendataset. InProceedingsoftheIEEE/CVF in bayesian deep learning for computer vision? arXiv ConferenceonComputerVisionandPatternRecognition, preprintarXiv:1703.04977,2017. 1,3,11 pp.2446–2454,2020. 4GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting Sutskever,I.,Martens,J.,Dahl,G.,andHinton,G. Onthe importanceofinitializationandmomentumindeeplearn- ing. InInternationalconferenceonmachinelearning,pp. 1139–1147.PMLR,2013. 11 Tan,J.,Wang,C.,Li,B.,Li,Q.,Ouyang,W.,Yin,C.,and Yan,J. Equalizationlossforlong-tailedobjectrecogni- tion. In Proceedings of the IEEE/CVF Conference on ComputerVisionandPatternRecognition(CVPR),June 2020. 3 VanHorn,G.,MacAodha,O.,Song,Y.,Cui,Y.,Sun,C., Shepard,A.,Adam,H.,Perona,P.,andBelongie,S. The inaturalistspeciesclassificationanddetectiondataset. In ProceedingsoftheIEEEconferenceoncomputervision andpatternrecognition,pp.8769–8778,2018. 3,8 Vyas, A., Jammalamadaka, N., Zhu, X., Das, D., Kaul, B.,andWillke,T.L. Out-of-distributiondetectionusing anensembleofselfsupervisedleave-outclassifiers. In ProceedingsoftheEuropeanConferenceonComputer Vision(ECCV),pp.550–564,2018. 1,3 Yang,Y.,Ma,Z.,Nie,F.,Chang,X.,andHauptmann,A.G. Multi-classactivelearningbyuncertaintysamplingwith diversitymaximization. InternationalJournalofCom- puterVision,113(2):113–127,2015. 1,3 Yoo,D.andKweon,I.S. Learninglossforactivelearning. In Proceedings of the IEEE/CVF Conference on Com- puterVisionandPatternRecognition,pp.93–102,2019. 3 Yu,T.,Kumar,S.,Gupta,A.,Levine,S.,Hausman,K.,and Finn,C. Gradientsurgeryformulti-tasklearning. arXiv preprintarXiv:2001.06782,2020. 3GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting A.DiscussiononAleatoricvsEpistemicUncertaintyandGradientDotProducts Withinthemainbodyofthiswork,werepeatedlymaketheclaimthatorthogonal(orclose-to-orthogonal)gradientswith respect to the average gradient indicate examples with high epistemic uncertainty, while gradients with high negative gradientdotproductsindicateexampleswithhighaleatoricuncertainty. Wewouldliketotaketheopportunitytofurther discusstheconnectionbetweenourworkandtheseclassicalconceptsinuncertaintyestimation. Traditionally,epistemicuncertaintyisalsoknownasmodeluncertainty,andreflectsourmodel’slackofknowledgetolearn acertainpieceofdata. Althoughtherearevariouswaysofmodelingsuchuncertainty(e.g. puttingaprioronthemodel weightsasdoneinBayesiandeeplearning(Kendall&Gal,2017)),akeypropertyofepistemicuncertaintyisthatitcan alwaysbealleviatedbycollectingmoredata. Thatkeypropertyimmediatelycreatesafundamentallinkbetweenepistemic uncertaintyandlong-tailedness. Incontrast,aleatoricuncertaintydetailsirreduciblenoisewithinthedatathatwillbepresent regardlessofhowmuchdatawecollect. Wethusaskourselveswhatkindsofdataasufficientmodelwilllearnbettergivenmoreexamplesofthatdatatype. We arguethatthisisthepointwherere-evaluatingtheprobleminthecontextofmodelgradientsbecomesespeciallyhelpful. Namely,frombasiccalculusthereisatleastalocalguaranteethatforaforward-passmodelM,shouldanexamplexwith labely producegradients∇ L(M(x);y)thatproducedotproductsthatarezeroorgreaterwithrespecttotheaverage w gradient,thenapplyingthesegradientupdateswill: 1. ReducethelossL(M(x);y). 2. Notdegradetheperformanceoftheaverageexamplewithinthedataset. Thus,wenotethatcollectingmoredatawithasimilargradientasxandthereforeasimilardotproductwillnecessarily reducetheuncertaintyofxwithinthemodelM trainedonthefulldataset,whilenotreducingtheperformanceofthemodel otherwise. Thisargumenttellsusthattheuncertaintyofdatapointxislargelyepistemic. Incontrast,if∇ L(M(x);y)isantiparallelorproducesnegativedotproductwhendottedwiththeaveragegradient,then w collectingmoreexampleswithasimilargradientprofilewillhurttheperformanceoftheaverageexampleforthatmodel. Thus,aswetrainonthefulldataset,althoughwemayoverfittotheexamplexandreduceitsuncertainty,theoverallmodel performancewilldegradeandbecomeevenmoresusceptibletothenoiseofmisbehavingexamplessuchasx. Suchbehavior doesnotfulfillthekeypropertyofepistemicuncertaintyofbeingalwaysreduciblewithmoredatacollection,andsowe associateitspoorperformancemorewithhighaleatoricuncertainty. Wenotethatmuchofthisargumentinvolvescontextualizingsampleuncertaintywithinalargerecosystemofmodeltraining. Inourview,theuncertaintyofanygivenexampleisonlymeaningfulinthecontextofamodelthatismakinginferenceonall thatdata,asweneverobservetrainingexamplesbythemselvesinavacuum. Suchconsiderationsareakeydrivingforceof ourproposedGradTailalgorithm,asthedynamicqualityofthealgorithmensuresthatthecomputeduncertainty(viagradient dotproduct)ofagivenexamplewilloftenchangewithtimeanddependsonthestateofthemodel. Wefindthatreasoning aboutuncertaintywhiletiedtoamodelstateiscrucialfordeterminingnotonlyexamplesofdifferinguncertaintytypes,but examplesthatwillbeofoptimalpracticalusetomodeltraining. Empirically,thisclaimiswellsupported,especiallybyour visualizationsinFigure3whereweshowedthatlong-tailexamplesclusteraroundthedecisionboundary,andinTable2 whereweshowedthatperformancedegradesforouralgorithmforlargenegativepivots. B.TrainingDetails Unlessotherwisenoted,alltrainedmodelsweretrainedonTPUcoreswithintheTensorFlowframework. TheToyExample anddepthestimationexperimentswereperformedonTensorFlow1,whiletheImageNetexperimentswereperformedon TensorFlow2. B.1.ToyExample ThetoyexamplenetworkisasimpleMLPmodelwithonehiddenlayerof5neurons. Thenetworkisbarebonesanddoesnot employanyofthestandardtoolslikebatchnormalizationordropout. Trainingforallmodelsisperformedfor10,000steps withaninitiallearningrateof1×10−4 Thelearningrateisnotdecayed. Weoptimizeusingalook-aheadmomentum optimizer(Sutskeveretal.,2013)withmomentum0.9GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting OurGradTaillayerusesallweights(bothconvolutionalweightsandbiases)withintheMLPforcomparison,andproducesa maximumupsamplingweightof15 WealsocompareGradTailwithinverseclass-frequencyweightinginFigureB.1. Byinverseclass-frequencyweighting,we meanthatduringtraining,exampleswithinthecommonclassareassignedweight1andexampleswithintheuncommon classareassignedweightw ≥1. Thisweightingisacommontechniqueusedinpracticetodealwithknownclassimbalance withinthetrainingdataset. WecomparethistotheGradTailmodel,forwhichw isthemaximumweightassignedtoa detectedlong-tailexample. Thebaselinew =1caseistheoneshowninFigure2b. FigureA.1.GradTail classification versus Inverse Frequency Weighting classification on an imbalanced toy dataset. The top row correspondstoinversefequencyweightingresultswhilethebottomareGradTailresults.Eachcolumncorrespondstoadifferentweighting. Forinversefrequencyweightingthisistheweightassignedtotrainingexamplesoftheuncommonclass.ForGradTailthisisthemaximum upweightfactorforanexamplethatiswithinthelong-tail. Weseethataswincrease,inversefrequencyweightingforcesthedecisionboundaryfurtherdownandtotheleft,reflecting theincreasedleveragethattheuncommon-classexamplesnowhaveonthetrainingdynamics. Ifwesetthefrequency weightingtotheexactfrequencyratiobetweenthecommonanduncommonclasses(inthiscasethatratiowouldbe25), theresultingdecisionboundaryisratheraggressiveandencroachesveryfarintothecommonclassterritory. Findingthe rightbalancebecomesahyperparametersearchproblem,whichisexponentiallyexacerbatedwhentherearemorethantwo classes. Incontrast,GradTailisconsistentthroughoutregardlessofwhatwesetasthemaximumupweightingfactor. Weattribute thisrobustnesstothedynamicnessofthemethod;GradTailcannotdeviatefromagoodbalancepointbetweentheclasses becauseonceoneclassstartstodominate,theotherclass’sexamplesaremorelikelytobedetectedaslong-tail. GradTail thusprovidesarestoringforcetothesystem,leadingtoastableconvergencethatwouldbeelusivetomanualmethodslike inversefrequencyweightingwherethesameweightisappliedthroughoutthetrainingrunregardlessofhowthemodel trains. Wefurthernotethatalthoughinversefrequencyweightingisofsomeutilityinsomescenarios,oneofourcorebeliefswithin thisworkisthatweneedtomoveawayfromamanualsemanticdefinitionoflong-tail. Anexampleisnotnecessarilyinthe long-tailjustbecauseitbelongstoalesscommonclass,astheclassdefinitionsthemselvesmaycomefromsemi-arbitrary semanticlabelsandhighlyoverlappinggeneratingdistributions. Overfittingourmethodologiestoasemanticclass-basedGradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting definitionoflong-tailmayleadtosubparperformance,asdemonstratedhere. B.2.ImageNet ImageNetinputsaredownsampledto160x160inputsbeforebeingputthroughastandardResNetv1architecture(Heetal., 2016). Modelsaretrainedfor1.7millionstepswithaninitiallearningrateof0.025andacosinelearningratedecayprofile (Loshchilov&Hutter,2016). Weusethesameoptimizerasinthetoyexample(lookaheadmomentumwithmomentum parameter0.9),andalsousealabelsmoothingof0.1. Wesetregularizationat6e-5andsetbatchsizeto256. OurGradTaillayerusesgradientsfromthefinaltwolayers(bothconvolutionalweightsandbiases),andthemaximum GradTail weight is set to 3. We use a pivot value of 0 for all our ImageNet experiments, which is consistent with our interpretationthatorthogonalgradientsbelongtolong-tailexamples. B.3.WaymoOpenDataset WeperformmonoculardepthestimationonthecameraimagesoftheWaymoOpenDatasetbytrainingaregressionmodel toproducemetricper-pixeldepthpredictionsfromcameraimageonly. Thedepthgroundtruthwasobtainedbyprojecting synchronizedLiDARpointsintotherespectiveimages.Themodelinputandoutputresolutionissetto192×480,andimages arefedthrougha5-blockMobileNet(Howardetal.,2017)encoderwithacorrespondingdecoderandskipconnections. The traininglossisasimple(cid:96) lossappliedonlytopixelsthathaveagroundtruthdepthvalueassignedtothem. Thelearning 1 rateiskeptconstantatlr=0.0002throughoutthetraining,andthebatchsizeissetto10. OurGradTaillayerusesjustthebiasesinthefirsttwoupsamplinglayerswithinthedecoder,whichallowsourmethodtobe especiallyefficientwithinthissetting. ThemaximumGradTailweightissetto15. C.MoreInsightonMethodologyHyperparameters OurproposedGradTailalgorithmisrelativelysimpleandstraightforward,butdoescomewithafewhyperparameters. Inour experience,GradTailisfairlyrobust(seeforexamplethefrequencyweightingexperimentsinSectionB.1,butweinclude manyofthesehyperparametersasguardsagainstedgecasesandspecificscenarioswheredynamicdriftmightbesevere enoughtothrowoffGradTailwithoutfurthermitigation. Inthissectionwegothroughthehyperparametersandoffereafew additionalinsightsoneachofthem. Pivotparameterp. Thepivotparameterpisthemainhyperparameterandtheonethathasthelargestinfluenceontraining (seeforexamplethedensedepthestimationresultsinSection4.3). Thepivotparameterrepresentsourbeliefforwhat examplescountas”long-tail”andthusmustbeupweighted. Withoutanyadditionalinformation,werecommendalways settingthepivottozeroforinitialexperiments. Thissettingisbecauseapivotofzeromeansweupweightexamplesthat backpropagateorthogonalgradients,whichrepresentexamplesthatarelearnablebutstilldifficultduetolackofmodel explorationoftheparameterspace. However,weleavethepivotasahyperparameterbecausesmallswingsinthenegative andpositivedirectioncanofferadditional(orless)regularization,withmorenegativepivotsresultinginmoreregularization. Becauseweultimatelywanttooptimizeourmodelsonatestset,suchregularizationpotentialcanbeuseful. Activationfunctionf. Theactivationfunctiontransformsanormalizedgradientdotproductintoasampleweight,which shouldbeatleast1.0. Wechoseasigmoid-likeactivationfunctionasitprovidesuswithasharppeakclosetothedesired pivot,whichallowsustobemoreselectivewithwhatsamplestoupweight. Decayrateλ. Thisisthedecayratebywhichweassignnewaveragegradientvaluestotheexponentialmovingaverage andvarianceσ. Wesetthisdecayrateinourexperimentsto0.99,whichmeansthatourpoolofaveragegradientsreflectson averagethegradientsofthelast100batches. Variance σ. Although not a hyperparameter, the zero-centered variance σ is worth discussing as it is an additional normalizationtermthatmayseemmysteriousatfirst. Weaddσintothemainalgorithmbecauseouralgorithmisadynamic onewhichdealswithmovingstatisticswithinanetworkthatistraining. Inmanycasesthedriftofσthroughtrainingwas onlymild,soitisplausibletouseGradTailwithoutσ,butasasafetymeasureitisstillrecommended.GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting D.FullImageNet2012Visualizationfor2Classes In the main paper, we showed the highest and lowest dot product images for some ImageNet classes in Figure 6. For completeness,weincludealltheimageswithinthetestset(50each)fortwoImageNetclasses,volcanoandrefrigerator,in FigureA.2. Foreachclass,thegradientdotproductsforeachimagebecomehigherfromlefttorightineachrow,andthenfromtopto bottom. Inotherwords,thetopleftimageineachclassproducedthelowestgradientdotproduct,whilethebottomright imageineachclassproducedthehighest. Ineachcase,zerogradientsoccurclosetothebeginningofthethirdrow. Weseethatasgradientdotproductsbecomehigherandhigher,thereisaclearsemantictrendforimagestostandardizeinto fairlysimilarimages. Forrefrigeratorswebegintoseeimagesoffullrefrigerators(withdoorsopenorclosed)withoutmuch debrisorextraneouselementsintheframe. Forvolcanosweseemanyimagesofclearskiesandsnow-cappedpeaks. Atthe lowerendofthegradientdotproductscale,webegintoseeoddfeatureslikeanimalsorhumansinfrontofarefrigeratoror volcanosthataremid-eruption. Weemphasize,however,thatalthoughweincludethesevisualizationsforcompleteness,wedonotfullyrecommendthe practiceoftryingtosemanticallyreasonwhyanygivendatainputmayormaynotbedifficultforagivenmodel. One ofourmajorintendedcontributionsofourworkwastodemonstratethatitisreasonabletodefinelong-tailednesspurely asafunctionofthedataandmodel,withoutreferencetohumandefinedclassesorclasshierarchies(whicharealways susceptibletoatleastabitofarbitrariness).Althoughitisausefulsanitychecktoseethatthereisaclearsemanticdifference inImageNetdataasweprogressalongthedotproductscale,wehopetoemphasizethatitismoreusefultothinkabout long-tailednessintermsofparameterspaceexplorationandlearnability,bothofwhichourgradientframeworkaddresses explicitly.GradTail:LearningLong-TailedDataUsingGradient-basedSampleWeighting (a) Refrigerator (b) Volcano FigureA.2.ImageNetvisualizationorderedbygradientdotproductfortwoclasseswithintheImageNet2012testset.Imagesproduce highergradientdotproductwiththemeangradientastheimagesgofromlefttoright,andthentoptobottom.Imagesonthebottomright ofeachcategoryhavethehighestgradientdotproduct,whileimagesonthetopleftofeachcategoryhavethelowest.