Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout ZhaoChen JiquanNgiam YanpingHuang WaymoLLC GoogleResearch GoogleResearch MountainView,CA94043 MountainView,CA94043 MountainView,CA94043 zhaoch@waymo.com jngiam@google.com huangyp@google.com ThangLuong HenrikKretzschmar YuningChai GoogleResearch WaymoLLC WaymoLLC MountainView,CA94043 MountainView,CA94043 MountainView,CA94043 thangluong@google.com kretzschmar@waymo.com chaiy@waymo.com DragomirAnguelov WaymoLLC MountainView,CA94043 dragomir@waymo.com Abstract Thevastmajorityofdeepmodelsusemultiplegradientsignals,typicallycorre- spondingtoasumofmultiplelossterms,toupdateasharedsetoftrainableweights. However,thesemultipleupdatescanimpedeoptimaltrainingbypullingthemodel inconflictingdirections. WepresentGradientSignDropout(GradDrop),aproba- bilisticmaskingprocedurewhichsamplesgradientsatanactivationlayerbasedon theirlevelofconsistency. GradDropisimplementedasasimpledeeplayerthatcan beusedinanydeepnetandsynergizeswithothergradientbalancingapproaches. WeshowthatGradDropoutperformsthestate-of-the-artmultilossmethodswithin traditionalmultitaskandtransferlearningsettings,andwediscusshowGradDrop revealslinksbetweenoptimalmultilosstrainingandgradientstochasticity. 1 Introduction Deepneuralnetworkshavefueledmanyrecentadvancesinthestate-of-the-artforhigh-dimensional nonlinearproblems. However,whendistilleddowntoitsmostbasicelements,deeplearningrelies onthehumblegradientastheoptimizationsignalwhichdrivesitscomplexalgorithmicmachinery. Indeed,thedesiretoproperlyleveragegradientshasspurredawealthofresearchintooptimization strategieswhichhasledtofaster,morestablemodeltraining[36]. However, the literature has habitually glossed over an increasingly crucial detail: most gradient signalsaresumsofmanysmallergradientsignals,oftencorrespondingtomultiplelosses. Abroad arrayofmodelsfallunderthiscategory,includingonesnottraditionallyconsideredmultitask;for example,multiclassclassifierscanbesplitintoalossperclass,andobjectdetectorsconventionally breakdowntheirpredictionsalongvariousboundingboxdimensions. Itisuncertain,andinfact unlikely,thatanaïvesumoftheseindividualsignalswouldproducethebestsolution. Deep learning theory tells us that the local minima found in single-task models through simple gradientupdatesaregenerallyofhighquality[4]. However,suchaclaimshouldbereevaluatedin thecontextofmultitasklosssurfaces,whereminimaofeachconstituentlossmayexistatdifferent 34thConferenceonNeuralInformationProcessingSystems(NeurIPS2020),Vancouver,Canada. 0202 tcO 41 ]GL.sc[ 1v80860.0102:viXranetwork weight settings, which results in many poor minima of the sum loss. Such undesirable minimaareavoidedifweencouragethenetworktoseekoutcriticalpointsthatarejointminima–i.e. criticalpointsthatlienearalocalminimumofalltheconstituentlossfunctions. Togenerallyaddresssuchissues,deepmultitasklearningstudiespropertiesofmodelswithmultiple outputsandhasgivenbirthtomethodstobalancerelativegradientmagnitudes[3,17]ortunethefull gradienttensor[38]. Still,methodsthatexplicitlytacklejointlossoptimizationarerare. Workssuch as[37,47]dosobyfindingacommongradientdescentdirectionforalllosses,butsuchmethods operatebyremovingsuboptimalgradientcomponents. Suchreductiveprocessesarestillsusceptible tolocalminimaanddiscourageinter-taskcompetition–competitionwhichevidencesuggestscan bebeneficial[6,46]. Ourproposedmethodnotonlyprovidestheoreticalguaranteesofjointloss minimabutalsoallowsgradientstocompete,andthusavoidsthesamepitfallsasreductivegradient algorithms. Tothebestofourknowledgeourmethodisthefirstwiththissetofdesirableproperties. Wemotivateourmethod,GradientSignDropout(GradDrop),bynotingthatwhenmultiplegradient valuestrytoupdatethesamescalarwithinadeepnetwork,conflictsarisethroughdifferencesinsign betweenthegradientvalues. Followingthesegradientsblindlyleadstogradienttug-of-warsandto criticalpointswhereconstituentgradientscanstillbelarge(andthussometasksperformpoorly). Toalleviatethisissue,wedemandthatallgradientupdatesarepureinsignateveryupdateposition. Givenalistof(possibly)conflictinggradientvalues,wealgorithmicallyselectonesign(positiveor negative)basedonthedistributionofgradientvalues,andmaskoutallgradientvaluesoftheopposite sign. AbasicschematicofthemethodispresentedinFigure1. Figure1: GradDropschematicfortwolossesandonescalar. Inbothcases,wecalculateP (from Equation 1), which tells us the probability of keeping ∇s with positive signs. On the left, P = 0.5∗(1+(3+1)/(|3|+|1|))=1.0,sowekeeppositive∇swith100%probability. Ontheright, P =0.5∗(1+(7−3)/(|7|+|−3|))=0.7,sowekeeppositive∇swith70%probability. ThemotivationbehindGradDropparallelsthewell-knownrelationshipbetweengradientstochasticity and model robustness [18, 39, 40]. When a network finds a narrow, low-quality minimum, the inherent noise within the batched gradient updates serves to kick the model into broader, more robustminima. Similarly,GradDropassignsaqualityscoretoeachgradientupdatebasedonitssign consistency,andaddsstochasticityalongaxeswheregradientstendtoconflictmore. Animportant consequenceofthislogicisthatGradDropcontinuestriggeringuntilthemodelfindsaminimumthat isajointminimumforalllosses(seeSection4.1forproof). Ourprimarycontributionsareasfollows: 1. WepresentGradientSignDropout(GradDrop),amodularlayerthatworksinanynetwork withmultiplegradientsignalsandincursnoadditionalcomputeatinference. 2. WeshowtheoreticallyandinsimulationthatGradDropleadstomorestableconvergence pointsthannaïvegradientdescentalgorithms. 3. We demonstrate the efficacy of GradDrop on multitask learning, transfer learning, and complexsingle-taskmodelslike3Dobjectdetectorsforavarietyofnetworkarchitectures. 22 RelatedWork Optimization via gradient descent is one of the key pillars of deep learning. Apart from the traditionaloptimizationmethods[8,19,32,33,49],therehasbeenaresearchthrustondeveloping differentwaystoapplygradientstodeepnetworks[2,7,10,15,36,45,50]. Thesuccessofsuch methodscomesinpartbecauseoptimizationinsingle-taskmodelsgenerallyconvergestohigh-quality minima[4]. Alsoimportantistherelationshipbetweenstochasticityandmodelrobustness;aswith GradDrop,noisygradientshelprepelpoorlocalminimainfavorofwider,morerobustcriticalpoints [18,39,40]. Theseinsightsarecrucialandworthrevisitingformultitaskenvironments. Multitasklearningpresentsachallengingproblemforoptimization,asthelosssurfacenowconsists ofmanysmallerlosssurfaces. Asasubjectofstudy,multitasklearningpredatesdeeplearning[1,6], butitspowerinhelpingmodelgeneralizationandtransferringinformationbetweencorrelatedtasks [30, 48] make it especially relevant in the deep learning era. Although a large part of multitask researchfocusesondevelopingnewnetworkarchitectures[16,20,23,25,28,29,31]ornewloss functions[17],wefocusonmethodsthatexplicitlyinteractwiththegradients,whichtendtobemore lightweightandmodular. GradNorm[3]modifiesgradientmagnitudestoensurethattaskstrainat approximatelythesamerate. MGDA,theMultipleGradientDescentAlgorithm[6,37],findsalinear combination of gradients that reduces every loss function simultaneously. PCGrad [47] projects conflictinggradientstoeachother,whichachievesasimilarsimultaneousdescenteffectasMGDA. Many other applications which are not traditionally considered multitask can benefit from this work. Visionapplicationssuchasobjectdetection[24,34,35,51]andinstancesegmentation[11] explicitlyconstructmultiplelossestoarriveatoneconsolidatedresult. Languagemodelsthatemploy seq2seq predictions [44] make multiple predictions and create multiple gradient conflicts when backpropagatingthroughtime. Domainadaptationandtransferlearning[9,12,43],topicsinwhich manypowerfulspecializedtechniqueshavebeendeveloped,stilloftenrelyonmultiplelossesand thuscanbenefitfromgeneralmultitaskapproaches. Ourapproachhere,althoughwrappedinthe languageofmultitasklearning,hasamuchwiderrangeofapplicabilityondeepmodelsingeneral. 3 GradientDropout 3.1 BasicConcepts GradientSignDropoutisappliedasalayerinanystandardnetworkforwardpass,usuallyonthe finallayerbeforethepredictionheadtosaveoncomputeoverheadandmaximizebenefitsduring backpropagation. In this section, we develop the GradDrop formalism. Throughout, ◦ denotes elementwisemultiplicationafteranynecessarytilingoperations(ifany)arecompleted. ToimplementGradDrop,wefirstdefinetheGradientPositiveSignPurity,P,as (cid:18) (cid:80) (cid:19) 1 ∇L P = 1+ i i . (1) (cid:80) 2 |∇L | i i P isboundedby[0,1]. Formultiplegradientvalues∇ L atsomescalara,weseethatP = 0if a i ∇ L <0∀i,whileP =1if∇ L >0∀i. Thus,P isameasureofhowmanypositivegradients a i a i arepresentatanygivenvalue. WethenformamaskforeachgradientM asfollows: i M =I[f(P)>U]◦I[∇L >0]+I[f(P)U]◦I[G i >0]+I[f(P)0allowssomeoriginalgradientto i i leakthrough,whichisusefulwhenlosseshavedifferentpriorities–forexample,intransferlearning, weprioritizeperformanceonthetransferset. FormoredetailsseeSection4.3. 3.4 GradDropTheoreticalProperties WenowpresentandprovethemaintheoreticalpropertiesforourproposedGradDropalgorithm. Proposition1(GradDropstablepointsarejointminima): GivenlossfunctionsL ,...,L and 1 n anycollectionofscalarsW forwhich∇ L ,...,∇ L arewell-defined, theGradDropupdate w 1 w n signal∇(GD)atanypositionw ∈Wisalwayszeroifandonlyif∇ L =0,∀i. w w i 1Theinitializationofthevirtuallayerisnotonlymeanttokeeptheforwardlogictrivial.Itisrelevantalsoin thederivationofEquation3,asitgivesusthat∇ L =W(A)◦∇ L =∇ L A i W(A)◦A i W(A)◦A i 4Proof:Considernlossfunctions,indexedL ,...,L ,andtheirgradients∇ L forw ∈W.Clearly, 1 n w i (cid:80) if∇ L =0,∀i,thenthatwistriviallyacriticalpointforthesumloss L .However,theconverse w i i i isalsotrueunderGradDropupdates. Namely,ifthereexistssomej forwhich∇ L (cid:54)=0,without w j lossofgeneralityassumethat∇ L >0. AccordingtoEquation1,P >0atw. Thusf(P)>0(as w j itismonotonicallyincreasing),sothereisanonzero(f(P))chancethatwekeepallpositivesigned gradientsandthusanonzerochancethat∇(GD) ≥∇ L >0. (cid:3) w w j Proposition 2 (GradDrop ∇ norms sensitive to every loss): Given continuous component loss functionsL (w)withlocalminimaw(i)andaGradDropupdate∇(GD),thentosecondorderaround i eachw(i),E[|∇(GD)L| ]ismonotonicallyincreasingw.r.t. |w−w(i)|,∀i. 2 Proof: Setδ := dδ for|δ | = 1. Tosecondorder,aroundaminimumvaluew(i) alossfunction 0 0 hastheformL i(w(i)+δ)≈L i(w(i))+ 1 2δTH(Li)(w(i))δ =L i(w(i))+ 21d2δT 0H(Li)(w(i))δ 0for positivedefiniteHessianH(Li). BecauseδT 0H(Li)(w(i))δ 0 >0,∇L iatw(i)+δisproportionalto d. Asdincreases,sowillthemagnitudeofeach∇L component,whichthenimmediatelyincreases i thetotalexpectedgradientmagnitudeinducedbyGradDrop. (cid:3) FromProposition1,weseethatGradDropwillresultinazerogradientupdateonlywhenthesystem findsaperfectjointminimumbetweenallcomponentlosses. Notonlythat,butProposition2implies thatGradDropinducesproportionallylargergradientupdateswithdistancefromanycomponentloss functionminimum,regardlessofthevalueofthetotalloss. TheerrorsignalsinducedbyGradDrop arethussensitivetoeverytask,ratheronlyrelyingonasumsignal. Thissensitivityalsoincreases monotonicallywithdistancefromanycloselocalminimumforanycomponenttask. Thus,GradDrop optimizationwillseekoutjointminima,butevenwhensuchminimadonotstrictlyexistProposition 2showsGradDropwillseekoutsystemstatesthatareatleastclosetojointminima. Foraclear illustrationofthiseffectinonedimension,pleaserefertoSection4.1. Apotentialconcerncouldbethatbybeingsensitivetoeverylossfunction,GradDropupdatesaretoo noisyandtheoverallsystemtrainsmoreslowly. However,thatisnotthecase,asGradDropupdates onexpectationareequivalentwithstandardSGDupdates. (cid:80) Proposition 3 (Statistical Properties): Suppose for 1D loss function L = L (w) an SGD i i gradientupdatewithlearningrateλchangestotallossbythelinearestimate∆L(SGD) =−λ|∇L|2 ≤ 0. ForGradDropwithactivationfunction(seeEq. 2)f(p)=k(p−0.5)+0.5fork ∈[0,1](with defaultsettingisk =1),wehave: 1. Fork =1,∆L(SGD) =E[∆L(GD)] 2. E[∆L(GD)]≤0andhasmagnitudemonotonicallyincreasingwithk. 3. Var[∆L(GD)]ismonotonicallydecreasingwithrespecttok. We present the proof of this proposition in Appendix A.1, along with generalizing it to arbitrary activationfunctions. (cid:3) Importantly,eventhoughGradDrophasastochasticelement,itprovidesthesameexpectedmovement inthetotallossfunctionasinvanillaSGD.Alsoimportantisthehyperparameterk,whichcontrols thetradeoffbetweenhowmuchtheGradDropupdatefollowstheoverallgradientandhowmuchnoise GradDropinducesforinconsistentgradients.Asmallervalueofkimpliesalargerpenalty/noisescale, andavalueofk =0meanswerandomlychooseasignforeverygradientvalue. Wecallthek =0 caseRandomGradDropandshowitgenerallycomparesunfavorablytok >0,butourevidencedoes notprecludeasituationwherethehighernoiseinthek =0casemaybedesirable. Indeed,inmost ofourexperimentsthek =0RandomGradDropsettingstilloutperformsthebaseline. 4 ExperimentswithGradDrop InthissectionwepresentthemainexperimentalresultsrelatedtoGradDrop. Allexperimentsarerun onNVIDIAV100GPUhardware. Wewillproviderelevanthyperparameterswithinthemaintext, butwerelegateacompletelistingofhyperparameterstotheAppendix. Wealsorelyexclusivelyon standardpublicdatasets,andthusmovediscussionofmostdatasetpropertiestotheAppendices. Allmultitaskbaselines(includingPCGrad,tokeepcomputeoverheadtractable)andtheGradDrop layerareappliedtothefinallayerbeforethepredictionheadstokeepcomputeoverheadtractable. We 5primarilycomparetootherstate-of-the-artmultitaskmethods,whichincludeGradNorm[3],MGDA [37],andPCGrad[47]. DescriptionsofallthesemethodsweregiveninSection2. Forcompletion,wealsocomparetoGradientClipping(e.g.[50])andGradientPenalty[10].Although notstrictlymultitaskmethods,thesegradient-basedmethodsenjoywidepopularityandwillprovide evidencethatprincipledsingle-taskmethodsarenotenoughtooptimizeatruemultitaskmodel. 4.1 ASimpleOne-DimensionalExample (a)Sumofsinusoidslossfunction (b)Losscurvesforonerandomrun (c)Summaryresultsfor200runs Figure2: GradDroptoyexample. (a)Asynthetic1Dlossfunctioncomposedoffivesines. (b)Loss curvesforGradDropandbaselinesgivenarandominitializationofthetrainableweight. (c)Boxplot offinalconvergedlossvalueswhenthemethodsinb. arerun200times. WeillustrateGradDropinonedimension. InFigure2wepresentresultsonasimpletoysystem,with alossfunctionthatisthesumoffivesinesoftheformL(x;a,b)=sin(ax+b)+1. Thefinallossis showninFigure2(a). NotethatalthougheachL hasidenticalperiodiclocalminima,thesumloss i hasawidedistributionoflocalminimaofvariablequality. Wenowinitializetheoneweightwtoarandomvalueandrunvariousoptimizationtechniquesfor 10000steps. InFigure2(b)weplotthelosscurvesforoneexampletrial. WenotethatPCGrad[47] doesnottraininthislow-dimensionalsetting,asanysignconflictwouldresultinPCGradzeroingthe gradients. Forfairness,weincludeaslightmodificationofPCGradcallediterativePCGradwhich stillworksinlowdimensions(fordetailsseeAppendix). WealsoincludeRandomGradDrop,which isaweakversionofGradDropwheref(P)issetto0.5everywhere. WeseethatGradDrophasthe bestperformanceofallmethodstested. Suchaconclusionisfurtherreinforcedwhenwerunthis experiment200timesandplotthestatisticsofthefinalresults,whichareshowninFigure2(c). Multiplealgorithms(GradDrop,RandomGradDrop,andIterativePCGrad)tendtofindthedeepest minimum, but GradDrop still performs better. We attribute this to the success of our sign purity measureP atproperlyemphasizinggradientdirectionswithhigherlevelsofconsistency. 4.2 MultitaskLearningonCeleb-A We first test GradDrop on the multitask learning dataset CelebA [26], which provides 40 binary attributesbasedoncelebrityfacialphotos. CelebAallowsustotestGradDropinatrulyarchetypal multitasksetting. Wealsouseastandardshallowconvolutionalnetworktoperformthistask. Ournetworkconsists onlyofcommonlayers(Conv,Pool,Batchnorm,FCLayers)andcontains9totallayersalongwith 40predictiveheads. TheresultsofourexperimentsaresummarizedinFigure3andTable1. WeseethatGradDropoutperformsallothermethods. Althoughtheimprovementsmayseemmildin Table1,theyaresubstantialforthisdatasetandFigure3(a)revealsavisuallysignificanteffect.Figure 3(b)alsoshowsanablationstudyofperformancewhenwechoosetomarginalizeourgradientsignal acrossthebatchdimension,assuggestedbySection3.2. AlthoughourgradientsignalforCelebA isnotbatch-separatedandthuswearenotstrictlyrequiredtosumtheGradDropsignalacrossour batches,thisoperationimprovesGradDrop’smemoryandcomputeefficiency,andalsocanclearly improvemodelperformance. Astherearethusfewdisadvantagesfromusingthesum-over-batch strategy,allfurtherGradDroprunsinthispaperwillusesum-over-batch. 6(a)CelebAmaximumF1scores (b)GradDropbatchmarginalization (c)Gradientconsistencyovertime Figure3: ExperimentswithGradDroponCelebA. Table1: MultitaskLearningonCelebA.Werepeattrainingrunsandreportstandarddeviationsof ≤0.04%forF1Scoreand≤0.02%foraccuracy. Method ErrorRate(%)↓ MaxF1Score↑ SpeedComparedtoBaseline↑ Baseline 8.71 29.35 1.00 GradientClipping[50] 8.70 29.34 1.00 GradientPenalty[10] 8.63 29.43 0.35 MGDA[37] 10.82 26.00 0.25 PCGrad[47] 8.72 29.25 0.20 GradNorm[3] 8.68 29.32 0.41 RandomGradDrop 8.60 29.42 0.45 GradDrop(ours) 8.52 29.57 0.45 Furthermore,Figure3(c)plotsthepercentageofgradientspassedbytheGradDroplayer,forbotha GradDropmodelandabaselinemodel2. Thispercentagecorrelatestothedegreeofsignconsistency ofgradientsattheGradDroplayer. Thismetricdoesnotimproveatallwhentrainingthebaseline, butimprovesappreciablywhenGradDropisenabled,suggestingthatthecriticalpointsfoundby GradDrophavemoreconsistentgradientsandthushigherprobabilityofbeingajointminimum. ItisinterestingtonotethatGradDropalsooverfitsless. WepositthatGradDropisagoodregularizer duetoitstendencytorejectweaklossminimathatmayoverfit. Theonlystrongerregularizermaybe GradNorm[3],butGradNormexplicitlycurtailsoverfittingwithitsαhyperparameter. CelebAwithitsT =40tasksalsopresentsuswithanexcellentopportunitytotestmethodspeed. LookingatthelastcolumnofTable1,weseethatGradDropisthefastestofthemultitaskmethods tried(notcountinggradientclipping,whichisageneralsingle-taskmethod),possiblybecauseitonly requiresasimplecalculationateachtensorpositionofO(T)ratherthanmultipleiterativestepslike MGDAorO(T2)orthogonalprojectionslikePCGrad. 4.3 TransferLearningonCIFAR-100 WenowuseGradDropinatransferlearningsetting,whichisabatch-separatedsetting(seeSection 3.2). WetransferImageNet2012[5]toCIFAR-100[21]byusinginputbatchesconsistingofhalf CIFAR-100andhalfImageNet2012examples. Eachdatasethasitsownpredictiveheadandloss. WeuseamorecomplexnetworkbasedonDenseNet-100[13],bothtoincreaseperformanceandto testGradDropwithmorecomplexnetworktopologies. OurresultsareshowninTable2andFigure4, wherewepresentthebestaccuracyachievedbyeachmethodandthecorrespondingloss3;weinclude thelossasitisgenerallysmoother. We see that the best model uses a combination of GradDrop and GradNorm [3], although the GradDrop-Onlymodelalsoperformswell. AsintheCelebAexperimentspresentedinSection4.2, theperformancegapislargerwhenthebaselinemodelsoverfitlaterintraining. Thegeneralsynergy between GradDrop and other multitask methods such as GradNorm is important, as it suggests 2Forthebaselinemodel,thisstatisticishypotheticalandnogradientsareactuallymasked. 3Thisisthelossthatcorrespondstothehighestaccuracymodel,notthemodelwiththelowestloss.However, reportingthelatterwouldnotchangethetrend. 7(a)CIFAR-100accuracy. (b)CIFAR-100loss. Figure 4: Accuracy and loss curves for CIFAR-100 transfer learning experiments. In all cases GradientDropoutoutperformsallothermethodstried. Table2: TransferLearningfromImageNet2012toCIFAR-100. Werepeattrainingrunsandobserve standarddeviationsof≤0.2%accuracyand≤0.01loss. Method Top-1Error(%)↓ TestLoss↓ TrainonCIFAR-100Only 33.6 1.52 MixedBatch(MB) 29.8 1.22 MB+GradientClipping[50] 29.4 1.22 MB+GradientPenalty[10] 30.6 1.28 MB+MGDA[37] 29.7 1.17 MB+GradNorm[3] 29.4 1.11 MB+GradDrop(ours) 29.1 1.08 MB+GradNorm[3]+RandomGradDrop 29.8 1.04 MB+GradNorm[3]+GradDrop(ours) 28.9 1.01 GradNormcanaddtocomplexmodelswhichalreadyemployanarrayofpre-existingdeeplearning tools. WeexplorethissynergyfurtherinSection4.5. ForourfinalGradDropmodelweusealeakparameter(cid:96) setto1.0forthesourceset. Inthissetting, i sourcesetgradientsareallowedtoflowunimpededbuttransfersetgradientsaremasked. Thissetting isoptimalasthesourcedatasetisusuallylargerandthemaskingeffectivelycurtailsoverfittingonthe transferdataset. Formoreexperimentsrelatedtotheleakparameter,seeSectionA.5. 4.4 3DPointCloudDetectiononWaymoOpenDataset Wenowpresentresultsonamuchmorecomplexproblem: 3Dvehicledetectionfrompointclouds on the Waymo Open Dataset [42]. For this task we use a PointPillar model [22], a complex and competitive3Ddetectionarchitecturethatvoxelizesapointcloudandusesstandard2Dconvolutions toderivedeeppredictivefeatures. Wealsonotethatobjectdetectionistraditionallyconsidereda single-taskproblem,butstillhasmultiplelosses–3foreachcoordinateoftheboxcenters,3for eachdimensionofthebox,1onboxorientation,and(inourformulation)2classifiersforboxmotion directionandboxclass. OurresultsthusshowthatGradDropisapplicableinamuchwidercontext thanthetraditionalexplicitinterpretationof“multitasklearning"mightimply. OurmainresultsareshowninTable3,whereweshowAveragePrecision(AP)andAveragePrecision w/Heading(APH)scores(fortrainingcurvesseeAppendix). APHisametricintroducedin[42], whichpenalizesboxesforbeing180omis-oriented. Allrunsincludegradientclippingatnorm1.0, and we are unable to compare to gradient penalty due to memory restrictions. GradDrop results inmarkedimprovements,especiallyintheAPHmetrics. Wealsonotethatlikethegradientnorm methods (which focus on the overall magnitude of gradients rather than their high-dimensional content),GradDropprovidesamoderateboostin2Dperformance. However,GradDropdoesnot sufferfromthesamesubstantialregressionsin3Dperformance,andinsteadimprovesallmetrics acrosstheboard. 8Table 3: Object Detection from Point Clouds on the Waymo Open Dataset. We report standard deviationsof≤0.3%onAPvaluesand≤0.5%onAPHvalues. Method 2DAP(%)↑ 2DAPH(%)↑ 3DAP(%)↑ 3DAPH(%)↑ Baseline 76.2 69.9 57.1 53 GradientNormMethods MGDA[37] 76.8 69.5 20.0 18.3 GradNorm[3] 76.9 71.7 51.0 48.2 FullGradientTensorMethods PCGrad[47] 76.2 70.2 58.4 54.4 RandomGradDrop 76.4 66.6 57.6 50.5 GradDrop(Ours) 76.8 72.4 58.8 56.0 Table4: SynergyBetweenGradDropandGradNorm CelebA WaymoOpenDataset Method ErrRate(%)↓ F1 ↑ 3DAP(%)↑ 3DAPH(%)↑ max GradNormOnly 8.68 29.32 51.0 48.2 GradNorm+GradDrop(ours) 8.57 29.50 55.1 51.5 4.5 SynergywithGradientNormalizationandOtherMethods OneimportantpropertyofGradDropisthatitprimarilymodifiesthegradienttensordirection,which is then largely left alone by other deep learning techniques. In principle, GradDrop can thus be appliedinparallelwithothermultitaskmethods. Inthissection,wedemonstratepositiveinteractions betweenGradDropanGradNorm[3],evidencethatGradDropcanbeconsideredamodularpartofa diversetoolsetwhichcanbeappliedinawidearrayofapplications. Our main results regarding synergy between GradDrop and GradNorm are summarized in Table 4. AlongwiththeCIFAR-100resultsinSection4.3,wefindGradDropoftenleadstosignificant improvements when applied with GradNorm. This is especially true where GradNorm performs poorly;forexample,althoughGradNormtendstoregressinthe3DAPmetricscomparedtobaseline, GradDrop+GradNormrecoversmuchofthatperformancewhilestillperformingwellinthe2DAP metrics(seeAppendixfor2DAPnumbers). WealsoexperimentedwithGradDrop+MGDA,but withlimitedsuccess. WehypothesizethatMGDAworksbestwheninputtensorshaveexplicitly conflictingsigns,whileGradDrop’sfinalgradienttensorshavethesamesign(orzero)atallpositions. Fromanefficiencystandpoint,applyingGradDropontopofGradNormorMGDAcomesessentially forfree;bothGradNormandMGDAalreadyrequireustocalculate∇ L ,∀i,whichisthemost W i expensive step in GradDrop. And because we know GradDrop is faster than the other methods described(seeTable1),theadditionalcomputetoaddGradDropissmall. 5 Conclusions WehavepresentedGradientSignDropout(GradDrop),amethodthatturnsadditivegradientsignals intoasumsignalthatispureinsignandencouragesthenetworktoseekoutjointminima. Froma theoreticalstandpoint,GradDropprovidessuperiorbehaviorinthefaceofsuboptimallocalminima, andalsoworksforawidearrayofnetworkarchitecturesandmultitasklearningsettings. Apart from our concrete contributions, we also hope that GradDrop will invigorate discussion regardinghowbesttooptimizethecomplexlosssurfacesinducedbymultitasklearning. Ourresults suggest that the traditional faith in standard gradient descent methods may not describe the full picture, and a realignment of our understanding of optimization robustness to include multitask conceptsandgradientstochasticityisprudentasmodelsbecomeevermorecomplex. Wepresent GradDropasacrucialearlypieceofthisincreasinglyimportantpuzzle. 96 BroaderImpacts InthispaperwepresentedGradDrop,ageneralalgorithmthatcanbeusedasamodularaddition tomultitaskmodels. Atitscore,ourcontributionisthedevelopmentofageneralmachinelearning algorithmwithoutanyassumptionsofspecificapplications,sothepotentialbroaderimpactsofour workisdependentontheapplicationarea. However,itisalsotruethatmultitasklearningoperatesbyattemptingtoleveragemultiplesources of potentially disparate information and making joint predictions based on those sources. When appliedcorrectly,multitaskmodelscanbelesspronetobias/unfairnessastheyhaveaccesstoalarger, morediversesourceofinformation. However,whenappliedincorrectly,multitaskmodelsmayend upreinforcingthesamebiasesthatwewanttoeliminate;imagine,forexample,multitaskmodels whichmakepredictionsseparatelyfordifferentsubpopulationsoftheinputdatasetandduetolack ofpropertrainingdynamicsendupoverfittingtoeachinturn. Ourproposedalgorithmmayhave beneficialeffectsincombatingsuchoverfitting,asouralgorithmiseffectiveatfindingjointsolutions that consistently take all available information into account. As such, we believe that GradDrop willhaveapositivebroaderimpactonmachinelearningworkbyprovidingwaystoarriveatbetter regularizedsolutionsthataremorereflectiveofreality. References [1] R.Caruana. Multitasklearning. Machinelearning,28(1):41–75,1997. [2] J.ChenandQ.Gu. Closingthegeneralizationgapofadaptivegradientmethodsintrainingdeepneural networks. arXivpreprintarXiv:1806.06763,2018. [3] Z.Chen,V.Badrinarayanan,C.-Y.Lee,andA.Rabinovich.Gradnorm:Gradientnormalizationforadaptive lossbalancingindeepmultitasknetworks. InInternationalConferenceonMachineLearning, pages 794–803,2018. [4] A.Choromanska,M.Henaff,M.Mathieu,G.B.Arous,andY.LeCun. Thelosssurfacesofmultilayer networks. InArtificialintelligenceandstatistics,pages192–204,2015. [5] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei. Imagenet:Alarge-scalehierarchicalimage database. In2009IEEEconferenceoncomputervisionandpatternrecognition,pages248–255.Ieee, 2009. [6] J.-A.Désidéri. Multiple-gradientdescentalgorithm(mgda)formultiobjectiveoptimization. Comptes RendusMathematique,350(5-6):313–318,2012. [7] T.Dozat. Incorporatingnesterovmomentumintoadam. 2016. [8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journalofmachinelearningresearch,12(Jul):2121–2159,2011. [9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495,2014. [10] I.Gulrajani,F.Ahmed,M.Arjovsky,V.Dumoulin,andA.C.Courville. Improvedtrainingofwasserstein gans. InAdvancesinneuralinformationprocessingsystems,pages5767–5777,2017. [11] K.He,G.Gkioxari,P.Dollár,andR.Girshick. Maskr-cnn. InProceedingsoftheIEEEinternational conferenceoncomputervision,pages2961–2969,2017. [12] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle-consistentadversarialdomainadaptation. arXivpreprintarXiv:1711.03213,2017. [13] G.Huang,Z.Liu,L.VanDerMaaten,andK.Q.Weinberger. Denselyconnectedconvolutionalnetworks. InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition,pages4700–4708, 2017. [14] S.IoffeandC.Szegedy. Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternal covariateshift. arXivpreprintarXiv:1502.03167,2015. [15] M.Jaderberg, W.M.Czarnecki, S.Osindero, O.Vinyals, A.Graves, D.Silver, andK.Kavukcuoglu. Decoupledneuralinterfacesusingsyntheticgradients.InProceedingsofthe34thInternationalConference onMachineLearning-Volume70,pages1627–1635.JMLR.org,2017. [16] L.Kaiser,A.N.Gomez,N.Shazeer,A.Vaswani,N.Parmar,L.Jones,andJ.Uszkoreit. Onemodelto learnthemall. arXivpreprintarXiv:1706.05137,2017. [17] A.Kendall,Y.Gal,andR.Cipolla.Multi-tasklearningusinguncertaintytoweighlossesforscenegeometry andsemantics. InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition,pages 7482–7491,2018. [18] N.S.Keskar,D.Mudigere,J.Nocedal,M.Smelyanskiy,andP.T.P.Tang. Onlarge-batchtrainingfor deeplearning:Generalizationgapandsharpminima. arXivpreprintarXiv:1609.04836,2016. [19] D.P.KingmaandJ.Ba. Adam:Amethodforstochasticoptimization. arXivpreprintarXiv:1412.6980, 2014. 10[20] I.Kokkinos. Ubernet:Trainingauniversalconvolutionalneuralnetworkforlow-,mid-,andhigh-level visionusingdiversedatasetsandlimitedmemory. InProceedingsoftheIEEEConferenceonComputer VisionandPatternRecognition,pages6129–6138,2017. [21] A.Krizhevsky,G.Hinton,etal. Learningmultiplelayersoffeaturesfromtinyimages. 2009. [22] A.H.Lang,S.Vora,H.Caesar,L.Zhou,J.Yang,andO.Beijbom. Pointpillars:Fastencodersforobject detectionfrompointclouds. InCVPR,pages12697–12705,2019. [23] S.Liu,E.Johns,andA.J.Davison. End-to-endmulti-tasklearningwithattention. InProceedingsofthe IEEEConferenceonComputerVisionandPatternRecognition,pages1871–1880,2019. [24] W.Liu,D.Anguelov,D.Erhan,C.Szegedy,S.Reed,C.-Y.Fu,andA.C.Berg. Ssd:Singleshotmultibox detector. InEuropeanconferenceoncomputervision,pages21–37.Springer,2016. [25] X.Liu,P.He,W.Chen,andJ.Gao. Multi-taskdeepneuralnetworksfornaturallanguageunderstanding. arXivpreprintarXiv:1901.11504,2019. [26] Z.Liu,P.Luo,X.Wang,andX.Tang. Large-scalecelebfacesattributes(celeba)dataset. RetrievedAugust, 15:2018,2018. [27] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983,2016. [28] M.-T.Luong,Q.V.Le,I.Sutskever,O.Vinyals,andL.Kaiser. Multi-tasksequencetosequencelearning. arXivpreprintarXiv:1511.06114,2015. [29] J.Ma,Z.Zhao,X.Yi,J.Chen,L.Hong,andE.H.Chi. Modelingtaskrelationshipsinmulti-tasklearning withmulti-gatemixture-of-experts. InProceedingsofthe24thACMSIGKDDInternationalConferenceon KnowledgeDiscovery&DataMining,pages1930–1939,2018. [30] E.MeyersonandR.Miikkulainen. Pseudo-taskaugmentation:Fromdeepmultitasklearningtointratask sharing—andback. arXivpreprintarXiv:1803.04062,2018. [31] I.Misra,A.Shrivastava,A.Gupta, andM.Hebert. Cross-stitchnetworksformulti-tasklearning. In ProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,pages3994–4003, 2016. [32] Y.Nesterov. Amethodforunconstrainedconvexminimizationproblemwiththerateofconvergenceo (1/kˆ2). InDokladyanussr,volume269,pages543–547,1983. [33] N.Qian.Onthemomentumtermingradientdescentlearningalgorithms.Neuralnetworks,12(1):145–151, 1999. [34] J.Redmon,S.Divvala,R.Girshick,andA.Farhadi.Youonlylookonce:Unified,real-timeobjectdetection. InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition,pages779–788,2016. [35] S.Ren,K.He,R.Girshick,andJ.Sun. Fasterr-cnn: Towardsreal-timeobjectdetectionwithregion proposalnetworks. InAdvancesinneuralinformationprocessingsystems,pages91–99,2015. [36] S.Ruder. Anoverviewofgradientdescentoptimizationalgorithms. arXivpreprintarXiv:1609.04747, 2016. [37] O.SenerandV.Koltun. Multi-tasklearningasmulti-objectiveoptimization. InAdvancesinNeural InformationProcessingSystems,pages527–538,2018. [38] A.Sinha,Z.Chen,V.Badrinarayanan,andA.Rabinovich.Gradientadversarialtrainingofneuralnetworks. arXivpreprintarXiv:1806.08028,2018. [39] S.L.Smith,P.-J.Kindermans,C.Ying,andQ.V.Le. Don’tdecaythelearningrate,increasethebatchsize. arXivpreprintarXiv:1711.00489,2017. [40] S.L.SmithandQ.V.Le. Abayesianperspectiveongeneralizationandstochasticgradientdescent. arXiv preprintarXiv:1710.06451,2017. [41] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov. Dropout:asimplewayto preventneuralnetworksfromoverfitting. Thejournalofmachinelearningresearch,15(1):1929–1958, 2014. [42] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui,J.Guo,Y.Zhou,Y.Chai,B.Caine, V.Vasudevan, W.Han,J.Ngiam,H.Zhao,A.Timofeev,S.Ettinger, M.Krivokon,A.Gao,A.Joshi, Y.Zhang,J.Shlens,Z.Chen,andD.Anguelov. Scalabilityinperceptionforautonomousdriving:Waymo opendataset. InProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,2020. [43] Y.Sun,E.Tzeng,T.Darrell,andA.A.Efros. Unsuperviseddomainadaptationthroughself-supervision. arXivpreprintarXiv:1909.11825,2019. [44] I.Sutskever,O.Vinyals,andQ.V.Le. Sequencetosequencelearningwithneuralnetworks. InAdvances inneuralinformationprocessingsystems,pages3104–3112,2014. [45] H.-Y.Tseng,Y.-W.Chen,Y.-H.Tsai,S.Liu,Y.-Y.Lin,andM.-H.Yang. Regularizingmeta-learningvia gradientdropout. arXivpreprintarXiv:2004.05859,2020. [46] S.Vandenhende,S.Georgoulis,M.Proesmans,D.Dai,andL.VanGool. Revisitingmulti-tasklearningin thedeeplearningera. arXivpreprintarXiv:2004.13379,2020. [47] T.Yu,S.Kumar,A.Gupta,S.Levine,K.Hausman,andC.Finn. Gradientsurgeryformulti-tasklearning. arXivpreprintarXiv:2001.06782,2020. 11[48] A.R.Zamir,A.Sax,W.Shen,L.J.Guibas,J.Malik,andS.Savarese. Taskonomy:Disentanglingtask transferlearning. InProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition, pages3712–3722,2018. [49] M.D.Zeiler. Adadelta:anadaptivelearningratemethod. arXivpreprintarXiv:1212.5701,2012. [50] J.Zhang, T.He, S.Sra, andA.Jadbabaie. Whygradientclippingacceleratestraining: Atheoretical justificationforadaptivity. arXivpreprintarXiv:1905.11881,2019. [51] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In ProceedingsoftheIEEEConferenceonComputerVisionandPatternRecognition,pages4490–4499, 2018. 12A Appendix Themajorityoftheappendixisdevotedtoafaithfullistingofhyperparameters,datasets,andtraining settingsforallofourexperiments. However,wealsoexpandonsomeintuitionsbehindourtreatment of batch-separated gradients in Section A.2 and present some more experiments on CIFAR-100 transferlearninginSectionA.5andontheWaymoOpenDatasetinSectionA.6. A.1 AddendumonProposition3andChoiceofActivationFunction WebeginwithaproofofProposition3,whichwerewritehereforconvenience: (cid:80) Proposition3: Supposefor1DlossfunctionL= L (w)anSGDgradientupdatewithlearning i i rate λ changes total loss by the linear estimate ∆L(SGD) = −λ|∇L|2 ≤ 0. For GradDrop with activationfunctionf(p)=k(p−0.5)+0.5fork ∈[0,1](withdefaultsettingisk =1),wehave: 1. Fork =1,∆L(SGD) =E[∆L(GD)] 2. E[∆L(GD)]≤0andhasmagnitudemonotonicallyincreasingwithk. 3. Var[∆L(GD)]ismonotonicallydecreasingwithrespecttok. Proof: For simplicity of notation and without loss of generality, let us assume a learning rate of (cid:80) (cid:80) λ = 1. Definep := |∇ |andn := |∇ |tobethetotalabsolutevalueofpositive ∇i≥0 i ∇i<0 i andnegativegradients,respectively. FromthedefinitionofP asinEq. 1,wecaneasilyderivethat P =p/(p+n). Wethencalculate (cid:18) (cid:19) p−n f(P)=k(P −0.5)+0.5=0.5 k+0.5 (4) p+n (cid:18) (cid:19) p−n 1−f(P)=−0.5 k+0.5 (5) p+n Wethennotethatwithtotalgradientp−n,thevalue∆LunderGradDropisprecisely E[∆L(GD)]=−(p−n)(f(P)p+(1−P)(−n)) (6) (cid:18) (cid:18) (cid:19) (cid:18) (cid:19) (cid:19) p−n p−n =−(p−n) 0.5 kp+0.5 kn+0.5p−0.5n (7) p+n p+n (cid:18)(cid:18) (cid:19) (cid:19) k =−0.5(p−n) ((p−n)p+(p−n)n)+(p−n) (8) p+n =−0.5(p−n)(k(p−n)+(p−n)) (9) =−0.5(k+1)(p−n)2 (10) Wenotethatfork = 1,thisreducesto−(p−n)2,whichisprecisely∆L(SGD),provingthefirst claim. Wealsonotethatthemagnitudeofthisexpressionismonotonicallyincreasingwithk,butitis alwaysnegativeassumingk ≥0,thusprovingthesecondclaim. Asforthevarianceclaim,itisstraightforwardtocalculate: Var[∆L(GD)]=E[(∆L(GD))2]−(E[(∆L(GD)])2 (11) =(f(P)p2+(1−f(P))n2)(p−n)2−(0.5(k+1)(p−n)2)2 (12) (cid:18)(cid:18) (cid:19) (cid:18) (cid:19) (cid:19) p−n p−n =0.5(p−n)2 p2k− n2k+p2+n2 −(0.5(k+1)(p−n)2)2 (13) p+n p+n 13(cid:18)(cid:18) (cid:19) (cid:19) p−n =0.5(p−n)2 (p2−n2)k+p2+n2−0.5(k+1)2(p−n)2 (14) p+n =0.5(p−n)2(cid:0) (p−n)2k+p2+n2−0.5(k+1)2(p−n)2(cid:1) (15) =0.5(p−n)2(cid:0) (p−n)2(k−0.5(k+1)2)+p2+n2(cid:1) (16) =0.25(p−n)2(cid:0) (p−n)2(−k2−1)+2p2+2n2(cid:1) (17) Although not as simple as our expression for expected value, the variance expression treated as a function of k looks like A(−k2 −1)+B, with A,B ≥ 0 and is thus clearly a monotonically decreasingfunctionofkfork ∈[0,1]. Thethirdclaimisproven. (cid:3) Although Proposition 3 was proven for a specific family of activation functions (i.e. f(p) = k(p−0.5)+0.5), it easily extends to the result that any choice of f that is (1) odd around the point (0.5,0.5), (2) monotonically increasing, and (3) bounded by 0.0 ≤ f(p) ≤ 1.0 will have similar characteristics. Namely, the steeper (formal definition to follow) that f is, the higher its correspondingmagnitudeofE[∆L(GD)]andtheloweritsvariance. Namely, Corollary3.1: Takethefamilyofreal-valuedcontinuousactivationfunctionsF suchthatf ∈F if f isdefinedonthedomain[0,1],oddaround(0.5,0.5),monotonicallyincreasing,andhasoutput boundedby0 ≤ f(p) ≤ 1onitsdomain. Wesayf ∈ F issteeperthang iff(p) ≥ g(p)when p ≥ 0.5andf(p) ≤ g(p)otherwise. Forf,g ∈ F, iff issteeperthang, callthecorresponding expectedlosschangesasE[∆L(f)]andE[∆L(g)]. Thenthefollowingmustbetrue: 1. E[∆L(f)]≤E[∆L(g)]≤0. 2. Var[∆L(f)]≤Var[∆L(g)]. Proof: ItisimportanttonotethattheproofforProposition3istrueforallvaluesofp≥0andn≥0. Thatis,theproofforProposition3immediatelyimpliesthatgivenanytripletofvalues(p,n,P),the claimsofthepropositionaretrueasafunctionofk. ForP =1,tuningthevalueofkallowsusto sweepthevalueoff(1)smoothlyfrom0.5to1,andthecorrespondingvalueoff(−1)smoothly from0.5to0. Thus,atthesetwospecialpoints,wehaveaccesstothefullrangeofpossibleoutcomes. AndsoifwelimitourselvestotheP =1andP =0cases,weimmediatelyconcludethefollowing: Given any value of (p,n) and the resultant value of P, if f(P) ≥ g(P) and P ≥ 0.5, or if f(P)≤g(P)andP ≤0.5,thenE[∆L(f)]≤E[∆L(g)]andVar[∆L(f)]≤Var[∆L(g)]asaspecial caseofProposition3. Becausetheconditionssolistedcovereveryvalueforeverypossiblevalidactivationfunctionf and g,thecorollaryisproven. WealsonotethatE[∆L(f)]≤0foranyf ∈F becausethe“leaststeep" activationfunctionisf(p)=0.5,whichweshowedinProposition3hasanexpected∆L(f)valueof ≤0. (cid:3) We note that the variance claims in both Proposition 3 and Corollary 3.1 are relatively simple extensionsoftheintuitiveresultthatthevarianceofarandomvariablethatcantakeononlytwo values is maximized when the two values each have a probability weight of 50%. We also note thatbecauseofthecorrollary,theresultsinProposition3areinfactvalidfortheextendedclassof activationfunctionsf(p)=clip(k(p−0.5)+0.5,0.0,1.0)forallk ≥0. A.2 MoreIntuitionRegardingBatch-SeparatedGradients PerhapsoneofthemostsubtlecomponentsoftheproposedGradDropmethodisitstreatmentof batch-separatedgradients. Althoughthetreatmentinthemainpaperismoremathematical,wewould liketousethissectiontodevelopsomemoreintuitionforourproposedmethodology. AsdescribedinSection3.2,itisnecessarytodevelopaversionofGradDropthatoperatesnontrivially when gradients are incident on orthogonal sub-batches, like in our transfer learning experiments inSection4.3. Theissueweneedtoresolveisthatthesegradientsaredependentontheirbatch’s inputvalues,sojustsumminggradientsacrossthebatchdimensionisnotanoption. Forexample, 14agradientvalueof4.0whentheinputvalueis1.0isnotingeneralthesamescenarioasagradient valueof4.0whentheinputvalueis-1.0. Animportantinsightisthatmostoperationsinastandarddeepnetworkaremultiplicativeinnature. Althoughadditionsofabiasarealsostandardinneuralnetworks,theyarevastlyoutnumberedbythe amountofmultiplicativeoperationsandoftenareleftoutentirelyofthenetwork. However,ifour basicbuildingblockwithinadeepnetworkismultiplication,thismeansthattheimportantquantityis notthepurevalueofagradient,butwhetherthatgradientpullsaninputvaluefurtherorcloserto zero. Thus,theimportantvaluewhencomparinggradientsis(input)×(grad),ratherthanthenaked gradient. However,anadditionalcomplicationarisesbecausetheinputisoftenhighvariance,andtakingthis productasourkeymetriccanproduceunstableresults. Anadditionalmodificationcanbemade basedonthereasoningthatGradDropoperatesmainlybyreasoningaboutthesigncontentofthe gradients. Thereasonwhypre-multiplicationbytheinputvalueisusefulisonlybecauseitensures wedonotmakeasignerrorwhensummingmultiplegradientstogether. Inthatsense,itissufficient topremultiplybythesignoftheinput,asthisallowsustocorrectourgradientsignalforanypotential signerrorswithoutbeingsusceptibletotheaddedvarianceoftheinputs. InthemainpaperSection3.2,wederivedtheproposedrulebyassumingavirtuallayerthatwas simpleelement-wisemultiplicationateachactivationposition. Inprinciple,therearealsootherlayers withtrainableweights(e.g. denselayers,convlayers)forwhichwecouldconsideravirtuallayerand derivearuleformarginalizationofthegradientsignalacrossbatches. Itisapotentialdirectionof futureworktoseeifanyoftheseotherlayersresultinmorerobustrulesforgradientcomparison. A.3 ASimpleOne-DimensionalExample: Addendum BecausetheexperimentinSection4.1usesamodelwithonlyonetrainableweight,thereisn’tmuch tolistintermsofhyperparameters. Wetrainallrunswithaninitiallearningrateof0.2andadecay ratioof0.5appliedevery1ksteps. Everyrunis10kstepsintotal. WeuseastandardSGDoptimizer. Thesinecurvesweusetogeneratethefinallossareoftheformsin(ax+b)+1.0. The1.0affine factorisonlytheresothatalllossvaluesarenonnegative,whichispurelycosmetic. Thefivesine functionshavethefollowingparametersfor(a,b): (1.0,0.0) (1.5,0.2) (2.0,0.4) (2.5,0.6) (5.0,0.8) Thesinefunctionperiodsareselectedpurelyatrandomandarenotchosentonecessarilyemphasize anyparticularbehavior. We note that many methods, such as MGDA [37] and PCGrad [47] do not operate well in the low-dimensionalregime. AlthoughitisdifficulttoadaptMGDAtolowerdimensions,wewereable tomodifyPCGradtoexhibitnontrivialbehaviorinlowdimensions. Namely,PCGradfirstmakes astaticcopyoftheoriginalgradientsandthenorthogonallyprojectsgradienttensorstoeachother withreferencetothestaticcopy. Instead,wedonotmakeacopyoftheoriginalgradientvectorand instead update the input gradients in-place. Such a replacement strategy, which we call Iterative PCGrad,addsnoisetothePCGradmethodbutallowsforreasonableoperationinlowdimensions. WealsohavetriedIterativePCGradonsomeoftheotherexperimentalsettingswithinthisworkand itgenerallyseemstoperformsimilarlytoPCGradproper. A.4 MultitaskLearningonCeleb-A:Addendum For these experiments we use the Celeb-A dataset in its standard setting. We use the standard 160k/20kdatasetsplitandtreateachattributeasaseparatetaskthatistrainedwithastandardbinary sigmoidclassificationloss. Our network is a shallow convolutional network with nine layers (not counting the maxpool layers or predictive head). With the notation CONV-F-C for a convolu- tional layer of filter size F and number of channels C, MAX denoting a maxpool 15layer of filter size and stride 2, and DENSE-H a dense layer with H outputs, the layerstackis[CONV-3-64][MAX][CONV-3-128][CONV-3-128][MAX][CONV-3-256][CONV-3- 256][MAX][CONV-3-512][CONV-3-512][DENSE-512][DENSE-512][DENSE-40]. GradDropand other baselines are applied after the final CONV layer. All layers use Batch Normalization [14] exceptforthefinalpredictivehead. We use an Adam optimizer with (β ,β ) = (0.9, 0.999). Our batch size is 8 and we start with a 1 2 learning rate of 1e-3, with an annealing rate of 0.96 applied every 2400 steps. All baselines are trainedwiththissetofhyperparameters,withtheexceptionofMGDA[37]forwhichwehadtolower thelearningratebyafactorof100x(otherwisetheperformanceofMGDAwasverypoor). Wetrain allnetworkspastconvergence,butreportresultsfromtheperformancepeak. Wedothissowealso canseethebehaviorofthesystemwhenthemodeldegradationfromoverfittingismostpronounced. A.5 TransferLearningonCIFAR-100: Addendum We use CIFAR-100 in its standard setting with a 40k/10k data split. All images (including the ImageNet2012images)areresizedto32x32beforebeinginputintothenetworktomatchtheCIFAR- 100imageresolution. Imagevaluesaredividedby256.0inpreprocessingsothatvaluesinputtothe networkliebetween0and1. Thisinitialnormalizationimprovestrainingstabilityespeciallyatthe beginningoftraining. OurnetworkisbasedonaDenseNet-100-BC[13]modelwithk =12. Themodelhas100layersin total. Wedonotusedataaugmentationtoreducethevarianceofourtrainingresults,andwesearch forhyperparametersthatperformoptimallyonthetransferlearningbaselinebeforeapplyingother baselineswiththesamesetofhyperparameters. WedonotuseDropout[41]inournetworkasit appearstodegradeperformance. Weusebatchsize8eachforCIFAR-100andImageNet2012inputs (foratotalbatchsizeof16), withanAdamoptimizerwith(β ,β ) = (0.9,0.999)andaninitial 1 2 learningrateof0.001. Thelearningratestaysconstantuntilstep100k,atwhichpointitdecaysby afactorof0.94every2000steps. Wetrainuntilconvergence,whichoccursataround250k(≈50 epochs). OurbestresultsusetheGradDropactivationf(p)=0.25(p−0.5)+0.5,showingthatthis particularsystembenefitsfromahighernoisepenalty(i.e. seetheoreticalresultsinSectionA.1). Both ImageNet2012 and CIFAR-100 inputs share the vast majority of the network, but they are givenseparateBatchNormtrainableparameterstohelpalleviatenegativeeffectsofthedomainshift betweenthetwodatasets. Wefoundthattrainingisveryunstableinthistransferlearningsetting withoutthisBatchNormparameterseparation. Unlikeourotherexperiments,wedonottryPCGrad[47]. ThisisprimarilybecausePCGradisthe trivialtransformationwhenallgradientshavenonnegativepairwisedotproducts. However,because transferlearningproducesbatch-separatedgradientsignals(seeSection3.2),allthegradientsare alreadypairwiseorthogonalbeforeanyadditionalprocessing. ThusPCGradwouldreturnthetrivial transformationandwouldperformidenticallytothebaseline. (a)CIFAR-100transferlearningwithvarious (b)FinalCIFAR-100transferlearningloss leakparametersettings. plottedagainstleakparametersettings. Figure5: ExperimentswithleakparametersontheCIFAR-100transferlearningsetting. 16Table 5: Transfer Learning from ImageNet2012 to CIFAR-100 with different leak parameters. Standarddeviationvaluesare≤0.2%foraccuracyand≤0.01forloss. (cid:96) (cid:96) Top-1Error(%)↓ TestLoss↓ source transfer 0.0 1.0 30.5 1.23 0.25 0.75 30.0 1.18 0.5 0.5 29.0 1.10 0.75 0.25 29.1 1.06 1.0 0.0 28.9 1.01 Inthemainpaper,wealsonotedthatGradDropallowsforflexibilityinleakparameters(cid:96) ∈[0.0,1.0], i suchthatthefinalgradientreturnedis(cid:96) ∇+(1−(cid:96) )∇(graddrop)foragiventaski. Wemadetheclaim i i thatforatransferlearningsetting,havingthestandard(cid:96) =0,∀ienvironmentissuboptimalaswe i caremoreaboutperformanceonthetransfertask. Wefurtherclaimedthatsettingaleakparameterof (cid:96)=1.0forthesourcedatasetwhilekeeping(cid:96)=0.0forthetransferdatasetwastheoptimalsetting fortransferlearning. We present here experiments that empirically justify the above statement. In Table 5 and Figure 5weshowresultsofapanelofexperimentsconductedwithdifferentleakparameters(cid:96) and source (cid:96) . WeruntheCIFAR-100experimentwithfivedifferentsettingsof(cid:96) and(cid:96) , transfer source transfer althoughforeaseofinterpretationwekeepthesum(cid:96) +(cid:96) ataconstantvalueof1.0. source transfer Wenotethatthereisacleardependenceofperformanceonthevalue(cid:96) −(cid:96) . Theerror source transfer valuesstaygenerallythesamefor(cid:96) closeto1.0,butthenriseprecipitously,althoughthesame source trendmanifestsasastronglineardependencyinthelossvalues. Tosome,thisresultmaybecounterintuitive;ifwecaremoreaboutthetransferset,thenitseems reasonablethat(cid:96) shouldbehigherandnot(cid:96) ,toensurethatmoretransfergradients transfer source aretransmittedbackthroughthenetwork. However,wefindtheseresultsfullyconsistentwithour understanding of GradDrop; as GradDrop primarily filters for consistent gradients, it is optimal toallowtheunimportantsourcesettofullyoverfitwhilethetransfersetismaximallyfilteredand regularized. WefindthatthissetofexperimentsstronglysuggeststhattheeffectofGradDropis beneficial. A.6 3DPointCloudDetectiononWaymoOpenDataset: Addendum WeusetheWaymoOpenDatasetfor3DVehicleObjectDetectionalsoinitsstandardsetting,witha total1000segmentsof20s10Hzvideos. Wesplitthe1000segmentsintotheoriginal798/202split. Ourre-implementationofPointpillar[22]isfaithfultothetopologicalandthresholdhyperparameters ofthatpaper,sowereferthereadertotheoriginalworkfordetails. Weuse8GPUsandatotalbatch sizeof16,withanAdamoptimizerwith(β ,β )=(0.9,0.999). Ourinitiallearningrateis0.0015 1 2 witharampupperiodof1000steps. Weuseacosineannealingscheduleasdescribedin[27]fora totaltrainingregimeof1.28millionsteps. [22] describes eight losses for our bounding boxes: three losses for (x,y,z) localization of the boxcenter,threelossesfor(h,(cid:96),w)regressionoftheboxdimensions,onelossfortherotational orientationofthebox,andonelossforthebinaryboxclass(i.e. vehicleornot). Inadditiontothose losses,weaddaninthlossintheformofadirectionalclassifier;weuseastandardcross-entropyloss topredictwhetheraboxfacesforwardsorbackwardswithinthedatasetcoordinatesystem. Weuse thislossasthereisanintrinsicambiguityintherotationlossthatdoesnotpenalizeapredictedbox forbeingexactly180orotatedwithrespecttothegroundtruth. Wefindthathavingthisadditional directionalclassifierimprovesperformancedramaticallyontheAPHmetrics(whichpenalizeheavily forincorrectlyorientedboxes). Forsakeofcompleteness,weshowtheaccuracycurvesforour3DdetectionexperimentsinFigure 6. WeseethatGradDropproducesbetteraccuracyinthe3Ddetectionmetricsandthisbenefitis presentthroughoutmostoftraining. PCGrad[47]alsoperformswell,butfallsshortoftheGradDrop performance. WeattributethisdifferentialtotheabilityofGradDroptomoreeffectivelychoose consistentgradientdirections,aconclusionalsosupportedbyourtoyexperimentsinSection4.1. 17(a)3DAP (b)3DAPH Figure6: 3DAPand3DAPHmetricsforWaymoOpenDataset. AswithotherexperimentstheMGDAbaselineseemstoperformrelativelypoorly,especiallyinthe 3Dmetrics. WenotethatbecauseMGDAseeksthelinearcombinationofgradientsthatresultsin thesmallestnorm,taskswhichtendtobackpropagatehighergradientswillbecomeattenuatedby MGDA.GradNormhasasimilareffect,butbecauseGradNorm’sreferencepointisthemeannorm ofallgradientsratherthantheminimumnormofallpossiblelinearcombinationsofgradients,the effectismuchlessacute. BecauseGradNormalsotendstoregressinthe3Dmetrics,weconclude thatthe3D-relevantlosses(zlocalizationandboxheightregression)tendtobackpropagatehigher gradients,whichthenhasaslightnegativeinteractionwithGradNormandamoreseverenegative interactionwithMGDA. WealsopresentamoreextensivesetofresultsinTable6forourexperimentswithsynergybetween GradDropandothermultitasklearningmethods,suchasGradNorm[3]andMGDA[37]. Inthemain papertextweonlypresented3Dmetricresultsastheywerewherewesawthemostprominenteffect, butherewetabulate2DmetricsaswellandalsopresentresultsforMGDA+GradDrop. Asdescribed inSection4.5,theeffectofapplyingGradDropatopMGDAismurky;weseearegressioninthe 2Dmetricsbutanimprovementin3D.AsalsodiscussedinSection4.5,weattributethiseffectto MGDAnotbehavingproperlywhenitsinputgradientshavethesamesignateveryposition. However, whenappliedwithGradNormweseethatGradDropsignificantlyimprovesthe3Dmetricswhile itdoesapproximatelyaswellifnotslightlybetterinthe2Dmetrics. Thisresultisimportant, as GradNormgenerallyprovidesamoderateboostalreadyinthe2Dmetricsoverthebaselinemodel. TheabilityofGradDroptoimprovethe3DmetricswhilemaintainingtheGradNormadvantagein 2Disencouraging. Table6: ObjectDetectionfromPointCloudsontheWaymoOpenDataset-SynergyWithOther MTLMethods Method 2DAP(%)↑ 2DAPH(%)↑ 3DAP(%)↑ 3DAPH(%)↑ MGDA[37] 76.8 69.5 20.0 18.3 MGDA[37]+GradDrop 73.7 65.0 32.3 28.6 GradNorm[3] 76.9 71.7 51.0 48.2 GradNorm+GradDrop[3] 77.3 71.6 55.1 51.5 18