PublishedasaconferencepaperatICLR2022 POLYLOSS: A POLYNOMIAL EXPANSION PERSPEC- TIVE OF CLASSIFICATION LOSS FUNCTIONS ZhaoqiLeng1,MingxingTan1,ChenxiLiu1,EkinDogusCubuk2,XiaojieShi2, ShuyangCheng1,DragomirAnguelov1 1WaymoLLC 2GoogleLLC {lengzhaoqi, tanmingxing, cxliu, shuyangcheng, dragomir}@waymo.com {cubuk, xiaojies}@google.com ABSTRACT Cross-entropy loss and focal loss are the most common choices when training deepneuralnetworksforclassificationproblems. Generallyspeaking,however,a goodlossfunctioncantakeonmuchmoreflexibleforms,andshould betailored fordifferenttasksanddatasets. Motivatedbyhowfunctionscanbeapproximated viaTaylorexpansion,weproposeasimpleframework,namedPolyLoss,toview and design loss functions as a linear combination of polynomial functions. Our PolyLoss allows the importance of different polynomial bases to be easily ad- justed depending on the targeting tasks and datasets, while naturally subsuming the aforementioned cross-entropy loss and focal loss as special cases. Extensive experimental results show that the optimal choice within the PolyLoss is indeed dependentonthetaskanddataset. Simplybyintroducingoneextrahyperparam- eter and adding one line of code, our Poly-1 formulation outperforms the cross- entropy loss and focal loss on 2D image classification, instance segmentation, objectdetection,and3Dobjectdetectiontasks,sometimesbyalargemargin. Task ImageNetclassification COCOdet.andseg. WaymoOpenDataset3Ddetection Defaultloss Cross-entropy Cross-entropy Focalloss Model ENetV2-L(21K) ENetV2-L(1K) MaskR-CNN PointPillarsCar PointPillarsPed RSNCar RSNPed Baseline 45.8 86.8 47.2 42.3 63.3 68.9 78.4 79.4 PolyLoss 46.4(+0.6) 87.2(+0.4) 49.7(+2.5) 44.4(+2.1) 63.7(+0.4) 69.6(+0.7) 78.9(+0.5) 80.2(+0.8) Table1: PolyLossoutperformscross-entropyandfocallossonvariousmodelsandtasks. Re- sultsareforthesimplestPoly-1,whichhasonlyasinglehyperparameter.OnImageNet(Dengetal., 2009),ourPolyLossimprovesbothpretrainingandfinetuningfortherecentEfficientNetV2(Tan& Le,2021);onCOCO (Linetal.,2014),PolyLossimprovesboth2DdetectionandsegmentationAR for Mask-RCNN (He et al., 2017); on Waymo Open Dataset (WOD) (Sun et al., 2020), PolyLoss improves3DdetectionAPforthewidelyusedPointPillars(Langetal.,2019)andtheveryrecent RangeSparseNet(RSN)(Sunetal.,2021). DetailsareinTable4,5,7. 1 INTRODUCTION Lossfunctionsareimportantintrainingneuralnetworks. Inprinciple,alossfunctioncouldbeany (differentiable) function that maps predictions and labels to a scalar. Therefore, designing a good loss function is generally challenging due to its large design space, and designing a universal loss functionthatworksacrossdifferenttasksanddatasetsisevenmorechallenging: forexample,L1/ L2lossesarecommonlyusedforregressiontasks, buttheyarerarelyusedforclassificationtasks; focallossisoftenusedtoalleviatetheoverfittingissueofcross-entropylossforimbalancedobject detectiondatasets(Linetal.,2017),butitisnotshowntoconsistentlyhelpothertasks. Manyrecent workshavealsoexplorednewlossfunctionsviameta-learning,ensemblingorcompositingdifferent losses(Hajiabadietal.,2017;Xuetal.,2018;Gonzalez&Miikkulainen,2020b;a;Lietal.,2019). Inthispaper,weproposePolyLoss: anovelframeworkforunderstandinganddesigninglossfunc- tions. Ourkeyinsightistodecomposecommonlyusedclassificationlossfunctions,suchascross- entropylossandfocalloss,intoaseriesofweightedpolynomialbases. Theyaredecomposedinthe formof(cid:80)∞ α (1−P )j,whereα ∈ R+ isthepolynomialcoefficientandP istheprediction j=1 j t j t probabilityofthetargetclasslabel. Eachpolynomialbase(1−P )j isweightedbyacorresponding t polynomial coefficient α , which enables us to easily adjust the importance of different bases for j differentapplications. Whenα =1/jforallj,ourPolyLossbecomesequivalenttothecommonly j usedcross-entropyloss,butthiscoefficientassignmentmaynotbeoptimal. 1 2202 yaM 01 ]VC.sc[ 2v11521.4022:viXraPublishedasaconferencepaperatICLR2022 Our study shows that, in order to achieve better results, it is necessary to adjust polynomial coef- ficients α for different tasks and datasets. Since it is impossible to adjust an infinite number of j α ,weexplorevariousstrategieswithasmalldegreeoffreedom. Perhapssurprisingly,weobserve j thatsimplyadjustingthesinglepolynomialcoefficientfortheleadingpolynomial,whichwedenote L ,issufficienttoachievesignificantimprovementsoverthecommonlyusedcross-entropyloss Poly-1 andfocalloss. Overall,ourcontributioncanbesummarizedas: • Insights on common losses: We propose a unified framework, named PolyLoss, to rethink and redesignlossfunctions. Thisframeworkhelpstoexplaincross-entropylossandfocallossastwo special cases of the PolyLoss family (by horizontally shifting polynomial coefficients), which wasnotrecognizedbefore. Thisnewfindingmotivatesustoinvestigatenewlossfunctionsthat verticallyadjustpolynomialcoefficients,showninFigure1. • Newlossformulation: Weevaluatedifferentwaysofverticallymanipulatingpolynomialcoef- ficientstosimplifythehyperparameterssearchspace. WeproposeasimpleandeffectivePoly-1 lossformulationwhichonlyintroducesonehyperparameterandonelineofcode. • Newfindings:Weidentifythatfocalloss,thougheffectiveformanydetectiontasks,issuboptimal fortheimbalancedImageNet-21K.Wefindtheleadingpolynomialcontributestoalargeportion of the gradient during training, and its coefficient correlates to the prediction confidence P . In t addition, weprovideanintuitiveexplanationonhowtoleveragethiscorrelationtodesigngood PolyLosstailoredtoimbalanceddatasets. • Extensiveexperiments: WeevaluateourPolyLossondifferenttasks,models,anddatasets. Re- sultsshowPolyLossconsistentlyimprovestheperformanceonallfronts,summarizedinTable1, whichincludesthestate-of-the-artclassifiersEfficientNetV2anddetectorsRSN. 2 RELATED WORK Cross-entropylossisusedinpopularandcurrentstate-of-the-artmodelsforperceptiontaskssuch asclassification,detectionandsemanticsegmentation(Tan&Le,2021;Heetal.,2017;Zophetal., 2020;Taoetal.,2020). Variouslossesareproposedtoimprovecross-entropyloss(Linetal.,2017; Law&Deng,2018;Cuietal.,2019;Zhaoetal.,2021). Unlikepriorworks,thegoalofthispaperis toprovideaunifiedframeworkforsystematicallydesigningabetterclassificationlossfunction. Lossforclassimbalance Trainingdetectionmodels,especiallysingle-stagedetectors,isdifficult due to class imbalance. Common approaches such as hard example mining and reweighing are developed to address the class imbalance issue (Sung, 1996; Viola & Jones, 2001; Felzenszwalb etal.,2010;Shrivastavaetal.,2016;Liuetal.,2016;Buloetal.,2017). Asoneoftheseapproaches, focal loss is designed to mitigate the class imbalance issue by focusing on the hard examples and is used to train state-of-the-art 2D and 3D detectors (Lin et al., 2017; Tan et al., 2020; Du et al., 2020; Shi et al., 2020; Sun et al., 2021). In our work, we found that focal loss is suboptimal for theimbalancedImageNet-21K.UsingthePolyLossframework,wediscoverabetterlossfunction, whichperformstheoppositeroleoffocalloss. Wefurtherprovideintuitiveunderstandingofwhy itisimportanttodesigndifferentlossfunctionstailoredtodifferentimbalanceddatasetsusingthe PolyLossframework. Robustlosstolabelnoise Anotherdirectionofresearchistodesignlossfunctionsthatarerobust tolabelnoise(Ghoshetal.,2015;2017;Zhang&Sabuncu,2018;Wangetal.,2019;Oksuzetal., 2020;Menonetal.,2019). Acommonlyusedapproachistoincorporatenoiserobustlossfunction suchasMeanAbsoluteError(MAE)intocross-entropyloss. Inparticular,Taylorcrossentropyloss isproposedtounifyMAEandcross-entropylossbyexpandingthecross-entropylossin(1−P )j t polynomialbases(Fengetal.,2020). Bytruncatingthehigher-orderpolynomials,theyshowtrun- catedcross-entropylossfunctionisclosertoMAE,whichismorerobusttolabelnoiseondatasets withsyntheticlabelnoise. Incontrast,ourPolyLossprovidesamoregeneralframeworktodesign lossfunctionsfordifferentdatasetsbymanipulatingpolynomialcoefficients,whichincludesdrop- ping higher-order polynomials proposed in Feng et al. (2020). Our experiments in subsection 4.1 show the loss proposed in Feng et al. (2020) performs worse than cross-entropy loss on the clean ImageNetdataset. Learnedlossfunctions Severalrecentworksdemonstratelearningthelossfunctionduringtrain- ingviagradientdescentormetalearning(Hajiabadietal.,2017;Xuetal.,2018;Gonzalez&Mi- ikkulainen,2020a;Lietal.,2019;2020). Notably,TaylorGLOutilizesCMA-EStooptimizemulti- variateTaylorparameterizationofalossfunctionandlearningratescheduleduringtraining(Hansen &Ostermeier,1996;Gonzalez&Miikkulainen,2020b).Duetothesearchspacescalewiththeorder 2PublishedasaconferencepaperatICLR2022 3 2 1 0 0.2 0.4 0.6 0.8 1.0 Prediction Pt ssoL PolyLoss LCE+(1 Pt) 2 PolyLoss LCE (1 Pt) Focal loss LCE×(1 Pt) F Co roc sa sl -l eo nss tr oLC pE y× lo( s1 s Pt)2 1 0 (1 Pt)1(1 Pt)2(1 Pt)3(1 Pt)4(1 Pt)5(1 Pt)6(1 Pt)7 Polynomial terms stneiciffeoc yloP PolyLoss allows vertical adjustment Focal loss only allows horizontal adjustment Figure 1: Unified view of cross-entropy loss, focal loss, and PolyLoss. PolyLoss (cid:80)∞ α (1− j=1 j P )jisamoregeneralframework,whereP standsforpredictionprobabilityofthetargetclass.Left: t t Polylossismoreflexible: itcanbesteeper(deepred)thancross-entropyloss(black)orflatter(light red)thanfocalloss(green).Right:Polynomialcoefficientsofdifferentlossfunctionsinthebasesof (1−P )j,wherej ∈Z+. Blackdashlinesaredrawntoshowthetrendofpolynomialcoefficients. t InthePolyLossframework,focallosscanonlyshiftthepolynomialcoefficientshorizontally(green arrow), see Equation 2, whereas the proposed PolyLoss framework is more general, which also allowsverticaladjustment(redarrows)ofthepolynomialcoefficientforeachpolynomialterm. ofpolynomials, thepaperdemonstratesthatusingthethird-orderparameterization(8parameters), the learned loss function schedule outperforms cross-entropy loss on 10-class classification prob- lems. Ourpaper(Figure2a),ontheotherhand,showsfor1000-classclassificationtasks,hundreds ofpolynomialsareneeded. Thisresultsinaprohibitivelylargesearchspace. OurproposedPoly-1 formulationmitigatesthechallengeofthelargesearchspaceanddonotrelyonadvancedblack-box optimizationalgorithms. Instead, weshowasimplegridsearchoveronehyperparametercanlead tosignificantimprovementonalltasksthatweinvestigate. 3 POLYLOSS PolyLossprovidesaframeworkforunderstandingandimprovingthecommonlyusedcross-entropy lossandfocalloss,visualizedinFigure1. ItisinspiredfromtheTaylorexpansionofcross-entropy loss(Equation1)andfocalloss(Equation2)inthebasesof(1−P )j: t ∞ (cid:88) L =−log(P )= 1/j(1−P )j =(1−P )+1/2(1−P )2... (1) CE t t t t j=1 ∞ (cid:88) L =−(1−P )γlog(P )= 1/j(1−P )j+γ =(1−P )1+γ +1/2(1−P )2+γ... (2) FL t t t t t j=1 whereP isthemodel’spredictionprobabilityofthetargetground-truthclass. t Cross-entropylossasPolyLoss Usingthegradientdescentmethodtooptimizethecross-entropy loss requires taking the gradient with respect to P . In the PolyLoss framework, an interesting t observation is that the coefficients 1/j exactly cancel the jth power of the polynomial bases, see Equation 1. Thus, the gradient of cross-entropy loss is simply the sum of polynomials (1−P )j, t showninEquation3. ∞ −dL CE =(cid:88) (1−P )j−1 =1+(1−P )+(1−P )2... (3) dP t t t t j=1 ThepolynomialtermsinthegradientexpansioncapturedifferentsensitivitywithrespecttoP . The t leadinggradienttermis1,whichprovidesaconstantgradientregardlessofthevalueofP . Onthe t contrary,whenj (cid:29)1,thejthgradienttermisstronglysuppressedwhenP getscloserto1. t FocallossasPolyLoss InthePolyLossframework, Equation2, itisapparentthatthefocalloss simplyshiftsthepowerj bythepowerofamodulatingfactorγ. Thisisequivalenttohorizontally shiftingallthepolynomialcoefficientsbyγasshowninFigure1. Tounderstandthefocallossfrom agradientprospective,wetakethegradientofthefocalloss(Equation2)withrespecttoP : t ∞ −dL FL =(cid:88) (1+γ/j)(1−P )j+γ−1 =(1+γ)(1−P )γ +(1+γ/2)(1−P )1+γ... (4) dP t t t t j=1 Forapositiveγ,thegradientoffocallossdropstheconstantleadinggradientterm,1,inthecross- entropy loss, see Equation 3. As discussed in the previous paragraph, this constant gradient term causes the model to emphasize the majority class, since its gradient is simply the total number of 3PublishedasaconferencepaperatICLR2022 Polynomialexpansioninthebasisof(1−Pt) Loss Cross-entropyloss (1−Pt)+1/2(1−Pt)2+...+1/N(1−Pt)N+1/(N+1)(1−Pt)N+1+... LCE=−log(Pt) Droppoly.(Sec4.1) (1−Pt)+1/2(1−Pt)2+...+1/N(1−Pt)N (droptheremainingterms) LDrop=LCE−(cid:80)∞ j=N1/j(1−Pt)j Poly-N(Sec4.2) ((cid:15)1 +1)(1−Pt)+...+((cid:15)N +1/N)D tN+1/(N+1)(1−Pt)N+1+... LPoly-N=LCE+(cid:80)N j=1(cid:15)j(1−Pt)i Poly-1(Sec4.3) ((cid:15)1 +1)(1−Pt)+1/2(1−Pt)2+...+1/N(1−Pt)N+1/(N+1)(1−Pt)N+1+... LPoly-1=LCE+(cid:15)1(1−Pt) Table 2: Comparing different losses in the PolyLoss framework. Dropping higher order poly- nomial, proposed in prior works, truncates all higher order (N +1 → ∞) polynomial terms. We proposePoly-Nloss,whichperturbstheleadingNpolynomialcoefficients. Poly-1isthefinalloss formulation,whichfurthersimplifiesPoly-Nandonlyrequiresasimplegridsearchoveronehyper- parameter. Thedifferencescomparedtocross-entropylossarehighlightedinred. examples for each class. By shifting the power of all the polynomial terms by γ, the first term thenbecomes(1−P )γ,whichissuppressedbythepowerofγ toavoidoverfittingtothealready t confident(meaningP closeto1)majorityclass. Moredetailsareshowninsection12. t Connection to regression and general form Representing the loss function in the PolyLoss framework provides an intuitive connection to regression. For classification tasks where y = 1 is the effective probability of the ground-truth label, the polynomial bases (1 − P )j can be ex- t pressedas(y−P )j. Thusbothcross-entropylossandfocallosscanbeinterpretedasaweighted t ensembleofdistancesbetweenthepredictionandlabeltothejthpower. However, afundamental questioninthoselosses: Arethecoefficientsinfrontoftheregressiontermsoptimal? In general, PolyLoss is a monotone decreasing function1 on [0,1] which can be expressed as (cid:80)∞ α (1 − P )j and provides a flexible framework to adjust each coefficient2. PolyLoss can j=1 j t begeneralizedtonon-integerj,butforsimplicityweonlyfocusonintegerpower(j ∈ Z+)inthis paper. Inthenextsection,weinvestigateseveralstrategiesondesigningbetterlossfunctionsinthe PolyLossframeworkviamanipulatingα . j 4 UNDERSTANDING THE EFFECT OF POLYNOMIAL COEFFICIENTS Intheprevioussection,weestablishedthePolyLossframeworkandshowedthatcross-entropyloss andfocallosssimplycorrespondtodifferentpolynomialcoefficients,wherefocallosshorizontally shiftsthepolynomialcoefficientsofcross-entropyloss. In this section, we propose the final loss formulation Poly-1. We study in depth how vertically adjustingpolynomialcoefficients,showninFigure1,mayaffecttraining. Specifically,weexplore threedifferentstrategiesinassigningpolynomialcoefficients: droppinghigher-orderterms;adjust- ingmultipleleadingpolynomialcoefficients;andadjustingthefirstpolynomialcoefficient,summa- rized in Table 2. We find adjusting the first polynomial coefficient (Poly-1 formulation) leads to maximalgainwhilerequiringminimalcodechangeandhyperparametertuning. In these explorations, we experiment with 1000-class ImageNet (Deng et al., 2009) classification. WeabbreviateitasImageNet-1Ktodifferentiateitfromthefullversion,whichcontains21Kclasses. WeuseResNet-50(Heetal.,2016)anditstraininghyperparameterswithoutmodification.3 4.1 L Drop: REVISITINGDROPPINGHIGHER-ORDERPOLYNOMIALTERMS Priorworks(Fengetal.,2020;Gonzalez&Miikkulainen,2020b)haveshowndroppingthehigher- order polynomials and tuning the leading polynomials can improve model robustness and perfor- mance. WeadoptthesamelossformulationL = (cid:80)N 1/j(1−P )j,asinFengetal.(2020), Drop j=1 t and compare their performance with the baseline cross-entropy loss on ImageNet-1K. As shown inFigure2a, weneedtosumupmorethan600polynomialtermstomatchtheaccuracyofcross- entropyloss. Notably,removinghigher-orderpolynomialscannotsimplybeinterpretedasadjusting the learning rate. To verify this, Figure 2b compares the performance for different learning rates withvariouscutoffs: nomatterweincreaseordecreasethelearningratefromtheoriginalvalueof 0.1,theaccuracyworsens. Additionalhyperparametertuningisshowninsection9. 1We only consider the case all α ≥ 0 in this paper for simplicity. There exist monotone decreasing j functionson[0,1]withsomeα negative,forexamplesin(1−P )=(cid:80)∞ (−1)j/(2j+1)!(1−P )2j+1. j t j=0 t 2Toensureseriesconverges,werequire1/limsup (cid:112) j |α |≥1forP in(0,1]. ForP =0wedon’t j→∞ j t t requirepoint-wiseconvergence;infactcross-entropyandfocallossbothgoto+∞. 3Codeathttps://github.com/tensorflow/tpu/tree/master/models/official/ 4PublishedasaconferencepaperatICLR2022 75 50 25 0 0 100 200 300 400 500 600 Cutoff index N ycaruccA 6 N 4 LDrop= j=11 j(1 Pt)j 2 LCE 0 1 2 3 4 5 6 7 8 Cutoff index N (a)Truncatingtheinfinitesumofpolynomialsin cross-entropylosstoN reducesaccuracy. ycaruccA lr =6.4 lr =0.02 lr =1.6 lr =0.005 lr =0.4 lr =0.001 lr =0.1 (b) Adjusting the learning rate (default 0.1) of L doesnotimprovetheclassificationaccuracy. Drop Figure 2: Training ResNet-50 on ImageNet-1K requires hundreds of polynomial terms to re- producethesameaccuracyascross-entropyloss. Tounderstandwhyhigher-ordertermsareimportant,weconsidertheresidualsumafterremoving thefirstN polynomialtermsfromcross-entropyloss:R =L −L =(cid:80)∞ 1/j(1−P )j. N CE Drop j=N+1 t Theorem 1. For any small ζ > 0, δ > 0 if N > log (ζ·δ), then for any p ∈ [δ,1], we have 1−δ |R (p)|<ζ and|R(cid:48) (p)|<ζ. (Proofinsection7) N N Hence,takingalargeN isnecessarytoensureL isuniformlyclosetoL intheperspectivesof Drop CE lossandlossderivativeon[δ,1].Forafixedζ,asδapproaches0,N growsrapidly.Ourexperimental results align with the theorem. The higher-order (j > N + 1) polynomials play an important role during the early stages of training, where P is typically close to zero. For example, when t P ∼0.001,accordingtoEquation3,thecoefficientofthe500thterm’sgradientis0.999499 ∼0.6, t which isfairly large. Different from aforementioned priorworks, our resultsshow that we cannot easilyreducethenumberofpolynomialcoefficientsα byexcludingthehigher-orderpolynomials. j Droppinghigherorderpolynomialsisequivalenttopushingallthehigherorder(j >N+1)polyno- mialcoefficientsα verticallytozerointhePolyLossframework. Sincesimplysettingcoefficients j to zero is suboptimal for training ImageNet-1K, in the following sections, we investigate how to manipulatepolynomialcoefficientbeyondsettingthemtozerointhePolyLossframework. Inpar- ticular,weaimtoproposeasimpleandeffectivelossfunctionthatrequiresminimaltuning. 4.2 L POLY-N: PERTURBINGLEADINGPOLYNOMIALCOEFFICIENTS Inthispaper,weproposeanalternativewayofdesigninganewlossfunctioninthePolyLossframe- work, where we adjust the coefficients of each polynomial. In general, there are infinitely many polynomialcoefficientsα needtobetuned. Thus,itisinfeasibletooptimizethemostgeneralloss: j ∞ (cid:88) L =α (1−P )+α (1−P )2+...+α (1−P )N +...= α (1−P )j (5) Poly 1 t 2 t N t j t j=1 Theprevioussection(subsection4.1)hasshownthathundredsofpolynomialsarerequiredintrain- ingtodowellontaskssuchasImageNet-1Kclassification.Ifwenaivelytruncatetheinfinitesumin Equation5tothefirstfewhundredsterms,tuningcoefficientsforsomanypolynomialsstillresults inaprohibitivelylargesearchspace. Inaddition,collectivelytuningmanycoefficientsalsodoesnot outperformcross-entropyloss,detailsinsection10. Totacklethischallenge,weproposetoperturbtheleadingpolynomialcoefficientsincross-entropy loss,whilekeepingtherestthesame. WedenotetheproposedlossformulationasPoly-N,whereN standsforthenumberofleadingcoefficientsthatwillbetuned. L =((cid:15) +1)(1−P )+...+((cid:15) +1/N)(1−P )N+1/(N +1)(1−P )N+1+... Poly-N 1 t N t t (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) perturbedby(cid:15)j sameasLCE N (cid:88) =−log(P )+ (cid:15) (1−P )j (6) t j t j=1 Here,wereplacethejthpolynomialcoefficientincross- CEloss N=1 N=2 N=3 entropy loss 1/j with 1/j +(cid:15) , where (cid:15) ∈ [−1/j,∞) j j N-dim.gridsearch 76.3 76.7 76.8 – is the perturbation term. This allows us to pinpoint the Greedygridsearch 76.3 76.7 76.7 76.7 firstN polynomialswithouttheneedtoworryaboutthe Table 3: L outperforms cross- infinitelymanyhigher-order(j >N +1)coefficients,as Poly-N entropylossonImageNet-1K. inEquation5. 5PublishedasaconferencepaperatICLR2022 77.0 76.5 76.0 75.5 75.0 0 2 4 6 8 Perturbation 1 ycaruccA 1.0 0.5 LPoly 1= log(Pt)+ 1(1 Pt) LCE 0.0 0 20 40 60 80 100 Training step (k) (a)PolyLossfamilyL =−log(P )+(cid:15) (1− Poly-1 t 1 P ),where(cid:15) ∈{−1,0,1,...,8}. t 1 noitubirtnoc tneidarG First polynomial Rest (b) Percentage of gradient from the first polyno- mialversustherest(infinitelymany)polynomials. Figure 3: The first polynomial plays an important role for training ResNet-50 on ImageNet- 1K. (a) Increasing the coefficient of the first polynomial term ((cid:15) > 0) consistently improves the 1 ResNet50 prediction accuracy. Red dash line shows the accuracy when using cross-entropy loss. Meanandstdevofthreerunsareplotted.(b)Thefirstpolynomial(1−P )contributesmorethanhalf t ofthecross-entropygradientatthelast65%ofthetrainingsteps,whichhighlightstheimportance oftuningthefirstpolynomial. Thereddashlineshowsthecrossover. Table 3 shows L outperforms the baseline cross-entropy loss accuracy. We explore N- Poly-N dimensional grid search and greedy grid search of (cid:15) in L up to N = 3 and find that simply j Poly-N adjustingthecoefficientofthefirstpolynomial(N =1)leadstobetterclassificationaccuracy. Per- forming 2D grid search (N = 2) can further boost the accuracy. However, the additional gain is small(+0.1)comparedtoadjustingonlythefirstpolynomial(+0.4). 4.3 L POLY-1: SIMPLEANDEFFECTIVE As shown in the previous section, we find tuning the first polynomial term leads to the most sig- nificant gain. In this section, we further simplify the Poly-N formulation and focus on evaluating Poly-1,whereonlythefirstpolynomialcoefficientincross-entropylossismodified. L =(1+(cid:15) )(1−P )+1/2(1−P )2+...=−log(P )+(cid:15) (1−P ) (7) Poly-1 1 t t t 1 t Westudytheeffectofdifferentfirsttermscalingontheaccuracyandobservethatincreasingthefirst polynomialcoefficientcansystematicallyincreasetheResNet-50accuracy,asshowninFigure3a. This result suggests that the cross-entropy loss is suboptimal in terms of polynomial coefficient values, and increasing the first polynomial coefficient leads to consistent improvement, which is comparabletoothertrainingtechniques(section11). Figure3bshowstheleadingpolynomialcontributestomorethanhalfofthecross-entropygradient duringtrainingforthemajorityofthetime,whichhighlightsthesignificanceofthefirstpolynomial term (1 − P ) compared to the rest of the infinite many terms. Therefore, in the remaining of t the paper, we adopt the form of L and primarily focus on adjusting the leading polynomial Poly-1 coefficient. As is evident from Equation 7, it only modifies the original loss implementation by a singlelineofcode(addinga(cid:15) (1−P )termontopofcross-entropyloss). 1 t Notethat,allthetraininghyperparametersareoptimizedforcross-entropyloss. Evenso,asimple grid search on the first polynomial coefficients in the Poly-1 formulation significantly increases the classification accuracy. We find optimizing other hyperparameters for L leads to higher Poly-1 accuracy,andshowmoredetailsinsection8. 5 EXPERIMENTAL RESULTS Inthissection, wecompareourPolyLossagainstthecommonlyusedcross-entropylossandfocal loss on various tasks, models, and datasets. For the following experiments, we adopt the default traininghyperparametersinthepublicrepositorieswithoutanytuning. Nevertheless,Poly-1formu- lationleadstoconsistentadvantageoverdefaultlossfunctionsatthecostofasimplegridsearch. 5.1 L POLY-1 IMPROVES2DIMAGECLASSIFICATIONONIMAGENET Imageclassificationisafundamentalproblemincomputervision,andprogressonimageclassifica- tionhasledtoprogressonmanyrelatedcomputervisiontasks. Intermsofthenetworkarchitecture, inadditiontotheResNet-50alreadyusedinsection4,wealsoexperimentwiththestate-of-the-art EfficientNetV2 (Tan & Le, 2021). We use the ImageNet settings in (Tan & Le, 2021) except for replacingtheoriginalcross-entropylosswithourPolyLossL withdifferentvaluesof(cid:15) . In Poly−1 1 termsofthedataset,inadditiontotheImageNet-1Kdatasetalreadyusedinsection4,wealsocon- siderImageNet-21K,whichhasabout13Mtrainingimageswith21,841classes. Wewillstudyboth theImageNet-21KpretrainingresultsandtheImageNet-1Kfinetuningresults. 6PublishedasaconferencepaperatICLR2022 Pretraining EfficientNetV2-L on ImageNet-21K, then finetuning it on ImageNet-1K can improve classification accuracy (Tan & Le, 2021). Here, we follow the same pretraining and finetuning schedule as reported in Tan & Le (2021) without modification4 but replace the cross-entropy loss withL = −log(P )+(cid:15) (1−P ). Wereserve25,000imagesfromthetrainingsetasminival Poly-1 t 1 t tosearchtheoptimal(cid:15) . 1 47 46 45 44 43 FLOPs ycaruccA Pretraining on ImageNet-21K Figure 4 46.4 highlightstheimportanceofusingtailoredloss 45.7 X2.2 faster 45.8 functionwhenpretrainingmodelonImageNet- 21K dataset. A simple grid search over (cid:15) 1 ∈ 44.8 {0,1,2,...,7} in L without changing 44.5 Poly-1 other default hyperparameters leads to around 43.5 LPoly 1= log(Pt)+5(1 Pt) 1%accuracygainforallSOTAEfficientNetV2 LCE models with different sizes. The accuracy im- ENetV2-S ENetV2-M ENetV2-L 8.8B 24B 53B provementofusingabetterlossfunctionnearly Figure 4: PolyLoss improves EfficientNetV2 matches the improvement of scaling up the family on the speed-accuracy Pareto curve. modelarchitecture(StoMandMtoL). Validation accuracy of EfficientNetV2 models Surprisingly, see Figure 5a, increasing the pretrained on ImageNet-21K are plotted. Poly- weight of the leading polynomial coeffi- Loss outperforms cross-entropy loss with about cient improves the accuracy of pretraining on ×2speed-up. ImageNet-21K(+0.6),whereasreducingitlow- erstheaccuracy(-0.9). Setting(cid:15) =−1truncatestheleadingpolynomialterminthecross-entropy 1 loss (Equation 1), which is similar to having a focal loss with γ = 1 (Equation 2). However, the oppositechange,where(cid:15) >0,improvestheaccuracyontheimbalancedImageNet-21K. 1 We hypothesize the prediction of the imbalanced ImageNet-21K is not confident enough (P is t small),andusingpositive(cid:15) PolyLossleadstomoreconfidentpredictions. Tovalidateourhypoth- 1 esis,weplotP asafunctionoftrainingstepsinFigure5b. Weobservethat(cid:15) directlycontrolsthe t 1 meanP overallclasses. Usingpositive(cid:15) PolyLossleadstomoreconfidentprediction(higherP ). t 1 t Ontheotherhand,negative(cid:15) PolyLosslowerstheconfidence. 1 46.5 46.0 45.5 45.0 0 2 4 6 Perturbation 1 ycaruccA 0.4 0.2 LPoly 1= log(Pt)+ 1(1 Pt) LCE 0.0 0 50 100 150 Training step (k) (a) Validation accuracy of EfficientNetV2-L on ImageNet-21K.PolyLosswithpositive(cid:15) outper- 1 formsbaselinecross-entropyloss(reddashline). tP naeM log(Pt)+(1 Pt) log(Pt) log(Pt) (1 Pt) (b) Positive (cid:15) = 1 (dark) increases the predic- 1 tion confidence, while negative (cid:15) = −1 (light) 1 decreasesthepredictionconfidence. Figure5: PolyLossimprovesEfficientNetV2-LbyincreasingpredictionconfidenceP . t Fine tuning on ImageNet-1K After pretraining on EfficientNetV2-L L L Improv. ImageNet-21K,wetaketheEfficientNetV2-Lcheckpoint CE Poly-1 and finetune it on ImageNet-1K, using the same proce- ImageNet-21K 45.8 46.4 +0.6 ImageNet-1K 86.8 87.2 +0.4 dureasTan&Le(2021)exceptforreplacingtheoriginal cross-entropylosswiththePoly-1formulation. PolyLoss Table4: PolyLossimprovesclassifica- improvesthefinetuningaccuracyby0.4%,advancingthe tionaccuracyonImageNetvalidation ImageNet-1Ktop-1accuracyfrom86.8%to87.2%. set. Weset(cid:15) =2forboth. 1 5.2 L POLY-1 IMPROVES2DINSTANCE SEGMENTATIONANDOBJECTDETECTIONONCOCO Instance segmentation and object detection require localizing objects in an image in addition to recognizingthem: theformerintheformofarbitraryshapesandthelatterintheformofbounding boxes. Forbothinstancesegmentationandobjectdetection,weusethepopularCOCO(Linetal., 2014) dataset, which contains 80 object classes. We choose Mask R-CNN (He et al., 2017) as the representative model for instance segmentation and object detection. These models optimize multiple losses, e.g. L = L +L +L . For the following experiments, we only MaskRCNN cls box mask replacetheL withPolyLossandleaveotherlossesintact. ResultsaresummarizedinTable5. cls 4Codeathttps://github.com/google/automl/tree/master/efficientnetv2 7PublishedasaconferencepaperatICLR2022 Loss Box Mask AP AR AP AR MaskR-CNNLCE −log(Pt) 35.0±0.09 47.2±0.16 31.3±0.09 42.3±0.02 MaskR-CNNLPoly-1 −log(Pt)−(1−Pt) 35.3±0.12 49.7±0.07 31.6±0.11 44.4±0.07 Improvement - +0.3 +2.5 +0.3 +2.1 Table 5: PolyLoss improves detection results on COCO validation set. Bounding box and in- stance segmentation mask average-precision (AP) and average-recall (AR) are reported for Mask R-CNNmodelwithaResNet-50backbone. Meanandstdevofthreerunsarereported. Reducing the leading polynomial coefficient improves Mask R-CNN AP and AR. In training MaskR-CNN,weusethetrainingscheduleoptimizedforcross-entropyloss,5andreplacethecross- entropy loss with L = −log(P )+(cid:15) (1−P ) for the classification loss L , where (cid:15) ∈ Poly−1 t 1 t cls 1 {−1.0,−0.8,−0.6,−0.4,−0.2,0,0.5,1.0}. Weensuretheleadingcoefficientispositive,i.e. (cid:15) ≥ 1 −1.OurresultsinFigure6ashowsystematicimprovementsofboxAP,boxAR,maskAP,andmask AR as we reduce the weight of the first polynomial by using negative (cid:15) values. Note that Poly-1 1 ((cid:15)=−1)notonlyimprovesAPbutalsosignificantlyincreasesAR,showninTable5. 35 34 1.0 0.5 0.0 0.5 1.0 Perturbation 1 PA xoB 32 31 LPoly 1= log(Pt)+1(1 Pt) 30 1.0 0.5 0.0 0.5 1.0 Perturbation 1 PA ksaM 48 46 1.0 0.5 0.0 0.5 1.0 Perturbation 1 RA xoB 44 42 1.0 0.5 0.0 0.5 1.0 Perturbation 1 RA ksaM 1.0 0.9 0.8 0.7 0 50 100 Training step (k) (a)BoundboxAP,ARandMaskAP,ARincreaseas(cid:15) decreases. 1 Negative(cid:15) outperformscross-entropyloss(reddashline). 1 tP naeM log(Pt)+(1 Pt) log(Pt) log(Pt) (1 Pt) (b)Negative(cid:15) = −1(light)reduces 1 theoverconfidentpredictionP . t Figure 6: PolyLoss improves Mask R-CNN by lowering overconfident predictions. Mean and stdevofthreerunsareplotted. Tailoring loss function to datasets and tasks is important. ImageNet-21K and COCO are both imbalancedbuttheoptimal(cid:15)forPolyLossareoppositeinsign,i.e. (cid:15)=2forImageNet-21Kclassi- ficationand(cid:15) = −1forMaskR-CNNdetection. WeplottheP oftheMaskR-CNNclassification t headandfoundtheoriginalpredictionisoverlyconfident(P iscloseto1)ontheimbalancedCOCO t dataset,thususinganegative(cid:15)lowersthepredictionconfidence,asshowninFigure6b. Thiseffect is similar to label smoothing (Szegedy et al., 2016) and confidence penalty (Pereyra et al., 2017), butunlikethosemethods, aslongas0 > (cid:15) > −1, PolyLosslowersthegradientsofoverconfident predictionsbutwillnotencourageincorrectpredictionsordirectlypenalizepredictionconfidence. 5.3 L POLY-1 IMPROVES3DOBJECTDETECTIONONWAYMOOPENDATASET Polynomialexpansioninthebasisof(1−Pt) Loss Focalloss (1−Pt)γ+1+1/2(1−Pt)γ+2+1/3(1−Pt)γ+3+... LFL=−(1−Pt)γlog(Pt) Poly-1(PointPillars) ((cid:15)1 +1)(1−Pt)γ+1+1/2(1−Pt)γ+2+1/3(1−Pt)γ+3+... LF PL oly-1=LFL+(cid:15)1(1−Pt)γ+1 Poly-1∗(RSN) (dropfirst) (1/2+ (cid:15)2)(1−Pt)γ+2+1/3(1−Pt)γ+3+... LF PL oly-1∗=LFL−(1−Pt)γ+1+(cid:15)2(1−Pt)γ+2 Table6: PolyLossvs. focallossfor3Ddetectionmodels. Differencesarehighlightedinred. We found the best Poly-1 for PointPillars is (cid:15) = −1, which is equivalent to dropping the first term. 1 Therefore,forRSN,wedropthefirsttermandtunethenewleadingpolynomial(1−P )γ+2. t Detecting 3D objects from LiDAR point clouds is an important topic and can directly benefit au- tonomous driving applications. We conduct these experiments on the Waymo Open Dataset (Sun et al., 2020). Similar to 2D detectors, 3D detection models are commonly based on single-stage andtwo-stagearchitectures. Here,weevaluateourPolyLossontwomodels: apopularsingle-stage PointPillars model (Lang et al., 2019); and a state-of-the-art two-stage Range Sparse Net (RSN) model(Sunetal.,2021). Bothmodelsrelyonmulti-tasklossfunctionsduringtraining. Here, we focus on improving the classification focal loss by replacing it with PolyLoss. Similar to the 2D perceptioncases,weadoptthePoly-1formulationtoimproveuponfocalloss,showninTable6. PolyLoss improves single-stage PointPillars model. The PointPillars model converts the raw 3D point cloud to a 2D top-down pseudo image, and then detect 3D bounding boxes from the 2D image in a similar way to RetinaNet (Lin et al., 2017). Here, we replace the classification 5Codeathttps://github.com/tensorflow/tpu/tree/master/models/official 8PublishedasaconferencepaperatICLR2022 Loss BEV 3D AP/APHL1 AP/APHL2 AP/APHL1 AP/APHL2 Vehicle(IoU=0.7) PointPillarsL FL −(1−Pt)2log(Pt) 82.5/81.5 73.9/72.9 63.3/62.7 55.2/54.7 PointPillarsLF PoL ly-1 −(1−Pt)2log(Pt)−(1−Pt)3 83.6/82.5 74.8/73.7 63.7/63.1 55.5/55.0 Improvement - +1.1/+1.0 +0.9/+0.8 +0.4/+0.7 +0.3/+0.3 RSNL FL −(1−Pt)2log(Pt) 91.3/90.8 82.6/82.2 78.4/78.1 69.5/69.1 RSNLF PoL ly-1∗ −(1−Pt)2log(Pt)−(1−Pt)3−0.4(1−Pt)4 91.5/90.9 82.7/82.1 78.9/78.4 69.9/69.5 Improvement - +0.2/+0.1 +0.1/-0.1 +0.5/+0.3 +0.4/+0.4 Pedestrian(IoU=0.5) PointPillarsL FL −(1−Pt)2log(Pt) 76.0/62.0 67.2/54.6 68.9/56.6 60.0/49.1 PointPillarsLF PoL ly-1 −(1−Pt)2log(Pt)−(1−Pt)3 77.1/62.9 67.7/55.1 69.6/57.1 60.2/49.3 Improvement - +1.1/+0.9 +0.5/+0.5 +0.7/+0.5 +0.2+0.2 RSNL FL −(1−Pt)2log(Pt) 85.0/81.4 75.5/72.2 79.4/76.2 69.9/67.0 RSNLF PoL ly-1∗ −(1−Pt)2log(Pt)−(1−Pt)3+0.2(1−Pt)4 85.4/81.8 75.8/72.5 80.2/77.0 70.6/67.7 Improvement - +0.4/+0.4 +0.3/+0.3 +0.8/+0.8 +0.7/+0.7 Table 7: PolyLoss improves detection results on Waymo Open Dataset validation set. Two detectionmodels:single-stagePointPillars(Langetal.,2019)andtwo-stageSOTARSN(Sunetal., 2021)areevaluated. Bird’seyeview(BEV)and3Ddetectionaverageprecision(AP)andaverage precision with heading (APH) at Level 1 (L1) and Level 2 (L2) difficulties are reported. The IoU thresholdissetto0.7forvehicledetectionand0.5forpedestriandetection. focal loss (γ = 2) with LFL = −(1 − P )2logP + (cid:15) (1 − P )3 and adopt the same train- Poly-1 t t 1 t ing schedule optimized for focal loss without any modification6. Table 7 shows that LFL with Poly-1 (cid:15) = −1leadstosignificantimprovementonallthemetricsforbothvehicleandpedestrianmodels. 1.0 0.5 0.0 (1 Pt)3 (1 Pt)4 (1 Pt)5 (1 Pt)6 Polynomial terms stneiciffeoc yloP LFL LPFoLly 1 Pillars (1 Pt)3 1.0 0.5 0.0 (1 Pt)3 (1 Pt)4 (1 Pt)5 (1 Pt)6 Polynomial terms stneiciffeoc yloP Advancing the state-of-the-art with RSN. RSN seg- mentsforegroundpointsfromthe3Dpointcloudinthe first stage, and then applies sparse convolution to pre- dict 3D bounding boxes from the selected foreground points. RSNusesthesamefocallossasthePointPillars model,i.e.,L =−(1−P )2logP .Sincetheoptimal FL t t LFL forPointPillars((cid:15) =−1)isequivalenttodrop- Poly-1 1 ping the first polynomial, we adapt the same loss for- mulationforRSNandtunethenewleadingpolynomial LPF oL ly 1* RSN Car LPF oL ly 1* RSN Ped (1−P )4bydefiningLFL =−(1−P )2log(P )− +0.2(1 Pt)4 (1−Pt )3 +(cid:15) (1−P )P 4o ,ly s-1 h∗ own in Figurt e 7. Wet fol- 0.4(1 Pt)4 t 2 t lowthesametrainingscheduleoptimizedforfocalloss describedinSunetal.(2021)withoutadjustment. Our results, in Table 7, show that tuning the new leading polynomialimprovesallmetrics(exceptvehicledetec- Figure7: VisualizingLFL andLFL Poly-1 Poly-1∗ tionBEVAPHL2)fortheSOTA3Ddetector. inthePolyLossframework. 6 CONCLUSION Inthispaper,weproposethePolyLossframework,whichprovidesaunifiedviewoncommonloss functionsforclassificationproblems.Werecognizethat,underpolynomialexpansion,focallossisa horizontalshiftofthepolynomialcoefficientscomparedtothecross-entropyloss. Thisnewinsight motivatesustoexploreanalternativedimension. i.e. verticallymodifythepolynomialcoefficients. Our PolyLoss framework provides flexible ways of changing the loss function shape by adjusting thepolynomialcoefficients. Inthisframework,weproposeasimpleandeffectivePoly-1formula- tion. By simply adjusting the coefficient of the leading polynomial coefficient with just one extra hyperparameter(cid:15) , weshowoursimplePoly-1improvesavarietyofmodelsacrossmultipletasks 1 anddatasets.WehopePoly-1formulation’ssimplicity(oneextralineofcode)andeffectivenesswill leadtoadoptioninmoreapplicationsofclassificationthantheoneswehavemanagedtoexplore. Moreimportantly,ourworkhighlightsthelimitationofcommonlossfunctions,andsimplemodifi- cationcouldleadtoimprovementsevenonwellestablishedstate-of-the-artmodels. Wehopethese findings will encourage exploring and rethinking the loss function design beyond the commonly usedcross-entropyandfocalloss,aswellasthesimplestPoly-1lossproposedinthiswork. 6Codeathttps://github.com/tensorflow/lingvo/tree/master/lingvo/tasks/car 9PublishedasaconferencepaperatICLR2022 ACKNOWLEDGEMENTS We thank James Philbin, Doug Eck, Tsung-Yi Lin and the rest of Waymo Research and Google Brainteamsforvaluablefeedback. REPRODUCIBILITY STATEMENT Ourexperimentsarebasedonpublicdatasetsandopensourcecoderepositories,showninfootnote 3-6. Wedonottuneanydefaulttraininghyperparametersandonlymodifythelossfunctions,which areshowninTable2-7. TheproposedfinalformulationL requiresonelineofcodechange. Poly-1 ExamplecodeforLCE withsoftmaxactivationisshownbelow. Poly-1 def poly1_cross_entropy(logits, labels, epsilon): # epsilon >=-1. # pt, CE, and Poly1 have shape [batch]. pt = tf.reduce_sum(labels * tf.nn.softmax(logits), axis=-1) CE = tf.nn.softmax_cross_entropy_with_logits(labels, logits) Poly1 = CE + epsilon * (1 - pt) return Poly1 ExamplecodeforLCE withαlabelsmoothing isshownbelow. Poly-1 def poly1_cross_entropy(logits, labels, epsilon, alpha = 0.1): # epsilon >=-1. # one minus pt, CE, and Poly1 have shape [batch]. num_classes = labels.get_shape().as_list()[-1] smooth labels = labels * (1-alpha) + alpha/num classes one_minus_pt = tf.reduce_sum( smooth labels * (1 - tf.nn.softmax(logits)), axis=-1) CE_loss = tf.keras.losses.CategoricalCrossentropy( from_logits=True, label_smoothing=alpha, reduction=’none’) CE = CE_loss(labels, logits) Poly1 = CE + epsilon * one minus pt return Poly1 ExamplecodeforLFL withsigmoidactivationisshownbelow. Poly-1 def poly1_focal_loss(logits, labels, epsilon, gamma=2.0): # epsilon >=-1. # p, pt, FL, and Poly1 have shape [batch, num of classes]. p = tf.math.sigmoid(logits) pt = labels * p + (1 - labels) * (1 - p) FL = focal_loss(pt, gamma) Poly1 = FL + epsilon * tf.math.pow(1 - pt, gamma + 1) return Poly1 ExamplecodeforLFL withαbalanceisshownbelow. Poly-1 def poly1_focal_loss(logits, labels, epsilon, gamma=2.0, alpha=0.25): # epsilon >=-1. # p, pt, FL, weight, and Poly1 have shape [batch, num of classes]. p = tf.math.sigmoid(logits) pt = labels * p + (1 - labels) * (1 - p) FL = focal_loss(pt, gamma, alpha) weight = labels * alpha + (1 - labels) * (1 - alpha) Poly1 = FL + epsilon * tf.math.pow(1 - pt, gamma + 1) * weight return Poly1 10PublishedasaconferencepaperatICLR2022 SUPPLEMENTARY MATERIAL 7 PROOF OF THEOREM 1 Theorem 1. For any small ζ > 0, δ > 0 if N > log (ζ·δ), then for any p ∈ [δ,1], we have 1−δ |R (p)|<ζ and|R(cid:48) (p)|<ζ. N N Proof. (cid:88)∞ (cid:88)∞ (1−p)N+1 (1−δ)N+1 (1−δ)N |R (p)|= 1/j(1−p)j ≤ (1−p)j = ≤ ≤ N p δ δ j=N+1 j=N+1 (cid:88)∞ (1−p)N (1−δ)N |R(cid:48) (p)|= (1−p)j = ≤ N p δ j=N 8 ADJUSTING OTHER TRAINING HYPERPARAMETERS LEADS TO HIGHER GAIN. Alltheexperimentsshowninthemaintextarebasedonhyperparametersoptimizedforthebaseline loss function, which actually puts PolyLoss at a disadvantage. Here we use weight decay rate for ResNet50 as an example. The default weight decay (1e-4) is optimized for cross-entropy loss. Adjustingthedecayratemayreducethemodelperformanceofcross-entropylossbutleadstomuch higher gain for PolyLoss (+0.8%), which is better than the best accuracy (76.3%) trained using cross-entropyloss(+0.8%). Weightdecay 1e-4† 2e-4 9e-5 Cross-entropy 76.3 76.3 76.1 PolyLoss 76.7 77.1 76.7 Improv.@thesameweightdecay +0.4 +0.8 +0.6 Improv.comparedtothebestL (76.3%) +0.4 +0.8 +0.4 CE Table8: ResNet50performancesonImageNet-1Kusingdifferentweightdecays. †Thedefault weightdecayvalueis1e-4. Here, we addadditional ablation studies onCOCO detection using RetinaNet. The optimal γ and αbalancevaluesforFocallossare(2.0,0.25)(Linetal.,2017). Sinceallthehyperparametersare optimized with respect to the optimal (γ, α) values, we observe no improvement when tuning the leadingpolynomialterm. WesuspectthedetectionAPisata’localmaximum’ofhyperparameters. By adjusting (γ, α) values, we show PolyLoss consistently outperforms the best Focal Loss AP (33.4),i.e.,adjustingonlyγ value(column3,4)orbothγ andαvalues(column5,6). Focalloss(γ,α) (2.0,0.25)† (1.5,0.25) (2.5,0.25) (1.5,0.3) (2.5,0.15) Focalloss 33.4 33.4 33.2 33.2 32.9 PolyLoss 33.4 33.6 33.7 33.8 33.8 Improv.@same(γ,α) 0 +0.2 +0.5 +0.6 +0.9 Improv.comparedtothebestL (33.4) 0 +0.2 +0.3 +0.4 +0.4 FL Table 9: RetinaNet (ResNet50 backbone) performances on COCO using different Focal loss (γ,α). †Thedefault(γ,α)usedinFocallossis(2.0,0.25). 9 L WITH MORE HYPERPARAMETER TUNING DROP For L (N = 2), besides adjusting the learning rate, we further tune the coefficient (α) of the Drop secondpolynomial,similartoapriorwork(Gonzalez&Miikkulainen,2020b),andweightdecay. L =(1−P )+α(1−P )2 (8) Drop* t t Unlike Feng et al. (2020), where α = 0.5 after dropping all higher-order polynomial, we find the optimalα = 8, whiletheoptimallearningrateisthesameasthedefaultsetting(0.1). Thisalone 11PublishedasaconferencepaperatICLR2022 increasestheaccuracyto70.9, whichshowssimplydroppingpolynomialtermsisnotenoughand adjustingthepolynomialcoefficientsiscritical. Furthertuningweightdecayleadstolessthan0.1% modelqualityimprovement. Comparing to prior works (Gonzalez & Miikkulainen, 2020b; Feng et al., 2020), Poly-1 is more effective and only contains one hyperparameter. Tuning weight decay of Poly-1 further increases theaccuracywhilehavinglesshyperparameterscomparedtoL ,showninTable10. Drop∗ Cross-entropy Poly-1 Poly-1(weightdecay) L Drop* Accuracy 76.3 76.7 77.1 70.9 Num.ofparameters – 1 2 3 Table 10: Poly-1 outperforms L with hyperparameter tuning. Accuracy of ResNet50 on Drop∗ ImageNet-1Kisreported. 10 COLLECTIVELY TUNING MULTIPLE POLYNOMIAL COEFFICIENTS Besidesadjustingindividualpolynomialcoefficients,inthissection,weexplorecollectivelytuning multiplepolynomialcoefficientsinthePolyLossframwork.Inparticular,wechangethecoefficients intheoriginalcross-entropylossfrom1/j (Equation1)toexponentialdecay. Here,wedefine 2N (cid:88) L = e−(j−1)/N(1−P )j (9) exp t j=1 where we cut off the infinite sum at twice the decay factor N. We performed 2D grid search on N ∈ {5,20,80,320} and learning rate ∈ {0.1,0.4,1.6,6.4}. The best accuracy is 72.3, where N =80andlearningrate=1.6,showninTable11. Cross-entropy Poly-1 L exp Accuracy 76.3 76.7 72.3 Num. ofparameters – 1 2 Table 11: Comparing Poly-1 with exponential decay coefficients. Accuracy of ResNet50 on ImageNet-1Kisreported. ThoughPoly-1isbetterthanusingL ,therearealotmorepossibilitiesbesidesusingexponential exp decay. Webelieveunderstandinghowcollectivelytuningmultiplecoefficientsaffectsthetrainingis animportanttopic. 11 COMPARING TO OTHER TRAINING TECHNIQUES As shown in recent works (He et al., 2019; Bello et al., 2021; Wightman et al., 2021), though independent novel training techniques often lead to sub 1% improvement, combining them could lead to significant overall improvements. To put things into perspective, Poly-1 achieves similar improvementsasothercommonlyusedtrainingtechniques,suchaslabelsmoothinganddropouton FC,showninTable12. Cross-entropy Poly-1 Labelsmoothing DropoutonFC Accuracy 76.3 76.7 76.7 76.4 Num. ofparameters – 1 1 1 Table 12: Comparing Poly-1 with common training techniques. Accuracy of ResNet50 on ImageNet-1Kisreported. 12PublishedasaconferencepaperatICLR2022 12 REDISCOVERING FOCAL LOSS FROM POLYLOSS 0.32 0.30 0 1 2 3 4 Drop first N polynomials PA xoB 0.45 AP AR 0.40 RA xoB Focal loss was first developed for single-stage detector RetinaNettoaddressstrongclassimbalancepresentedin object detection (Lin et al., 2017). Here, we provide an additionalablationstudyonhowtosystemicallydiscover focallossinthePolyLossframeworkandinvestigatehow the leading terms affect training in the presence of class imbalance. Rediscovering the concept of focal loss from cross- Figure 8: Dropping leading polyno- entropyloss. Here,wetakeastepbackandattemptto mialtermscanimproveRetinaNet. systematicallyrediscovertheconceptoffocallossviaour PolyLoss framework. Focal loss is commonly used for training detection models. Coming up with such an insight to address the class imbalance issue in detection requires strong domain expertise. We start with the PolyLoss representation of cross- entropylossandimproveitfromthePolyLossgradientperspective. 1.000 0.995 0.990 0.985 0 10 20 30 40 Training step (k) tP naeM log(Pt) log(Pt) (1 Pt) log(Pt) (1 Pt) 1/2(1 Pt)2 1.000 0.995 0.990 0.985 0 25 Training step (k) dnuorgkcab tP 0.4 0.2 0.0 0 25 Training step (k) tcejbo tP Westartwiththecross-entropylossanddefinePolyLoss by dropping the first N polynomials in cross-entropy loss,i.e. L = (cid:80)∞ 1/j(1−P )j = L − Drop-front j=N+1 t CE (cid:80)N 1/j(1−P )j. Droppingthefirsttwopolynomial j=1 t terms(1−P )significantlyimprovesboththedetection t APandAR,seeFigure8.Droppingthefirsttwopolyno- mials(N =2)leadstothebestRetinaNetperformance, whichissimilartosettingγ = 2infocalloss,i.e. focal lossγ = 2pushesallthepolynomialcoefficientstothe right by 2, shown in Figure 1 right, which is similar to truncatingthefirsttwopolynomialterms. Leading polynomials cause overfitting to the major- ityclass. InthePolyLossframework,theleadingpoly- Figure 9: Dropping leading polynomi- nomial of cross-entropy loss is a constant, shown in als reduces overfitting to the major- Equation3. Forbinaryclassification,theleadinggradi- ity class. P t during RetinaNet training entforeachclassissimplyN −N ,where are plotted. Top: overall. Bottom left: background object N and N are the counts of background background. Bottom right: foreground background object and object instances in the training mini-batch. When object. Dark blue curves represents P t theclasscountsareextremelyimbalanced,themajority for cross-entropy loss. Blue curves rep- class will dominate the gradient which will lead to sig- resents dropping the first polynomial in nificantbiastowardsoptimizingthemajorityclass. thecross-entropyloss. Lightbluecurves represents dropping both the first and Dropping polynomials reduces the extremely confident secondpolynomialsinthecross-entropy prediction P , see Figure 9. To examine the composi- t loss. tion of the overall prediction confidence, we also plot theP forbackgroundonlyandP forobjectonly. Due t t to the extreme imbalance between the background and theobjectclass,theoverallP isdominatedbythebackgroundonlyP . SoreducingtheoverallP t t t decreases the background P . On the other hand, reducing overfitting to the majority background t classleadstomoreconfidentpredictionP ontheobjectclass. t 13PublishedasaconferencepaperatICLR2022 REFERENCES IrwanBello,WilliamFedus,XianzhiDu,EkinDCubuk,AravindSrinivas,Tsung-YiLin,Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. arXiv preprintarXiv:2103.07579,2021. SamuelRotaBulo,GerhardNeuhold,andPeterKontschieder.Lossmax-poolingforsemanticimage segmentation. In2017IEEEConferenceonComputerVisionandPatternRecognition(CVPR), pp.7082–7091.IEEE,2017. Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based oneffectivenumberofsamples. InProceedingsoftheIEEE/CVFconferenceoncomputervision andpatternrecognition,pp.9268–9277,2019. JiaDeng,WeiDong,RichardSocher,Li-JiaLi,KaiLi,andLiFei-Fei. Imagenet: Alarge-scalehi- erarchicalimagedatabase. In2009IEEEconferenceoncomputervisionandpatternrecognition, pp.248–255.Ieee,2009. XianzhiDu,Tsung-YiLin,PengchongJin,GolnazGhiasi,MingxingTan,YinCui,QuocVLe,and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11592–11601,2020. Pedro F Felzenszwalb, Ross B Girshick, and David McAllester. Cascade object detection with deformable part models. In 2010 IEEE Computer society conference on computer vision and patternrecognition,pp.2241–2248.IEEE,2010. Lei Feng, Senlin Shu, Zhuoyi Lin, Fengmao Lv, Li Li, and Bo An. Can cross entropy loss be robust to label noise. In Proceedings of the 29th International Joint Conferences on Artificial Intelligence,pp.2206–2212,2020. Aritra Ghosh, Naresh Manwani, and PS Sastry. Making risk minimization tolerant to label noise. Neurocomputing,160:93–107,2015. Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. SantiagoGonzalezandRistoMiikkulainen. Improvedtrainingspeed,accuracy,anddatautilization throughlossfunctionoptimization.In2020IEEECongressonEvolutionaryComputation(CEC), pp.1–8.IEEE,2020a. SantiagoGonzalezandRistoMiikkulainen. Optimizinglossfunctionsthroughmultivariatetaylor polynomialparameterization. arXivpreprintarXiv:2002.00059,2020b. HamidehHajiabadi,DiegoMolla-Aliod,andRezaMonsefi.Onextendingneuralnetworkswithloss ensemblesfortextclassification. arXivpreprintarXiv:1711.05170,2017. Nikolaus Hansen and Andreas Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE international conferenceonevolutionarycomputation,pp.312–317.IEEE,1996. KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. Deepresiduallearningforimagerecog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778,2016. KaimingHe,GeorgiaGkioxari,PiotrDolla´r,andRossGirshick. Maskr-cnn. InProceedingsofthe IEEEinternationalconferenceoncomputervision,pp.2961–2969,2017. Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition,pp.558–567,2019. 14PublishedasaconferencepaperatICLR2022 AlexHLang,SourabhVora,HolgerCaesar,LubingZhou,JiongYang,andOscarBeijbom. Point- pillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition,pp.12697–12705,2019. Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of the Europeanconferenceoncomputervision(ECCV),pp.734–750,2018. ChumingLi,XinYuan,ChenLin,MinghaoGuo,WeiWu,JunjieYan,andWanliOuyang. Am-lfs: Automl for loss function search. In Proceedings of the IEEE/CVF International Conference on ComputerVision,pp.8410–8419,2019. Hao Li, Chenxin Tao, Xizhou Zhu, Xiaogang Wang, Gao Huang, and Jifeng Dai. Auto seg-loss: Searchingmetricsurrogatesforsemanticsegmentation. arXivpreprintarXiv:2010.07930,2020. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conferenceoncomputervision,pp.740–755.Springer,2014. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dolla´r. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988,2017. WeiLiu,DragomirAnguelov,DumitruErhan,ChristianSzegedy,ScottReed,Cheng-YangFu,and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision,pp.21–37.Springer,2016. Aditya Krishna Menon, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Can gradient clippingmitigatelabelnoise? InInternationalConferenceonLearningRepresentations,2019. Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Akbas. Imbalance problems in object detection: Areview. IEEEtransactionsonpatternanalysisandmachineintelligence,2020. GabrielPereyra,GeorgeTucker,JanChorowski,ŁukaszKaiser,andGeoffreyHinton. Regularizing neuralnetworksbypenalizingconfidentoutputdistributions. arXivpreprintarXiv:1701.06548, 2017. Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition,pp.10529–10538,2020. Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors withonlinehardexamplemining. InProceedingsoftheIEEEconferenceoncomputervisionand patternrecognition,pp.761–769,2016. Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for au- tonomousdriving: Waymoopendataset. InProceedingsoftheIEEE/CVFConferenceonCom- puterVisionandPatternRecognition,pp.2446–2454,2020. Pei Sun, Weiyue Wang, Yuning Chai, Gamaleldin Elsayed, Alex Bewley, Xiao Zhang, Christian Sminchisescu, and Dragomir Anguelov. Rsn: Range sparse net for efficient, accurate lidar 3d objectdetection. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPattern Recognition,2021. Kah-KaySung. Learningandexampleselectionforobjectandpatterndetection. 1996. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethink- ing the inception architecture for computer vision. In Proceedings of the IEEE conference on computervisionandpatternrecognition,pp.2818–2826,2016. MingxingTanandQuocVLe. Efficientnetv2: Smallermodelsandfastertraining. InInternational ConferenceonMachineLearning,2021. 15PublishedasaconferencepaperatICLR2022 MingxingTan,RuomingPang,andQuocVLe. Efficientdet:Scalableandefficientobjectdetection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790,2020. Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale attention for semantic segmentation. arXivpreprintarXiv:2005.10821,2020. PaulViolaandMichaelJones. Rapidobjectdetectionusingaboostedcascadeofsimplefeatures. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition.CVPR2001,volume1,pp.I–I.IEEE,2001. YisenWang, XingjunMa, ZaiyiChen, YuanLuo, JinfengYi, andJamesBailey. Symmetriccross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International ConferenceonComputerVision,pp.322–330,2019. Ross Wightman, Hugo Touvron, and Herve´ Je´gou. Resnet strikes back: An improved training procedureintimm. arXivpreprintarXiv:2110.00476,2021. Haowen Xu, Hao Zhang, Zhiting Hu, Xiaodan Liang, Ruslan Salakhutdinov, and Eric Xing. Au- toloss: Learningdiscreteschedulesforalternateoptimization. arXivpreprintarXiv:1810.02442, 2018. ZhiluZhangandMertRSabuncu. Generalizedcrossentropylossfortrainingdeepneuralnetworks withnoisylabels. arXivpreprintarXiv:1805.07836,2018. Guangxiang Zhao, Wenkai Yang, Xuancheng Ren, Lei Li, and Xu Sun. Well-classified examples areunderestimatedinclassificationwithdeepneuralnetworks. arXivpreprintarXiv:2110.06537, 2021. BarretZoph,GolnazGhiasi,Tsung-YiLin,YinCui,HanxiaoLiu,EkinDCubuk,andQuocVLe. Rethinkingpre-trainingandself-training. arXivpreprintarXiv:2006.06882,2020. 16