Can weight sharing outperform random architecture search? An investigation with TuNAS GabrielBender,HanxiaoLiu BoChen GraceChu GoogleResearch,contributedequally GoogleResearch GoogleResearch gbender@google.com, hanxiaol@google.com bochen@google.com cxy@google.com ShuyangCheng Pieter-JanKindermans QuocLe Waymo GoogleResearch GoogleResearch shuyangcheng@waymo.com pikinder@google.com qvl@google.com Abstract Efficient Neural Architecture Search methods based on weight sharing have shown good promise in democratiz- ingNeuralArchitectureSearchforcomputervisionmodels. Thereis,however,anongoingdebatewhethertheseefficient methodsaresignificantlybetterthanrandomsearch. Here we perform a thorough comparison between efficient and randomsearchmethodsonafamilyofprogressivelylarger andmorechallengingsearchspacesforimageclassification anddetectiononImageNetandCOCO.Whiletheefficacies of both methods are problem-dependent, our experiments demonstrate that there are large, realistic tasks where ef- Figure1:Validationaccuraciesof250randomarchitectures(blue ficient search methods can provide substantial gains over bars)vs.5independentrunsofourTuNASsearchalgorithm(pink randomsearch. Inaddition,weproposeandevaluatetech- lines)forthreedifferentsearchspaces.Rejectionsamplingisused niqueswhichimprovethequalityofsearchedarchitectures toensurethelatenciesofrandomarchitecturesarecomparableto andreducetheneedformanualhyper-parametertuning. 1 thoseofsearchedarchitectures. 1.Introduction withlimitations. First:negativeresultsmaysimplyindicate that existing algorithms are challenging to implement and Neural Architecture Search (NAS) tries to find net- tune. Second: most negative results focus on fairly small work architectures with excellent accuracy-latency trade- datasetssuchasCIFAR-10orPTB,andsomeareobtained offs. Whiletheresourcecostsofearlyapproaches[47,33] on heavily restricted search spaces. With those caveats in were prohibitively expensive for many, recent efficient ar- mind, it is possible that efficient NAS methods work well chitecturesearchmethodsbasedonweightsharingpromise onlyonspecificsearchspacesandproblems. Butevenso, to reduce the costs of architecture search experiments by theycanstillbeusefulifthoseproblemsareofhighpracti- multipleordersofmagnitude. [30,3,26,5,43] calvalue. Forthisreasonwefocusonthefollowing: “Can TheeffectivenessoftheseefficientNASapproacheshas efficientneuralarchitecturesearchbemadetoworkreliably been questioned by recent studies (e.g., [20, 45]) present- onlargerealisticsearchspacesforrealisticproblems?” ingexperimentalresultswhereefficientarchitecturesearch Whencomparingagainstsimpleralgorithmssuchasran- methods did not always outperform random search base- dom search, we must consider not just explicit costs, such lines. Furthermore, even when gains were reported, they as the time needed to run a single search, but also implicit were often modest. However, most existing results come costs. For example, many models operate under hard la- 1Sourcecodeandexperimentdataareavailableathttps://github. tencyconstraints(e.g.,amodelrunningonaself-drivingcar com/google-research/google-research/tree/master/tunas inrealtime). However,therewardfunctionsusedbyMnas- 1 0202 guA 31 ]GL.sc[ 1v02160.8002:viXraNet [38] and ProxylessNAS [5] require us to run multiple Genetic Algorithms [34, 25, 33] or Bayesian Optimiza- searches with different hyper-parameters to match a given tion[19]. Asthesemethodsrequirealargeamountofcom- latency target.2 Larger models cannot be deployed, while putetoachievegoodresults,recentworksfocusonaddress- smallermodelshavesub-optimalaccuracies. ingthisrequirement[2,29,28,24]. Inmanyofthesemeth- WepresentTuNAS,aneweasy-to-tuneandscalableim- ods,asinglesupernetworkistrainedwhichencompassesall plementationofefficientNASwithweightsharing,anduse possibleoptionsinthesearchspace[4,30,3,5,26,43,7]. ittoinvestigatethequestionsoutlinedabove: Asinglepathwithinthesupernetworkcorrespondstoanar- chitecture in the search space. With this scheme, weight • WeinvestigatetwoexistingmodelsfromtheNASliter- sharingnaturallyoccursbetweenthearchitectures. ature,andfindthatwhilesomeoftheirgainsarelikely Withintheframeworkofefficientsearchmethodsbased duetobettertraininghyper-parameters, somearedue onweightsharing,ourworkiscloselyrelatedtoENAS[30], torealimprovementsinthenetworkarchitectures. DARTS[26],SNAS[43]andespeciallyProxylessNAS[5] in terms of optimizing the quality-latency tradeoff of ar- • We use TuNAS to study the effectiveness of weight chitectures. Different from ProxylessNAS, our method is sharing on the ProxylessNAS search space for Ima- abletohandlesubstantiallylargerandmoredifficultsearch geNet. Prior work [5] demonstrated that an efficient spaces with less prior knowledge (e.g., hand-engineered NASalgorithm couldfindstate-of-the-artimage clas- outputfiltersizes)andachieveimprovedperformance. sificationarchitecturesonthisspace,butleftopenthe question of whether a simple random search baseline Multi-Objective Search. An important strength of Neu- couldfindequallygoodresults. Wefindthatefficient ral Architecture Search is that it can cope with an ob- NAS can significantly outperform random search on jective function beyond pure accuracy. Recently, Neu- theProxylessNASsearchspaceonImageNet. ral Architecture Search has been used intensively to find • We further evaluate our implementation on two new architectures that have better tradeoff between accuracy andevenlargersearchspaces. Wefindthat(i)TuNAS and latency [38, 41, 8, 5, 37], FLOPS [39], power con- continuestofindhigh-qualityarchitectures,and(ii)the sumption [16], and memory usage [10]. We also focus gapbetweenTuNASandrandomsearchincreasessig- on resource-aware NAS because (i) finding architectures nificantlyonthesenewsearchspaces. withgoodtrade-offsbetweenaccuracyandlatencyisvalu- able in practice, and (ii) constrained optimization may be • We demonstrate that when weight sharing is imple- more challenging than unconstrained optimization, which mented carefully, the same algorithm can be used makesitidealforastress-testofefficientNASalgorithms. acrossdifferentclassificationsearchspacesandacross With this in mind, we make use of and extend the search domains(classificationandobjectdetection). spacesproposedbyMnasNet[38],ProxylessNAS[5],Mo- bileNetV2[35]andMobileNetV3[15], whicharecloseor • We propose and evaluate two new techniques, which amongthestate-of-the-artnetworksformobilesettings. we call op and filter warmup, for better training the shared model weights used during a search. These RandomSearchvsEfficientSearchMethods. Theuseof techniques improve the quality of architectures found more complicated search methods in Neural Architecture byouralgorithm,sometimesbyalargemargin. Search has been challenged [45, 20]. In a nutshell, these studiesfindthatforcertainproblemsandsearchspaces,ran- • WeproposeanovelRLrewardfunctionwhichallows domsearchperformsclosetoorjustaswellasmorecom- ustopreciselycontrolthelatencyofthefinalarchitec- plicated search methods. These studies, however, mainly ture returned by the search, significantly reducing the focused on relatively small tasks (such as CIFAR-10) and needforadditionalhyper-parametertuning. accuracy-only optimization. Our focus is on larger and 2.Relatedwork more challenging searches which incorporate latency con- straints. In these more realistic settings, efficient architec- NeuralArchitectureSearchwasproposedasawaytoau- turesearchsignificantlyoutperformsrandomsearch. tomate and optimize the design process for network archi- tectures.Existingmethodshaveachievedimpressiveresults 3.SearchSpaces onimageclassification[47,42,39],objectdetection[11,6], segmentation [23], and video understanding [31]. Early Our goal is to develop a NAS method that can reliably methods were based on Reinforcement Learning [46, 1], findhighqualitymodelsataspecificinferencecostacross 2Inourexperimentswetypicallyneededtorun7searchestoachieve multiplesearchspaces. Wenextpresentthreeprogressively thedesiredlatency. largersearchspacesandshowthattheyarenon-trivial:theyRandomSearch SearchSpace Cardinality RefModel OurSearch Simulated N=1 N=20 N=50 N=250 Latency(ms) ProxylessNAS ∼1021 76.2 76.3±0.2 74.1±0.8 75.4±0.1 75.4±0.2 75.6 83-85 ProxylessNAS-Enlarged ∼1028 76.2 76.4±0.1 72.1±1.5 74.4±0.5 74.6±0.3 74.8 83-85 MobileNetV3-Like ∼1043 76.5 76.6±0.1 71.7±1.7 74.1±0.6 74.6±0.3 74.9 57-59 Table1: Comparisonbetweenreferencemodelsproposedinpreviouswork(“RefModel”),randomsearchbaselinesinoursearchspaces (“RandomSearch”),andsearchedmodelsfoundbyTuNAS(“OurSearch”).WereportvalidationaccuraciesonImageNetafter90epochs oftraining. Cardinalityrefersto(anupperboundof)thenumberofuniquearchitecturesinthesearchspace. Thereferencemodelforthe ProxylessNASandProxylessNAS-EnlargedsearchspacesisourreproductionoftheProxylessNASmobileCPUmodel[5].Thereference model for the MobileNetV3-Like search space is our reproduction of MobileNetV3 [15]. Mean and variance for Random Search are reportedover5repeatsforN=20andN=50,and250repeatsforN=1. BaseFilterSizes TypicalChoiceswithinanInvertedBottleneckLayer SearchSpace BuiltAround (ci’sforeachlayer) ExpansionRatio Kernel Outputfiltersize SE ProxylessNAS MobileNetV2 ProxylessNAS[5] {3,6} {3,5,7} ci (cid:55) ProxylessNAS-Enlarged MobileNetV2 ×2whenstride=2 {3,6} {3,5,7} ci×(cid:8)1 2,5 8, 43,1,5 4,3 2,2(cid:9) (cid:55) MobileNetV3-Like MobileNetV3 ×2whenstride=2 {1,2,3,4,5,6} {3,5,7} ci×(cid:8) 21,5 8, 43,1,5 4,3 2,2(cid:9) {(cid:55),(cid:51)} Table2: Searchspacesweusetoevaluateourmethod. ThefirsttwoarebuiltaroundMobileNetV2. Thethirdusesthecombinationof ReLUandSiLU/Swish[32,9,14]activationsandthenewmodelheadfromMobileNetV3.Weuseatargetinferencetimeof84msforthe firsttwo(tocompareagainstProxylessNASandMnasNet)and57msforthethirdsearchspace(tocompareagainstMobileNetV3). containknowngoodreferencemodels3 thatclearlyoutper- While output filter sizes are something that we ideally formmodelsfoundbyrandomsearch,asshowninTable1. shouldbeabletosearchforautomatically,theoriginalProx- The same table shows that for the larger of these search ylessNASsearchspaceusedmanuallytunedfiltersizesbuilt spaces, the gap between known good models and random around those discovered by an earlier and more expensive search baselines widens. Although architecture search be- search algorithm [38]. To understand whether this restric- comesmoredifficult,itcanalsobemorebeneficial. tioncanbelifted,weexploreanextensionoftheProxyless- NASsearchspacetoautomaticallysearchoverthenumber 3.1.SearchSpaceDefinitions ofoutputfiltersineachlayerofthenetwork. DetailsofthethreesearchspacesaresummarizedinTa- Specifically,wedefinealistofpossibleoutputfiltersizes ble2. Motivationsforeachofthemareoutlinedbelow. foreachlayerinoursearchspacebymultiplyinganinteger- valued base filter size by a predefined set of multipliers ProxylessNAS Space. The first and the smallest search (cid:8)1,5,3,1,5,3,2(cid:9) androundingtoamultipleof8.4 The 2 8 4 4 2 space is a reproduction of the one used in Proxyless- base filter size is 16 for the first layer of the network, and NAS[5],anefficientarchitecturesearchmethodthatreports is doubled whenever we start a new block. If two layers strongresultsonmobileCPUs. Itconsistsofastackofin- ofthenetworkareconnectedviaaresidualconnection,we vertedbottlenecklayers,wheretheexpansionratioandthe constrainthemtohavethesameoutputfiltersize. depthwise kernel size for each layer are searchable. The searchspaceisbuiltaroundMobileNetV2[35],exceptthat MobileNetV3-LikeSpace. Ourlargestsearchspaceisin- theoutputfiltersizesforallthelayersarehandcraftedtobe spired by MobileNetV3 [15]. Different from the previous similartothosefoundbyMnasNet[38]. spaces, models in this space utilize the SiLU/Swish [32, 9, 14] activation function [32] and a compact head [15]. ProxylessNAS-Enlarged Space. While earlier convolu- Thesearchspaceisalsomuchlargerthantheprevioustwo. tional architectures such as VGG [36] used the heuristic First: inverted bottleneck expansion ratios can be selected of doubling the number of filters every time the feature fromtheset{1,2,3,4,5,6},comparedwith{3,6}inother map width and height were halved, more recent models searchspaces. Second:weoptionallyallowaSqueeze-and- [35,38,5]obtainstrongperformanceusingmorecarefully Excitemodule[17]tobeaddedtoeachinvertedbottleneck. tuned filter sizes. Our experimental evaluation (Table 8) Outputfiltersizesaresearched;thechoicesfollowthesame demonstratesthatthesecarefullytunedfiltersizesareinfact heuristicusedintheProxylessNAS-Enlargedspace. importantforobtaininggoodaccuracy/latencytradeoffs. 3InTable3wepresentourreproductionstothepublishednumbers.In 4For performance reasons, working with multiples of 8 was recom- allcasesourreproductionsareatleastasaccurateasthepublishedresults. mendedforourtargetinferencehardware.3.2.MeasuringSearchAlgorithmEffectiveness tures. Formally,π isdefinedasacollectionofindependent multinomialvariables, oneforeachofthedecisionsinour WemeasuretheeffectivenessofourNASalgorithmona search space. We also learn a set of shared weights W, givensearchspaceintwodifferentways. which are used to efficiently estimate the quality of candi- datearchitecturesinoursearchspace. ReferenceModels. Canouralgorithmmatchorexceedthe WealternatebetweenlearningthesharedweightsW us- qualityofknowngoodarchitectureswithinthespace? For ing gradient descent and learning the policy π using RE- example,whenevaluatingtheeffectivenessofouralgorithm INFORCE [40]. At each step, we first sample a network ontheProxylessNAS-Enlargedspace, wecompareagainst architecture α ∼ π. Next, we use the shared weights to MobileNetV2 [35] (a hand-tuned model), MnasNet [38] estimate the quality Q(α) of the sampled architecture us- (a model obtained from a more expensive NAS algorithm ingasinglebatchofexamplesfromthevalidationset. We where thousands of candidate architectures were trained thenestimatetheinferencetimeofthesampledarchitecture fromscratch), andProxylessNAS-Mobile[5](amodelob- T(α). TheaccuracyQ(α)andinferencetimeT(α)jointly tainedusingaNASalgorithmsimilartoours,whichweuse determinetherewardr(α)whichisusedtoupdatethepol- toverifyoursetup). icy π using REINFORCE.5 Finally, we update the shared RandomSearch. Canouralgorithmprovidebetterresults modelweightsW bycomputingagradientupdatew.r.t. the inlesstimethanrandomsearchwithoutweightsharing,an architectureαonabatchofexamplesfromthetrainingset. easy-to-implement heuristic which is widely used in prac- The above process is repeated over and over until the tice? Inindustrysettings, itiscommontotargetaspecific search completes. At the end of the search, the final ar- latencyT ; slowermodelscannotbedeployedwhilefaster chitecture is obtained by independently selecting the most 0 models typically have sub-optimal accuracies [39]. How- probablevalueforeachcategoricaldecisioninπ. ever, inpracticeonlyasmallpercentageofmodelsareac- 4.1.WeightSharing tuallyclosetothisinferencetarget. To make the baseline interesting in this realistic setup, To amortize the cost of an architecture search, NAS al- we perform rejection sampling in the range of T ±1ms to 0 gorithmsbasedonweightsharing(e.g.,[3,30,26,5])train obtainN randommodels.Thesemodelsarethentrainedfor alargenetwork–aone-shotmodel–withmanyredundant 90epochsandvalidated. Themodelwiththebestresulton operations,mostofwhichwillberemovedattheendofthe thevalidationsetissubsequentlytrainedfor360epochsand search. In ProxylessNAS [5], for instance, we must select evaluatedonthetestset,analogoustooursearchedmodels betweenthreepossiblekernelsizes(3,5,or7)andtwopos- for final evaluation. Note the cost of random search with sibleexpansionfactors(3or6),givingus3×2=6possible N =15to30iscomparablewiththecostofasinglerunof combinations. In ProxylessNAS, each of these six combi- ourefficientsearchalgorithm(AppendixF). nationswillhaveitsownpaththroughthenetwork: itsown Besidesthecomparisonsdiscussedabove,thecomplex- setoftrainableweightsandoperationswhicharenotshared ity of our search spaces can be quantified using several withanyotherpath. Ateachstep, werandomlyselectone other metrics, which we report in Table 1. A clear pro- of the six paths and update the shared model weights and gressioninthetaskdifficultiescanbeobservedaswemove RLcontrollerusingjusttheweightsfromtheselectedpath. fromthesmallestProxylessNASsearchspacetothelargest While this approach works well when the number of MobileNetV3-Likesearchspace. paths is small, the size of the one-shot model will quickly blow up once we add more primitives to the search. For 4.TuNAS instance, the number of unique inverted bottleneck config- TuNAS uses a reinforcement learning algorithm with urationsperlayercanbeaslargeas6×3×7×2 = 252 weight sharing to perform architecture searches. Our al- in our MobileNetV3-Like space, in contrast to 2×3 = 6 gorithmissimilartoProxylessNAS[5]andENAS[30],but options in the ProxylessNAS space (Table 2). As a result, containschangestoimproverobustnessandscalabilityand thesearchprocesscannotbecarriedoutefficientlybecause reducetheneedformanualhyper-parametertuning. each inverted bottleneck will only be trained a small frac- Asearchspaceisrepresentedasasetofcategoricalde- tionoftime(1/252underauniformpolicy). cisions which control different aspects of the network ar- Operation Collapsing. Instead of using a separate set of chitecture.Forexample,asinglecategoricaldecisionmight controlwhetherweusea3×3,5×5,or7×7convolutionat weightsforeachpossiblecombinationofchoiceswithinan inverted bottleneck, we share (“collapse”) operations and a particular position in the network. An architecture is an assignmentofvaluestothesecategoricaldecisions. 5WesetthelearningrateoftheRLcontrollerto0duringthefirst25% During a search, we learn a policy π, a probability dis- oftraining. Thisallowsustolearnagoodsetofsharedmodelweights tribution from which we can sample high quality architec- beforetheRLcontrollerkicksin.DetailsareprovidedinAppendixE.2.nent,atunablehyper-parameterofthesetup. Sinceβ < 0, this reward function is maximized when the model quality Q(α)islargeandtheinferencetimeT(α)issmall. However, to find the best possible model whose in- ference time is less than T , we must perform a hyper- 0 parametersweepoverβ.Ifβistoosmall,theinferencecon- straintwilleffectivelybeignored. Ifβ istoolarge,wewill endupwithmodelsthathavelowlatenciesbutsub-optimal Figure2:Illustrationofouraggressiveweightsharingschemefor accuracies.Tomakemattersworse,wefoundthatchanging invertedbottlenecklayers. Eachchoiceblockisassociatedwitha the search space or search algorithm details required us to decisiontobemadebasedtheRLpolicy.Theexpansionratiosand retunethevalueofβ,increasingsearchexperimentcostsby outputfiltersizesforthe1×1convolutionsarelearnedthrougha 7×inpractice. channelmaskingmechanism. Figure 3 shows a geometric intuition for this behavior. weights in order to ensure that each trainable weight gets Each contour line in the plot represents a set of possible a sufficient gradient signal. The approach is illustrated in tradeoffsbetweenmodelqualityandlatencywhichyieldthe Figure2. Forexample,whileProxylessNASusesdifferent samefinalreward. Ourgoalistotrytofindanarchitecture 1×1convolutionsforeachpossibledepthwisekernelwithin with the highest possible reward, corresponding to a con- aninvertedbottleneck,wereusethesame1×1convolutions tourlinethatisasfartothetopleftaspossible. However, regardlessofwhichdepthwisekernelisselected. the reward must correspond to a viable architecture in the search space, which means the contour must intersect the Channel Masking. Complementary to operation collaps- population’saccuracy-latencyfrontier(circledinblack). ing, we also share parameters between convolutions with Forthesoftexponentialreward,thefiguresuggeststhat different numbers of input/output channels. The idea is to asmallshiftinthepopulation(e.g., duetoachangeinthe create only a single convolutional kernel with the largest trainingsetuporsearchspace)cansignificantlyaltertheop- possiblenumberofchannels. Wesimulatesmallerchannel timal latency. This explains why the same value of β can sizes by retaining only the first N input (or output) chan- lead to different latencies in different searches. Both the nels,andzeroingouttheremainingones. Thisallowsusto hardexponentialrewardfunctionandtheproposedabsolute efficientlysearchforbothexpansionfactorsandoutputfilter rewardfunction, whichwewilldiscussnext, aremoresta- sizesinaninvertedbottleneck,aslearningtheseisreduced ble,thankstothe“sharp”changepointsintheircontours. tolearningmultinomialdistributionsoverthemasks. HardExponentialRewardFunction. Asecondinstantia- 4.2.TargetingaSpecificLatency tionoftheMnasNetrewardfunction[38]penalizesmodels whose inference times T(α) are above T but does not re- For many practical applications (e.g., real-time image 0 wardmodelswhoseinferencetimesarelessthanT : perception), we want the best possible model that runs (cid:40) 0 Q(α), ifT(α)≤T withinafixednumberofmilliseconds. However,wefound r(α)= 0 (1) that with the existing RL reward functions used by Mnas- Q(α)×(T(α)/T 0)β, ifT(α)>T 0 Atfirstglance,wemightexpectthatanRLcontrollerusing Net[38]andProxylessNAS[5],wefrequentlyhadtoretune thisrewardwouldalwaysfavormodelswithhigheraccura- oursearchhyper-parametersinordertofindthebestmodels cies, provided that their inference times do not exceed T . underagivenlatencytarget. Thisextraretuningstepmul- 0 However,thisisnotthecaseinourexperiments.Thereason tiplied resource costs by 7× in many of our experiments. isthattheRLcontrollerdoesnotselectasinglepointonthe Wenowexplainwhy,andproposeanewrewardfunctionto Pareto frontier. Rather, it learns a probability distribution addresstheissue. overpoints.Ifthecostexponentβistoolarge,thecontroller Soft Exponential Reward Function. In previous work, willbecomerisk-adversepreferringtosamplearchitectures Tan et al. [38] proposed a parameterized RL reward func- whoselatenciesaresignificantlybelowthetargetT 0. This tiontofindarchitectureswithgoodaccuracy/timetradeoffs, allows it to minimize the probability of accidentally sam- and evaluated two instantiations of this function. In the plingarchitectureswhosetimesexceedthetargetandincur- firstinstantiation(lateradoptedbyProxylessNAS[5]),they ringlargepenalties. Empirically,wefoundthatifwemade maximizethereward thecostpenaltyβtoolarge,thecontrollerwouldsamplear- chitectureswithinferencetimescloseto75ms,eventhough r(α)=Q(α)×(T(α)/T )β 0 thetargetinferencetimewascloserto85ms. whereQ(α)indicatesthequality(accuracy)ofacandidate architectureα,T(α)isitsinferencetime,T isaproblem- Our Solution: Absolute Reward Function. We propose 0 dependentinferencetimetarget,andβ <0isthecostexpo- a new reward function which can find good architecturesperiments. Bothtechniquesrelyontheintuitionthatifwe can ensure that all of our shared model weights are suffi- cientlywell-trained,wecangetamorereliablesignalabout whichpartsofthesearchspacearemostpromising. Filter Warmup. We can efficiently search over different filtersizesbymaskingouttensorsacrossthefiltersdimen- sion. For example, we can simulate a convolution with 96 Figure3:Contoursofdifferentrewardfunctionsandtheirinterac- output filters by taking a convolution with 128 output fil- tionswiththefrontiers. Theblueandtheorangedenotethefron- tersandzeroingouttherightmost32. However,thisintro- tiers of two search spaces with different accuracy-latency trade- offs. Left: Soft exponential reward function. Center: Hard ex- duces a bias into our training process: the left-most filters ponentialrewardfunction. Right: Ourabsoluterewardfunction. will always be trained, whereas the right-most filters will Regionsinblackindicatearchitectureswiththehighestreward. be trained only occasionally. To counteract this effect, we randomlyenablealltheoutputfilters–ratherthanjustthe filtersselectedbytheRLcontroller–withsomeprobability whoseinferencetimesareclosetoauser-specifiedtargetT 0 p. Welinearlydecreasepfrom1to0overthefirst25%of and is robust to the exact values of hyper-parameters. The thesearch,7 duringwhichtheRLcontrollerisdisabledand keyideaistoaddtoourrewardfunctionapriorthatlarger onlythesharedmodelweightsaretrained. models are typically more accurate. Instead of just asking thesearchtoidentifymodelswithinferencetimeslessthan Op Warmup. We enable all possible operations in the T (as in previous work), we explicitly search for the best 0 searchspaceatstartoftraining,andgraduallydropoutmore possiblemodelswhoseinferencetimesareclosetoT .This 0 and more of the operations as the search progresses. Our impliesaconstrainedoptimizationproblem: discussion will focus on a single choice block, where the max Q(α) subjectto T(α)≈T 0 (2) goal is to select one of N possible operations (e.g., convo- α Theabovecanberelaxedasanunconstrainedoptimization lutions, inverted bottleneck layers, etc.) from a predeter- problemthataimstomaximize minedsearchspace. Theideawasoriginallyproposedand evaluated by Bender et al. [3], who used carefully-tuned r(α)=Q(α)+β|T(α)/T −1| 0 heuristicstodeterminethedropoutschedule. Wefoundthat where|·|denotestheabsolutefunctionandβ <0,thecost asimplifiedversionoftheideacouldbeusedtoimproveour exponent,isafinitenegativescalarthatcontrolshowmuch searchresultswithminimaltuning. Withsomeprobability stronglyweencouragearchitecturestohaveinferencetimes pbetween0and1weenablealloperationswithinachoice closetoT . TheexpressionT(α)/T ensurestherewardis 0 0 block, rather than just enabling the operations selected by scale-invariantw.r.t.latency.Searchresultsarerobusttothe the RL controller. When multiple operations are enabled, exactvalueofβ,6 andthisscale-invariancefurtherreduces weaveragetheiroutputs. Whenp = 0,thecontroller’sbe- theneedtoretuneβ fornewdevicesandsearchspaces. haviorisunaffectedbyopwarmup.Whenp=1,weenable Using the absolute value reward, we found that while allpossibleoperationsateverytrainingstep. Inourimple- the average inference time of models sampled by the RL mentation,welinearlydecreasepfrom1to0overthefirst controllerwasconsistentlyclosetothetarget,theinference 25% of the search, during which the RL controller is dis- timeofthemostlikelyarchitectureselectedattheendofthe abled and only the shared model weights are trained. Op searchcouldbeseveralmillisecondslower. Wecombatthe warmupcanbeimplementedinamemory-efficientmanner mismatchbetweenaverageandmostlikelyinferencetimes byleveragingrematerialization(AppendixC). byadjustingthelearningratescheduleoftheRLcontroller. Insteadofusingaconstantlearningratethroughthesearch, 5.ExperimentalSetupandBaselines we exponentially increase the RL learning rate over time. Thisallowsthecontrollertoexplorethesearchspace(with For the ProxylessNAS and ProxylessNAS-Enlarged arelativelylowlearningrate)atthestartofthesearch,but search spaces, our searches target the same latency as the alsoensuresthattheentropyoftheRLcontrollerislowat published ProxylessNAS Mobile architecture [5]. For our theendofthesearch,preventingthemismatch. Additional MobileNetV3-Like search space, we target the latency of informationisprovidedinAppendB. MobileNetV3-Large [15]. The resulting architectures are 4.3.Improvingthesharedmodelweights trainedfromscratchtoobtainvalidationandtestsetaccura- cies. Unlessotherwisespecified,weobtainedvalidationset Weidentifiedtwotechniquesthatallowedustoimprove accuraciesofstandalonemodelsonaheld-outsubsetofthe the quality of the models found by architecture search ex- 7Wealsoexperimentedwithdecreasingpover50%ofstepsinsteadof 6Weusedthesamevalueofβforallourclassificationexperiments. 25%,butdidnotobserveasignificanteffectonsearchquality.ImageNet training set, after training on the remainder for Name Simulated Accuracy(%) 90epochs. Weobtainedtestsetaccuraciesaftertrainingon Latency Valid Test Test ours published ours the full training set for 360 epochs. Details for standalone MobileNetV2 77.2ms 74.5±0.1 72.0 73.3±0.1 trainingandarchitecturesearcharereportedinAppendixE. MnasNet-B1 84.5ms 76.0±0.1 74.5 74.5±0.1 For architecture search experiments, we always repeat ProxylessNAS 84.4ms 76.3±0.2 74.6 74.9±0.1 theentiresearchprocess5timesassuggestedbyLindauer MobileNetV3 58.5ms 76.5±0.2 75.2 75.3±0.1 and Hutter [22], and report the mean and variance of the Table3:ReproductionsofourbaselinemodelsonImageNet. performanceoftheresultingarchitectures. Thisisdifferent fromthepopularpracticeoftrainingasingleresultingarchi- Model/Method ValidAcc(%) TestAcc(%) Latency tecturemultipletimes,whichonlyreflectsthevarianceofa ProxylessNAS[5] 76.2 74.8 84.4 single searched architecture (which can be cherry-picked) RS(N =20) 75.4±0.2 73.9±0.3 84.3±0.8 ratherthanthesearchalgorithmitself. Forreproductionsof RS(N =50) 75.4±0.2 74.0±0.2 83.8±0.6 TuNAS(90epochs) 76.3±0.2 75.0±0.1 84.0±0.4 publishedmodelssuchasMobileNetV2wherenoarchitec- turesearchisrequiredonourpart,wereportthemeanand Table4: ResultsintheProxylessNASsearchspace. “Proxyless- NAS[5]”isourreproductionoftheProxylessNAS-Mobilemodel. varianceacrossfivetrainingruns. OurTuNASimplementationincludesop/filterwarmup,theabso- Identical hyper-parameters are used to train all of our lutevaluereward,andmoreaggressiveweightsharing. models. (See Appendix E for details.) The one exception isthedropoutrateofthefinalfullyconnectedlayer,which Model/Method ValidAcc(%) TestAcc(%) Latency issetto0whentrainingfor90epochs,0.15whentraining MobileNetV2[35] 74.4 73.4 77.2 MobileNetV2-basedmodelsfor360epochs,and0.25when MNASNet-B1[38] 76.0 74.5 84.5 training MobileNetV3-based models for 360 epochs. We ProxylessNAS[5] 76.2 74.8 84.4 initiallyexperimentedwithtuninghyper-parameterssuchas RS(N =20) 74.4±0.5 73.1±0.6 84.0±0.6 thelearningrateanddropoutrateseparatelyforeachmodel. RS(N =50) 74.6±0.3 73.2±0.3 83.5±0.3 TuNAS(90epochs) 76.4±0.1 75.3±0.2 84.0±0.4 However, this did not lead to any additional quality gains, Table5:ResultsintheProxylessNAS-Enlargedsearchspace. andwasomittedforourfinalexperiments. 5.1.ReproducingReferenceArchitectures Model/Method ValidAcc(%) TestAcc(%) Latency MobileNetV3-L[15] 76.5 75.3 58.5 We start by reproducing three popular mobile image classification models in our training setup: MobileNetV2 RS(N =20) 74.1±0.6 73.0±0.5 58.5±0.5 RS(N =50) 74.6±0.3 73.5±0.2 58.7±0.4 [35],MnasNet-B1[38],andProxylessNAS[5]. Thisserves TuNAS(90epochs) 76.6±0.1 75.2±0.2 57.0±0.2 two purposes. First, it allows us to verify our training and TuNAS(360epochs) 76.7±0.2 75.4±0.1 57.1±0.1 evaluation setup. And second, it allows us to cleanly dis- Table6:ResultsintheMobileNetV3-Likesearchspace. tinguishbetweenimprovementsinourmodeltrainingsetup andimprovementsinoursearchednetworkarchitectures. theendofthissectionwewillsystematicallyevaluatethese ResultsarepresentedinTable3. Ourhyper-parameters changes. Firstweevaluateourproposedefficientarchitec- provide quality comparable to the published results for turesearchimplementationonthethreesearchspacespre- MnasNet-B1andsignificantlyimproveuponthepublished sentedinSection3,andcompareourresultsagainstrandom resultsofProxylessNASandMobileNetV2. searchwithsimilarorhighersearchcost.Thesearchspaces There is an especially large (1.3%) accuracy improve- graduallyincreaseintermsofbothsizeanddifficulty. ment in our reproduction of MobileNetV2. This suggests that some of the quality gains from MnasNet and Proxy- Finding1:TuNASoutperformsRandomSearch(RS)by lessNASwhichwerepreviouslyattributedtobetternetwork alargemarginineachofourthreeclassificationsearch architectures may in fact be due to better hyper-parameter spaces. This holds even though we use 2-3x more com- tuning. Itunderscorestheimportanceofaccountingfordif- puteresourcesforeachRSexperiment(AppendixF). ferencesintrainingandevaluationsetupswhencomparing First (Table 4), we evaluate our search algorithm on differentnetworkarchitectures. the ProxylessNAS search space [5]. Despite having lower search costs, the accuracies of architectures found with an 6.ResultsandDiscussion efficient architecture search improve upon random search Comparedwithpreviouspapersonefficientarchitecture by1%. Thesealsoprovideasanitycheckforoursetup: the searchsuchasProxylessNAS,ourarchitecturesearchsetup resultsofoursearcharecompetitivewiththosereportedby includes several novel features, including (i) a new abso- theoriginalProxylessNASpaper. lutevaluereward,(ii)theuseofopandfilterwarmup,and Next(Table5),weevaluateoursearchalgorithmonthe (iii) more aggressive weight sharing during searches. At ProxylessNAS-Enlarged search space, which additionallysearches over output filter sizes. In this larger and more Backbone COCOTest-devmAP Latency challenging search space, the accuracy gap between our MobileNetV2 20.7 126 methodandrandomsearchincreasesfrom1%to2%. MNASNet 21.3 129 ProxylessNAS 21.8 140 Finally (Table 6), we evaluate our search algorithm on MobileNetV3-Large 22.0 106 our MobileNetV3-Like search space, which is the most TuNASSearch 22.5 106 challenging of the three. In addition to being the largest Table7:BackbonearchitecturesearchresultsonMSCOCOinthe of the three search spaces, the accuracies of architectures MobileNetV3-Likespace. Alldetectionbackbonesarecombined sampled from this search space are – on average – lower withtheSSDLitehead. TargetlatencyforTuNASsearchwasset thantheothertwo. Increasingthesizeofarandomsample to106ms(sameasforMobileNetV3-Large+SSDLite). fromN =20toN =50canimprovetheresultsofrandom search. However,wefindthatincreasingthesearchtimeof Filters ValidAcc(%) TestAcc(%) Latency ouralgorithmfrom90to360epochscanalsoimprovethe ProxylessNAS 76.3±0.2 75.0±0.1 84.0±0.4 results of our efficient algorithm, while still maintaining a ×2EveryStride-2 74.8±0.2 73.5±0.2 83.9±1.0 ×2EveryBlock 75.3±0.2 74.0±0.2 83.9±0.2 lowersearchcostthanrandomsearchwithN =50. Table8:Effectofoutputfiltersizesonfinalmodelaccuracies. Finding 2: The TuNAS implementation generalizes to object detection. We investigate the transferability of our Reward ValidAcc(%) TestAcc(%) Latency algorithmtotheobjectdetectiontask, bysearchingforthe MnasNet-SoftReward 76.2±0.2 74.8±0.3 79.5±3.3 detection backbone w.r.t. both mean average precision and AbsoluteValueReward 76.4±0.1 75.0±0.1 84.1±0.4 theinferencecost. ResultsonCOCO[21]aresummarized Table 9: Comparison of our absolute value reward function inTable7. Thesearchedarchitectureoutperformsthestate- (T 0=84ms)vs. therewardusedbyMnasNetandProxylessNAS. Whilebothprovidesimilarquality/latencytradeoffs,ourabsolute of-the-art model MobileNetV3 + SSDLite [15]. Details of valuerewardallowsprecisecontrolovertheinferencelatency,and theexperimentalsetuparepresentedinAppendixE.3. reducestheneedforextratuningtofindasuitablecostexponent. Finding3: Outputfiltersizesareimportant. MnasNet-B1[38]searchesoverthenumberofoutputfil- SearchSpace Warmup ValidAcc(%) TestAcc(%) Latency ters in addition to factors such as the kernel sizes and ex- ProxylessNAS (cid:55) 76.1±0.1 74.7±0.1 84.0±0.3 ProxylessNAS (cid:51) 76.3±0.2 75.0±0.1 84.0±0.4 pansionfactors. Thisis differentfrom manyrecent papers ProxylessNAS-Enl (cid:55) 75.8±0.3 74.6±0.2 83.6±0.2 onefficientNAS–includingENAS [30],DARTS[26],and ProxylessNAS-Enl (cid:51) 76.4±0.1 75.3±0.2 84.0±0.4 ProxylessNAS[5]–whichhard-codedtheoutputfiltersizes. MobileNetV3-Like (cid:55) 76.2±0.2 75.0±0.1 57.0±0.6 To determine the importance of output filter sizes, one MobileNetV3-Like (cid:51) 76.6±0.1 75.2±0.2 57.0±0.2 possibility would be to modify the output filter sizes of a Table 10: Comparison of search results with vs. without op and high-performingmodel(suchastheProxylessNAS-Mobile filterwarmup.Weuseaggressiveweightsharingandsearchfor90 model)andlookathowthemodelaccuracychanges. How- epochsinallsearchconfigurations. ever, we can potentially do better by searching for a new architecturewhoseoperationsarebetter-adaptedtothenew outputfiltersizes. Wethereforeperformtwodifferentvari- fectsearchedmodelquality(AppendixG). antsofthelatterprocedure. Inthefirstvariant, wereplace theProxylessNASoutputfiltersizes(whicharehard-coded Finding 5: The absolute value reward reduces hyper- to be almost the same as MnasNet) with a naive heuristic parameter tuning significantly. With the MnasNet-Soft where we double the number of filters whenever we halve reward function, we found it necessary to grid search over theimagewidthandheight,similartoarchitecturessuchas β ∈ {−0.03,−0.04,−0.05,−0.06,−0.07,−0.08,−0.09} ResNetandVGG.Inthesecond,wedoublethenumberof in order to reliably find network architectures close to the filtersateachnewblock. Table8showsthatsearchedfilter target latency. By switching to the absolute value reward sizessignificantlyoutperformbothdoublingheuristics. function,wewereabletoeliminatetheneedforthissearch, reducingresourcecostsbyafactorof7. Wecomparedthe Finding 4: Aggressive weight sharing enables larger qualityofbothmethodsonourimplementationoftheProx- search spaces without significant quality impact. We ylessNASsearchspacewithweightsharing,andfoundthat share weights between candidate networks more aggres- theAbsoluteValuerewardfunctiondidnotsignificantlyaf- sivelythanpreviousworkssuchasProxylessNAS(Section fectthequality/latencytradeoff(Table9andAppendixD). 4.1).Thisletsusexploremuchlargersearchspaces,includ- ingonewithupto252optionsperinvertedbottleneck. For Finding6: Opandfilterwarmupleadtoconsistentim- theProxylessNASspace(wheresearchesarepossiblewith provements across all search spaces. Controlled experi- andwithoutit),weverifiedthatitdoesnotsignificantlyaf- ments are presented in Table 10. While improvements aresmall in some spaces, they account for nearly half of all on resource-constrained microcontrollers. arXiv preprint qualitygainsintheProxylessNAS-EnlargedSpace. arXiv:1905.12107,2019. 2 [11] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Acknowledgements Learningscalablefeaturepyramidarchitectureforobjectde- tection.InProceedingsoftheIEEEConferenceonComputer WewouldliketothankBlakeHechtman,RyanSepassi, VisionandPatternRecognition,pages7036–7045,2019. 2 andTongShenfortheirhelpwithsoftwarechangesneeded [12] Priya Goyal, Piotr Dolla´r, Ross Girshick, Pieter Noord- to make our algorithms to work well on TPUs. We would huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, alsoliketothankBerkinAkin,OkanArikan,YimingChen, Yangqing Jia, and Kaiming He. Accurate, large mini- ZhifengChen,FrankChu,EkinD.Cubuk,MatthieuDevin, batch sgd: Training imagenet in 1 hour. arXiv preprint Suyog Gupta, Andrew Howard, Da Huang, Adam Kraft, arXiv:1706.02677,2017. 12 PeishengLi,YifengLu,RuomingPang,DaiyiPeng,Mark [13] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. Sandler,YonghuiWu,ZhinanXu,XinZhou,andMenglong Deep residual learning for image recognition. In Proceed- ingsoftheIEEEconferenceoncomputervisionandpattern Zhuforhelpfulfeedbackanddiscussions. recognition,pages770–778,2016. 12 References [14] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units(gelus). arXivpreprintarXiv:1606.08415,2016. 3 [1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh [15] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Raskar. Designingneuralnetworkarchitecturesusingrein- Chen, BoChen, MingxingTan, WeijunWang, YukunZhu, forcementlearning.InInternationalConferenceonLearning RuomingPang, VijayVasudevan, etal. Searchingformo- Representations,2017. 2 bilenetv3. InInternationalConferenceonComputerVision, [2] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil 2019. 2,3,6,7,8,12 Naik. Acceleratingneuralarchitecturesearchusingperfor- [16] Chi-Hung Hsu, Shu-Huan Chang, Jhao-Hong Liang, Hsin- manceprediction. arXivpreprintarXiv:1705.10823, 2017. PingChou, Chun-HaoLiu, Shih-ChiehChang, Jia-YuPan, 2 Yu-Ting Chen, Wei Wei, and Da-Cheng Juan. Monas: [3] GabrielBender,Pieter-JanKindermans,BarretZoph,Vijay Multi-objective neural architecture search using reinforce- Vasudevan, and Quoc Le. Understanding and simplifying mentlearning. arXivpreprintarXiv:1806.10332,2018. 2 one-shotarchitecturesearch.InInternationalConferenceon [17] JieHu,LiShen,andGangSun. Squeeze-and-excitationnet- MachineLearning,pages549–558,2018. 1,2,4,6 works. InProceedingsoftheIEEEconferenceoncomputer [4] AndrewBrock, TheoLim, J.M.Ritchie, andNickWeston. visionandpatternrecognition,pages7132–7141,2018. 3 SMASH:One-shotmodelarchitecturesearchthroughhyper- [18] JonathanHuang, VivekRathod, ChenSun, MenglongZhu, networks. InInternationalConferenceonLearningRepre- AnoopKorattikara,AlirezaFathi,IanFischer,ZbigniewWo- sentations,2018. 2 jna, YangSong, SergioGuadarrama, etal. Speed/accuracy [5] HanCai,LigengZhu,andSongHan. Proxylessnas: Direct trade-offsformodernconvolutionalobjectdetectors.InPro- neuralarchitecturesearchontargettaskandhardware.arXiv ceedingsoftheIEEEconferenceoncomputervisionandpat- preprintarXiv:1812.00332,2018. 1,2,3,4,5,6,7,8 ternrecognition,pages7310–7311,2017. 12 [6] YukangChen,TongYang,XiangyuZhang,GaofengMeng, [19] KirthevasanKandasamy,WillieNeiswanger,JeffSchneider, Chunhong Pan, and Jian Sun. Detnas: Neural architecture Barnabas Poczos, and Eric P Xing. Neural architecture searchonobjectdetection. InAdvancesinNeuralInforma- searchwithbayesianoptimisationandoptimaltransport. In tionProcessingSystems,2019. 2 AdvancesinNeuralInformationProcessingSystems,pages [7] JiequanCui,PengguangChen,RuiyuLi,ShuLiu,Xiaoyong 2016–2025,2018. 2 Shen, and Jiaya Jia. Fast and practical neural architecture [20] Liam Li and Ameet Talwalkar. Random search and repro- search. In International Conference on Computer Vision, ducibility for neural architecture search. In Amir Glober- 2019. 2 son and Ricardo Silva, editors, Proceedings of the Thirty- [8] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fifth Conference on Uncertainty in Artificial Intelligence, FeiSun,YanghanWang,MaratDukhan,YunqingHu,Yim- UAI2019,TelAviv,Israel,July22-25,2019,page129.AUAI ing Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, and Press,2019. 1,2 Niraj K. Jha. Chamnet: Towards efficient network design [21] Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays, through platform-aware model adaptation. In The IEEE PietroPerona,DevaRamanan,PiotrDolla´r,andCLawrence Conference on Computer Vision and Pattern Recognition Zitnick. Microsoft coco: Common objects in context. In (CVPR),June2019. 2 European conference on computer vision, pages 740–755. [9] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- Springer,2014. 8 weightedlinearunitsforneuralnetworkfunctionapproxima- [22] MariusLindauerandFrankHutter. Bestpracticesforscien- tioninreinforcementlearning. NeuralNetworks,107:3–11, tificresearchonneuralarchitecturesearch,2019. 7 2018. 3 [23] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig [10] IgorFedorov,RyanPAdams,MatthewMattina,andPaulN Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- Whatmough. Sparse: Sparse architecture search for cnns deeplab:Hierarchicalneuralarchitecturesearchforsemanticimagesegmentation.InProceedingsoftheIEEEConference Platform-awareneuralarchitecturesearchformobile.InThe onComputerVisionandPatternRecognition,pages82–92, IEEEConferenceonComputerVisionandPatternRecogni- 2019. 2 tion(CVPR),June2019. 2,3,4,5,7,8 [24] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon [39] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking Shlens,WeiHua,Li-JiaLi,LiFei-Fei,AlanYuille,Jonathan model scaling for convolutional neural networks. In Inter- Huang,andKevinMurphy. Progressiveneuralarchitecture nationalConferenceonMachineLearning,2019. 2,4 search. In The European Conference on Computer Vision [40] RonaldJWilliams. Simplestatisticalgradient-followingal- (ECCV),September2018. 2 gorithmsforconnectionistreinforcementlearning. Machine [25] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha learning,8(3-4):229–256,1992. 4 Fernando, and Koray Kavukcuoglu. Hierarchical repre- [41] BichenWu,XiaoliangDai,PeizhaoZhang,YanghanWang, sentations for efficient architecture search. arXiv preprint FeiSun,YimingWu,YuandongTian,PeterVajda,Yangqing arXiv:1711.00436,2017. 2 Jia,andKurtKeutzer. Fbnet:Hardware-awareefficientcon- [26] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: vnetdesignviadifferentiableneuralarchitecturesearch. In Differentiablearchitecturesearch. InInternationalConfer- Proceedings of the IEEE Conference on Computer Vision enceonLearningRepresentations,2019. 1,2,4,8,11 andPatternRecognition,pages10734–10742,2019. 2 [27] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- [42] SainingXie,AlexanderKirillov,RossGirshick,andKaim- tic gradient descent with warm restarts. arXiv preprint ingHe. Exploringrandomlywiredneuralnetworksforim- arXiv:1608.03983,2016. 12 agerecognition. arXivpreprintarXiv:1904.01569,2019. 2 [28] RenqianLuo,FeiTian,TaoQin,EnhongChen,andTie-Yan [43] SiruiXie,HehuiZheng,ChunxiaoLiu,andLiangLin.Snas: Liu.Neuralarchitectureoptimization.InAdvancesinneural stochasticneuralarchitecturesearch. InInternationalCon- informationprocessingsystems,pages7816–7827,2018. 2 ferenceonLearningRepresentations,2019. 1,2 [29] RenatoNegrinhoandGeoffGordon. Deeparchitect: Auto- [44] Tien-JuYang,AndrewHoward,BoChen,XiaoZhang,Alec matically designing and training deep architectures. arXiv Go, MarkSandler, VivienneSze, andHartwig Adam. Ne- preprintarXiv:1704.08792,2017. 2 tadapt:Platform-awareneuralnetworkadaptationformobile [30] HieuPham,MelodyYGuan,BarretZoph,QuocVLe,and applications. InProceedingsoftheEuropeanConferenceon JeffDean. Efficientneuralarchitecturesearchviaparameter ComputerVision(ECCV),pages285–300,2018. 13 sharing. InInternationalConferenceonMachineLearning, [45] KaichengYu,ChristianSciuto,MartinJaggi,ClaudiuMusat, 2018. 1,2,4,8 and Mathieu Salzmann. Evaluating the search phase of [31] AJ Piergiovanni, Anelia Angelova, and Michael S. Ryoo. neural architecture search. In International Conference on Tiny video networks. arXiv preprint arXiv:1910.06961, LearningRepresentations,2020. 1,2 2019. 2 [46] Barret Zoph and Quoc V Le. Neural architecture search [32] Prajit Ramachandran, Barret Zoph, and Quoc V Le. withreinforcementlearning. InInternationalConferenceon Searching for activation functions. arXiv preprint LearningRepresentations,2016. 2 arXiv:1710.05941,2017. 3 [47] BarretZoph,VijayVasudevan,JonathonShlens,andQuocV [33] EstebanReal,AlokAggarwal,YanpingHuang,andQuocV Le. Learning transferable architectures for scalable image Le. Regularized evolution for image classifier architecture recognition. In Proceedings of the IEEE conference on search. InProceedingsoftheAAAIConferenceonArtificial computervisionandpatternrecognition,pages8697–8710, Intelligence,volume33,pages4780–4789,2019. 1,2 2018. 1,2 [34] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Sax- ena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scaleevolutionofimageclassifiers. InInternational ConferenceonMachineLearning,2017. 2 [35] MarkSandler,AndrewHoward,MenglongZhu,AndreyZh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residualsandlinearbottlenecks. InProceedingsoftheIEEE Conference on Computer Vision and Pattern Recognition, pages4510–4520,2018. 2,3,4,7,12 [36] KarenSimonyanandAndrewZisserman. Verydeepconvo- lutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556,2014. 3 [37] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Mar- culescu. Single-pathnas:Designinghardware-efficientcon- vnetsinlessthan4hours. arXivpreprintarXiv:1904.02877, 2019. 2 [38] MingxingTan, BoChen, RuomingPang, VijayVasudevan, MarkSandler,AndrewHoward,andQuocV.Le. Mnasnet:A.VarianceofProxylessNAS-MobileModel theiroutputs. Ratherthanretainingtheintermediateresults of these operations for use during the backwards pass, we For stand-alone model training, we estimated the vari- throw them away. During the backwards pass, we recom- ance in accuracy across runs by starting five identical runs putetheintermediateresultsforoneoperationatatime. In of the ProxylessNAS-Mobile model. We repeated the ex- practice, this leads to a large decrease in memory require- periment in two configurations. In the first configuration, ments,asweonlyneedtoretainasingleinputtensoranda we trained for 90 epochs and then evaluated the models single output tensor for each choice block. For example, onourvalidationset;theresultingvalidationsetaccuracies in our reproduction of the original ProxylessNAS search were[76.1%,76.4%,76.4%,76.5%,76.2%]. Inthesecond space with a per-core batch size of 128, rematerialization configuration, we trained for 360 epochs and then evalu- decreasesthememoryneededtotrainareproductionofthe ated on our test set; the resulting test set accuracies were original ProxylessNAS search space from 29.5 GiB to 4.8 [75.0%,75.0%,74.9%,74.9%,75.0%].8 GiB. This memory-saving technique, which allowed us to scaletolargersearchspaces,cameatthecostofroughlya B.AveragevsArgmaxInferenceTime 30% increase in search times to perform a second forward passforeachoftheNpossibleoperations. For the absolute value reward function with a constant Finally,wenotethatalthoughthisrematerializationtrick RL learning rate, we argued that the reason models didn’t was developed with our version of op rampup in mind, it converge to the target inference time was because of dif- could also be used to reduce the memory requirements of ferencesbetweentheinferencetimesofrandomlysampled a method such as DARTS [26] which requires us to eval- models vs. the argmax model taken by selecting the most uate every possible operation in the search space at every likelychoiceforeachcategoricaldecision. Totestthishy- trainingstep. pothesis, we compared the two at the end of a search. We obtained an average time from randomly sampled models D.DiscussionofAbsoluteValueReward using an exponential moving average with a decay rate of 0.9whichwasupdatedevery100trainingsteps. Whileav- Wenowcontrastatypicalarchitecturesearchworkflow eragetimesconsistentlyconvergedtothedesiredinference with the MnasNet-Soft reward function against a work- timetarget,theinferencetimefromtheargmaxcoulddiffer flow with our new Absolute Value reward function. For by8msormore. the MnasNet-Soft reward function, the first step when us- ing a new search space or training configuration is to tune β Avg. Time ArgMaxTime the RL controller’s cost exponent β to obtain inference -0.01 134.2ms 128.8ms times which are reasonably close to our target latency. In -0.02 104.3ms 92.2ms our early experiments, we found that grid searching over -0.05 84.2ms 75.8ms β ∈ {−0.03,−0.04,−0.05,−0.06,−0.07,−0.08,−0.09} -0.10 83.1ms 78.3ms workedwellinpractice. However,runningthisgridsearch -0.20 84.0ms 82.7ms increased the cost of architecture search experiments by a -0.50 84.2ms 83.4ms factorof7. -1.00 83.2ms 85.5ms Even after we fixed the value of β, the latencies and Table 11: Average vs. argmax inference times when using the accuracies of architectures found by a search could vary absolute value reward function without a learning rate schedule significantly from one run to the next. For example, in fortheRLcontroller. our reproduction of the ProxylessNAS search space with β = −0.07, five identical architecture search experiments returned latencies which ranged from 74ms to 82ms. We C.Rematerialization also saw a wide variance in accuracies across the different architectures, ranging from 75.8% to 76.4% on the valida- If op warmup is implemented naively then the activa- tionsetand74.2%to75.1%onthetestset. Largermodels tion memory required to train the shared model weights generallyhadbetteraccuracies,indicatingthattheproblem grows linearly with the number of possible operations in stemmedfromourinabilitytopreciselycontrolthelatency. thesearchspace. Ifmanydifferentpossibleoperationscan This helped motivate our Absolute Reward function, besimultaneouslyenabledateachpositioninthenetwork, which allowed the RL controller to reliably find architec- themodelwillbeunabletofitinmemory. Weuseremateri- tureswhoselatencieswereclosetothetarget. Forexample, alizationtoaddressthisissue. Duringtheforwardpass,we thelowvarianceofsearchedTuNASmodellatenciesinTa- applyN differentoperationstothesameinput,andaverage bles4,5,6,and12showswecanreliablyfindhigh-quality 8Accuraciesonthevalidationsetaretypicallyafewpercentagepoints architectures within 1 ms of the target across several dif- higherthanonthetestset. ferentsearchconfigurations,evenwhenwereusethesamesearchhyper-parametersbetweendifferentsetups. E.2.ArchitectureSearchforClassification As an alternative to the absolute value reward function, Architecture search experiments are performed using we also considered searching for an architecture close to Cloud TPU v2-32 or Cloud TPU v3-32 instances with 32 theinferencetimetarget, andthenuniformlyscalingupor TPUcoresandaper-corebatchsizeof128. downthenumberoffiltersineverylayer. Whilethishelped Fortrainingthesharedmodelweights,weusethesame reducethevarianceofsearchedmodelaccuracies,itdidnot hyper-parametersasforstand-alonemodeltraining,except remove the need to tune the RL cost exponent, since we that the dropout rate of the final fully connected layer is needed to find a model that was already close to the infer- always set to 0. When applying L2 regularization to the ence time target to get good results. Furthermore, finding traininablemodelvariables, weonlyregularizeparameters the right scaling factor to hit a specific inference time tar- which are used in the current training step. Because batch get added an extra step to experiments in this setup. The norm statistics can potentially vary significantly from one absolute value reward function gave us high-quality archi- candidatearchitecturetothenext,batchnormisalwaysap- tectureswithamorestreamlinedsearchprocess. pliedin“training”mode,evenduringmodelevaluation. For training the RL controller, we use an Adam opti- mizerwithabaselearningrateof3e-4,β =0,β =0.999, E.ExperimentalSetup 1 2 and(cid:15) =1e-8. WesetthelearningrateoftheRLcontroller E.1.StandaloneTrainingforClassification to 0 for the first 25% of training. If using an exponential schedule, we set the learning rate equal to the base value During stand-alone model training, each model was 25%ofthewaythroughtraining, andincreaseitexponen- trainedusingdistributedsynchronousSGDonTensorFlow tiallysothatthefinallearningrateis10xthebaselearning withaCloudTPUv2-32orCloudTPUv3-32instance(32 rate. If using a constant schedule, we set the learning rate TPUcores)andaper-corebatchsizeof128. Modelswere equaltothebaselearningrateafterthefirst25%oftraining. optimized using RMSProp with momentum = 0.9, decay E.3.ObjectDetection rate=0.9,andepsilon=0.1.Thelearningratewasannealed following a cosine decay schedule without restarts [27], Our implementation is based on the Tensorflow Object withamaximumvalueof2.64globally(or0.0825percore). DetectionAPI[18]. AllbackbonesarecombinedwithSS- Welinearlyincreasedthelearningratefrom0overthefirst DLite [35] as the head. Following MobileNetV2 [35] and 2.5%oftrainingsteps[12]. Modelsweretrainedwithbatch V3[15],weusethelastfeatureextractorlayersthathavean normalizationwithepsilon=0.001andmomentum=0.99. outputstrideof16(C4)and32(C5)astheendpointsforthe ConvolutionalkernelswereinitializedwithHeinitialization head. In contrast with MobileNetV3 + SSDLite [15], we [13],9 while bias variables were initialized to 0. The final do not manually halve the number of channels for blocks fullyconnectedlayerofthenetworkwasinitializedfroma betweenC4andC5,sinceinourcasethenumberofchan- random normal distribution with mean 0 and standard de- nels is automatically learned by the search algorithm. All viation 0.01. We applied L2 regularization with a strength experimentsuse320×320inputimages. of0.00004toallconvolutionalkernelsexceptthefinalfully Forstandalonetraining, eachdetectionmodelistrained connected layer of the network. All models were trained for 50K steps from scratch on COCO train2017 data us- withResNetdatapreprocessingandaninputimagesizeof ing a Cloud TPU v2-32 or TPU v3-32 instance (32 TPU 224×224pixels. Whentrainingmodelsfor360epochs,we cores)withaper-corebatchsizeof32. WeuseSGDtoop- applied a dropout rate of 0.15 before the final fully con- timizethesharedmodelweightswithamomentumof0.9. nected layer for models from MobileNetV2 search spaces The(global)learningrateiswarmeduplinearlyfrom0to4 and0.25formodelsfromMobileNetV3searchspaces. We duringthefirst5Kstepsandthendecayedto0followinga didnotapplydropoutwhentrainingmodelsfor90epochs. cosineschedule[27]duringtherestofthetrainingprocess. AsisstandardforImageNetexperiments,ourtestsetac- Forarchitecturesearches,trainingconfigurationsforthe curacies were obtained on what is confusingly called the modelweights remainthe sameas forstandalone training. ImageNetvalidationsetforhistoricalreasons. Whatwere- Wesplitout10%ofthedatafromCOCOtrain2017tocom- fertoasvalidationsetaccuracieswereobtainedonaheld- pute the reward during an architecture search. The train- out subset of the ImageNet training set containing 50,046 ing setup of the RL controller is the same as for classifi- randomlyselectedexamples. cation, except that the base learning rate of the Adam op- timizer is set to 5e-3. Whereas classification models are evaluated based on accuracy, detection models are evalu- 9TensorFlow’s default variable initialization heuristics, such as ated using mAP (mean Average Precision). To obtain re- tf.initializers.henormal are designed for ordinary convolu- sultsinTable7,architecturesearcheswerecarriedoutinthe tions,andcanoverestimatethefan-inofdepthwiseconvolutionalkernels bymultipleordersofmagnitude;wecorrectedthisissueinourversion. MobileNetV3-LikesearchspacewithatargetinferencecostAgg.Sharing ValidAcc(%) TestAcc(%) Latency No 76.4±0.1 75.0±0.1 84.1±0.4 Yes 76.3±0.2 75.0±0.1 84.0±0.4 Table12:Effectofaggressiveweightsharing(abbreviatedas“Agg Sharing”inthetableabove)onthequalityofsearchedarchitec- tures. Each search is run for 90 epochs on the ProxylessNAS searchspacewithopandfilterwarmupenabled. F.CostofRandomSearchvsEfficientNAS Trainingasinglearchitecturefor90epochsonImageNet requiresabout1.7hoursusingaCloudTPUv3-32instance (32 cores), whereas a single architecture search run takes between 8 and 13 hours, depending on the search space. Thismeansthatforthecostofasingle90-epochsearch,we canevaluate4-8randommodels. Insomecases,wefound thatthecostofanefficientarchitecturesearchcouldbefur- ther improved by increasing the number of search epochs from90to360. Forthecostofasingle360-epochsearch, wecanevaluate15-30randommodels. Weprovideagen- erous budget of 20 - 50 models for our random search ex- perimentsinordertodemonstratethatefficientarchitecture Figure4:On-devicevs.simulatedlatenciesintheProxylessNAS- searchcanoutperformrandomsearchevenifeachrandom EnlargedandMobileNetV3-Likesearchspaces.Eachplotisbased searchexperimentismorecompute-intensive. on100randomarchitecturesinthegivenspace. G.QualityofAggressiveWeightSharing of106mstomatchthesimulatedlatencyofMobileNetV3+ To verify the quality impact of aggressive weight shar- SSDLite. ing,weranarchitecturesearchesontheoriginalProxyless- NASsearchspacebothwithandwithoutaggressivesharing. Toobtainthetest-devresults,eachmodelistrainedover Theresults(Table12)indicatethataggressiveweightshar- thecombinedCOCOtrain2017andval2017datafor100K ingdoesnotsignificantlyaffectsearchedmodelaccuracies steps. Other settings remain the same as those for stan- inthisspace. Ourothertwosearchspaces(ProxylessNAS- dalonetrainingandvalidation. Enlarged and MobilenetV3-Like) were too large us to run searcheswithoutaggressiveweightsharing. E.4.SimulatedInferenceTimes Inearlyexperiments, wefoundthatifwebenchmarked the same model on two different phones, the observed la- tenciescoulddifferbyseveralmilliseconds. Toensurethat ourresultswerereproducible–andtomitigatethepossibil- ity of random hardware-specific variance across runs – we estimated the latencies of our models using lookup tables similartothoseproposedbyNetAdapt[44]. Theselookup tables let us estimate the latency of each individual opera- tion(e.g.,convolutionorpoolinglayer)inthenetwork. The overall latency of a network architecture was estimated by summingupthelatenciesofallitsindividualoperations. We validated our use of simulated latencies by sam- pling 100 random architectures and comparing the simu- lated numbers against on-device numbers measured on a real Pixel-1 phone. Figure 4 shows that the two are well- correlated.