Can weight sharing outperform random architecture search? An investigation with TuNAS Gabriel Bender, Hanxiao Liu Google Research, contributed equally gbender@google.com, hanxiaol@google.com Bo Chen Google Research bochen@google.com Grace Chu Google Research cxy@google.com Shuyang Cheng Waymo shuyangcheng@waymo.com Pieter-Jan Kindermans Google Research pikinder@google.com Quoc Le Google Research qvl@google.com Abstract Efficient Neural Architecture Search methods based on weight sharing have shown good promise in democratiz- ing Neural Architecture Search for computer vision models. There is, however, an ongoing debate whether these efficient methods are significantly better than random search. Here we perform a thorough comparison between efficient and random search methods on a family of progressively larger and more challenging search spaces for image classification and detection on ImageNet and COCO. While the efficacies of both methods are problem-dependent, our experiments demonstrate that there are large, realistic tasks where ef- ficient search methods can provide substantial gains over random search. In addition, we propose and evaluate tech- niques which improve the quality of searched architectures and reduce the need for manual hyper-parameter tuning. 1 1. Introduction Neural Architecture Search (NAS) tries to find net- work architectures with excellent accuracy-latency trade- offs. While the resource costs of early approaches [47, 33] were prohibitively expensive for many, recent efficient ar- chitecture search methods based on weight sharing promise to reduce the costs of architecture search experiments by multiple orders of magnitude. [30, 3, 26, 5, 43] The effectiveness of these efficient NAS approaches has been questioned by recent studies (e.g., [20, 45]) present- ing experimental results where efficient architecture search methods did not always outperform random search base- lines. Furthermore, even when gains were reported, they were often modest. However, most existing results come 1Source code and experiment data are available at https://github. com/google-research/google-research/tree/master/tunas Figure 1: Validation accuracies of 250 random architectures (blue bars) vs. 5 independent runs of our TuNAS search algorithm (pink lines) for three different search spaces. Rejection sampling is used to ensure the latencies of random architectures are comparable to those of searched architectures. with limitations. First: negative results may simply indicate that existing algorithms are challenging to implement and tune. Second: most negative results focus on fairly small datasets such as CIFAR-10 or PTB, and some are obtained on heavily restricted search spaces. With those caveats in mind, it is possible that efficient NAS methods work well only on specific search spaces and problems. But even so, they can still be useful if those problems are of high practi- cal value. For this reason we focus on the following: “Can efficient neural architecture search be made to work reliably on large realistic search spaces for realistic problems?” When comparing against simpler algorithms such as ran- dom search, we must consider not just explicit costs, such as the time needed to run a single search, but also implicit costs. For example, many models operate under hard la- tency constraints (e.g., a model running on a self-driving car in real time). However, the reward functions used by Mnas- 1 arXiv:2008.06120v1 [cs.LG] 13 Aug 2020 Net [38] and ProxylessNAS [5] require us to run multiple searches with different hyper-parameters to match a given latency target.2 Larger models cannot be deployed, while smaller models have sub-optimal accuracies. We present TuNAS, a new easy-to-tune and scalable im- plementation of efficient NAS with weight sharing, and use it to investigate the questions outlined above: • We investigate two existing models from the NAS liter- ature, and find that while some of their gains are likely due to better training hyper-parameters, some are due to real improvements in the network architectures. • We use TuNAS to study the effectiveness of weight sharing on the ProxylessNAS search space for Ima- geNet. Prior work [5] demonstrated that an efficient NAS algorithm could find state-of-the-art image clas- sification architectures on this space, but left open the question of whether a simple random search baseline could find equally good results. We find that efficient NAS can significantly outperform random search on the ProxylessNAS search space on ImageNet. • We further evaluate our implementation on two new and even larger search spaces. We find that (i) TuNAS continues to find high-quality architectures, and (ii) the gap between TuNAS and random search increases sig- nificantly on these new search spaces. • We demonstrate that when weight sharing is imple- mented carefully, the same algorithm can be used across different classification search spaces and across domains (classification and object detection). • We propose and evaluate two new techniques, which we call op and filter warmup, for better training the shared model weights used during a search. These techniques improve the quality of architectures found by our algorithm, sometimes by a large margin. • We propose a novel RL reward function which allows us to precisely control the latency of the final architec- ture returned by the search, significantly reducing the need for additional hyper-parameter tuning. 2. Related work Neural Architecture Search was proposed as a way to au- tomate and optimize the design process for network archi- tectures. Existing methods have achieved impressive results on image classification [47, 42, 39], object detection [11, 6], segmentation [23], and video understanding [31]. Early methods were based on Reinforcement Learning [46, 1], 2In our experiments we typically needed to run 7 searches to achieve the desired latency. Genetic Algorithms [34, 25, 33] or Bayesian Optimiza- tion [19]. As these methods require a large amount of com- pute to achieve good results, recent works focus on address- ing this requirement [2, 29, 28, 24]. In many of these meth- ods, a single supernetwork is trained which encompasses all possible options in the search space [4, 30, 3, 5, 26, 43, 7]. A single path within the supernetwork corresponds to an ar- chitecture in the search space. With this scheme, weight sharing naturally occurs between the architectures. Within the framework of efficient search methods based on weight sharing, our work is closely related to ENAS [30], DARTS [26], SNAS [43] and especially ProxylessNAS [5] in terms of optimizing the quality-latency tradeoff of ar- chitectures. Different from ProxylessNAS, our method is able to handle substantially larger and more difficult search spaces with less prior knowledge (e.g., hand-engineered output filter sizes) and achieve improved performance. Multi-Objective Search. An important strength of Neu- ral Architecture Search is that it can cope with an ob- jective function beyond pure accuracy. Recently, Neu- ral Architecture Search has been used intensively to find architectures that have better tradeoff between accuracy and latency [38, 41, 8, 5, 37], FLOPS [39], power con- sumption [16], and memory usage [10]. We also focus on resource-aware NAS because (i) finding architectures with good trade-offs between accuracy and latency is valu- able in practice, and (ii) constrained optimization may be more challenging than unconstrained optimization, which makes it ideal for a stress-test of efficient NAS algorithms. With this in mind, we make use of and extend the search spaces proposed by MnasNet [38], ProxylessNAS [5], Mo- bileNetV2 [35] and MobileNetV3 [15], which are close or among the state-of-the-art networks for mobile settings. Random Search vs Efficient Search Methods. The use of more complicated search methods in Neural Architecture Search has been challenged [45, 20]. In a nutshell, these studies find that for certain problems and search spaces, ran- dom search performs close to or just as well as more com- plicated search methods. These studies, however, mainly focused on relatively small tasks (such as CIFAR-10) and accuracy-only optimization. Our focus is on larger and more challenging searches which incorporate latency con- straints. In these more realistic settings, efficient architec- ture search significantly outperforms random search. 3. Search Spaces Our goal is to develop a NAS method that can reliably find high quality models at a specific inference cost across multiple search spaces. We next present three progressively larger search spaces and show that they are non-trivial: they Search Space Cardinality Ref Model Our Search Random Search Simulated N=1 N = 20 N = 50 N = 250 Latency (ms) ProxylessNAS ∼1021 76.2 76.3 ± 0.2 74.1 ± 0.8 75.4 ± 0.1 75.4 ± 0.2 75.6 83-85 ProxylessNAS-Enlarged ∼1028 76.2 76.4 ± 0.1 72.1 ± 1.5 74.4 ± 0.5 74.6 ± 0.3 74.8 83-85 MobileNetV3-Like ∼1043 76.5 76.6 ± 0.1 71.7 ± 1.7 74.1 ± 0.6 74.6 ± 0.3 74.9 57-59 Table 1: Comparison between reference models proposed in previous work (“Ref Model”), random search baselines in our search spaces (“Random Search”), and searched models found by TuNAS (“Our Search”). We report validation accuracies on ImageNet after 90 epochs of training. Cardinality refers to (an upper bound of) the number of unique architectures in the search space. The reference model for the ProxylessNAS and ProxylessNAS-Enlarged search spaces is our reproduction of the ProxylessNAS mobile CPU model [5]. The reference model for the MobileNetV3-Like search space is our reproduction of MobileNetV3 [15]. Mean and variance for Random Search are reported over 5 repeats for N=20 and N=50, and 250 repeats for N=1. Search Space Built Around Base Filter Sizes (ci’s for each layer) Typical Choices within an Inverted Bottleneck Layer Expansion Ratio Kernel Output filter size SE ProxylessNAS MobileNetV2 ProxylessNAS [5] {3, 6} {3, 5, 7} ci  ProxylessNAS-Enlarged MobileNetV2 ×2 when stride = 2 {3, 6} {3, 5, 7} ci ×  1 2 , 5 8 , 3 4 , 1, 5 4 , 3 2 , 2  MobileNetV3-Like MobileNetV3 ×2 when stride = 2 {1, 2, 3, 4, 5, 6} {3, 5, 7} ci ×  1 2 , 5 8 , 3 4 , 1, 5 4 , 3 2 , 2 {, } Table 2: Search spaces we use to evaluate our method. The first two are built around MobileNetV2. The third uses the combination of ReLU and SiLU/Swish [32, 9, 14] activations and the new model head from MobileNetV3. We use a target inference time of 84ms for the first two (to compare against ProxylessNAS and MnasNet) and 57ms for the third search space (to compare against MobileNetV3). contain known good reference models3 that clearly outper- form models found by random search, as shown in Table 1. The same table shows that for the larger of these search spaces, the gap between known good models and random search baselines widens. Although architecture search be- comes more difficult, it can also be more beneficial. 3.1. Search Space Definitions Details of the three search spaces are summarized in Ta- ble 2. Motivations for each of them are outlined below. ProxylessNAS Space. The first and the smallest search space is a reproduction of the one used in Proxyless- NAS [5], an efficient architecture search method that reports strong results on mobile CPUs. It consists of a stack of in- verted bottleneck layers, where the expansion ratio and the depthwise kernel size for each layer are searchable. The search space is built around MobileNetV2 [35], except that the output filter sizes for all the layers are handcrafted to be similar to those found by MnasNet [38]. ProxylessNAS-Enlarged Space. While earlier convolu- tional architectures such as VGG [36] used the heuristic of doubling the number of filters every time the feature map width and height were halved, more recent models [35, 38, 5] obtain strong performance using more carefully tuned filter sizes. Our experimental evaluation (Table 8) demonstrates that these carefully tuned filter sizes are in fact important for obtaining good accuracy/latency tradeoffs. 3In Table 3 we present our reproductions to the published numbers. In all cases our reproductions are at least as accurate as the published results. While output filter sizes are something that we ideally should be able to search for automatically, the original Prox- ylessNAS search space used manually tuned filter sizes built around those discovered by an earlier and more expensive search algorithm [38]. To understand whether this restric- tion can be lifted, we explore an extension of the Proxyless- NAS search space to automatically search over the number of output filters in each layer of the network. Specifically, we define a list of possible output filter sizes for each layer in our search space by multiplying an integer- valued base filter size by a predefined set of multipliers  1 2, 5 8, 3 4, 1, 5 4, 3 2, 2 and rounding to a multiple of 8.4 The base filter size is 16 for the first layer of the network, and is doubled whenever we start a new block. If two layers of the network are connected via a residual connection, we constrain them to have the same output filter size. MobileNetV3-Like Space. Our largest search space is in- spired by MobileNetV3 [15]. Different from the previous spaces, models in this space utilize the SiLU/Swish [32, 9, 14] activation function [32] and a compact head [15]. The search space is also much larger than the previous two. First: inverted bottleneck expansion ratios can be selected from the set {1, 2, 3, 4, 5, 6}, compared with {3, 6} in other search spaces. Second: we optionally allow a Squeeze-and- Excite module [17] to be added to each inverted bottleneck. Output filter sizes are searched; the choices follow the same heuristic used in the ProxylessNAS-Enlarged space. 4For performance reasons, working with multiples of 8 was recom- mended for our target inference hardware. 3.2. Measuring Search Algorithm Effectiveness We measure the effectiveness of our NAS algorithm on a given search space in two different ways. Reference Models. Can our algorithm match or exceed the quality of known good architectures within the space? For example, when evaluating the effectiveness of our algorithm on the ProxylessNAS-Enlarged space, we compare against MobileNetV2 [35] (a hand-tuned model), MnasNet [38] (a model obtained from a more expensive NAS algorithm where thousands of candidate architectures were trained from scratch), and ProxylessNAS-Mobile [5] (a model ob- tained using a NAS algorithm similar to ours, which we use to verify our setup). Random Search. Can our algorithm provide better results in less time than random search without weight sharing, an easy-to-implement heuristic which is widely used in prac- tice? In industry settings, it is common to target a specific latency T0; slower models cannot be deployed while faster models typically have sub-optimal accuracies [39]. How- ever, in practice only a small percentage of models are ac- tually close to this inference target. To make the baseline interesting in this realistic setup, we perform rejection sampling in the range of T0±1ms to obtain N random models. These models are then trained for 90 epochs and validated. The model with the best result on the validation set is subsequently trained for 360 epochs and evaluated on the test set, analogous to our searched models for final evaluation. Note the cost of random search with N = 15 to 30 is comparable with the cost of a single run of our efficient search algorithm (Appendix F). Besides the comparisons discussed above, the complex- ity of our search spaces can be quantified using several other metrics, which we report in Table 1. A clear pro- gression in the task difficulties can be observed as we move from the smallest ProxylessNAS search space to the largest MobileNetV3-Like search space. 4. TuNAS TuNAS uses a reinforcement learning algorithm with weight sharing to perform architecture searches. Our al- gorithm is similar to ProxylessNAS [5] and ENAS [30], but contains changes to improve robustness and scalability and reduce the need for manual hyper-parameter tuning. A search space is represented as a set of categorical de- cisions which control different aspects of the network ar- chitecture. For example, a single categorical decision might control whether we use a 3×3, 5×5, or 7×7 convolution at a particular position in the network. An architecture is an assignment of values to these categorical decisions. During a search, we learn a policy π, a probability dis- tribution from which we can sample high quality architec- tures. Formally, π is defined as a collection of independent multinomial variables, one for each of the decisions in our search space. We also learn a set of shared weights W, which are used to efficiently estimate the quality of candi- date architectures in our search space. We alternate between learning the shared weights W us- ing gradient descent and learning the policy π using RE- INFORCE [40]. At each step, we first sample a network architecture α ∼π. Next, we use the shared weights to estimate the quality Q(α) of the sampled architecture us- ing a single batch of examples from the validation set. We then estimate the inference time of the sampled architecture T(α). The accuracy Q(α) and inference time T(α) jointly determine the reward r(α) which is used to update the pol- icy π using REINFORCE.5 Finally, we update the shared model weights W by computing a gradient update w.r.t. the architecture α on a batch of examples from the training set. The above process is repeated over and over until the search completes. At the end of the search, the final ar- chitecture is obtained by independently selecting the most probable value for each categorical decision in π. 4.1. Weight Sharing To amortize the cost of an architecture search, NAS al- gorithms based on weight sharing (e.g., [3, 30, 26, 5]) train a large network – a one-shot model – with many redundant operations, most of which will be removed at the end of the search. In ProxylessNAS [5], for instance, we must select between three possible kernel sizes (3, 5, or 7) and two pos- sible expansion factors (3 or 6), giving us 3×2 = 6 possible combinations. In ProxylessNAS, each of these six combi- nations will have its own path through the network: its own set of trainable weights and operations which are not shared with any other path. At each step, we randomly select one of the six paths and update the shared model weights and RL controller using just the weights from the selected path. While this approach works well when the number of paths is small, the size of the one-shot model will quickly blow up once we add more primitives to the search. For instance, the number of unique inverted bottleneck config- urations per layer can be as large as 6 × 3 × 7 × 2 = 252 in our MobileNetV3-Like space, in contrast to 2 × 3 = 6 options in the ProxylessNAS space (Table 2). As a result, the search process cannot be carried out efficiently because each inverted bottleneck will only be trained a small frac- tion of time (1/252 under a uniform policy). Operation Collapsing. Instead of using a separate set of weights for each possible combination of choices within an inverted bottleneck, we share (“collapse”) operations and 5We set the learning rate of the RL controller to 0 during the first 25% of training. This allows us to learn a good set of shared model weights before the RL controller kicks in. Details are provided in Appendix E.2. Figure 2: Illustration of our aggressive weight sharing scheme for inverted bottleneck layers. Each choice block is associated with a decision to be made based the RL policy. The expansion ratios and output filter sizes for the 1 × 1 convolutions are learned through a channel masking mechanism. weights in order to ensure that each trainable weight gets a sufficient gradient signal. The approach is illustrated in Figure 2. For example, while ProxylessNAS uses different 1×1 convolutions for each possible depthwise kernel within an inverted bottleneck, we reuse the same 1×1 convolutions regardless of which depthwise kernel is selected. Channel Masking. Complementary to operation collaps- ing, we also share parameters between convolutions with different numbers of input/output channels. The idea is to create only a single convolutional kernel with the largest possible number of channels. We simulate smaller channel sizes by retaining only the first N input (or output) chan- nels, and zeroing out the remaining ones. This allows us to efficiently search for both expansion factors and output filter sizes in an inverted bottleneck, as learning these is reduced to learning multinomial distributions over the masks. 4.2. Targeting a Specific Latency For many practical applications (e.g., real-time image perception), we want the best possible model that runs within a fixed number of milliseconds. However, we found that with the existing RL reward functions used by Mnas- Net [38] and ProxylessNAS [5], we frequently had to retune our search hyper-parameters in order to find the best models under a given latency target. This extra retuning step mul- tiplied resource costs by 7× in many of our experiments. We now explain why, and propose a new reward function to address the issue. Soft Exponential Reward Function. In previous work, Tan et al. [38] proposed a parameterized RL reward func- tion to find architectures with good accuracy/time tradeoffs, and evaluated two instantiations of this function. In the first instantiation (later adopted by ProxylessNAS [5]), they maximize the reward r(α) = Q(α) × (T(α)/T0)β where Q(α) indicates the quality (accuracy) of a candidate architecture α, T(α) is its inference time, T0 is a problem- dependent inference time target, and β < 0 is the cost expo- nent, a tunable hyper-parameter of the setup. Since β < 0, this reward function is maximized when the model quality Q(α) is large and the inference time T(α) is small. However, to find the best possible model whose in- ference time is less than T0, we must perform a hyper- parameter sweep over β. If β is too small, the inference con- straint will effectively be ignored. If β is too large, we will end up with models that have low latencies but sub-optimal accuracies. To make matters worse, we found that changing the search space or search algorithm details required us to retune the value of β, increasing search experiment costs by 7× in practice. Figure 3 shows a geometric intuition for this behavior. Each contour line in the plot represents a set of possible tradeoffs between model quality and latency which yield the same final reward. Our goal is to try to find an architecture with the highest possible reward, corresponding to a con- tour line that is as far to the top left as possible. However, the reward must correspond to a viable architecture in the search space, which means the contour must intersect the population’s accuracy-latency frontier (circled in black). For the soft exponential reward, the figure suggests that a small shift in the population (e.g., due to a change in the training setup or search space) can significantly alter the op- timal latency. This explains why the same value of β can lead to different latencies in different searches. Both the hard exponential reward function and the proposed absolute reward function, which we will discuss next, are more sta- ble, thanks to the “sharp” change points in their contours. Hard Exponential Reward Function. A second instantia- tion of the MnasNet reward function [38] penalizes models whose inference times T(α) are above T0 but does not re- ward models whose inference times are less than T0: r(α) = ( Q(α), if T(α) ≤T0 Q(α) × (T(α)/T0)β, if T(α) > T0 (1) At first glance, we might expect that an RL controller using this reward would always favor models with higher accura- cies, provided that their inference times do not exceed T0. However, this is not the case in our experiments. The reason is that the RL controller does not select a single point on the Pareto frontier. Rather, it learns a probability distribution over points. If the cost exponent β is too large, the controller will become risk-adverse preferring to sample architectures whose latencies are significantly below the target T0. This allows it to minimize the probability of accidentally sam- pling architectures whose times exceed the target and incur- ring large penalties. Empirically, we found that if we made the cost penalty β too large, the controller would sample ar- chitectures with inference times close to 75ms, even though the target inference time was closer to 85ms. Our Solution: Absolute Reward Function. We propose a new reward function which can find good architectures Figure 3: Contours of different reward functions and their interac- tions with the frontiers. The blue and the orange denote the fron- tiers of two search spaces with different accuracy-latency trade- offs. Left: Soft exponential reward function. Center: Hard ex- ponential reward function. Right: Our absolute reward function. Regions in black indicate architectures with the highest reward. whose inference times are close to a user-specified target T0 and is robust to the exact values of hyper-parameters. The key idea is to add to our reward function a prior that larger models are typically more accurate. Instead of just asking the search to identify models with inference times less than T0 (as in previous work), we explicitly search for the best possible models whose inference times are close to T0. This implies a constrained optimization problem: max α Q(α) subject to T(α) ≈T0 (2) The above can be relaxed as an unconstrained optimization problem that aims to maximize r(α) = Q(α) + β |T(α)/T0 −1| where | · | denotes the absolute function and β < 0, the cost exponent, is a finite negative scalar that controls how much strongly we encourage architectures to have inference times close to T0. The expression T(α)/T0 ensures the reward is scale-invariant w.r.t. latency. Search results are robust to the exact value of β,6 and this scale-invariance further reduces the need to retune β for new devices and search spaces. Using the absolute value reward, we found that while the average inference time of models sampled by the RL controller was consistently close to the target, the inference time of the most likely architecture selected at the end of the search could be several milliseconds lower. We combat the mismatch between average and most likely inference times by adjusting the learning rate schedule of the RL controller. Instead of using a constant learning rate through the search, we exponentially increase the RL learning rate over time. This allows the controller to explore the search space (with a relatively low learning rate) at the start of the search, but also ensures that the entropy of the RL controller is low at the end of the search, preventing the mismatch. Additional information is provided in Append B. 4.3. Improving the shared model weights We identified two techniques that allowed us to improve the quality of the models found by architecture search ex- 6We used the same value of β for all our classification experiments. periments. Both techniques rely on the intuition that if we can ensure that all of our shared model weights are suffi- ciently well-trained, we can get a more reliable signal about which parts of the search space are most promising. Filter Warmup. We can efficiently search over different filter sizes by masking out tensors across the filters dimen- sion. For example, we can simulate a convolution with 96 output filters by taking a convolution with 128 output fil- ters and zeroing out the rightmost 32. However, this intro- duces a bias into our training process: the left-most filters will always be trained, whereas the right-most filters will be trained only occasionally. To counteract this effect, we randomly enable all the output filters – rather than just the filters selected by the RL controller – with some probability p. We linearly decrease p from 1 to 0 over the first 25% of the search,7 during which the RL controller is disabled and only the shared model weights are trained. Op Warmup. We enable all possible operations in the search space at start of training, and gradually drop out more and more of the operations as the search progresses. Our discussion will focus on a single choice block, where the goal is to select one of N possible operations (e.g., convo- lutions, inverted bottleneck layers, etc.) from a predeter- mined search space. The idea was originally proposed and evaluated by Bender et al. [3], who used carefully-tuned heuristics to determine the dropout schedule. We found that a simplified version of the idea could be used to improve our search results with minimal tuning. With some probability p between 0 and 1 we enable all operations within a choice block, rather than just enabling the operations selected by the RL controller. When multiple operations are enabled, we average their outputs. When p = 0, the controller’s be- havior is unaffected by op warmup. When p = 1, we enable all possible operations at every training step. In our imple- mentation, we linearly decrease p from 1 to 0 over the first 25% of the search, during which the RL controller is dis- abled and only the shared model weights are trained. Op warmup can be implemented in a memory-efficient manner by leveraging rematerialization (Appendix C). 5. Experimental Setup and Baselines For the ProxylessNAS and ProxylessNAS-Enlarged search spaces, our searches target the same latency as the published ProxylessNAS Mobile architecture [5]. For our MobileNetV3-Like search space, we target the latency of MobileNetV3-Large [15]. The resulting architectures are trained from scratch to obtain validation and test set accura- cies. Unless otherwise specified, we obtained validation set accuracies of standalone models on a held-out subset of the 7We also experimented with decreasing p over 50% of steps instead of 25%, but did not observe a significant effect on search quality. ImageNet training set, after training on the remainder for 90 epochs. We obtained test set accuracies after training on the full training set for 360 epochs. Details for standalone training and architecture search are reported in Appendix E. For architecture search experiments, we always repeat the entire search process 5 times as suggested by Lindauer and Hutter [22], and report the mean and variance of the performance of the resulting architectures. This is different from the popular practice of training a single resulting archi- tecture multiple times, which only reflects the variance of a single searched architecture (which can be cherry-picked) rather than the search algorithm itself. For reproductions of published models such as MobileNetV2 where no architec- ture search is required on our part, we report the mean and variance across five training runs. Identical hyper-parameters are used to train all of our models. (See Appendix E for details.) The one exception is the dropout rate of the final fully connected layer, which is set to 0 when training for 90 epochs, 0.15 when training MobileNetV2-based models for 360 epochs, and 0.25 when training MobileNetV3-based models for 360 epochs. We initially experimented with tuning hyper-parameters such as the learning rate and dropout rate separately for each model. However, this did not lead to any additional quality gains, and was omitted for our final experiments. 5.1. Reproducing Reference Architectures We start by reproducing three popular mobile image classification models in our training setup: MobileNetV2 [35], MnasNet-B1 [38], and ProxylessNAS [5]. This serves two purposes. First, it allows us to verify our training and evaluation setup. And second, it allows us to cleanly dis- tinguish between improvements in our model training setup and improvements in our searched network architectures. Results are presented in Table 3. Our hyper-parameters provide quality comparable to the published results for MnasNet-B1 and significantly improve upon the published results of ProxylessNAS and MobileNetV2. There is an especially large (1.3%) accuracy improve- ment in our reproduction of MobileNetV2. This suggests that some of the quality gains from MnasNet and Proxy- lessNAS which were previously attributed to better network architectures may in fact be due to better hyper-parameter tuning. It underscores the importance of accounting for dif- ferences in training and evaluation setups when comparing different network architectures. 6. Results and Discussion Compared with previous papers on efficient architecture search such as ProxylessNAS, our architecture search setup includes several novel features, including (i) a new abso- lute value reward, (ii) the use of op and filter warmup, and (iii) more aggressive weight sharing during searches. At Name Simulated Accuracy (%) Latency Valid Test Test ours published ours MobileNetV2 77.2 ms 74.5 ± 0.1 72.0 73.3 ± 0.1 MnasNet-B1 84.5 ms 76.0 ± 0.1 74.5 74.5 ± 0.1 ProxylessNAS 84.4 ms 76.3 ± 0.2 74.6 74.9 ± 0.1 MobileNetV3 58.5 ms 76.5 ± 0.2 75.2 75.3 ± 0.1 Table 3: Reproductions of our baseline models on ImageNet. Model / Method Valid Acc (%) Test Acc (%) Latency ProxylessNAS [5] 76.2 74.8 84.4 RS (N = 20) 75.4 ± 0.2 73.9 ± 0.3 84.3 ± 0.8 RS (N = 50) 75.4 ± 0.2 74.0 ± 0.2 83.8 ± 0.6 TuNAS (90 epochs) 76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 Table 4: Results in the ProxylessNAS search space. “Proxyless- NAS [5]” is our reproduction of the ProxylessNAS-Mobile model. Our TuNAS implementation includes op/filter warmup, the abso- lute value reward, and more aggressive weight sharing. Model / Method Valid Acc (%) Test Acc (%) Latency MobileNetV2 [35] 74.4 73.4 77.2 MNASNet-B1 [38] 76.0 74.5 84.5 ProxylessNAS [5] 76.2 74.8 84.4 RS (N = 20) 74.4 ± 0.5 73.1 ± 0.6 84.0 ± 0.6 RS (N = 50) 74.6 ± 0.3 73.2 ± 0.3 83.5 ± 0.3 TuNAS (90 epochs) 76.4 ± 0.1 75.3 ± 0.2 84.0 ± 0.4 Table 5: Results in the ProxylessNAS-Enlarged search space. Model / Method Valid Acc (%) Test Acc (%) Latency MobileNetV3-L [15] 76.5 75.3 58.5 RS (N = 20) 74.1 ± 0.6 73.0 ± 0.5 58.5 ± 0.5 RS (N = 50) 74.6 ± 0.3 73.5 ± 0.2 58.7 ± 0.4 TuNAS (90 epochs) 76.6 ± 0.1 75.2 ± 0.2 57.0 ± 0.2 TuNAS (360 epochs) 76.7 ± 0.2 75.4 ± 0.1 57.1 ± 0.1 Table 6: Results in the MobileNetV3-Like search space. the end of this section we will systematically evaluate these changes. First we evaluate our proposed efficient architec- ture search implementation on the three search spaces pre- sented in Section 3, and compare our results against random search with similar or higher search cost. The search spaces gradually increase in terms of both size and difficulty. Finding 1: TuNAS outperforms Random Search (RS) by a large margin in each of our three classification search spaces. This holds even though we use 2-3x more com- pute resources for each RS experiment (Appendix F). First (Table 4), we evaluate our search algorithm on the ProxylessNAS search space [5]. Despite having lower search costs, the accuracies of architectures found with an efficient architecture search improve upon random search by 1%. These also provide a sanity check for our setup: the results of our search are competitive with those reported by the original ProxylessNAS paper. Next (Table 5), we evaluate our search algorithm on the ProxylessNAS-Enlarged search space, which additionally searches over output filter sizes. In this larger and more challenging search space, the accuracy gap between our method and random search increases from 1% to 2%. Finally (Table 6), we evaluate our search algorithm on our MobileNetV3-Like search space, which is the most challenging of the three. In addition to being the largest of the three search spaces, the accuracies of architectures sampled from this search space are – on average – lower than the other two. Increasing the size of a random sample from N = 20 to N = 50 can improve the results of random search. However, we find that increasing the search time of our algorithm from 90 to 360 epochs can also improve the results of our efficient algorithm, while still maintaining a lower search cost than random search with N = 50. Finding 2: The TuNAS implementation generalizes to object detection. We investigate the transferability of our algorithm to the object detection task, by searching for the detection backbone w.r.t. both mean average precision and the inference cost. Results on COCO [21] are summarized in Table 7. The searched architecture outperforms the state- of-the-art model MobileNetV3 + SSDLite [15]. Details of the experimental setup are presented in Appendix E.3. Finding 3: Output filter sizes are important. MnasNet-B1 [38] searches over the number of output fil- ters in addition to factors such as the kernel sizes and ex- pansion factors. This is different from many recent papers on efficient NAS–including ENAS [30], DARTS [26], and ProxylessNAS [5]–which hard-coded the output filter sizes. To determine the importance of output filter sizes, one possibility would be to modify the output filter sizes of a high-performing model (such as the ProxylessNAS-Mobile model) and look at how the model accuracy changes. How- ever, we can potentially do better by searching for a new architecture whose operations are better-adapted to the new output filter sizes. We therefore perform two different vari- ants of the latter procedure. In the first variant, we replace the ProxylessNAS output filter sizes (which are hard-coded to be almost the same as MnasNet) with a naive heuristic where we double the number of filters whenever we halve the image width and height, similar to architectures such as ResNet and VGG. In the second, we double the number of filters at each new block. Table 8 shows that searched filter sizes significantly outperform both doubling heuristics. Finding 4: Aggressive weight sharing enables larger search spaces without significant quality impact. We share weights between candidate networks more aggres- sively than previous works such as ProxylessNAS (Section 4.1). This lets us explore much larger search spaces, includ- ing one with up to 252 options per inverted bottleneck. For the ProxylessNAS space (where searches are possible with and without it), we verified that it does not significantly af- Backbone COCO Test-dev mAP Latency MobileNetV2 20.7 126 MNASNet 21.3 129 ProxylessNAS 21.8 140 MobileNetV3-Large 22.0 106 TuNAS Search 22.5 106 Table 7: Backbone architecture search results on MS COCO in the MobileNetV3-Like space. All detection backbones are combined with the SSDLite head. Target latency for TuNAS search was set to 106ms (same as for MobileNetV3-Large + SSDLite). Filters Valid Acc (%) Test Acc (%) Latency ProxylessNAS 76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 ×2 Every Stride-2 74.8 ± 0.2 73.5 ± 0.2 83.9 ± 1.0 ×2 Every Block 75.3 ± 0.2 74.0 ± 0.2 83.9 ± 0.2 Table 8: Effect of output filter sizes on final model accuracies. Reward Valid Acc (%) Test Acc (%) Latency MnasNet-Soft Reward 76.2 ± 0.2 74.8 ± 0.3 79.5 ± 3.3 Absolute Value Reward 76.4 ± 0.1 75.0 ± 0.1 84.1 ± 0.4 Table 9: Comparison of our absolute value reward function (T0=84ms) vs. the reward used by MnasNet and ProxylessNAS. While both provide similar quality/latency tradeoffs, our absolute value reward allows precise control over the inference latency, and reduces the need for extra tuning to find a suitable cost exponent. Search Space Warmup Valid Acc (%) Test Acc (%) Latency ProxylessNAS  76.1 ± 0.1 74.7 ± 0.1 84.0 ± 0.3 ProxylessNAS  76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 ProxylessNAS-Enl  75.8 ± 0.3 74.6 ± 0.2 83.6 ± 0.2 ProxylessNAS-Enl  76.4 ± 0.1 75.3 ± 0.2 84.0 ± 0.4 MobileNetV3-Like  76.2 ± 0.2 75.0 ± 0.1 57.0 ± 0.6 MobileNetV3-Like  76.6 ± 0.1 75.2 ± 0.2 57.0 ± 0.2 Table 10: Comparison of search results with vs. without op and filter warmup. We use aggressive weight sharing and search for 90 epochs in all search configurations. fect searched model quality (Appendix G). Finding 5: The absolute value reward reduces hyper- parameter tuning significantly. With the MnasNet-Soft reward function, we found it necessary to grid search over β ∈{−0.03, −0.04, −0.05, −0.06, −0.07, −0.08, −0.09} in order to reliably find network architectures close to the target latency. By switching to the absolute value reward function, we were able to eliminate the need for this search, reducing resource costs by a factor of 7. We compared the quality of both methods on our implementation of the Prox- ylessNAS search space with weight sharing, and found that the Absolute Value reward function did not significantly af- fect the quality/latency tradeoff (Table 9 and Appendix D). Finding 6: Op and filter warmup lead to consistent im- provements across all search spaces. Controlled experi- ments are presented in Table 10. While improvements are small in some spaces, they account for nearly half of all quality gains in the ProxylessNAS-Enlarged Space. Acknowledgements We would like to thank Blake Hechtman, Ryan Sepassi, and Tong Shen for their help with software changes needed to make our algorithms to work well on TPUs. We would also like to thank Berkin Akin, Okan Arikan, Yiming Chen, Zhifeng Chen, Frank Chu, Ekin D. Cubuk, Matthieu Devin, Suyog Gupta, Andrew Howard, Da Huang, Adam Kraft, Peisheng Li, Yifeng Lu, Ruoming Pang, Daiyi Peng, Mark Sandler, Yonghui Wu, Zhinan Xu, Xin Zhou, and Menglong Zhu for helpful feedback and discussions. References [1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using rein- forcement learning. In International Conference on Learning Representations, 2017. 2 [2] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using perfor- mance prediction. arXiv preprint arXiv:1705.10823, 2017. 2 [3] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pages 549–558, 2018. 1, 2, 4, 6 [4] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hyper- networks. In International Conference on Learning Repre- sentations, 2018. 2 [5] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018. 1, 2, 3, 4, 5, 6, 7, 8 [6] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Chunhong Pan, and Jian Sun. Detnas: Neural architecture search on object detection. In Advances in Neural Informa- tion Processing Systems, 2019. 2 [7] Jiequan Cui, Pengguang Chen, Ruiyu Li, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast and practical neural architecture search. In International Conference on Computer Vision, 2019. 2 [8] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yim- ing Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, and Niraj K. Jha. Chamnet: Towards efficient network design through platform-aware model adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2 [9] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning. Neural Networks, 107:3–11, 2018. 3 [10] Igor Fedorov, Ryan P Adams, Matthew Mattina, and Paul N Whatmough. Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers. arXiv preprint arXiv:1905.12107, 2019. 2 [11] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object de- tection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019. 2 [12] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large mini- batch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 12 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 12 [14] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3 [15] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In International Conference on Computer Vision, 2019. 2, 3, 6, 7, 8, 12 [16] Chi-Hung Hsu, Shu-Huan Chang, Jhao-Hong Liang, Hsin- Ping Chou, Chun-Hao Liu, Shih-Chieh Chang, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, and Da-Cheng Juan. Monas: Multi-objective neural architecture search using reinforce- ment learning. arXiv preprint arXiv:1806.10332, 2018. 2 [17] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. 3 [18] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo- jna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 7310–7311, 2017. 12 [19] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric P Xing. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pages 2016–2025, 2018. 2 [20] Liam Li and Ameet Talwalkar. Random search and repro- ducibility for neural architecture search. In Amir Glober- son and Ricardo Silva, editors, Proceedings of the Thirty- Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 129. AUAI Press, 2019. 1, 2 [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 8 [22] Marius Lindauer and Frank Hutter. Best practices for scien- tific research on neural architecture search, 2019. 7 [23] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 82–92, 2019. 2 [24] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In The European Conference on Computer Vision (ECCV), September 2018. 2 [25] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical repre- sentations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017. 2 [26] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Confer- ence on Learning Representations, 2019. 1, 2, 4, 8, 11 [27] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 12 [28] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In Advances in neural information processing systems, pages 7816–7827, 2018. 2 [29] Renato Negrinho and Geoff Gordon. Deeparchitect: Auto- matically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017. 2 [30] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, 2018. 1, 2, 4, 8 [31] AJ Piergiovanni, Anelia Angelova, and Michael S. Ryoo. Tiny video networks. arXiv preprint arXiv:1910.06961, 2019. 2 [32] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 3 [33] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4780–4789, 2019. 1, 2 [34] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Sax- ena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. In International Conference on Machine Learning, 2017. 2 [35] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. 2, 3, 4, 7, 12 [36] Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3 [37] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Mar- culescu. Single-path nas: Designing hardware-efficient con- vnets in less than 4 hours. arXiv preprint arXiv:1904.02877, 2019. 2 [38] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In The IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2019. 2, 3, 4, 5, 7, 8 [39] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Inter- national Conference on Machine Learning, 2019. 2, 4 [40] Ronald J Williams. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. 4 [41] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con- vnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019. 2 [42] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaim- ing He. Exploring randomly wired neural networks for im- age recognition. arXiv preprint arXiv:1904.01569, 2019. 2 [43] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. In International Con- ference on Learning Representations, 2019. 1, 2 [44] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Ne- tadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018. 13 [45] Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2020. 1, 2 [46] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2016. 2 [47] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. 1, 2 A. Variance of ProxylessNAS-Mobile Model For stand-alone model training, we estimated the vari- ance in accuracy across runs by starting five identical runs of the ProxylessNAS-Mobile model. We repeated the ex- periment in two configurations. In the first configuration, we trained for 90 epochs and then evaluated the models on our validation set; the resulting validation set accuracies were [76.1%, 76.4%, 76.4%, 76.5%, 76.2%]. In the second configuration, we trained for 360 epochs and then evalu- ated on our test set; the resulting test set accuracies were [75.0%, 75.0%, 74.9%, 74.9%, 75.0%].8 B. Average vs Argmax Inference Time For the absolute value reward function with a constant RL learning rate, we argued that the reason models didn’t converge to the target inference time was because of dif- ferences between the inference times of randomly sampled models vs. the argmax model taken by selecting the most likely choice for each categorical decision. To test this hy- pothesis, we compared the two at the end of a search. We obtained an average time from randomly sampled models using an exponential moving average with a decay rate of 0.9 which was updated every 100 training steps. While av- erage times consistently converged to the desired inference time target, the inference time from the argmax could differ by 8 ms or more. β Avg. Time ArgMax Time -0.01 134.2 ms 128.8 ms -0.02 104.3 ms 92.2 ms -0.05 84.2 ms 75.8 ms -0.10 83.1 ms 78.3 ms -0.20 84.0 ms 82.7 ms -0.50 84.2 ms 83.4 ms -1.00 83.2 ms 85.5 ms Table 11: Average vs. argmax inference times when using the absolute value reward function without a learning rate schedule for the RL controller. C. Rematerialization If op warmup is implemented naively then the activa- tion memory required to train the shared model weights grows linearly with the number of possible operations in the search space. If many different possible operations can be simultaneously enabled at each position in the network, the model will be unable to fit in memory. We use remateri- alization to address this issue. During the forward pass, we apply N different operations to the same input, and average 8Accuracies on the validation set are typically a few percentage points higher than on the test set. their outputs. Rather than retaining the intermediate results of these operations for use during the backwards pass, we throw them away. During the backwards pass, we recom- pute the intermediate results for one operation at a time. In practice, this leads to a large decrease in memory require- ments, as we only need to retain a single input tensor and a single output tensor for each choice block. For example, in our reproduction of the original ProxylessNAS search space with a per-core batch size of 128, rematerialization decreases the memory needed to train a reproduction of the original ProxylessNAS search space from 29.5 GiB to 4.8 GiB. This memory-saving technique, which allowed us to scale to larger search spaces, came at the cost of roughly a 30% increase in search times to perform a second forward pass for each of the N possible operations. Finally, we note that although this rematerialization trick was developed with our version of op rampup in mind, it could also be used to reduce the memory requirements of a method such as DARTS [26] which requires us to eval- uate every possible operation in the search space at every training step. D. Discussion of Absolute Value Reward We now contrast a typical architecture search workflow with the MnasNet-Soft reward function against a work- flow with our new Absolute Value reward function. For the MnasNet-Soft reward function, the first step when us- ing a new search space or training configuration is to tune the RL controller’s cost exponent β to obtain inference times which are reasonably close to our target latency. In our early experiments, we found that grid searching over β ∈{−0.03, −0.04, −0.05, −0.06, −0.07, −0.08, −0.09} worked well in practice. However, running this grid search increased the cost of architecture search experiments by a factor of 7. Even after we fixed the value of β, the latencies and accuracies of architectures found by a search could vary significantly from one run to the next. For example, in our reproduction of the ProxylessNAS search space with β = −0.07, five identical architecture search experiments returned latencies which ranged from 74ms to 82ms. We also saw a wide variance in accuracies across the different architectures, ranging from 75.8% to 76.4% on the valida- tion set and 74.2% to 75.1% on the test set. Larger models generally had better accuracies, indicating that the problem stemmed from our inability to precisely control the latency. This helped motivate our Absolute Reward function, which allowed the RL controller to reliably find architec- tures whose latencies were close to the target. For example, the low variance of searched TuNAS model latencies in Ta- bles 4, 5, 6, and 12 shows we can reliably find high-quality architectures within 1 ms of the target across several dif- ferent search configurations, even when we reuse the same search hyper-parameters between different setups. As an alternative to the absolute value reward function, we also considered searching for an architecture close to the inference time target, and then uniformly scaling up or down the number of filters in every layer. While this helped reduce the variance of searched model accuracies, it did not remove the need to tune the RL cost exponent, since we needed to find a model that was already close to the infer- ence time target to get good results. Furthermore, finding the right scaling factor to hit a specific inference time tar- get added an extra step to experiments in this setup. The absolute value reward function gave us high-quality archi- tectures with a more streamlined search process. E. Experimental Setup E.1. Standalone Training for Classification During stand-alone model training, each model was trained using distributed synchronous SGD on TensorFlow with a Cloud TPU v2-32 or Cloud TPU v3-32 instance (32 TPU cores) and a per-core batch size of 128. Models were optimized using RMSProp with momentum = 0.9, decay rate = 0.9, and epsilon = 0.1. The learning rate was annealed following a cosine decay schedule without restarts [27], with a maximum value of 2.64 globally (or 0.0825 per core). We linearly increased the learning rate from 0 over the first 2.5% of training steps [12]. Models were trained with batch normalization with epsilon = 0.001 and momentum = 0.99. Convolutional kernels were initialized with He initialization [13],9 while bias variables were initialized to 0. The final fully connected layer of the network was initialized from a random normal distribution with mean 0 and standard de- viation 0.01. We applied L2 regularization with a strength of 0.00004 to all convolutional kernels except the final fully connected layer of the network. All models were trained with ResNet data preprocessing and an input image size of 224×224 pixels. When training models for 360 epochs, we applied a dropout rate of 0.15 before the final fully con- nected layer for models from MobileNetV2 search spaces and 0.25 for models from MobileNetV3 search spaces. We did not apply dropout when training models for 90 epochs. As is standard for ImageNet experiments, our test set ac- curacies were obtained on what is confusingly called the ImageNet validation set for historical reasons. What we re- fer to as validation set accuracies were obtained on a held- out subset of the ImageNet training set containing 50,046 randomly selected examples. 9TensorFlow’s default variable initialization heuristics, such as tf.initializers.he normal are designed for ordinary convolu- tions, and can overestimate the fan-in of depthwise convolutional kernels by multiple orders of magnitude; we corrected this issue in our version. E.2. Architecture Search for Classification Architecture search experiments are performed using Cloud TPU v2-32 or Cloud TPU v3-32 instances with 32 TPU cores and a per-core batch size of 128. For training the shared model weights, we use the same hyper-parameters as for stand-alone model training, except that the dropout rate of the final fully connected layer is always set to 0. When applying L2 regularization to the traininable model variables, we only regularize parameters which are used in the current training step. Because batch norm statistics can potentially vary significantly from one candidate architecture to the next, batch norm is always ap- plied in “training” mode, even during model evaluation. For training the RL controller, we use an Adam opti- mizer with a base learning rate of 3e-4, β1 = 0, β2 = 0.999, and ϵ =1e-8. We set the learning rate of the RL controller to 0 for the first 25% of training. If using an exponential schedule, we set the learning rate equal to the base value 25% of the way through training, and increase it exponen- tially so that the final learning rate is 10x the base learning rate. If using a constant schedule, we set the learning rate equal to the base learning rate after the first 25% of training. E.3. Object Detection Our implementation is based on the Tensorflow Object Detection API [18]. All backbones are combined with SS- DLite [35] as the head. Following MobileNetV2 [35] and V3 [15], we use the last feature extractor layers that have an output stride of 16 (C4) and 32 (C5) as the endpoints for the head. In contrast with MobileNetV3 + SSDLite [15], we do not manually halve the number of channels for blocks between C4 and C5, since in our case the number of chan- nels is automatically learned by the search algorithm. All experiments use 320×320 input images. For standalone training, each detection model is trained for 50K steps from scratch on COCO train2017 data us- ing a Cloud TPU v2-32 or TPU v3-32 instance (32 TPU cores) with a per-core batch size of 32. We use SGD to op- timize the shared model weights with a momentum of 0.9. The (global) learning rate is warmed up linearly from 0 to 4 during the first 5K steps and then decayed to 0 following a cosine schedule [27] during the rest of the training process. For architecture searches, training configurations for the model weights remain the same as for standalone training. We split out 10% of the data from COCO train2017 to com- pute the reward during an architecture search. The train- ing setup of the RL controller is the same as for classifi- cation, except that the base learning rate of the Adam op- timizer is set to 5e-3. Whereas classification models are evaluated based on accuracy, detection models are evalu- ated using mAP (mean Average Precision). To obtain re- sults in Table 7, architecture searches were carried out in the MobileNetV3-Like search space with a target inference cost Figure 4: On-device vs. simulated latencies in the ProxylessNAS- Enlarged and MobileNetV3-Like search spaces. Each plot is based on 100 random architectures in the given space. of 106ms to match the simulated latency of MobileNetV3 + SSDLite. To obtain the test-dev results, each model is trained over the combined COCO train2017 and val2017 data for 100K steps. Other settings remain the same as those for stan- dalone training and validation. E.4. Simulated Inference Times In early experiments, we found that if we benchmarked the same model on two different phones, the observed la- tencies could differ by several milliseconds. To ensure that our results were reproducible – and to mitigate the possibil- ity of random hardware-specific variance across runs – we estimated the latencies of our models using lookup tables similar to those proposed by NetAdapt [44]. These lookup tables let us estimate the latency of each individual opera- tion (e.g., convolution or pooling layer) in the network. The overall latency of a network architecture was estimated by summing up the latencies of all its individual operations. We validated our use of simulated latencies by sam- pling 100 random architectures and comparing the simu- lated numbers against on-device numbers measured on a real Pixel-1 phone. Figure 4 shows that the two are well- correlated. Agg. Sharing Valid Acc (%) Test Acc (%) Latency No 76.4 ± 0.1 75.0 ± 0.1 84.1 ± 0.4 Yes 76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 Table 12: Effect of aggressive weight sharing (abbreviated as “Agg Sharing” in the table above) on the quality of searched architec- tures. Each search is run for 90 epochs on the ProxylessNAS search space with op and filter warmup enabled. F. Cost of Random Search vs Efficient NAS Training a single architecture for 90 epochs on ImageNet requires about 1.7 hours using a Cloud TPU v3-32 instance (32 cores), whereas a single architecture search run takes between 8 and 13 hours, depending on the search space. This means that for the cost of a single 90-epoch search, we can evaluate 4-8 random models. In some cases, we found that the cost of an efficient architecture search could be fur- ther improved by increasing the number of search epochs from 90 to 360. For the cost of a single 360-epoch search, we can evaluate 15 - 30 random models. We provide a gen- erous budget of 20 - 50 models for our random search ex- periments in order to demonstrate that efficient architecture search can outperform random search even if each random search experiment is more compute-intensive. G. Quality of Aggressive Weight Sharing To verify the quality impact of aggressive weight shar- ing, we ran architecture searches on the original Proxyless- NAS search space both with and without aggressive sharing. The results (Table 12) indicate that aggressive weight shar- ing does not significantly affect searched model accuracies in this space. Our other two search spaces (ProxylessNAS- Enlarged and MobilenetV3-Like) were too large us to run searches without aggressive weight sharing. Can weight sharing outperform random architecture search? An investigation with TuNAS Gabriel Bender, Hanxiao Liu Google Research, contributed equally gbender@google.com, hanxiaol@google.com Bo Chen Google Research bochen@google.com Grace Chu Google Research cxy@google.com Shuyang Cheng Waymo shuyangcheng@waymo.com Pieter-Jan Kindermans Google Research pikinder@google.com Quoc Le Google Research qvl@google.com Abstract Efficient Neural Architecture Search methods based on weight sharing have shown good promise in democratiz- ing Neural Architecture Search for computer vision models. There is, however, an ongoing debate whether these efficient methods are significantly better than random search. Here we perform a thorough comparison between efficient and random search methods on a family of progressively larger and more challenging search spaces for image classification and detection on ImageNet and COCO. While the efficacies of both methods are problem-dependent, our experiments demonstrate that there are large, realistic tasks where ef- ficient search methods can provide substantial gains over random search. In addition, we propose and evaluate tech- niques which improve the quality of searched architectures and reduce the need for manual hyper-parameter tuning. 1 1. Introduction Neural Architecture Search (NAS) tries to find net- work architectures with excellent accuracy-latency trade- offs. While the resource costs of early approaches [47, 33] were prohibitively expensive for many, recent efficient ar- chitecture search methods based on weight sharing promise to reduce the costs of architecture search experiments by multiple orders of magnitude. [30, 3, 26, 5, 43] The effectiveness of these efficient NAS approaches has been questioned by recent studies (e.g., [20, 45]) present- ing experimental results where efficient architecture search methods did not always outperform random search base- lines. Furthermore, even when gains were reported, they were often modest. However, most existing results come 1Source code and experiment data are available at https://github. com/google-research/google-research/tree/master/tunas Figure 1: Validation accuracies of 250 random architectures (blue bars) vs. 5 independent runs of our TuNAS search algorithm (pink lines) for three different search spaces. Rejection sampling is used to ensure the latencies of random architectures are comparable to those of searched architectures. with limitations. First: negative results may simply indicate that existing algorithms are challenging to implement and tune. Second: most negative results focus on fairly small datasets such as CIFAR-10 or PTB, and some are obtained on heavily restricted search spaces. With those caveats in mind, it is possible that efficient NAS methods work well only on specific search spaces and problems. But even so, they can still be useful if those problems are of high practi- cal value. For this reason we focus on the following: “Can efficient neural architecture search be made to work reliably on large realistic search spaces for realistic problems?” When comparing against simpler algorithms such as ran- dom search, we must consider not just explicit costs, such as the time needed to run a single search, but also implicit costs. For example, many models operate under hard la- tency constraints (e.g., a model running on a self-driving car in real time). However, the reward functions used by Mnas- 1 arXiv:2008.06120v1 [cs.LG] 13 Aug 2020 Net [38] and ProxylessNAS [5] require us to run multiple searches with different hyper-parameters to match a given latency target.2 Larger models cannot be deployed, while smaller models have sub-optimal accuracies. We present TuNAS, a new easy-to-tune and scalable im- plementation of efficient NAS with weight sharing, and use it to investigate the questions outlined above: • We investigate two existing models from the NAS liter- ature, and find that while some of their gains are likely due to better training hyper-parameters, some are due to real improvements in the network architectures. • We use TuNAS to study the effectiveness of weight sharing on the ProxylessNAS search space for Ima- geNet. Prior work [5] demonstrated that an efficient NAS algorithm could find state-of-the-art image clas- sification architectures on this space, but left open the question of whether a simple random search baseline could find equally good results. We find that efficient NAS can significantly outperform random search on the ProxylessNAS search space on ImageNet. • We further evaluate our implementation on two new and even larger search spaces. We find that (i) TuNAS continues to find high-quality architectures, and (ii) the gap between TuNAS and random search increases sig- nificantly on these new search spaces. • We demonstrate that when weight sharing is imple- mented carefully, the same algorithm can be used across different classification search spaces and across domains (classification and object detection). • We propose and evaluate two new techniques, which we call op and filter warmup, for better training the shared model weights used during a search. These techniques improve the quality of architectures found by our algorithm, sometimes by a large margin. • We propose a novel RL reward function which allows us to precisely control the latency of the final architec- ture returned by the search, significantly reducing the need for additional hyper-parameter tuning. 2. Related work Neural Architecture Search was proposed as a way to au- tomate and optimize the design process for network archi- tectures. Existing methods have achieved impressive results on image classification [47, 42, 39], object detection [11, 6], segmentation [23], and video understanding [31]. Early methods were based on Reinforcement Learning [46, 1], 2In our experiments we typically needed to run 7 searches to achieve the desired latency. Genetic Algorithms [34, 25, 33] or Bayesian Optimiza- tion [19]. As these methods require a large amount of com- pute to achieve good results, recent works focus on address- ing this requirement [2, 29, 28, 24]. In many of these meth- ods, a single supernetwork is trained which encompasses all possible options in the search space [4, 30, 3, 5, 26, 43, 7]. A single path within the supernetwork corresponds to an ar- chitecture in the search space. With this scheme, weight sharing naturally occurs between the architectures. Within the framework of efficient search methods based on weight sharing, our work is closely related to ENAS [30], DARTS [26], SNAS [43] and especially ProxylessNAS [5] in terms of optimizing the quality-latency tradeoff of ar- chitectures. Different from ProxylessNAS, our method is able to handle substantially larger and more difficult search spaces with less prior knowledge (e.g., hand-engineered output filter sizes) and achieve improved performance. Multi-Objective Search. An important strength of Neu- ral Architecture Search is that it can cope with an ob- jective function beyond pure accuracy. Recently, Neu- ral Architecture Search has been used intensively to find architectures that have better tradeoff between accuracy and latency [38, 41, 8, 5, 37], FLOPS [39], power con- sumption [16], and memory usage [10]. We also focus on resource-aware NAS because (i) finding architectures with good trade-offs between accuracy and latency is valu- able in practice, and (ii) constrained optimization may be more challenging than unconstrained optimization, which makes it ideal for a stress-test of efficient NAS algorithms. With this in mind, we make use of and extend the search spaces proposed by MnasNet [38], ProxylessNAS [5], Mo- bileNetV2 [35] and MobileNetV3 [15], which are close or among the state-of-the-art networks for mobile settings. Random Search vs Efficient Search Methods. The use of more complicated search methods in Neural Architecture Search has been challenged [45, 20]. In a nutshell, these studies find that for certain problems and search spaces, ran- dom search performs close to or just as well as more com- plicated search methods. These studies, however, mainly focused on relatively small tasks (such as CIFAR-10) and accuracy-only optimization. Our focus is on larger and more challenging searches which incorporate latency con- straints. In these more realistic settings, efficient architec- ture search significantly outperforms random search. 3. Search Spaces Our goal is to develop a NAS method that can reliably find high quality models at a specific inference cost across multiple search spaces. We next present three progressively larger search spaces and show that they are non-trivial: they Search Space Cardinality Ref Model Our Search Random Search Simulated N=1 N = 20 N = 50 N = 250 Latency (ms) ProxylessNAS ∼1021 76.2 76.3 ± 0.2 74.1 ± 0.8 75.4 ± 0.1 75.4 ± 0.2 75.6 83-85 ProxylessNAS-Enlarged ∼1028 76.2 76.4 ± 0.1 72.1 ± 1.5 74.4 ± 0.5 74.6 ± 0.3 74.8 83-85 MobileNetV3-Like ∼1043 76.5 76.6 ± 0.1 71.7 ± 1.7 74.1 ± 0.6 74.6 ± 0.3 74.9 57-59 Table 1: Comparison between reference models proposed in previous work (“Ref Model”), random search baselines in our search spaces (“Random Search”), and searched models found by TuNAS (“Our Search”). We report validation accuracies on ImageNet after 90 epochs of training. Cardinality refers to (an upper bound of) the number of unique architectures in the search space. The reference model for the ProxylessNAS and ProxylessNAS-Enlarged search spaces is our reproduction of the ProxylessNAS mobile CPU model [5]. The reference model for the MobileNetV3-Like search space is our reproduction of MobileNetV3 [15]. Mean and variance for Random Search are reported over 5 repeats for N=20 and N=50, and 250 repeats for N=1. Search Space Built Around Base Filter Sizes (ci’s for each layer) Typical Choices within an Inverted Bottleneck Layer Expansion Ratio Kernel Output filter size SE ProxylessNAS MobileNetV2 ProxylessNAS [5] {3, 6} {3, 5, 7} ci  ProxylessNAS-Enlarged MobileNetV2 ×2 when stride = 2 {3, 6} {3, 5, 7} ci ×  1 2 , 5 8 , 3 4 , 1, 5 4 , 3 2 , 2  MobileNetV3-Like MobileNetV3 ×2 when stride = 2 {1, 2, 3, 4, 5, 6} {3, 5, 7} ci ×  1 2 , 5 8 , 3 4 , 1, 5 4 , 3 2 , 2 {, } Table 2: Search spaces we use to evaluate our method. The first two are built around MobileNetV2. The third uses the combination of ReLU and SiLU/Swish [32, 9, 14] activations and the new model head from MobileNetV3. We use a target inference time of 84ms for the first two (to compare against ProxylessNAS and MnasNet) and 57ms for the third search space (to compare against MobileNetV3). contain known good reference models3 that clearly outper- form models found by random search, as shown in Table 1. The same table shows that for the larger of these search spaces, the gap between known good models and random search baselines widens. Although architecture search be- comes more difficult, it can also be more beneficial. 3.1. Search Space Definitions Details of the three search spaces are summarized in Ta- ble 2. Motivations for each of them are outlined below. ProxylessNAS Space. The first and the smallest search space is a reproduction of the one used in Proxyless- NAS [5], an efficient architecture search method that reports strong results on mobile CPUs. It consists of a stack of in- verted bottleneck layers, where the expansion ratio and the depthwise kernel size for each layer are searchable. The search space is built around MobileNetV2 [35], except that the output filter sizes for all the layers are handcrafted to be similar to those found by MnasNet [38]. ProxylessNAS-Enlarged Space. While earlier convolu- tional architectures such as VGG [36] used the heuristic of doubling the number of filters every time the feature map width and height were halved, more recent models [35, 38, 5] obtain strong performance using more carefully tuned filter sizes. Our experimental evaluation (Table 8) demonstrates that these carefully tuned filter sizes are in fact important for obtaining good accuracy/latency tradeoffs. 3In Table 3 we present our reproductions to the published numbers. In all cases our reproductions are at least as accurate as the published results. While output filter sizes are something that we ideally should be able to search for automatically, the original Prox- ylessNAS search space used manually tuned filter sizes built around those discovered by an earlier and more expensive search algorithm [38]. To understand whether this restric- tion can be lifted, we explore an extension of the Proxyless- NAS search space to automatically search over the number of output filters in each layer of the network. Specifically, we define a list of possible output filter sizes for each layer in our search space by multiplying an integer- valued base filter size by a predefined set of multipliers  1 2, 5 8, 3 4, 1, 5 4, 3 2, 2 and rounding to a multiple of 8.4 The base filter size is 16 for the first layer of the network, and is doubled whenever we start a new block. If two layers of the network are connected via a residual connection, we constrain them to have the same output filter size. MobileNetV3-Like Space. Our largest search space is in- spired by MobileNetV3 [15]. Different from the previous spaces, models in this space utilize the SiLU/Swish [32, 9, 14] activation function [32] and a compact head [15]. The search space is also much larger than the previous two. First: inverted bottleneck expansion ratios can be selected from the set {1, 2, 3, 4, 5, 6}, compared with {3, 6} in other search spaces. Second: we optionally allow a Squeeze-and- Excite module [17] to be added to each inverted bottleneck. Output filter sizes are searched; the choices follow the same heuristic used in the ProxylessNAS-Enlarged space. 4For performance reasons, working with multiples of 8 was recom- mended for our target inference hardware. 3.2. Measuring Search Algorithm Effectiveness We measure the effectiveness of our NAS algorithm on a given search space in two different ways. Reference Models. Can our algorithm match or exceed the quality of known good architectures within the space? For example, when evaluating the effectiveness of our algorithm on the ProxylessNAS-Enlarged space, we compare against MobileNetV2 [35] (a hand-tuned model), MnasNet [38] (a model obtained from a more expensive NAS algorithm where thousands of candidate architectures were trained from scratch), and ProxylessNAS-Mobile [5] (a model ob- tained using a NAS algorithm similar to ours, which we use to verify our setup). Random Search. Can our algorithm provide better results in less time than random search without weight sharing, an easy-to-implement heuristic which is widely used in prac- tice? In industry settings, it is common to target a specific latency T0; slower models cannot be deployed while faster models typically have sub-optimal accuracies [39]. How- ever, in practice only a small percentage of models are ac- tually close to this inference target. To make the baseline interesting in this realistic setup, we perform rejection sampling in the range of T0±1ms to obtain N random models. These models are then trained for 90 epochs and validated. The model with the best result on the validation set is subsequently trained for 360 epochs and evaluated on the test set, analogous to our searched models for final evaluation. Note the cost of random search with N = 15 to 30 is comparable with the cost of a single run of our efficient search algorithm (Appendix F). Besides the comparisons discussed above, the complex- ity of our search spaces can be quantified using several other metrics, which we report in Table 1. A clear pro- gression in the task difficulties can be observed as we move from the smallest ProxylessNAS search space to the largest MobileNetV3-Like search space. 4. TuNAS TuNAS uses a reinforcement learning algorithm with weight sharing to perform architecture searches. Our al- gorithm is similar to ProxylessNAS [5] and ENAS [30], but contains changes to improve robustness and scalability and reduce the need for manual hyper-parameter tuning. A search space is represented as a set of categorical de- cisions which control different aspects of the network ar- chitecture. For example, a single categorical decision might control whether we use a 3×3, 5×5, or 7×7 convolution at a particular position in the network. An architecture is an assignment of values to these categorical decisions. During a search, we learn a policy π, a probability dis- tribution from which we can sample high quality architec- tures. Formally, π is defined as a collection of independent multinomial variables, one for each of the decisions in our search space. We also learn a set of shared weights W, which are used to efficiently estimate the quality of candi- date architectures in our search space. We alternate between learning the shared weights W us- ing gradient descent and learning the policy π using RE- INFORCE [40]. At each step, we first sample a network architecture α ∼π. Next, we use the shared weights to estimate the quality Q(α) of the sampled architecture us- ing a single batch of examples from the validation set. We then estimate the inference time of the sampled architecture T(α). The accuracy Q(α) and inference time T(α) jointly determine the reward r(α) which is used to update the pol- icy π using REINFORCE.5 Finally, we update the shared model weights W by computing a gradient update w.r.t. the architecture α on a batch of examples from the training set. The above process is repeated over and over until the search completes. At the end of the search, the final ar- chitecture is obtained by independently selecting the most probable value for each categorical decision in π. 4.1. Weight Sharing To amortize the cost of an architecture search, NAS al- gorithms based on weight sharing (e.g., [3, 30, 26, 5]) train a large network – a one-shot model – with many redundant operations, most of which will be removed at the end of the search. In ProxylessNAS [5], for instance, we must select between three possible kernel sizes (3, 5, or 7) and two pos- sible expansion factors (3 or 6), giving us 3×2 = 6 possible combinations. In ProxylessNAS, each of these six combi- nations will have its own path through the network: its own set of trainable weights and operations which are not shared with any other path. At each step, we randomly select one of the six paths and update the shared model weights and RL controller using just the weights from the selected path. While this approach works well when the number of paths is small, the size of the one-shot model will quickly blow up once we add more primitives to the search. For instance, the number of unique inverted bottleneck config- urations per layer can be as large as 6 × 3 × 7 × 2 = 252 in our MobileNetV3-Like space, in contrast to 2 × 3 = 6 options in the ProxylessNAS space (Table 2). As a result, the search process cannot be carried out efficiently because each inverted bottleneck will only be trained a small frac- tion of time (1/252 under a uniform policy). Operation Collapsing. Instead of using a separate set of weights for each possible combination of choices within an inverted bottleneck, we share (“collapse”) operations and 5We set the learning rate of the RL controller to 0 during the first 25% of training. This allows us to learn a good set of shared model weights before the RL controller kicks in. Details are provided in Appendix E.2. Figure 2: Illustration of our aggressive weight sharing scheme for inverted bottleneck layers. Each choice block is associated with a decision to be made based the RL policy. The expansion ratios and output filter sizes for the 1 × 1 convolutions are learned through a channel masking mechanism. weights in order to ensure that each trainable weight gets a sufficient gradient signal. The approach is illustrated in Figure 2. For example, while ProxylessNAS uses different 1×1 convolutions for each possible depthwise kernel within an inverted bottleneck, we reuse the same 1×1 convolutions regardless of which depthwise kernel is selected. Channel Masking. Complementary to operation collaps- ing, we also share parameters between convolutions with different numbers of input/output channels. The idea is to create only a single convolutional kernel with the largest possible number of channels. We simulate smaller channel sizes by retaining only the first N input (or output) chan- nels, and zeroing out the remaining ones. This allows us to efficiently search for both expansion factors and output filter sizes in an inverted bottleneck, as learning these is reduced to learning multinomial distributions over the masks. 4.2. Targeting a Specific Latency For many practical applications (e.g., real-time image perception), we want the best possible model that runs within a fixed number of milliseconds. However, we found that with the existing RL reward functions used by Mnas- Net [38] and ProxylessNAS [5], we frequently had to retune our search hyper-parameters in order to find the best models under a given latency target. This extra retuning step mul- tiplied resource costs by 7× in many of our experiments. We now explain why, and propose a new reward function to address the issue. Soft Exponential Reward Function. In previous work, Tan et al. [38] proposed a parameterized RL reward func- tion to find architectures with good accuracy/time tradeoffs, and evaluated two instantiations of this function. In the first instantiation (later adopted by ProxylessNAS [5]), they maximize the reward r(α) = Q(α) × (T(α)/T0)β where Q(α) indicates the quality (accuracy) of a candidate architecture α, T(α) is its inference time, T0 is a problem- dependent inference time target, and β < 0 is the cost expo- nent, a tunable hyper-parameter of the setup. Since β < 0, this reward function is maximized when the model quality Q(α) is large and the inference time T(α) is small. However, to find the best possible model whose in- ference time is less than T0, we must perform a hyper- parameter sweep over β. If β is too small, the inference con- straint will effectively be ignored. If β is too large, we will end up with models that have low latencies but sub-optimal accuracies. To make matters worse, we found that changing the search space or search algorithm details required us to retune the value of β, increasing search experiment costs by 7× in practice. Figure 3 shows a geometric intuition for this behavior. Each contour line in the plot represents a set of possible tradeoffs between model quality and latency which yield the same final reward. Our goal is to try to find an architecture with the highest possible reward, corresponding to a con- tour line that is as far to the top left as possible. However, the reward must correspond to a viable architecture in the search space, which means the contour must intersect the population’s accuracy-latency frontier (circled in black). For the soft exponential reward, the figure suggests that a small shift in the population (e.g., due to a change in the training setup or search space) can significantly alter the op- timal latency. This explains why the same value of β can lead to different latencies in different searches. Both the hard exponential reward function and the proposed absolute reward function, which we will discuss next, are more sta- ble, thanks to the “sharp” change points in their contours. Hard Exponential Reward Function. A second instantia- tion of the MnasNet reward function [38] penalizes models whose inference times T(α) are above T0 but does not re- ward models whose inference times are less than T0: r(α) = ( Q(α), if T(α) ≤T0 Q(α) × (T(α)/T0)β, if T(α) > T0 (1) At first glance, we might expect that an RL controller using this reward would always favor models with higher accura- cies, provided that their inference times do not exceed T0. However, this is not the case in our experiments. The reason is that the RL controller does not select a single point on the Pareto frontier. Rather, it learns a probability distribution over points. If the cost exponent β is too large, the controller will become risk-adverse preferring to sample architectures whose latencies are significantly below the target T0. This allows it to minimize the probability of accidentally sam- pling architectures whose times exceed the target and incur- ring large penalties. Empirically, we found that if we made the cost penalty β too large, the controller would sample ar- chitectures with inference times close to 75ms, even though the target inference time was closer to 85ms. Our Solution: Absolute Reward Function. We propose a new reward function which can find good architectures Figure 3: Contours of different reward functions and their interac- tions with the frontiers. The blue and the orange denote the fron- tiers of two search spaces with different accuracy-latency trade- offs. Left: Soft exponential reward function. Center: Hard ex- ponential reward function. Right: Our absolute reward function. Regions in black indicate architectures with the highest reward. whose inference times are close to a user-specified target T0 and is robust to the exact values of hyper-parameters. The key idea is to add to our reward function a prior that larger models are typically more accurate. Instead of just asking the search to identify models with inference times less than T0 (as in previous work), we explicitly search for the best possible models whose inference times are close to T0. This implies a constrained optimization problem: max α Q(α) subject to T(α) ≈T0 (2) The above can be relaxed as an unconstrained optimization problem that aims to maximize r(α) = Q(α) + β |T(α)/T0 −1| where | · | denotes the absolute function and β < 0, the cost exponent, is a finite negative scalar that controls how much strongly we encourage architectures to have inference times close to T0. The expression T(α)/T0 ensures the reward is scale-invariant w.r.t. latency. Search results are robust to the exact value of β,6 and this scale-invariance further reduces the need to retune β for new devices and search spaces. Using the absolute value reward, we found that while the average inference time of models sampled by the RL controller was consistently close to the target, the inference time of the most likely architecture selected at the end of the search could be several milliseconds lower. We combat the mismatch between average and most likely inference times by adjusting the learning rate schedule of the RL controller. Instead of using a constant learning rate through the search, we exponentially increase the RL learning rate over time. This allows the controller to explore the search space (with a relatively low learning rate) at the start of the search, but also ensures that the entropy of the RL controller is low at the end of the search, preventing the mismatch. Additional information is provided in Append B. 4.3. Improving the shared model weights We identified two techniques that allowed us to improve the quality of the models found by architecture search ex- 6We used the same value of β for all our classification experiments. periments. Both techniques rely on the intuition that if we can ensure that all of our shared model weights are suffi- ciently well-trained, we can get a more reliable signal about which parts of the search space are most promising. Filter Warmup. We can efficiently search over different filter sizes by masking out tensors across the filters dimen- sion. For example, we can simulate a convolution with 96 output filters by taking a convolution with 128 output fil- ters and zeroing out the rightmost 32. However, this intro- duces a bias into our training process: the left-most filters will always be trained, whereas the right-most filters will be trained only occasionally. To counteract this effect, we randomly enable all the output filters – rather than just the filters selected by the RL controller – with some probability p. We linearly decrease p from 1 to 0 over the first 25% of the search,7 during which the RL controller is disabled and only the shared model weights are trained. Op Warmup. We enable all possible operations in the search space at start of training, and gradually drop out more and more of the operations as the search progresses. Our discussion will focus on a single choice block, where the goal is to select one of N possible operations (e.g., convo- lutions, inverted bottleneck layers, etc.) from a predeter- mined search space. The idea was originally proposed and evaluated by Bender et al. [3], who used carefully-tuned heuristics to determine the dropout schedule. We found that a simplified version of the idea could be used to improve our search results with minimal tuning. With some probability p between 0 and 1 we enable all operations within a choice block, rather than just enabling the operations selected by the RL controller. When multiple operations are enabled, we average their outputs. When p = 0, the controller’s be- havior is unaffected by op warmup. When p = 1, we enable all possible operations at every training step. In our imple- mentation, we linearly decrease p from 1 to 0 over the first 25% of the search, during which the RL controller is dis- abled and only the shared model weights are trained. Op warmup can be implemented in a memory-efficient manner by leveraging rematerialization (Appendix C). 5. Experimental Setup and Baselines For the ProxylessNAS and ProxylessNAS-Enlarged search spaces, our searches target the same latency as the published ProxylessNAS Mobile architecture [5]. For our MobileNetV3-Like search space, we target the latency of MobileNetV3-Large [15]. The resulting architectures are trained from scratch to obtain validation and test set accura- cies. Unless otherwise specified, we obtained validation set accuracies of standalone models on a held-out subset of the 7We also experimented with decreasing p over 50% of steps instead of 25%, but did not observe a significant effect on search quality. ImageNet training set, after training on the remainder for 90 epochs. We obtained test set accuracies after training on the full training set for 360 epochs. Details for standalone training and architecture search are reported in Appendix E. For architecture search experiments, we always repeat the entire search process 5 times as suggested by Lindauer and Hutter [22], and report the mean and variance of the performance of the resulting architectures. This is different from the popular practice of training a single resulting archi- tecture multiple times, which only reflects the variance of a single searched architecture (which can be cherry-picked) rather than the search algorithm itself. For reproductions of published models such as MobileNetV2 where no architec- ture search is required on our part, we report the mean and variance across five training runs. Identical hyper-parameters are used to train all of our models. (See Appendix E for details.) The one exception is the dropout rate of the final fully connected layer, which is set to 0 when training for 90 epochs, 0.15 when training MobileNetV2-based models for 360 epochs, and 0.25 when training MobileNetV3-based models for 360 epochs. We initially experimented with tuning hyper-parameters such as the learning rate and dropout rate separately for each model. However, this did not lead to any additional quality gains, and was omitted for our final experiments. 5.1. Reproducing Reference Architectures We start by reproducing three popular mobile image classification models in our training setup: MobileNetV2 [35], MnasNet-B1 [38], and ProxylessNAS [5]. This serves two purposes. First, it allows us to verify our training and evaluation setup. And second, it allows us to cleanly dis- tinguish between improvements in our model training setup and improvements in our searched network architectures. Results are presented in Table 3. Our hyper-parameters provide quality comparable to the published results for MnasNet-B1 and significantly improve upon the published results of ProxylessNAS and MobileNetV2. There is an especially large (1.3%) accuracy improve- ment in our reproduction of MobileNetV2. This suggests that some of the quality gains from MnasNet and Proxy- lessNAS which were previously attributed to better network architectures may in fact be due to better hyper-parameter tuning. It underscores the importance of accounting for dif- ferences in training and evaluation setups when comparing different network architectures. 6. Results and Discussion Compared with previous papers on efficient architecture search such as ProxylessNAS, our architecture search setup includes several novel features, including (i) a new abso- lute value reward, (ii) the use of op and filter warmup, and (iii) more aggressive weight sharing during searches. At Name Simulated Accuracy (%) Latency Valid Test Test ours published ours MobileNetV2 77.2 ms 74.5 ± 0.1 72.0 73.3 ± 0.1 MnasNet-B1 84.5 ms 76.0 ± 0.1 74.5 74.5 ± 0.1 ProxylessNAS 84.4 ms 76.3 ± 0.2 74.6 74.9 ± 0.1 MobileNetV3 58.5 ms 76.5 ± 0.2 75.2 75.3 ± 0.1 Table 3: Reproductions of our baseline models on ImageNet. Model / Method Valid Acc (%) Test Acc (%) Latency ProxylessNAS [5] 76.2 74.8 84.4 RS (N = 20) 75.4 ± 0.2 73.9 ± 0.3 84.3 ± 0.8 RS (N = 50) 75.4 ± 0.2 74.0 ± 0.2 83.8 ± 0.6 TuNAS (90 epochs) 76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 Table 4: Results in the ProxylessNAS search space. “Proxyless- NAS [5]” is our reproduction of the ProxylessNAS-Mobile model. Our TuNAS implementation includes op/filter warmup, the abso- lute value reward, and more aggressive weight sharing. Model / Method Valid Acc (%) Test Acc (%) Latency MobileNetV2 [35] 74.4 73.4 77.2 MNASNet-B1 [38] 76.0 74.5 84.5 ProxylessNAS [5] 76.2 74.8 84.4 RS (N = 20) 74.4 ± 0.5 73.1 ± 0.6 84.0 ± 0.6 RS (N = 50) 74.6 ± 0.3 73.2 ± 0.3 83.5 ± 0.3 TuNAS (90 epochs) 76.4 ± 0.1 75.3 ± 0.2 84.0 ± 0.4 Table 5: Results in the ProxylessNAS-Enlarged search space. Model / Method Valid Acc (%) Test Acc (%) Latency MobileNetV3-L [15] 76.5 75.3 58.5 RS (N = 20) 74.1 ± 0.6 73.0 ± 0.5 58.5 ± 0.5 RS (N = 50) 74.6 ± 0.3 73.5 ± 0.2 58.7 ± 0.4 TuNAS (90 epochs) 76.6 ± 0.1 75.2 ± 0.2 57.0 ± 0.2 TuNAS (360 epochs) 76.7 ± 0.2 75.4 ± 0.1 57.1 ± 0.1 Table 6: Results in the MobileNetV3-Like search space. the end of this section we will systematically evaluate these changes. First we evaluate our proposed efficient architec- ture search implementation on the three search spaces pre- sented in Section 3, and compare our results against random search with similar or higher search cost. The search spaces gradually increase in terms of both size and difficulty. Finding 1: TuNAS outperforms Random Search (RS) by a large margin in each of our three classification search spaces. This holds even though we use 2-3x more com- pute resources for each RS experiment (Appendix F). First (Table 4), we evaluate our search algorithm on the ProxylessNAS search space [5]. Despite having lower search costs, the accuracies of architectures found with an efficient architecture search improve upon random search by 1%. These also provide a sanity check for our setup: the results of our search are competitive with those reported by the original ProxylessNAS paper. Next (Table 5), we evaluate our search algorithm on the ProxylessNAS-Enlarged search space, which additionally searches over output filter sizes. In this larger and more challenging search space, the accuracy gap between our method and random search increases from 1% to 2%. Finally (Table 6), we evaluate our search algorithm on our MobileNetV3-Like search space, which is the most challenging of the three. In addition to being the largest of the three search spaces, the accuracies of architectures sampled from this search space are – on average – lower than the other two. Increasing the size of a random sample from N = 20 to N = 50 can improve the results of random search. However, we find that increasing the search time of our algorithm from 90 to 360 epochs can also improve the results of our efficient algorithm, while still maintaining a lower search cost than random search with N = 50. Finding 2: The TuNAS implementation generalizes to object detection. We investigate the transferability of our algorithm to the object detection task, by searching for the detection backbone w.r.t. both mean average precision and the inference cost. Results on COCO [21] are summarized in Table 7. The searched architecture outperforms the state- of-the-art model MobileNetV3 + SSDLite [15]. Details of the experimental setup are presented in Appendix E.3. Finding 3: Output filter sizes are important. MnasNet-B1 [38] searches over the number of output fil- ters in addition to factors such as the kernel sizes and ex- pansion factors. This is different from many recent papers on efficient NAS–including ENAS [30], DARTS [26], and ProxylessNAS [5]–which hard-coded the output filter sizes. To determine the importance of output filter sizes, one possibility would be to modify the output filter sizes of a high-performing model (such as the ProxylessNAS-Mobile model) and look at how the model accuracy changes. How- ever, we can potentially do better by searching for a new architecture whose operations are better-adapted to the new output filter sizes. We therefore perform two different vari- ants of the latter procedure. In the first variant, we replace the ProxylessNAS output filter sizes (which are hard-coded to be almost the same as MnasNet) with a naive heuristic where we double the number of filters whenever we halve the image width and height, similar to architectures such as ResNet and VGG. In the second, we double the number of filters at each new block. Table 8 shows that searched filter sizes significantly outperform both doubling heuristics. Finding 4: Aggressive weight sharing enables larger search spaces without significant quality impact. We share weights between candidate networks more aggres- sively than previous works such as ProxylessNAS (Section 4.1). This lets us explore much larger search spaces, includ- ing one with up to 252 options per inverted bottleneck. For the ProxylessNAS space (where searches are possible with and without it), we verified that it does not significantly af- Backbone COCO Test-dev mAP Latency MobileNetV2 20.7 126 MNASNet 21.3 129 ProxylessNAS 21.8 140 MobileNetV3-Large 22.0 106 TuNAS Search 22.5 106 Table 7: Backbone architecture search results on MS COCO in the MobileNetV3-Like space. All detection backbones are combined with the SSDLite head. Target latency for TuNAS search was set to 106ms (same as for MobileNetV3-Large + SSDLite). Filters Valid Acc (%) Test Acc (%) Latency ProxylessNAS 76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 ×2 Every Stride-2 74.8 ± 0.2 73.5 ± 0.2 83.9 ± 1.0 ×2 Every Block 75.3 ± 0.2 74.0 ± 0.2 83.9 ± 0.2 Table 8: Effect of output filter sizes on final model accuracies. Reward Valid Acc (%) Test Acc (%) Latency MnasNet-Soft Reward 76.2 ± 0.2 74.8 ± 0.3 79.5 ± 3.3 Absolute Value Reward 76.4 ± 0.1 75.0 ± 0.1 84.1 ± 0.4 Table 9: Comparison of our absolute value reward function (T0=84ms) vs. the reward used by MnasNet and ProxylessNAS. While both provide similar quality/latency tradeoffs, our absolute value reward allows precise control over the inference latency, and reduces the need for extra tuning to find a suitable cost exponent. Search Space Warmup Valid Acc (%) Test Acc (%) Latency ProxylessNAS  76.1 ± 0.1 74.7 ± 0.1 84.0 ± 0.3 ProxylessNAS  76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 ProxylessNAS-Enl  75.8 ± 0.3 74.6 ± 0.2 83.6 ± 0.2 ProxylessNAS-Enl  76.4 ± 0.1 75.3 ± 0.2 84.0 ± 0.4 MobileNetV3-Like  76.2 ± 0.2 75.0 ± 0.1 57.0 ± 0.6 MobileNetV3-Like  76.6 ± 0.1 75.2 ± 0.2 57.0 ± 0.2 Table 10: Comparison of search results with vs. without op and filter warmup. We use aggressive weight sharing and search for 90 epochs in all search configurations. fect searched model quality (Appendix G). Finding 5: The absolute value reward reduces hyper- parameter tuning significantly. With the MnasNet-Soft reward function, we found it necessary to grid search over β ∈{−0.03, −0.04, −0.05, −0.06, −0.07, −0.08, −0.09} in order to reliably find network architectures close to the target latency. By switching to the absolute value reward function, we were able to eliminate the need for this search, reducing resource costs by a factor of 7. We compared the quality of both methods on our implementation of the Prox- ylessNAS search space with weight sharing, and found that the Absolute Value reward function did not significantly af- fect the quality/latency tradeoff (Table 9 and Appendix D). Finding 6: Op and filter warmup lead to consistent im- provements across all search spaces. Controlled experi- ments are presented in Table 10. While improvements are small in some spaces, they account for nearly half of all quality gains in the ProxylessNAS-Enlarged Space. Acknowledgements We would like to thank Blake Hechtman, Ryan Sepassi, and Tong Shen for their help with software changes needed to make our algorithms to work well on TPUs. We would also like to thank Berkin Akin, Okan Arikan, Yiming Chen, Zhifeng Chen, Frank Chu, Ekin D. Cubuk, Matthieu Devin, Suyog Gupta, Andrew Howard, Da Huang, Adam Kraft, Peisheng Li, Yifeng Lu, Ruoming Pang, Daiyi Peng, Mark Sandler, Yonghui Wu, Zhinan Xu, Xin Zhou, and Menglong Zhu for helpful feedback and discussions. References [1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using rein- forcement learning. In International Conference on Learning Representations, 2017. 2 [2] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture search using perfor- mance prediction. arXiv preprint arXiv:1705.10823, 2017. 2 [3] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pages 549–558, 2018. 1, 2, 4, 6 [4] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hyper- networks. In International Conference on Learning Repre- sentations, 2018. 2 [5] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018. 1, 2, 3, 4, 5, 6, 7, 8 [6] Yukang Chen, Tong Yang, Xiangyu Zhang, Gaofeng Meng, Chunhong Pan, and Jian Sun. Detnas: Neural architecture search on object detection. In Advances in Neural Informa- tion Processing Systems, 2019. 2 [7] Jiequan Cui, Pengguang Chen, Ruiyu Li, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast and practical neural architecture search. In International Conference on Computer Vision, 2019. 2 [8] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yim- ing Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, and Niraj K. Jha. Chamnet: Towards efficient network design through platform-aware model adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2 [9] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning. Neural Networks, 107:3–11, 2018. 3 [10] Igor Fedorov, Ryan P Adams, Matthew Mattina, and Paul N Whatmough. Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers. arXiv preprint arXiv:1905.12107, 2019. 2 [11] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object de- tection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019. 2 [12] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large mini- batch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 12 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 12 [14] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3 [15] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In International Conference on Computer Vision, 2019. 2, 3, 6, 7, 8, 12 [16] Chi-Hung Hsu, Shu-Huan Chang, Jhao-Hong Liang, Hsin- Ping Chou, Chun-Hao Liu, Shih-Chieh Chang, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, and Da-Cheng Juan. Monas: Multi-objective neural architecture search using reinforce- ment learning. arXiv preprint arXiv:1806.10332, 2018. 2 [17] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. 3 [18] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo- jna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 7310–7311, 2017. 12 [19] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric P Xing. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pages 2016–2025, 2018. 2 [20] Liam Li and Ameet Talwalkar. Random search and repro- ducibility for neural architecture search. In Amir Glober- son and Ricardo Silva, editors, Proceedings of the Thirty- Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 129. AUAI Press, 2019. 1, 2 [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 8 [22] Marius Lindauer and Frank Hutter. Best practices for scien- tific research on neural architecture search, 2019. 7 [23] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 82–92, 2019. 2 [24] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In The European Conference on Computer Vision (ECCV), September 2018. 2 [25] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical repre- sentations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017. 2 [26] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Confer- ence on Learning Representations, 2019. 1, 2, 4, 8, 11 [27] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 12 [28] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In Advances in neural information processing systems, pages 7816–7827, 2018. 2 [29] Renato Negrinho and Geoff Gordon. Deeparchitect: Auto- matically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017. 2 [30] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, 2018. 1, 2, 4, 8 [31] AJ Piergiovanni, Anelia Angelova, and Michael S. Ryoo. Tiny video networks. arXiv preprint arXiv:1910.06961, 2019. 2 [32] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 3 [33] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4780–4789, 2019. 1, 2 [34] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Sax- ena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. In International Conference on Machine Learning, 2017. 2 [35] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. 2, 3, 4, 7, 12 [36] Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3 [37] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Mar- culescu. Single-path nas: Designing hardware-efficient con- vnets in less than 4 hours. arXiv preprint arXiv:1904.02877, 2019. 2 [38] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. In The IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2019. 2, 3, 4, 5, 7, 8 [39] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Inter- national Conference on Machine Learning, 2019. 2, 4 [40] Ronald J Williams. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. 4 [41] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con- vnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019. 2 [42] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaim- ing He. Exploring randomly wired neural networks for im- age recognition. arXiv preprint arXiv:1904.01569, 2019. 2 [43] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. In International Con- ference on Learning Representations, 2019. 1, 2 [44] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Ne- tadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018. 13 [45] Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2020. 1, 2 [46] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2016. 2 [47] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. 1, 2 A. Variance of ProxylessNAS-Mobile Model For stand-alone model training, we estimated the vari- ance in accuracy across runs by starting five identical runs of the ProxylessNAS-Mobile model. We repeated the ex- periment in two configurations. In the first configuration, we trained for 90 epochs and then evaluated the models on our validation set; the resulting validation set accuracies were [76.1%, 76.4%, 76.4%, 76.5%, 76.2%]. In the second configuration, we trained for 360 epochs and then evalu- ated on our test set; the resulting test set accuracies were [75.0%, 75.0%, 74.9%, 74.9%, 75.0%].8 B. Average vs Argmax Inference Time For the absolute value reward function with a constant RL learning rate, we argued that the reason models didn’t converge to the target inference time was because of dif- ferences between the inference times of randomly sampled models vs. the argmax model taken by selecting the most likely choice for each categorical decision. To test this hy- pothesis, we compared the two at the end of a search. We obtained an average time from randomly sampled models using an exponential moving average with a decay rate of 0.9 which was updated every 100 training steps. While av- erage times consistently converged to the desired inference time target, the inference time from the argmax could differ by 8 ms or more. β Avg. Time ArgMax Time -0.01 134.2 ms 128.8 ms -0.02 104.3 ms 92.2 ms -0.05 84.2 ms 75.8 ms -0.10 83.1 ms 78.3 ms -0.20 84.0 ms 82.7 ms -0.50 84.2 ms 83.4 ms -1.00 83.2 ms 85.5 ms Table 11: Average vs. argmax inference times when using the absolute value reward function without a learning rate schedule for the RL controller. C. Rematerialization If op warmup is implemented naively then the activa- tion memory required to train the shared model weights grows linearly with the number of possible operations in the search space. If many different possible operations can be simultaneously enabled at each position in the network, the model will be unable to fit in memory. We use remateri- alization to address this issue. During the forward pass, we apply N different operations to the same input, and average 8Accuracies on the validation set are typically a few percentage points higher than on the test set. their outputs. Rather than retaining the intermediate results of these operations for use during the backwards pass, we throw them away. During the backwards pass, we recom- pute the intermediate results for one operation at a time. In practice, this leads to a large decrease in memory require- ments, as we only need to retain a single input tensor and a single output tensor for each choice block. For example, in our reproduction of the original ProxylessNAS search space with a per-core batch size of 128, rematerialization decreases the memory needed to train a reproduction of the original ProxylessNAS search space from 29.5 GiB to 4.8 GiB. This memory-saving technique, which allowed us to scale to larger search spaces, came at the cost of roughly a 30% increase in search times to perform a second forward pass for each of the N possible operations. Finally, we note that although this rematerialization trick was developed with our version of op rampup in mind, it could also be used to reduce the memory requirements of a method such as DARTS [26] which requires us to eval- uate every possible operation in the search space at every training step. D. Discussion of Absolute Value Reward We now contrast a typical architecture search workflow with the MnasNet-Soft reward function against a work- flow with our new Absolute Value reward function. For the MnasNet-Soft reward function, the first step when us- ing a new search space or training configuration is to tune the RL controller’s cost exponent β to obtain inference times which are reasonably close to our target latency. In our early experiments, we found that grid searching over β ∈{−0.03, −0.04, −0.05, −0.06, −0.07, −0.08, −0.09} worked well in practice. However, running this grid search increased the cost of architecture search experiments by a factor of 7. Even after we fixed the value of β, the latencies and accuracies of architectures found by a search could vary significantly from one run to the next. For example, in our reproduction of the ProxylessNAS search space with β = −0.07, five identical architecture search experiments returned latencies which ranged from 74ms to 82ms. We also saw a wide variance in accuracies across the different architectures, ranging from 75.8% to 76.4% on the valida- tion set and 74.2% to 75.1% on the test set. Larger models generally had better accuracies, indicating that the problem stemmed from our inability to precisely control the latency. This helped motivate our Absolute Reward function, which allowed the RL controller to reliably find architec- tures whose latencies were close to the target. For example, the low variance of searched TuNAS model latencies in Ta- bles 4, 5, 6, and 12 shows we can reliably find high-quality architectures within 1 ms of the target across several dif- ferent search configurations, even when we reuse the same search hyper-parameters between different setups. As an alternative to the absolute value reward function, we also considered searching for an architecture close to the inference time target, and then uniformly scaling up or down the number of filters in every layer. While this helped reduce the variance of searched model accuracies, it did not remove the need to tune the RL cost exponent, since we needed to find a model that was already close to the infer- ence time target to get good results. Furthermore, finding the right scaling factor to hit a specific inference time tar- get added an extra step to experiments in this setup. The absolute value reward function gave us high-quality archi- tectures with a more streamlined search process. E. Experimental Setup E.1. Standalone Training for Classification During stand-alone model training, each model was trained using distributed synchronous SGD on TensorFlow with a Cloud TPU v2-32 or Cloud TPU v3-32 instance (32 TPU cores) and a per-core batch size of 128. Models were optimized using RMSProp with momentum = 0.9, decay rate = 0.9, and epsilon = 0.1. The learning rate was annealed following a cosine decay schedule without restarts [27], with a maximum value of 2.64 globally (or 0.0825 per core). We linearly increased the learning rate from 0 over the first 2.5% of training steps [12]. Models were trained with batch normalization with epsilon = 0.001 and momentum = 0.99. Convolutional kernels were initialized with He initialization [13],9 while bias variables were initialized to 0. The final fully connected layer of the network was initialized from a random normal distribution with mean 0 and standard de- viation 0.01. We applied L2 regularization with a strength of 0.00004 to all convolutional kernels except the final fully connected layer of the network. All models were trained with ResNet data preprocessing and an input image size of 224×224 pixels. When training models for 360 epochs, we applied a dropout rate of 0.15 before the final fully con- nected layer for models from MobileNetV2 search spaces and 0.25 for models from MobileNetV3 search spaces. We did not apply dropout when training models for 90 epochs. As is standard for ImageNet experiments, our test set ac- curacies were obtained on what is confusingly called the ImageNet validation set for historical reasons. What we re- fer to as validation set accuracies were obtained on a held- out subset of the ImageNet training set containing 50,046 randomly selected examples. 9TensorFlow’s default variable initialization heuristics, such as tf.initializers.he normal are designed for ordinary convolu- tions, and can overestimate the fan-in of depthwise convolutional kernels by multiple orders of magnitude; we corrected this issue in our version. E.2. Architecture Search for Classification Architecture search experiments are performed using Cloud TPU v2-32 or Cloud TPU v3-32 instances with 32 TPU cores and a per-core batch size of 128. For training the shared model weights, we use the same hyper-parameters as for stand-alone model training, except that the dropout rate of the final fully connected layer is always set to 0. When applying L2 regularization to the traininable model variables, we only regularize parameters which are used in the current training step. Because batch norm statistics can potentially vary significantly from one candidate architecture to the next, batch norm is always ap- plied in “training” mode, even during model evaluation. For training the RL controller, we use an Adam opti- mizer with a base learning rate of 3e-4, β1 = 0, β2 = 0.999, and ϵ =1e-8. We set the learning rate of the RL controller to 0 for the first 25% of training. If using an exponential schedule, we set the learning rate equal to the base value 25% of the way through training, and increase it exponen- tially so that the final learning rate is 10x the base learning rate. If using a constant schedule, we set the learning rate equal to the base learning rate after the first 25% of training. E.3. Object Detection Our implementation is based on the Tensorflow Object Detection API [18]. All backbones are combined with SS- DLite [35] as the head. Following MobileNetV2 [35] and V3 [15], we use the last feature extractor layers that have an output stride of 16 (C4) and 32 (C5) as the endpoints for the head. In contrast with MobileNetV3 + SSDLite [15], we do not manually halve the number of channels for blocks between C4 and C5, since in our case the number of chan- nels is automatically learned by the search algorithm. All experiments use 320×320 input images. For standalone training, each detection model is trained for 50K steps from scratch on COCO train2017 data us- ing a Cloud TPU v2-32 or TPU v3-32 instance (32 TPU cores) with a per-core batch size of 32. We use SGD to op- timize the shared model weights with a momentum of 0.9. The (global) learning rate is warmed up linearly from 0 to 4 during the first 5K steps and then decayed to 0 following a cosine schedule [27] during the rest of the training process. For architecture searches, training configurations for the model weights remain the same as for standalone training. We split out 10% of the data from COCO train2017 to com- pute the reward during an architecture search. The train- ing setup of the RL controller is the same as for classifi- cation, except that the base learning rate of the Adam op- timizer is set to 5e-3. Whereas classification models are evaluated based on accuracy, detection models are evalu- ated using mAP (mean Average Precision). To obtain re- sults in Table 7, architecture searches were carried out in the MobileNetV3-Like search space with a target inference cost Figure 4: On-device vs. simulated latencies in the ProxylessNAS- Enlarged and MobileNetV3-Like search spaces. Each plot is based on 100 random architectures in the given space. of 106ms to match the simulated latency of MobileNetV3 + SSDLite. To obtain the test-dev results, each model is trained over the combined COCO train2017 and val2017 data for 100K steps. Other settings remain the same as those for stan- dalone training and validation. E.4. Simulated Inference Times In early experiments, we found that if we benchmarked the same model on two different phones, the observed la- tencies could differ by several milliseconds. To ensure that our results were reproducible – and to mitigate the possibil- ity of random hardware-specific variance across runs – we estimated the latencies of our models using lookup tables similar to those proposed by NetAdapt [44]. These lookup tables let us estimate the latency of each individual opera- tion (e.g., convolution or pooling layer) in the network. The overall latency of a network architecture was estimated by summing up the latencies of all its individual operations. We validated our use of simulated latencies by sam- pling 100 random architectures and comparing the simu- lated numbers against on-device numbers measured on a real Pixel-1 phone. Figure 4 shows that the two are well- correlated. Agg. Sharing Valid Acc (%) Test Acc (%) Latency No 76.4 ± 0.1 75.0 ± 0.1 84.1 ± 0.4 Yes 76.3 ± 0.2 75.0 ± 0.1 84.0 ± 0.4 Table 12: Effect of aggressive weight sharing (abbreviated as “Agg Sharing” in the table above) on the quality of searched architec- tures. Each search is run for 90 epochs on the ProxylessNAS search space with op and filter warmup enabled. F. Cost of Random Search vs Efficient NAS Training a single architecture for 90 epochs on ImageNet requires about 1.7 hours using a Cloud TPU v3-32 instance (32 cores), whereas a single architecture search run takes between 8 and 13 hours, depending on the search space. This means that for the cost of a single 90-epoch search, we can evaluate 4-8 random models. In some cases, we found that the cost of an efficient architecture search could be fur- ther improved by increasing the number of search epochs from 90 to 360. For the cost of a single 360-epoch search, we can evaluate 15 - 30 random models. We provide a gen- erous budget of 20 - 50 models for our random search ex- periments in order to demonstrate that efficient architecture search can outperform random search even if each random search experiment is more compute-intensive. G. Quality of Aggressive Weight Sharing To verify the quality impact of aggressive weight shar- ing, we ran architecture searches on the original Proxyless- NAS search space both with and without aggressive sharing. The results (Table 12) indicate that aggressive weight shar- ing does not significantly affect searched model accuracies in this space. Our other two search spaces (ProxylessNAS- Enlarged and MobilenetV3-Like) were too large us to run searches without aggressive weight sharing.