Revisiting Multi-Scale Feature Fusion for Semantic Segmentation Tianjian Meng 1 Golnaz Ghiasi 1 Reza Mahjorian 2 Quoc V. Le 1 Mingxing Tan 1 1 Google Research 2 Waymo Inc. {mengtianjian, tanmingxing}@google.com Abstract It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolu- tions) are necessary for accurate semantic segmentation, resulting in slow speed and large memory usage. In this paper, we question this belief and demonstrate that neither high internal resolution nor atrous convolutions are nec- essary. Our intuition is that although segmentation is a dense per-pixel prediction task, the semantics of each pixel often depend on both nearby neighbors and far-away con- text; therefore, a more powerful multi-scale feature fusion network plays a critical role. Following this intuition, we revisit the conventional multi-scale feature space (typically capped at P51) and extend it to a much richer space, up to P9, where the smallest features are only 1/512 of the input size and thus have very large receptive fields. To process such a rich feature space, we leverage the recent BiFPN to fuse the multi-scale features. Based on these in- sights, we develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expen- sive atrous convolutions. Perhaps surprisingly, our simple method can achieve better accuracy with faster speed than prior art across multiple datasets. In real-time settings, ESeg-Lite-S achieves 76.0% mIoU on CityScapes [12] at 189 FPS, outperforming FasterSeg [9] (73.1% mIoU at 170 FPS). Our ESeg-Lite-L runs at 79 FPS and achieves 80.1% mIoU, largely closing the gap between real-time and high- performance segmentation models. 1. Introduction Semantic segmentation is an important computer vi- sion task that has been widely used in robotics and au- tonomous driving. The main challenge in semantic seg- mentation lies in the per-pixel dense prediction, where we need to predict the semantic class for each pixel. State- of-the-art (SOTA) segmentation models [6, 8, 35, 44, 50] heavily rely on high internal resolution for dense predic- 1Pi denotes features with resolution P0/2i, where P0 is the input im- age size. 0 1000 2000 3000 FLOPs (Billions) 76 78 80 82 84 CityScapes mIoU ESeg-S ESeg-M ESeg-L ESeg-M (+data) ESeg-L (+data) Auto-DeepLab HRNetV2 Panoptic-DeepLab DeepLabV3 DeepLabV3+ PSPNet Naive-Student Naive-Student (+data) B0 B1 B2 B3 B4 SegFormer-B5 Figure 1. Model Sizes vs. CityScapes validation mIoU. All mod- els in the figure are using single-scale evaluation protocol. +data denotes using extra data for pretraining and self-training. The FLOPs are calculated at 1024×2048 input resolution. Our pro- posed ESeg models are much simpler, yet still outperform previ- ous models by better quality and less computation cost. tion, and it is commonly believed that such high internal resolution is necessary to learn accurate per-pixel seman- tics, leading to slow runtime speed and large runtime mem- ory usage. For example, DeepLabV3+ [8] requires more than 1000B FLOPs (vs common real-time models using <100B FLOPs), making it prohibitively expensive to apply to real-time scenarios. Meanwhile, the high-resolution fea- ture maps make it difficult to obtain large receptive fields. Many attempts have been made to address this issue, for example by replacing the regular convolutions with atrous convolutions [6, 8, 44, 50] or adding additional attention modules [14, 45, 46] to enlarge receptive fields. Although these methods improve accuracy, atrous convolutions and attention modules are usually much slower than regular con- volutions. Recent real-time semantic segmentation models tend to learn high-resolution representation in different ways to satisfy the latency constraints, either by hand-crafting better networks [20, 43] or by using neural architecture search [9, 21]. These methods tend to improve accuracy, 1 arXiv:2203.12683v1 [cs.CV] 23 Mar 2022 but they still need to maintain the high internal resolution. In this paper, we question the belief in high internal reso- lution and demonstrate that neither high internal resolution nor atrous convolutions are necessary for accurate segmen- tation. Our intuition is that although segmentation is a dense per-pixel prediction task, the semantics of each pixel of- ten depend on both nearby neighbors and far-away context; therefore, conventional multi-scale feature space can easily be a bottleneck, and a more powerful multi-resolution fea- ture fusion network is desired. Following this intuition, we revisit the conventional simple feature fusion space (typi- cally up to P5 from traditional classification backbones) and extend it to a much richer space, up to P9, where the small- est internal resolution is only 1/512 of the input image size and thus can have a very large receptive field. Based on the richer multi-resolution feature space, we present a simplified segmentation model named ESeg, which has neither high internal resolution nor expensive atrous convolutions. To keep the model as simple as pos- sible, we adopt the naive encoder-decoder network struc- ture, where the encoder is an off-the-shelf EfficientNet [36] and the decoder is a slightly-modified BiFPN [38]. We ob- serve that as the multi-resolution feature space up to P9 has much richer information than conventional approaches, a powerful bi-directional feature pyramid is extremely impor- tant here to effectively fuse these features. Unlike the con- ventional top-down FPN, BiFPN allows both top-down and bottom-up feature fusion, enabling more effective feature fusion in the rich multi-scale feature space. We evaluate our ESeg on CityScapes [12] and ADE20K [52], two widely used benchmarks for semantic segmentation. Surprisingly, although ESeg is much sim- pler, it consistently matches the performance of prior art across different datasets. By scaling up the network size, our ESeg models achieve better accuracy while using 2× - 9× fewer parameters and 4× - 50× fewer FLOPs than competitive methods. In real-time settings, our ESeg-Lite- S achieves 76.0% mIoU on the CityScapes validation set at 189 FPS, outperforming the prior art of FasterSeg by 2.9% mIoU at faster speed. With 80.1% CityScapes mIoU at 79 FPS, our ESeg-Lite-L bridges the gap between real-time and advanced segmentation models for the first time. In addition to the standard CityScapes/ADE20K datasets, our models also perform well on large-scale datasets. By pretraining on the larger Mapillaly Vistas [27] and self-training on CityScapes coarse-labeled data, our ESeg-L achieves 84.8% mIoU on the CityScapes valida- tion set, while being much smaller and faster than prior art. These results highlight the importance of multi-scale fea- tures for semantic segmentation, demonstrating that a sim- ple network is also able to achieve state-of-the-art perfor- mance. 2. Revisiting Multi-Scale Feature Fusion Semantic segmentation models need to predict the se- mantic class for individual pixels. Due to this dense predic- tion requirement, it is common to maintain high-resolution features and perform pixel-wise classification on them. However, high-resolution features bring two critical chal- lenges: First, they often require expensive computation and large memory, at O(n2) complexity with respect to the res- olution; Second, it is difficult to obtain large receptive fields with regular convolutions, so capturing long-distance corre- lations would require very deep networks, expensive atrous operations, or spatial attention mechanisms, making them even slower and difficult to train. To mitigate the issues of high-resolution features, many previous works employ multi-scale feature fusion. The core idea is to leverage multiple features with different reso- lutions to capture both short- and long-distance patterns. Specifically, a common practice is to adopt an ImageNet- pretrained backbone network (such as ResNet [17]) to ex- tract feature scales up to P5, and then apply an FPN [23] to fuse these features in a top-down manner, as illustrated in Figure 2(a). However, recent works [5, 35] show such naive feature fusion is inferior to high resolution features: For example, DeepLab tends to keep high-resolution fea- tures combined with atrous convolutions as shown in Fig- ure 2(b), but atrous convolutions are usually hardware- unfriendly and slow. Similarly, HRNet [35] maintains mul- tiple parallel branches of high-resolution features as shown in Figure 2(c). Both approaches lead to high computational cost and memory usage. Feature Levels mIoU #Params FLOPs P2 −P5 78.3 6.4M 34.3B P2 −P7 79.5 (+1.2) 6.6M 34.5B P2 −P9 80.1 (+1.8) 6.9M 34.5B Table 1. Comparison of different feature levels on CityScapes dataset. Pi feature level has resolution of P0 ∗2−i, where P0 is the input image size. Use more high-level features would enlarge receptive fields and improve accuracy with negligible overhead. Here we revisit the multi-scale feature map and iden- tify that insufficient feature levels are the key limitation of the vanilla P2 −P5 multi-scale feature space. P2 −P5 is originally designed for ImageNet backbones, where the image size is as small as 224×224, but segmentation im- ages have much higher resolution: in CityScapes, each im- age has a resolution of 1024×2048. Such large input im- ages require much larger receptive fields, raising the need for higher-level feature maps. Table 1 compares the perfor- mance of different feature levels. When we simply increase the feature levels to P2 −P9 (adding four more extra level P6 −P9), it significantly improves the performance by 1.8 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  3  3  3  3  3  3  3  D &RQYHQWLRQDO33 E 'HHS/DEZLWKDWURXVFRQYV F +51HW G 2XU33 Figure 2. Multi-Scale Feature Space. Pi denotes a feature map with resolution P0 2i , where P0 is the input image size. (a) conventional feature networks mostly use up to P5 from an ImageNet backbone, but the accuracy is relatively low; (b) DeepLab [5, 8] maintains the high resolution and use atrous convolutions to capture larger receptive fields, with the overhead of slow runtime; (c) HRNet [35] maintains a separate path for each resolution, resulting in large memory usage for high-resolution features; (d) we simply add more low-resolution features P6 −P9 to capture high-level semantics. The additional P6 −P9 features improves accuracy (see Table 1) with very minimal overhead, thanks to large receptive fields on the low-resolution feature maps. mIoU with negligible overhead on parameters and FLOPs. Notably, for high level features (P6 −P9) that are already down-sampled from the input image, there is no need to use hardware-unfriendly atrous convolutions to increase the re- ceptive fields any more. This leads to our first observation: Insight 1: Larger feature spaces up to P9 are much more effective than conventional spaces capped at P5. Given such large feature space like P2 −P9, a natu- ral question is how to fuse them effectively. Many previ- ous works simply apply a top-down FPN [23] to fuse the features. Recently, PANet [25] proposes to add an extra bottom-up fusion path, and EfficientDet [38] proposes to fuse features in bidirectional fashion for object detection. Here we study the performance of different feature fusion approaches. As shown in Table 2, when the feature space is larger (P2 −P9), a more powerful BiFPN performs signifi- cantly better than a simple FPN. Intuitively, a larger feature space needs more layers to propagate the information and more bidirectional flows to effectively fuse them. Based on these studies, we have the second observation: Insight 2: A powerful feature fusion network is critical for a large multi-scale feature space. 3. ESeg Based on our two key insights, we now develop a sim- ple encoder-decoder-based segmentation model, aiming to Feature Space Feature Network mIoU FLOPs P2 −P5 FPN 78.0 39.8B BiFPN 78.3 (+0.3) 34.3B P2 −P9 FPN 79.1 40.0B BiFPN 80.1 (+1.0) 34.5B Table 2. Comparison of different feature fusion networks for CityScapes dataset. BiFPN and FPN has similar performance for small feature space P2 −P5, but the different is much significant for large feature space P2 −P9. We adjust the filter sizes of FPN and BiFPN to ensure they have comparable computational cost. achieve better performance without relying on atrous con- volutions or dedicated high-resolution features. 3.1. Network Design Figure 3 shows an overview of our ESeg network. It has a standard encoder-decoder structure, where the encoder ex- tracts multiple levels of features from the raw images and the decoder performs extensive multi-scale feature fusion. After the decoder, we upsample and weighted-sum all fea- tures, and predict the dense per-pixel semantic class. We describe each component and highlight our design choices in following: Encoder: We use EfficientNet [36] as the backbone to extract the first five levels of features {P1, P2, P3, P4, P5}, where Pi denotes the i-th level of features with spatial size of P0/2i. Please note that our design is general and can be applied to any potentially-better convolutional classification backbones in the future. Here we use EfficientNet as our de- 3 Weighted Sum Additional Features P6 – P9 Backbone Features P2 - P5 EfficientNet Encoder Repeatable BiFPN Decoder Predictions Figure 3. ESeg network architecture. The backbone [36] extracts {P2 −P5} feature maps from the raw input images; Four additional feature maps {P6 −P9} are added on top of these backbone features with simple average pooling. The decoder perform bidirectional multi-scale feature fusion [38] to strength the internal representations for each feature map. All feature maps are upsampled and combined with weighted sum to generate the final per-pixel prediction. fault backbone to illustrate our idea. Based on our insight 1, we add four more extra features P6 −P9 on top of P5, by downsampling the features with 3×3 stride-2 pooling lay- ers. For simplicity, we use the same hidden size of P5 for all other P6 −P9 features. Notably, we only use regular convo- lutional and pooling layers in the encoder, without any use of hardware-unfriendly atrous convolutions. Decoder: The decoder is responsible to combining the multi-scale features. Based on our insight 2, we employ the powerful BiFPN [38] to repeatedly apply top-down and bottom-up bidirectional feature fusion. The BiFPN outputs the same shape of P2 −P9 feature maps, but each of them is already fused with information from other feature scales. Prediction: To generate the final prediction, we upsample all decoder features to a fixed resolution, and do a weighted sum to combine them into a unified high-resolution feature map. Specifically, given P2−P9 feature maps, we introduce eight additional learnable weights w2 −w9, where wi is a scalar variable for input feature Pi. These features are then combined using a softmax-based weighted sum fusion: O = X i ewi P j ewj · Upsample(Pi) where wi is a non-negative variable initialized as 1.0 and updated during back-propagation, and upsample is a simple bilinear interpolation operation to match the resolution. In- tuitively, wi represents the importance of feature map Pi: if Pi is more important to the output, then wi will become larger during training. After training is done, we can get rid of the softmax function and perform normal weighted-sum fusion for faster inference. A prediction head is applied to the final output feature O to produce the final pixel-level class predictions. Notably, since the raw input image size is usually very large (e.g., 1024×2048), it is expensive to directly perform prediction at the resolution of input images. Instead, we apply our pre- diction head at the minimum feature level P2 (one fourth of the original size), and then bilinearly upsample the predic- tion to the original input image resolution to calculate the per-pixel loss. 3.2. Network and Data Scaling Our goal is to design a family of segmentation models with different accuracy and latency trade-offs; However, hand-crafting new models for each latency constraint is dif- ficult and expensive. Here we take a different approach sim- ilar to previous compound scaling [36,38]. For the encoder backbone, we use the same scaling fac- tors as the original EfficientNet such that we can reuse its ImageNet checkpoints. For the the BiFPN decoder, we scale them up linearly by increasing the depth (number of repeats) and width (number of input/output channels). Ta- ble 3 shows the width and depth scaling configurations, where the smallest model ESeg-Lite-S has 17B FLOPs and the largest model ESeg-L has 342B FLOPs. By scaling up ESeg, we show that it can consistently outperform prior works across different network size constraints. Since segmentation datasets are usually small, another scaling dimension is data size. In our experiments, we will show that by pretraining and self-training on extra data, we are also able to further improve model accuracy. We explic- itly note if a model is trained with extra data. 4 Model Encoder Decoder Width Depth # channels # repeats ESeg-Lite-S 0.4 0.6 64 1 ESeg-Lite-M 0.6 1.0 80 2 ESeg-Lite-L 1.0 1.0 96 3 ESeg-S 1.0 1.1 96 4 ESeg-M 1.4 1.8 192 5 ESeg-L 2.0 3.1 288 6 Table 3. Network size scaling configurations. 3.3. GPU Inference Optimization As many segmentation models run on GPU with Ten- sorRT, we further optimize our real-time ESeg-Lite mod- els when running with TensorRT. First, we replace all MB- Conv blocks [34, 36] with Fused-MBConv [16], as depth- wise convolutions are not well supported in TensorRT and cannot fully utilize GPU parallelisms. Second, we also remove GPU-unfriendly operations such as squeeze-and- excitation [19], and replace the SiLU(Swish-1) [13,32] ac- tivation with regular ReLU. Lastly, we further increase the base level features from P2 to P3 to avoid the expensive computations on large spatial dimensions, and move the prediction head to P3 accordingly. 4. Experiments We evaluate ESeg on the popular CityScapes [12] and ADE20K [52] datasets, and compare the results with previ- ous state-of-the-art segmentation models. 4.1. Setup Our models are trained on 8 TPU cores with a batch size of 16. We use SGD as our optimizer with momen- tum 0.9. Following previous work [35], we apply random horizontal flipping and random scale jittering [0.5, 2.0] dur- ing our training. We use cosine learning rate decay, which yields similar performance as previous polynomial learn- ing rate decay but with less tunable hyperparameters. For fair comparison with [3,10,14,35,43,45], we use the same online hard example mining (OHEM) strategy during train- ing. We also find applying the exponential moving average decay [36,38] to be helpful to stabilize training without in- fluencing the final performance. We report our main results in mIoU (Mean Intersection- Over-Union) and pixAcc (Pixel Accuracy) under the single- model single-scale setting with no test-time augmentation. We note that multi-scale inference may further boost the metrics, but they are also much slower in practice. CityScapes: CityScapes dataset [12] contains 5,000 high- resolution images with fine annotations, which focuses on urban street scenes understanding. The fine-annotated im- ages are divided into 2,975/500/1,525 images for train- ing, validation and testing respectively. 19 out of 30 to- tal classes are used for evaluation. Following the training protocol from [35, 50], we randomly crop the image from 1024×2048 to 512×1024 for training. We also adopt an initial learning rate of 0.08 and a weight decay of 0.00005. Our models are trained for 90k steps (484 epochs), which is the same as popular methods [35] (484 epochs). ADE20K: ADE20K dataset [52] is a scene parsing benchmark with 20,210 images for training and 2,000 im- ages for validation. Models will be evaluated using the mean of pixel-wise accuracy and mIoU over 150 seman- tic categories. Following previous work [45,48], we use the same image size 520×520 for both training and inference. We train our model for 300 epochs with an initial learning rate of 0.08 and a weight decay of 0.00005. 4.2. Large-Scale Segmentation Results We scale up our ESeg to larger network sizes using the scaling configurations in Table 3. Table 4 and Table 5 com- pare their performance with other state-of-the-art segmen- tation models on CityScapes and ADE20K. In general, our ESeg consistently outperforms previous models, often by a large margin. For example, ESeg-S achieves higher accu- racy than the latest Auto-DeepLab-S [24] while using 1.5x fewer parameters and 9.7x fewer FLOPs. On CityScapes, ESeg-L achieves 82.6% mIOU with single-scale, outper- forming prior art of HRNetV2-W48 by 1.5% mIOU with much less FLOPs. These results suggest that ESeg is scal- able and can achieve remarkable performance across vari- ous resource constraints. On ADE20K, ESeg-L also out- performs previous multi-scale state-of-the-art CNN models by 1.3%, while recent transformer-based models achieved better performance with much more computation. CityScapes test set performance. Previous works usu- ally retrain or finetune their models on the train+validation set of CityScapes dataset to boost their test set performance. However, such methods introduce extra training cost and more tunable hyperparameters, which is not scalable and leads to harder comparison. Meanwhile, some leaderboard submissions introduce extra tricks such as multi-scale in- ference and extra post-processing. Here we believe the best practice is to directly evaluate the trained checkpoints on the CityScapes test set, under the simplest single-scale infer- ence protocol. Our ESeg-L shows great generalization abil- ity by achieving a remarkable 84.1% mIoU on CityScapes leaderboard without bells and whistles, which is on par with top leaderboard results. 4.3. Real-Time Segmentation Results We evaluate our ESeg under real-time inference set- tings. Since our real-time models are extremely small, we 5 Model val mIoU val mIoU Params Ratio FLOPs Ratio w/o extra data w/ extra data ESeg-S 80.1 81.7 6.9M 1x 34.5B 1x Auto-DeepLab-S [24] 79.7 - 10.2M 1.5x 333B 9.7x PSPNet (ResNet-101) [50] 79.7 - 65.9M 9.6x 2018B 59x OCR (ResNet-101) [45] 79.6 - - - - - DeepLabV3+ (Xception-71) [8] 79.6 - 43.5M 6.3x 1445B 42x DeepLabV3+ (ResNeXt-50) [53] 79.5 81.4 - - - - DeepLabV3 (ResNet-101) [6] 78.5 - 58.0M 8.4x 1779B 52x ESeg-M 81.6 83.7 20.0M 1x 112B 1x HRNetV2-W48 [35] 81.1 - 65.9M 3.3x 747B 6.7x OCR (HRNet-W48) [45] 81.1 - - - - - ACNet (ResNet-101) [14] 80.9 - - - - - Naive-Student [3] 80.7 83.4 147.3M 7.3x 3246B 29x Panoptic-DeepLab (X-71) [10] 80.5 82.5 46.7M 2.3x 548B 4.9x DeepLabV3 (ResNeSt-101) [48] 80.4† - - - - - Auto-DeepLab-L [24] 80.3 - 44.4M 2.2x 695B 6.2x HRNetV2-W40 [35] 80.2 - 45.2M 2.3x 493B 4.1x Auto-DeepLab-M [24] 80.0 - 21.6M 1.1x 461B 4.1x DeepLabV3 (ResNeSt-50) [48] 79.9† - - - - - OCNet (ResNet-101) [46] 79.6 - - - - - ESeg-L 82.6 84.8 70.5M 1x 343B 1x SegFormer-B5 [40] 82.4 - 84.7M 1.2x 1460B 4.3x Table 4. Performance comparison on CityScapes. † denotes results using multi-scale evaluation protocol. All our models are evaluated in single-scale evaluation protocol. Model mIoU PixAcc ESeg-M 46.0 81.3 OCR (ResNet-101) [45] 44.3/45.3† - HRNetV2-W48 [35] 43.1/44.2† - Auto-DeepLab-M [24] 42.2† 81.1† PSPNet (ResNet-101) [50] 42.0† 80.6† Auto-DeepLab-S [24] 40.7† 80.6† ESeg-L 48.2 81.8 DeepLabV3 (ResNeSt-101) [48] 46.9† 82.1† ACNet (ResNet-101) [14] 45.9† 82.0† OCR (HRNet-W48) [45] 44.5/45.5† - OCNet (ResNet-101) [46] 45.5† - DeepLabV3 (ResNeSt-50) [48] 45.1† 81.2† Auto-DeepLab-L [24] 44.0† 81.7† SETR [51] 46.3 - Swin-S [26] 49.3† - SegFormer-B4 [40] 50.3 - Table 5. Performance comparison on ADE20K. † denotes re- sults using multi-scale evaluation protocol. All our models are evaluated in single-scale evaluation protocol. Recent Transformer- based models are marked in gray. use larger crop size for training, following common prac- tice [18]. Figure 6 shows the performance comparison be- tween our ESeg-Lite and other real-time segmentation mod- els. We measure our inference speed on a Tesla V100- SXM2 16GB GPU with TensorRT-7.2.1.6, CUDA 11.0 and CUDNN 8.0. For fair comparison with the strongest prior art FasterSeg [9], we use its official open-sourced code, and rerun the inference benchmark under our environment. We use a batch size of 1 and input resolution of 1024×2048. As shown in this table, our ESeg-Lite models outperform the SOTA by a large margin. In particular, our ESeg-Lite-S achieves the highest 189 FPS speed with 2.9% better mIoU than FasterSeg. Moreover, our ESeg-Lite-L achieves 80.1% mIoU on CityScapes, largely bridging the accuracy gap be- tween real-time models and previous full-size models (see Table 4 for comparison). 4.4. Pre- and Self-Training with Extra Data Given that segmentation datasets often have limited training sets, we study two types of data size scaling: pre- training with the large-scale Mapillary Vistas dataset and self-training with CityScapes unlabelled data. Due to the dataset policy, we don’t conduct data size scaling experi- ments on ADE20K dataset [52]. Pretraining. We follow the common setting in [10,45] to pretrain our models on the larger urban scene understanding dataset Mapillary Vistas [27]. We resize the images to 2048 pixels at the longer side to handle the image size variations, and randomly crop the images to 1024×1024 during train- ing. The models are pretrained for 500 epochs on Mapillary Vistas dataset. We take a slightly different training setting 6 Model mIoU InputSize FPS ESeg-Lite-M 74.0 512×1024 334† ESeg-Lite-S 76.0 1024×2048 189† FasterSeg [9] 73.1 1024×2048 170† / 164 DF1-Seg [21] 74.1 768×1536 106 BiSeNet-V2 [43] 73.4 512×1024 156 DF1-Seg-d8 [21] 72.4 768×1536 137 ESeg-Lite-M 78.6 1024×2048 125† DDRNet-23-Slim [18] 77.8 1024×2048 109 DF1-Seg2 [21] 76.9 768×1536 56 FANet-34 [20] 76.3 1024×2048 58 DF1-Seg1 [21] 75.9 768×1536 67 BiSeNetV2-L [43] 75.8 512×1024 47 SwiftNetRN-18 [30] 75.4 1024×2048 40 FANet-18 [20] 75.0 1024×2048 72 ESeg-Lite-L 80.1 1024×2048 79† DDRNet-23 [18] 79.5 1024×2048 39 Table 6. Performance comparison on CityScapes under real- time speed settings. † denotes our measurements on the same V100 GPU with the same inference code. 40 60 80 100 120 140 160 180 FPS 73 74 75 76 77 78 79 80 CityScapes mIoU ESeg-Lite-L ESeg-Lite-M ESeg-Lite-S DDRNet FANet DF1-Seg BiSeNetV2 SwiftNet FasterSeg Figure 4. Inference speed vs. CityScapes validation mIoU. Real-time ESeg family of models outperform previous models by a large margin with much faster speed. for finetuning the models from Mapillary Vistas pretrained checkpoints. Specially, we train the models for only 200 epochs with an initial learning rate of 0.01. Self-training. We perform self-training on the coarse- annotated data of CityScapes. Note we treat this set as unlabeled data and do not utilize its coarse-annotated la- bels. To generate the pseudo labels from a teacher, we em- ploy a multi-scale inference setting to generate our pseudo- labels with scales [0.5, 1.0, 2.0] and horizontal flip. We only keep the predictions which have a confidence higher than 0.5, otherwise we set it to ignore area. We use a sim- Model size S M L Baseline 80.1 81.6 82.6 Pre-training 81.1 (+1.0) 82.5 (+0.9) 83.8 (+1.2) Self-training 81.4 (+1.3) 82.6 (+1.0) 83.5 (+0.9) Combined 81.7 (+1.6) 83.7 (+2.1) 84.8 (+2.2) Table 7. Performance results with pre-training and self- training on various ESeg models. Observations: (1) large mod- els like ESeg-XL benefit more from extra data; (2) self-training are more effective than pretraining; (3) pretraining and self-training are complementary, and can be combined to obtain the best accu- racy gains. ilar self-training setting as in [54]. We use batch size 128 and initial learning rate 0.16 to train the model for 1000 epochs. Half of the images in a batch will be sampled from ground-truth fine-annotated labels, and the other half will be sampled from generated pseudo-labels. Our data scaling method consists of both pretraining and self-training. Table 7 shows the pretraining and self-training results. Not surprisingly, both pretraining and self-training can im- prove the accuracy. In general, self-training generally pro- vides slightly higher gains than pretraining for most mod- els, and the accuracy gain on the large model tend to be more pronounced, suggesting that larger models have more network capacity to benefit from the extra data. Interest- ingly, our results show that large models can obtain more relative accuracy gains with pretraining and self-training. Perhaps surprisingly, we observe that pretraining and self- training are complementary, and combining them can fur- ther improve the accuracy by a large margin. For example, ESeg-M can gain 0.9% accuracy with pretraining and 1.0% accuracy with self-training, but combining them can pro- vide 2.1% accuracy gain, significantly better than the sin- gle self-training or pretraining. Notably, by combining pre- training and self-training, our largest ESeg-L achieves the best 84.8% mIoU on CityScapes validation set. 5. Ablation Study In this section, we ablate our encoder and decoder design choices on the CityScapes dataset. A powerful backbone is crucial for an encoder-decoder architecture as it needs to encode the high-resolution im- ages into low-resolution features. Table 8 shows the per- formance of different backbones. Compared to the widely- used ResNet-50 [17], EfficientNet-B1 [36] achieves 1.2% better mIoU with much fewer FLOPs, suggesting the criti- cal role of good backbone networks in encoder-decoder ar- chitectures. The decoder also plays an important role as it needs to recover high-resolution features. Table 8 compares the previously-widely-used DeepLabV3+ [8] with BiFPN 7 Encoder Decoder mIoU FLOPs EfficientNet-B1 BiFPN (w/o atrous) 80.1 34.5B DeepLabV3+ (w/ atrous) 79.4 91.8B DeepLabV3+ (w/o atrous) 78.8 49.9B ResNet-50 BiFPN (w/o atrous) 78.9 188.0B DeepLabV3+ (w/ atrous) 77.8 324.3B DeepLabV3+ (w/o atrous) 77.4 230.3B Table 8. Encoder and decoder choices. All models are trained with exactly the same training settings. BiFPN outperforms DeepLabV3+ [8] regardless whether atrous convolutions are used. (adopted in ESeg). Combined with higher feature levels, the BiFPN achieves better performance than the combination of ASPP and DeepLabV3+ decoder in term of both accuracy and efficiency. Our ablation study also demonstrates the im- portance of employing high internal resolution in DeepLab- like models, which is very expensive. In contrast, ESeg pro- vides a more elegant and efficient alternative. 6. Related Work Efficient Network Architecture: Many previous works aim to improve segmentation model efficiency via better hand-crafted networks [20,43] or network pruning [21]. Re- cently, neural architecture search (NAS) [36,37,55] has be- come a popular tool to improve model accuracy and effi- ciency. For semantic segmentation, DPC [2] searches for a multi-scale dense prediction cell after a handcrafted back- bone; Auto-DeepLab [24] jointly searches for both cell level and network level architecture for segmentation. Re- cent works [9, 49] also use a similar hierarchical search space, targeting for real-time speed or better accuracy. Encoder-decoder structure: Early works including U- Net [33] and SegNet [1] adopt a symmetric encoder-decoder structure to effectively extract spatial context. Such struc- ture will first extract the high-level semantic information from input images using an encoder network, and then add extra top-down and lateral connections to fuse the fea- tures in the decoder part. Most feature encoders used in semantic segmentation come from architectures developed for image classification [11, 17, 41, 47, 48], while some of them are designed to keep high-resolution features for dense prediction tasks [22,35,39]. Different decoder mod- ules are also proposed to recover high-resolution represen- tations [1, 23, 28, 29, 31, 33, 42]. Our work is largely based on encoder-decoder structure for its simplicity. Multi-scale features: Multi-scale features have been widely used in dense prediction tasks such as object de- tection and segmentation. While high-resolution features preserve more spatial information with smaller receptive fields, lower-resolution features can have larger receptive fields to capture more contextual information. A key chal- lenge here is how to effective combine these multi-scale features. Many previous works in object detection use fea- ture pyramid networks (FPN) [23] and its variants [15, 25] to perform multi-scale feature fusion. For semantic seg- mentation, ASPP [5] is widely used to get multi-scale fea- tures [5,7,8,50]. Other studies also try to exploit the multi- scale features with attention [14,45]. In our work, we lever- age a more powerful bidirectional feature network [38] to extensively fuse multi-scale features. Atrous convolutions: Atrous convolutions is widely used in semantic segmentation architectures [4–6,8,44,50] to en- large the receptive field. PSPNet [50] uses a pyramid pool- ing module on top of a backbone with dilation. DeepLab [5] proposes an atrous spatial pyramid pooling to extract multi- scale features for dense prediction. DeepLabV3+ [8] further improves it by using an additional decoder module to refine the representation. Unfortunately, atrous convolutions are not hardware friendly and usually run slowly in real-world scenario. As opposed to the common practice, our proposed model does not use any atrous convolutions. 7. Conclusion In this paper, we present a family of simple ESeg mod- els, which use neither high internal resolution nor atrous convolutions. Instead, they add a much richer multi-scale feature space and use a more powerful bi-directional feature network to capture the local and global semantic informa- tion. We show that the simple encoder-decoder-based ESeg can outperform the prior art on various datasets. In par- ticular, our small models ESeg-Lite significantly bridge the gap between real-time and server-size models, and our ESeg achieves state-of-the-art performance on both CityScapes and ADE20K with much fewer parameters and FLOPs. With extra pretraining and self-training, our ESeg-L fur- ther pushes the CityScapes accuracy to 84.8% while be- ing single-model and single-scale. Our study shows that, despite the difficulty of dense prediction, it is possible to achieve both high accuracy and faster speed. We hope our work can spark more explorations on how to design seg- mentation models with better performance and efficiency. Limitations: We intend to keep our networks simple in this paper, but it is possible to further optimize the network, e.g., by searching for new backbone or feature networks. We will leave this for future works. References [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern anal- ysis and machine intelligence, 39(12):2481–2495, 2017. 8 8 [2] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale archi- tectures for dense image prediction. In Advances in neural information processing systems, pages 8699–8710, 2018. 8 [3] Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D Collins, Ekin D Cubuk, Barret Zoph, Hartwig Adam, and Jonathon Shlens. Semi-supervised learning in video sequences for urban scene segmentation. arXiv preprint arXiv:2005.10266, 2020. 5, 6 [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 8 [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. 2, 3, 8 [6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation. arXiv preprint arXiv:1706.05587, 2017. 1, 6, 8 [7] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic im- age segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3640–3649, 2016. 8 [8] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 1, 3, 6, 7, 8 [9] Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, and Zhangyang Wang. Fasterseg: Searching for faster real-time semantic segmentation. arXiv preprint arXiv:1912.10917, 2019. 1, 6, 7, 8 [10] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12485, 2020. 5, 6 [11] Franc¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017. 8 [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1, 2, 5 [13] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning. Neural Networks, 107:3–11, 2018. 5 [14] Jun Fu, Jing Liu, Yuhang Wang, Yong Li, Yongjun Bao, Jin- hui Tang, and Hanqing Lu. Adaptive context network for scene parsing. In Proceedings of the IEEE international con- ference on computer vision, pages 6748–6757, 2019. 1, 5, 6, 8 [15] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7036–7045, 2019. 8 [16] Suyog Gupta and Mingxing Tan. Efficientnet-edgetpu: Cre- ating accelerator-optimized neural networks with automl, 2019. 5 [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2, 7, 8 [18] Yuanduo Hong, Huihui Pan, Weichao Sun, Yisong Jia, et al. Deep dual-resolution networks for real-time and ac- curate semantic segmentation of road scenes. arXiv preprint arXiv:2101.06085, 2021. 6, 7 [19] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. 5 [20] Ping Hu, Federico Perazzi, Fabian Caba Heilbron, Oliver Wang, Zhe Lin, Kate Saenko, and Stan Sclaroff. Real-time semantic segmentation with fast attention. arXiv preprint arXiv:2007.03815, 2020. 1, 7, 8 [21] Xin Li, Yiming Zhou, Zheng Pan, and Jiashi Feng. Partial order pruning: for best speed/accuracy trade-off in neural ar- chitecture search. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 9145–9153, 2019. 1, 7, 8 [22] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: Design backbone for object detection. In Proceedings of the European conference on computer vision (ECCV), pages 334–350, 2018. 8 [23] Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 2, 3, 8 [24] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 82–92, 2019. 5, 6, 8 [25] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 8759–8768, 2018. 3, 8 [26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows. arXiv preprint arXiv:2103.14030, 2021. 6 [27] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic 9 understanding of street scenes. In Proceedings of the IEEE International Conference on Computer Vision, pages 4990– 4999, 2017. 2, 6 [28] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. In European con- ference on computer vision, pages 483–499. Springer, 2016. 8 [29] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on com- puter vision, pages 1520–1528, 2015. 8 [30] Marin Orsic, Ivan Kreso, Petra Bevandic, and Sinisa Segvic. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12607–12616, 2019. 7 [31] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-resolution residual networks for seman- tic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4151–4160, 2017. 8 [32] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. 5 [33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 8 [34] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4510–4520, 2018. 5 [35] Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019. 1, 2, 3, 5, 6, 8 [36] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019. 2, 3, 4, 5, 7, 8 [37] Mingxing Tan and Quoc V. Le. Efficientnetv2: Smaller mod- els and faster training. ICML, 2021. 8 [38] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10790, 2020. 2, 3, 4, 5, 8 [39] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 2020. 8 [40] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and ef- ficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203, 2021. 6 [41] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017. 8 [42] Jing Yang, Qingshan Liu, and Kaihua Zhang. Stacked hour- glass network for robust facial landmark localisation. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 79–87, 2017. 8 [43] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral net- work with guided aggregation for real-time semantic seg- mentation. arXiv preprint arXiv:2004.02147, 2020. 1, 5, 7, 8 [44] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 472–480, 2017. 1, 8 [45] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065, 2019. 1, 5, 6, 8 [46] Yuhui Yuan and Jingdong Wang. Ocnet: Object context net- work for scene parsing. arXiv preprint arXiv:1809.00916, 2018. 1, 6 [47] Sergey Zagoruyko and Nikos Komodakis. Wide residual net- works. arXiv preprint arXiv:1605.07146, 2016. 8 [48] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R Manmatha, et al. Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020. 5, 6, 8 [49] Xiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, and Wenqi Ren. Dcnas: Densely connected neural architecture search for semantic image segmentation. arXiv preprint arXiv:2003.11883, 2020. 8 [50] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 1, 5, 6, 8 [51] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers. In CVPR, 2021. 6 [52] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2, 5, 6 [53] Yi Zhu, Karan Sapra, Fitsum A Reda, Kevin J Shih, Shawn Newsam, Andrew Tao, and Bryan Catanzaro. Improving se- mantic segmentation via video propagation and label relax- ation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8856–8865, 2019. 6 [54] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D Cubuk, and Quoc V Le. Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882, 2020. 7 [55] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016. 8 10