Revisiting Multi-Scale Feature Fusion for Semantic Segmentation TianjianMeng1 GolnazGhiasi1 RezaMahjorian2 QuocV.Le1 MingxingTan1 1 GoogleResearch 2 WaymoInc. {mengtianjian, tanmingxing}@google.com Abstract 84 It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolu- tions) are necessary for accurate semantic segmentation, 82 resulting in slow speed and large memory usage. In this paper,wequestionthisbeliefanddemonstratethatneither high internal resolution nor atrous convolutions are nec- 80 essary. Our intuition is that although segmentation is a denseper-pixelpredictiontask,thesemanticsofeachpixel 78 often dependon bothnearby neighbors andfar-away con- text; therefore, a more powerful multi-scale feature fusion network plays a critical role. Following this intuition, we 76 revisittheconventionalmulti-scalefeaturespace(typically 0 1000 2000 3000 capped at P 1) and extend it to a much richer space, up FLOPs (Billions) 5 to P , where the smallest features are only 1/512 of the 9 input size and thus have very large receptive fields. To process such a rich feature space, we leverage the recent BiFPNtofusethemulti-scalefeatures. Basedonthesein- sights,wedevelopasimplifiedsegmentationmodel,named ESeg,whichhasneitherhighinternalresolutionnorexpen- sive atrous convolutions. Perhaps surprisingly, our simple methodcanachievebetteraccuracywithfasterspeedthan prior art across multiple datasets. In real-time settings, ESeg-Lite-S achieves 76.0% mIoU on CityScapes [12] at 189FPS,outperformingFasterSeg[9](73.1%mIoUat170 FPS).OurESeg-Lite-Lrunsat79FPSandachieves80.1% mIoU,largelyclosingthegapbetweenreal-timeandhigh- performancesegmentationmodels. 1.Introduction Semantic segmentation is an important computer vi- sion task that has been widely used in robotics and au- tonomous driving. The main challenge in semantic seg- mentation lies in the per-pixel dense prediction, where we need to predict the semantic class for each pixel. State- of-the-art (SOTA) segmentation models [6,8,35,44,50] heavily rely on high internal resolution for dense predic- 1PidenotesfeatureswithresolutionP0/2i,whereP0istheinputim- agesize. UoIm sepacSytiC ESeg-L (+data) ESeg-M (+data) Naive-Student (+data) ESeg-L B4 SegFormer-B5 B3 ESeg-M B2 HRNetV2 Naive-Student Panoptic-DeepLab ESeg-S PSPNet Auto-DeepLab DeepLabV3+ B1 DeepLabV3 B0 Figure1.ModelSizesvs.CityScapesvalidationmIoU.Allmod- elsinthefigureareusingsingle-scaleevaluationprotocol.+data denotes using extra data for pretraining and self-training. The FLOPs are calculated at 1024×2048 input resolution. Our pro- posedESegmodelsaremuchsimpler,yetstilloutperformprevi- ousmodelsbybetterqualityandlesscomputationcost. tion, and it is commonly believed that such high internal resolution is necessary to learn accurate per-pixel seman- tics,leadingtoslowruntimespeedandlargeruntimemem- ory usage. For example, DeepLabV3+ [8] requires more than 1000B FLOPs (vs common real-time models using <100BFLOPs),makingitprohibitivelyexpensivetoapply toreal-timescenarios. Meanwhile,thehigh-resolutionfea- ture maps make it difficult to obtain large receptive fields. Many attempts have been made to address this issue, for example by replacing the regular convolutions with atrous convolutions [6,8,44,50] or adding additional attention modules [14,45,46] to enlarge receptive fields. Although these methods improve accuracy, atrous convolutions and attentionmodulesareusuallymuchslowerthanregularcon- volutions. Recent real-time semantic segmentation models tend to learn high-resolution representation in different ways to satisfy the latency constraints, either by hand-crafting better networks [20,43] or by using neural architecture search [9,21]. These methods tend to improve accuracy, 1 2202 raM 32 ]VC.sc[ 1v38621.3022:viXrabuttheystillneedtomaintainthehighinternalresolution. 2.RevisitingMulti-ScaleFeatureFusion Inthispaper,wequestionthebeliefinhighinternalreso- Semantic segmentation models need to predict the se- lutionanddemonstratethatneitherhighinternalresolution manticclassforindividualpixels. Duetothisdensepredic- noratrousconvolutionsarenecessaryforaccuratesegmen- tionrequirement,itiscommontomaintainhigh-resolution tation.Ourintuitionisthatalthoughsegmentationisadense features and perform pixel-wise classification on them. per-pixel prediction task, the semantics of each pixel of- However, high-resolution features bring two critical chal- tendependonbothnearbyneighborsandfar-awaycontext; lenges: First,theyoftenrequireexpensivecomputationand therefore,conventionalmulti-scalefeaturespacecaneasily largememory,atO(n2)complexitywithrespecttotheres- beabottleneck, andamorepowerfulmulti-resolutionfea- olution;Second,itisdifficulttoobtainlargereceptivefields turefusionnetworkisdesired. Followingthisintuition,we withregularconvolutions,socapturinglong-distancecorre- revisit the conventional simple feature fusion space (typi- lationswouldrequireverydeepnetworks,expensiveatrous callyuptoP fromtraditionalclassificationbackbones)and 5 operations, or spatial attention mechanisms, making them extendittoamuchricherspace,uptoP ,wherethesmall- 9 evensloweranddifficulttotrain. estinternalresolutionisonly1/512oftheinputimagesize Tomitigatetheissuesofhigh-resolutionfeatures,many andthuscanhaveaverylargereceptivefield. previousworksemploymulti-scalefeaturefusion.Thecore Based on the richer multi-resolution feature space, we idea is to leverage multiple features with different reso- present a simplified segmentation model named ESeg, lutions to capture both short- and long-distance patterns. which has neither high internal resolution nor expensive Specifically, a common practice is to adopt an ImageNet- atrous convolutions. To keep the model as simple as pos- pretrained backbone network (such as ResNet [17]) to ex- sible, we adopt the naive encoder-decoder network struc- tract feature scales up to P 5, and then apply an FPN [23] ture,wheretheencoderisanoff-the-shelfEfficientNet[36] to fuse these features in a top-down manner, as illustrated andthedecoderisaslightly-modifiedBiFPN[38]. Weob- in Figure 2(a). However, recent works [5,35] show such servethatasthemulti-resolutionfeaturespaceuptoP has naive feature fusion is inferior to high resolution features: 9 much richer information than conventional approaches, a For example, DeepLab tends to keep high-resolution fea- powerfulbi-directionalfeaturepyramidisextremelyimpor- tures combined with atrous convolutions as shown in Fig- tantheretoeffectivelyfusethesefeatures. Unlikethecon- ure 2(b), but atrous convolutions are usually hardware- ventionaltop-downFPN,BiFPNallowsbothtop-downand unfriendlyandslow. Similarly,HRNet[35]maintainsmul- bottom-up feature fusion, enabling more effective feature tipleparallelbranchesofhigh-resolutionfeaturesasshown fusionintherichmulti-scalefeaturespace. inFigure2(c). Bothapproachesleadtohighcomputational costandmemoryusage. We evaluate our ESeg on CityScapes [12] and ADE20K [52], two widely used benchmarks for semantic FeatureLevels mIoU #Params FLOPs segmentation. Surprisingly, although ESeg is much sim- P −P 78.3 6.4M 34.3B pler, it consistently matches the performance of prior art 2 5 P −P 79.5(+1.2) 6.6M 34.5B across different datasets. By scaling up the network size, 2 7 P −P 80.1(+1.8) 6.9M 34.5B 2 9 our ESeg models achieve better accuracy while using 2× - 9× fewer parameters and 4× - 50× fewer FLOPs than Table1. ComparisonofdifferentfeaturelevelsonCityScapes competitivemethods. Inreal-timesettings, ourESeg-Lite- dataset. P featurelevelhasresolutionofP ∗2−i,whereP is i 0 0 Sachieves76.0%mIoUontheCityScapesvalidationsetat theinputimagesize. Usemorehigh-levelfeatureswouldenlarge 189FPS,outperformingthepriorartofFasterSegby2.9% receptivefieldsandimproveaccuracywithnegligibleoverhead. mIoUatfasterspeed. With80.1%CityScapesmIoUat79 FPS, our ESeg-Lite-L bridges the gap between real-time Here we revisit the multi-scale feature map and iden- andadvancedsegmentationmodelsforthefirsttime. tifythatinsufficientfeaturelevelsarethekeylimitationof In addition to the standard CityScapes/ADE20K the vanilla P − P multi-scale feature space. P − P 2 5 2 5 datasets, our models also perform well on large-scale is originally designed for ImageNet backbones, where the datasets. BypretrainingonthelargerMapillalyVistas[27] image size is as small as 224×224, but segmentation im- and self-training on CityScapes coarse-labeled data, our ageshavemuchhigherresolution: inCityScapes,eachim- ESeg-L achieves 84.8% mIoU on the CityScapes valida- age has a resolution of 1024×2048. Such large input im- tionset,whilebeingmuchsmallerandfasterthanpriorart. ages require much larger receptive fields, raising the need These results highlight the importance of multi-scale fea- forhigher-levelfeaturemaps. Table1comparestheperfor- turesforsemanticsegmentation,demonstratingthatasim- manceofdifferentfeaturelevels. Whenwesimplyincrease ple network is also able to achieve state-of-the-art perfor- thefeaturelevelstoP −P (addingfourmoreextralevel 2 9 mance. P −P ),itsignificantlyimprovestheperformanceby1.8 6 9 2(cid:51)(cid:28)(cid:3)(cid:11)(cid:20)(cid:18)(cid:24)(cid:20)(cid:21)(cid:12) (cid:51)(cid:27)(cid:3)(cid:11)(cid:20)(cid:18)(cid:21)(cid:24)(cid:25)(cid:12) (cid:51)(cid:26)(cid:3)(cid:11)(cid:20)(cid:18)(cid:20)(cid:21)(cid:27)(cid:12) (cid:51)(cid:25)(cid:3)(cid:11)(cid:20)(cid:18)(cid:25)(cid:23)(cid:12) (cid:51)(cid:21) (cid:51)(cid:24) (cid:51)(cid:22) (cid:51)(cid:22) (cid:51)(cid:24)(cid:3)(cid:11)(cid:20)(cid:18)(cid:22)(cid:21)(cid:12) (cid:51)(cid:21) (cid:51)(cid:23) (cid:51)(cid:22) (cid:51)(cid:22) (cid:51)(cid:23)(cid:3)(cid:11)(cid:20)(cid:18)(cid:20)(cid:25)(cid:12) (cid:51)(cid:22) (cid:51)(cid:22) (cid:51)(cid:21) (cid:51)(cid:22)(cid:3)(cid:11)(cid:20)(cid:18)(cid:27)(cid:12) (cid:51)(cid:22) (cid:51)(cid:21) (cid:51)(cid:21) (cid:51)(cid:21)(cid:3)(cid:11)(cid:20)(cid:18)(cid:23)(cid:12) (cid:11)(cid:68)(cid:12)(cid:3)(cid:38)(cid:82)(cid:81)(cid:89)(cid:72)(cid:81)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:51)(cid:22)(cid:16)(cid:51)(cid:24) (cid:11)(cid:69)(cid:12)(cid:3)(cid:39)(cid:72)(cid:72)(cid:83)(cid:47)(cid:68)(cid:69)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:68)(cid:87)(cid:85)(cid:82)(cid:88)(cid:86)(cid:3)(cid:70)(cid:82)(cid:81)(cid:89)(cid:86) (cid:11)(cid:70)(cid:12)(cid:3)(cid:43)(cid:53)(cid:49)(cid:72)(cid:87) (cid:11)(cid:71)(cid:12)(cid:3)(cid:50)(cid:88)(cid:85)(cid:3)(cid:51)(cid:21)(cid:16)(cid:51)(cid:28) P Figure2. Multi-ScaleFeatureSpace. P denotesafeaturemapwithresolution 0,whereP istheinputimagesize. (a)conventional i 2i 0 featurenetworksmostlyuseuptoP fromanImageNetbackbone,buttheaccuracyisrelativelylow; (b)DeepLab[5,8]maintainsthe 5 highresolutionanduseatrousconvolutionstocapturelargerreceptivefields,withtheoverheadofslowruntime;(c)HRNet[35]maintains aseparatepathforeachresolution,resultinginlargememoryusageforhigh-resolutionfeatures;(d)wesimplyaddmorelow-resolution featuresP −P tocapturehigh-levelsemantics. TheadditionalP −P featuresimprovesaccuracy(seeTable1)withveryminimal 6 9 6 9 overhead,thankstolargereceptivefieldsonthelow-resolutionfeaturemaps. mIoUwithnegligibleoverheadonparametersandFLOPs. FeatureSpace FeatureNetwork mIoU FLOPs Notably, for high level features (P −P ) that are already FPN 78.0 39.8B 6 9 P −P down-sampledfromtheinputimage,thereisnoneedtouse 2 5 BiFPN 78.3(+0.3) 34.3B hardware-unfriendlyatrousconvolutionstoincreasethere- P −P FPN 79.1 40.0B 2 9 BiFPN 80.1(+1.0) 34.5B ceptivefieldsanymore. Thisleadstoourfirstobservation: Table2. Comparisonofdifferentfeaturefusionnetworksfor Insight1: LargerfeaturespacesuptoP aremuchmore CityScapesdataset.BiFPNandFPNhassimilarperformancefor 9 effectivethanconventionalspacescappedatP 5. smallfeaturespaceP 2−P 5,butthedifferentismuchsignificant forlargefeaturespaceP −P . WeadjustthefiltersizesofFPN Given such large feature space like P − P , a natu- 2 9 2 9 andBiFPNtoensuretheyhavecomparablecomputationalcost. ral question is how to fuse them effectively. Many previ- ous works simply apply a top-down FPN [23] to fuse the achieve better performance without relying on atrous con- features. Recently, PANet [25] proposes to add an extra volutionsordedicatedhigh-resolutionfeatures. bottom-up fusion path, and EfficientDet [38] proposes to fuse features in bidirectional fashion for object detection. 3.1.NetworkDesign Here we study the performance of different feature fusion Figure3showsanoverviewofourESegnetwork. Ithas approaches. AsshowninTable2,whenthefeaturespaceis astandardencoder-decoderstructure,wheretheencoderex- larger(P −P ),amorepowerfulBiFPNperformssignifi- 2 9 tracts multiple levels of features from the raw images and cantlybetterthanasimpleFPN.Intuitively,alargerfeature the decoder performs extensive multi-scale feature fusion. space needs more layers to propagate the information and Afterthedecoder, weupsampleandweighted-sumallfea- morebidirectionalflowstoeffectivelyfusethem. Basedon tures, and predict the dense per-pixel semantic class. We thesestudies,wehavethesecondobservation: describeeachcomponentandhighlightourdesignchoices infollowing: Insight 2: A powerful feature fusion network is critical foralargemulti-scalefeaturespace. Encoder: We use EfficientNet [36] as the backbone to extractthefirstfivelevelsoffeatures{P ,P ,P ,P ,P }, 1 2 3 4 5 3.ESeg whereP i denotesthei-thleveloffeatureswithspatialsize ofP /2i. Pleasenotethatourdesignisgeneralandcanbe 0 Based on our two key insights, we now develop a sim- appliedtoanypotentially-betterconvolutionalclassification ple encoder-decoder-based segmentation model, aiming to backbonesinthefuture.HereweuseEfficientNetasourde- 3Additional Features P6–P9 Backbone Features P2-P5 Weighted Sum EfficientNetEncoder Repeatable BiFPNDecoder Predictions Figure3. ESegnetworkarchitecture. Thebackbone[36]extracts{P −P }featuremapsfromtherawinputimages;Fouradditional 2 5 featuremaps{P −P }areaddedontopofthesebackbonefeatureswithsimpleaveragepooling. Thedecoderperformbidirectional 6 9 multi-scalefeaturefusion[38]tostrengththeinternalrepresentationsforeachfeaturemap.Allfeaturemapsareupsampledandcombined withweightedsumtogeneratethefinalper-pixelprediction. fault backbone to illustrate our idea. Based on our insight ofthesoftmaxfunctionandperformnormalweighted-sum 1,weaddfourmoreextrafeaturesP −P ontopofP ,by fusionforfasterinference. 6 9 5 downsampling the features with 3×3 stride-2 pooling lay- ApredictionheadisappliedtothefinaloutputfeatureO ers.Forsimplicity,weusethesamehiddensizeofP forall to produce the final pixel-level class predictions. Notably, 5 otherP −P features.Notably,weonlyuseregularconvo- since the raw input image size is usually very large (e.g., 6 9 lutionalandpoolinglayersintheencoder,withoutanyuse 1024×2048), itisexpensivetodirectlyperformprediction ofhardware-unfriendlyatrousconvolutions. attheresolutionofinputimages. Instead,weapplyourpre- dictionheadattheminimumfeaturelevelP (onefourthof 2 theoriginalsize), andthenbilinearlyupsamplethepredic- Decoder: The decoder is responsible to combining the tion to the original input image resolution to calculate the multi-scale features. Based on our insight 2, we employ per-pixelloss. thepowerfulBiFPN[38]torepeatedlyapplytop-downand bottom-upbidirectionalfeaturefusion. TheBiFPNoutputs 3.2.NetworkandDataScaling thesameshapeofP −P featuremaps,buteachofthem 2 9 Our goal is to design a family of segmentation models isalreadyfusedwithinformationfromotherfeaturescales. with different accuracy and latency trade-offs; However, hand-craftingnewmodelsforeachlatencyconstraintisdif- Prediction: Togeneratethefinalprediction,weupsample ficultandexpensive.Herewetakeadifferentapproachsim- alldecoderfeaturestoafixedresolution,anddoaweighted ilartopreviouscompoundscaling[36,38]. sumtocombinethemintoaunifiedhigh-resolutionfeature Fortheencoderbackbone,weusethesamescalingfac- map.Specifically,givenP 2−P 9featuremaps,weintroduce tors as the original EfficientNet such that we can reuse its eight additional learnable weights w 2 −w 9, where w i is a ImageNet checkpoints. For the the BiFPN decoder, we scalarvariableforinputfeatureP i. Thesefeaturesarethen scale them up linearly by increasing the depth (number of combinedusingasoftmax-basedweightedsumfusion: repeats) and width (number of input/output channels). Ta- ble 3 shows the width and depth scaling configurations, (cid:88) ewi wherethesmallestmodelESeg-Lite-Shas17BFLOPsand O = ·Upsample(P ) (cid:80) ewj i the largest model ESeg-L has 342B FLOPs. By scaling i j upESeg,weshowthatitcanconsistentlyoutperformprior where w is a non-negative variable initialized as 1.0 and worksacrossdifferentnetworksizeconstraints. i updatedduringback-propagation,andupsampleisasimple Since segmentation datasets are usually small, another bilinearinterpolationoperationtomatchtheresolution. In- scalingdimensionisdatasize. Inourexperiments,wewill tuitively, w represents the importance of feature map P : showthatbypretrainingandself-trainingonextradata,we i i ifP ismoreimportanttotheoutput, thenw willbecome arealsoabletofurtherimprovemodelaccuracy. Weexplic- i i largerduringtraining. Aftertrainingisdone,wecangetrid itlynoteifamodelistrainedwithextradata. 4Encoder Decoder ing, validation and testing respectively. 19 out of 30 to- Model Width Depth #channels #repeats tal classes are used for evaluation. Following the training ESeg-Lite-S 0.4 0.6 64 1 protocol from [35,50], we randomly crop the image from ESeg-Lite-M 0.6 1.0 80 2 1024×2048 to 512×1024 for training. We also adopt an ESeg-Lite-L 1.0 1.0 96 3 initiallearningrateof0.08andaweightdecayof0.00005. ESeg-S 1.0 1.1 96 4 Ourmodelsaretrainedfor90ksteps(484epochs),whichis ESeg-M 1.4 1.8 192 5 thesameaspopularmethods[35](484epochs). ESeg-L 2.0 3.1 288 6 Table3.Networksizescalingconfigurations. ADE20K: ADE20K dataset [52] is a scene parsing benchmark with 20,210 images for training and 2,000 im- 3.3.GPUInferenceOptimization ages for validation. Models will be evaluated using the As many segmentation models run on GPU with Ten- mean of pixel-wise accuracy and mIoU over 150 seman- sorRT, we further optimize our real-time ESeg-Lite mod- ticcategories. Followingpreviouswork[45,48],weusethe elswhenrunningwithTensorRT.First,wereplaceallMB- same imagesize 520×520 forboth trainingand inference. Conv blocks [34,36] with Fused-MBConv [16], as depth- Wetrainourmodelfor300epochswithaninitiallearning wise convolutions are not well supported in TensorRT and rateof0.08andaweightdecayof0.00005. cannot fully utilize GPU parallelisms. Second, we also 4.2.Large-ScaleSegmentationResults remove GPU-unfriendly operations such as squeeze-and- excitation[19],andreplacetheSiLU(Swish-1)[13,32]ac- We scale up our ESeg to larger network sizes using the tivationwithregularReLU.Lastly, wefurtherincreasethe scalingconfigurationsinTable3. Table4andTable5com- base level features from P to P to avoid the expensive pare their performance with other state-of-the-art segmen- 2 3 computations on large spatial dimensions, and move the tationmodelsonCityScapesandADE20K.Ingeneral,our predictionheadtoP accordingly. ESegconsistentlyoutperformspreviousmodels,oftenbya 3 large margin. For example, ESeg-S achieves higher accu- 4.Experiments racythanthelatestAuto-DeepLab-S[24]whileusing1.5x fewer parameters and 9.7x fewer FLOPs. On CityScapes, We evaluate ESeg on the popular CityScapes [12] and ESeg-L achieves 82.6% mIOU with single-scale, outper- ADE20K[52]datasets,andcomparetheresultswithprevi- forming prior art of HRNetV2-W48 by 1.5% mIOU with ousstate-of-the-artsegmentationmodels. muchlessFLOPs. TheseresultssuggestthatESegisscal- able and can achieve remarkable performance across vari- 4.1.Setup ous resource constraints. On ADE20K, ESeg-L also out- Our models are trained on 8 TPU cores with a batch performspreviousmulti-scalestate-of-the-artCNNmodels size of 16. We use SGD as our optimizer with momen- by 1.3%, while recent transformer-based models achieved tum 0.9. Following previous work [35], we apply random betterperformancewithmuchmorecomputation. horizontalflippingandrandomscalejittering[0.5,2.0]dur- ingourtraining. Weusecosinelearningratedecay, which CityScapes test set performance. Previous works usu- yields similar performance as previous polynomial learn- allyretrainorfinetunetheirmodelsonthetrain+validation ing rate decay but with less tunable hyperparameters. For setofCityScapesdatasettoboosttheirtestsetperformance. faircomparisonwith[3,10,14,35,43,45],weusethesame However, such methods introduce extra training cost and onlinehardexamplemining(OHEM)strategyduringtrain- more tunable hyperparameters, which is not scalable and ing. Wealsofindapplyingtheexponentialmovingaverage leadstohardercomparison. Meanwhile,someleaderboard decay[36,38]tobehelpfultostabilizetrainingwithoutin- submissions introduce extra tricks such as multi-scale in- fluencingthefinalperformance. ferenceandextrapost-processing. Herewebelievethebest WereportourmainresultsinmIoU(MeanIntersection- practiceistodirectlyevaluatethetrainedcheckpointsonthe Over-Union)andpixAcc(PixelAccuracy)underthesingle- CityScapes test set, under the simplest single-scale infer- model single-scale setting with no test-time augmentation. enceprotocol. OurESeg-Lshowsgreatgeneralizationabil- We note that multi-scale inference may further boost the ity by achieving a remarkable 84.1% mIoU on CityScapes metrics,buttheyarealsomuchslowerinpractice. leaderboardwithoutbellsandwhistles,whichisonparwith topleaderboardresults. CityScapes: CityScapesdataset[12]contains5,000high- 4.3.Real-TimeSegmentationResults resolution images with fine annotations, which focuses on urban street scenes understanding. The fine-annotated im- We evaluate our ESeg under real-time inference set- ages are divided into 2,975/500/1,525 images for train- tings. Since our real-time models are extremely small, we 5valmIoU valmIoU Model Params Ratio FLOPs Ratio w/oextradata w/extradata ESeg-S 80.1 81.7 6.9M 1x 34.5B 1x Auto-DeepLab-S[24] 79.7 - 10.2M 1.5x 333B 9.7x PSPNet(ResNet-101)[50] 79.7 - 65.9M 9.6x 2018B 59x OCR(ResNet-101)[45] 79.6 - - - - - DeepLabV3+(Xception-71)[8] 79.6 - 43.5M 6.3x 1445B 42x DeepLabV3+(ResNeXt-50)[53] 79.5 81.4 - - - - DeepLabV3(ResNet-101)[6] 78.5 - 58.0M 8.4x 1779B 52x ESeg-M 81.6 83.7 20.0M 1x 112B 1x HRNetV2-W48[35] 81.1 - 65.9M 3.3x 747B 6.7x OCR(HRNet-W48)[45] 81.1 - - - - - ACNet(ResNet-101)[14] 80.9 - - - - - Naive-Student[3] 80.7 83.4 147.3M 7.3x 3246B 29x Panoptic-DeepLab(X-71)[10] 80.5 82.5 46.7M 2.3x 548B 4.9x DeepLabV3(ResNeSt-101)[48] 80.4† - - - - - Auto-DeepLab-L[24] 80.3 - 44.4M 2.2x 695B 6.2x HRNetV2-W40[35] 80.2 - 45.2M 2.3x 493B 4.1x Auto-DeepLab-M[24] 80.0 - 21.6M 1.1x 461B 4.1x DeepLabV3(ResNeSt-50)[48] 79.9† - - - - - OCNet(ResNet-101)[46] 79.6 - - - - - ESeg-L 82.6 84.8 70.5M 1x 343B 1x SegFormer-B5[40] 82.4 - 84.7M 1.2x 1460B 4.3x Table4. PerformancecomparisononCityScapes. †denotesresultsusingmulti-scaleevaluationprotocol. Allourmodelsareevaluated insingle-scaleevaluationprotocol. Model mIoU PixAcc CUDNN 8.0. For fair comparison with the strongest prior ESeg-M 46.0 81.3 artFasterSeg[9],weuseitsofficialopen-sourcedcode,and OCR(ResNet-101)[45] 44.3/45.3† - reruntheinferencebenchmarkunderourenvironment. We HRNetV2-W48[35] 43.1/44.2† - use a batch size of 1 and input resolution of 1024×2048. Auto-DeepLab-M[24] 42.2† 81.1† As shown in this table, our ESeg-Lite models outperform PSPNet(ResNet-101)[50] 42.0† 80.6† theSOTAbyalargemargin. Inparticular,ourESeg-Lite-S Auto-DeepLab-S[24] 40.7† 80.6† achievesthehighest189FPSspeedwith2.9%bettermIoU ESeg-L 48.2 81.8 thanFasterSeg.Moreover,ourESeg-Lite-Lachieves80.1% DeepLabV3(ResNeSt-101)[48] 46.9† 82.1† mIoUonCityScapes,largelybridgingtheaccuracygapbe- ACNet(ResNet-101)[14] 45.9† 82.0† tween real-time models and previous full-size models (see OCR(HRNet-W48)[45] 44.5/45.5† - Table4forcomparison). OCNet(ResNet-101)[46] 45.5† - DeepLabV3(ResNeSt-50)[48] 45.1† 81.2† 4.4.Pre-andSelf-TrainingwithExtraData Auto-DeepLab-L[24] 44.0† 81.7† Given that segmentation datasets often have limited SETR[51] 46.3 - training sets, we studytwo types of data size scaling: pre- Swin-S[26] 49.3† - training with the large-scale Mapillary Vistas dataset and SegFormer-B4[40] 50.3 - self-training with CityScapes unlabelled data. Due to the Table 5. Performance comparison on ADE20K. † denotes re- dataset policy, we don’t conduct data size scaling experi- sults using multi-scale evaluation protocol. All our models are mentsonADE20Kdataset[52]. evaluatedinsingle-scaleevaluationprotocol.RecentTransformer- basedmodelsaremarkedingray. Pretraining. Wefollowthecommonsettingin[10,45]to pretrainourmodelsonthelargerurbansceneunderstanding use larger crop size for training, following common prac- datasetMapillaryVistas[27]. Weresizetheimagesto2048 tice[18]. Figure6showstheperformancecomparisonbe- pixelsatthelongersidetohandletheimagesizevariations, tweenourESeg-Liteandotherreal-timesegmentationmod- andrandomlycroptheimagesto1024×1024duringtrain- els. We measure our inference speed on a Tesla V100- ing.Themodelsarepretrainedfor500epochsonMapillary SXM216GBGPUwithTensorRT-7.2.1.6,CUDA11.0and Vistas dataset. We take a slightly different training setting 6Model mIoU InputSize FPS ESeg-Lite-M 74.0 512×1024 334† ESeg-Lite-S 76.0 1024×2048 189† FasterSeg[9] 73.1 1024×2048 170†/164 DF1-Seg[21] 74.1 768×1536 106 BiSeNet-V2[43] 73.4 512×1024 156 DF1-Seg-d8[21] 72.4 768×1536 137 ESeg-Lite-M 78.6 1024×2048 125† DDRNet-23-Slim[18] 77.8 1024×2048 109 DF1-Seg2[21] 76.9 768×1536 56 FANet-34[20] 76.3 1024×2048 58 DF1-Seg1[21] 75.9 768×1536 67 BiSeNetV2-L[43] 75.8 512×1024 47 SwiftNetRN-18[30] 75.4 1024×2048 40 FANet-18[20] 75.0 1024×2048 72 ESeg-Lite-L 80.1 1024×2048 79† DDRNet-23[18] 79.5 1024×2048 39 Table 6. Performance comparison on CityScapes under real- time speed settings. † denotes our measurements on the same V100GPUwiththesameinferencecode. 80 79 78 77 76 75 74 73 40 60 80 100 120 140 160 180 FPS UoIm sepacSytiC Modelsize S M L Baseline 80.1 81.6 82.6 Pre-training 81.1(+1.0) 82.5(+0.9) 83.8(+1.2) Self-training 81.4(+1.3) 82.6(+1.0) 83.5(+0.9) Combined 81.7(+1.6) 83.7(+2.1) 84.8(+2.2) Table 7. Performance results with pre-training and self- trainingonvariousESegmodels. Observations: (1)largemod- elslikeESeg-XLbenefitmorefromextradata;(2)self-trainingare more effective than pretraining; (3) pretraining and self-training arecomplementary,andcanbecombinedtoobtainthebestaccu- racygains. ilar self-training setting as in [54]. We use batch size 128 and initial learning rate 0.16 to train the model for 1000 epochs. Halfoftheimagesinabatchwillbesampledfrom ground-truth fine-annotated labels, and the other half will besampledfromgeneratedpseudo-labels. Ourdatascaling methodconsistsofbothpretrainingandself-training. Table 7 shows the pretraining and self-training results. Notsurprisingly,bothpretrainingandself-trainingcanim- provetheaccuracy. Ingeneral, self-traininggenerallypro- ESeg-Lite-L vides slightly higher gains than pretraining for most mod- els, and the accuracy gain on the large model tend to be ESeg-Lite-M morepronounced,suggestingthatlargermodelshavemore network capacity to benefit from the extra data. Interest- DDRNet ingly, our results show that large models can obtain more relative accuracy gains with pretraining and self-training. Perhaps surprisingly, we observe that pretraining and self- ESeg-Lite-S training are complementary, and combining them can fur- SwiftNet therimprovetheaccuracybyalargemargin. Forexample, FANet ESeg-Mcangain0.9%accuracywithpretrainingand1.0% DF1-Seg BiSeNetV2 accuracy with self-training, but combining them can pro- FasterSeg vide 2.1% accuracy gain, significantly better than the sin- gleself-trainingorpretraining. Notably,bycombiningpre- training and self-training, our largest ESeg-L achieves the best84.8%mIoUonCityScapesvalidationset. Figure 4. Inference speed vs. CityScapes validation mIoU. Real-timeESegfamilyofmodelsoutperformpreviousmodelsby 5.AblationStudy alargemarginwithmuchfasterspeed. Inthissection,weablateourencoderanddecoderdesign forfinetuningthemodelsfromMapillaryVistaspretrained choicesontheCityScapesdataset. checkpoints. Specially, we train the models for only 200 A powerful backbone is crucial for an encoder-decoder epochswithaninitiallearningrateof0.01. architecture as it needs to encode the high-resolution im- ages into low-resolution features. Table 8 shows the per- Self-training. We perform self-training on the coarse- formanceofdifferentbackbones. Comparedtothewidely- annotated data of CityScapes. Note we treat this set as used ResNet-50 [17], EfficientNet-B1 [36] achieves 1.2% unlabeled data and do not utilize its coarse-annotated la- bettermIoUwithmuchfewerFLOPs,suggestingthecriti- bels. Togeneratethepseudolabelsfromateacher,weem- calroleofgoodbackbonenetworksinencoder-decoderar- ployamulti-scaleinferencesettingtogenerateourpseudo- chitectures. labels with scales [0.5, 1.0, 2.0] and horizontal flip. We The decoder also plays an important role as it needs only keep the predictions which have a confidence higher to recover high-resolution features. Table 8 compares than0.5, otherwisewesetittoignorearea. Weuseasim- the previously-widely-used DeepLabV3+ [8] with BiFPN 7Encoder Decoder mIoU FLOPs fieldstocapturemorecontextualinformation. Akeychal- BiFPN(w/oatrous) 80.1 34.5B lenge here is how to effective combine these multi-scale EfficientNet-B1 DeepLabV3+(w/atrous) 79.4 91.8B features. Manypreviousworksinobjectdetectionusefea- DeepLabV3+(w/oatrous) 78.8 49.9B turepyramidnetworks(FPN)[23]anditsvariants[15,25] BiFPN(w/oatrous) 78.9 188.0B to perform multi-scale feature fusion. For semantic seg- ResNet-50 DeepLabV3+(w/atrous) 77.8 324.3B DeepLabV3+(w/oatrous) 77.4 230.3B mentation, ASPP[5]iswidelyusedtogetmulti-scalefea- tures[5,7,8,50]. Otherstudiesalsotrytoexploitthemulti- Table8. Encoderanddecoderchoices. Allmodelsaretrained scalefeatureswithattention[14,45]. Inourwork,welever- with exactly the same training settings. BiFPN outperforms age a more powerful bidirectional feature network [38] to DeepLabV3+[8]regardlesswhetheratrousconvolutionsareused. extensivelyfusemulti-scalefeatures. (adoptedinESeg).Combinedwithhigherfeaturelevels,the Atrousconvolutions: Atrousconvolutionsiswidelyused BiFPNachievesbetterperformancethanthecombinationof insemanticsegmentationarchitectures[4–6,8,44,50]toen- ASPP and DeepLabV3+ decoder in term of both accuracy largethereceptivefield. PSPNet[50]usesapyramidpool- andefficiency.Ourablationstudyalsodemonstratestheim- ingmoduleontopofabackbonewithdilation.DeepLab[5] portanceofemployinghighinternalresolutioninDeepLab- proposesanatrousspatialpyramidpoolingtoextractmulti- likemodels,whichisveryexpensive.Incontrast,ESegpro- scalefeaturesfordenseprediction.DeepLabV3+[8]further videsamoreelegantandefficientalternative. improvesitbyusinganadditionaldecodermoduletorefine the representation. Unfortunately, atrous convolutions are 6.RelatedWork nothardwarefriendlyandusuallyrunslowlyinreal-world scenario.Asopposedtothecommonpractice,ourproposed Efficient Network Architecture: Many previous works modeldoesnotuseanyatrousconvolutions. aim to improve segmentation model efficiency via better hand-craftednetworks[20,43]ornetworkpruning[21].Re- 7.Conclusion cently,neuralarchitecturesearch(NAS)[36,37,55]hasbe- come a popular tool to improve model accuracy and effi- Inthispaper, wepresentafamilyofsimpleESegmod- ciency. Forsemanticsegmentation,DPC[2]searchesfora els, which use neither high internal resolution nor atrous multi-scale dense prediction cell after a handcrafted back- convolutions. Instead, they add a much richer multi-scale bone; Auto-DeepLab [24] jointly searches for both cell featurespaceanduseamorepowerfulbi-directionalfeature level and network level architecture for segmentation. Re- network to capture the local and global semantic informa- cent works [9,49] also use a similar hierarchical search tion. Weshowthatthesimpleencoder-decoder-basedESeg space,targetingforreal-timespeedorbetteraccuracy. can outperform the prior art on various datasets. In par- ticular,oursmallmodelsESeg-Litesignificantlybridgethe gapbetweenreal-timeandserver-sizemodels,andourESeg Encoder-decoder structure: Early works including U- achieves state-of-the-art performance on both CityScapes Net[33]andSegNet[1]adoptasymmetricencoder-decoder and ADE20K with much fewer parameters and FLOPs. structure to effectively extract spatial context. Such struc- With extra pretraining and self-training, our ESeg-L fur- ture will first extract the high-level semantic information ther pushes the CityScapes accuracy to 84.8% while be- from input images using an encoder network, and then ing single-model and single-scale. Our study shows that, addextratop-downandlateralconnectionstofusethefea- despite the difficulty of dense prediction, it is possible to tures in the decoder part. Most feature encoders used in achievebothhighaccuracyandfasterspeed. Wehopeour semantic segmentation come from architectures developed work can spark more explorations on how to design seg- for image classification [11,17,41,47,48], while some mentationmodelswithbetterperformanceandefficiency. of them are designed to keep high-resolution features for densepredictiontasks[22,35,39]. Differentdecodermod- Limitations:Weintendtokeepournetworkssimpleinthis ulesarealsoproposedtorecoverhigh-resolutionrepresen- paper,butitispossibletofurtheroptimizethenetwork,e.g., tations[1,23,28,29,31,33,42]. Ourworkislargelybased bysearchingfornewbackboneorfeaturenetworks.Wewill onencoder-decoderstructureforitssimplicity. leavethisforfutureworks. References Multi-scale features: Multi-scale features have been widely used in dense prediction tasks such as object de- [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. tection and segmentation. While high-resolution features Segnet: Adeepconvolutionalencoder-decoderarchitecture preserve more spatial information with smaller receptive forimagesegmentation. IEEEtransactionsonpatternanal- fields, lower-resolution features can have larger receptive ysisandmachineintelligence,39(12):2481–2495,2017. 8 8[2] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George [14] JunFu,JingLiu,YuhangWang,YongLi,YongjunBao,Jin- Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, hui Tang, and Hanqing Lu. Adaptive context network for and Jon Shlens. Searching for efficient multi-scale archi- sceneparsing.InProceedingsoftheIEEEinternationalcon- tecturesfordenseimageprediction. InAdvancesinneural ferenceoncomputervision,pages6748–6757,2019. 1,5,6, informationprocessingsystems,pages8699–8710,2018. 8 8 [3] Liang-ChiehChen, RaphaelGontijoLopes, BowenCheng, [15] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Maxwell D Collins, Ekin D Cubuk, Barret Zoph, Hartwig Learningscalablefeaturepyramidarchitectureforobjectde- Adam, and Jonathon Shlens. Semi-supervised learning tection. InProceedingsoftheIEEEconferenceoncomputer in video sequences for urban scene segmentation. arXiv visionandpatternrecognition,pages7036–7045,2019. 8 preprintarXiv:2005.10266,2020. 5,6 [16] SuyogGuptaandMingxingTan. Efficientnet-edgetpu: Cre- [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, ating accelerator-optimized neural networks with automl, KevinMurphy,andAlanLYuille. Semanticimagesegmen- 2019. 5 tationwithdeepconvolutionalnetsandfullyconnectedcrfs. [17] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. arXivpreprintarXiv:1412.7062,2014. 8 Deep residual learning for image recognition. In Proceed- [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, ingsoftheIEEEconferenceoncomputervisionandpattern KevinMurphy,andAlanLYuille.Deeplab:Semanticimage recognition,pages770–778,2016. 2,7,8 segmentationwithdeepconvolutionalnets,atrousconvolu- [18] Yuanduo Hong, Huihui Pan, Weichao Sun, Yisong Jia, tion,andfullyconnectedcrfs. IEEEtransactionsonpattern et al. Deep dual-resolution networks for real-time and ac- analysisandmachineintelligence,40(4):834–848,2017. 2, curatesemanticsegmentationofroadscenes. arXivpreprint 3,8 arXiv:2101.06085,2021. 6,7 [6] Liang-ChiehChen,GeorgePapandreou,FlorianSchroff,and [19] JieHu,LiShen,andGangSun. Squeeze-and-excitationnet- Hartwig Adam. Rethinking atrous convolution for seman- works. InProceedingsoftheIEEEconferenceoncomputer tic image segmentation. arXiv preprint arXiv:1706.05587, visionandpatternrecognition,pages7132–7141,2018. 5 2017. 1,6,8 [20] Ping Hu, Federico Perazzi, Fabian Caba Heilbron, Oliver [7] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Wang,ZheLin,KateSaenko,andStanSclaroff. Real-time AlanLYuille. Attentiontoscale:Scale-awaresemanticim- semantic segmentation with fast attention. arXiv preprint agesegmentation.InProceedingsoftheIEEEconferenceon arXiv:2007.03815,2020. 1,7,8 computervisionandpatternrecognition,pages3640–3649, [21] XinLi, YimingZhou, ZhengPan, andJiashiFeng. Partial 2016. 8 orderpruning:forbestspeed/accuracytrade-offinneuralar- [8] Liang-ChiehChen,YukunZhu,GeorgePapandreou,Florian chitecturesearch.InProceedingsoftheIEEEConferenceon Schroff, and Hartwig Adam. Encoder-decoder with atrous computervisionandpatternrecognition,pages9145–9153, separableconvolutionforsemanticimagesegmentation. In 2019. 1,7,8 ProceedingsoftheEuropeanconferenceoncomputervision [22] ZemingLi,ChaoPeng,GangYu,XiangyuZhang,Yangdong (ECCV),pages801–818,2018. 1,3,6,7,8 Deng, and Jian Sun. Detnet: Design backbone for object [9] Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, detection. In Proceedings of the European conference on Yuan Li, and Zhangyang Wang. Fasterseg: Searching computervision(ECCV),pages334–350,2018. 8 for faster real-time semantic segmentation. arXiv preprint arXiv:1912.10917,2019. 1,6,7,8 [23] Tsung-Yi Lin, Piotr Dolla´r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- [10] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, mid networks for object detection. In Proceedings of the Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. IEEE conference on computer vision and pattern recogni- Panoptic-deeplab: A simple, strong, and fast baseline for tion,pages2117–2125,2017. 2,3,8 bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [24] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Recognition,pages12475–12485,2020. 5,6 Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- [11] Franc¸oisChollet. Xception: Deeplearningwithdepthwise deeplab:Hierarchicalneuralarchitecturesearchforsemantic separable convolutions. In Proceedings of the IEEE con- imagesegmentation. InProceedingsoftheIEEEconference ference on computer vision and pattern recognition, pages on computer vision and pattern recognition, pages 82–92, 1251–1258,2017. 8 2019. 5,6,8 [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [25] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Pathaggregationnetworkforinstancesegmentation. InPro- Franke, Stefan Roth, and Bernt Schiele. The cityscapes ceedingsoftheIEEEconferenceoncomputervisionandpat- dataset for semantic urban scene understanding. In Proc. ternrecognition,pages8759–8768,2018. 3,8 of the IEEE Conference on Computer Vision and Pattern [26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Recognition(CVPR),2016. 1,2,5 ZhengZhang, StephenLin, andBainingGuo. Swintrans- [13] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- former: Hierarchical vision transformer using shifted win- weightedlinearunitsforneuralnetworkfunctionapproxima- dows. arXivpreprintarXiv:2103.14030,2021. 6 tioninreinforcementlearning. NeuralNetworks,107:3–11, [27] GerhardNeuhold,TobiasOllmann,SamuelRotaBulo,and 2018. 5 PeterKontschieder.Themapillaryvistasdatasetforsemantic 9understandingofstreetscenes. InProceedingsoftheIEEE [41] SainingXie,RossGirshick,PiotrDolla´r,ZhuowenTu,and InternationalConferenceonComputerVision,pages4990– KaimingHe. Aggregatedresidualtransformationsfordeep 4999,2017. 2,6 neuralnetworks. InProceedingsoftheIEEEconferenceon [28] AlejandroNewell,KaiyuYang,andJiaDeng.Stackedhour- computervisionandpatternrecognition,pages1492–1500, glassnetworksforhumanposeestimation.InEuropeancon- 2017. 8 ferenceoncomputervision,pages483–499.Springer,2016. [42] JingYang,QingshanLiu,andKaihuaZhang. Stackedhour- 8 glassnetworkforrobustfaciallandmarklocalisation.InPro- [29] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. ceedings of the IEEE Conference on Computer Vision and Learningdeconvolutionnetworkforsemanticsegmentation. PatternRecognitionWorkshops,pages79–87,2017. 8 InProceedingsoftheIEEEinternationalconferenceoncom- [43] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, putervision,pages1520–1528,2015. 8 ChunhuaShen, andNongSang. Bisenetv2: Bilateralnet- work with guided aggregation for real-time semantic seg- [30] MarinOrsic,IvanKreso,PetraBevandic,andSinisaSegvic. mentation. arXiv preprint arXiv:2004.02147, 2020. 1, 5, Indefenseofpre-trainedimagenetarchitecturesforreal-time 7,8 semanticsegmentationofroad-drivingimages. InProceed- [44] FisherYu,VladlenKoltun,andThomasFunkhouser.Dilated ingsoftheIEEE/CVFConferenceonComputerVisionand residual networks. In Proceedings of the IEEE conference PatternRecognition,pages12607–12616,2019. 7 oncomputervisionandpatternrecognition,pages472–480, [31] Tobias Pohlen, Alexander Hermans, Markus Mathias, and 2017. 1,8 BastianLeibe. Full-resolutionresidualnetworksforseman- [45] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object- ticsegmentationinstreetscenes.InProceedingsoftheIEEE contextualrepresentationsforsemanticsegmentation. arXiv Conference on Computer Vision and Pattern Recognition, preprintarXiv:1909.11065,2019. 1,5,6,8 pages4151–4160,2017. 8 [46] YuhuiYuanandJingdongWang. Ocnet:Objectcontextnet- [32] Prajit Ramachandran, Barret Zoph, and Quoc V Le. work for scene parsing. arXiv preprint arXiv:1809.00916, Searching for activation functions. arXiv preprint 2018. 1,6 arXiv:1710.05941,2017. 5 [47] SergeyZagoruykoandNikosKomodakis.Wideresidualnet- [33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- works. arXivpreprintarXiv:1605.07146,2016. 8 net: Convolutionalnetworksforbiomedicalimagesegmen- [48] HangZhang,ChongruoWu,ZhongyueZhang,YiZhu,Zhi tation. InInternationalConferenceonMedicalimagecom- Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R puting and computer-assisted intervention, pages 234–241. Manmatha, etal. Resnest: Split-attentionnetworks. arXiv Springer,2015. 8 preprintarXiv:2004.08955,2020. 5,6,8 [34] MarkSandler,AndrewHoward,MenglongZhu,AndreyZh- [49] XiongZhang,HongminXu,HongMo,JianchaoTan,Cheng moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted Yang, and Wenqi Ren. Dcnas: Densely connected neural residuals and linear bottlenecks. In Proceedings of the architecturesearchforsemanticimagesegmentation. arXiv IEEE conference on computer vision and pattern recogni- preprintarXiv:2003.11883,2020. 8 tion,pages4510–4520,2018. 5 [50] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang [35] KeSun,YangZhao,BoruiJiang,TianhengCheng,BinXiao, Wang, and Jiaya Jia. Pyramid scene parsing network. In Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and ProceedingsoftheIEEEconferenceoncomputervisionand JingdongWang.High-resolutionrepresentationsforlabeling patternrecognition,pages2881–2890,2017. 1,5,6,8 pixelsandregions. arXivpreprintarXiv:1904.04514,2019. [51] SixiaoZheng, JiachenLu, HengshuangZhao, XiatianZhu, 1,2,3,5,6,8 Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao [36] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking Xiang,PhilipHSTorr,etal. Rethinkingsemanticsegmen- model scaling for convolutional neural networks. arXiv tation from a sequence-to-sequence perspective with trans- preprintarXiv:1905.11946,2019. 2,3,4,5,7,8 formers. InCVPR,2021. 6 [37] MingxingTanandQuocV.Le.Efficientnetv2:Smallermod- [52] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela elsandfastertraining. ICML,2021. 8 Barriuso, and Antonio Torralba. Scene parsing through [38] MingxingTan,RuomingPang,andQuocVLe.Efficientdet: ade20kdataset. InProceedingsoftheIEEEConferenceon Scalable and efficient object detection. In Proceedings of ComputerVisionandPatternRecognition,2017. 2,5,6 theIEEE/CVFConferenceonComputerVisionandPattern [53] YiZhu,KaranSapra,FitsumAReda,KevinJShih,Shawn Recognition,pages10781–10790,2020. 2,3,4,5,8 Newsam,AndrewTao,andBryanCatanzaro. Improvingse- [39] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, manticsegmentationviavideopropagationandlabelrelax- ChaoruiDeng,YangZhao,DongLiu,YadongMu,Mingkui ation. InProceedingsoftheIEEEConferenceonComputer Tan,XinggangWang,etal. Deephigh-resolutionrepresen- VisionandPatternRecognition,pages8856–8865,2019. 6 tationlearningforvisualrecognition. IEEEtransactionson [54] BarretZoph,GolnazGhiasi,Tsung-YiLin,YinCui,Hanxiao patternanalysisandmachineintelligence,2020. 8 Liu,EkinDCubuk,andQuocVLe.Rethinkingpre-training andself-training. arXivpreprintarXiv:2006.06882,2020. 7 [40] EnzeXie,WenhaiWang,ZhidingYu,AnimaAnandkumar, [55] BarretZophandQuocVLe.Neuralarchitecturesearchwith JoseMAlvarez,andPingLuo. Segformer: Simpleandef- reinforcement learning. arXiv preprint arXiv:1611.01578, ficientdesignforsemanticsegmentationwithtransformers. 2016. 8 arXivpreprintarXiv:2105.15203,2021. 6 10