Superpixel Transformers for Efficient Semantic Segmentation Alex Zihao Zhu1∗, Jieru Mei2∗, Siyuan Qiao3, Hang Yan1, Yukun Zhu3, Liang-Chieh Chen4, Henrik Kretzschmar5 Abstract—Semanticsegmentation,whichaimstoclassifyevery pixel in an image, is a key task in machine perception, with manyapplicationsacrossroboticsandautonomousdriving.Due tothehighdimensionalityofthistask,mostexistingapproaches use local operations, such as convolutions, to generate per- pixel features. However, these methods are typically unable to effectively leverage global context information due to the high computationalcostsofoperatingonadenseimage.Inthiswork, we propose a solution to this issue by leveraging the idea of superpixels, an over-segmentation of the image, and applying them with a modern transformer framework. In particular, our model learns to decompose the pixel space into a spatially low Fig.1. Ourmethodenablesefficientsegmentationofhigh-resolutioncamera images by learning to decompose the images into a set of superpixels. dimensionalsuperpixelspaceviaaseriesoflocalcross-attentions. Specifically, we oversegment the image pixels into a small set of soft We then apply multi-head self-attention to the superpixels to superpixels(left)viaaseriesoflocalcrossattentions.Thesuperpixelsare enrich the superpixel features with global context and then thenrefinedviaasetofmulti-headselfattentions,anddirectlyclassified. directly produce a class prediction for each superpixel. Finally, Finally,wefusetheclasspredictionswiththesuperpixel-pixelassociations we directly project the superpixel class predictions back into toproduceadensesemanticsegmentation(right). the pixel space using the associations between the superpixels and the image pixel features. Reasoning in the superpixel space allows our method to be substantially more computationally efficient compared to convolution-based decoder methods. Yet, Toaddressthisissue,weaimtobringtheclassicalideassur- our method achieves state-of-the-art performance in semantic rounding superpixels into modern deep learning. The premise segmentation due to the rich superpixel features generated of using superpixels is to decompose and over-segment the by the global self-attention mechanism. Our experiments on image into a series of irregular patches. By grouping similar Cityscapes and ADE20K demonstrate that our method matches the state of the art in terms of accuracy, while outperforming pixels into superpixels and then operating on the superpixel in terms of model parameters and latency. level, one can significantly reduce the computational cost of dense prediction tasks, such as semantic segmentation. I. INTRODUCTION Classical superpixel algorithms, such as SLIC [1], however, Theproblemofsemanticsegmentation,orclassifyingevery rely on hard associations between each image pixel and pixel in the image, is increasingly common in many robotics superpixel. This makes it hard to embed the superpixel applications. A dense, fine-grained, understanding of the representation into neural network architectures [29] as this world is necessary for navigation in cluttered environments, association is not differentiable for back propagation. Recent particularly for applications such as autonomous driving, works, such as superpixel sampling networks (SSN) [24], where scene understanding is deeply safety-critical. On the resolve this issue by turning the hard association into a soft otherhand,manyroboticssystemsarecombinationsofhighly one. While this is a step towards incorporating superpixels complex and specialized systems, and latency is an ever- into neural networks, their segmentation quality still lags present issue for real time operation. behind other models that adopt the per-pixel or per-mask The balance between safety and performance and latency representation. is critical for modern robotic systems. While the state of the In this work, we propose a novel architecture that aims to art in semantic segmentation is able to achieve strong per- revive the differentiable superpixel generation pipeline in a formance in terms of metrics such as mean Intersection over modern transformer framework [40]. In place of the iterative Union (mIoU), many methods still rely on dense decoders clustering algorithms used in SLIC [1] and SSN [24], we which produce predictions for every pixel in the scene. As propose to learn the superpixel representation by developing a result, these methods tend to be relatively expensive, and a series of local cross-attentions between a set of learned arguably produce a lot of redundant computation for nearby superpixel queries and pixel features. The outputs of cross- pixels that are often very similar. attention modules act as the superpixel features, directly used for semantic segmentation prediction. As a result, the 1WaymoLLC 2JohnsHopkinsUniversity(WorkdoneasaninternatWaymo) proposed transformer decoder effectively converts object 3GoogleResearch queries to superpixel features, enabling the model to learn 4ByteDanceResearch(WorkdonewhileatGoogleResearch) the superpixel representation end-to-end. 5WorkdonewhileatWaymo ∗Equalcontributions Operating on the superpixel level provides a number of 3202 tcO 2 ]VC.sc[ 2v98861.9032:viXra… Superpixel Features ASHM … ASHM Fig.2. OurproposedSuperpixelTransformerarchitecture.Givenanimage,wefirstgeneratehypercolumnfeatureswithanoff-the-shelfencoderbackbone. Oursuperpixeltokenizationmoduleusesaseriesoflocaldual-pathcross-attentionstogeneratefeaturesforeachsuperpixel.Thesuperpixelfeaturesarethen enrichedbyseveralmulti-headself-attention(MHSA)layerstoproduceaclasspredictionforeachsuperpixel,whiletheassociationsbetweeneachsuperpixel andpixelfeaturearecomputedfromtheirrespectivefeatures.Finally,thesuperpixelclasspredictionsareunfoldedintothedensepixelspaceusingthe associations.Notethatthefigureillustratesahardassignmentbetweenpixelsandsuperpixelsforsimplicity,whileinpracticeweapplyadifferentiablesoft assignment. notable benefits. Conventional pixel-based approaches are II. RELATEDWORK limited by the high dimensionality of the pixel space, making global self-attention computationally intractable. Numerous Superpixels for Segmentation Before the deep learning approaches such as axial attention [42] or window atten- era, the superpixel representation, paired with graphical tion [31] have been developed to work around these issues models, was the main paradigm for image segmentation. by relaxing the global attention to a local one. By over- Superpixelmethods[33],[37],[13],[1]areusuallyusedinthe segmenting the image into a small set of superpixels, we pre-processingsteptoreducethecomputationcost.Ashallow are able to efficiently apply global self-attention on the classifier,e.g.,SVM[15],predictsthesemanticlabelsofeach superpixels, providing full global context to the superpixel superpixel [18], which aggregates hand-crafted features. The features, even when reasoning about high-resolution images. graphical models, particularly conditional random fields [28], Despite applying global self-attention (vs conventional convo- are then employed to refine the segmentation results [22], lutional neural networks) in our model, our method is more [27], [26]. efficient than existing methods due to the low dimensionality ConvNets for Segmentation Convolutional neural net- ofthesuperpixelspace.Finally,wedirectlyproducesemantic works (ConvNets) [29] deployed in a fully convolutional classes for the superpixel features, and then back-project the manner [34] performs semantic segmentation by pixel-wise predicted classes onto the image space using the superpixel- classification. Typical ConvNet-based approaches include the pixel associations. DeepLab series [4], [6], [8], PSPNet [54], UPerNet [46], and We perform extensive evaluations on the Cityscapes [14] OCRNet [51]. Alternatively, there are some works [19], [24] and ADE20K [56] datasets, where our method matches state- that employ superpixels to aggregate features extracted by of-the-art performance, but at significantly lower computa- ConvNets and show promising results. tional cost. Transformers for Segmentation Transformers [40] and In summary, the main contributions of this work are as their vision variants [17] have been adopted as the backbone follows: encoders for image segmentation [7], [55], [3]. Transformer • The first work that revives the superpixel representation encoders can be instantiated as augmenting ConvNets with in the modern transformer framework, where the object self-attention modules [43], [42]. When used as stand-alone queries are used to learn superpixel features. backbones[36],[17],[31],[47],theyalsodemonstratestrong • A novel network architecture that uses local cross- performance compared to the previous ConvNet baselines. attention to significantly reduce the spatial dimensional- Transformers are also used as the decoders [2] for im- ity of pixel features to a small set of superpixel features, age segmentation. A popular design is to generate masks enabling learning the global context between them and embedding vectors from object queries and then multiply the direct classification of each superpixel. themwiththepixelfeaturestogeneratemasks[39],[44].For • A superpixel association and unfolding scheme that example, MaX-DeepLab [41] proposes an end-to-end mask projects each superpixel class prediction back to a dense transformer framework that directly predicts class-labeled pixel segmentation, discarding the CNN pixel decoder. object masks. Segmenter [38] and MaskFormer [12] tackle • Experiments on the Cityscapes and ADE20k datasets, semantic segmentation from the view of mask classification. where our method outperforms the state of the art at K-Net [52] generates segmentation masks by a group of substantially lower computational cost. learnable kernels. Inspired by the similarity between masktransformers and clustering algorithms [33], clustering-based masktransformersareproposedtosegmentimages[49],[48], [50]. Deformable transformer [57] is also used for improving the image segmentation as in Panoptic SegFormer [30] and Mask2Former [11]. Similar to this work, Region Proxy [53] (RegProxy) also incorporates the idea of superpixels into a deep segmentation network by using a CNN decoder to learn the association between each pixel and superpixel. However, RegProxy uses featuresontheregularpixelgridtorepresenteachsuperpixel, andonwhichtoapplyself-attention.Incomparison,weapply Fig.3. Visualizationofthesuperpixel-pixelassociation.Eachsuperpixel is assigned to a gray grid cell in the image. For each pixel (small dots, asetoflearnedweights,whichcorrespondtoeachsuperpixel, sizeexaggerated)insideagivensuperpixel,wecomputeitscrossattention and use cross-attention with the pixel features to compute withitsneighboring3×3superpixels,highlightedwiththesamecolor.The the pixel-superpixel associations. Our experiments demon- essence of our method is that these neighborhoods overlap in a sliding windowfashion. strate that our methodology provides significant performance improvements. In summary, all of the prior works that apply transformers B. Superpixel Tokenization to segmentation have, in some way, relied on a dense Before introducing our proposed Superpixel Tokenization CNN decoder to generate the final dense features, and then module, we briefly review the previous works on Differ- combined these features with an attention mechanism to entiable SLIC [1], [24], which we modernize with the improve performance. Our method, in comparison, uses cross transformer framework. attention to reduce the image into a small set of superpixels, Preliminary:Differentiable SLIC SimpleLinearIterative and only applies self-attention in this superpixel space in the Clustering (SLIC) [1] adopts the classical iterative k-means decoder. This allows our method to operate on a significantly algorithm [33] to generate superpixels by clustering pixels lower dimensional space (often 322× smaller than the image based on their features (e.g., color similarity and location resolution), while utilizing the benefits that come with global proximity). Given a set of pixel features I and initialized self-attention to achieve state-of-the-art performance. p superpixel features S0 at iteration 0, the algorithm iterates i between two steps at iteration t: III. METHOD 1) (Hard) Assignment: Compute the similarity Qt be- pi tween each pixel feature I and superpixel feature St. Our proposed Superpixel Transformer architecture, sum- p i Assign each pixel to a single superpixel based on its marized in Figure 2, consists of four main components: maximum similarity. 1) Pixel Feature Extraction: A convolutional encoder 2) Update: Update the superpixel features St based on i backbone to generate hypercolumn features. the pixels features assigned to it. 2) Superpixel Tokenization: A series of local dual-path The Superpixel Sampling Networks (SSN) [24] make the cross-attentions, between a set of learned queries and wholeprocessdifferentiablebyreplacingthehardassignment pixel features, to generate a set of superpixel features. between each pixel and superpixel with a soft weight: 3) Superpixel Classification: A series of multi-head self- attention layers to refine the superpixel features and Qt pi =e−∥Ip−S it−1∥2 (1) produce a semantic class for each superpixel. n 1 (cid:88) 4) Superpixel Association: Associating the predicted su- S it = Zt Qt piI p, (2) perpixel classes and pixel features to obtain the final i p=1 dense semantic segmentation. where Zt = (cid:80) Q it is the normalization constant. In i p p We detail each component in the following subsections. practice,att=0,thesuperpixelfeaturesareinitializedasthe meanfeaturewithinasetofrectangularpatchesthatareevenly distributed in the image. In order to reduce the computational A. Pixel Feature Extraction complexity and to apply a spatial locality constraint, the Typicalconvolutionalneuralnetworks,suchasResNet[21] distance computation ∥I −St−1∥2 is restricted to a local p i and ConvNeXt [32], are employed as the encoder backbone. 3×3 superpixel neighborhood around each pixel, although On top of the encoder output, we apply a multi-layer larger window sizes are possible. perceptron (MLP), and bilinear resize to the features after Superpixel Tokenization We propose to unroll the SSN stage-1 (stride 2), stage-3 (stride 8), and stage-5 (stride 32). iterations and replace the k-means clustering steps with a set The multi-scale features are combined with addition to form of local cross-attentions. We initialize the superpixel features, hypercolumn features [20]. Each pixel feature is represented whicharedistributedonaregulargridintheimage,(Figure3) by their corresponding hypercolumn features, which are fed with a set of randomly-initialized, learnable queries, S0, i to the following Superpixel Tokenization module. and perform the superpixel update step using cross-attentionbetween superpixel features and pixel features by adapting During training, the dense semantic segmentation Y is the dual-path cross-attention [41], giving supervised by the semantic segmentation ground truth. (cid:88) St =St−1+ softmax (q ·k )v (3) i i p S it−1 Ipt−1 Ipt−1 IV. EXPERIMENTS p∈N(i) (cid:88) It =It−1+ softmax (q ·k )v , (4) In this section, we evaluate the size, latency and accuracy p p i Ipt−1 S it−1 S it−1 of a small (ResNet-50 backbone) and large (ConvNeXt-L i∈N(p) backbone) variant of our model against prior works, and where N(x) denotes the neighborhood of x and q, k and provide ablations for the superpixel tokenization module and v are the query, key and value, generated applying a MLP a fine grained latency analysis. to each respective feature plus an additive learned position We evaluate our work on the Cityscapes [14] and embedding. For each superpixel neighborhood corresponding ADE20K [56]. Cityscapes is a driving dataset, consisting toasuperpixel,wesharethesamesetofpositionembeddings. of 5,000 high resolution street-view images, and 19 semantic For each superpixel S , there are 9·h·w pixel neighbors, i classes.ADE20Kisageneralsceneparsingdataset,consisting where[h,w]isthesizeofthepatchcoveredbyonesuperpixel, of 20,210 images with 150 semantic classes. whileeachpixelhas9superpixelneighbors.Weillustratethis neighborhood in Figure 3. The local dual-path cross-attention repeats n times to generate the output superpixel, Stn and A. Implementation Details i pixel, I ptn, features. Pixel Feature Extraction The hypercolumn features are This local dual-path cross-attention serves three purposes: generated by applying a MLP to project each encoder feature • Reduce complexity compared to a full cross-attention. to 256 channels, and bilinear resizing to stride 8. • Stabilize training, as the final softmax is only between Superpixel Tokenization For both datasets, we apply 2 9 superpixel features or 9·h·w pixel features. sequential local dual-path cross-attention to generate the • Encourage spatial locality of the superpixels, forcing superpixel embeddings, each with 256 channels and 2 heads. them to focus on a coherent, local over-segmentation. We use a single set of learned position embeddings for each superpixel and pixel feature, initialized at 1 resolution of C. Superpixel Classification 4 each feature, bilinear upsampled to the feature resolution and Given the updated superpixel features from the Superpixel added to the feature. Tokenization module, we directly predict a class for each Superpixel Classification 4 multi-head attention layers are superpixel using a series of self-attentions. In particular, we applied in the superpixel classification stage, with 4 heads applyk multi-headself-attention(MHSA)layers[40]tolearn each, outputting 256 channels in each layer. global context information between superpixels, producing Superpixel Association The associations between the su- outputs F . Performing MHSA on the superpixel features is i perpixel features and pixel features are computed at stride significantly more efficient than on the pixel features, since 8. The association is then bilinear upsampled to the input thenumberofsuperpixelsismuchsmaller.Inourexperiments, resolution before applying the softmax in (5). we typically use a superpixel resolution that is 322× smaller Training Our training hyperparameters closely follow prior than the input resolution. Finally, we apply a linear layer as worksuchas[5],[50].Specifically,weemploythepolynomial a classifier, producing a semantic class prediction for each learning rate scheduler, and the backbone learning rate refined superpixel feature, C . As opposed to the CNN pixel i multiplier is set to 0.1. The AdamW optimizer [25], [35] decoders used in other approaches [10], [12], our superpixel is used with weight decay 0.05. For regularization and class predictions C can be directly projected back to the i augmentations, we use random scaling, color jittering [16], final pixel-level semantic segmentation output without any and drop path [23] with a keep probability of 0.8 (for additional layers, as described in Section III-D. ConvNeXt-L). We use a softmax cross-entropy loss, applied D. Superpixel Association to the top 20% of pixels of the dense segmentation output. To project the superpixel class predictions back into the Cityscapes Our models are trained with global batch size pixelspace,weusetheoutputsoftheSuperpixelTokenization 32 over 32 TPU cores for 60k iterations. The initial learning module, Itn and Stn, to compute the association between rate is 10−3 with 5,000 steps of linear warmup. The model p i each pixel and its 9 neighboring superpixels: accepts the full resolution 1024×2048 images as input, and produces 128 × 256 hypercolumn features (i.e., stride 8). Q =softmax (Itn ·Stn). (5) pi i∈N(p) p i Taking as input the hypercolumn features, the superpixel Thefinaldensesemanticsegmentation,Y,isthencomputedat tokenization module uses 32×64 superpixels. eachpixel,p,asthecombinationofeachpredictedsuperpixel ADE20K Our models are trained with global batch size classfromtheSuperpixelClassificationmodule,C ,weighted 64 over 32 TPU cores and crop size 640×640. Our Resnet- i by the above associations: 50 and ConvNeX-L models are trained for 100k and 150k iterations, respectively. The initial learning rate is 10−3 with (cid:88) Y p = Q pi·C i. (6) 5,000stepsoflinearwarmup.160×160hypercolumnfeatures i∈N(p) are generated, and 40×40 superpixels are used.Method Backbone Params↓ FLOPs↓ FPS↑ mIoU↑ MaskFormer[12] ResNet-50[21] - - - 78.5 Mask2Former[11] ResNet-50[21] - - - 79.4 Panoptic-DeepLab[10] ResNet-50[21] 43M 517G - 78.7 RegProxy∗ [53] ViT-S[17] 23M 270G - 79.8 kMaX-DeepLab† [50] ResNet-50[21] 56M 434G 9.0 79.7 SP-Transformer ResNet-50[21] 29M 253G 15.3 80.4 Mask2Former‡ [11] Swin-L[31] - - - 83.3 RegProxy∗ [53] ViT-L/16[17] 307M - - 81.4 SegFormer[47] MiT-B5[47] 85M 1,448G 2.5 82.4 kMaX-DeepLab† [50] ConvNeXt-L[32] 232M 1,673G 3.1 83.5 SP-Transformer ConvNeXt-L[32] 202M 1,557G 3.6 83.1 TABLEI CITYSCAPESvalSETRESULTS.WEEVALUATEFLOPSANDFPSWITHINPUT1024×2048FOROURSP-TRANSFORMERONATESLAV100-SXM2 GPU.SP-TRANSFORMERWITHRESNET-50OUTPERFORMSPRIORARTSINTERMSOFPARAMETERS,LATENCY,ANDPERFORMANCE.FORTHELARGE MODELS,SP-TRANSFORMERWITHCONVNEXT-LBACKBONEACHIEVESSIMILARREDUCTIONSINPARAMETERS,WHILEACHIEVINGTHELOWEST LATENCY,ANDCOMPETITIVEMIOUPERFORMANCE.∗REGPROXYEVALUATESUSINGA7682SLIDINGWINDOW.†kMAX-DEEPLABISTRAINEDFOR PANOPTICSEGMENTATION. Method Backbone Crop Params↓ FLOPs↓ FPS↑ mIoU↑ RegProxy[53] ViT-Ti/16[17] 512 6M 3.9G 38.9 42.1 MaskFormer[12] ResNet-50[21] 512 41M 53G 24.5 44.5 kMaX-DeepLab† [50] ResNet-50[21] 641 57M 75G 38.7 45.0 SP-Transformer ResNet-50[21] 640 29M 78G 40.8 43.7 TABLEII ADE20KvalSETRESULTS.WEEVALUATEFLOPSANDFPSWITHINPUT640×640FORSP-TRANSFORMERONATESLAV100-SXM2GPU.OUR METHODOUTPERFORMSTHEPRIORREGPROXYWORK,WHILEREMAININGCOMPETITIVEWITHOTHERPRIORWORKS,ANDATTHEHIGHESTFPS. †kMAX-DEEPLABISTRAINEDFORPANOPTICSEGMENTATION. B. Results model is able to achieve near state-of-the-art performance, especially compared to the prior semantic segmentation 1) CityscapesDataset: TableIcomparestheresultsofour methods, where we outperform most of the prior works, Superpixel Transformer model to other transformer-based except for Mask2Former, where we are within 0.2mIoU. We state-of-the-art models for semantic segmentation on the notethat,forthiscomparison,theequivalentMask2Former[9] Cityscapes val set. In these experiments, we compare models model is pre-trained with a significantly larger ImageNet-22k with backbones roughly the same size as ResNet-50 and dataset, whereas our model is pre-trained on ImageNet-1k. ConvNeXt-L. In addition to other semantic segmentation Figure 4 provides qualitative examples of our ConvNeXt-L models, we also compare against the state of the art panop- model. The semantic segmentation predictions suggest that tic segmentation method, kMaX-DeepLab [50]. While this themodelisablerecoverthinstructures,suchaspoles,despite method is trained on a slightly different task, we find that, largely operating in a 32×64 superpixel space. for computational cost, it is the most fair comparison, as our We also provide a visualization of the learned superpixels. training schedule and pipeline is most similar to theirs. We convert the soft association in Section III-D to a hard With the smaller ResNet-50 backbone, our model is assignment by selecting the argmax over superpixels, i: roughly half the size and latency of kMaX-DeepLab, while improving upon mIoU by 0.6 against the previous SOTA, Q¯ =argmax Q (7) p i∈N(p) pi RegProxy [53]. As the ResNet-50 backbone is relatively small, most existing models are dominated by the size of We visualize the boundaries of these assignments overlaid on theirdecoders,whichallowsourmodel’sreductionindecoder top of the input image in the left-most column of Figure 4. size to have significant impacts on the overall size and From these visualizations, we can see that, despite the performance, with a 70% improvement in FPS compared model being trained with a soft association, the superpixels to kMaX-DeepLab. In addition, we expect to see this effect generatedbythehardassignmenttightlyfollowtheboundaries even more for even smaller backbones. in the image. We note that this is particularly interesting as For comparison, we also provided results with a larger we do not provide any direct supervision to the superpixel ConvNeXt-L backbone. Here, our model has a similar associations, and instead these are learned implicitly by the absolute reduction in params and FLOPs, as compared network. In addition, we find that these boundaries tend to to the equivalent kMaX-DeepLab model. However, as the be more faithful to the edges of an object than the labels. model is largely dominated by the size of the backbone, 2) ADE20K Dataset: We provide quantitative results on the overall improvements are more modest. Nonetheless, our the ADE20K dataset in Table II. We choose one of the mostFig.4. QualitativeexamplesofourConvNeXt-LbackbonemodelontheCityScapes(top)andADE20k(bottom)datasets.Left:Inputimagewithsuperpixel boundariesoverlaid.Middle:Semanticprediction.Right:Semanticgroundtruth.Thesuperpixelsarevisualizedbygeneratingahardassignmentbytaking theargmaxofthesoftassignment.Despiteonlybeingtrainedwithasemanticsegmentationlossandwithsoftassignments,thehardassignmentsuperpixels faithfullyfollowboundariesintheimage.Thesuperpixeloverlaysarebestviewedzoomedin. commonly used crop sizes (640×640) and provide results for boundaries are drawn (see Figure 5 for examples). As our the ResNet-50 backbone. superpixel tokenization module operates before any semantic prediction, and each superpixel query only operates on a For ADE20K, we achieve the highest FPS and the second local neighborhood in the pixel space, the model must learn lowest # params (behind the surprisingly small RegProxy a consistent way to divide the image into a set of superpixels. model), while outperforming RegProxy. When the label boundaries are inconsistent, our model is less However, we do note that the gap in performance is larger able to effectively learn this over-segmentation, leading to a forADE20KthanCityscapes.Ourhypothesisisthatthelarge small decrease in performance. number of classes in ADE20K (152) results in ambiguities when a pixel could belong to multiple classes. This results 3) Superpixel Tokenization Ablation: In Table III, we in inconsistencies for object classes, and in particular where provide an ablation of the number of cross attention layersMethod Time(ms) Backbone:ResNet-50 28.0 Hypercolumn 7.1 SuperpixelTokenization 25.4 SuperpixelSelf-Attention 4.0 SuperpixelAssociation 1.0 Total 65.5 TABLEIV LATENCYINFORMATIONFOREACHSUB-COMPONENT.THEBULKOFTHE NON-BACKBONELATENCYISCONSUMEDBYTHELOCALCROSS ATTENTIONINTHESUPERPIXELTOKENIZATIONMODULE. Fig. 5. Inconsistent label boundaries in the ADE20K dataset make it difficult for our model to effectively learn superpixels. See the spaces in superpixels. betweenfencerailings,thebooksontheshelves,theunlabeledpeopleon thebleachersandthetrees/vegetation. 2) Latency: As seen in Tables I and II, our method provides a significant improvement in FPS. As the main #CA sp.res. params FLOPs Runtime(ms) mIoU processing of our model operates on the small superpixel 1 32×64 28M 234G 50.0 78.0 space, this allows for a large reduction in model complexity, 2 16×32 29M 240G 53.5 74.9 2 32×64 29M 253G 63.4 80.4 while achieving state of the art performance. 2 64×128 31M 404G 120.5 79.7 However, we believe that there is still a large further TABLEIII reduction in latency available. In particular, the local cross EXPERIMENTSABLATINGTHENUMBEROFLOCALCROSS-ATTENTION attention operation is not efficient for standard accelerators LAYERS(#CA)ANDTHERESOLUTIONOFTHESUPERPIXELS.ABLATIONS and native TensorFlow or PyTorch implementations. This is AREPERFORMEDWITHARESNET-50BACKBONEONCITYSCAPES. becauseitrequiresaslidingwindowwithoverlappingpatches, but with different operands in each patch (as opposed to convolutions where the weights are the same for all patches). However, as we operate on the superpixel grid level (32×64 and the superpixel resolution, evaluated using our ResNet-50 for Cityscapes and 40×40 for ADE20K), this impact of this backbone model on Cityscapes. inefficiency is low enough to make our method overall faster From these experiments, we find that increasing the compared to methods which operate on the dense pixel space. number of cross attention layers from 1 to 2 improves Nonetheless, the Superpixel Tokenization module takes mIoU significantly by 2.4. Further increasing the number of up the majority of the decoder runtime, as can be seen in these layers may have further improvements in performance, TableIV,whereweprovidetiminginformationforourmethod althoughwearecurrentlybottleneckedbyacceleratormemory with a ResNet-50 backbone on a 1024×2048 input. Our constraints. This is largely due to our naive TensorFlow experiments are currently designed with a relatively naive, implementation [45], which we discuss in Section IV-C.2. pureTensorFlowimplementation,andinvolvestheduplication Reducing the number of superpixels to 16×32 also has of each superpixel or pixel 9 times (depending on the cross- a significant regression in mIoU of 5.5. On the other hand, attentionoperation).WebelievethataCUDAimplementation increasing the superpixels to 64×128 also results in a mild could remove this redundant copy, and provide even further regression of 0.7. We posit that this is because our pixel speedups for our method. features used in the association are at stride 8 (128×256). As aresult,having64×128superpixelsprovideseachlocal-cross V. CONCLUSIONS attention with a neighborhood of 128/64×3 = 6×6 pixels, We presented a novel network architecture for semantic which makes the receptive field for each superpixel too small segmentation that leverages superpixels to project the dense to learn the oversegmentation as effectively. This can be image segmentation problem into a low dimensional super- resolved by increasing size of the neighborhood around each pixelspace.Operatingonthisspaceenablesustosignificantly superpixel, but at significantly higher computational cost. reducethesizeandinferencelatencyofournetworkcompared to prior works, while achieving state-of-the-art performance. C. Discussion REFERENCES 1) Superpixel Quality: Despite our network not being trained with any explicit superpixel-based loss, we find that [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, PascalFua,andSabineSu¨sstrunk. Slicsuperpixelscomparedtostate- the associations learned by the network closely resemble of-the-artsuperpixelmethods. IEEETransactionsonPatternAnalysis classical superpixels. That is, the superpixels are aligned andMachineIntelligence,34(11):2274–2282,2012. [2] NicolasCarion,FranciscoMassa,GabrielSynnaeve,NicolasUsunier, such that they follow the dominant edges in the image. We AlexanderKirillov,andSergeyZagoruyko.End-to-endobjectdetection posit that this is due to the limited receptive field for each withtransformers. InECCV,2020. superpixel’s cross attention. As any single superpixel may [3] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: not have visibility to all of the pixels for a given mask, the Transformersmakestrongencodersformedicalimagesegmentation. model must use local edges and boundaries to separate the arXiv:2102.04306,2021.[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin oftheIEEE,86(11):2278–2324,1998. Murphy,andAlanLYuille. Semanticimagesegmentationwithdeep [30] ZhiqiLi,WenhaiWang,EnzeXie,ZhidingYu,AnimaAnandkumar, convolutionalnetsandfullyconnectedcrfs. InICLR,2015. JoseMAlvarez,PingLuo,andTongLu. Panopticsegformer:Delving [5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin deeperintopanopticsegmentationwithtransformers. InCVPR,2022. Murphy,andAlanLYuille. Deeplab:Semanticimagesegmentation [31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, withdeepconvolutionalnets,atrousconvolution,andfullyconnected StephenLin,andBainingGuo. Swintransformer:Hierarchicalvision crfs. IEEETransactionsonPatternAnalysisandMachineIntelligence, transformerusingshiftedwindows. InICCV,2021. 2017. [32] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, [6] Liang-ChiehChen,GeorgePapandreou,FlorianSchroff,andHartwig TrevorDarrell,andSainingXie. Aconvnetforthe2020s. InCVPR, Adam.Rethinkingatrousconvolutionforsemanticimagesegmentation. 2022. arXiv:1706.05587,2017. [33] StuartLloyd. Leastsquaresquantizationinpcm. IEEEtransactions [7] Liang-ChiehChen,YiYang,JiangWang,WeiXu,andAlanLYuille. oninformationtheory,28(2):129–137,1982. Attentiontoscale:Scale-awaresemanticimagesegmentation.InCVPR, [34] JonathanLong,EvanShelhamer,andTrevorDarrell. Fullyconvolu- 2016. tionalnetworksforsemanticsegmentation. InCVPR,2015. [8] Liang-ChiehChen,YukunZhu,GeorgePapandreou,FlorianSchroff, [35] IlyaLoshchilovandFrankHutter. Decoupledweightdecayregulariza- andHartwigAdam.Encoder-decoderwithatrousseparableconvolution tion. InICLR,2019. forsemanticimagesegmentation. InECCV,2018. [36] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, [9] BowenCheng,AnwesaChoudhuri,IshanMisra,AlexanderKirillov, AnselmLevskaya,andJonShlens. Stand-aloneself-attentioninvision Rohit Girdhar, and Alexander G Schwing. Mask2former for video models. InNeurIPS,2019. instancesegmentation. InCVPR,2022. [37] JianboShiandJitendraMalik.Normalizedcutsandimagesegmentation. [10] BowenCheng,MaxwellDCollins,YukunZhu,TingLiu,ThomasS IEEE Transactions on Pattern Analysis and Machine Intelligence, Huang,HartwigAdam,andLiang-ChiehChen. Panoptic-DeepLab:A 22(8):888–905,2000. Simple,Strong,andFastBaselineforBottom-UpPanopticSegmenta- [38] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. tion. InCVPR,2020. Segmenter:Transformerforsemanticsegmentation. InICCV,2021. [11] BowenCheng,IshanMisra,AlexanderGSchwing,AlexanderKirillov, [39] ZhiTian,ChunhuaShen,andHaoChen. Conditionalconvolutionsfor andRohitGirdhar. Masked-attentionmasktransformerforuniversal instancesegmentation. InECCV,2020. [40] AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,Llion imagesegmentation. InCVPR,2022. [12] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel Jones,AidanNGomez,ŁukaszKaiser,andIlliaPolosukhin. Attention classificationisnotallyouneedforsemanticsegmentation.InNeurIPS, isallyouneed. InNeurIPS,volume30,2017. [41] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang- 2021. [13] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach Chieh Chen. Max-deeplab: End-to-end panoptic segmentation with towardfeaturespaceanalysis. IEEETransactionsonPatternAnalysis masktransformers. InCVPR,2021. [42] HuiyuWang,YukunZhu,BradleyGreen,HartwigAdam,AlanYuille, andMachineIntelligence,24(5):603–619,2002. [14] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, andLiang-ChiehChen. Axial-deeplab:Stand-aloneaxial-attentionfor Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, panopticsegmentation. InECCV,2020. [43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. andBerntSchiele. Thecityscapesdatasetforsemanticurbanscene Non-localneuralnetworks. InCVPR,2018. understanding. InCVPR,2016. [44] XinlongWang,RufengZhang,TaoKong,LeiLi,andChunhuaShen. [15] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Solov2:Dynamicandfastinstancesegmentation. InNeurIPS,2020. Machinelearning,20(3):273–297,1995. [45] Mark Weber, Huiyu Wang, Siyuan Qiao, Jun Xie, Maxwell D [16] EkinDCubuk,BarretZoph,DandelionMane,VijayVasudevan,and Collins,YukunZhu,LiangzheYuan,DahunKim,QihangYu,Daniel QuocVLe. Autoaugment:Learningaugmentationpoliciesfromdata. Cremers, et al. Deeplab2: A tensorflow library for deep labeling. InCVPR,2019. [17] AlexeyDosovitskiy,LucasBeyer,AlexanderKolesnikov,DirkWeis- arXiv:2106.09748,2021. [46] TeteXiao,YingchengLiu,BoleiZhou,YuningJiang,andJianSun. senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Unifiedperceptualparsingforsceneunderstanding. InECCV,2018. MatthiasMinderer,GeorgHeigold,SylvainGelly,etal. Animageis [47] EnzeXie,WenhaiWang,ZhidingYu,AnimaAnandkumar,JoseM worth16x16words:Transformersforimagerecognitionatscale. In Alvarez, and Ping Luo. Segformer: Simple and efficient design for ICLR,2020. semanticsegmentationwithtransformers. InNeurIPS,2021. [18] BrianFulkerson,AndreaVedaldi,andStefanoSoatto. Classsegmenta- [48] JiaruiXu,ShaliniDeMello,SifeiLiu,WonminByeon,ThomasBreuel, tionandobjectlocalizationwithsuperpixelneighborhoods. InICCV, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation 2009. emergesfromtextsupervision. InCVPR,2022. [19] RaghudeepGadde,VarunJampani,MartinKiefel,DanielKappler,and [49] QihangYu,HuiyuWang,DahunKim,SiyuanQiao,MaxwellCollins, Peter V Gehler. Superpixel convolutional networks using bilateral YukunZhu,HartwigAdam,AlanYuille,andLiang-ChiehChen. Cmt- inceptions. InECCV,2016. deeplab:Clusteringmasktransformersforpanopticsegmentation. In [20] BharathHariharan,PabloArbela´ez,RossGirshick,andJitendraMalik. CVPR,2022. Hypercolumnsforobjectsegmentationandfine-grainedlocalization. [50] QihangYu,HuiyuWang,SiyuanQiao,MaxwellCollins,YukunZhu, InCVPR,2015. HartwigAdam,AlanYuille,andLiang-ChiehChen. k-meansmask [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep transformer. InECCV,2022. residuallearningforimagerecognition. InCVPR,2016. [51] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual [22] Xuming He, Richard S Zemel, and Miguel A Carreira-Perpina´n. representationsforsemanticsegmentation. InECCV,2020. Multiscale conditional random fields for image labeling. In CVPR, [52] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. 2004. K-net:Towardsunifiedimagesegmentation. InNeurIPS,2021. [23] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q [53] Yifan Zhang, Bo Pang, and Cewu Lu. Semantic segmentation by Weinberger. Deepnetworkswithstochasticdepth. InECCV,2016. earlyregionproxy. InProceedingsoftheIEEE/CVFConferenceon [24] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan Yang, and ComputerVisionandPatternRecognition,pages1258–1268,2022. JanKautz. Superpixelsamplingnetworks. InECCV,pages352–368, [54] HengshuangZhao,JianpingShi,XiaojuanQi,XiaogangWang,and 2018. JiayaJia. Pyramidsceneparsingnetwork. InCVPR,2017. [25] DiederikPKingmaandJimmyBa. Adam:Amethodforstochastic [55] SixiaoZheng,JiachenLu,HengshuangZhao,XiatianZhu,ZekunLuo, optimization. InICLR,2015. YabiaoWang,YanweiFu,JianfengFeng,TaoXiang,PhilipHSTorr, [26] PhilippKra¨henbu¨hlandVladlenKoltun. Efficientinferenceinfully etal. Rethinkingsemanticsegmentationfromasequence-to-sequence connectedcrfswithgaussianedgepotentials. InNeurIPS,2011. [27] L’uborLadicky`,ChrisRussell,PushmeetKohli,andPhilipHSTorr. perspectivewithtransformers. InCVPR,2021. [56] BoleiZhou,HangZhao,XavierPuig,SanjaFidler,AdelaBarriuso, Associativehierarchicalcrfsforobjectclassimagesegmentation. In andAntonioTorralba.Sceneparsingthroughade20kdataset.InCVPR, ICCV,2009. [28] JohnLafferty,AndrewMcCallum,andFernandoCNPereira. Condi- 2017. [57] XizhouZhu,WeijieSu,LeweiLu,BinLi,XiaogangWang,andJifeng tionalrandomfields:Probabilisticmodelsforsegmentingandlabeling Dai. Deformabledetr:Deformabletransformersforend-to-endobject sequencedata. InICML,2001. [29] Yann LeCun, Le´on Bottou, Yoshua Bengio, and Patrick Haffner. detection. InICLR,2020. Gradient-basedlearningappliedtodocumentrecognition. Proceedings