Instance Segmentation with Cross-Modal Consistency Alex Zihao Zhu1, Vincent Casser1, Reza Mahjourian1, Henrik Kretzschmar1 and So¨ren Pirk2 Abstract—Segmenting object instances is a key task in machineperception,withsafety-criticalapplicationsinrobotics and autonomous driving. We introduce a novel approach to instance segmentation that jointly leverages measurements from multiple sensor modalities, such as cameras and LiDAR. Our method learns to predict embeddings for each pixel or point that give rise to a dense segmentation of the scene. Specifically,ourtechniqueappliescontrastivelearningtopoints in the scene both across sensor modalities and the temporal domain. We demonstrate that this formulation encourages the models to learn embeddings that are invariant to viewpoint Fig.1:Ourapproachtoinstancesegmentationleveragesinformationfrom variations and consistent across sensor modalities. We further multiple sensor modalities, such as 3D LiDAR point clouds and camera demonstratethattheembeddingsarestableovertimeasobjects images(left).Ourmethodlearnstopredictdenseembeddings(depictedby move around the scene. This not only provides stable instance thecolors,rightimage)basedonanovelcross-modalcontrastivelossthat usessamplesfromdifferentsensormodalitiesandtemporallyconsecutive masks, but can also provide valuable signals to downstream framesofRGBimagesandLiDARpoints.Theresultingembeddingsare tasks, such as object tracking. We evaluate our method on stableovertimeandinvarianttoviewpointchanges. the Cityscapes and KITTI-360 datasets. We further conduct a number of ablation studies, demonstrating benefits when In this paper, we aim to leverage these complementary sig- applying additional inputs for the contrastive loss. nalsandadvanceinstancesegmentationbytrainingjointlyon I. INTRODUCTION differentdatamodalitiesandovertime(Figure1).Specifically, Identifying and segmenting objects is a fundamental and we propose a novel cross-modal contrastive loss that enables challenging task in computer vision, and critically important us to train with sequential samples from camera images for many robotics applications, such as autonomous driving. and LiDAR point clouds to learn a dense representation Given some sensor input such as a camera image or LiDAR of object instances. Our goal is to learn embeddings that scan, the goal of instance segmentation is to associate each coherently represent individual objects in both modalities. pixel or point with a unique instance label for the object to By training on different modalities, our method can more which it corresponds. This allows for fine-grained reasoning meaningfully disentangle individual instances. Moreover, we about each object beyond sparse representations such as showthatourlossformulationisflexibleenoughtobeapplied bounding boxes. Only recently, learning-based approaches to sequences of sensor data of both modalities, enabling have shown remarkable success in solving this task on learning temporally-stable embeddings for object instances. a number of benchmark datasets [16], [6], [34]. Reliably For situations where only sequential data (but no labels) identifying object instances in complex real-world data, both are available, we propose a method for generating pseudo- across different data modalities and coherent in time, remains groundtruth using optical flow, and demonstrate that training a challenging and open problem [49], [15], [8]. on this data brings similar improvements in segmentation. Many safety-critical robotics systems, such as autonomous In summary, our contributions are as follows: vehicles,relyonmultiplesensorsthatprovidecomplementary • We introduce a novel cross-modal contrastive loss to informationtoperceivetheenvironment.Forinstance,LiDAR obtain consistent instance embeddings for RGB images sensors provide highly accurate range readings, but only in and point clouds. clear weather conditions and up to a limited range. On the • We show that our loss can also be used to train on other hand, cameras may be able to perceive objects at much sequences of sensor data from both modalities, enabling longer ranges, but without any explicit range readings and the learning of temporally stable embeddings. only in adequate lighting conditions. As a result, there is an • We propose a method using optical flow to generate opportunity to improve segmentation performance for each pseudo ground-truth for video, allowing the extension sensor by reasoning jointly across all modalities. of our contrastive loss to partially labeled datasets. Another avenue is the utilization of temporal data, which • Our extensive experiments on the KITTI-360 [33] and can provide motion information, varying viewpoints and Cityscapes [16] benchmarks suggest that our method context for occlusions. Temporal data comes naturally for significantly improves instance segmentation quality by most sensors, although sequential labeling for ground truth upto1.9mAPwhencontrastingbetweentemporalRGB is more expensive. data, and by up to 1.7 mAP when contrasting between 1WaymoLLC temporal LiDAR and RGB data, while also improving 2AdobeResearch(WorkdonewhileatGoogleResearch) the temporal stability of the embeddings. 2202 tcO 41 ]VC.sc[ 1v31180.0122:viXra(a) Training Instance Embeddings … LiDAR t ConvNet RGB 1 CrogbnvNet LiDAR LiDAR t n 1 … n lidar P (so amsi et i iv nse ts ance) Global Contrastive Negatives Loss (everything else) … … E E E E 1, lidar n, lidar 1, rgb n, rgb … ConvNet RGB t CrgobnvNet 1 RGB t n 1 … n rgb (b) Inference LiDAR Semantic Segmentation RGB Mean Shi昀琀 Instance Clustering Masks ConvNet rgb Instance Embeddings Fig.2:Ourgoalistolearndensecross-modalembeddingsviacontrastivelearningtosegmentindividualobjectinstances.Ourmethodleveragesshared informationbetweensensorsandconsecutiveframesbyapplyingthecontrastivelossjointlyoverembeddingsfromallinputsourcesattrainingtime.This globalcontrastivelossallowsthenetworktoshareinformationbetweendifferentdatasources,andimprovesoverallsegmentationperformance.Once trained,thelearnedembeddingscanbeusedtoobtainobjectIDsthroughmeanshiftclustering. II. RELATEDWORK into foreground and background during a second stage. The performance of top-down approaches is determined by ob- Object detection and segmentation have a long tradition in tainingregionproposalsandbythenumberofobjectspresent computer vision. While this spans a wide range of different in an image. On the contrary, bottom-up approaches aim to approaches that we cannot conclusively discuss, we aim learn per-pixel embeddings that represent individual object to provide an overview of learning-based methods toward instancesbasedonrecurrentinstancegrouping[29],watershed instance and semantic segmentation in images and point transformations [3], subnetwork instantiations [2], associative clouds. embeddings [39], or based on more refined loss functions Semantic Segmentation. The goal of semantic segmenta- such as discriminative loss [5], [26], or metric learning [20]. tion is to associate every pixel in an image with a discrete Bottom-up approaches are commonly implemented with label that defines its semantic class. A major challenge of more lightweight architectures, which offers more efficient effectively training networks for semantic segmentation is to training and easier integration into existing multi-task setups. define precise labels, and a number of different datasets However, learning meaningful embeddings from complex were introduced for benchmarking [16], [22], [6]. Many data to represent individual objects remains a challenging recent methods for semantic segmentation employ deep problem. Finally, a number of hybrid approaches exist that neural network architectures for feature aggregation based aim to learn embeddings of larger regions in images [7], [13]. on pyramid pooling [53], attention modules [10], multi-scale activations[24],orend-to-endconvolutions[36].Furthermore, 3D Object Detection and Segmentation: Detecting and it has been recognized that dilated or atrous convolutions segmentingobjectsinpointcloudshasreceivedaconsiderable provide a powerful way of capturing more global features amount of attention. PointNets [41] directly consume raw through wider receptive fields [52], [9]. Other approaches point clouds to generate latent representations that can be aim to solve semantic segmentation with an emphasis on used for classification or semantic segmentation tasks. By learning structured representations based on conditional leveragingvariousgeometricrelationshipsandrepresentations random fields [1], [9] or deep parsing networks [35]. a number of methods have since then improved upon the Instance Segmentation expands on the concept of seman- performance to solve for tasks, such as object detection [54], tic segmentation. Here the goal is to not only obtain a class semantic segmentation [30], [42], or instance segmenta- label for every pixel, but to predict labels for individual tion [45], [51]. A number of approaches rely on Bird’s Eye object instances. Instance segmentation approaches can be View (BEV) representations to generate bounding boxes [50], separated into two different categories. In the first category, [14]that–alongwithpillar-basedrepresentations[31],[46]– most methods rely on a two stage (or top-down) process [17], are popular in autonomous driving. More recent methods for [18], [25]. A region proposal step [23] identifies object 3Dinstancesegmentationrelyonmoreprincipledapproaches. instances in a first stage and each region is then segmented Tothisend3D-MPA [19]usesanaggregation-basedapproachwhere each point votes for its object center, HAIS [11] exploits the spatial relation of points and point sets, and SSTNet [32] uses a hierarchical representation which is constructed based on learned semantic features. Multimodal Approaches. There also exists a limited number of multimodal approaches proposed in the robotic- vision domain, including [48], [37]. Most of these works tackle indoor environments using closely paired RGB-D data. Similar to these methods, our goal is to leverage multi- modal sensor data. Specifically, we use a single-stage fully- Fig.3:Exampleofoursamplingstrategywherepointsareevenlydistributed convolutionalnetworktolearnper-pointandper-pixelembed- amongst instances. The bottom row shows close-ups of the used sample dings for instance segmentation. We train these embeddings locations. with a metric learning loss that contrasts embeddings of the same object instance with those of different objects. The key idea to our approach is that we obtain the positive and negatives samples spatially, within the same frame, across different modalities (images, points), and even across time, in the adjacent frames of a sequence. We accomplish this by proposinganovelcross-modalconsistencylossthatallowsus tosampleembeddingvectorsfromdifferentsources.Training a single-stage network with this loss enables us to compute dense embeddings and, consequently, more robust instance masks for each frame, while also obtaining temporally stable results for sequences. III. METHOD In this section, we introduce our model architecture, the cross-modal contrastive loss function, and implementation details. Our full pipeline is shown in Figure 2. A. Cross-Modal Contrastive Loss Our method consists of three main components: (1) a pair of instance segmentation networks that generate dense embeddings for RGB images and LiDAR range images. (2) a sampling based contrastive learning loss which constrains embeddings from the same instances to be similar, while forcing embeddings from different instances to be different. (3) a consistency module based on paired instance labels that introduces additional training data and correspondences between sensors and over temporally neighboring frames. Contrastive Loss: The goal of the instance segmentation network is to segment object instances by learning dense embedding vectors. The network takes as input either RGB images or LiDAR range images [4], and outputs embeddings with c=32 channels. To train the network, we use a metric learninglossforcontrastingembeddingsofthesameinstance with those of different instances. Embeddings of the same instance are pushed closer together during training, while thoseofdifferentinstancesarepushedawayintheembedding space. Our instance segmentation loss consists of the normalized temperature-scaled cross entropy (NT-XENT) loss defined in [12]. We define the per-pixel embeddings as E , h×w×c where h and w denote the spatial dimensions of the output, and c, the channels of the embedding at each pixel. For each frame, we randomly sample K = 8192 embeddings, 1 distributed evenly across all instances (visualized for RGB ... Per-Pixel Embeddings RGB t1 + GT Instance Masks RGB t2 + GT Instance Masks Positives Negatives Training: Cross-Modal Contrastive Loss ... Embeddings Frame t1 Frame t2 Embedding E 1, rgb Embedding E 2, rgb Inference Embedding E n, rgb ... Fig.4:Givenasetofpredictedembeddingswithpairedinstancelabels,we samplepositiveembeddingvectorswithintheinstancemaskofanobjectin thecurrentframeaswellinallotherframes(crosssensorandtemporal). Samplesfromallotherobjectsinallframesareusedasnegatives.Wethen trainwithourcross-modalcontrastivelosstolearndenseembeddingsof objectinstances.Theimagesatthebottomshowtheprojectedembeddings duringtraining. frames in Figure 3). Given a set of sampled embeddings for an object instance, we generate all possible positive pairs. The samples of all other instances are used as negatives for each pair. For a set of samples U and a positive pair, e ,e , i j our loss can then be defined as: exp(sim(e ,e )/τ) l =−log i j (1) i,j (cid:80) 1 exp(sim(e ,e )/τ) k∈U [id(i)(cid:54)=id(k))] i k xTy with sim(x,y)= , (2) (cid:107)x(cid:107) (cid:107)y(cid:107) 2 2 where τ denotes a temperature hyperparameter and id(i) is the mask ID of point i. The similarity function sim is used to compute the cosine similarity between two normalized embedding vectors x and y. The total loss over the instance embeddings is then defined as: 1 (cid:88) l = 1 l . (3) c (cid:107)U(cid:107) [id(i)=id(j)] i,j i,j∈U Furthermore, to obtain more stable results, we also apply a regularization loss over the norm of the embeddings: 1 (cid:88) l = (cid:107)E(x,y)(cid:107) . (4) r (cid:107)h×w(cid:107) 2 x,yOur final loss, then, is: l=l +λl , (5) c r where λ = 0.01 is a weighting factor. Together, these losses allow us to learn meaningful embedding vectors as representation for object instances. Consistency Over Modalities and Time:Givenadataset with instance labels which are consistent between sensors and across time, such that an object has the same ID across Fig. 5: Temporally-coherent embeddings: as our model is trained with a temporal consistency loss based on optical flow, we are able to obtain all inputs, one can apply the above training loss (5) between temporally-coherentandmorestableinstanceembeddings.Hereweshow all inputs by randomly sampling points from each set of theRGBsequencesoftwodifferentscenesalongwiththemulti-dimensional embeddings, and using the consistent instance labels to per-pixelembeddingvectorsprojectedintoacolorspace. determinepositiveandnegativepairs.Thisallowsthenetwork Instance Mask Generation: to learn embeddings which are consistent not only within a At inference time, we cluster the resulting per-pixel single frame, but between all modalities and time steps seen embeddingsbyapplyingavariantofthemeanshiftalgorithm at training. proposed by Brabandere et al. [5]. In particular, we randomly However,itisoftenthecasethatwedonothaveconsistent sample a point in the embedding space, and find all inlier instance labels. In these cases, we propose a method for points with cosine distance (scaled to be within [0,1]) less generating pseudo-groundtruth using optical flow. Given a than a threshold, m = 0.1, from the sampled point. We dataset with individually labeled frames amongst unlabeled then iterate by shifting the sampled point to the mean of temporalsequences,weuseapre-trainedopticalflownetwork the set of inliers, and repeat until convergence or some to warp the groundtruth of the labeled frame to its temporal maximum number of iterations is reached. We repeat this neighbors. processuntilallpixelshavebeenclustered.Withthismethod, Weusethepredictedflowtowarpthegroundtruthinstance we found that erroneous masks were typically generated on and semantic labels of the current frame to the next frame the transitions between masks, generating thin artifacts along and apply nearest neighbor resampling to avoid interpolating the boundaries. To compensate, we filter out masks with between integer label values. Following the work of Wang et an area to perimeter ratio less than a threshold, r = 4. As al.[47],wecomputeocclusionsaccordingtothe‘rangemap’, this method computes distances across the entire image, no defined by the number of points in the original image map to assumptions about the connectivity of each mask are made, theotherimage.Theinstanceandsemanticlabelsofoccluded and so arbitrarily distributed instances can be detected. pixels, as well as pixels that have left the image, are set to More complex methods are available for generating in- invalid, and are not sampled for the contrastive loss. The stance masks from embeddings, such as the graph cut embeddings and warped labels for the next frame are then algorithm proposed in SSAP [21] or the transformer head combined with those of the current frame when performing proposed in MaX-DeepLab [44], which would likely achieve the sampling for the contrastive loss. Figure 4 illustrates higher performance. However, these would incur additional the result of the warped labels as well as the sampling of latency penalties, and their computational power may hide embeddings from the two frames. improvements to the underlying embeddings. This pipeline has similarities to semi-supervised methods such as in the work by Chen et al. [8], which generate B. Implementation Details pseudo-groundtruth by running inference with a pre-trained model, and subsequently re-training. However, in addition to Inthissection,weprovidedetailsabouttheimplementation providinggroundtruthtotemporallyneighboringframeswith of our network architecture. optical flow, our method also ensures consistency between ModelArchitecture:Weuseamodifiedarchitecturebased frames and thus over time. Sample outputs for our method onPanopticDeepLab[15].OurmodelconsistsofaXception- over time can be found in Figure 5. 71 backbone, with an additional ASPP and decoder module Semantic Segmentation: We predict a semantic mask for instance embedding prediction, with c=32 channels, in for each frame and filter out all ‘stuff’ classes to a single place of the instance center and offset decoder in [15]. For background class. During clustering, we assign semantic our cross-modal experiments, we use two separate networks classes to each clustered instance with a ‘majority vote’ for LiDAR range image inputs and RGB image inputs. between semantic predictions within the instance. We also Preprocessing: For all datasets, we use full resolution generate the confidence for each mask using the average images as input. During training, we augment our data by semantic score. Pixels not belonging to the selected semantic randomly flipping, scaling and cropping or padding the input class are dropped from the instance to preserve semantic images. Furthermore, we use random Gaussian blur and jitter boundaries. The one-hot semantic labels are concatenated to brightness, contrast, saturation, and hue. For temporal data, the predicted instance embeddings, so that the network only we apply the same augmentation to each pair of images. For has to predict unique instance embeddings within each class. LiDAR range images, we only apply flipping, and ensure that the flipping is consistent with the camera images.Method All Bicycle Building Car Motorcycle Person Rider Truck Singlecamera 11.2 0.4 22.9 42.5 1.3 5.8 0.8 4.8 Temporalcamera 11.9 1.4 23.6 43.1 2.0 6.5 0.8 6.1 Singlecamera+LiDAR 11.9 0.4 22.5 41.8 3.1 6.9 1.4 7.1 Temporalcamera+LiDAR 12.2 2.1 22.7 41.0 3.0 6.4 2.9 7.2 Temporalcamera+TemporalLiDAR 12.9 2.0 22.7 40.7 5.4 6.3 4.4 8.7 TABLEI:AblationstudyontheeffectsofadditionalconsistencyoninstancesegmentationperformanceonacustomKITTI360train/valsplit.Adding additionalinputsforthecontrastivelossimprovesoverallsegmentationperformance.PerformanceismeasuredintermsofAP(%). Fig.6:Cross-modalresultsonKITTI-360:OurmodelstakeasinputsRGBimagesorLiDARrangeimages(a),andpredictsdenseinstanceembeddings(b) whichareconsistentbetweenthesensormodalities.Atpost-processing,wecanapplyamean-shiftclusteringalgorithmtogenerateinstancemasksfor eachmodality(c),whichwecomparetothegroundtruth(d).Ourcross-modalcontrastivelosscontrastspairsbetweenthesensormodalities,andtrains thenetworktopredictthesameembeddingforeachobjectinstanceineachsensor,denotedbysimilarityincolor.Eachimagepairshowsbothsensor modalities,wheretheLiDARdataisshownasrangeimages(bottom). Training: For all experiments, we use a moving average Thevalidationsetwaschosenfromthetwodensestsequences, ADAM optimizer with linear warmup for 5 epochs, initial such that there are almost the same number of objects in the learning rate of 2.5e-4 and exponential decay with decay training and validation sets. For the final test set evaluation, factor of 1.2 every 10 epochs. For the KITTI-360 dataset, we train on the entire training set. we train for 10,000 steps for the ablation train/val split, and As the dataset is largely transferred from primitive 3D 20,000 steps for the train/test split, with an effective batch shapes (cuboids, ellipsoids etc.), each label has a correspond- size of 128. For the Cityscapes dataset, we train for 60,000 ing confidence value, which is used to weight the evaluation iterations with an effective batch size of 32. metrics. In accordance with the experiments in KITTI-360, we only use the top 70% of labels in each frame, in terms IV. EXPERIMENTS of confidence. To generate temporal frames, we randomly In this section we present and discuss a variety of sample an image 2 frames before or after the current frame experiments,includingbaselinecomparisons,ablationstudies, at each training step. To train a model on LiDAR, we use the and qualitative results our method is able to generate. For all range image representation of the point cloud generated by of our experiments, we do not perform any kind of test time Bewley et al. [4]. From this 64×2048 range image, we crop augmentation. a central 64×512 patch corresponding to the camera field of view, and treat this as an image into the DeepLab model. To A. Datasets generate per-scan groundtruth, we use nearest-neighbors to a) KITTI-360: The KITTI-360 dataset [33] consists find the closest ground truth segmentation label to each point of camera and LiDAR panoptic segmentation and LiDAR in the range image. bounding box labels for all of the sequences of the KITTI b) Cityscapes: To test our method on a dataset without dataset [22]. For each sequence, the instance labels are consistent instance labels and in a single-modal setting, consistentbothbetweenthecameraandLiDARlabels,aswell we employ the Cityscapes dataset [16]. dataset. Cityscapes as between different time steps. The dataset contains 83,000 providesuswith2975,500,and1525imagesofurbanscenes frames and associated LiDAR scans, split into 9 training for training, testing and validation, respectively. Furthermore, sequences and 2 test sequences with held outgroundtruth. As the dataset provides 8 ‘thing’ and 11 ‘stuff’ classes. In this no validation set is provided, we perform our ablations on a work, we train our network on the ‘fine’ set of training train/val split where sequences 00, 02, 03, 04, 05, 06, 07 are images with ground truth instance labels. In order to generate the train set and sequences 09 and 10 are the validation set. paired groundtruth, we use the unlabeled temporal frames,andrandomlysampleframesfromtheimages2framesbefore orafterthelabeledimage.Wethenwarpthegroundtruthfrom the labeled frame to each temporal frame using a UFlow [27] model trained on the Waymo Open Dataset [43], consisting of 200,000 training images. To filter potential errors in the flow warping due to occlusions and other errors, we ignore warped labels based on the occlusion map generated by the flowmodel.ForCityscapes,werelyonapre-trainedsemantic segmentationnetworkforassigningclassIDs,inparticularan implementation of Panoptic Deeplab [15], with an Xception- 71 backbone, trained on Cityscapes. Our implementation Fig.7:Close-upanalysisofour3DLiDARsegmentationnetwork,operating has a mIOU of 68.8% on the Cityscapes validation set. As on range images (top-left), demonstrating accurate point-cloud instance segmentation. embeddings along the boundary are particularly challenging, and the Cityscapes labels are typically reliable, we dedicate half of the points when sampling in the contrastive loss to Method Approach InputSize AP(%) Speed(ms) embeddings that are within b=10 pixels of the boundary of PanopticFPN[28] TD 512×1024 33.2 - UPSNet[49] TD 1024×2048 33.3 202ms each instance mask. Seamless[40] TD 1024×2048 33.6 - PanopticDeeplab[15]Xception-71 TD1 1025×2049 35.3 175ms B. Results on KITTI-360 SSAP[21] BU 1024×2048 31.5 260ms+2 Oursbase BU-C 1024×2048 29.0 182ms Ourstemporal BU-C 1024×2048 30.9 182ms In Table I, we report an ablation study on the effects TABLE III: Cityscapes validation set results, without any test time aug- of adding additional signals for the contrastive loss, both mentationoradditionaltrainingdataforallmethods.TD:Top-Down.BU: temporallyandfromdifferentmodalities.Overall,weobserve Bottom-Up.BU-C:Bottom-UpviaContrastiveLearning.1PanopticDeeplab that instance mAP for images improves as we add each isasinglestagenetwork,butreliesontop-downobjectproposalsofobject centers.2Reportedtimeisforpost-processingonlyanddoesnotincludethe additional frame to the consistency loss. The overall trend networkforwardpass. that we observe is that the network improves significantly for rare classes such as bicycle, motorcycle, rider and truck, while regressing slightly for the more common car class. C. Results on Cityscapes This results in a significant overall mAP improvement. Note that, because the ablation models are trained on a much In Table III, we report our instance segmentation results smaller training set, the mAP numbers in these experiments on Cityscapes compared to existing methods. The reported are naturally lower than those for the test set. We show results were obtained on the Cityscapes validation set. To qualitative examples from our best model (Temporal RGB + provideafaircomparisonofthemainmethodsthemselves,we TemporalLiDAR)inFigure6,wherewedemonstratethatthe use the reported numbers for each method without test time networksareabletopredictembeddingsthatareconsistentfor augmentation. In addition, we report mAP for our method objectsacrossthesensormodalities.WealsoshowourLiDAR using groundtruth semantic segmentations in Table IV, and instance segmentation results projected to 3D in Figure 7, compare against past work which has used a similar metric. demonstrating that the network can generate accurate 3D Thisapproachallowsustoevaluatetheinstancesegmentation instance segmentation when operating only on the range network independently from the performance of the semantic image. However, we did not observe similar improvements segmentation. Overall, the improvement from our cross- in 3D LiDAR mAP when training with the proposed cross- modal consistency loss allows us to strongly outperform modality contrastive loss. Our observation is that, as the 2D bottom up methods such as Brabandere et al. [5] and Neven labelsaregeneratedbyprojectingthe3Dlabels(andapplying et al. [38] with groundtruth semantics. It also pushes our a CRF), they are typically much noisier than the 3D labels, embedding-based method close to the previous bottom up andcannotprovideusefulsignalforthe3Dpartofthemodel. works such as SSAP [21]. However, this work requires a In Table II, we report a comparison in the mAP of our complex post-processing procedure which has a significant final method against the popular MaskRCNN method [25] latency, as reported in Table III. We underperform compared on the held-out test set, where we outperform the ResNet-50 to Panoptic DeepLab, but note that this method, while being baseline, while approaching the performance of the ResNet- single shot, requires the prediction of object centers. These 101baseline.Atthetimeofwriting,thevalidationsetforthis centers serve as top-down proposals requiring instances to datasetwasnotavailable,andbetterresultswillbeachievable have distinguishable centers and some form of NMS. oncethesametrainingsplitsareavailableasforthebaselines. We also show visualizations of our predictions in Figure 8, where the embeddings can robustly and precisely represent a Method AP (%) large variety of object instances. Mask-RCNN ResNet-50 19.5 Temporal Constraints: To validate the impact of the Mask-RCNN ResNet-101 20.9 temporal contrastive loss, we trained our model with and Ours with temporal + lidar temporal 20.3 without contrasting over time (see Table III, Ours base vs TABLEII:ResultsontheKITTI-360heldouttestset. Ourstemporal).Fromtheseresults,introducingthecontrastiveMethod Cosine Distance ⇓ Accuracy ⇑ Without Temporal Loss 0.087 75.1 With Temporal Loss 0.062 83.4 TABLEV:Temporalconsistencymetrics.Wecomputethecosinedistance (in[0,1])betweentheaverageembeddingforeachinstancebetweentwo temporalframes.Accuracyisthepercentageofinstanceswithcosinedistance < m = 0.1 over time. Training with temporal consistency significantly increasesperformance. Fig. 8: Visualization of per-pixel embedding vectors on Cityscapes: our methodallowsustoobtainper-pixelembeddingvectorsfromRGBinput images(firstrow).Theembeddingsrobustlyandpreciselydisentangleobject instancesacrossdifferentclasses(thirdrow).Tovisualizetheembeddings, weprojectthemintoRGBcolorspace(secondrow).Inthelastrow,we show the ground truth instance masks. Please note that the color of the predictedandgroundtruthmasksarenotsupposedtomatch. Method Approach InputSize AP(%) Speed(ms) Brabandereetal.[5](ResNet-38) BU-C 384×768 29.0 200ms Nevenetal.[38] BU 1024×2048 40.5 91ms Oursbase BU-C 1024×2048 48.9 182ms Ourstemporal BU-C 1024×2048 50.6 182ms Fig.9:Pixelsimilarity:foranimagewithobjectsofdifferentclasses(a)we TABLEIV:Cityscapesvalidationsetresultsusinggroundtruthsemantics randomlyselectdifferentpixelpositionsintheimage(yellowdot)andshow forassigningclassestoeachinstance.BU:Bottom-Up.BU-C:Bottom-Up thesimilarityoftheselectedembeddingtoallotherembeddingsasgray- viaContrastiveLearning.Forthissetup,wesignificantlyoutperformother scalevalues.Asshown,thelearnedembeddingsallowustomeaningfully contrastivelearningmethods. disentangleobjectinstances,suchasvariousdifferentcars(b,c,d),intricate objectslikebicycles(e),andsmall-scaleobjects,suchaspedestrians(f). loss over time has a significant impact on the predicted instances masks. Adding temporal frames to the contrastive V. LIMITATIONSANDFUTUREWORK loss improves overall AP by 2%. The visualization of the The main contribution of this method is the contrastive predicted embeddings in Figure 5 also shows fewer artifacts. learning of instance embeddings between sensors and over We also compute error metrics on the stability of the time. As it stands, we believe that the learned embeddings instance embeddings over time for the Cityscapes dataset. are strong, but have chosen a relatively simple method to Given two frames, we warp the predicted embeddings of generate instance masks for evaluation. We use this post- the next frame to the current frame, and compute the cosine processing to highlight the improvements in the embeddings distance between the average embedding of each instance in themselves, where a more complex method may obscure the two time steps, while ignoring any pixels estimated as such improvements. However, there is room for improvement ‘occlusion’ from the optical flow. In addition, we compute as if stronger AP scores are desired, by adding a graph cut accuracytheproportionofinstancepairswithcosinedistance optimization such as in SSAP [21] or a transformer head on less than the clustering threshold m = 0.1. These results top of the embeddings as in MaX-DeepLab [44]. can be found in Table V. We observe that the majority of embeddings (75.1%) are stable over time, even without the VI. CONCLUSION proposed temporal contrastive loss. However, there are a Wehaveintroducedanovelmethodforlearningpixel-wise number of cases, such as large motions and dense objects, embeddingsasarepresentationforindividualobjectinstances. where the embeddings are not stable. Overall, our proposed Compared to other bottom-up approaches, we employ a loss reduces the cosine distance between instances by 0.025, contrastive learning scheme; embeddings of the same object and improves accuracy by 8.3% to 83.4%. instance are pushed closer together, while they are pulled EmbeddingSimilarity:InFigure9weshowtheresultsof awayfromotherobjectsandthebackground.Inparticular,we an embedding similarity experiment. We select pixels in the leverageinformationbetweensensors,aswellasovertime,by image and compute the distance, shown in grayscale, of the applying our cross-modal contrastive loss on the union of all selected embeddings with all other embeddings in the image. predictedembeddingsforagivenscene.Throughquantitative Asillustrated,thelearnedembeddingsallowustodisentangle experiments, we show that this contrastive loss is able to objectinstancesofthesameclass,acrossdifferentclasses,and significantlyimproveinstancesegmentationperformance,and the background. Our simple clustering scheme can generate coherent instance embeddings between sensors and time, high quality masks by thresholding this similarity. which are clustered to generate high quality instance masks.REFERENCES segmentation with superpoint graphs. In CVPR, pages 4558–4567, 2018. [31] A.H.Lang,S.Vora,H.Caesar,L.Zhou,J.Yang,andO.Beijbom. [1] A. Arnab, S. Jayasumana, S. Zheng, and P. Torr. Higher order Pointpillars: Fast encoders for object detection from point clouds. conditionalrandomfieldsindeepneuralnetworks. InECCV,2016. CVPR,pages12689–12697,2019. [2] A. Arnab and P. Torr. Pixelwise instance segmentation with a [32] Z.Liang,Z.Li,S.Xu,M.Tan,andK.Jia. Instancesegmentationin dynamicallyinstantiatednetwork. CVPR,pages879–888,072017. 3Dscenesusingsemanticsuperpointtreenetworks. InCVPR,pages [3] M. Bai and R. Urtasun. Deep watershed transform for instance 2783–2792,2021. segmentation. InCVPR,pages2858–2866,2017. [33] Y. Liao, J. Xie, and A. Geiger. KITTI-360: A novel dataset and [4] A.Bewley,P.Sun,T.Mensink,D.Anguelov,andC.Sminchisescu. benchmarks for urban scene understanding in 2D and 3D. arXiv, Rangeconditioneddilatedconvolutionsforscaleinvariant3Dobject 2021. detection. arXiv,2020. [34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro [5] B. De Brabandere, D. Neven, and L. Gool. Semantic instance Perona, Deva Ramanan, Piotr Dolla´r, and C. Lawrence Zitnick. segmentationwithadiscriminativelossfunction. arXiv,2017. MicrosoftCOCO:Commonobjectsincontext. InECCV,pages740– [6] H.Caesar,V.Bankiti,A.H.Lang,S.Vora,V.E.Liong,Q.Xu,A. 755,2014. Krishnan,Y.Pan,G.Baldan,andO.Beijbom.nuScenes:Amultimodal [35] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image datasetforautonomousdriving. arXiv,2019. segmentationviadeepparsingnetwork. InICCV,page1377–1385, [7] H.Chen,K.Sun,Z.Tian,C.Shen,Y.Huang,andY.Yan. BlendMask: USA,2015.IEEEComputerSociety. Top-downmeetsbottom-upforinstancesegmentation.InCVPR,2020. [36] J.Long,E.Shelhamer,andT.Darrell. Fullyconvolutionalnetworks [8] L.-C.Chen,R.GontijoLopes,B.Cheng,M.D.Collins,E.D.Cubuk, forsemanticsegmentation. InCVPR,pages3431–3440,2015. B. Zoph, H. Adam, and J. Shlens. Naive-student: Leveraging semi- [37] JohannesMeyer,AndreasEitel,ThomasBrox,andWolframBurgard. supervisedlearninginvideosequencesforurbanscenesegmentation. Improvingunimodalobjectrecognitionwithmultimodalcontrastive InECCV,volume12354,pages695–714.Springer,2020. learning. InIROS2020,pages5656–5663.IEEE,2020. [9] L.Chen,G.Papandreou,I.Kokkinos,K.Murphy,andA.L.Yuille. [38] D.Neven,B.DeBrabandere,M.Proesmans,andL.VanGool.Instance DeepLab:Semanticimagesegmentationwithdeepconvolutionalnets, segmentationbyjointlyoptimizingspatialembeddingsandclustering atrousconvolution,andfullyconnectedCRFs. PAMI,40(4):834–848, bandwidth. InCVPR,pages8837–8845,2019. 2018. [39] A.Newell,Z.Huang,andJ.Deng.Associativeembedding:End-to-end [10] L.-C.Chen,Y.Yang,J.Wang,W.Xu,andA.Yuille.Attentiontoscale: learningforjointdetectionandgrouping. InNIPS,page2274–2284, Scale-awaresemanticimagesegmentation. CVPR,pages3640–3649, 2017. 2016. [40] Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter [11] ShaoyuChen,JieminFang,QianZhang,WenyuLiu,andXinggang Kontschieder. Seamlessscenesegmentation. CVPR,pages8277–8286, Wang. Hierarchical aggregation for 3D instance segmentation. In 2019. CVPR,pages15467–15476,2021. [41] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. PointNet: [12] T.Chen,S.Kornblith,M.Norouzi,andG.Hinton.Asimpleframework Deeplearningonpointsetsfor3Dclassificationandsegmentation. In forcontrastivelearningofvisualrepresentations. arXiv,2020. CVPR,pages652–660,2017. [13] XinleiChen,RossGirshick,KaimingHe,andPiotrDolla´r.TensorMask: [42] C.R.Qi,L.Yi,H.Su,andL.J.Guibas.PointNet++:Deephierarchical Afoundationfordenseobjectsegmentation.InICCV,pages2061–2069, featurelearningonpointsetsinametricspace.InNeurIPS,volume30. 2019. CurranAssociates,Inc.,2017. [14] XiaozhiChen,HuiminMa,JiWan,BoLi,andTianXia. Multi-view [43] P.Sun,H.Kretzschmar,X.Dotiwalla,A.Chouard,V.Patnaik,P.Tsui, 3Dobjectdetectionnetworkforautonomousdriving. InCVPR,pages J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. Scalability in perception 1907–1915,2017. forautonomousdriving:WaymoOpenDataset. InCVPR,pages2446– [15] B.Cheng,M.D.Collins,Y.Zhu,T.Liu,T.S.Huang,H.Adam,and 2454,2020. L.-C.Chen. Panoptic-deeplab:Asimple,strong,andfastbaselinefor [44] H.Wang,Y.Zhu,H.Adam,A.Yuille,andL.-C.Chen. Max-DeepLab: bottom-uppanopticsegmentation. InCVPR,2020. End-to-endpanopticsegmentationwithmasktransformers. InCVPR, [16] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. pages5463–5474,2021. Benenson,U.Franke,S.Roth,andB.Schiele. TheCityscapesdataset [45] X.Wang,S.Liu,X.Shen,C.Shen,andJ.Jia.Associativelysegmenting forsemanticurbansceneunderstanding. InCVPR,2016. instancesandsemanticsinpointclouds. InCVPR,pages4091–4100, [17] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully 2019. convolutionalnetworks. InECCV,pages534–549,2016. [46] Y. Wang, A. Fathi, A. Kundu, D. A. Ross, C. Pantofaru, T. A. [18] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint Funkhouser, and J. M. Solomon. Pillar-based object detection for objectandstuffsegmentation. InCVPR,pages3992–4000,2015. autonomousdriving. InECCV,2020. [19] F.Engelmann,M.Bokeloh,A.Fathi,B.Leibe,andM.Nießner. 3D- [47] Y.Wang,Y.Yang,Z.Yang,L.Zhao,P.Wang,andW.Xu. Occlusion MPA:Multi-proposalaggregationfor3Dsemanticinstancesegmenta- awareunsupervisedlearningofopticalflow. InCVPR,pages4884– tion. CVPR,pages9028–9037,2020. 4893,2018. [20] A.Fathi,Z.W.,V.Rathod,P.Wang,H.O.Song,S.Guadarrama,and [48] Yu Xiang, Christopher Xie, Arsalan Mousavian, and Dieter Fox. K.Murphy. Semanticinstancesegmentationviadeepmetriclearning. Learning RGB-D feature embeddings for unseen object instance arXiv,abs/1703.10277,2017. segmentation. CoRL,2020. [21] N.Gao,Y.Shan,Y.Wang,X.Zhao,Y.Yu,M.Yang,andK.Huang. [49] Y.Xiong,R.Liao,H.Zhao,R.Hu,M.Bai,E.Yumer,andR.Urtasun. SSAP:Single-shotinstancesegmentationwithaffinitypyramid. CoRR, UPSNet:Aunifiedpanopticsegmentationnetwork. InCVPR,2019. abs/1909.01616,2019. [50] B. Yang, W. Luo, and R. Urtasun. PIXOR: Real-time 3D object [22] A.Geiger,P.Lenz,C.Stiller,andR.Urtasun. Visionmeetsrobotics: detectionfrompointclouds. CVPR,pages7652–7660,2018. TheKITTIdataset. IJRR,2013. [51] L.Yi,W.Zhao,H.Wang,M.Sung,andL.J.Guibas.GSPN:Generative [23] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature shapeproposalnetworkfor3Dinstancesegmentationinpointcloud. hierarchiesforaccurateobjectdetectionandsemanticsegmentation. CVPR,pages3942–3951,2019. InCVPR,pages580–587,2014. [52] F.Yu,V.Koltun,andT.Funkhouser. Dilatedresidualnetworks. In [24] B.Hariharan,P.Arbelaez,R.Girshick,andJ.Malik. Hypercolumns CVPR,pages636–644,2017. forobjectsegmentationandfine-grainedlocalization. InCVPR,2015. [53] H.Zhao,J.Shi,X.Qi,X.Wang,andJ.Jia. Pyramidsceneparsing [25] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick. Mask R-CNN. In network. InCVPR,pages6230–6239,2017. ICCV,pages2961–2969,2017. [54] Y.ZhouandO.Tuzel. VoxelNet:End-to-endlearningforpointcloud [26] AnthonyHu,AlexKendall,andRobertoCipolla. Learningaspatio- based3Dobjectdetection. CVPR,pages4490–4499,2018. temporalembeddingforvideoinstancesegmentation. arXivpreprint arXiv:1912.08969,2019. [27] R.Jonschkowski,A.Stone,J.Barron,A.Gordon,K.Konolige,andA. Angelova. Whatmattersinunsupervisedopticalflow. ECCV,2020. [28] A.Kirillov,R.Girshick,K.He,andP.Dolla´r.Panopticfeaturepyramid networks. arXivpreprintarXiv:1901.02446,2019. [29] S. Kong and C. Fowlkes. Recurrent pixel embedding for instance grouping. InCVPR,2018. [30] L. Landrieu and M. Simonovsky. Large-scale point cloud semantic