Waymo Open Dataset: Panoramic Video Panoptic Segmentation JieruMei1* AlexZihaoZhu2 XinchenYan2 HangYan2 SiyuanQiao3 YukunZhu3 Liang-ChiehChen3 HenrikKretzschmar2 DragomirAnguelov2 1JohnsHopkinsUniversity 2WaymoLLC 3GoogleResearch Abstract Panopticimagesegmentationisthecomputervisiontaskoffindinggroupsof pixelsinanimageandassigningsemanticclassesandobjectinstanceidentifiers tothem.Researchinimagesegmentationhasbecomeincreasinglypopulardue toitscriticalapplicationsinroboticsandautonomousdriving. Theresearch communitytherebyreliesonpubliclyavailablebenchmarkdatasettoadvance thestate-of-the-artincomputervision.Duetothehighcostsofdenselylabeling the images, however, there is a shortage of publicly available ground truth labelsthataresuitableforpanopticsegmentation.Thehighlabelingcostsalso makeitchallengingtoextendexistingdatasetstothevideodomainandtomulti- camerasetups.WethereforepresenttheWaymoOpenDataset:PanoramicVideo PanopticSegmentationDataset,alarge-scaledatasetthatoffershigh-quality panopticsegmentationlabelsforautonomousdriving.Wegenerateourdataset usingthepubliclyavailableWaymoOpenDataset,leveragingthediverseset ofcameraimages.Ourlabelsareconsistentovertimeforvideoprocessingand consistentacrossmultiplecamerasmountedonthevehiclesforfullpanoramic sceneunderstanding. Specifically,weofferlabelsfor28semanticcategories and2,860temporalsequencesthatwerecapturedbyfivecamerasmountedon autonomousvehiclesdrivinginthreedifferentgeographicallocations,leading toatotalof100klabeledcameraimages. Tothebestofourknowledge,this makes our dataset an order of magnitude larger than existing datasets that offervideopanopticsegmentationlabels.Wefurtherproposeanewbenchmark forPanoramicVideoPanopticSegmentationandestablishanumberofstrong baselinesbasedontheDeepLabfamilyofmodels.Wewillmakethebenchmark andthecodepubliclyavailable,whichwehopewillfacilitatefutureresearchon holisticsceneunderstanding.Findthedatasetathttps://waymo.com/open. 1 Introduction Semanticvisualsceneunderstandinghasbeenstudiedextensivelyfordecadesin thefieldofcomputervision[58,65,17,35,81,74]. Researchershavetackledtasks *WorkdoneasaninternatWaymo. 1 2202 nuJ 51 ]VC.sc[ 1v40770.6022:viXraFL F FR SL SR Multi-Camera Annotations Temporal Sequences Residential Rain Dense Urban Night Highway Diverse Scenes Figure1:Weprovidepanopticsegmentationlabelsfor100kcameraimagesoftheWaymoOpenDataset. Ourdatasetisgroupedinto2,860temporalsequencescapturedbyfivecameras,mountedonautonomous vehiclesdrivinginthreegeographicallocations.Instancesegmentationlabelsareconsistentbothacross camerasandovertime.Ourdatasetoffersdiversityintermsofobjectclasses,locations,weather,and timeofday. ofvaryingdifficulty,rangingfromsegmentingdistinctobjectsinindividualcamera images[26,24,42,9]totrackingandsegmentingmultipleobjectsinvideos[72,66, 14]. Roboticapplications,suchasautonomousdriving,haveledtonewchallenges andopportunitiesforsemanticvisualsceneunderstanding[21,12]. Modernautonomousvehiclestendtobeequippedwithmultiplecamerasand LiDARscanners. Thecamerasproviderichsemanticinformationaboutthescene, whereas the LiDAR scanners capture sparse, but geometrically highly accurate information. Autonomousvehiclesneedtobeabletofuseandinterpretthedata stream from multiple sensors to build and maintain over time an accurate and consistentestimateoftheworld. Onechallengewhentrackingandsegmenting multipleobjectsisthatobjectsofinterestmayleavethefieldofviewofacamerato enterthefieldofviewofanothercameraacrossconsecutivevideoframes. Inthispaper,westudythenewtaskofvideopanopticsegmentation[33,30] forautonomousvehiclesequippedwithmultiplecameras. SeeFig.1foranillus- tration. Panopticsegmentationenablesautonomousvehiclestoreasonabouttheir surroundingsintermsofsemanticandgeometryproperties,suchasfine-grained objectcontours. Therearealsoimportantoffboardapplications, includingauto- labeling[84,76,50]andcamerasensorsimulation[44,40,10]. Ontheonehand, mostexistingpanopticsegmentationdatasets[12,47]providelabelsforindividual cameraimages. Thismakesitdifficulttotrainmodelsthatfuseinformationfrom multiplecameraimages,eithertemporallyorbyleveragingamulti-camerasetup. Ontheotherhand,datasetsthatprovidepanopticsegmentationlabelsforvideo data[30,71]tendtobescarceandmuchsmallerthandatasetsforobjectdetection 2andtrackingforautonomousdriving[21,61]. Tobridgethisgap,wepresentanew benchmarkdatasetforpanopticsegmentationbasedonthepopularWaymoOpen Dataset(WOD).Specifically,weprovidepanopticsegmentationlabelsforvideo datathatareconsistentacrossfivecamerasmountedonthevehicles. Wefurther presentabenchmarkthatcapturesthetaskofmulti-camerapanopticsegmentation invideodataforautonomousdriving. Overall,weprovidepanopticsegmentation labelsfor100kcameraimages,whichwegroupintotraining(70%),validation(10%) and test (20%) sets. The training set consists of 2,800 sequences, each of which compriseslabelsforfivecamerasspanning1.2secondsandfivetemporalframes. Incontrast,ourvalidationandtestsetsconsistof60longersequences,inorderto facilitatetheevaluationoflong-termtracking. Eachvalidationandtestsequence consistsof100temporalframes,spanningthefull20sofascene,whilealsoprovid- inglabelsacrossallfivecameras. WeextendtheSegmentationandTrackingQuality (STQ)metric[71]tosupportourmulti-camerasetupbycomputingaweightfor pixelsdependingonthecamerastheycorrespondto. Wealsoextendastate-of-the- artvideopanopticsegmentationmethod,ViP-DeepLab[51],toourmulti-camera setupbytrainingseparatemodelsoneachcameraviewandbytrainingamodel onapanoramageneratedfromallviews. Wepresentanextensiveexperimental evaluationontheproposeddatasetandmetric. Wepublishedthefulldatasettoenhancevideopanopticsegmentationresearch whilealsoopeningupthefieldofpanoramicvideopanopticsegmentation. 2 Related Work Panoptic Segmentation The task of panoptic segmentation [33] aims to unify semanticsegmentation[26]andinstancesegmentation[24],requiringassigninga classlabelandinstanceIDtoallpixelsinanimage. Modernpanopticsegmentation systemscouldberoughlycategorizedintotop-down(orproposal-based)[32,49, 36,41,73,69]andbottom-up(orproposal-free)[80,20,67,11,68]approaches. Our adoptedbaselinemethodsbelongtothebottom-upcategory. VideoPanopticSegmentation Extendingpanopticsegmentationtothevideo domain,VideoPanopticSegmentation(VPS)[30]requiresgeneratingtheinstance trackingIDs(i.e.,temporallyconsistentinstanceIDs)alongwithpanopticsegmen- tationresultsacrossvideoframes. CurrentVPSdatasetsaresmallscaleinterms ofsemanticclassesandsizes. Specifically,Cityscapes-VPS[30]sparselyannotates (everyfiveframe)Cityscapes[12]videosequences,resultinginonly3,000frames with19semanticclassesfortrainingandtesting. Recently,STEP[71]extendsKITTI- MOTS[21,66]andMOTS-Challenge[66,14]forVPS.However, theirannotated datasets are still small-scale (18K annotated frames with 19 semantic classes for KITTI-STEP,and2Kframeswith8classesforMOTChallenge-STEP),andthevideo sequencesareonlycapturedbyasinglefront-viewcamera. Ontheotherhand,our annotateddatasetpresentsthefirstlarge-scaleVPSannotationsandextendstothe multi-camerascenario. SegmentationBenchmarks Thereareotherpopularvideosegmentationbench- marksexistingintheliterature,e.g.,VSPW[45]forvideosemanticsegmentation, whileMOTS[66]andYoutube-VIS[79]forvideoinstancesegmentation. Ourbench- mark is also related to urban scene understanding, where typical benchmarks include[4,21,39,12,47,6,82,2,61,5,37,28,83,78,38,22]. Ourworkismostre- 3Table1:Datasetcomparison.OurWOD:PVPSisanewlarge-scalepanoramicvideopanopticsegmenta- tiondataset.†WildPASScontains500panoramas. datasetstatistics WOD:PVPS(ours) WildPASS[78] Cityscapes-VPS[30] KITTI-STEP[71] MOT-STEP[71] #sequences 2860 - 500 50 4 #images 100,000 500† 3,000 19,103 2,075 #trackingclasses 8 - 8 2 1 #semanticclasses 28 8 19 19 7 panoramic (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) videopanoptic (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) latedtoWildPASS[78],whichalsoaimstoendowmachineswithlargefield-of-view perception. However,buildingontopofthelarge-scaleWaymoOpenDataset[61], ourbenchmarkprovidesmuchmorehigh-qualityannotatedvideosequences. Multi-Camera Multi-Object Tracking Consistently tracking objects across multiple cameras, multi-camera multi-object tracking [19, 15, 3, 55, 27, 13, 1, 53] has been a popular research topic in the computer vision community. Typical benchmarks[18,34,75,52,7,62,23]onlytrackasingleclass(e.g.,peopleorvehicles) with bounding boxes, while our proposed benchmark demands for pixel-level trackingandsegmentationformultipleclasses. Panoramic Semantic Segmentation Panoramic semantic segmentation pro- vides surround-view perception [59, 46, 63, 85, 78, 77], but limited to semantic segmentation without temporal and instance-level understanding. Our work is similar,butadditionallytacklesvideopanopticsegmentation. Recently, [54,48] predictbird’s-eyeviewsemanticsegmentationusingmulti-camerainputs. 3 WOD: PVPS Dataset Inthissection,wefirstrecaptheexistingWaymoOpenDataset(WOD)[61],one ofthelargestandmostdiversemulti-sensordatasetsintheautonomousdriving domain. Weleveragetheexistingdatathatcomeswithcoarse-levelannotations (e.g., 2D and 3D bounding boxes) as the foundation, and subsample images for ourdataset. WethenprovideanoverviewofourWOD:PVPSdataset,including panoramageneration,statisticsofthesemanticclasses,andtemporalframesam- pling. Finally,weexplainindetailsourhybridschemetoaddressthechallengesin multi-cameraandvideolabeling. WeobtainconsistentinstanceIDsacrosstemporal framesandcamerasbyassociatingthepanopticlabelsfromeachindividualimage withtheexistingbox-levelannotations. 3.1 DatasetOverview TheWaymoOpenDatasetcontains1,150scenes,eachconsistingof20secondsof datacapturedat10Hz(i.e.,10framespersecond,andthus200framesperscene). Eachdataframeinthedatasetincludes3DpointcloudsfromtheLiDARdevices, imagesfromfivecameras(positionedatFront,Front-Left,Front-Right,Side-Left, andSide-Right),andgroundtruth3Dand2Dboundingboxesannotatedbyhumans intheLiDARpointcloudsandcameraimages,respectively. Eachboundingbox containsanIDthatisuniquetothatobjectacrosstheentiretyofeachscene. Forthe LiDARdata,thisallowsfortrackinginthewholescene. Forthecameradata,these IDsareconsistentwithineachcamera’simagesonly. 4Figure2:Histogramofthe28semanticcategoriesinourdatasetintermsoftheirpixeldistributions.The verticalaxisdenotesthenumberofpixelsforeachclassinlogscale.WeprovideinstanceIDsforclasses markedwithdiamonds. Figure3:Super-classdistributionsforeachcamera.Eachcameraseesadifferentdistributionofclasses, duetotheirfixedpositionsanddifferentfield-of-views. BuiltontopoftheWOD,OurWOD:PVPSdatasetconsistsof100,000images withpanopticsegmentationlabelsusingaprescribedtrain,validation,andtestset split,subsampledfromtheexisting1.15millionimages. InTab.1,wecompareour proposedWOD:PVPSdatasetwiththepublicdatasetsforvideopanopticsegmen- tation. Ourdatasetistheonlyonethatprovidespanopticsegmentationannotations thatareconsistentbothacrossmultiplecamerasandacrosstime. Furthermore,our datasetismuchlargerbothintermsofnumberofframesandnumberofsemantic classesthanexistingdatasets[30,78,71]. Equirectangular Panorama We reconstruct the equirectangular panorama (220◦coveragefromfivecameras)bystitchingeachindividualcameraimagesasan alternativeinputformattoourdataset.Specifically,wefirstusetheextrinsicsandin- trinsicsfromthefivecamerasprovidedbyWODtounprojecteachpixelcoordinates tothe3Dspace. Wethensetavirtualcamera[60]locatedatthegeometricmeanof allfivecameracentersandcomputethepixelcolorsbyequirectangularprojection fromthe3Dspacewithbilinearsampling. Forpixelscorrespondtomultiplecamera views,wecomputetheweightsbasedonthedistanceofeachpixelinthepanorama toeachofthecameraviews’boundaries. Forpanopticlabels,wecomputelabels ineachcameraviewgiventhecameraparametersoffivecamerasandthevirtual camerausingthenearestsampling. ThenweusethemethodinQiaoetal.[51]to stitchthepanoramalabelstomaintaintheviewconsistency. Finally,wefusedthe fivepanoramalabelstogetherbasedonthecorrespondencesandthedistancesto thecameraview’sboundaries. Therearemoresophisticatedmethods[57,64]that leveragecross-frameinformationandthegeometrycapturedfromLiDARsensors topotentiallyimprovepanoramageneration. Weleavethisasanopenresearch topicinthefuture. SemanticClassDistribution Intotal,ourdatasetcontains28semanticcate- gories,outlinedwiththeirfrequencyinpixelsinFig.2. Inaddition,weprovide instanceIDsformostoftheclassesunderthevehicleandhumansuper-classes, astheyaremajordynamiccategoriesintheautonomousdrivingspace. Wealso outlinethepixeldistributionforeachcameraviewinFig.3,whereweseenotable differencesinthedistributionsineachcamera.Forexample,thefrontcameracovers 5moreofflat(e.g.,roadsurfaces)andskypixelsthantherestofthecameras,while thesideleftcameracoversmorevehiclepixelsduetotheego-vehicledriving ontherighthandsideoftheroad. Thisanalysisisimportantasmachinelearning modelstrainedontheimagescapturedbyasinglecamerafromtheexistingdatasets maynotnecessarilygeneralizetotheothercamerasduetolargedomaingapsacross differentcameras. Incontrast,ourproposedtaskhasanemphasisontheholistic sceneunderstanding,whichgrantsourWOD:PVPSdatasetuniquevaluetothe researchcommunity. TemporalFrameSamplingforHumanAnnotations Tomaximizethediver- sity of the images on the training set, we subsample sparsely from each scene, labeling chunks of five-frame sequences from all the cameras. We start by ran- domlyselecting700outofthe798scenes. Foreachscene,whichtypicallyhas200 frames, we annotate four sets of five-frame sequences, starting at frame indices {25,50,125,150}(i.e.,wepick25th,50th,125th,and150thframesasthefirstframe ofeachfive-framesequenceforannotation). Foreachset,wefurtherselectframes withoffsets{0,4,6,8,12}w.r.t.thefirstframeforannotations. Forexample,thefirst setoffive-framesequenceswillcontainframeswithindices{25,29,31,33,37}. Our sparsesamplingstrategyfacilitatesavarietyofdifferentsequencelengths,allowing userstotrainonframepairswithtimedifferenceassmallastwoframes(0.2sec- onds)andaslargeas12frames(1.2seconds). Asaresult,ourtrainingsetcontains groupsoffivetemporalframesacrossallfivecameras,yielding2,800sequences of25images(5temporalframes×5cameras),or70,000imagesintotal. Finally, weprovidetheassociationsbetweeneachinstanceIDandthecorresponding3D LiDARboundingbox,allowingustocomputeverylongassociations(upto13.7 secondsbetweenallfoursequences),ifanobjectpersistsacrossmultiplesequences inthesamescene. For the validation and test sets, we aim to enable the testing of long-term consistency across cameras and frames. We therefore densely sample frames at 5Hz from chunks of 100 frames across all cameras (i.e., every other two frames aresampledinthe200framesequence). Weselect20and40scenesforvalidation andtestsetsbymaintainingdiversityinthelocation,densityofobject,andtimeof daydistributionsofWOD.Incontrasttothetrainingset,foreachsceneselected fromthevalidationandtesttests,wedenselysubsamplethescenesforthesesplits by labeling every other frame, resulting in sequences with 100 temporal frames across all five cameras. In the end, our validation set contains annotations for 20sequencesof500images(100temporalframes×5cameras), andourtestset consistsof40sequencesof500images(ortotally10,000and20,000annotationsfor validationandtestsets, respectively). Thetestsetannotationswillnotbemade publiclyavailable,butinsteadwewillprepareatestservertoevaluatetheheld-out testset,oncethedatasetisreleased. 3.2 AssociatingInstanceIDsAcrossCamerasandFrames Inconstructingthepanoramicvideopanopticsegmentationdataset,ensuringthe annotationshaveconsistentinstanceIDsacrosscamerasandtemporalframesis oneofthemajorchallenges. Manuallabelingisastraight-forwardoption,butis time-consumingandexpensiveatlargescales. Inaddition,itisdifficulttodevelop aneffectivelabelinginterfacethatallowshumanannotatorstoiterativelyrefine instancelabelsacrosscamerasandtemporaryframes. 6Front Left Camera (t = 0) Front Left Camera (t = 1) Front Left Camera (t = 2) Front Camera (t = 2) LiDAR point clouds (t = 2) Step 1: Human t=0 t=1 t=2 Annotation t=0 t=1 t=2 Camera Images Human Annotated Instance Labels Step 2: 3D Box t=0 t=1 t=2 Association t=0 t=1 t=2 3D Box Annotations Projected LiDAR Points Associated Instance Labels (3D Box) Step 3: 2D Box t=0 t=1 t=2 Association t=0 t=1 t=2 2D Box Annotations Associated Instance Labels (3D + 2D Box) Figure4: LabelingandAssociationOverview. Humanannotatorsfirstlabeleachcameraimagefor panopticsegmentationseparately(step1).LiDARpointswithineachgroundtruth3Dboundingboxare thenprojectedtoeachimage,andassociatewiththesingleframeinstancelabels(step2).Forfar-range instanceswithoutcorresponding3Dboundingboxes,weassociatethesingleframeinstancelabelsover timeusingthegroundtruth2Dboundingboxeswithineachcamera(step3). Newassociationsare highlightedinthezoomed-inviewsatthebottom. Weinsteadassignedhumanannotatorstolabeleachcameraimageforpanoptic segmentationseparatelyandemployedahybridschemethatleveragestheexist- ingcoarse-levelannotationsinWOD.Thecoarse-levelannotationsinclude(i)3D boundingboxeswithcorrespondingIDsthatareconsistentacrossallframesand cameras;and(ii)2DboundingboxeswithIDsthatareconsistentacrosstemporaral frames, but annotated independently for each camera. Associations were then computedbetweeneachinstanceanditscorresponding3DLiDARboxesand2D camera boxes. Instances determined to correspond to the same object are then mappedtothesameIDinallframesacrosscameras. Asamplesequencefromthis processcanbefoundinFig.4. Foragivenframewithinstancelabels,3Dpointclouds,and3Dboundingboxes, weassociateinstanceswithboxesbyfilteringtheLiDARpointswithineachbox, andprojectingthemontotheimage. Associationscoresarethencomputedusing IoUbetweentheconvexhulloftheprojectedLiDARpointsandeachinstancelabel. Bipartitematchingisthenappliedtomatcheachprojectedboxwithaninstancelabel. For3Ddrivingscenes,pointsinsidetheboundingboxalmostentirelycorrespond totheinstancesinsideofthem,andsotheseprojectedLiDARpointshaveahigh overlapwiththeircorrespondinginstancemasksintheimage.Ourlabelassociation stepisrelatedtothepriorwork[28,38],but,ourassociationleveragestheground- truth labeled 3D boxes and only transfers instance IDs rather than fine-grained per-pixellabels. Thereare,however,asmallnumberofinstanceswithoutcorrespondingLiDAR groundtruthboxesduetoocclusions,rollingshutterartifacts,andthelimitedrange of the provided LiDAR scans (75m). We apply an additional matching step by associatingthe2Dboundingboxeswithourinstancelabels. First,wescorematches between2DboxesandinstancesbycomputingtheIoUbetweeneach2Dboxand thetightly-fittingboundingboxesaroundeachinstancemask,andthencompute associationswithbipartitematching. Forboxeswithexisting3Dassociations,weextendthesetracksbypropagating theexistingIDtoallotherinstancesthatmatchwiththesame2Dbox. Thisresolves caseswhereonlyaobjecttrackmisses3Dassociationsinafewframes. Then,we assigntheremainingboxeswithoutanymatchestotheIDoftheircorresponding 2D box, if any. Finally, to capture any additional cross-camera associations, we 7projectallofthecameraviewsontothepanorama,andassociateinstanceswhich overlapinthisjointrepresentation. Inordertoidentifyanyinstancesthatarestillnotassociatedwithanyground truth boxes after these steps, we provide an additional mask for these instance pixelsindicatingthattheyarenottracked,similartothecrowdmaskusedinsingle frameinstancesegmentationlabels[12]. 4 Benchmark and Evaluation Metrics Inthissection,wefirstdescribethetaskofPanoramicVideoPanopticSegmentation (PVPS).Thenwereviewtheevaluationmetricsusedintheliterature,andproposea newmetricdesignedforPVPSwithanemphasisonconsistentmulti-objecttracking andsegmentationacrossmultiplecameras. 4.1 ProblemDefinition Werepresentamulti-cameravideosequencewithT framesandM independent camera views as {I1:T}M , where It is the i-th camera view captured at the t-th i i=1 i time step in the video sequence. Along with the multi-view representation of the full scene, we define the panorama at t-th time step as It . In the task of pano PanoramicVideoPanopticSegmentation(PVPS),werequireamappingf ofevery pixel(x,y,t,i)inthemulti-cameravideosequencetoasemanticcategoryc ∈ C andaninstanceIDz consistentacrosscameraviewsandtemporalframes. Here, (x,y,t,i)indicatesthespatialcoordinate(x,y)ofthei-thcameraviewcapturedat thet-thtimestep,andCisthesetofsemanticcategories. Accordingly,wedefine themappingsf andf foraparticularinstanceIDz andsemanticcategoryc id sem inEq.(1)andEq.(2),respectively. Themappingfunctionsarethebuildingblocks ofourproposedmetricintroducedinSec.4.2. f (z)={(x,y,i,t)|f(x,y,i,t)=(c,z),c∈C}, (1) id f (c)={(x,y,i,t)|f(x,y,i,t)=(c,∗),c∈C}. (2) sem ComparedtotheexistingtasksincludingVideoPanopticSegmentaion(VPS) andPanoramicSemanticSegmentation,theproposedtaskismorechallenginginthe followingaspects. First,eachindividualcamerahasitsownuniqueviewpointand field-of-viewsuchthatthesemanticclassstatisticsaredifferentacrosscameras(e.g., seeFig.3). Thisleadstoalargedomaingapbetweenvideoscapturedwithdifferent cameras. Second,theinstanceIDprediction,withthelong-termconsistencyacross bothtimeandcameras,requiresholisticsceneunderstanding. 4.2 EvaluationMetrics Inthissubsection,weoverviewtheexistingVideoPanopticSegmentation(VPS) metric:SegmentationandTrackingQuality(STQ)[71],whichweextendtoevaluate thePanoramicVideoPanopticSegmentation(PVPS)task. VPSMetric Weusef andgtoindicatethepredictionandground-truthmap- ping,respectively. Wedefinethetruepositiveassociations(TPA)[43]ofaspecific instance as TPA(z ,z ) = |f (z )∩g (z )|, where z is the predicted instance, f g id f id g f z ∈Gistheground-truthinstance,andGisthesetcontainingalluniqueground- g truth instances across cameras and temporal frames. Similarly, false negative associations(FNA)andfalsepositiveassociations(FPA)canbedefinedtocompute 8Side Left Front Left Front Front Right Side Right Figure5:Visualizationoftheweightstensorforallcameras.Pixelsintheblueregionhaveweights0.5 duringevaluation,astheyarecoveredbytwocameras. theIntersectionoverUnion(IoU )forevaluatingtrackingquality. Formally,STQis id definedasfollows. STQ=(AQ×SQ)1 2, (3) 1 (cid:88) 1 (cid:88) AQ= TPA(z ,z )×IoU (z ,z ), |G| |g (z )| f g id f g zg∈G id g zf,|zf∩zg|=(cid:54) ∅ 1 (cid:88) f (c)∩g (c) SQ= sem sem . |C| f (c)∪g (c) c∈C sem sem As defined in Eq (3), STQ fairly balances segmentation and tracking perfor- mance, and is suitable for evaluating video sequences of arbitrary length. The Association Quality (AQ) measures the association quality for tracking classes, whiletheSegmentationQuality(SQ)measuresthesegmentationqualityforseman- ticclasses. Specifically,AQinvolvestheIoU computationforpredictedinstance id IDs (and further weighted by true positive associations to encourage long-term tracking[71]),whileSQisthetypicalsemanticsegmentationmetric[16](i.e.,mean IoU forpredictedsemanticclasses). sem PVPSMetric WeproposetoextendthemetricSTQ[71]forPanoramicVideo Panoptic Segmentation (PVPS). However, na¨ıvely adopting STQ for the multi- camerascenarioresultsinapotentialissue,wherepixelsintheoverlappingregions coveredbymultiplecameraswillbecountedmultipletimes. Instead,weemploy a simple and effective solution by exploiting the pixel-centric property of STQ. In particular, we weight each pixel prediction w.r.t. its coverage by the number of cameras, as determined by the mapping between the camera images and the panoramaimage. Forexample,ifapixeliscoveredbyN cameras(inourdataset, N =2),itspredictionwillcontribute1/N whencomputingAQ,andSQ.Wename theresultingmetricsasweightedSTQ(wSTQ),sinceeachpixelpredictiontakesa differentweightdependingonitscoveragebythenumberofcameras. InFig.5,we visualizetheweightsforanexampleoffive-cameraimages. PS Metric We also briefly review the metric PQ (panoptic quality) [33] for evaluating image Panoptic Segmentation (PS), since we will build image-level baselinespurelytrainedwithimagepanopticannotations. Foraparticularsemanticclassc,thesetsoftruepositives(TP ),falsepositives c (FP ), and false negatives (FN ) are formed by matching predictions z to the c c f ground-truthmasksz basedontheIoUscores. Aminimalthresholdofgreater g than0.5IoUischosentoguaranteeuniquematching. Formally, (cid:80) IoU(z ,z ) PQ = (zf,zg)∈TPc f g , (4) c |TP |+ 1|FP |+ 1|FN | c 2 c 2 c wherethefinalPQisthenobtainedbyaveragingPQ oversemanticclasses. c 9Front Left Front Front Right Equirectangular Panorama Network Predictions Network Predictions Stitch Re-Project (a) View Evaluation (b) Pano Evaluation Figure6:Weexperimentwithtwoevaluationschemes:(a)Viewand(b)Pano. TheViewevaluation schemetakesindividualcameraviewsasinputandgeneratestheirpanopticpredictions,whichare then“stitchedovercameras”toobtainconsistentinstanceIDsbetweencameras.ThePanoevaluation schemetakesaspanoramaimagesasinputandgeneratespanoramicpanopticpredictions,whichare thenreprojectedbacktoeachcameraforevaluation. 5 Experimental Results Inthissection,weintroduceourPVPSbaselines,whichexploitthepropertyofmulti- camera images by taking as input either individual camera views or panorama images(generatedfromallcameraviews). Wethenprovideextensiveexperiments ontheproposeddatasetandmetric. 5.1 ViP-DeepLabExtensionsasPVPSbaselines TotacklethenewchallengingPVPStask,weextendthestate-of-artvideopanoptic segmentationmethod,ViP-Deeplab[51],topanoramicviews. BaselineOverview Forcompleteness,wefirstbrieflyreviewViP-DeepLab[51]. ViP-DeepLabextendsthestate-of-artimagepanopticsegmentationmodel,Panoptic- DeepLab [11], to the video domain. Panoptic-DeepLab employs two separate predictionbranchesforsemanticsegmentation[9]andinstancesegmentation[29], respectively.Bothsegmentationresultsarethenmerged[80]toformthefinalpanop- ticsegmentationresult. Toperformvideopanopticsegmentation, ViP-DeepLab adoptsatwo-frameimagepanopticsegmentationframework. Specifically,during training, ViP-DeepLab takes a pair of image frames as input and their panoptic segmentationground-truthsastrainingtarget. Duringinference,ViP-DeepLabper- formstwo-frameimagepanopticpredictionsateachtimestep,andcontinuesthe inferenceprocessforeverytwoconsecutiveframes(i.e.,withoneoverlappingframe atthenexttimestep)inavideosequence.Thepredictionsintheoverlappingframes are“stitched”togetherbypropagatinginstanceIDsbasedonmaskIoUbetween regionpairs(i.e.,iftwomaskshavehighIoUoverlap,theywillbere-assignedwith thesameinstanceID),andthustemporallyconsistentIDsareobtained(seeFig. 4 ofQiaoetal.[51]foranillustration). Wereferthispost-processingas“panoptic stitchingovertime”. Baseline Extension for PVPS We explore several ViP-DeepLab extensions for PVPS, which takes as input individual camera views or panorama images (generated from all camera views). The input types could be different during trainingandevaluation. Specifically,wedefinethreetrainingschemes: View,Pano, and Ensemble-View. The View scheme refers to the case where ViP-DeepLab is trainedwithimagesfromallcameraviews,whilePanomeansthemodelistrained withfullpanoramaimages. TheEnsemble-Viewschemereferstothecasewhere wehavefivecamera-specificViP-DeepLabmodels,eachofwhichistrainedand 10evaluated on their own camera images. We also have two evaluation schemes: View and Pano. The View scheme refers to the case where the trained model is fedwithimagesfromindividualcameraviewsandgeneratesthecorresponding panopticpredictionsforeachview. However,thepredictedinstanceIDsarenot consistent between cameras, since the predictions are made independently for each view. To generate consistent instance IDs between cameras, we propose a similar method to “panoptic stitching over time”: if two masks have high IoU overlap in the overlapping regions between two cameras’ field-of-view, we re- assign the same instance ID for them, resulting in the “panoptic stitching over cameras” post-processing method. For the Pano evaluation scheme, the model isfedwithpanoramaimagesandgeneratespanoramicpanopticpredictions. We thenre-projectpanoramicpanopticpredictionsontoeachcameraforevaluation. Notethat,forthePanoevaluationscheme,theinstanceIDsareconsistentbetween camerasbynature. WevisualizetheevaluationschemesinFig.6. ImplementationDetails Webuildourimage-basedandvideo-basedbaselines on top of Panoptic-DeepLab [11] and ViP-DeepLab [51], respectively, using the officialcode-base[70]. ThetrainingstrategyfollowsPanoptic-DeepLabandViP- DeepLab. Specifically,themodelsaretrainedwith32TPUcoresfor60ksteps,batch size32,Adam[31]optimizerandapolyschedulelearningrateof2.5×10−4. We useanImageNet-1K-pretrained[56]ResNet-50[25]withstride16asthebackbone (using atrous convolution [8]). For image-based methods, we use the crop size 1281 × 1921 during training, while, during inference, we use the whole image (or panorama). We use a similar strategy for the video-based methods, but we useaResNet-50backbonewithstride32andcropsize641×961duetomemory constraints. 5.2 QualitativeEvaluation In Fig. 7, we provide qualitative results from our two ViP-DeepLab baselines, bothtrainedandevaluatedonsingleimagesandonpanoramaimages(i.e., one modelusesViewschemesforbothtrainingandevaluation,andtheotherusesPano schemesforbothtrainingandevaluation),overtwo(non-adjacent)temporalframes. Fromtheseresults,wecanseethatthebaselinemodelsareabletoaccuratelytrack objectsinverydensescenes. Inaddition,wenotethattherearesomequalitative benefits provided by the panorama model in these examples. In particular, the singleviewmodelhasaninconsistentpredictiononthecrosswalkintheleftand rightimagesforthesingleviewmodel,butthepanoramamodelisabletoattain thefullcontextofthesceneandavoidsthismistake. Inaddition,thesingleview modelsfailtotrackthecarcrossingthefrontrightandsiderightcamerasatt ,but 0 thepanoramaisagainabletotrackthisobjectcorrectly. 5.3 BaselineComparisons Video-basedBaselines InTab.2(a),weprovidevideo-basedbaselinecomparisons using ViP-DeepLab [51], evaluated by the proposed weighted STQ (wSTQ). We comparedifferenttrainingandevaluationschemes. Asshowninthetable,when bothevaluatedwithViewscheme,trainingwithViewschemeperformsbetterthan trainingwithEnsemble-Viewby0.86%wSTQ.Thatis,trainingasinglemodelwith allthecameraviewsperformsbetterthantrainingfivecamera-specificmodelswith its own camera views. Also, when training with View scheme, using the Pano 11t s2.0+ t t s1+ t 0 0 1 1 weiV elgniS amaronaP weiV elgniS amaronaP weiV elgniS amaronaP weiV elgniS amaronaP TG + BGR snoitciderP snoitciderP TG + BGR snoitciderP snoitciderP TG + BGR snoitciderP snoitciderP TG + BGR snoitciderP snoitciderP Figure7:ComparisonofqualitativeresultsfromourbaselineViP-DeepLab[51]modelsoverdifferent timeintervals.Resultsshowmodelstrainedonsingleimageswithpanopticstitchingovercameras,and traineddirectlyonpanoramaimages.Ourbaselinemodelsshowstrongperformanceforthemajorityof thescene,althoughtrackingsmall/distantobjectsandcrowdedscenesremainschallenging. 12Table2:Quantitativeevaluation:duringtrainingandevaluation,thebaselinescantakedifferenttypes ofinputs:View:individualcameraviews;Pano:panoramas;Ensemble-View:camera-specificviews. Resultsinclude(a)video-baselinecomparisonusingViP-DeepLab,measuredbyweightedSegmentation andTrackingQuality(wSTQ);and(b)image-baselinecomparisonusingPanoptic-DeepLab,measured byPanopticQuality(PQ)andmeanIntersection-over-Union(mIoU). (a)Video-baselineComparison (b)Image-baselineComparison TrainingScheme EvalScheme wSTQ wAQ wSQ TrainingScheme EvalScheme PQ mIoU Ensemble-View View 16.92 7.61 37.33 Ensemble-View View 35.70 48.15 View View 17.78 8.21 38.46 View View 40.00 53.64 View Pano 14.87 6.13 36.04 View Pano 33.65 50.61 Pano View 17.56 8.11 38.04 Pano View 38.93 51.65 Pano Pano 15.72 6.22 39.78 Pano Pano 36.32 52.19 Table3:Viewtransferabilityonourvideo-basedbaselines,measuredbywSTQ.Weevaluatemodels(1st column)trainedonaspecificvieww.r.t.othercameraviews.Thelastrow,MultiCamera,referstothe modeltrainedwithallcameraviews(i.e.,trainingschemeView),andthelastcolumn,All,denotesthe evaluationsetusingallcameraviews(i.e.,evaluationschemeView). Model\Eval SideLeft FrontLeft Front FrontRight SideRight All SideLeft 18.79 17.41 14.56 19.06 19.40 16.31 FrontLeft 16.88 18.39 12.84 19.22 18.49 15.36 Front 16.58 18.02 14.54 18.55 18.96 15.98 FrontRight 16.56 17.36 14.99 19.40 19.16 16.18 SideRight 17.91 16.50 13.16 18.23 20.47 15.65 MultiCamera 20.11 19.54 15.63 20.67 21.53 17.78 evaluationschemedegradestheperformanceby2.91%wSTQ.Whentrainingwith Panoscheme,usingViewschemeisbetterthanPanoschemeforevaluation. We thinkitiscausedbytheasymmetrybetweentrainingandevaluationsettings. We couldnotusewholepanoramaimageswithresolution1000×5875asinputduring training(duetomemorylimit),andthusweonlyuseasmallercropsize641×961. Ideally,themodelshouldbeevaluatedwiththesamesettingasitstraining. The currentbestsettingistrainedandevaluatedwiththeViewscheme,reaching17.78% wSTQ.Weobservethatourdatasetisverychallengingintermsofbothtracking andsegmentation,sinceourbestwAQisonly8.21%andbestwSQis39.78%. Image-basedBaselines InTab.2(b),weprovideimage-basedbaselinecom- parisonsusingPanoptic-DeepLab[11],evaluatedbyimagepanopticsegmentation metricPQ[33]andsemanticsegmentationmetricmIoU[16]. Basically,weobserve thesametrendofimage-basedbaselinesandvideo-basedbaselines. 5.4 AblationStudies TransferabilityofModelsbetweenViewpoints Inthisablation,wemeasurethe abilitytotransfermodelstrainedononeviewpointtoadifferentviewpoint. As showninTab.3,wehavethefollowingobservations: First,allmodels,eventrained onleftsideviews,performbetteronrightsideviews. Thisphenomenonisdueto theego-vehicledrivingontherightsideoftheroad,andprovidingwiderscope, moreinstances,andsmallerobjectsontheleftside(i.e.,theleftsideviewsaremore challenging). Second, the front camera performance is inferior compared to the othercameras. Wehypothesizethatthefrontcameracapturesmorediverseand challengingviews,e.g.,vehiclesdrivinginmultipledirections,moredynamicand smallerobjects,makingtrackingmorechallenging. 136 Conclusion Inthiswork,wepresentedanewbenchmark,theWaymoOpenDataset:Panoramic VideoPanopticSegmentation(WOD:PVPS)dataset. Ourbenchmarkextendsvideo panopticsegmentationtoamorechallengingmulti-camerasettingthatrequires consistentinstanceIDsbothacrosscamerasandovertime. Ourdatasetisanorder ofmagnitudelargerthanalltheexistingvideopanopticsegmentationdatasets. We establishseveralstrongbaselinesevaluatedwithanewmetric,wSTQ,thattakes multi-camera,multi-objecttrackingandsegmentationintoconsideration. Wewill makeourbenchmarkpubliclyavailable,andwehopethatitwillfacilitatefuture researchonpanoramicvideopanopticsegmentation. 14References [1] Baque´, P., Fleuret, F., Fua, P.: Deep occlusion reasoning for multi-camera multi-targetdetection.In: ICCV(2017) [2] Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: Semantickitti: A dataset for semantic scene understanding of lidar sequences.In: ICCV(2019) [3] Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Multiple object tracking using k-shortestpathsoptimization.PAMI33(9),1806–1819(2011) [4] Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-definitiongroundtruthdatabase.PatternRecognitionLetters30(2),88–97 (2009) [5] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomousdriving.In: CVPR(2020) [6] Chang,M.F.,Lambert,J.,Sangkloy,P.,Singh,J.,Bak,S.,Hartnett,A.,Wang,D., Carr,P.,Lucey,S.,Ramanan,D.,etal.: Argoverse: 3dtrackingandforecasting withrichmaps.In: CVPR(2019) [7] Chavdarova,T.,Baque´,P.,Bouquet,S.,Maksai,A.,Jose,C.,Bagautdinov,T., Lettry, L., Fua, P., Van Gool, L., Fleuret, F.: Wildtrack: A multi-camera hd datasetfordenseunscriptedpedestriandetection.In: CVPR(2018) [8] Chen,L.C.,Papandreou,G.,Kokkinos,I.,Murphy,K.,Yuille,A.L.: Semantic imagesegmentationwithdeepconvolutionalnetsandfullyconnectedCRFs. In: ICLR(2015) [9] Chen,L.C.,Papandreou,G.,Kokkinos,I.,Murphy,K.,Yuille,A.L.: DeepLab: Semanticimagesegmentationwithdeepconvolutionalnets,atrousconvolu- tion,andfullyconnectedCRFs.TPAMI(2017) [10] Chen, Y., Rong, F., Duggal, S., Wang, S., Yan, X., Manivasagam, S., Xue, S., Yumer,E.,Urtasun,R.:Geosim:Realisticvideosimulationviageometry-aware compositionforself-driving.In: CVPR(2021) [11] Cheng,B.,Collins,M.D.,Zhu,Y.,Liu,T.,Huang,T.S.,Adam,H.,Chen,L.C.: Panoptic-deeplab: Asimple,strong,andfastbaselineforbottom-uppanoptic segmentation.In: CVPR(2020) [12] Cordts,M.,Omran,M.,Ramos,S.,Rehfeld,T.,Enzweiler,M.,Benenson,R., Franke,U.,Roth,S.,Schiele,B.: TheCityscapesDatasetforSemanticUrban SceneUnderstanding.In: CVPR(2016) [13] Dehghan,A.,ModiriAssari,S.,Shah,M.: Gmmcptracker: Globallyoptimal generalizedmaximummulticliqueproblemformultipleobjecttracking.In: CVPR(2015) 15[14] Dendorfer,P.,Osˇep,A.,Milan,A.,Schindler,K.,Cremers,D.,Reid,I.,Roth, S.,Leal-Taixe´,L.: MOTChallenge: ABenchmarkforSingle-cameraMultiple TargetTracking.IJCV(2020) [15] Eshel,R.,Moses,Y.: Homographybasedmultiplecameradetectionandtrack- ingofpeopleinadensecrowd.In: CVPR(2008) [16] Everingham,M.,VanGool,L.,Williams,C.K.I.,Winn,J.,Zisserman,A.: The PascalVisualObjectClasses(VOC)Challenge.IJCV(2010) [17] Felzenszwalb,P.F.,Huttenlocher,D.P.: Efficientgraph-basedimagesegmenta- tion.IJCV(2004) [18] Ferryman,J.,Shahrokni,A.: Pets2009: Datasetandchallenge.In: 2009Twelfth IEEE international workshop on performance evaluation of tracking and surveillance.pp.1–6(2009) [19] Fleuret,F.,Berclaz,J.,Lengagne,R.,Fua,P.: Multicamerapeopletrackingwith aprobabilisticoccupancymap.PAMI30(2),267–282(2007) [20] Gao, N., Shan, Y., Wang, Y., Zhao, X., Yu, Y., Yang, M., Huang, K.: SSAP: Single-ShotInstanceSegmentationWithAffinityPyramid.In: ICCV(2019) [21] Geiger,A.,Lenz,P.,Urtasun,R.: Arewereadyforautonomousdriving? the kittivisionbenchmarksuite.In: CVPR(2012) [22] Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A.S., Hauswald, L., Pham, V.H., Mu¨hlegg, M., Dorn, S., et al.: A2d2: Audi au- tonomousdrivingdataset.arXivpreprintarXiv:2004.06320(2020) [23] Han, X., You, Q., Wang, C., Zhang, Z., Chu, P., Hu, H., Wang, J., Liu, Z.: Mmptrack: Large-scaledenselyannotatedmulti-cameramultiplepeopletrack- ingbenchmark(2021) [24] Hariharan,B.,Arbela´ez,P.,Girshick,R.,Malik,J.: Simultaneousdetectionand segmentation.In: ECCV(2014) [25] He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition. In: CVPR(2016) [26] He,X.,Zemel,R.S.,Carreira-Perpin˜a´n,M.A´.: Multiscaleconditionalrandom fieldsforimagelabeling.In: CVPR(2004) [27] Hofmann,M.,Wolf,D.,Rigoll,G.: Hypergraphsforjointmulti-viewrecon- structionandmulti-objecttracking.In: CVPR(2013) [28] Huang,X.,Wang,P.,Cheng,X.,Zhou,D.,Geng,Q.,Yang,R.: Theapolloscape opendatasetforautonomousdrivinganditsapplication.PAMI(2020) [29] Kendall,A.,Gal,Y.,Cipolla,R.: Multi-tasklearningusinguncertaintytoweigh lossesforscenegeometryandsemantics.In: CVPR(2018) [30] Kim,D.,Woo,S.,Lee,J.Y.,Kweon,I.S.:VideoPanopticSegmentation.In:CVPR (2020) 16[31] Kingma,D.P.,Ba,J.: Adam: Amethodforstochasticoptimization.In: ICLR (2015) [32] Kirillov,A.,Girshick,R.,He,K.,Dolla´r,P.:PanopticFeaturePyramidNetworks. In: CVPR(2019) [33] Kirillov,A.,He,K.,Girshick,R.,Rother,C.,Dolla´r,P.: PanopticSegmentation. In: CVPR(2019) [34] Kuo, C.H., Huang, C., Nevatia, R.: Inter-cameraassociationofmulti-target tracksbyon-linelearnedappearanceaffinitymodels.In: ECCV(2010) [35] Ladicky`,L.,Sturgess,P.,Alahari,K.,Russell,C.,Torr,P.H.: What,whereand howmany? combiningobjectdetectorsandcrfs.In: ECCV(2010) [36] Li,Y.,Chen,X.,Zhu,Z.,Xie,L.,Huang,G.,Du,D.,Wang,X.: Attention-guided unifiednetworkforpanopticsegmentation.In: CVPR(2019) [37] Liang, J., Homayounfar, N., Ma, W.C., Xiong, Y., Hu, R., Urtasun, R.: Poly- transform: Deeppolygontransformerforinstancesegmentation.In: CVPR (2020) [38] Liao, Y., Xie, J., Geiger, A.: Kitti-360: A novel dataset and benchmarks for urbansceneunderstandingin2dand3d.arXiv:2109.13410(2021) [39] Lin,T.,Maire,M.,Belongie,S.J.,Bourdev,L.D.,Girshick,R.B.,Hays,J.,Perona, P.,Ramanan,D.,Doll’ar,P.,Zitnick,C.L.: MicrosoftCOCO:commonobjects incontext.In: ECCV(2014) [40] Ling,H.,Acuna,D.,Kreis,K.,Kim,S.W.,Fidler,S.: Variationalamodalobject completion.NeurIPS(2020) [41] Liu,H.,Peng,C.,Yu,C.,Wang,J.,Liu,X.,Yu,G.,Jiang,W.: AnEnd-to-End NetworkforPanopticSegmentation.In: CVPR(2019) [42] Long,J.,Shelhamer,E.,Darrell,T.: Fullyconvolutionalnetworksforsemantic segmentation.In: CVPR(2015) [43] Luiten,J.,Osˇep,A.,Dendorfer,P.,Torr,P.,Geiger,A.,Leal-Taixe´,L.,Leibe,B.: HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. IJCV (2020) [44] Mallya,A.,Wang,T.C.,Sapra,K.,Liu,M.Y.: World-consistentvideo-to-video synthesis.In: ECCV(2020) [45] Miao,J.,Wei,Y.,Wu,Y.,Liang,C.,Li,G.,Yang,Y.: Vspw: Alarge-scaledataset forvideosceneparsinginthewild.In: CVPR(2021) [46] Narioka, K., Nishimura, H., Itamochi, T., Inomata, T.: Understanding 3d semantic structure around the vehicle with monocular cameras. In: IEEE IntelligentVehiclesSymposium(IV).pp.132–137(2018) [47] Neuhold,G.,Ollmann,T.,Bulo`,S.R.,Kontschieder,P.: Themapillaryvistas datasetforsemanticunderstandingofstreetscenes.In: ICCV(2017) 17[48] Philion,J.,Fidler,S.: Lift,splat,shoot: Encodingimagesfromarbitrarycamera rigsbyimplicitlyunprojectingto3d.In: ECCV(2020) [49] Porzi,L.,Bulo`,S.R.,Colovic,A.,Kontschieder,P.: SeamlessSceneSegmenta- tion.In: CVPR(2019) [50] Qi,C.R.,Zhou,Y.,Najibi,M.,Sun,P.,Vo,K.,Deng,B.,Anguelov,D.: Offboard 3dobjectdetectionfrompointcloudsequences.In: CVPR(2021) [51] Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Vip-deeplab: Learning visualperceptionwithdepth-awarevideopanopticsegmentation.In: CVPR (2021) [52] Ristani,E.,Solera,F.,Zou,R.,Cucchiara,R.,Tomasi,C.: Performancemeasures andadatasetformulti-target,multi-cameratracking.In: ECCVWorkshopon BenchmarkingMulti-TargetTracking(2016) [53] Ristani,E.,Tomasi,C.: Featuresformulti-targetmulti-cameratrackingand re-identification.In: CVPR(2018) [54] Roddick,T.,Cipolla,R.: Predictingsemanticmaprepresentationsfromimages usingpyramidoccupancynetworks.In: CVPR(2020) [55] RoshanZamir,A.,Dehghan,A.,Shah,M.: Gmcp-tracker: Globalmulti-object trackingusinggeneralizedminimumcliquegraphs.In: ECCV(2012) [56] Russakovsky,O.,Deng,J.,Su,H.,Krause,J.,Satheesh,S.,Ma,S.,Huang,Z., Karpathy,A.,Khosla,A.,Bernstein,M.,Berg,A.C.,Fei-Fei,L.: ImageNetLarge ScaleVisualRecognitionChallenge.IJCV(2015) [57] Scho¨nberger,J.L.,Zheng,E.,Frahm,J.M.,Pollefeys,M.: Pixelwiseviewselec- tionforunstructuredmulti-viewstereo.In: ECCV(2016) [58] Shi,J.,Malik,J.: Normalizedcutsandimagesegmentation.PAMI(2000) [59] Song, S., Zeng, A., Chang, A.X., Savva, M., Savarese, S., Funkhouser, T.: Im2pano3d: Extrapolating 360 structure and semantics beyond the field of view.In: CVPR(2018) [60] Su,Y.C.,Grauman,K.: Making360videowatchablein2d: Learningvideogra- phyforclickfreeviewing.In: CVPR(2017) [61] Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo, J.,Zhou,Y.,Chai,Y.,Caine,B.,etal.: Scalabilityinperceptionforautonomous driving: WaymoOpenDataset.In: CVPR(2020) [62] Tang,Z.,Naphade,M.,Liu,M.Y.,Yang,X.,Birchfield,S.,Wang,S.,Kumar,R., Anastasiu,D.,Hwang,J.N.: Cityflow: Acity-scalebenchmarkformulti-target multi-cameravehicletrackingandre-identification.In: CVPR(2019) [63] Tateno,K.,Navab,N.,Tombari,F.: Distortion-awareconvolutionalfiltersfor densepredictioninpanoramicimages.In: ECCV(2018) 18[64] Thrun, S., Montemerlo, M.: Thegraphslamalgorithmwithapplicationsto large-scalemappingofurbanstructures.TheInternationalJournalofRobotics Research25(5-6),403–429(2006) [65] Tu,Z.,Chen,X.,Yuille,A.L.,Zhu,S.C.: Imageparsing: Unifyingsegmentation, detection,andrecognition.IJCV(2005) [66] Voigtlaender, P., Krause, M., Os˘ep, A., Luiten, J., Sekar, B.B.G., Geiger, A., Leibe,B.: MOTS:Multi-objecttrackingandsegmentation.In: CVPR(2019) [67] Wang,H.,Luo,R.,Maire,M.,Shakhnarovich,G.: PixelConsensusVotingfor PanopticSegmentation.In: CVPR(2020) [68] Wang,H.,Zhu,Y.,Green,B.,Adam,H.,Yuille,A.,Chen,L.C.: Axial-DeepLab: Stand-AloneAxial-AttentionforPanopticSegmentation.In: ECCV(2020) [69] Weber,M.,Luiten,J.,Leibe,B.: Single-shotPanopticSegmentation.In: IROS (2020) [70] Weber,M.,Wang,H.,Qiao,S.,Xie,J.,Collins,M.D.,Zhu,Y.,Yuan,L.,Kim,D., Yu,Q.,Cremers,D.,Leal-Taixe,L.,Yuille,A.L.,Schroff,F.,Adam,H.,Chen, L.C.: DeepLab2: ATensorFlowLibraryforDeepLabeling.arXiv: 2106.09748 (2021) [71] Weber, M., Xie, J., Collins, M., Zhu, Y., Voigtlaender, P., Adam, H., Green, B., Geiger, A., Leibe, B., Cremers, D., Osep, A., Leal-Taixe, L., Chen, L.C.: Step: Segmentingandtrackingeverypixel.In: NeurIPSTrackonDatasetsand Benchmarks(2021) [72] Wu,Y.,Lim,J.,Yang,M.H.: Onlineobjecttracking: Abenchmark.In: CVPR (2013) [73] Xiong,Y.,Liao,R.,Zhao,H.,Hu,R.,Bai,M.,Yumer,E.,Urtasun,R.: UPSNet: AUnifiedPanopticSegmentationNetwork.In: CVPR(2019) [74] Xu,C.,Xiong,C.,Corso,J.J.: Streaminghierarchicalvideosegmentation.In: ECCV(2012) [75] Xu,Y.,Liu,X.,Liu,Y.,Zhu,S.C.: Multi-viewpeopletrackingviahierarchical trajectorycomposition.In: CVPR(2016) [76] Yang,B.,Bai,M.,Liang,M.,Zeng,W.,Urtasun,R.: Auto4d: Learningtolabel 4dobjectsfromsequentialpointclouds.arXivpreprintarXiv:2101.06586(2021) [77] Yang,K.,Hu,X.,Bergasa,L.M.,Romera,E.,Wang,K.: Pass: Panoramican- nularsemanticsegmentation.IEEETransactionsonIntelligentTransportation Systems21(10),4171–4185(2019) [78] Yang,K.,Zhang,J.,Reiß,S.,Hu,X.,Stiefelhagen,R.: Capturingomni-range contextforomnidirectionalsegmentation.In: CVPR(2021) [79] Yang,L.,Fan,Y.,Xu,N.: VideoInstanceSegmentation.In: ICCV(2019) 19[80] Yang,T.J.,Collins,M.D.,Zhu,Y.,Hwang,J.J.,Liu,T.,Zhang,X.,Sze,V.,Papan- dreou,G.,Chen,L.C.: DeeperLab: Single-ShotImageParser.arXiv:1902.05093 (2019) [81] Yao, J., Fidler, S., Urtasun, R.: Describingthesceneasawhole: Jointobject detection,sceneclassificationandsemanticsegmentation.In: CVPR(2012) [82] Yogamani,S.,Hughes,C.,Horgan,J.,Sistu,G.,Varley,P.,O’Dea,D.,Urica´r,M., Milz,S.,Simon,M.,Amende,K.,etal.:Woodscape:Amulti-task,multi-camera fisheyedatasetforautonomousdriving.In: ICCV(2019) [83] Yu,F.,Chen,H.,Wang,X.,Xian,W.,Chen,Y.,Liu,F.,Madhavan,V.,Darrell,T.: Bdd100k: Adiversedrivingdatasetforheterogeneousmultitasklearning.In: CVPR(2020) [84] Zakharov,S.,Kehl,W.,Bhargava,A.,Gaidon,A.: Autolabeling3dobjectswith differentiablerenderingofsdfshapepriors.In: CVPR(2020) [85] Zhang, C., Liwicki, S., Smith, W., Cipolla, R.: Orientation-aware semantic segmentationonicosahedronspheres.In: ICCV(2019) 20