Waymo Open Dataset:
Panoramic Video Panoptic Segmentation
JieruMei1* AlexZihaoZhu2 XinchenYan2 HangYan2
SiyuanQiao3 YukunZhu3 Liang-ChiehChen3
HenrikKretzschmar2 DragomirAnguelov2
1JohnsHopkinsUniversity 2WaymoLLC 3GoogleResearch
Abstract
Panopticimagesegmentationisthecomputervisiontaskoffindinggroupsof
pixelsinanimageandassigningsemanticclassesandobjectinstanceidentifiers
tothem.Researchinimagesegmentationhasbecomeincreasinglypopulardue
toitscriticalapplicationsinroboticsandautonomousdriving. Theresearch
communitytherebyreliesonpubliclyavailablebenchmarkdatasettoadvance
thestate-of-the-artincomputervision.Duetothehighcostsofdenselylabeling
the images, however, there is a shortage of publicly available ground truth
labelsthataresuitableforpanopticsegmentation.Thehighlabelingcostsalso
makeitchallengingtoextendexistingdatasetstothevideodomainandtomulti-
camerasetups.WethereforepresenttheWaymoOpenDataset:PanoramicVideo
PanopticSegmentationDataset,alarge-scaledatasetthatoffershigh-quality
panopticsegmentationlabelsforautonomousdriving.Wegenerateourdataset
usingthepubliclyavailableWaymoOpenDataset,leveragingthediverseset
ofcameraimages.Ourlabelsareconsistentovertimeforvideoprocessingand
consistentacrossmultiplecamerasmountedonthevehiclesforfullpanoramic
sceneunderstanding. Specifically,weofferlabelsfor28semanticcategories
and2,860temporalsequencesthatwerecapturedbyfivecamerasmountedon
autonomousvehiclesdrivinginthreedifferentgeographicallocations,leading
toatotalof100klabeledcameraimages. Tothebestofourknowledge,this
makes our dataset an order of magnitude larger than existing datasets that
offervideopanopticsegmentationlabels.Wefurtherproposeanewbenchmark
forPanoramicVideoPanopticSegmentationandestablishanumberofstrong
baselinesbasedontheDeepLabfamilyofmodels.Wewillmakethebenchmark
andthecodepubliclyavailable,whichwehopewillfacilitatefutureresearchon
holisticsceneunderstanding.Findthedatasetathttps://waymo.com/open.
1 Introduction
Semanticvisualsceneunderstandinghasbeenstudiedextensivelyfordecadesin
thefieldofcomputervision[58,65,17,35,81,74]. Researchershavetackledtasks
*WorkdoneasaninternatWaymo.
1
2202
nuJ
51
]VC.sc[
1v40770.6022:viXraFL F FR
SL SR
Multi-Camera Annotations Temporal Sequences
Residential Rain Dense Urban Night Highway
Diverse Scenes
Figure1:Weprovidepanopticsegmentationlabelsfor100kcameraimagesoftheWaymoOpenDataset.
Ourdatasetisgroupedinto2,860temporalsequencescapturedbyfivecameras,mountedonautonomous
vehiclesdrivinginthreegeographicallocations.Instancesegmentationlabelsareconsistentbothacross
camerasandovertime.Ourdatasetoffersdiversityintermsofobjectclasses,locations,weather,and
timeofday.
ofvaryingdifficulty,rangingfromsegmentingdistinctobjectsinindividualcamera
images[26,24,42,9]totrackingandsegmentingmultipleobjectsinvideos[72,66,
14]. Roboticapplications,suchasautonomousdriving,haveledtonewchallenges
andopportunitiesforsemanticvisualsceneunderstanding[21,12].
Modernautonomousvehiclestendtobeequippedwithmultiplecamerasand
LiDARscanners. Thecamerasproviderichsemanticinformationaboutthescene,
whereas the LiDAR scanners capture sparse, but geometrically highly accurate
information. Autonomousvehiclesneedtobeabletofuseandinterpretthedata
stream from multiple sensors to build and maintain over time an accurate and
consistentestimateoftheworld. Onechallengewhentrackingandsegmenting
multipleobjectsisthatobjectsofinterestmayleavethefieldofviewofacamerato
enterthefieldofviewofanothercameraacrossconsecutivevideoframes.
Inthispaper,westudythenewtaskofvideopanopticsegmentation[33,30]
forautonomousvehiclesequippedwithmultiplecameras. SeeFig.1foranillus-
tration. Panopticsegmentationenablesautonomousvehiclestoreasonabouttheir
surroundingsintermsofsemanticandgeometryproperties,suchasfine-grained
objectcontours. Therearealsoimportantoffboardapplications, includingauto-
labeling[84,76,50]andcamerasensorsimulation[44,40,10]. Ontheonehand,
mostexistingpanopticsegmentationdatasets[12,47]providelabelsforindividual
cameraimages. Thismakesitdifficulttotrainmodelsthatfuseinformationfrom
multiplecameraimages,eithertemporallyorbyleveragingamulti-camerasetup.
Ontheotherhand,datasetsthatprovidepanopticsegmentationlabelsforvideo
data[30,71]tendtobescarceandmuchsmallerthandatasetsforobjectdetection
2andtrackingforautonomousdriving[21,61]. Tobridgethisgap,wepresentanew
benchmarkdatasetforpanopticsegmentationbasedonthepopularWaymoOpen
Dataset(WOD).Specifically,weprovidepanopticsegmentationlabelsforvideo
datathatareconsistentacrossfivecamerasmountedonthevehicles. Wefurther
presentabenchmarkthatcapturesthetaskofmulti-camerapanopticsegmentation
invideodataforautonomousdriving. Overall,weprovidepanopticsegmentation
labelsfor100kcameraimages,whichwegroupintotraining(70%),validation(10%)
and test (20%) sets. The training set consists of 2,800 sequences, each of which
compriseslabelsforfivecamerasspanning1.2secondsandfivetemporalframes.
Incontrast,ourvalidationandtestsetsconsistof60longersequences,inorderto
facilitatetheevaluationoflong-termtracking. Eachvalidationandtestsequence
consistsof100temporalframes,spanningthefull20sofascene,whilealsoprovid-
inglabelsacrossallfivecameras. WeextendtheSegmentationandTrackingQuality
(STQ)metric[71]tosupportourmulti-camerasetupbycomputingaweightfor
pixelsdependingonthecamerastheycorrespondto. Wealsoextendastate-of-the-
artvideopanopticsegmentationmethod,ViP-DeepLab[51],toourmulti-camera
setupbytrainingseparatemodelsoneachcameraviewandbytrainingamodel
onapanoramageneratedfromallviews. Wepresentanextensiveexperimental
evaluationontheproposeddatasetandmetric.
Wepublishedthefulldatasettoenhancevideopanopticsegmentationresearch
whilealsoopeningupthefieldofpanoramicvideopanopticsegmentation.
2 Related Work
Panoptic Segmentation The task of panoptic segmentation [33] aims to unify
semanticsegmentation[26]andinstancesegmentation[24],requiringassigninga
classlabelandinstanceIDtoallpixelsinanimage. Modernpanopticsegmentation
systemscouldberoughlycategorizedintotop-down(orproposal-based)[32,49,
36,41,73,69]andbottom-up(orproposal-free)[80,20,67,11,68]approaches. Our
adoptedbaselinemethodsbelongtothebottom-upcategory.
VideoPanopticSegmentation Extendingpanopticsegmentationtothevideo
domain,VideoPanopticSegmentation(VPS)[30]requiresgeneratingtheinstance
trackingIDs(i.e.,temporallyconsistentinstanceIDs)alongwithpanopticsegmen-
tationresultsacrossvideoframes. CurrentVPSdatasetsaresmallscaleinterms
ofsemanticclassesandsizes. Specifically,Cityscapes-VPS[30]sparselyannotates
(everyfiveframe)Cityscapes[12]videosequences,resultinginonly3,000frames
with19semanticclassesfortrainingandtesting. Recently,STEP[71]extendsKITTI-
MOTS[21,66]andMOTS-Challenge[66,14]forVPS.However, theirannotated
datasets are still small-scale (18K annotated frames with 19 semantic classes for
KITTI-STEP,and2Kframeswith8classesforMOTChallenge-STEP),andthevideo
sequencesareonlycapturedbyasinglefront-viewcamera. Ontheotherhand,our
annotateddatasetpresentsthefirstlarge-scaleVPSannotationsandextendstothe
multi-camerascenario.
SegmentationBenchmarks Thereareotherpopularvideosegmentationbench-
marksexistingintheliterature,e.g.,VSPW[45]forvideosemanticsegmentation,
whileMOTS[66]andYoutube-VIS[79]forvideoinstancesegmentation. Ourbench-
mark is also related to urban scene understanding, where typical benchmarks
include[4,21,39,12,47,6,82,2,61,5,37,28,83,78,38,22]. Ourworkismostre-
3Table1:Datasetcomparison.OurWOD:PVPSisanewlarge-scalepanoramicvideopanopticsegmenta-
tiondataset.†WildPASScontains500panoramas.
datasetstatistics WOD:PVPS(ours) WildPASS[78] Cityscapes-VPS[30] KITTI-STEP[71] MOT-STEP[71]
#sequences 2860 - 500 50 4
#images 100,000 500† 3,000 19,103 2,075
#trackingclasses 8 - 8 2 1
#semanticclasses 28 8 19 19 7
panoramic (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
videopanoptic (cid:51) (cid:55) (cid:51) (cid:51) (cid:51)
latedtoWildPASS[78],whichalsoaimstoendowmachineswithlargefield-of-view
perception. However,buildingontopofthelarge-scaleWaymoOpenDataset[61],
ourbenchmarkprovidesmuchmorehigh-qualityannotatedvideosequences.
Multi-Camera Multi-Object Tracking Consistently tracking objects across
multiple cameras, multi-camera multi-object tracking [19, 15, 3, 55, 27, 13, 1, 53]
has been a popular research topic in the computer vision community. Typical
benchmarks[18,34,75,52,7,62,23]onlytrackasingleclass(e.g.,peopleorvehicles)
with bounding boxes, while our proposed benchmark demands for pixel-level
trackingandsegmentationformultipleclasses.
Panoramic Semantic Segmentation Panoramic semantic segmentation pro-
vides surround-view perception [59, 46, 63, 85, 78, 77], but limited to semantic
segmentation without temporal and instance-level understanding. Our work is
similar,butadditionallytacklesvideopanopticsegmentation. Recently, [54,48]
predictbird’s-eyeviewsemanticsegmentationusingmulti-camerainputs.
3 WOD: PVPS Dataset
Inthissection,wefirstrecaptheexistingWaymoOpenDataset(WOD)[61],one
ofthelargestandmostdiversemulti-sensordatasetsintheautonomousdriving
domain. Weleveragetheexistingdatathatcomeswithcoarse-levelannotations
(e.g., 2D and 3D bounding boxes) as the foundation, and subsample images for
ourdataset. WethenprovideanoverviewofourWOD:PVPSdataset,including
panoramageneration,statisticsofthesemanticclasses,andtemporalframesam-
pling. Finally,weexplainindetailsourhybridschemetoaddressthechallengesin
multi-cameraandvideolabeling. WeobtainconsistentinstanceIDsacrosstemporal
framesandcamerasbyassociatingthepanopticlabelsfromeachindividualimage
withtheexistingbox-levelannotations.
3.1 DatasetOverview
TheWaymoOpenDatasetcontains1,150scenes,eachconsistingof20secondsof
datacapturedat10Hz(i.e.,10framespersecond,andthus200framesperscene).
Eachdataframeinthedatasetincludes3DpointcloudsfromtheLiDARdevices,
imagesfromfivecameras(positionedatFront,Front-Left,Front-Right,Side-Left,
andSide-Right),andgroundtruth3Dand2Dboundingboxesannotatedbyhumans
intheLiDARpointcloudsandcameraimages,respectively. Eachboundingbox
containsanIDthatisuniquetothatobjectacrosstheentiretyofeachscene. Forthe
LiDARdata,thisallowsfortrackinginthewholescene. Forthecameradata,these
IDsareconsistentwithineachcamera’simagesonly.
4Figure2:Histogramofthe28semanticcategoriesinourdatasetintermsoftheirpixeldistributions.The
verticalaxisdenotesthenumberofpixelsforeachclassinlogscale.WeprovideinstanceIDsforclasses
markedwithdiamonds.
Figure3:Super-classdistributionsforeachcamera.Eachcameraseesadifferentdistributionofclasses,
duetotheirfixedpositionsanddifferentfield-of-views.
BuiltontopoftheWOD,OurWOD:PVPSdatasetconsistsof100,000images
withpanopticsegmentationlabelsusingaprescribedtrain,validation,andtestset
split,subsampledfromtheexisting1.15millionimages. InTab.1,wecompareour
proposedWOD:PVPSdatasetwiththepublicdatasetsforvideopanopticsegmen-
tation. Ourdatasetistheonlyonethatprovidespanopticsegmentationannotations
thatareconsistentbothacrossmultiplecamerasandacrosstime. Furthermore,our
datasetismuchlargerbothintermsofnumberofframesandnumberofsemantic
classesthanexistingdatasets[30,78,71].
Equirectangular Panorama We reconstruct the equirectangular panorama
(220◦coveragefromfivecameras)bystitchingeachindividualcameraimagesasan
alternativeinputformattoourdataset.Specifically,wefirstusetheextrinsicsandin-
trinsicsfromthefivecamerasprovidedbyWODtounprojecteachpixelcoordinates
tothe3Dspace. Wethensetavirtualcamera[60]locatedatthegeometricmeanof
allfivecameracentersandcomputethepixelcolorsbyequirectangularprojection
fromthe3Dspacewithbilinearsampling. Forpixelscorrespondtomultiplecamera
views,wecomputetheweightsbasedonthedistanceofeachpixelinthepanorama
toeachofthecameraviews’boundaries. Forpanopticlabels,wecomputelabels
ineachcameraviewgiventhecameraparametersoffivecamerasandthevirtual
camerausingthenearestsampling. ThenweusethemethodinQiaoetal.[51]to
stitchthepanoramalabelstomaintaintheviewconsistency. Finally,wefusedthe
fivepanoramalabelstogetherbasedonthecorrespondencesandthedistancesto
thecameraview’sboundaries. Therearemoresophisticatedmethods[57,64]that
leveragecross-frameinformationandthegeometrycapturedfromLiDARsensors
topotentiallyimprovepanoramageneration. Weleavethisasanopenresearch
topicinthefuture.
SemanticClassDistribution Intotal,ourdatasetcontains28semanticcate-
gories,outlinedwiththeirfrequencyinpixelsinFig.2. Inaddition,weprovide
instanceIDsformostoftheclassesunderthevehicleandhumansuper-classes,
astheyaremajordynamiccategoriesintheautonomousdrivingspace. Wealso
outlinethepixeldistributionforeachcameraviewinFig.3,whereweseenotable
differencesinthedistributionsineachcamera.Forexample,thefrontcameracovers
5moreofflat(e.g.,roadsurfaces)andskypixelsthantherestofthecameras,while
thesideleftcameracoversmorevehiclepixelsduetotheego-vehicledriving
ontherighthandsideoftheroad. Thisanalysisisimportantasmachinelearning
modelstrainedontheimagescapturedbyasinglecamerafromtheexistingdatasets
maynotnecessarilygeneralizetotheothercamerasduetolargedomaingapsacross
differentcameras. Incontrast,ourproposedtaskhasanemphasisontheholistic
sceneunderstanding,whichgrantsourWOD:PVPSdatasetuniquevaluetothe
researchcommunity.
TemporalFrameSamplingforHumanAnnotations Tomaximizethediver-
sity of the images on the training set, we subsample sparsely from each scene,
labeling chunks of five-frame sequences from all the cameras. We start by ran-
domlyselecting700outofthe798scenes. Foreachscene,whichtypicallyhas200
frames, we annotate four sets of five-frame sequences, starting at frame indices
{25,50,125,150}(i.e.,wepick25th,50th,125th,and150thframesasthefirstframe
ofeachfive-framesequenceforannotation). Foreachset,wefurtherselectframes
withoffsets{0,4,6,8,12}w.r.t.thefirstframeforannotations. Forexample,thefirst
setoffive-framesequenceswillcontainframeswithindices{25,29,31,33,37}. Our
sparsesamplingstrategyfacilitatesavarietyofdifferentsequencelengths,allowing
userstotrainonframepairswithtimedifferenceassmallastwoframes(0.2sec-
onds)andaslargeas12frames(1.2seconds). Asaresult,ourtrainingsetcontains
groupsoffivetemporalframesacrossallfivecameras,yielding2,800sequences
of25images(5temporalframes×5cameras),or70,000imagesintotal. Finally,
weprovidetheassociationsbetweeneachinstanceIDandthecorresponding3D
LiDARboundingbox,allowingustocomputeverylongassociations(upto13.7
secondsbetweenallfoursequences),ifanobjectpersistsacrossmultiplesequences
inthesamescene.
For the validation and test sets, we aim to enable the testing of long-term
consistency across cameras and frames. We therefore densely sample frames at
5Hz from chunks of 100 frames across all cameras (i.e., every other two frames
aresampledinthe200framesequence). Weselect20and40scenesforvalidation
andtestsetsbymaintainingdiversityinthelocation,densityofobject,andtimeof
daydistributionsofWOD.Incontrasttothetrainingset,foreachsceneselected
fromthevalidationandtesttests,wedenselysubsamplethescenesforthesesplits
by labeling every other frame, resulting in sequences with 100 temporal frames
across all five cameras. In the end, our validation set contains annotations for
20sequencesof500images(100temporalframes×5cameras), andourtestset
consistsof40sequencesof500images(ortotally10,000and20,000annotationsfor
validationandtestsets, respectively). Thetestsetannotationswillnotbemade
publiclyavailable,butinsteadwewillprepareatestservertoevaluatetheheld-out
testset,oncethedatasetisreleased.
3.2 AssociatingInstanceIDsAcrossCamerasandFrames
Inconstructingthepanoramicvideopanopticsegmentationdataset,ensuringthe
annotationshaveconsistentinstanceIDsacrosscamerasandtemporalframesis
oneofthemajorchallenges. Manuallabelingisastraight-forwardoption,butis
time-consumingandexpensiveatlargescales. Inaddition,itisdifficulttodevelop
aneffectivelabelinginterfacethatallowshumanannotatorstoiterativelyrefine
instancelabelsacrosscamerasandtemporaryframes.
6Front Left Camera (t = 0) Front Left Camera (t = 1) Front Left Camera (t = 2) Front Camera (t = 2) LiDAR point clouds (t = 2)
Step 1: Human
t=0 t=1 t=2 Annotation t=0 t=1 t=2
Camera Images Human Annotated Instance Labels
Step 2: 3D Box
t=0 t=1 t=2 Association t=0 t=1 t=2
3D Box Annotations Projected LiDAR Points Associated Instance Labels (3D Box)
Step 3: 2D Box
t=0 t=1 t=2 Association t=0 t=1 t=2
2D Box Annotations Associated Instance Labels (3D + 2D Box)
Figure4: LabelingandAssociationOverview. Humanannotatorsfirstlabeleachcameraimagefor
panopticsegmentationseparately(step1).LiDARpointswithineachgroundtruth3Dboundingboxare
thenprojectedtoeachimage,andassociatewiththesingleframeinstancelabels(step2).Forfar-range
instanceswithoutcorresponding3Dboundingboxes,weassociatethesingleframeinstancelabelsover
timeusingthegroundtruth2Dboundingboxeswithineachcamera(step3). Newassociationsare
highlightedinthezoomed-inviewsatthebottom.
Weinsteadassignedhumanannotatorstolabeleachcameraimageforpanoptic
segmentationseparatelyandemployedahybridschemethatleveragestheexist-
ingcoarse-levelannotationsinWOD.Thecoarse-levelannotationsinclude(i)3D
boundingboxeswithcorrespondingIDsthatareconsistentacrossallframesand
cameras;and(ii)2DboundingboxeswithIDsthatareconsistentacrosstemporaral
frames, but annotated independently for each camera. Associations were then
computedbetweeneachinstanceanditscorresponding3DLiDARboxesand2D
camera boxes. Instances determined to correspond to the same object are then
mappedtothesameIDinallframesacrosscameras. Asamplesequencefromthis
processcanbefoundinFig.4.
Foragivenframewithinstancelabels,3Dpointclouds,and3Dboundingboxes,
weassociateinstanceswithboxesbyfilteringtheLiDARpointswithineachbox,
andprojectingthemontotheimage. Associationscoresarethencomputedusing
IoUbetweentheconvexhulloftheprojectedLiDARpointsandeachinstancelabel.
Bipartitematchingisthenappliedtomatcheachprojectedboxwithaninstancelabel.
For3Ddrivingscenes,pointsinsidetheboundingboxalmostentirelycorrespond
totheinstancesinsideofthem,andsotheseprojectedLiDARpointshaveahigh
overlapwiththeircorrespondinginstancemasksintheimage.Ourlabelassociation
stepisrelatedtothepriorwork[28,38],but,ourassociationleveragestheground-
truth labeled 3D boxes and only transfers instance IDs rather than fine-grained
per-pixellabels.
Thereare,however,asmallnumberofinstanceswithoutcorrespondingLiDAR
groundtruthboxesduetoocclusions,rollingshutterartifacts,andthelimitedrange
of the provided LiDAR scans (75m). We apply an additional matching step by
associatingthe2Dboundingboxeswithourinstancelabels. First,wescorematches
between2DboxesandinstancesbycomputingtheIoUbetweeneach2Dboxand
thetightly-fittingboundingboxesaroundeachinstancemask,andthencompute
associationswithbipartitematching.
Forboxeswithexisting3Dassociations,weextendthesetracksbypropagating
theexistingIDtoallotherinstancesthatmatchwiththesame2Dbox. Thisresolves
caseswhereonlyaobjecttrackmisses3Dassociationsinafewframes. Then,we
assigntheremainingboxeswithoutanymatchestotheIDoftheircorresponding
2D box, if any. Finally, to capture any additional cross-camera associations, we
7projectallofthecameraviewsontothepanorama,andassociateinstanceswhich
overlapinthisjointrepresentation.
Inordertoidentifyanyinstancesthatarestillnotassociatedwithanyground
truth boxes after these steps, we provide an additional mask for these instance
pixelsindicatingthattheyarenottracked,similartothecrowdmaskusedinsingle
frameinstancesegmentationlabels[12].
4 Benchmark and Evaluation Metrics
Inthissection,wefirstdescribethetaskofPanoramicVideoPanopticSegmentation
(PVPS).Thenwereviewtheevaluationmetricsusedintheliterature,andproposea
newmetricdesignedforPVPSwithanemphasisonconsistentmulti-objecttracking
andsegmentationacrossmultiplecameras.
4.1 ProblemDefinition
Werepresentamulti-cameravideosequencewithT framesandM independent
camera views as {I1:T}M , where It is the i-th camera view captured at the t-th
i i=1 i
time step in the video sequence. Along with the multi-view representation of
the full scene, we define the panorama at t-th time step as It . In the task of
pano
PanoramicVideoPanopticSegmentation(PVPS),werequireamappingf ofevery
pixel(x,y,t,i)inthemulti-cameravideosequencetoasemanticcategoryc ∈ C
andaninstanceIDz consistentacrosscameraviewsandtemporalframes. Here,
(x,y,t,i)indicatesthespatialcoordinate(x,y)ofthei-thcameraviewcapturedat
thet-thtimestep,andCisthesetofsemanticcategories. Accordingly,wedefine
themappingsf andf foraparticularinstanceIDz andsemanticcategoryc
id sem
inEq.(1)andEq.(2),respectively. Themappingfunctionsarethebuildingblocks
ofourproposedmetricintroducedinSec.4.2.
f (z)={(x,y,i,t)|f(x,y,i,t)=(c,z),c∈C}, (1)
id
f (c)={(x,y,i,t)|f(x,y,i,t)=(c,∗),c∈C}. (2)
sem
ComparedtotheexistingtasksincludingVideoPanopticSegmentaion(VPS)
andPanoramicSemanticSegmentation,theproposedtaskismorechallenginginthe
followingaspects. First,eachindividualcamerahasitsownuniqueviewpointand
field-of-viewsuchthatthesemanticclassstatisticsaredifferentacrosscameras(e.g.,
seeFig.3). Thisleadstoalargedomaingapbetweenvideoscapturedwithdifferent
cameras. Second,theinstanceIDprediction,withthelong-termconsistencyacross
bothtimeandcameras,requiresholisticsceneunderstanding.
4.2 EvaluationMetrics
Inthissubsection,weoverviewtheexistingVideoPanopticSegmentation(VPS)
metric:SegmentationandTrackingQuality(STQ)[71],whichweextendtoevaluate
thePanoramicVideoPanopticSegmentation(PVPS)task.
VPSMetric Weusef andgtoindicatethepredictionandground-truthmap-
ping,respectively. Wedefinethetruepositiveassociations(TPA)[43]ofaspecific
instance as TPA(z ,z ) = |f (z )∩g (z )|, where z is the predicted instance,
f g id f id g f
z ∈Gistheground-truthinstance,andGisthesetcontainingalluniqueground-
g
truth instances across cameras and temporal frames. Similarly, false negative
associations(FNA)andfalsepositiveassociations(FPA)canbedefinedtocompute
8Side Left Front Left Front Front Right Side Right
Figure5:Visualizationoftheweightstensorforallcameras.Pixelsintheblueregionhaveweights0.5
duringevaluation,astheyarecoveredbytwocameras.
theIntersectionoverUnion(IoU )forevaluatingtrackingquality. Formally,STQis
id
definedasfollows.
STQ=(AQ×SQ)1 2, (3)
1 (cid:88) 1 (cid:88)
AQ= TPA(z ,z )×IoU (z ,z ),
|G| |g (z )| f g id f g
zg∈G id g zf,|zf∩zg|=(cid:54) ∅
1 (cid:88) f (c)∩g (c)
SQ= sem sem .
|C| f (c)∪g (c)
c∈C sem sem
As defined in Eq (3), STQ fairly balances segmentation and tracking perfor-
mance, and is suitable for evaluating video sequences of arbitrary length. The
Association Quality (AQ) measures the association quality for tracking classes,
whiletheSegmentationQuality(SQ)measuresthesegmentationqualityforseman-
ticclasses. Specifically,AQinvolvestheIoU computationforpredictedinstance
id
IDs (and further weighted by true positive associations to encourage long-term
tracking[71]),whileSQisthetypicalsemanticsegmentationmetric[16](i.e.,mean
IoU forpredictedsemanticclasses).
sem
PVPSMetric WeproposetoextendthemetricSTQ[71]forPanoramicVideo
Panoptic Segmentation (PVPS). However, na¨ıvely adopting STQ for the multi-
camerascenarioresultsinapotentialissue,wherepixelsintheoverlappingregions
coveredbymultiplecameraswillbecountedmultipletimes. Instead,weemploy
a simple and effective solution by exploiting the pixel-centric property of STQ.
In particular, we weight each pixel prediction w.r.t. its coverage by the number
of cameras, as determined by the mapping between the camera images and the
panoramaimage. Forexample,ifapixeliscoveredbyN cameras(inourdataset,
N =2),itspredictionwillcontribute1/N whencomputingAQ,andSQ.Wename
theresultingmetricsasweightedSTQ(wSTQ),sinceeachpixelpredictiontakesa
differentweightdependingonitscoveragebythenumberofcameras. InFig.5,we
visualizetheweightsforanexampleoffive-cameraimages.
PS Metric We also briefly review the metric PQ (panoptic quality) [33] for
evaluating image Panoptic Segmentation (PS), since we will build image-level
baselinespurelytrainedwithimagepanopticannotations.
Foraparticularsemanticclassc,thesetsoftruepositives(TP ),falsepositives
c
(FP ), and false negatives (FN ) are formed by matching predictions z to the
c c f
ground-truthmasksz basedontheIoUscores. Aminimalthresholdofgreater
g
than0.5IoUischosentoguaranteeuniquematching. Formally,
(cid:80)
IoU(z ,z )
PQ =
(zf,zg)∈TPc f g
, (4)
c |TP |+ 1|FP |+ 1|FN |
c 2 c 2 c
wherethefinalPQisthenobtainedbyaveragingPQ oversemanticclasses.
c
9Front Left Front Front Right Equirectangular Panorama
Network Predictions Network Predictions
Stitch Re-Project
(a) View Evaluation (b) Pano Evaluation
Figure6:Weexperimentwithtwoevaluationschemes:(a)Viewand(b)Pano. TheViewevaluation
schemetakesindividualcameraviewsasinputandgeneratestheirpanopticpredictions,whichare
then“stitchedovercameras”toobtainconsistentinstanceIDsbetweencameras.ThePanoevaluation
schemetakesaspanoramaimagesasinputandgeneratespanoramicpanopticpredictions,whichare
thenreprojectedbacktoeachcameraforevaluation.
5 Experimental Results
Inthissection,weintroduceourPVPSbaselines,whichexploitthepropertyofmulti-
camera images by taking as input either individual camera views or panorama
images(generatedfromallcameraviews). Wethenprovideextensiveexperiments
ontheproposeddatasetandmetric.
5.1 ViP-DeepLabExtensionsasPVPSbaselines
TotacklethenewchallengingPVPStask,weextendthestate-of-artvideopanoptic
segmentationmethod,ViP-Deeplab[51],topanoramicviews.
BaselineOverview Forcompleteness,wefirstbrieflyreviewViP-DeepLab[51].
ViP-DeepLabextendsthestate-of-artimagepanopticsegmentationmodel,Panoptic-
DeepLab [11], to the video domain. Panoptic-DeepLab employs two separate
predictionbranchesforsemanticsegmentation[9]andinstancesegmentation[29],
respectively.Bothsegmentationresultsarethenmerged[80]toformthefinalpanop-
ticsegmentationresult. Toperformvideopanopticsegmentation, ViP-DeepLab
adoptsatwo-frameimagepanopticsegmentationframework. Specifically,during
training, ViP-DeepLab takes a pair of image frames as input and their panoptic
segmentationground-truthsastrainingtarget. Duringinference,ViP-DeepLabper-
formstwo-frameimagepanopticpredictionsateachtimestep,andcontinuesthe
inferenceprocessforeverytwoconsecutiveframes(i.e.,withoneoverlappingframe
atthenexttimestep)inavideosequence.Thepredictionsintheoverlappingframes
are“stitched”togetherbypropagatinginstanceIDsbasedonmaskIoUbetween
regionpairs(i.e.,iftwomaskshavehighIoUoverlap,theywillbere-assignedwith
thesameinstanceID),andthustemporallyconsistentIDsareobtained(seeFig. 4
ofQiaoetal.[51]foranillustration). Wereferthispost-processingas“panoptic
stitchingovertime”.
Baseline Extension for PVPS We explore several ViP-DeepLab extensions
for PVPS, which takes as input individual camera views or panorama images
(generated from all camera views). The input types could be different during
trainingandevaluation. Specifically,wedefinethreetrainingschemes: View,Pano,
and Ensemble-View. The View scheme refers to the case where ViP-DeepLab is
trainedwithimagesfromallcameraviews,whilePanomeansthemodelistrained
withfullpanoramaimages. TheEnsemble-Viewschemereferstothecasewhere
wehavefivecamera-specificViP-DeepLabmodels,eachofwhichistrainedand
10evaluated on their own camera images. We also have two evaluation schemes:
View and Pano. The View scheme refers to the case where the trained model is
fedwithimagesfromindividualcameraviewsandgeneratesthecorresponding
panopticpredictionsforeachview. However,thepredictedinstanceIDsarenot
consistent between cameras, since the predictions are made independently for
each view. To generate consistent instance IDs between cameras, we propose a
similar method to “panoptic stitching over time”: if two masks have high IoU
overlap in the overlapping regions between two cameras’ field-of-view, we re-
assign the same instance ID for them, resulting in the “panoptic stitching over
cameras” post-processing method. For the Pano evaluation scheme, the model
isfedwithpanoramaimagesandgeneratespanoramicpanopticpredictions. We
thenre-projectpanoramicpanopticpredictionsontoeachcameraforevaluation.
Notethat,forthePanoevaluationscheme,theinstanceIDsareconsistentbetween
camerasbynature. WevisualizetheevaluationschemesinFig.6.
ImplementationDetails Webuildourimage-basedandvideo-basedbaselines
on top of Panoptic-DeepLab [11] and ViP-DeepLab [51], respectively, using the
officialcode-base[70]. ThetrainingstrategyfollowsPanoptic-DeepLabandViP-
DeepLab. Specifically,themodelsaretrainedwith32TPUcoresfor60ksteps,batch
size32,Adam[31]optimizerandapolyschedulelearningrateof2.5×10−4. We
useanImageNet-1K-pretrained[56]ResNet-50[25]withstride16asthebackbone
(using atrous convolution [8]). For image-based methods, we use the crop size
1281 × 1921 during training, while, during inference, we use the whole image
(or panorama). We use a similar strategy for the video-based methods, but we
useaResNet-50backbonewithstride32andcropsize641×961duetomemory
constraints.
5.2 QualitativeEvaluation
In Fig. 7, we provide qualitative results from our two ViP-DeepLab baselines,
bothtrainedandevaluatedonsingleimagesandonpanoramaimages(i.e., one
modelusesViewschemesforbothtrainingandevaluation,andtheotherusesPano
schemesforbothtrainingandevaluation),overtwo(non-adjacent)temporalframes.
Fromtheseresults,wecanseethatthebaselinemodelsareabletoaccuratelytrack
objectsinverydensescenes. Inaddition,wenotethattherearesomequalitative
benefits provided by the panorama model in these examples. In particular, the
singleviewmodelhasaninconsistentpredictiononthecrosswalkintheleftand
rightimagesforthesingleviewmodel,butthepanoramamodelisabletoattain
thefullcontextofthesceneandavoidsthismistake. Inaddition,thesingleview
modelsfailtotrackthecarcrossingthefrontrightandsiderightcamerasatt ,but
0
thepanoramaisagainabletotrackthisobjectcorrectly.
5.3 BaselineComparisons
Video-basedBaselines InTab.2(a),weprovidevideo-basedbaselinecomparisons
using ViP-DeepLab [51], evaluated by the proposed weighted STQ (wSTQ). We
comparedifferenttrainingandevaluationschemes. Asshowninthetable,when
bothevaluatedwithViewscheme,trainingwithViewschemeperformsbetterthan
trainingwithEnsemble-Viewby0.86%wSTQ.Thatis,trainingasinglemodelwith
allthecameraviewsperformsbetterthantrainingfivecamera-specificmodelswith
its own camera views. Also, when training with View scheme, using the Pano
11t
s2.0+
t
t
s1+
t
0
0
1
1
weiV
elgniS
amaronaP
weiV
elgniS
amaronaP
weiV
elgniS
amaronaP
weiV
elgniS
amaronaP
TG
+
BGR
snoitciderP
snoitciderP
TG
+
BGR
snoitciderP
snoitciderP
TG
+
BGR
snoitciderP
snoitciderP
TG
+
BGR
snoitciderP
snoitciderP
Figure7:ComparisonofqualitativeresultsfromourbaselineViP-DeepLab[51]modelsoverdifferent
timeintervals.Resultsshowmodelstrainedonsingleimageswithpanopticstitchingovercameras,and
traineddirectlyonpanoramaimages.Ourbaselinemodelsshowstrongperformanceforthemajorityof
thescene,althoughtrackingsmall/distantobjectsandcrowdedscenesremainschallenging.
12Table2:Quantitativeevaluation:duringtrainingandevaluation,thebaselinescantakedifferenttypes
ofinputs:View:individualcameraviews;Pano:panoramas;Ensemble-View:camera-specificviews.
Resultsinclude(a)video-baselinecomparisonusingViP-DeepLab,measuredbyweightedSegmentation
andTrackingQuality(wSTQ);and(b)image-baselinecomparisonusingPanoptic-DeepLab,measured
byPanopticQuality(PQ)andmeanIntersection-over-Union(mIoU).
(a)Video-baselineComparison (b)Image-baselineComparison
TrainingScheme EvalScheme wSTQ wAQ wSQ TrainingScheme EvalScheme PQ mIoU
Ensemble-View View 16.92 7.61 37.33 Ensemble-View View 35.70 48.15
View View 17.78 8.21 38.46 View View 40.00 53.64
View Pano 14.87 6.13 36.04 View Pano 33.65 50.61
Pano View 17.56 8.11 38.04 Pano View 38.93 51.65
Pano Pano 15.72 6.22 39.78 Pano Pano 36.32 52.19
Table3:Viewtransferabilityonourvideo-basedbaselines,measuredbywSTQ.Weevaluatemodels(1st
column)trainedonaspecificvieww.r.t.othercameraviews.Thelastrow,MultiCamera,referstothe
modeltrainedwithallcameraviews(i.e.,trainingschemeView),andthelastcolumn,All,denotesthe
evaluationsetusingallcameraviews(i.e.,evaluationschemeView).
Model\Eval SideLeft FrontLeft Front FrontRight SideRight All
SideLeft 18.79 17.41 14.56 19.06 19.40 16.31
FrontLeft 16.88 18.39 12.84 19.22 18.49 15.36
Front 16.58 18.02 14.54 18.55 18.96 15.98
FrontRight 16.56 17.36 14.99 19.40 19.16 16.18
SideRight 17.91 16.50 13.16 18.23 20.47 15.65
MultiCamera 20.11 19.54 15.63 20.67 21.53 17.78
evaluationschemedegradestheperformanceby2.91%wSTQ.Whentrainingwith
Panoscheme,usingViewschemeisbetterthanPanoschemeforevaluation. We
thinkitiscausedbytheasymmetrybetweentrainingandevaluationsettings. We
couldnotusewholepanoramaimageswithresolution1000×5875asinputduring
training(duetomemorylimit),andthusweonlyuseasmallercropsize641×961.
Ideally,themodelshouldbeevaluatedwiththesamesettingasitstraining. The
currentbestsettingistrainedandevaluatedwiththeViewscheme,reaching17.78%
wSTQ.Weobservethatourdatasetisverychallengingintermsofbothtracking
andsegmentation,sinceourbestwAQisonly8.21%andbestwSQis39.78%.
Image-basedBaselines InTab.2(b),weprovideimage-basedbaselinecom-
parisonsusingPanoptic-DeepLab[11],evaluatedbyimagepanopticsegmentation
metricPQ[33]andsemanticsegmentationmetricmIoU[16]. Basically,weobserve
thesametrendofimage-basedbaselinesandvideo-basedbaselines.
5.4 AblationStudies
TransferabilityofModelsbetweenViewpoints Inthisablation,wemeasurethe
abilitytotransfermodelstrainedononeviewpointtoadifferentviewpoint. As
showninTab.3,wehavethefollowingobservations: First,allmodels,eventrained
onleftsideviews,performbetteronrightsideviews. Thisphenomenonisdueto
theego-vehicledrivingontherightsideoftheroad,andprovidingwiderscope,
moreinstances,andsmallerobjectsontheleftside(i.e.,theleftsideviewsaremore
challenging). Second, the front camera performance is inferior compared to the
othercameras. Wehypothesizethatthefrontcameracapturesmorediverseand
challengingviews,e.g.,vehiclesdrivinginmultipledirections,moredynamicand
smallerobjects,makingtrackingmorechallenging.
136 Conclusion
Inthiswork,wepresentedanewbenchmark,theWaymoOpenDataset:Panoramic
VideoPanopticSegmentation(WOD:PVPS)dataset. Ourbenchmarkextendsvideo
panopticsegmentationtoamorechallengingmulti-camerasettingthatrequires
consistentinstanceIDsbothacrosscamerasandovertime. Ourdatasetisanorder
ofmagnitudelargerthanalltheexistingvideopanopticsegmentationdatasets. We
establishseveralstrongbaselinesevaluatedwithanewmetric,wSTQ,thattakes
multi-camera,multi-objecttrackingandsegmentationintoconsideration. Wewill
makeourbenchmarkpubliclyavailable,andwehopethatitwillfacilitatefuture
researchonpanoramicvideopanopticsegmentation.
14References
[1] Baque´, P., Fleuret, F., Fua, P.: Deep occlusion reasoning for multi-camera
multi-targetdetection.In: ICCV(2017)
[2] Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C.,
Gall, J.: Semantickitti: A dataset for semantic scene understanding of lidar
sequences.In: ICCV(2019)
[3] Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Multiple object tracking using
k-shortestpathsoptimization.PAMI33(9),1806–1819(2011)
[4] Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A
high-definitiongroundtruthdatabase.PatternRecognitionLetters30(2),88–97
(2009)
[5] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan,
A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for
autonomousdriving.In: CVPR(2020)
[6] Chang,M.F.,Lambert,J.,Sangkloy,P.,Singh,J.,Bak,S.,Hartnett,A.,Wang,D.,
Carr,P.,Lucey,S.,Ramanan,D.,etal.: Argoverse: 3dtrackingandforecasting
withrichmaps.In: CVPR(2019)
[7] Chavdarova,T.,Baque´,P.,Bouquet,S.,Maksai,A.,Jose,C.,Bagautdinov,T.,
Lettry, L., Fua, P., Van Gool, L., Fleuret, F.: Wildtrack: A multi-camera hd
datasetfordenseunscriptedpedestriandetection.In: CVPR(2018)
[8] Chen,L.C.,Papandreou,G.,Kokkinos,I.,Murphy,K.,Yuille,A.L.: Semantic
imagesegmentationwithdeepconvolutionalnetsandfullyconnectedCRFs.
In: ICLR(2015)
[9] Chen,L.C.,Papandreou,G.,Kokkinos,I.,Murphy,K.,Yuille,A.L.: DeepLab:
Semanticimagesegmentationwithdeepconvolutionalnets,atrousconvolu-
tion,andfullyconnectedCRFs.TPAMI(2017)
[10] Chen, Y., Rong, F., Duggal, S., Wang, S., Yan, X., Manivasagam, S., Xue, S.,
Yumer,E.,Urtasun,R.:Geosim:Realisticvideosimulationviageometry-aware
compositionforself-driving.In: CVPR(2021)
[11] Cheng,B.,Collins,M.D.,Zhu,Y.,Liu,T.,Huang,T.S.,Adam,H.,Chen,L.C.:
Panoptic-deeplab: Asimple,strong,andfastbaselineforbottom-uppanoptic
segmentation.In: CVPR(2020)
[12] Cordts,M.,Omran,M.,Ramos,S.,Rehfeld,T.,Enzweiler,M.,Benenson,R.,
Franke,U.,Roth,S.,Schiele,B.: TheCityscapesDatasetforSemanticUrban
SceneUnderstanding.In: CVPR(2016)
[13] Dehghan,A.,ModiriAssari,S.,Shah,M.: Gmmcptracker: Globallyoptimal
generalizedmaximummulticliqueproblemformultipleobjecttracking.In:
CVPR(2015)
15[14] Dendorfer,P.,Osˇep,A.,Milan,A.,Schindler,K.,Cremers,D.,Reid,I.,Roth,
S.,Leal-Taixe´,L.: MOTChallenge: ABenchmarkforSingle-cameraMultiple
TargetTracking.IJCV(2020)
[15] Eshel,R.,Moses,Y.: Homographybasedmultiplecameradetectionandtrack-
ingofpeopleinadensecrowd.In: CVPR(2008)
[16] Everingham,M.,VanGool,L.,Williams,C.K.I.,Winn,J.,Zisserman,A.: The
PascalVisualObjectClasses(VOC)Challenge.IJCV(2010)
[17] Felzenszwalb,P.F.,Huttenlocher,D.P.: Efficientgraph-basedimagesegmenta-
tion.IJCV(2004)
[18] Ferryman,J.,Shahrokni,A.: Pets2009: Datasetandchallenge.In: 2009Twelfth
IEEE international workshop on performance evaluation of tracking and
surveillance.pp.1–6(2009)
[19] Fleuret,F.,Berclaz,J.,Lengagne,R.,Fua,P.: Multicamerapeopletrackingwith
aprobabilisticoccupancymap.PAMI30(2),267–282(2007)
[20] Gao, N., Shan, Y., Wang, Y., Zhao, X., Yu, Y., Yang, M., Huang, K.: SSAP:
Single-ShotInstanceSegmentationWithAffinityPyramid.In: ICCV(2019)
[21] Geiger,A.,Lenz,P.,Urtasun,R.: Arewereadyforautonomousdriving? the
kittivisionbenchmarksuite.In: CVPR(2012)
[22] Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A.S.,
Hauswald, L., Pham, V.H., Mu¨hlegg, M., Dorn, S., et al.: A2d2: Audi au-
tonomousdrivingdataset.arXivpreprintarXiv:2004.06320(2020)
[23] Han, X., You, Q., Wang, C., Zhang, Z., Chu, P., Hu, H., Wang, J., Liu, Z.:
Mmptrack: Large-scaledenselyannotatedmulti-cameramultiplepeopletrack-
ingbenchmark(2021)
[24] Hariharan,B.,Arbela´ez,P.,Girshick,R.,Malik,J.: Simultaneousdetectionand
segmentation.In: ECCV(2014)
[25] He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.
In: CVPR(2016)
[26] He,X.,Zemel,R.S.,Carreira-Perpin˜a´n,M.A´.: Multiscaleconditionalrandom
fieldsforimagelabeling.In: CVPR(2004)
[27] Hofmann,M.,Wolf,D.,Rigoll,G.: Hypergraphsforjointmulti-viewrecon-
structionandmulti-objecttracking.In: CVPR(2013)
[28] Huang,X.,Wang,P.,Cheng,X.,Zhou,D.,Geng,Q.,Yang,R.: Theapolloscape
opendatasetforautonomousdrivinganditsapplication.PAMI(2020)
[29] Kendall,A.,Gal,Y.,Cipolla,R.: Multi-tasklearningusinguncertaintytoweigh
lossesforscenegeometryandsemantics.In: CVPR(2018)
[30] Kim,D.,Woo,S.,Lee,J.Y.,Kweon,I.S.:VideoPanopticSegmentation.In:CVPR
(2020)
16[31] Kingma,D.P.,Ba,J.: Adam: Amethodforstochasticoptimization.In: ICLR
(2015)
[32] Kirillov,A.,Girshick,R.,He,K.,Dolla´r,P.:PanopticFeaturePyramidNetworks.
In: CVPR(2019)
[33] Kirillov,A.,He,K.,Girshick,R.,Rother,C.,Dolla´r,P.: PanopticSegmentation.
In: CVPR(2019)
[34] Kuo, C.H., Huang, C., Nevatia, R.: Inter-cameraassociationofmulti-target
tracksbyon-linelearnedappearanceaffinitymodels.In: ECCV(2010)
[35] Ladicky`,L.,Sturgess,P.,Alahari,K.,Russell,C.,Torr,P.H.: What,whereand
howmany? combiningobjectdetectorsandcrfs.In: ECCV(2010)
[36] Li,Y.,Chen,X.,Zhu,Z.,Xie,L.,Huang,G.,Du,D.,Wang,X.: Attention-guided
unifiednetworkforpanopticsegmentation.In: CVPR(2019)
[37] Liang, J., Homayounfar, N., Ma, W.C., Xiong, Y., Hu, R., Urtasun, R.: Poly-
transform: Deeppolygontransformerforinstancesegmentation.In: CVPR
(2020)
[38] Liao, Y., Xie, J., Geiger, A.: Kitti-360: A novel dataset and benchmarks for
urbansceneunderstandingin2dand3d.arXiv:2109.13410(2021)
[39] Lin,T.,Maire,M.,Belongie,S.J.,Bourdev,L.D.,Girshick,R.B.,Hays,J.,Perona,
P.,Ramanan,D.,Doll’ar,P.,Zitnick,C.L.: MicrosoftCOCO:commonobjects
incontext.In: ECCV(2014)
[40] Ling,H.,Acuna,D.,Kreis,K.,Kim,S.W.,Fidler,S.: Variationalamodalobject
completion.NeurIPS(2020)
[41] Liu,H.,Peng,C.,Yu,C.,Wang,J.,Liu,X.,Yu,G.,Jiang,W.: AnEnd-to-End
NetworkforPanopticSegmentation.In: CVPR(2019)
[42] Long,J.,Shelhamer,E.,Darrell,T.: Fullyconvolutionalnetworksforsemantic
segmentation.In: CVPR(2015)
[43] Luiten,J.,Osˇep,A.,Dendorfer,P.,Torr,P.,Geiger,A.,Leal-Taixe´,L.,Leibe,B.:
HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. IJCV
(2020)
[44] Mallya,A.,Wang,T.C.,Sapra,K.,Liu,M.Y.: World-consistentvideo-to-video
synthesis.In: ECCV(2020)
[45] Miao,J.,Wei,Y.,Wu,Y.,Liang,C.,Li,G.,Yang,Y.: Vspw: Alarge-scaledataset
forvideosceneparsinginthewild.In: CVPR(2021)
[46] Narioka, K., Nishimura, H., Itamochi, T., Inomata, T.: Understanding 3d
semantic structure around the vehicle with monocular cameras. In: IEEE
IntelligentVehiclesSymposium(IV).pp.132–137(2018)
[47] Neuhold,G.,Ollmann,T.,Bulo`,S.R.,Kontschieder,P.: Themapillaryvistas
datasetforsemanticunderstandingofstreetscenes.In: ICCV(2017)
17[48] Philion,J.,Fidler,S.: Lift,splat,shoot: Encodingimagesfromarbitrarycamera
rigsbyimplicitlyunprojectingto3d.In: ECCV(2020)
[49] Porzi,L.,Bulo`,S.R.,Colovic,A.,Kontschieder,P.: SeamlessSceneSegmenta-
tion.In: CVPR(2019)
[50] Qi,C.R.,Zhou,Y.,Najibi,M.,Sun,P.,Vo,K.,Deng,B.,Anguelov,D.: Offboard
3dobjectdetectionfrompointcloudsequences.In: CVPR(2021)
[51] Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Vip-deeplab: Learning
visualperceptionwithdepth-awarevideopanopticsegmentation.In: CVPR
(2021)
[52] Ristani,E.,Solera,F.,Zou,R.,Cucchiara,R.,Tomasi,C.: Performancemeasures
andadatasetformulti-target,multi-cameratracking.In: ECCVWorkshopon
BenchmarkingMulti-TargetTracking(2016)
[53] Ristani,E.,Tomasi,C.: Featuresformulti-targetmulti-cameratrackingand
re-identification.In: CVPR(2018)
[54] Roddick,T.,Cipolla,R.: Predictingsemanticmaprepresentationsfromimages
usingpyramidoccupancynetworks.In: CVPR(2020)
[55] RoshanZamir,A.,Dehghan,A.,Shah,M.: Gmcp-tracker: Globalmulti-object
trackingusinggeneralizedminimumcliquegraphs.In: ECCV(2012)
[56] Russakovsky,O.,Deng,J.,Su,H.,Krause,J.,Satheesh,S.,Ma,S.,Huang,Z.,
Karpathy,A.,Khosla,A.,Bernstein,M.,Berg,A.C.,Fei-Fei,L.: ImageNetLarge
ScaleVisualRecognitionChallenge.IJCV(2015)
[57] Scho¨nberger,J.L.,Zheng,E.,Frahm,J.M.,Pollefeys,M.: Pixelwiseviewselec-
tionforunstructuredmulti-viewstereo.In: ECCV(2016)
[58] Shi,J.,Malik,J.: Normalizedcutsandimagesegmentation.PAMI(2000)
[59] Song, S., Zeng, A., Chang, A.X., Savva, M., Savarese, S., Funkhouser, T.:
Im2pano3d: Extrapolating 360 structure and semantics beyond the field of
view.In: CVPR(2018)
[60] Su,Y.C.,Grauman,K.: Making360videowatchablein2d: Learningvideogra-
phyforclickfreeviewing.In: CVPR(2017)
[61] Sun,P.,Kretzschmar,H.,Dotiwalla,X.,Chouard,A.,Patnaik,V.,Tsui,P.,Guo,
J.,Zhou,Y.,Chai,Y.,Caine,B.,etal.: Scalabilityinperceptionforautonomous
driving: WaymoOpenDataset.In: CVPR(2020)
[62] Tang,Z.,Naphade,M.,Liu,M.Y.,Yang,X.,Birchfield,S.,Wang,S.,Kumar,R.,
Anastasiu,D.,Hwang,J.N.: Cityflow: Acity-scalebenchmarkformulti-target
multi-cameravehicletrackingandre-identification.In: CVPR(2019)
[63] Tateno,K.,Navab,N.,Tombari,F.: Distortion-awareconvolutionalfiltersfor
densepredictioninpanoramicimages.In: ECCV(2018)
18[64] Thrun, S., Montemerlo, M.: Thegraphslamalgorithmwithapplicationsto
large-scalemappingofurbanstructures.TheInternationalJournalofRobotics
Research25(5-6),403–429(2006)
[65] Tu,Z.,Chen,X.,Yuille,A.L.,Zhu,S.C.: Imageparsing: Unifyingsegmentation,
detection,andrecognition.IJCV(2005)
[66] Voigtlaender, P., Krause, M., Os˘ep, A., Luiten, J., Sekar, B.B.G., Geiger, A.,
Leibe,B.: MOTS:Multi-objecttrackingandsegmentation.In: CVPR(2019)
[67] Wang,H.,Luo,R.,Maire,M.,Shakhnarovich,G.: PixelConsensusVotingfor
PanopticSegmentation.In: CVPR(2020)
[68] Wang,H.,Zhu,Y.,Green,B.,Adam,H.,Yuille,A.,Chen,L.C.: Axial-DeepLab:
Stand-AloneAxial-AttentionforPanopticSegmentation.In: ECCV(2020)
[69] Weber,M.,Luiten,J.,Leibe,B.: Single-shotPanopticSegmentation.In: IROS
(2020)
[70] Weber,M.,Wang,H.,Qiao,S.,Xie,J.,Collins,M.D.,Zhu,Y.,Yuan,L.,Kim,D.,
Yu,Q.,Cremers,D.,Leal-Taixe,L.,Yuille,A.L.,Schroff,F.,Adam,H.,Chen,
L.C.: DeepLab2: ATensorFlowLibraryforDeepLabeling.arXiv: 2106.09748
(2021)
[71] Weber, M., Xie, J., Collins, M., Zhu, Y., Voigtlaender, P., Adam, H., Green,
B., Geiger, A., Leibe, B., Cremers, D., Osep, A., Leal-Taixe, L., Chen, L.C.:
Step: Segmentingandtrackingeverypixel.In: NeurIPSTrackonDatasetsand
Benchmarks(2021)
[72] Wu,Y.,Lim,J.,Yang,M.H.: Onlineobjecttracking: Abenchmark.In: CVPR
(2013)
[73] Xiong,Y.,Liao,R.,Zhao,H.,Hu,R.,Bai,M.,Yumer,E.,Urtasun,R.: UPSNet:
AUnifiedPanopticSegmentationNetwork.In: CVPR(2019)
[74] Xu,C.,Xiong,C.,Corso,J.J.: Streaminghierarchicalvideosegmentation.In:
ECCV(2012)
[75] Xu,Y.,Liu,X.,Liu,Y.,Zhu,S.C.: Multi-viewpeopletrackingviahierarchical
trajectorycomposition.In: CVPR(2016)
[76] Yang,B.,Bai,M.,Liang,M.,Zeng,W.,Urtasun,R.: Auto4d: Learningtolabel
4dobjectsfromsequentialpointclouds.arXivpreprintarXiv:2101.06586(2021)
[77] Yang,K.,Hu,X.,Bergasa,L.M.,Romera,E.,Wang,K.: Pass: Panoramican-
nularsemanticsegmentation.IEEETransactionsonIntelligentTransportation
Systems21(10),4171–4185(2019)
[78] Yang,K.,Zhang,J.,Reiß,S.,Hu,X.,Stiefelhagen,R.: Capturingomni-range
contextforomnidirectionalsegmentation.In: CVPR(2021)
[79] Yang,L.,Fan,Y.,Xu,N.: VideoInstanceSegmentation.In: ICCV(2019)
19[80] Yang,T.J.,Collins,M.D.,Zhu,Y.,Hwang,J.J.,Liu,T.,Zhang,X.,Sze,V.,Papan-
dreou,G.,Chen,L.C.: DeeperLab: Single-ShotImageParser.arXiv:1902.05093
(2019)
[81] Yao, J., Fidler, S., Urtasun, R.: Describingthesceneasawhole: Jointobject
detection,sceneclassificationandsemanticsegmentation.In: CVPR(2012)
[82] Yogamani,S.,Hughes,C.,Horgan,J.,Sistu,G.,Varley,P.,O’Dea,D.,Urica´r,M.,
Milz,S.,Simon,M.,Amende,K.,etal.:Woodscape:Amulti-task,multi-camera
fisheyedatasetforautonomousdriving.In: ICCV(2019)
[83] Yu,F.,Chen,H.,Wang,X.,Xian,W.,Chen,Y.,Liu,F.,Madhavan,V.,Darrell,T.:
Bdd100k: Adiversedrivingdatasetforheterogeneousmultitasklearning.In:
CVPR(2020)
[84] Zakharov,S.,Kehl,W.,Bhargava,A.,Gaidon,A.: Autolabeling3dobjectswith
differentiablerenderingofsdfshapepriors.In: CVPR(2020)
[85] Zhang, C., Liwicki, S., Smith, W., Cipolla, R.: Orientation-aware semantic
segmentationonicosahedronspheres.In: ICCV(2019)
20