Waymo Open Dataset:
Panoramic Video Panoptic Segmentation
Jieru Mei1*
Alex Zihao Zhu2
Xinchen Yan2
Hang Yan2
Siyuan Qiao3
Yukun Zhu3
Liang-Chieh Chen3
Henrik Kretzschmar2
Dragomir Anguelov2
1Johns Hopkins University
2Waymo LLC
3Google Research
Abstract
Panoptic image segmentation is the computer vision task of ﬁnding groups of
pixels in an image and assigning semantic classes and object instance identiﬁers
to them. Research in image segmentation has become increasingly popular due
to its critical applications in robotics and autonomous driving. The research
community thereby relies on publicly available benchmark dataset to advance
the state-of-the-art in computer vision. Due to the high costs of densely labeling
the images, however, there is a shortage of publicly available ground truth
labels that are suitable for panoptic segmentation. The high labeling costs also
make it challenging to extend existing datasets to the video domain and to multi-
camera setups. We therefore present the Waymo Open Dataset: Panoramic Video
Panoptic Segmentation Dataset, a large-scale dataset that offers high-quality
panoptic segmentation labels for autonomous driving. We generate our dataset
using the publicly available Waymo Open Dataset, leveraging the diverse set
of camera images. Our labels are consistent over time for video processing and
consistent across multiple cameras mounted on the vehicles for full panoramic
scene understanding. Speciﬁcally, we offer labels for 28 semantic categories
and 2,860 temporal sequences that were captured by ﬁve cameras mounted on
autonomous vehicles driving in three different geographical locations, leading
to a total of 100k labeled camera images. To the best of our knowledge, this
makes our dataset an order of magnitude larger than existing datasets that
offer video panoptic segmentation labels. We further propose a new benchmark
for Panoramic Video Panoptic Segmentation and establish a number of strong
baselines based on the DeepLab family of models. We will make the benchmark
and the code publicly available, which we hope will facilitate future research on
holistic scene understanding. Find the dataset at https://waymo.com/open.
1
Introduction
Semantic visual scene understanding has been studied extensively for decades in
the ﬁeld of computer vision [58, 65, 17, 35, 81, 74]. Researchers have tackled tasks
*Work done as an intern at Waymo.
1
arXiv:2206.07704v1  [cs.CV]  15 Jun 2022
Residential
Rain
Dense Urban
Night
Highway
Diverse Scenes
Multi-Camera Annotations
SL
FL
F
FR
SR
Temporal Sequences
Figure 1: We provide panoptic segmentation labels for 100k camera images of the Waymo Open Dataset.
Our dataset is grouped into 2,860 temporal sequences captured by ﬁve cameras, mounted on autonomous
vehicles driving in three geographical locations. Instance segmentation labels are consistent both across
cameras and over time. Our dataset offers diversity in terms of object classes, locations, weather, and
time of day.
of varying difﬁculty, ranging from segmenting distinct objects in individual camera
images [26, 24, 42, 9] to tracking and segmenting multiple objects in videos [72, 66,
14]. Robotic applications, such as autonomous driving, have led to new challenges
and opportunities for semantic visual scene understanding [21, 12].
Modern autonomous vehicles tend to be equipped with multiple cameras and
LiDAR scanners. The cameras provide rich semantic information about the scene,
whereas the LiDAR scanners capture sparse, but geometrically highly accurate
information. Autonomous vehicles need to be able to fuse and interpret the data
stream from multiple sensors to build and maintain over time an accurate and
consistent estimate of the world. One challenge when tracking and segmenting
multiple objects is that objects of interest may leave the ﬁeld of view of a camera to
enter the ﬁeld of view of another camera across consecutive video frames.
In this paper, we study the new task of video panoptic segmentation [33, 30]
for autonomous vehicles equipped with multiple cameras. See Fig. 1 for an illus-
tration. Panoptic segmentation enables autonomous vehicles to reason about their
surroundings in terms of semantic and geometry properties, such as ﬁne-grained
object contours. There are also important offboard applications, including auto-
labeling [84, 76, 50] and camera sensor simulation [44, 40, 10]. On the one hand,
most existing panoptic segmentation datasets [12, 47] provide labels for individual
camera images. This makes it difﬁcult to train models that fuse information from
multiple camera images, either temporally or by leveraging a multi-camera setup.
On the other hand, datasets that provide panoptic segmentation labels for video
data [30, 71] tend to be scarce and much smaller than datasets for object detection
2
and tracking for autonomous driving [21, 61]. To bridge this gap, we present a new
benchmark dataset for panoptic segmentation based on the popular Waymo Open
Dataset (WOD). Speciﬁcally, we provide panoptic segmentation labels for video
data that are consistent across ﬁve cameras mounted on the vehicles. We further
present a benchmark that captures the task of multi-camera panoptic segmentation
in video data for autonomous driving. Overall, we provide panoptic segmentation
labels for 100k camera images, which we group into training (70%), validation (10%)
and test (20%) sets. The training set consists of 2,800 sequences, each of which
comprises labels for ﬁve cameras spanning 1.2 seconds and ﬁve temporal frames.
In contrast, our validation and test sets consist of 60 longer sequences, in order to
facilitate the evaluation of long-term tracking. Each validation and test sequence
consists of 100 temporal frames, spanning the full 20s of a scene, while also provid-
ing labels across all ﬁve cameras. We extend the Segmentation and Tracking Quality
(STQ) metric [71] to support our multi-camera setup by computing a weight for
pixels depending on the cameras they correspond to. We also extend a state-of-the-
art video panoptic segmentation method, ViP-DeepLab [51], to our multi-camera
setup by training separate models on each camera view and by training a model
on a panorama generated from all views. We present an extensive experimental
evaluation on the proposed dataset and metric.
We published the full dataset to enhance video panoptic segmentation research
while also opening up the ﬁeld of panoramic video panoptic segmentation.
2
Related Work
Panoptic Segmentation
The task of panoptic segmentation [33] aims to unify
semantic segmentation [26] and instance segmentation [24], requiring assigning a
class label and instance ID to all pixels in an image. Modern panoptic segmentation
systems could be roughly categorized into top-down (or proposal-based) [32, 49,
36, 41, 73, 69] and bottom-up (or proposal-free) [80, 20, 67, 11, 68] approaches. Our
adopted baseline methods belong to the bottom-up category.
Video Panoptic Segmentation
Extending panoptic segmentation to the video
domain, Video Panoptic Segmentation (VPS) [30] requires generating the instance
tracking IDs (i.e., temporally consistent instance IDs) along with panoptic segmen-
tation results across video frames. Current VPS datasets are small scale in terms
of semantic classes and sizes. Speciﬁcally, Cityscapes-VPS [30] sparsely annotates
(every ﬁve frame) Cityscapes [12] video sequences, resulting in only 3,000 frames
with 19 semantic classes for training and testing. Recently, STEP [71] extends KITTI-
MOTS [21, 66] and MOTS-Challenge [66, 14] for VPS. However, their annotated
datasets are still small-scale (18K annotated frames with 19 semantic classes for
KITTI-STEP, and 2K frames with 8 classes for MOTChallenge-STEP), and the video
sequences are only captured by a single front-view camera. On the other hand, our
annotated dataset presents the ﬁrst large-scale VPS annotations and extends to the
multi-camera scenario.
Segmentation Benchmarks
There are other popular video segmentation bench-
marks existing in the literature, e.g., VSPW [45] for video semantic segmentation,
while MOTS [66] and Youtube-VIS [79] for video instance segmentation. Our bench-
mark is also related to urban scene understanding, where typical benchmarks
include [4, 21, 39, 12, 47, 6, 82, 2, 61, 5, 37, 28, 83, 78, 38, 22]. Our work is most re-
3
Table 1: Dataset comparison. Our WOD: PVPS is a new large-scale panoramic video panoptic segmenta-
tion dataset. †WildPASS contains 500 panoramas.
dataset statistics
WOD: PVPS (ours)
WildPASS [78]
Cityscapes-VPS [30]
KITTI-STEP [71]
MOT-STEP [71]
# sequences
2860
-
500
50
4
# images
100,000
500†
3,000
19,103
2,075
# tracking classes
8
-
8
2
1
# semantic classes
28
8
19
19
7
panoramic





video panoptic





lated to WildPASS [78], which also aims to endow machines with large ﬁeld-of-view
perception. However, building on top of the large-scale Waymo Open Dataset [61],
our benchmark provides much more high-quality annotated video sequences.
Multi-Camera Multi-Object Tracking
Consistently tracking objects across
multiple cameras, multi-camera multi-object tracking [19, 15, 3, 55, 27, 13, 1, 53]
has been a popular research topic in the computer vision community. Typical
benchmarks [18, 34, 75, 52, 7, 62, 23] only track a single class (e.g., people or vehicles)
with bounding boxes, while our proposed benchmark demands for pixel-level
tracking and segmentation for multiple classes.
Panoramic Semantic Segmentation
Panoramic semantic segmentation pro-
vides surround-view perception [59, 46, 63, 85, 78, 77], but limited to semantic
segmentation without temporal and instance-level understanding. Our work is
similar, but additionally tackles video panoptic segmentation. Recently, [54, 48]
predict bird’s-eye view semantic segmentation using multi-camera inputs.
3
WOD: PVPS Dataset
In this section, we ﬁrst recap the existing Waymo Open Dataset (WOD) [61], one
of the largest and most diverse multi-sensor datasets in the autonomous driving
domain. We leverage the existing data that comes with coarse-level annotations
(e.g., 2D and 3D bounding boxes) as the foundation, and subsample images for
our dataset. We then provide an overview of our WOD: PVPS dataset, including
panorama generation, statistics of the semantic classes, and temporal frame sam-
pling. Finally, we explain in details our hybrid scheme to address the challenges in
multi-camera and video labeling. We obtain consistent instance IDs across temporal
frames and cameras by associating the panoptic labels from each individual image
with the existing box-level annotations.
3.1
Dataset Overview
The Waymo Open Dataset contains 1,150 scenes, each consisting of 20 seconds of
data captured at 10Hz (i.e., 10 frames per second, and thus 200 frames per scene).
Each data frame in the dataset includes 3D point clouds from the LiDAR devices,
images from ﬁve cameras (positioned at Front, Front-Left, Front-Right, Side-Left,
and Side-Right), and ground truth 3D and 2D bounding boxes annotated by humans
in the LiDAR point clouds and camera images, respectively. Each bounding box
contains an ID that is unique to that object across the entirety of each scene. For the
LiDAR data, this allows for tracking in the whole scene. For the camera data, these
IDs are consistent within each camera’s images only.
4
Figure 2: Histogram of the 28 semantic categories in our dataset in terms of their pixel distributions. The
vertical axis denotes the number of pixels for each class in log scale. We provide instance IDs for classes
marked with diamonds.
Figure 3: Super-class distributions for each camera. Each camera sees a different distribution of classes,
due to their ﬁxed positions and different ﬁeld-of-views.
Built on top of the WOD, Our WOD: PVPS dataset consists of 100,000 images
with panoptic segmentation labels using a prescribed train, validation, and test set
split, subsampled from the existing 1.15 million images. In Tab. 1, we compare our
proposed WOD: PVPS dataset with the public datasets for video panoptic segmen-
tation. Our dataset is the only one that provides panoptic segmentation annotations
that are consistent both across multiple cameras and across time. Furthermore, our
dataset is much larger both in terms of number of frames and number of semantic
classes than existing datasets [30, 78, 71].
Equirectangular Panorama
We reconstruct the equirectangular panorama
(220◦coverage from ﬁve cameras) by stitching each individual camera images as an
alternative input format to our dataset. Speciﬁcally, we ﬁrst use the extrinsics and in-
trinsics from the ﬁve cameras provided by WOD to unproject each pixel coordinates
to the 3D space. We then set a virtual camera [60] located at the geometric mean of
all ﬁve camera centers and compute the pixel colors by equirectangular projection
from the 3D space with bilinear sampling. For pixels correspond to multiple camera
views, we compute the weights based on the distance of each pixel in the panorama
to each of the camera views’ boundaries. For panoptic labels, we compute labels
in each camera view given the camera parameters of ﬁve cameras and the virtual
camera using the nearest sampling. Then we use the method in Qiao et al. [51] to
stitch the panorama labels to maintain the view consistency. Finally, we fused the
ﬁve panorama labels together based on the correspondences and the distances to
the camera view’s boundaries. There are more sophisticated methods [57, 64] that
leverage cross-frame information and the geometry captured from LiDAR sensors
to potentially improve panorama generation. We leave this as an open research
topic in the future.
Semantic Class Distribution
In total, our dataset contains 28 semantic cate-
gories, outlined with their frequency in pixels in Fig. 2. In addition, we provide
instance IDs for most of the classes under the vehicle and human super-classes,
as they are major dynamic categories in the autonomous driving space. We also
outline the pixel distribution for each camera view in Fig. 3, where we see notable
differences in the distributions in each camera. For example, the front camera covers
5
more of flat (e.g., road surfaces) and sky pixels than the rest of the cameras, while
the side left camera covers more vehicle pixels due to the ego-vehicle driving
on the right hand side of the road. This analysis is important as machine learning
models trained on the images captured by a single camera from the existing datasets
may not necessarily generalize to the other cameras due to large domain gaps across
different cameras. In contrast, our proposed task has an emphasis on the holistic
scene understanding, which grants our WOD: PVPS dataset unique value to the
research community.
Temporal Frame Sampling for Human Annotations
To maximize the diver-
sity of the images on the training set, we subsample sparsely from each scene,
labeling chunks of ﬁve-frame sequences from all the cameras. We start by ran-
domly selecting 700 out of the 798 scenes. For each scene, which typically has 200
frames, we annotate four sets of ﬁve-frame sequences, starting at frame indices
{25, 50, 125, 150} (i.e., we pick 25th, 50th, 125th, and 150th frames as the ﬁrst frame
of each ﬁve-frame sequence for annotation). For each set, we further select frames
with offsets {0, 4, 6, 8, 12} w.r.t. the ﬁrst frame for annotations. For example, the ﬁrst
set of ﬁve-frame sequences will contain frames with indices {25, 29, 31, 33, 37}. Our
sparse sampling strategy facilitates a variety of different sequence lengths, allowing
users to train on frame pairs with time difference as small as two frames (0.2 sec-
onds) and as large as 12 frames (1.2 seconds). As a result, our training set contains
groups of ﬁve temporal frames across all ﬁve cameras, yielding 2,800 sequences
of 25 images (5 temporal frames × 5 cameras), or 70,000 images in total. Finally,
we provide the associations between each instance ID and the corresponding 3D
LiDAR bounding box, allowing us to compute very long associations (up to 13.7
seconds between all four sequences), if an object persists across multiple sequences
in the same scene.
For the validation and test sets, we aim to enable the testing of long-term
consistency across cameras and frames. We therefore densely sample frames at
5Hz from chunks of 100 frames across all cameras (i.e., every other two frames
are sampled in the 200 frame sequence). We select 20 and 40 scenes for validation
and test sets by maintaining diversity in the location, density of object, and time of
day distributions of WOD. In contrast to the training set, for each scene selected
from the validation and test tests, we densely subsample the scenes for these splits
by labeling every other frame, resulting in sequences with 100 temporal frames
across all ﬁve cameras. In the end, our validation set contains annotations for
20 sequences of 500 images (100 temporal frames × 5 cameras), and our test set
consists of 40 sequences of 500 images (or totally 10,000 and 20,000 annotations for
validation and test sets, respectively). The test set annotations will not be made
publicly available, but instead we will prepare a test server to evaluate the held-out
test set, once the dataset is released.
3.2
Associating Instance IDs Across Cameras and Frames
In constructing the panoramic video panoptic segmentation dataset, ensuring the
annotations have consistent instance IDs across cameras and temporal frames is
one of the major challenges. Manual labeling is a straight-forward option, but is
time-consuming and expensive at large scales. In addition, it is difﬁcult to develop
an effective labeling interface that allows human annotators to iteratively reﬁne
instance labels across cameras and temporary frames.
6
Front Left Camera (t = 0)
Front Left Camera (t = 1)
Front Left Camera (t = 2)
Front Camera (t = 2)
LiDAR point clouds (t = 2)
t=0
t=1
t=2
Camera Images
t=0
t=1
t=2
t=0
t=1
t=2
t=0
t=1
t=2
t=0
t=1
t=2
t=0
t=1
t=2
3D Box Annotations
2D Box Annotations
Projected LiDAR Points
Human Annotated Instance Labels
Associated Instance Labels (3D Box)
Associated Instance Labels (3D + 2D Box)
Step 1: Human 
Annotation
Step 2: 3D Box
Association
Step 3: 2D Box
Association
Figure 4: Labeling and Association Overview. Human annotators ﬁrst label each camera image for
panoptic segmentation separately (step 1). LiDAR points within each ground truth 3D bounding box are
then projected to each image, and associate with the single frame instance labels (step 2). For far-range
instances without corresponding 3D bounding boxes, we associate the single frame instance labels over
time using the ground truth 2D bounding boxes within each camera (step 3). New associations are
highlighted in the zoomed-in views at the bottom.
We instead assigned human annotators to label each camera image for panoptic
segmentation separately and employed a hybrid scheme that leverages the exist-
ing coarse-level annotations in WOD. The coarse-level annotations include (i) 3D
bounding boxes with corresponding IDs that are consistent across all frames and
cameras; and (ii) 2D bounding boxes with IDs that are consistent across temporaral
frames, but annotated independently for each camera. Associations were then
computed between each instance and its corresponding 3D LiDAR boxes and 2D
camera boxes. Instances determined to correspond to the same object are then
mapped to the same ID in all frames across cameras. A sample sequence from this
process can be found in Fig. 4.
For a given frame with instance labels, 3D point clouds, and 3D bounding boxes,
we associate instances with boxes by ﬁltering the LiDAR points within each box,
and projecting them onto the image. Association scores are then computed using
IoU between the convex hull of the projected LiDAR points and each instance label.
Bipartite matching is then applied to match each projected box with an instance label.
For 3D driving scenes, points inside the bounding box almost entirely correspond
to the instances inside of them, and so these projected LiDAR points have a high
overlap with their corresponding instance masks in the image. Our label association
step is related to the prior work [28, 38], but, our association leverages the ground-
truth labeled 3D boxes and only transfers instance IDs rather than ﬁne-grained
per-pixel labels.
There are, however, a small number of instances without corresponding LiDAR
ground truth boxes due to occlusions, rolling shutter artifacts, and the limited range
of the provided LiDAR scans (75m). We apply an additional matching step by
associating the 2D bounding boxes with our instance labels. First, we score matches
between 2D boxes and instances by computing the IoU between each 2D box and
the tightly-ﬁtting bounding boxes around each instance mask, and then compute
associations with bipartite matching.
For boxes with existing 3D associations, we extend these tracks by propagating
the existing ID to all other instances that match with the same 2D box. This resolves
cases where only a object track misses 3D associations in a few frames. Then, we
assign the remaining boxes without any matches to the ID of their corresponding
2D box, if any. Finally, to capture any additional cross-camera associations, we
7
project all of the camera views onto the panorama, and associate instances which
overlap in this joint representation.
In order to identify any instances that are still not associated with any ground
truth boxes after these steps, we provide an additional mask for these instance
pixels indicating that they are not tracked, similar to the crowd mask used in single
frame instance segmentation labels [12].
4
Benchmark and Evaluation Metrics
In this section, we ﬁrst describe the task of Panoramic Video Panoptic Segmentation
(PVPS). Then we review the evaluation metrics used in the literature, and propose a
new metric designed for PVPS with an emphasis on consistent multi-object tracking
and segmentation across multiple cameras.
4.1
Problem Deﬁnition
We represent a multi-camera video sequence with T frames and M independent
camera views as {I1:T
i
}M
i=1, where It
i is the i-th camera view captured at the t-th
time step in the video sequence. Along with the multi-view representation of
the full scene, we deﬁne the panorama at t-th time step as It
pano. In the task of
Panoramic Video Panoptic Segmentation (PVPS), we require a mapping f of every
pixel (x, y, t, i) in the multi-camera video sequence to a semantic category c ∈C
and an instance ID z consistent across camera views and temporal frames. Here,
(x, y, t, i) indicates the spatial coordinate (x, y) of the i-th camera view captured at
the t-th time step, and C is the set of semantic categories. Accordingly, we deﬁne
the mappings fid and fsem for a particular instance ID z and semantic category c
in Eq. (1) and Eq. (2), respectively. The mapping functions are the building blocks
of our proposed metric introduced in Sec. 4.2.
fid(z) = {(x, y, i, t)|f(x, y, i, t) = (c, z), c ∈C},
(1)
fsem(c) = {(x, y, i, t)|f(x, y, i, t) = (c, ∗), c ∈C}.
(2)
Compared to the existing tasks including Video Panoptic Segmentaion (VPS)
and Panoramic Semantic Segmentation, the proposed task is more challenging in the
following aspects. First, each individual camera has its own unique viewpoint and
ﬁeld-of-view such that the semantic class statistics are different across cameras (e.g.,
see Fig. 3). This leads to a large domain gap between videos captured with different
cameras. Second, the instance ID prediction, with the long-term consistency across
both time and cameras, requires holistic scene understanding.
4.2
Evaluation Metrics
In this subsection, we overview the existing Video Panoptic Segmentation (VPS)
metric: Segmentation and Tracking Quality (STQ) [71], which we extend to evaluate
the Panoramic Video Panoptic Segmentation (PVPS) task.
VPS Metric
We use f and g to indicate the prediction and ground-truth map-
ping, respectively. We deﬁne the true positive associations (TPA) [43] of a speciﬁc
instance as TPA(zf, zg) = |fid(zf) ∩gid(zg)|, where zf is the predicted instance,
zg ∈G is the ground-truth instance, and G is the set containing all unique ground-
truth instances across cameras and temporal frames. Similarly, false negative
associations (FNA) and false positive associations (FPA) can be deﬁned to compute
8
Side Left
Front Left
Front
Front Right
Side Right
Figure 5: Visualization of the weights tensor for all cameras. Pixels in the blue region have weights 0.5
during evaluation, as they are covered by two cameras.
the Intersection over Union (IoUid) for evaluating tracking quality. Formally, STQ is
deﬁned as follows.
STQ = (AQ × SQ)
1
2 ,
(3)
AQ =
1
|G|
X
zg∈G
1
|gid(zg)|
X
zf ,|zf ∩zg|̸=∅
TPA(zf, zg) × IoUid(zf, zg),
SQ =
1
|C|
X
c∈C
fsem(c) ∩gsem(c)
fsem(c) ∪gsem(c).
As deﬁned in Eq (3), STQ fairly balances segmentation and tracking perfor-
mance, and is suitable for evaluating video sequences of arbitrary length. The
Association Quality (AQ) measures the association quality for tracking classes,
while the Segmentation Quality (SQ) measures the segmentation quality for seman-
tic classes. Speciﬁcally, AQ involves the IoUid computation for predicted instance
IDs (and further weighted by true positive associations to encourage long-term
tracking [71]), while SQ is the typical semantic segmentation metric [16] (i.e., mean
IoUsem for predicted semantic classes).
PVPS Metric
We propose to extend the metric STQ [71] for Panoramic Video
Panoptic Segmentation (PVPS). However, na¨ıvely adopting STQ for the multi-
camera scenario results in a potential issue, where pixels in the overlapping regions
covered by multiple cameras will be counted multiple times. Instead, we employ
a simple and effective solution by exploiting the pixel-centric property of STQ.
In particular, we weight each pixel prediction w.r.t. its coverage by the number
of cameras, as determined by the mapping between the camera images and the
panorama image. For example, if a pixel is covered by N cameras (in our dataset,
N = 2), its prediction will contribute 1/N when computing AQ, and SQ. We name
the resulting metrics as weighted STQ (wSTQ), since each pixel prediction takes a
different weight depending on its coverage by the number of cameras. In Fig. 5, we
visualize the weights for an example of ﬁve-camera images.
PS Metric
We also brieﬂy review the metric PQ (panoptic quality) [33] for
evaluating image Panoptic Segmentation (PS), since we will build image-level
baselines purely trained with image panoptic annotations.
For a particular semantic class c, the sets of true positives (TPc), false positives
(FPc), and false negatives (FNc) are formed by matching predictions zf to the
ground-truth masks zg based on the IoU scores. A minimal threshold of greater
than 0.5 IoU is chosen to guarantee unique matching. Formally,
PQc =
P
(zf ,zg)∈TPc IoU(zf, zg)
|TPc| + 1
2|FPc| + 1
2|FNc|,
(4)
where the ﬁnal PQ is then obtained by averaging PQc over semantic classes.
9
Network Predictions
Stitch
(a)
View Evaluation
Network Predictions
Re-Project
(b)
Pano Evaluation
Front Left
Front
Front Right
Equirectangular Panorama
Figure 6: We experiment with two evaluation schemes: (a) View and (b) Pano. The View evaluation
scheme takes individual camera views as input and generates their panoptic predictions, which are
then “stitched over cameras” to obtain consistent instance IDs between cameras. The Pano evaluation
scheme takes as panorama images as input and generates panoramic panoptic predictions, which are
then reprojected back to each camera for evaluation.
5
Experimental Results
In this section, we introduce our PVPS baselines, which exploit the property of multi-
camera images by taking as input either individual camera views or panorama
images (generated from all camera views). We then provide extensive experiments
on the proposed dataset and metric.
5.1
ViP-DeepLab Extensions as PVPS baselines
To tackle the new challenging PVPS task, we extend the state-of-art video panoptic
segmentation method, ViP-Deeplab [51], to panoramic views.
Baseline Overview
For completeness, we ﬁrst brieﬂy review ViP-DeepLab [51].
ViP-DeepLab extends the state-of-art image panoptic segmentation model, Panoptic-
DeepLab [11], to the video domain. Panoptic-DeepLab employs two separate
prediction branches for semantic segmentation [9] and instance segmentation [29],
respectively. Both segmentation results are then merged [80] to form the ﬁnal panop-
tic segmentation result. To perform video panoptic segmentation, ViP-DeepLab
adopts a two-frame image panoptic segmentation framework. Speciﬁcally, during
training, ViP-DeepLab takes a pair of image frames as input and their panoptic
segmentation ground-truths as training target. During inference, ViP-DeepLab per-
forms two-frame image panoptic predictions at each time step, and continues the
inference process for every two consecutive frames (i.e., with one overlapping frame
at the next time step) in a video sequence. The predictions in the overlapping frames
are “stitched” together by propagating instance IDs based on mask IoU between
region pairs (i.e., if two masks have high IoU overlap, they will be re-assigned with
the same instance ID), and thus temporally consistent IDs are obtained (see Fig. 4
of Qiao et al. [51] for an illustration). We refer this post-processing as “panoptic
stitching over time”.
Baseline Extension for PVPS
We explore several ViP-DeepLab extensions
for PVPS, which takes as input individual camera views or panorama images
(generated from all camera views). The input types could be different during
training and evaluation. Speciﬁcally, we deﬁne three training schemes: View, Pano,
and Ensemble-View. The View scheme refers to the case where ViP-DeepLab is
trained with images from all camera views, while Pano means the model is trained
with full panorama images. The Ensemble-View scheme refers to the case where
we have ﬁve camera-speciﬁc ViP-DeepLab models, each of which is trained and
10
evaluated on their own camera images. We also have two evaluation schemes:
View and Pano. The View scheme refers to the case where the trained model is
fed with images from individual camera views and generates the corresponding
panoptic predictions for each view. However, the predicted instance IDs are not
consistent between cameras, since the predictions are made independently for
each view. To generate consistent instance IDs between cameras, we propose a
similar method to “panoptic stitching over time”: if two masks have high IoU
overlap in the overlapping regions between two cameras’ ﬁeld-of-view, we re-
assign the same instance ID for them, resulting in the “panoptic stitching over
cameras” post-processing method. For the Pano evaluation scheme, the model
is fed with panorama images and generates panoramic panoptic predictions. We
then re-project panoramic panoptic predictions onto each camera for evaluation.
Note that, for the Pano evaluation scheme, the instance IDs are consistent between
cameras by nature. We visualize the evaluation schemes in Fig. 6.
Implementation Details
We build our image-based and video-based baselines
on top of Panoptic-DeepLab [11] and ViP-DeepLab [51], respectively, using the
ofﬁcial code-base [70]. The training strategy follows Panoptic-DeepLab and ViP-
DeepLab. Speciﬁcally, the models are trained with 32 TPU cores for 60k steps, batch
size 32, Adam [31] optimizer and a poly schedule learning rate of 2.5 × 10−4. We
use an ImageNet-1K-pretrained [56] ResNet-50 [25] with stride 16 as the backbone
(using atrous convolution [8]). For image-based methods, we use the crop size
1281 × 1921 during training, while, during inference, we use the whole image
(or panorama). We use a similar strategy for the video-based methods, but we
use a ResNet-50 backbone with stride 32 and crop size 641 × 961 due to memory
constraints.
5.2
Qualitative Evaluation
In Fig. 7, we provide qualitative results from our two ViP-DeepLab baselines,
both trained and evaluated on single images and on panorama images (i.e., one
model uses View schemes for both training and evaluation, and the other uses Pano
schemes for both training and evaluation), over two (non-adjacent) temporal frames.
From these results, we can see that the baseline models are able to accurately track
objects in very dense scenes. In addition, we note that there are some qualitative
beneﬁts provided by the panorama model in these examples. In particular, the
single view model has an inconsistent prediction on the crosswalk in the left and
right images for the single view model, but the panorama model is able to attain
the full context of the scene and avoids this mistake. In addition, the single view
models fail to track the car crossing the front right and side right cameras at t0, but
the panorama is again able to track this object correctly.
5.3
Baseline Comparisons
Video-based Baselines
In Tab. 2(a), we provide video-based baseline comparisons
using ViP-DeepLab [51], evaluated by the proposed weighted STQ (wSTQ). We
compare different training and evaluation schemes. As shown in the table, when
both evaluated with View scheme, training with View scheme performs better than
training with Ensemble-View by 0.86% wSTQ. That is, training a single model with
all the camera views performs better than training ﬁve camera-speciﬁc models with
its own camera views. Also, when training with View scheme, using the Pano
11
t1
t1 +1s
RGB + GT
Single View 
Predictions
Panorama 
Predictions
RGB + GT
Single View 
Predictions
Panorama 
Predictions
t0
t0 +0.2s
RGB + GT
Single View 
Predictions
Panorama 
Predictions
RGB + GT
Single View 
Predictions
Panorama 
Predictions
Figure 7: Comparison of qualitative results from our baseline ViP-DeepLab [51] models over different
time intervals. Results show models trained on single images with panoptic stitching over cameras, and
trained directly on panorama images. Our baseline models show strong performance for the majority of
the scene, although tracking small/distant objects and crowded scenes remains challenging.
12
Table 2: Quantitative evaluation: during training and evaluation, the baselines can take different types
of inputs: View: individual camera views; Pano: panoramas; Ensemble-View: camera-speciﬁc views.
Results include (a) video-baseline comparison using ViP-DeepLab, measured by weighted Segmentation
and Tracking Quality (wSTQ); and (b) image-baseline comparison using Panoptic-DeepLab, measured
by Panoptic Quality (PQ) and mean Intersection-over-Union (mIoU).
(a) Video-baseline Comparison
Training Scheme
Eval Scheme
wSTQ
wAQ
wSQ
Ensemble-View
View
16.92
7.61
37.33
View
View
17.78
8.21
38.46
View
Pano
14.87
6.13
36.04
Pano
View
17.56
8.11
38.04
Pano
Pano
15.72
6.22
39.78
(b) Image-baseline Comparison
Training Scheme
Eval Scheme
PQ
mIoU
Ensemble-View
View
35.70
48.15
View
View
40.00
53.64
View
Pano
33.65
50.61
Pano
View
38.93
51.65
Pano
Pano
36.32
52.19
Table 3: View transferability on our video-based baselines, measured by wSTQ. We evaluate models (1st
column) trained on a speciﬁc view w.r.t. other camera views. The last row, MultiCamera, refers to the
model trained with all camera views (i.e., training scheme View), and the last column, All, denotes the
evaluation set using all camera views (i.e., evaluation scheme View).
Model \ Eval
Side Left
Front Left
Front
Front Right
Side Right
All
Side Left
18.79
17.41
14.56
19.06
19.40
16.31
Front Left
16.88
18.39
12.84
19.22
18.49
15.36
Front
16.58
18.02
14.54
18.55
18.96
15.98
Front Right
16.56
17.36
14.99
19.40
19.16
16.18
Side Right
17.91
16.50
13.16
18.23
20.47
15.65
MultiCamera
20.11
19.54
15.63
20.67
21.53
17.78
evaluation scheme degrades the performance by 2.91% wSTQ. When training with
Pano scheme, using View scheme is better than Pano scheme for evaluation. We
think it is caused by the asymmetry between training and evaluation settings. We
could not use whole panorama images with resolution 1000 × 5875 as input during
training (due to memory limit), and thus we only use a smaller crop size 641 × 961.
Ideally, the model should be evaluated with the same setting as its training. The
current best setting is trained and evaluated with the View scheme, reaching 17.78%
wSTQ. We observe that our dataset is very challenging in terms of both tracking
and segmentation, since our best wAQ is only 8.21% and best wSQ is 39.78%.
Image-based Baselines
In Tab. 2(b), we provide image-based baseline com-
parisons using Panoptic-DeepLab [11], evaluated by image panoptic segmentation
metric PQ [33] and semantic segmentation metric mIoU [16]. Basically, we observe
the same trend of image-based baselines and video-based baselines.
5.4
Ablation Studies
Transferability of Models between Viewpoints
In this ablation, we measure the
ability to transfer models trained on one viewpoint to a different viewpoint. As
shown in Tab. 3, we have the following observations: First, all models, even trained
on left side views, perform better on right side views. This phenomenon is due to
the ego-vehicle driving on the right side of the road, and providing wider scope,
more instances, and smaller objects on the left side (i.e., the left side views are more
challenging). Second, the front camera performance is inferior compared to the
other cameras. We hypothesize that the front camera captures more diverse and
challenging views, e.g., vehicles driving in multiple directions, more dynamic and
smaller objects, making tracking more challenging.
13
6
Conclusion
In this work, we presented a new benchmark, the Waymo Open Dataset: Panoramic
Video Panoptic Segmentation (WOD: PVPS) dataset. Our benchmark extends video
panoptic segmentation to a more challenging multi-camera setting that requires
consistent instance IDs both across cameras and over time. Our dataset is an order
of magnitude larger than all the existing video panoptic segmentation datasets. We
establish several strong baselines evaluated with a new metric, wSTQ, that takes
multi-camera, multi-object tracking and segmentation into consideration. We will
make our benchmark publicly available, and we hope that it will facilitate future
research on panoramic video panoptic segmentation.
14
References
[1] Baqu´e, P., Fleuret, F., Fua, P.: Deep occlusion reasoning for multi-camera
multi-target detection. In: ICCV (2017)
[2] Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C.,
Gall, J.: Semantickitti: A dataset for semantic scene understanding of lidar
sequences. In: ICCV (2019)
[3] Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Multiple object tracking using
k-shortest paths optimization. PAMI 33(9), 1806–1819 (2011)
[4] Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A
high-deﬁnition ground truth database. Pattern Recognition Letters 30(2), 88–97
(2009)
[5] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan,
A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for
autonomous driving. In: CVPR (2020)
[6] Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D.,
Carr, P., Lucey, S., Ramanan, D., et al.: Argoverse: 3d tracking and forecasting
with rich maps. In: CVPR (2019)
[7] Chavdarova, T., Baqu´e, P., Bouquet, S., Maksai, A., Jose, C., Bagautdinov, T.,
Lettry, L., Fua, P., Van Gool, L., Fleuret, F.: Wildtrack: A multi-camera hd
dataset for dense unscripted pedestrian detection. In: CVPR (2018)
[8] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic
image segmentation with deep convolutional nets and fully connected CRFs.
In: ICLR (2015)
[9] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab:
Semantic image segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected CRFs. TPAMI (2017)
[10] Chen, Y., Rong, F., Duggal, S., Wang, S., Yan, X., Manivasagam, S., Xue, S.,
Yumer, E., Urtasun, R.: Geosim: Realistic video simulation via geometry-aware
composition for self-driving. In: CVPR (2021)
[11] Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., Chen, L.C.:
Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic
segmentation. In: CVPR (2020)
[12] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The Cityscapes Dataset for Semantic Urban
Scene Understanding. In: CVPR (2016)
[13] Dehghan, A., Modiri Assari, S., Shah, M.: Gmmcp tracker: Globally optimal
generalized maximum multi clique problem for multiple object tracking. In:
CVPR (2015)
15
[14] Dendorfer, P., Oˇsep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth,
S., Leal-Taix´e, L.: MOTChallenge: A Benchmark for Single-camera Multiple
Target Tracking. IJCV (2020)
[15] Eshel, R., Moses, Y.: Homography based multiple camera detection and track-
ing of people in a dense crowd. In: CVPR (2008)
[16] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The
Pascal Visual Object Classes (VOC) Challenge. IJCV (2010)
[17] Felzenszwalb, P.F., Huttenlocher, D.P.: Efﬁcient graph-based image segmenta-
tion. IJCV (2004)
[18] Ferryman, J., Shahrokni, A.: Pets2009: Dataset and challenge. In: 2009 Twelfth
IEEE international workshop on performance evaluation of tracking and
surveillance. pp. 1–6 (2009)
[19] Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multicamera people tracking with
a probabilistic occupancy map. PAMI 30(2), 267–282 (2007)
[20] Gao, N., Shan, Y., Wang, Y., Zhao, X., Yu, Y., Yang, M., Huang, K.: SSAP:
Single-Shot Instance Segmentation With Afﬁnity Pyramid. In: ICCV (2019)
[21] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the
kitti vision benchmark suite. In: CVPR (2012)
[22] Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh, R., Chung, A.S.,
Hauswald, L., Pham, V.H., M¨uhlegg, M., Dorn, S., et al.: A2d2: Audi au-
tonomous driving dataset. arXiv preprint arXiv:2004.06320 (2020)
[23] Han, X., You, Q., Wang, C., Zhang, Z., Chu, P., Hu, H., Wang, J., Liu, Z.:
Mmptrack: Large-scale densely annotated multi-camera multiple people track-
ing benchmark (2021)
[24] Hariharan, B., Arbel´aez, P., Girshick, R., Malik, J.: Simultaneous detection and
segmentation. In: ECCV (2014)
[25] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
[26] He, X., Zemel, R.S., Carreira-Perpi˜n´an, M. ´A.: Multiscale conditional random
ﬁelds for image labeling. In: CVPR (2004)
[27] Hofmann, M., Wolf, D., Rigoll, G.: Hypergraphs for joint multi-view recon-
struction and multi-object tracking. In: CVPR (2013)
[28] Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., Yang, R.: The apolloscape
open dataset for autonomous driving and its application. PAMI (2020)
[29] Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh
losses for scene geometry and semantics. In: CVPR (2018)
[30] Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video Panoptic Segmentation. In: CVPR
(2020)
16
[31] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR
(2015)
[32] Kirillov, A., Girshick, R., He, K., Doll´ar, P.: Panoptic Feature Pyramid Networks.
In: CVPR (2019)
[33] Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ar, P.: Panoptic Segmentation.
In: CVPR (2019)
[34] Kuo, C.H., Huang, C., Nevatia, R.: Inter-camera association of multi-target
tracks by on-line learned appearance afﬁnity models. In: ECCV (2010)
[35] Ladick`y, L., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.: What, where and
how many? combining object detectors and crfs. In: ECCV (2010)
[36] Li, Y., Chen, X., Zhu, Z., Xie, L., Huang, G., Du, D., Wang, X.: Attention-guided
uniﬁed network for panoptic segmentation. In: CVPR (2019)
[37] Liang, J., Homayounfar, N., Ma, W.C., Xiong, Y., Hu, R., Urtasun, R.: Poly-
transform: Deep polygon transformer for instance segmentation. In: CVPR
(2020)
[38] Liao, Y., Xie, J., Geiger, A.: Kitti-360: A novel dataset and benchmarks for
urban scene understanding in 2d and 3d. arXiv:2109.13410 (2021)
[39] Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona,
P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects
in context. In: ECCV (2014)
[40] Ling, H., Acuna, D., Kreis, K., Kim, S.W., Fidler, S.: Variational amodal object
completion. NeurIPS (2020)
[41] Liu, H., Peng, C., Yu, C., Wang, J., Liu, X., Yu, G., Jiang, W.: An End-to-End
Network for Panoptic Segmentation. In: CVPR (2019)
[42] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: CVPR (2015)
[43] Luiten, J., Oˇsep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taix´e, L., Leibe, B.:
HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. IJCV
(2020)
[44] Mallya, A., Wang, T.C., Sapra, K., Liu, M.Y.: World-consistent video-to-video
synthesis. In: ECCV (2020)
[45] Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., Yang, Y.: Vspw: A large-scale dataset
for video scene parsing in the wild. In: CVPR (2021)
[46] Narioka, K., Nishimura, H., Itamochi, T., Inomata, T.: Understanding 3d
semantic structure around the vehicle with monocular cameras. In: IEEE
Intelligent Vehicles Symposium (IV). pp. 132–137 (2018)
[47] Neuhold, G., Ollmann, T., Bul`o, S.R., Kontschieder, P.: The mapillary vistas
dataset for semantic understanding of street scenes. In: ICCV (2017)
17
[48] Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera
rigs by implicitly unprojecting to 3d. In: ECCV (2020)
[49] Porzi, L., Bul`o, S.R., Colovic, A., Kontschieder, P.: Seamless Scene Segmenta-
tion. In: CVPR (2019)
[50] Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D.: Offboard
3d object detection from point cloud sequences. In: CVPR (2021)
[51] Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: Vip-deeplab: Learning
visual perception with depth-aware video panoptic segmentation. In: CVPR
(2021)
[52] Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures
and a data set for multi-target, multi-camera tracking. In: ECCV Workshop on
Benchmarking Multi-Target Tracking (2016)
[53] Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and
re-identiﬁcation. In: CVPR (2018)
[54] Roddick, T., Cipolla, R.: Predicting semantic map representations from images
using pyramid occupancy networks. In: CVPR (2020)
[55] Roshan Zamir, A., Dehghan, A., Shah, M.: Gmcp-tracker: Global multi-object
tracking using generalized minimum clique graphs. In: ECCV (2012)
[56] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
Scale Visual Recognition Challenge. IJCV (2015)
[57] Sch¨onberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selec-
tion for unstructured multi-view stereo. In: ECCV (2016)
[58] Shi, J., Malik, J.: Normalized cuts and image segmentation. PAMI (2000)
[59] Song, S., Zeng, A., Chang, A.X., Savva, M., Savarese, S., Funkhouser, T.:
Im2pano3d: Extrapolating 360 structure and semantics beyond the ﬁeld of
view. In: CVPR (2018)
[60] Su, Y.C., Grauman, K.: Making 360 video watchable in 2d: Learning videogra-
phy for click free viewing. In: CVPR (2017)
[61] Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo,
J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous
driving: Waymo Open Dataset. In: CVPR (2020)
[62] Tang, Z., Naphade, M., Liu, M.Y., Yang, X., Birchﬁeld, S., Wang, S., Kumar, R.,
Anastasiu, D., Hwang, J.N.: Cityﬂow: A city-scale benchmark for multi-target
multi-camera vehicle tracking and re-identiﬁcation. In: CVPR (2019)
[63] Tateno, K., Navab, N., Tombari, F.: Distortion-aware convolutional ﬁlters for
dense prediction in panoramic images. In: ECCV (2018)
18
[64] Thrun, S., Montemerlo, M.: The graph slam algorithm with applications to
large-scale mapping of urban structures. The International Journal of Robotics
Research 25(5-6), 403–429 (2006)
[65] Tu, Z., Chen, X., Yuille, A.L., Zhu, S.C.: Image parsing: Unifying segmentation,
detection, and recognition. IJCV (2005)
[66] Voigtlaender, P., Krause, M., O˘sep, A., Luiten, J., Sekar, B.B.G., Geiger, A.,
Leibe, B.: MOTS: Multi-object tracking and segmentation. In: CVPR (2019)
[67] Wang, H., Luo, R., Maire, M., Shakhnarovich, G.: Pixel Consensus Voting for
Panoptic Segmentation. In: CVPR (2020)
[68] Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.C.: Axial-DeepLab:
Stand-Alone Axial-Attention for Panoptic Segmentation. In: ECCV (2020)
[69] Weber, M., Luiten, J., Leibe, B.: Single-shot Panoptic Segmentation. In: IROS
(2020)
[70] Weber, M., Wang, H., Qiao, S., Xie, J., Collins, M.D., Zhu, Y., Yuan, L., Kim, D.,
Yu, Q., Cremers, D., Leal-Taixe, L., Yuille, A.L., Schroff, F., Adam, H., Chen,
L.C.: DeepLab2: A TensorFlow Library for Deep Labeling. arXiv: 2106.09748
(2021)
[71] Weber, M., Xie, J., Collins, M., Zhu, Y., Voigtlaender, P., Adam, H., Green,
B., Geiger, A., Leibe, B., Cremers, D., Osep, A., Leal-Taixe, L., Chen, L.C.:
Step: Segmenting and tracking every pixel. In: NeurIPS Track on Datasets and
Benchmarks (2021)
[72] Wu, Y., Lim, J., Yang, M.H.: Online object tracking: A benchmark. In: CVPR
(2013)
[73] Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., Urtasun, R.: UPSNet:
A Uniﬁed Panoptic Segmentation Network. In: CVPR (2019)
[74] Xu, C., Xiong, C., Corso, J.J.: Streaming hierarchical video segmentation. In:
ECCV (2012)
[75] Xu, Y., Liu, X., Liu, Y., Zhu, S.C.: Multi-view people tracking via hierarchical
trajectory composition. In: CVPR (2016)
[76] Yang, B., Bai, M., Liang, M., Zeng, W., Urtasun, R.: Auto4d: Learning to label
4d objects from sequential point clouds. arXiv preprint arXiv:2101.06586 (2021)
[77] Yang, K., Hu, X., Bergasa, L.M., Romera, E., Wang, K.: Pass: Panoramic an-
nular semantic segmentation. IEEE Transactions on Intelligent Transportation
Systems 21(10), 4171–4185 (2019)
[78] Yang, K., Zhang, J., Reiß, S., Hu, X., Stiefelhagen, R.: Capturing omni-range
context for omnidirectional segmentation. In: CVPR (2021)
[79] Yang, L., Fan, Y., Xu, N.: Video Instance Segmentation. In: ICCV (2019)
19
[80] Yang, T.J., Collins, M.D., Zhu, Y., Hwang, J.J., Liu, T., Zhang, X., Sze, V., Papan-
dreou, G., Chen, L.C.: DeeperLab: Single-Shot Image Parser. arXiv:1902.05093
(2019)
[81] Yao, J., Fidler, S., Urtasun, R.: Describing the scene as a whole: Joint object
detection, scene classiﬁcation and semantic segmentation. In: CVPR (2012)
[82] Yogamani, S., Hughes, C., Horgan, J., Sistu, G., Varley, P., O’Dea, D., Uric´ar, M.,
Milz, S., Simon, M., Amende, K., et al.: Woodscape: A multi-task, multi-camera
ﬁsheye dataset for autonomous driving. In: ICCV (2019)
[83] Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell, T.:
Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In:
CVPR (2020)
[84] Zakharov, S., Kehl, W., Bhargava, A., Gaidon, A.: Autolabeling 3d objects with
differentiable rendering of sdf shape priors. In: CVPR (2020)
[85] Zhang, C., Liwicki, S., Smith, W., Cipolla, R.: Orientation-aware semantic
segmentation on icosahedron spheres. In: ICCV (2019)
20