LESS: Label-Efficient Semantic Segmentation for
LiDAR Point Clouds
Minghua Liu1(cid:63), Yin Zhou2(cid:63)(cid:63), Charles R. Qi2, Boqing Gong3, Hao Su1, and
Dragomir Anguelov2
1UC San Diego, 2Waymo, 3Google
Abstract. Semantic segmentation of LiDAR point clouds is an impor-
tanttaskinautonomousdriving.However,trainingdeepmodelsviacon-
ventionalsupervisedmethodsrequireslargedatasetswhicharecostlyto
label.Itiscriticaltohavelabel-efficientsegmentationapproachestoscale
upthemodeltonewoperationaldomainsortoimproveperformanceon
rare cases. While most prior works focus on indoor scenes, we are one
of the first to propose a label-efficient semantic segmentation pipeline
foroutdoorsceneswithLiDARpointclouds.Ourmethodco-designsan
efficient labeling process with semi/weakly supervised learning and is
applicable to nearly any 3D semantic segmentation backbones. Specifi-
cally,weleveragegeometrypatternsinoutdoorscenestohaveaheuristic
pre-segmentation to reduce the manual labeling and jointly design the
learningtargetswiththelabelingprocess.Inthelearningstep,welever-
age prototype learning to get more descriptive point embeddings and
use multi-scan distillation to exploit richer semantics from temporally
aggregatedpointcloudstoboosttheperformanceofsingle-scanmodels.
Evaluated on the SemanticKITTI and the nuScenes datasets, we show
that our proposed method outperforms existing label-efficient methods.
With extremely limited human annotations (e.g., 0.1% point labels),
our proposed method is even highly competitive compared to the fully
supervised counterpart with 100% labels.
1 Introduction
Light detection and ranging (LiDAR) sensors have become a necessity for most
autonomous vehicles. They capture more precise depth measurements and are
more robust against various lighting conditions compared to visual cameras. Se-
manticsegmentationforLiDARpointcloudsisanindispensabletechnologyasit
provides fine-grained scene understanding, complementary to object detection.
For example, semantic segmentation help self-driving cars distinguish drivable
and non-drivable road surfaces and reason about their functionalities, like park-
ing areas and sidewalks, which is beyond the scope of modern object detectors.
Based on large-scale public driving-scene datasets [4,5], several LiDAR se-
mantic segmentation approaches have recently been developed [68,59,9,62,50].
Typically,thesemethodsrequirefullylabeledpointcloudsduringtraining.Since
(cid:63) Work done during internship at Waymo LLC.
(cid:63)(cid:63) Corresponding to yinzhou@waymo.com.
2202
tcO
41
]VC.sc[
1v46080.0122:viXra2 M. Liu et al.
mIoU SemanticKITTI nuScenes
75% 75.4%74.8%73.5%
65% 65.9%66.0% 65.5% 63.5%
61.0% 59.8%
55% 52.0%
46.0%
45%
38.3%
35%
26.0%
25%
0
FullySup.Ours Ours Contra. SQN SQN OTOCReDAL FullySup.Ours Ours Contra.Contra.
100% 0.1% 0.01% 0.1% 0.1% 0.01% 0.1% 5% 100% 0.9% 0.2% 0.9% 0.2%
Fig.1: We compare LESS with Cylinder3D [68] (our fully-supervised counter-
part), ContrastiveSceneContext [23], SQN [24], OneThingOneClick [33], and
ReDAL [55] on the SemanticKITTI [4] and nuScenes [5] validation sets. The
ratio between labels used and all points is listed below each bar. Please note
that all competing label-efficient methods mainly focus on indoor settings and
are not specially designed for outdoor LiDAR segmentation.
aLiDARsensormayperceivemillionsofpointspersecond,exhaustivelylabeling
all points is extremely laborious and time-consuming. Moreover, it may fail to
scale when we extend the operational domain (e.g., various cities and weather
conditions)andseektocovermorerarecases.Therefore,toscaleupthesystem,
it is critical to have label-efficient approaches for LiDAR semantic segmenta-
tion, whose goal is to minimize the quantity of human annotations while still
achieving high performance.
While there are some prior works studying label-efficient semantic segmen-
tation, they mostly focus on indoor scenes [11,3] or 3D object parts [6], which
are quite different in point cloud appearance and object type distribution, com-
pared to the outdoor driving scenes (e.g., significant variances in point den-
sity, extremely unbalanced point counts between common types, like ground
and vehicles, and less common ones, such as cyclists and pedestrians). Besides,
most prior explorations tend to address the problem from two independent per-
spectives, which may be less effective in our outdoor setting. Specifically, one
perspective is improving labeling efficiency, where the methods resort to active
learning[47,55,34],weaklabels[44,54],and2Dsupervision[53]toreducelabeling
efforts. The other perspective focuses on training, where the efforts assume the
partiallabelsaregivenanddesignsemi/weaklysupervisedlearningalgorithmsto
exploitthelimitedlabelsandstriveforbetterperformance[33,60,44,61,20,34,66].
This paper proposes a novel framework, label-efficient semantic segmenta-
tion (LESS), for LiDAR point clouds captured by self-driving cars. Different
from prior works, our method co-designs the labeling process and the model
learning. Our co-design is based on two principles: 1) the labeling step is de-
signed to provide bare minimum supervision, which is suitable for state-of-the-
art semi/weakly supervised segmentation methods; 2) the model training step
can tap into the labeling policy as a prior and deduce more learning targets.
The proposed method can fit in a straightforward way with most state-of-the-LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 3
art LiDAR segmentation backbones without introducing any network architec-
tural change or extra computational complexity when deployed onboard. Our
approach is suitable for effectively labeling and learning from scratch. It is also
highlycompatiblewithmininglong-tailinstances,where,inpractice,wemainly
want to identify and annotate rare cases based on trained models.
Specifically, we leverage a philosophy that outdoor-scene objects are often
well-separated when isolating ground points and design a heuristic approach to
pre-segment an outdoor scene into a set of connected components. The compo-
nent proposals are of high purity (i.e., only contain one or a few classes) and
cover most of the points. Then, instead of meticulously labeling all points, the
annotators are only required to label one point per class for each component. In
the model learning process, we train the backbone segmentation network with
thesparselabelsdirectlyannotatedbyhumansaswellasthederivedlabelsbased
on component proposals. To encourage a more descriptive embedding space, we
employ contrastive prototype learning [18,29,48,63,33], which increases intra-
class similarity and inter-class separation. We also leverage a multi-scan teacher
model to exploit richer semantics within the temporally fused point clouds and
distill the knowledge to boost the performance of the single-scan model.
We evaluate the proposed method on two large-scale autonomous driving
datasets,SemanticKITTI[4]andnuScenes[5].Weshowthatourmethodsignif-
icantlyoutperformsexistinglabel-efficientmethods(seeFig.1).Withextremely
limited human annotations, such as 0.1% labeled points, the approach achieves
highly competitive performance compared to the fully supervised counterpart,
demonstrating the potential of practical deployment.
In summary, our contribution mainly includes:
– Analyze how label-efficient segmentation of outdoor LiDAR point clouds dif-
fers from the indoor settings, and show that the unbalanced category distri-
bution is one of the main challenges.
– Leverage the unique geometric structure of LiDAR point clouds and design
a heuristic algorithm to pre-segment input points into high-purity connected
components. A customized labeling policy is then proposed to exploit the
components with tailored labels and losses.
– Adapt beneficial components into label-efficient LiDAR segmentation and
carefullydesignanetwork-agnosticpipelinethatachieveson-parperformance
with the fully supervised counterpart.
– Evaluatetheproposedpipelineontwolarge-scaleautonomousdrivingdatasets
and extensively ablate each module.
2 Related work
2.1 Segmentation networks for LiDAR point clouds
In contrast to indoor-scene point clouds, outdoor LiDAR point clouds’ large
scale, varying density, and sparsity require the segmentation networks to be
more efficient. Many works project the 3D point clouds from spherical view
[43,27,10,12,58,2,30,39](i.e.,rangeimages)orbird’s-eye-view[45,65]onto2Dim-
ages, or try to fuse different views [31,19,1]. There are also some works directly4 M. Liu et al.
dataset vegetation road building car motorcyclepersonbicycle
SemanticKITTI 1606 1197 799 257 2 2 1
nuScenes 867 2242 1261 270 3 16 1
Table 1: Point distribution across the most common and rarest cate-
gories of SemanticKITTI [4] and nuScenes [5]. Numbers are normalized
by the sample quantity of bicycles.
consuming point clouds [52,14,25,8]. They aim to structure the irregular data
more efficiently. Zhu et al. [68] employ the voxel-based representation and al-
leviate the high computational burden by leveraging cylindrical partition and
sparse asymmetrical convolution. Recent works also try to fuse the point and
voxel representations [50,62,64,9], and even with range images [59]. All of these
works can serve as the backbone network in our label-efficient framework.
2.2 Label-efficient 3D semantic segmentation
Label-efficient 3D semantic segmentation has recently received lots of atten-
tion [17]. Previous explorations are mainly two-fold: labeling and training.
As for labeling, several approaches seek active learning [47,55,34], which it-
eratively selects and requests points to be labeled during the network training.
Hou et al. [23] utilize features from unsupervised pre-training to choose points
for labeling. Wang et al. [53] project the point clouds to 2D and leverage 2D
supervision signals. Some works utilize scene-level or sub-cloud-level weak la-
bels [44,54]. There are also several approaches using rule-based heuristics or
handcrafted features to help annotation [37,51,20].
As for training, Xie et al. [23,57] utilize contrastive learning for unsuper-
vised pre-training. Some approaches employ self-training to generate pseudo-
labels[33,60,44].LotsofworksuseConditionalRandomFields(CRFs)[33,61,20,34]
orrandomwalk[66]topropagatelabels.Moreover,therearealsoworksthatuti-
lizeprototypelearning[33,66],siameselearning[61,44],temporalconstraints[36],
smoothnessconstraints[44,47],attention[54,66],crosstaskconsistency[44],and
synthetic data [56] to help training.
However,mostrecentworksmainlyfocusonindoorscenes[11,3]or3Dobject
parts [6], while outdoor scenarios are largely under-explored.
3 Method
In this section, we present our LESS framework. Since existing label-efficient
segmentation works typically address domains other than autonomous driving,
we first conduct a pilot study to understand the challenges in this novel setting
and introduce motivations behind LESS (Sec. 3.1). After briefly going over our
LESSframework(Sec.3.2),wediveintothedetailsofeachpart(Secs.3.3to3.6).
3.1 Pilot study: what should we pay attention to?
Previousworks[33,23,47,53,44,54,61,66]onlabel-efficient3Dsemanticsegmenta-
tionmainlyfocusedonindoordatasets,suchasScanNet-v2[11]andS3DIS[3].In
thesedatasets,inputpointsaresampledfromhigh-qualityreconstructedmeshes
and are thus densely and uniformly distributed. Also, objects in indoor scenar-
ios typically share similar sizes and have a relatively balanced class distribu-
tion.However,inoutdoorsettings,inputpointcloudsdemonstratesubstantiallyLESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 5
highercomplexityduetothevaryingpointdensityandtheubiquitousocclusions
throughout the scene. Moreover, in outdoor driving scenes, the sample distri-
bution across different categories is highly unbalanced due to factors including
occurringfrequencyandobjectsize.Tab.1showsthepointdistributionovertwo
autonomous driving datasets, where the numbers of road points are 1,197 and
2,242 times larger than that of bicycle points, respectively. The extremely un-
balanceddistributionaddsextradifficultyforlabel-efficientsegmentation,whose
goal is to only label a tiny portion of points.
We conduct a pilot study to fur-
fully supervised
ther examine this challenge. Specifi- 100% 0.1% random points
cally, we train a state-of-the-art se- 0.1% random scans
80%
mantic segmentation network, Cylin-
der3D [68], on the SemanticKITTI 60%
dataset with three intuitive setups:
40%
(a) 100% labels, (b) randomly anno-
tating 0.1% points per scan, and (c) 20%
randomly selecting 0.1% scans and 0%
a scn an no st .at Tin hg
e
a rell sup lo ti snt as refo sr ht oh we nse inlec Ate pd
-
carroadbuildiv ne ggetatiote nrrainfencebicycm liso ttorcycp le ersonbicyclemIoU
Fig.2: Pilot study: performances
pendix S.1. Without any special ef-
(IoU) of the most common and
forts, “0.1% random points” can al-
rarest categories. Models are trained
ready achieve a mean IoU of 48.0%,
with 100% of labels and 0.1% of la-
comparedto65.9%bythefullysuper-
bels (in terms of points or scans) on Se-
vised version. On common categories,
manticKITTI [4].
such as car, road, building, and vege-
tation, the performances of the “0.1% label” models are close to the fully su-
pervised model. However, on the underrepresented categories, such as bicycle,
person, and motorcycle, we observe substantial performance gaps compared to
thefullysupervisedmodel.Thesecategoriestendtohavesmallsizes,appearless
frequently, and are thus more vulnerable when reducing the annotation budget.
However, they are still critical for many applications such as autonomous driv-
ing. Moreover, we find that “0.1% random points” outperforms “0.1% random
scans” by a large margin, mainly due to its label diversity.
These observations inspire us to rethink the existing paradigm of label-
efficient segmentation. While prior works typically focus on either efficient la-
beling or improving training approaches, we argue that it can be more effective
to address the problem by co-designing both. By integrating the two parts, we
may cover more underrepresented instances with a limited labeling budget, and
exploit the labeling efforts more effectively during network training.
3.2 Overview
Our LESS framework integrates pre-segmentation, labeling, and network train-
ing. It can work with most existing LiDAR segmentation backbones without
changing their network architectures or inference latency. As shown in Fig. 3,
our pipeline takes raw LiDAR sequences as input. It first employs a heuris-
tic method to partition the point clouds into a set of high-purity components6 M. Liu et al.
Weak Multi-Scan
Labels Teacher Model
S Lp abar es le s roavdegroe Pavtadetrgiooeavsntdoeaid tgt she ieoew tsnraoai-d ttsl eihketotrewsneu orirac-d tr slt heak tu uetwrirneurea -rcsrp ltka tua trieurr nek drcrip tn auatg io rnrr eutkhpnin oeaktgrorr-kutohinnbc ektgja oerr -ur tcohntb bec k Lu jra e-irl ocd tbbicn juap egiro cld atl be inupgilo dl ie bnpgole els
Heuristic
Pre-segmentation Propagated
Raw LiDAR Point Human Labeling Labels
Cloud Sequences
Proposed Components Single-Scan
(a) (b) (c) (d) Student Model
Fig.3: Overview of our LESS pipeline. (a) We first utilize a heuristrioavdegetcatiosnoid the ew ra -slk ttreurcrtauirnepaarkintg oruthnekr-olbc ja er ctbugildinpgoleo-
rithmtopre-segmenteachLiDARsequenceintoasetofconnectedcomponents.
(b) Examples of the proposed components. Different colors indicate different
components.Forclearvisualization,componentsofgroundpointsarenotshown.
(c) Human annotators only need to coarsely label each component. Each color
denotes a proposed component, and each click icon indicates a labeled point.
Onlysparselabelsaredirectlyannotatedbyhumans.(d)Wethentrainthenet-
work to digest various labels and utilize multi-scan distillation to exploit richer
semantics in the temporally fused point clouds.
(Sec. 3.3). Instead of exhaustively labeling all points, annotators only need to
quickly label a few points for each component proposal (e.g., one point label for
each class that appears). Besides the human-annotated sparse labels, we derive
other types of labels so as to train the network with more context information
(Sec. 3.4). During the network training, we employ contrastive prototype learn-
ing to realize a more descriptive embedding space (Sec. 3.5). We also boost the
single-scan model by distilling the knowledge from a multi-scan teacher, which
exploits richer semantics within the temporally fused point clouds (Sec. 3.6).
3.3 Pre-segmentation
We design a heuristic pre-segmentation to subdivide the point cloud into a col-
lection of components. Each resulting component proposal is of high purity,
containing only one or a few categories, which facilitates annotators to coarsely
label all the proposals, i.e., one point label per class (Sec. 3.4). In this way, we
canderivedensesupervisionbydisseminatingthesparsepoint-wiseannotations
tothewholecomponents.Sincemodernnetworkscanlearnthesemanticsofho-
mogeneous neighborhoods from sparse annotations, spending lots of annotation
budgets on large objects may be futile. Our component-wise coarse annotation
is agnostic to the object size, which benefits underrepresented small objects.
For indoor scenarios, many prior arts [33,20,47] leverage the surface normal
andcolorinformationtogeneratesupervoxelsandassumethatthepointswithin
eachsupervoxelsharethesamecategory.Theseapproaches,however,mightnot
generalize to outdoor LiDAR point clouds, where the surface can be noisy and
color information is not available. Since the homogeneity assumption is hard to
hold, we instead propose to lift this constraint and allow each component to
contain more than one category.
Unlikeindoorscenarios,objectsinoutdoorscansareoftenwell-separatedaf-
terdetectingandisolatingthegroundpoints.Inspiredbythisphilosophy,wede-
signanintuitiveapproachtopre-segmenteachLiDARsequence,whichincludesLESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 7
four steps: (a) Fuse overlapping scans. We first split a LiDAR sequence into
sub-sequences, each containing t consecutive scans. We then fuse the scans of
each sub-sequence based on the provided ego-poses. In this way, we can label
the same instance across overlapping scans at one click. (b) Detect ground
points. While the ground surface may not be flat at the full-scene scale, we as-
sumeforeachlocalregion(e.g.,5m×5m),thegroundpointscanbefittedbya
plane.Wethuspartitionthewholesceneintoauniformgridaccordingtothexy
coordinates, and then employ the RANSAC algorithm [15] to detect the ground
points for each local cell. Since the ground points may belong to different cate-
gories(e.g.,parkingzone,sidewalk,androad),weregardthegroundpointsfrom
each local cell as a single component instead of merging all of them. We allow a
singlegroundcomponenttocontainmultipleclasses,andonepointperclasswill
belabeledlater.(c) Construct connected components.Afterdetectingand
isolating the ground points, the remaining objects are often well-separated. We
build a graph G, where each node represents a point. We connect every pair of
points(u,v)inthegraph,whoseEuclideandistanceissmallerthanathresholdτ.
We then divide the points into groups by calculating the connected components
forthegraphG.Duetothenon-uniformpointdensitydistributionoftheLiDAR
point clouds, it is hard to use a fixed threshold across different ranges. We thus
propose an adaptive threshold τ(u,v) = max(r ,r )×d to compensate for the
u v
varying density, where r and r are the distances between the points and the
u v
sensor centers, and d is a pre-defined hyper-parameter. (d) Subdivide large
components. After step (c), there usually exist some connected components
covering an enormous area (e.g., buildings and vegetation), which are prone to
includesomesmallobjects.Tokeepeachcomponentofhighpurityandfacilitate
network training, we subdivide oversized components to ensure each component
isboundedwithinafixedsize.Also,weignoresmallcomponentswithonlyafew
points, which tend to be noisy and can lead to excessive component proposals.
In practice, we find our pre-segmentation generates a small number of com-
ponents for each sequence. The component proposals cover most of the points,
and each component tends to have high purity. These open up the possibility of
quickly bootstrapping the labeling from scratch. Moreover, unlike other meth-
ods [33,20,47] relying on various handcrafted features, our method only utilizes
thesimplegeometricalconnectivity,allowingittogeneralizetovariousscenarios
withouttuninglotsofhyper-parameters.Pleasereferto Sec.4.4forstatisticsof
the pre-segmentation results and the supplementary material for more details.
3.4 Annotation policy & training labels
Insteadofmeticulouslylabelingeverypoint,weproposetocoarselyannotatethe
component proposals. Specifically, for each component proposal, an annotator
needstofirstskimthroughthecomponentandthenlabelonlyonepointforeach
identifiedcategory.Fig.3(c)illustratesanexamplewherethepre-segmentation
yields three components colored in red, blue, and green, respectively. Because
the blue component only has traffic-sign points, the annotator only needs to
randomly select one point to label. The green component is similar, as it only
contains road points. In the red component, there is a bicycle lying against a8 M. Liu et al.
trafficsign,andtheannotatorneedstoselectonepointforeachclasstolabel.By
coarsely labeling all components, we are unlikely to miss any underrepresented
instances, as the proposed components cover the majority of points. Moreover,
since the number of components is orders of magnitude smaller than that of
points and our coarse annotation policy frees annotators from carefully labeling
instanceboundaries(requiredinthelabelingprocesstobuildSemanticKITTI[4]
dataset), we are thus able to reduce manual labeling costs.
Based on the component proposals, we can obtain three types of labels.
Sparse labels:pointsdirectlylabeledbyannotators.Althoughonlyatinysub-
set of points are labeled, sparse labels provide the most accurate and diverse
supervision. Weak labels: classes that appear in each component. Weak labels
are derived based on human-annotated sparse labels within each component. In
the example of Fig. 3 (c), all red points can only be either bicycles or traffic
signs. We disseminate weak labels from each component to the points therein.
The multi-category weak labels provide weak but dense supervision and cover
most points. Propagated labels: for the pure components (i.e., only one cat-
egory appears), we can propagate the label to the entire component. Given the
effectiveness of our pre-segmentation approach, the propagated labels also cover
a wide range of points. However, since some categories may be easier to be sep-
arated and prone to form pure components, the distribution of the propagated
labels may be biased and less diverse than the sparse labels.
Weformulateajointlossfunctionbyexploitingthethreetypesoflabels:L=
L +L +L , where L and L are weighted cross-
sparse propagated weak sparse propagated
entropylosswithrespecttothesparselabelsandpropagatedlabels,respectively.
We utilize inverse square root of label frequency [38,69,35] as category weights
toemphasizeunderrepresentedcategories.Here,wecalculateacross-entropyloss
foreachlabeltypeseparately,becausepropagatedlabelssignificantlyoutnumber
sparse labels while sparse labels provide more diverse supervision.
Denotetheweaklabelsasbinarymasksl forpointiandcategoryj.l =1
ij ij
when point i belongs to a component that contains category j. We exploit the
multi-category weak labels by penalizing the impossible predictions:
n
1 (cid:88) (cid:88)
L =− log(1− p ) (1)
weak n ij
i=1 lij=0
where p is the predicted probability of point i, and n is the number of points.
ij
Prior approaches [54,44] aggregate per-point predictions into component-level
predictions and then utilize the multiple-instance learning loss (MIL) [41,42] to
supervise the learning. Here, we only penalize the negative predictions without
encouraging the positive ones. This is because our network takes a single-scan
pointcloudasinput,butthelabelsarecollectedandderivedoverthetemporally
fused point clouds. Hence, a positive instance may not always appear in each
individual scan, due to occlusions or limited sensor coverage.
3.5 Contrastive prototype learning
Besidesthegreatsuccessinself-supervisedrepresentationlearning[7,21,40],con-
trastivelearninghasalsoshowneffectivenessinsupervisedlearningandfew-shotLESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 9
learning [26,18,46,49]. It can overcome shortcomings of the cross-entropy loss,
such as poor margins [13,32,26,63], and construct a more descriptive embedding
space. Following [18,29,48,63,33], we exploit the limited annotations by learning
distinctive class prototypes (i.e., class centroids in the feature space). Without
pre-training, a contrastive prototype loss L is added to Sec. 3.4 as an aux-
proto
iliary loss. Due to the limited annotations and unbalanced label distribution,
only using samples within each batch to determine class prototypes may lead
to unstable results. Inspired by the idea of momentum contrast [21], we instead
learn the class prototypes P by using a moving average over iterations:
c
1 (cid:88)
P ←mP +(1−m) stopgrad(h(f(x ))) (2)
c c n i
c
yi=c
wheref(x )istheembeddingofpointx ,hisalinearprojectionheadwithvector
i i
normalization, stopgrad denotes the stop gradient operation, y is the label of
i
x , n is the number of points with label c in a batch, and m is a momentum
i c
coefficient. In the beginning, P are initialized randomly.
c
The prototype loss L is calculated for the points with sparse labels and
proto
propagated labels within each batch:
n
L =
1 (cid:88)
−w log
exp(h(f(x i))·P yi/τ)
(3)
proto n yi (cid:80) exp(h(f(x ))·P /τ)
i c i c
whereh(f(x ))·P indicatesthecosinesimilaritybetweentheprojectedembed-
i yi
ding and the prototype, τ is a temperature hyper-parameter, n is the number
of points, and w is the inverse square root weight of category y . L aims
yi i proto
to learn a better embedding space by increasing intra-class compactness and
inter-class separability.
3.6 Multi-scan distillation
We aim to learn a segmentation network that takes a single LiDAR scan as
input and can be deployed in real-time onboard applications. During our label-
efficienttraining,wecantrainamulti-scannetworkasateachermodel.Itapplies
temporal fusion of multiple scans and takes the densified point cloud as input,
compensating for the sparsity and incompleteness within a single scan. The
teachermodelisthusexpectedtoexploittherichersemanticsandperformbetter
than a single-scan model. Especially, it may improve the performance for those
underrepresented categories, which tend to be small and sparse. After that, we
distilltheknowledgefromthemulti-scanteachermodeltoboosttheperformance
of the single-scan student model.
Specifically, for a scan at time t, we fuse the point clouds of neighboring
scans at time {t+i∆;i ∈ [−2,2]} (∆ is a time interval) using the ego-poses of
the LiDAR sensor. To enable a large batch size, we use voxel subsampling [67]
to normalize the fused point cloud to a fixed size. Labels are then fused accord-
ingly.Besidesthespatialcoordinates,wealsoconcatenateanadditionalchannel
indicatingthetimeindexiofeachpoint.Theteachermodelistrainedusingthe
loss functions introduced in Secs. 3.4 and 3.5.
The student model shares the same backbone network and is first trained
from scratch in the same way as the teacher model except for the single-scan10 M. Liu et al.
input. We then fine-tune it by incorporating an additional distillation loss L .
dis
Specifically, following [22], we match student predictions with the soft pseudo-
labels generated by the teacher model via a cross-entropy loss:
L dis =−T n2 (cid:88)n
i
(cid:88)
c
(cid:80) ce (cid:48)x ep x( pu (i uc/ icT (cid:48)/) T)log(cid:18) (cid:80) ce (cid:48)x ep x( pv (ic v/ icT (cid:48)/) T)(cid:19) (4)
whereu andv arethepredictedlogitsforpointiandcategorycbytheteacher
ic ic
and student models respectively, and T is a temperature hyper-parameter. A
higher temperature is typically used so that the probability distribution across
classesissmoother,andthedistillationisthusencouragedtomatchthenegative
logits, which also contain rich information. The cross-entropy is multiplied by
T2 to align the magnitudes of the gradients with existing other losses [22].
Pleasenotethattheideaofmulti-scandistillationmayonlybebeneficialfor
our label-efficient LiDAR segmentation setting. For the fully supervised setting,
alllabelsarealreadyavailableandaccurate,andthereisnoneedtoleveragethe
pseudo labels. For the indoor setting, all points are sampled from high-quality
reconstructed meshes, and there is no need for a multi-scan teacher model.
4 Experiments
WeemployCylinder3D[68],arecentstate-of-the-artmethodforLiDARseman-
tic segmentation, as our backbone network. We utilize ground truth labels to
mimic the obtained human annotations, and no extra noise is added. Please re-
fertothesupplementarymaterialformoreimplementationandtrainingdetails.
We evaluate the proposed method on two large-scale autonomous driving
datasets,SemanticKITTI[4]andnuScenes[5].SemanticKITTI[4]iscollected
inGermanywith64-beamLiDARsensors.The(sensor)captureandannotation
frequency is 10 Hz. It contains 10 training sequences (19k scans), 1 validation
sequence(4kscans),and11testingsequences(20kscans).19classesareusedfor
segmentation. nuScenes [5] is collected in Boston and Singapore with 32-beam
LiDAR sensors. Although the (sensor) capture frequency is 20Hz, the annota-
tion frequency is only 2Hz. It contains 700 training sequences (28k scans), 150
validation sequences (6k scans), and 150 testing sequences (6k scans). 16 classes
areusedforsegmentation.Forbothdatasets,wefollowtheofficialguidance[4,5]
to use mean intersection-over-union (mIoU) as the evaluation metric.
4.1 Comparison on SemanticKITTI
We compare the proposed method with both label-efficient [55,24,33,23] and
fullysupervised[58,65,1,16,12,27,50,31,68]methods.Pleasenotethatallcompet-
ing label-efficient methods mainly focus on indoor settings and are not specially
designed for outdoor LiDAR segmentation. Among them, ContrastiveSC [23]
employs contrastive learning as unsupervised pre-training and uses the learned
features for active labeling, ReDAL [55] also employs active labeling, OneThin-
gOneClick [33] proposes a self-training approach and iteratively propagate the
labels, and SQN [24] presents a network by leveraging the similarity between
neighboring points. We report the results on the validation set. Since Con-
trastiveSC[23]andOneThingOneClick[33]areonlytestedonindoordatasetsinLESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 11
Method Annot. mIoU
rac
elcycib
elcycrotom
kcurt
elcihev-rehto
nosrep tsilcycib
tsilcycrotom
daor
gnikrap klawedis
dnuorg-rehto
gnidliub
ecnef
noitategev
knurt
niarret
elop
ngis-cffiart
SqueezeSegV3[58] 52.7 86 31 48 51 42 52 52 0 95 47 82 0 80 47 83 53 72 42 38
PolarNet[65] 53.6 92 31 39 46 24 54 62 0 92 47 78 2 89 46 85 60 72 58 42
MPF[1] 57.0 94 28 55 62 36 57 74 0 95 47 81 1 88 53 86 54 73 57 42
S-BKI[16] 57.4 94 34 57 45 27 53 72 0 94 50 84 0 89 60 87 63 75 64 45
TemporalLidarSeg[12] 100% 61.3 92 43 54 84 61 64 68 0 95 44 83 1 89 60 85 64 71 59 47
KPRNet[27] 63.1 95 43 60 76 51 75 81 0 96 51 84 0 90 60 88 66 76 63 43
SPVNAS[50] 64.7 97 35 72 81 66 71 86 0 94 48 81 0 92 67 88 65 74 64 49
AMVNet[31] 65.2 96 49 65 89 55 71 86 0 96 54 83 0 91 62 88 67 74 65 49
Cylinder3D[68] 65.9 97 55 79 80 67 75 86 1 95 46 82 1 89 53 87 71 71 66 53
Cylinder3D(cid:63) 66.2 97 48 72 94 67 74 91 0 93 44 79 3 91 60 88 70 72 63 53
ReDAL[55] 5% 59.8 95 30 59 63 50 63 84 1 92 39 78 1 89 54 87 62 74 64 50
OneThingOneClick[33] 0.1% 26.0 77 0 0 2 1 0 2 0 63 0 38 0 73 44 78 39 53 25 0
ContrastiveSC[23] 0.1% 46.0 93 0 0 62 45 28 0 0 90 39 71 6 90 42 89 57 75 54 34
SQN[24] 0.1% 52.0 93 8 35 59 46 41 59 0 91 37 76 1 89 51 85 61 73 53 35
LESS(Ours) 0.1% 66.0 97 50 73 94 67 76 92 0 93 40 79 3 91 60 87 68 71 62 51
SQN[24] 0.01% 38.3 83 0 22 12 17 15 47 0 85 21 65 0 79 37 77 46 67 44 12
LESS(Ours) 0.01% 61.0 96 33 61 73 59 68 87 0 92 38 76 5 89 52 87 67 71 59 46
Table 2: Comparison on the SemanticKITTI validation set. Cylin-
der3D [68] is our fully supervised counterpart. Cylinder3D(cid:63) is our re-trained
version with our proposed prototype learning and multi-scan distillation.
the original paper, we adapt the source code published by the authors and train
their models on SemanticKITTI [4]. For other methods, the results are either
obtained from the literature or correspondences with the authors.
Tab.2liststheresults,whereourmethodoutperformsexistinglabel-efficient
methodsbyalargemargin.Withonly0.1%sparselabels(asdefinedinSec.3.4),
itevencompletelymatchtheperformanceofthefullysupervisedbaselineCylin-
der3D [68], which demonstrates the potential of deployment into real applica-
tions. By checking the breakdown results, we find that the differences between
methodsmainlycomefromtheunderrepresentedcategories,suchasbicycle,mo-
torcycle,person,andbicyclist.Existinglabel-efficientmethods,whicharemainly
designed for indoor settings, suffer a lot from the highly unbalanced sample dis-
tribution,whileourmethodisremarkablycompetitiveinthoseunderrepresented
classes. See Fig. 4 for further demonstration. OneThingOneClick [33] fails to
produce decent results, which is partially due to its pure super-voxel assump-
tion that does not always hold in outdoor scenes. As for the 0.01% annotations
setting, the performance of SQN [24] drops drastically to 38.3%, whereas our
proposed method can still achieve a high mIoU of 61.0%. For completeness, we
also re-train Cylinder3D [68] with our proposed prototype learning and multi-
scan distillation. We find that the two strategies provide marginal gain in the
fully-supervised setting, where all labels are available and accurate.
4.2 Comparison on nuScenes
WealsocomparetheproposedmethodwithexistingapproachesonthenuScenes[5]
dataset and report the results on the validation set. Since the author-released
model of Cylinder3D [68] utilizes SemanticKITTI for pre-training, here, we re-
port its result based on training the model from scratch for a fair comparison.12 M. Liu et al.
Fig.4: Qualitative examples on the SemanticKITTI [4] (first row) and
nuScenes [5] (second row) validation sets. Please zoom in for the details.
Red rectangles highlight the wrong predictions. Our results are similar to the
fullysupervisedcounterpart,whileContrastiveSceneContext[23]producesworse
results on underrepresented categories (see persons and bicycles). Please noet
that, points in two datasets (with different density) are visualized in different
point size for better visualization.
Method Anno.mIOU(%)
pre- weak propa. proto. multi-scan mIoU
(AF)2-S3Net[9] 62.2 seg. labels labels learning distillation (%)
SPVNAS[50] 74.8
Cy Al Min Vd Ner e3 tD [31[6 ]8] 100% 7 75 7. .4
2
(cid:51)(cid:55) (cid:55)(cid:55) (cid:55)(cid:55) (cid:55)(cid:55) (cid:55)(cid:55) 4 58 9. .1
3
RPVNet[59] 77.6 (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) 61.6
(cid:51) (cid:55) (cid:51) (cid:55) (cid:55) 62.2
ContrastiveSC[23] 0.2% 63.5 (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) 63.5
LESS(Ours) 0.2% 73.5 (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) 64.9
ContrastiveSC[23] 0.9% 65.5 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) 66.0
LESS(Ours) 0.9% 74.8
Table 4: Ablation study on the Se-
Table 3: Comparison on nuScenes
manticKITTI validation set. All
validation set. Cylinder3D [68] is our
variants use 0.1% sparse labels.
fully supervised counterpart.
For other fully-supervised methods [9,50,31,59], the results are either obtained
from the literature or correspondences with the authors. Since no prior label-
efficient work is tested on the nuScenes [5] dataset, we adapt the source code
published by the authors to train ContrastiveSceneContext [23] from scratch.
We want to point out that points in the
nuScenesdatasetaremuchsparserthanthosein
SemanticKITTI. In nuScenes, only 2 scans per
secondarelabeled,whileinSemanticKITTI,10
scans per second are labeled. Due to the dif-
ference of sensors (32-beam vs. 64-beam), the
number of points per scan in nuScenes is also
muchsmaller(26kvs.120k).Seetherightinset
forthecomparisonoftwodatasets(fusedpoints
for0.5seconds).Consideringthesparsityofthe
originalgroundtruthlabels,herewereportthe
0.2% and 0.9% annotation settings.
Tab. 3 shows the results, where our proposed method outperforms Con-
trastiveSceneContext [23] by a large margin. With only 0.2% sparse labels, our
result is also highly competitive with the fully-supervised counterpart [68].LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 13
Statistics SemanticKITTI nuScenes
one-categorycomponents 68.6% 80.6%
two-categorycomponents 23.8% 14.9%
componentswithmorethantwocategories 7.6% 4.5%
averagenumberofcategoriespercomponent 1.40 1.25
coverageofsparselabels 0.1% 0.2%
coverageofpropagatedlabels 42.0% 53.6%
coverageofweaklabels 95.5% 99.0%
Table 5: Statistics of the pre-segmentation and labeling. Only sparse
labels are directly annotated by humans.
pre-seg- mIoU #labelsfor IoU(%)of
annotationpolicy mentation (%) motorcycle motorcycle
randomlysamplepoints (cid:55) 48.1 943 22.2
randomlysamplescans (cid:55) 35.6 548 0.0
activelabeling[23] (cid:55) 54.2 456 36.7
uniformgridpartition (cid:51) 61.4 1024(76k) 61.3
geometricpartition[28,33] (cid:51) 61.9 1190(294k) 64.0
LESS(Ours) (cid:51) 64.9 1146(933k) 72.3
Table6:Comparison of various annotation policies on SemanticKITTI.
All methods utilize 0.1% annotations and the same backbone network [68]. The
fourth column indicates the number of sparse labels (and propagated labels) for
an underrepresented category (i.e., motorcycle). Multi-scan distillation is not
utilized here. The IoU results are calculated on the validation set.
4.3 Ablation study
Tab.4showstheablationstudyofeachcomponent.Thefirstrowistheresultof
trainingwith0.1%randompointlabels.Byincorporatingthepre-segmentation,
we spend the limited annotation budget on more underrepresented instances,
thereby significantly increasing mIoU from 48.1% to 59.3%. Derived from the
componentproposals,weaklabelsandpropagatedlabelscomplementthehuman-
annotated sparse labels and provide dense supervision. Compared to multi-
category weak labels, propagated labels provide more accurate supervision and
thus lead to a slightly higher gain. Both contrastive prototype learning and
multi-scan distillation further boost the performance and finally close the gap
between LESS and the fully-supervised counterpart in terms of mIoU.
4.4 Analysis of pre-segmentation & labeling
By leveraging the unique geometric structure and a careful design, our pre-
segmentation works well for outdoor LiDAR point clouds. Tab. 5 summarizes
some statistics of the pre-segmentation and labeling results. For both datasets,
only less than 10% of the components contain more than two categories, which
validatesthatourpre-segmentationgenerateshigh-puritycomponents.Thehigh
“coverageofpropagatedlabels”indicatesthatwethusdeduceagoodamountof
“free”supervisionfromthepurecomponents.Thelow“coverageofsparselabels”
shows that annotators indeed only need to label a tiny portion of points, thus
reducinghumaneffort.The“coverageofweaklabels”confirmsthattheproposed
componentscanfaithfullycovermostpoints.Furthermore,theconsistentresults
across two distinct datasets verify that our method generalizes well in practice.14 M. Liu et al.
single-scan multi-scan single-scan
(before distillation) teacher (after distillation)
Fig.5: Improving segmentation with multi-scan distillation. The multi-
scan teacher leverages the richer semantics via temporal fusion to accurately
segment the bicycle and ground, which provides high-quality supervision to en-
hance the single-scan model.
single-scan(before) multi-scanteacher single-scan(after)
mIoU bicycle mIoU bicycle mIoU bicycle
64.9% 45.6% 66.8% 51.5% 66.0% 49.9%
Table 7: Results of the multi-scan distillation on the SemanticKITTI
validation set. 0.1% annotations are used.
Tab. 6 shows the comparison of different annotation policies (i.e., how to
use the labeling budget). The first two baselines are introduced in Sec. 3.1,
“active labeling” utilizes the features from contrastive pre-training to actively
select points [23], “uniform grid partition” uniformly divides the fused point
clouds into a grid according to the xy coordinates and treats each cell as a
component, “geometric partition” extracts handcrafted geometric features and
solvesaminimalpartitionproblem[28,68].Allofthemaretrainedwiththesame
backbone Cylinder3D [68]. The first three methods employ no pre-segmentation
andaretrainedwithL only.Theotherapproachesutilizeourlabelingpolicy
sparse
(i.e., one label per class for each component) and are trained with additional
L , L , and L . As a result, their performances are much higher
propagated weak proto
than the first three methods. We also report the number of labels and the IoU
for an underrepresented category. We see that our policy leads to more useful
supervisions and higher IoUs for underrepresented categories.
4.5 Analysis of multi-scan distillation
Tab. 7 and Fig. 5 show the results of multi-scan distillation. The teacher model
exploits the densified point clouds via temporal fusion and thus performs better
than the single-scan model (even compared to the fully supervised single-scan
model). Through knowledge distillation from the teacher model, the student
model improves a lot in the underrepresented classes and completely matches
the fully supervised model in mIoU.
5 Conclusion and future work
We study label-efficient LiDAR point cloud semantic segmentation and propose
a pipeline that co-designs the labeling and the model learning and can work
with most 3D segmentation backbones. We show that our method can utilize
bare minimum human annotations to achieve highly competitive performance.
We have shown LESS is an effective approach for bootstrapping labeling
and learning from scratch. In addition, LESS is also highly compatible for effi-
cientlyimprovingaperformantmodel.Withthepredictionsofanexistingmodel,
the proposed pipeline can be used for annotators to pick and label component
proposals of high-values, such as underrepresented classes, long-tail instances,
classes with most failures, etc. We leave this for future exploration.LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 15
In this supplementary material, we first present the implementation and
training details of our proposed method and baseline methods (Appendix S.1).
Wethenshowthevisualexamplesofourpre-segmentationresults(AppendixS.2),
thefullresultsonthenuScenesdataset(AppendixS.3),andthemulti-scandistil-
lationresultsontheSemanticKITTIdataset(AppendixS.4).Finally,weanalyze
thegeneratedlabeldistribution(AppendixS.5)andtherobustnesstolabelnoise
(Appendix S.6).
S.1 Implementation & training details
Pre-segmentation & labeling While some prior works require perfect pre-
segmentation results, our proposed labeling and training pipeline (using weak
andpropagatedlabels)allowsimperfectcomponentproposals(e.g.,acomponent
withmultiplecategoriesoranobjectinstancedividedintomultiplecomponents),
which greatly mitigates the impact of pre-segmentation quality on final per-
formance. Our pre-segmentation heuristic only includes two key steps: ground
removal and connected component construction. Compared to other complex
heuristics, it has fewer hyperparameters. Also, thanks to the good property of
outdoor point clouds (i.e., objects are well-separated), we find that, in our ex-
periments, the hyper-parameters are intuitive and easy to select without much
effort.
For example, during the ground removal, we find that the cell size and the
RANSAC threshold are robust across datasets, and we set them to be 5m×5m
and0.2mforbothdatasets.Whenbuildingconnectedcomponents,theparame-
ter d should accommodate the LiDAR sensor (the sparser the points, the larger
thed).Wesetdto0.01and0.02forSemanticKITTI[4]andnuScenes[5]datasets,
respectively. In our experiments, choosing hyper-parameters with visual inspec-
tion is convenient and sufficient to achieve satisfactory results.
For the SemanticKITTI [4] dataset, we fuse
every 5 adjacent scans for the 0.1% setting
and every 100 adjacent scans for the 0.01%
setting. Fusing more adjacent scans will im-
prove labeling efficiency, but may sacrifice pre-
segmentation quality as points may become
blurry, especially for dynamic objects. After
constructing connected components, oversized
components are subdivided along the xy axes
to ensure each component is within a fixed size
(i.e., 2m × 2m for non-ground components).
We also ignore small components with no more
than 100 points. For each component of size s,
we randomly label 1 point for each category whose number of points is more
than 0.05s. The motivation here is to prevent those noisy and ambiguous points
within each component from decreasing the component purity. In real applica-
tions,humanlabelersmayalsomissorignorethosenoisycategoriestoaccelerate
the annotation.16 M. Liu et al.
Method Anno.mIoU
reirrab elcycib
sub rac
elcihev-noitcurtsnoc
elcycrotom nairtsedep enoc-cffiart
reliart
kcurt
ecafrus-elbavird
tafl-rehto
klawedis
nairret
edamnam noitategev
(AF)2-S3Net[9] 62.2 60.312.682.380.020.162.059.049.042.267.494.268.064.168.682.982.4
RangeNet++[39] 65.5 66.021.377.280.930.266.869.652.154.272.394.166.663.570.183.179.8
PolarNet[65] 71.0 74.728.285.390.935.177.571.358.857.476.196.571.174.774.087.385.7
SPVNAS[50] 74.8 74.939.991.186.445.883.772.164.362.583.396.272.773.674.188.387.4
Cylinder3D [68] 100% 75.4 75.341.791.686.152.979.379.266.161.581.796.472.373.873.588.186.5
AMVNt[31] 77.0 77.743.891.793.051.180.378.865.769.683.596.971.475.175.390.188.3
RPVNet[59] 77.6 78.243.492.793.249.085.780.566.066.984.096.973.575.976.090.688.9
ContrastiveSC[23] 0.2% 63.5 65.6 0.0 82.787.342.846.357.132.259.076.494.262.565.968.887.886.8
LESS(Ours) 0.2% 73.5 73.738.392.089.746.975.670.958.464.883.095.667.670.971.889.287.3
ContrastiveSC[23] 0.9% 64.5 64.012.780.787.641.155.861.637.559.175.294.265.667.070.188.087.2
LESS(Ours) 0.9% 74.8 75.042.391.989.951.080.072.660.164.983.695.767.571.773.189.587.6
Table S8: Comparison of different methods on the nuScenes validation
set. Cylinder3D [68] is our fully supervised counterpart.
For the nuScenes [5] dataset, we share the same hyperparameters as Se-
manticKITTI, except for the following. We fuse every 40 adjacent scans, and
ignoresmallcomponentswithnomorethan10points.Foreachcomponentpro-
posal of size s, we randomly label 1 (or 4) point(s) for each category whose
numberofpointsismorethan0.01s,correspondingtothe0.2%(0.9%)settings.
ThesesubtledifferencesaremainlyduetothepointsinthenuScenes[5]dataset
are much sparser (e.g., the right inset shows the fused points for 0.5 seconds),
and we fuse more points and annotate more labels to compensate for the point
sparsity.
Network training As for contrastive prototype learning, the momentum pa-
rameter m is empirically set to 0.99, temperature parameter τ is set to 0.1. In
multi-scan distillation, we fuse the scans at time {t+0.5i;i ∈ [−2,2]} for Se-
manticKITTI, and {t+0.5i;i ∈ [−3,3]} for nuScenes. We tried multiple sets
of parameters (different numbers of scans and intervals). They do lead to some
differences(∼3%mIOU),andwechoosethebestempirically.Wekeepallpoints
for scan i = 0, and use voxel downsampling to sub-sample 120k points from
other scans. The temperature T is set to 4.
We sum up all loss terms with equal weights and train the models on 4
NVIDIA A100 GPUs. For SemanticKITTI, the batch size is 12 and 8 for the
single-scan and the multi-scan model, respectively. For nuScenes, the batch size
is16and12forthesingle-scanandthemulti-scanmodel,respectively.Weutilize
theAdamoptimizer,andthelearningrateisinitiallysetto1e-3andthendecayed
to 1e-4 after convergence. During distillation, the learning rate is set to 1e-4.
Other training parameters are the same as Cylinder3D [68].LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 17
Fig.S6: Examples of the pre-segmentation results. First row: detected
ground points of each cell. Non-ground points are colored in gray. Each other
color indicates a proposed ground component. Second row: connected compo-
nents of the non-ground points. Each color indicates a connected component.
The example is from the nuScenes dataset, where 40 scans are fused.
Baseline Methods We adopt the author released code to train OneThin-
gOneClick[33]andContrastiveSceneContext[23]onSemanticKITTIandnuScenes.
For other methods, the results are either obtained from the literature or corre-
spondences with the authors.
For ContrastiveSceneContext [23], we first compute the overlapping ra-
tio between every pair of scans within each sequence, where the voxel size is
set to 0.3m. We then use pairs of scans whose overlapping ratio is no less than
30% for contrastive pre-training. During pre-training, we train the model with
a voxel size of 0.15m for 100k iterations. The batch size is 12 and 20 for Se-
manticKITTI and nuScenes, respectively. We then follow the provided pipeline
toinferthepointfeaturesandselectpointsforlabeling.Afterthat,wetrainthe
segmentation network withthe pre-trainedweightsfor 30kiterations.The voxel
size is set to 0.1m, and the batch size is set to 18 and 36 SemanticKITTI and
nuScenes, respectively. We disable the elastic distortion and the color-related
data augmentation.
For OneThingOneClick [33], we first apply the geometrical partition de-
scribed in [28] to generate the super-voxels, where only the point coordinates
are used as input. We then randomly label a subset of super-voxels for a given
annotation budget. We follow the authors’ guidance to train the modules for
three iterations. In each iteration, we train the 3D-U-Net for 32 epochs (51k18 M. Liu et al.
rac
elcycib
elcycrotom
kcurt
elcihev-rehto
nosrep tsilcycib
tsilcycrotom
daor
gnikrap klawedis
dnuorg-rehto
gnidliub
ecnef
noitategev
knurt
niarret
elop
ngiscffiart
sparse(×0.1%) 0.8 2.7 0.9 0.6 0.8 1.8 1.8 2.7 0.4 0.7 0.6 1.4 0.8 1.0 1.0 2.0 1.0 3.1 4.1
propagated(%) 79 12 75 77 75 52 64 48 16 6 9 17 77 25 55 29 32 28 9
Table S9: The coverage of sparse labels and propagated labels for the
SemanticKITTI dataset. The numbers are the ratios between the number
of sparse labels (and propagated labels) and the number of points within each
category.
reirrab elcycib
sub rac
elcihev-noitcurtsnoc
elcycrotom nairtsedep enoc-cffiart
reliart
kcurt
ecafrus-elbavird
tafl-rehto
klawedis
nairret
edamnam noitategev
sparse(×0.1%) 2.4 20.9 4.0 4.6 4.8 8.0 19.9 12.2 3.4 3.4 0.6 1.7 1.9 3.1 4.8 7.9
propagated(%) 16 16 53 52 54 46 29 20 49 59 32 2 2 11 62 55
Table S10: The coverage of sparse labels and propagated labels for the
nuScenes dataset. The numbers are the ratios between the number of sparse
labels (and propagated labels) and the number of points within each category.
iterations)andtheRelationNetfor64epochs(102kiterations).Duringtraining,
thevoxelsizeissetto0.1m,andthebatchsizeissetto12.Wedisabletheelastic
distortion for the data augmentation.
S.2 Visual results of pre-segmentation
Fig. S6 shows the examples of our pre-segmentation results.
S.3 Full results on nuScenes
Tab. S8 shows the full results on the nuScenes validation set.
S.4 Full table of multi-scan distillation
Tab. S11 shows the full results of the multi-scan distillation. The multi-scan
teacher model leverages the richer semantics via temporal fusion and achieves
significantly better performances in the underrepresented categories, such as
bicycle, person, and bicyclist. Through knowledge distillation from the teacher
model, the student model also improves a lot in those categories.LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 19
Method mIOU
rac
elcycib
elcycrotom
kcurt
elcihev-rehto
nosrep
tsilcycib
tsilcycrotom
daor gnikrap klawedis
dnuorg-rehto
gnidliub ecnef
noitategev
knurt niarret elop
ngiscffiart
single-scan(before) 64.9 97 46 72 91 69 73 88 0 92 39 77 4 90 58 88 66 73 61 52
multi-scanteacher 66.8 97 52 82 94 72 78 92 0 93 40 79 1 89 54 87 70 72 64 53
single-scan(after) 66.0 97 50 73 94 67 76 92 0 93 40 79 3 91 60 87 68 71 62 51
TableS11:Results of the multi-scan distillation on the SemanticKITTI
validation set. 0.1% annotations are used.
S.5 Label distribution
Tab.S9andTab.S10summarizethedistributionsofthegeneratedsparselabels
and the propagated labels. By leveraging our proposed pre-segmentation and
labeling policy, we put more emphasis on the underrepresented categories. For
example, the ratios of sparse labels for bicycle and road are 2.68 vs. 0.36 in the
SemanticKITTI dataset, and 20.85 vs. 0.63 in the nuScenes dataset. As for the
propagatedlabels,wefindthedistributionsareunbalanced.Forcategories,such
as car and building, they are easier to be separated and form pure components,
thushavinghighcoveragesofpropagatedlabels.However,somecategories,such
as bicycle, road, sidewalk, and parking, are prone to be connected with other
categories, thus having low coverages of propagated labels. The discrepancy be-
tween the distributions of the two types of labels confirms that we need to treat
them separately instead of simply merging them with a single loss function.
S.6 Robustness to label noise
In the paper, we use point labels from the original datasets to mimic the anno-
tation policy, and no extra noise is added.
Toevaluatetherobustnessofourmethodtolabelnoise,werandomlychange
3% (or 10%) of the sparse point labels to a random category, which alters weak
labels and propagated labels accordingly. The resulting mIoU drops 2.1% (or
3.7%), which is within a reasonable range and verifies that our method will not
be significantly affected by the label noise.
References
1. Alnaggar, Y.A., Afifi, M., Amer, K., ElHelw, M.: Multi projection fusion for
real-time semantic segmentation of 3d lidar point clouds. In: Proceedings of the
IEEE/CVFWinterConferenceonApplicationsofComputerVision.pp.1800–1809
(2021) 3, 10, 11
2. Alonso, I., Riazuelo, L., Montesano, L., Murillo, A.C.: 3d-mininet: Learning a 2d
representationfrompointcloudsforfastandefficient3dlidarsemanticsegmenta-
tion. IEEE Robotics and Automation Letters 5(4), 5432–5439 (2020) 3
3. Armeni,I.,Sax,S.,Zamir,A.R.,Savarese,S.:Joint2d-3d-semanticdataforindoor
scene understanding. arXiv preprint arXiv:1702.01105 (2017) 2, 420 M. Liu et al.
4. Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall,
J.: Semantickitti: A dataset for semantic scene understanding of lidar sequences.
In:ProceedingsoftheIEEEInternationalConferenceonComputerVision(ICCV).
pp. 9297–9307 (2019) 1, 2, 3, 4, 5, 8, 10, 11, 12, 15
5. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A.,
Pan,Y.,Baldan,G.,Beijbom,O.:nuscenes:Amultimodaldatasetforautonomous
driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
PatternRecognition(CVPR).pp.11621–11631(2020) 1,2,3,4,10,11,12,15,16
6. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,
Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich
3d model repository. arXiv preprint arXiv:1512.03012 (2015) 2, 4
7. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con-
trastive learning of visual representations. In: Proceedings of the International
Conference on Machine Learning (ICML). pp. 1597–1607. PMLR (2020) 8
8. Cheng,M.,Hui,L.,Xie,J.,Yang,J.,Kong,H.:Cascadednon-localneuralnetwork
for point cloud semantic segmentation. In: 2020 IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems (IROS). pp. 8447–8452. IEEE (2020) 4
9. Cheng, R., Razani, R., Taghavi, E., Li, E., Liu, B.: Af2-s3net: Attentive feature
fusion with adaptive feature selection for sparse semantic segmentation network.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 12547–12556 (2021) 1, 4, 12, 16
10. Cortinhal, T., Tzelepis, G., Aksoy, E.E.: Salsanext: Fast, uncertainty-aware se-
manticsegmentationoflidarpointclouds.In:InternationalSymposiumonVisual
Computing. pp. 207–222. Springer (2020) 3
11. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scan-
net: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
pp. 5828–5839 (2017) 2, 4
12. Duerr, F., Pfaller, M., Weigel, H., Beyerer, J.: Lidar-based recurrent 3d semantic
segmentation with temporal memory alignment. In: Proceedings of the Interna-
tional Conference on 3D Vision (3DV). pp. 781–790. IEEE (2020) 3, 10, 11
13. Elsayed,G.F.,Krishnan,D.,Mobahi,H.,Regan,K.,Bengio,S.:Largemargindeep
networksforclassification.In:AdvancesinNeuralInformationProcessingSystems
(NeurIPS) (2018) 9
14. Fang,Y.,Xu,C.,Cui,Z.,Zong,Y.,Yang,J.:Spatialtransformerpointconvolution.
arXiv preprint arXiv:2009.01427 (2020) 4
15. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model
fittingwithapplicationstoimageanalysisandautomatedcartography.Communi-
cations of the ACM 24(6), 381–395 (1981) 7
16. Gan, L., Zhang, R., Grizzle, J.W., Eustice, R.M., Ghaffari, M.: Bayesian spatial
kernel smoothing for scalable dense semantic mapping. IEEE Robotics and Au-
tomation Letters 5(2), 790–797 (2020) 10, 11
17. Gao, B., Pan, Y., Li, C., Geng, S., Zhao, H.: Are we hungry for 3d lidar data for
semantic segmentation? ArXiv abs/2006.04307 3, 20 (2020) 4
18. Gao, Y., Fei, N., Liu, G., Lu, Z., Xiang, T., Huang, S.: Contrastive proto-
type learning with augmented embeddings for few-shot learning. arXiv preprint
arXiv:2101.09499 (2021) 3, 9
19. Gerdzhev,M.,Razani,R.,Taghavi,E.,Bingbing,L.:Tornado-net:multiviewtotal
variation semantic segmentation with diamond inception module. In: Proceedings
of the IEEE International Conference on Robotics and Automation (ICRA). pp.
9543–9549. IEEE (2021) 3LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 21
20. Guinard, S., Landrieu, L.: Weakly supervised segmentation-aided classification of
urbanscenesfrom3dlidarpointclouds.In:ISPRSWorkshop2017(2017) 2,4,6,
7
21. He,K.,Fan,H.,Wu,Y.,Xie,S.,Girshick,R.:Momentumcontrastforunsupervised
visual representation learning. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). pp. 9729–9738 (2020) 8, 9
22. Hinton,G.,Vinyals,O.,Dean,J.:Distillingtheknowledgeinaneuralnetwork.In:
Advances in Neural Information Processing Systems (NeurIPS) (2015) 10
23. Hou,J.,Graham,B.,Nießner,M.,Xie,S.:Exploringdata-efficient3dsceneunder-
standing with contrastive scene contexts. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR). pp. 15587–15597
(2021) 2, 4, 10, 11, 12, 13, 14, 16, 17
24. Hu, Q., Yang, B., Fang, G., Guo, Y., Leonardis, A., Trigoni, N., Markham, A.:
Sqn:Weakly-supervisedsemanticsegmentationoflarge-scale3dpointcloudswith
1000x fewer labels. arXiv preprint arXiv:2104.04891 (2021) 2, 10, 11
25. Hu,Q.,Yang,B.,Xie,L.,Rosa,S.,Guo,Y.,Wang,Z.,Trigoni,N.,Markham,A.:
Randla-net: Efficient semantic segmentation of large-scale point clouds. In: Pro-
ceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecogni-
tion (CVPR). pp. 11108–11117 (2020) 4
26. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A.,
Liu, C., Krishnan, D.: Supervised contrastive learning. In: Advances in Neural
Information Processing Systems (NeurIPS) (2020) 9
27. Kochanov,D.,Nejadasl,F.K.,Booij,O.:Kprnet:Improvingprojection-basedlidar
semantic segmentation. arXiv preprint arXiv:2007.12668 (2020) 3, 10, 11
28. Landrieu,L.,Simonovsky,M.:Large-scalepointcloudsemanticsegmentationwith
superpoint graphs. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 4558–4567 (2018) 13, 14, 17
29. Li,J.,Zhou,P.,Xiong,C.,Hoi,S.C.:Prototypicalcontrastivelearningofunsuper-
visedrepresentations.In:ProceedingsoftheInternationalConferenceonLearning
Representations (ICLR) (2020) 3, 9
30. Li, S., Chen, X., Liu, Y., Dai, D., Stachniss, C., Gall, J.: Multi-scale interaction
for real-time lidar data segmentation on an embedded platform. arXiv preprint
arXiv:2008.09162 (2020) 3
31. Liong, V.E., Nguyen, T.N.T., Widjaja, S., Sharma, D., Chong, Z.J.: Amvnet:
Assertion-based multi-view fusion network for lidar semantic segmentation. arXiv
preprint arXiv:2012.04934 (2020) 3, 10, 11, 12, 16
32. Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolu-
tionalneuralnetworks.In:ProceedingsoftheInternationalConferenceonMachine
Learning (ICML). vol. 2, p. 7 (2016) 9
33. Liu,Z.,Qi,X.,Fu,C.W.:Onethingoneclick:Aself-trainingapproachforweakly
supervised 3d semantic segmentation. In: Proceedings of the IEEE/CVF Confer-
enceonComputerVisionandPatternRecognition(CVPR).pp.1726–1736(2021)
2, 3, 4, 6, 7, 9, 10, 11, 13, 17
34. Luo, H., Wang, C., Wen, C., Chen, Z., Zai, D., Yu, Y., Li, J.: Semantic label-
ing of mobile lidar point clouds via active learning and higher order mrf. IEEE
Transactions on Geoscience and Remote Sensing 56(7), 3631–3644 (2018) 2, 4
35. Mahajan,D.,Girshick,R.,Ramanathan,V.,He,K.,Paluri,M.,Li,Y.,Bharambe,
A., Van Der Maaten, L.: Exploring the limits of weakly supervised pretraining.
In: Proceedings of the European Conference on Computer Vision (ECCV). pp.
181–196 (2018) 822 M. Liu et al.
36. Mei, J., Gao, B., Xu, D., Yao, W., Zhao, X., Zhao, H.: Semantic segmentation of
3dlidardataindynamicsceneusingsemi-supervisedlearning.IEEETransactions
on Intelligent Transportation Systems 21(6), 2496–2509 (2019) 4
37. Mei, J., Zhao, H.: Incorporating human domain knowledge in 3-d lidar-based se-
mantic segmentation. IEEE Transactions on Intelligent Vehicles 5(2), 178–187
(2019) 4
38. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentationsofwordsandphrasesandtheircompositionality.In:AdvancesinNeural
Information Processing Systems (NeurIPS). pp. 3111–3119 (2013) 8
39. Milioto, A., Vizzo, I., Behley, J., Stachniss, C.: Rangenet++: Fast and accurate
lidar semantic segmentation. In: 2019 IEEE/RSJ International Conference on In-
telligent Robots and Systems (IROS). pp. 4213–4220. IEEE (2019) 3, 16
40. Oord,A.v.d.,Li,Y.,Vinyals,O.:Representationlearningwithcontrastivepredic-
tive coding. arXiv preprint arXiv:1807.03748 (2018) 8
41. Pathak, D., Shelhamer, E., Long, J., Darrell, T.: Fully convolutional multi-class
multiple instance learning. arXiv preprint arXiv:1412.7144 (2014) 8
42. Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with con-
volutional networks. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 1713–1721 (2015) 8
43. Razani, R., Cheng, R., Taghavi, E., Bingbing, L.: Lite-hdseg: Lidar semantic seg-
mentationusingliteharmonicdenseconvolutions.arXivpreprintarXiv:2103.08852
(2021) 3
44. Ren, Z., Misra, I., Schwing, A.G., Girdhar, R.: 3d spatial recognition without
spatially labeled 3d. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 13204–13213 (2021) 2, 4, 8
45. Rist,C.B.,Schmidt,D.,Enzweiler,M.,Gavrila,D.M.:Scssnet:Learningspatially-
conditioned scene segmentation on lidar point clouds. In: 2020 IEEE Intelligent
Vehicles Symposium (IV). pp. 1086–1093. IEEE (2020) 3
46. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face
recognitionandclustering.In:ProceedingsoftheIEEE/CVFConferenceonCom-
puter Vision and Pattern Recognition (CVPR). pp. 815–823 (2015) 9
47. Shi, X., Xu, X., Chen, K., Cai, L., Foo, C.S., Jia, K.: Label-efficient point
cloud semantic segmentation: An active learning approach. arXiv preprint
arXiv:2101.06931 (2021) 2, 4, 6, 7
48. Snell,J.,Swersky,K.,Zemel,R.S.:Prototypicalnetworksforfew-shotlearning.In:
Advances in Neural Information Processing Systems (NeurIPS) (2017) 3, 9
49. Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective.
In:AdvancesinNeuralInformationProcessingSystems(NeurIPS).pp.1857–1865
(2016) 9
50. Tang,H.,Liu,Z.,Zhao,S.,Lin,Y.,Lin,J.,Wang,H.,Han,S.:Searchingefficient3d
architectureswithsparsepoint-voxelconvolution.In:ProceedingsoftheEuropean
Conference on Computer Vision (ECCV). pp. 685–702. Springer (2020) 1, 4, 10,
11, 12, 16
51. Thomas, H., Agro, B., Gridseth, M., Zhang, J., Barfoot, T.D.: Self-supervised
learning of lidar segmentation for autonomous indoor navigation. In: Proceedings
of the IEEE International Conference on Robotics and Automation (ICRA). pp.
14047–14053. IEEE (2021) 4
52. Thomas,H.,Qi,C.R.,Deschaud,J.E.,Marcotegui,B.,Goulette,F.,Guibas,L.J.:
Kpconv: Flexible and deformable convolution for point clouds. In: Proceedings of
the IEEE International Conference on Computer Vision (ICCV). pp. 6411–6420
(2019) 4LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds 23
53. Wang, H., Rong, X., Yang, L., Feng, J., Xiao, J., Tian, Y.: Weakly supervised
semantic segmentation in 3d graph-structured point clouds of wild scenes. arXiv
preprint arXiv:2004.12498 (2020) 2, 4
54. Wei, J., Lin, G., Yap, K.H., Hung, T.Y., Xie, L.: Multi-path region mining for
weakly supervised 3d semantic segmentation on point clouds. In: Proceedings of
theIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR).
pp. 4384–4393 (2020) 2, 4, 8
55. Wu, T.H., Liu, Y.C., Huang, Y.K., Lee, H.Y., Su, H.T., Huang, P.C., Hsu, W.H.:
Redal: Region-based and diversity-aware active learning for point cloud semantic
segmentation.In:ProceedingsoftheIEEEInternationalConferenceonComputer
Vision (ICCV). pp. 15510–15519 (2021) 2, 4, 10, 11
56. Xiao, A., Huang, J., Guan, D., Zhan, F., Lu, S.: Synlidar: Learning from syn-
thetic lidar sequential point cloud for semantic segmentation. arXiv preprint
arXiv:2107.05399 (2021) 4
57. Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: Pointcontrast: Un-
supervised pre-training for 3d point cloud understanding. In: Proceedings of the
European Conference on Computer Vision (ECCV). pp. 574–591. Springer (2020)
4
58. Xu, C., Wu, B., Wang, Z., Zhan, W., Vajda, P., Keutzer, K., Tomizuka, M.:
Squeezesegv3: Spatially-adaptive convolution for efficient point-cloud segmenta-
tion. In: Proceedings of the European Conference on Computer Vision (ECCV).
pp. 1–19. Springer (2020) 3, 10, 11
59. Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: Rpvnet: A deep and efficient
range-point-voxelfusionnetworkforlidarpointcloudsegmentation.arXivpreprint
arXiv:2103.12978 (2021) 1, 4, 12, 16
60. Xu, K., Yao, Y., Murasaki, K., Ando, S., Sagata, A.: Semantic segmentation of
sparsely annotated 3d point clouds by pseudo-labelling. In: Proceedings of the
International Conference on 3D Vision (3DV). pp. 463–471. IEEE (2019) 2, 4
61. Xu,X.,Lee,G.H.:Weaklysupervisedsemanticpointcloudsegmentation:Towards
10xfewerlabels.In:ProceedingsoftheIEEE/CVFConferenceonComputerVision
and Pattern Recognition (CVPR). pp. 13706–13715 (2020) 2, 4
62. Yan,X.,Gao,J.,Li,J.,Zhang,R.,Li,Z.,Huang,R.,Cui,S.:Sparsesinglesweep
lidarpointcloudsegmentationvialearningcontextualshapepriorsfromscenecom-
pletion.In:ProceedingsoftheAAAIConferenceonArtificialIntelligence(AAAI)
(2020) 1, 4
63. Yang, H.M., Zhang, X.Y., Yin, F., Liu, C.L.: Robust classification with convolu-
tional prototype learning. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR). pp. 3474–3482 (2018) 3, 9
64. Zhang, F., Fang, J., Wah, B., Torr, P.: Deep fusionnet for point cloud semantic
segmentation. In: Proceedings of the European Conference on Computer Vision
(ECCV). pp. 644–663. Springer (2020) 4
65. Zhang,Y.,Zhou,Z.,David,P.,Yue,X.,Xi,Z.,Gong,B.,Foroosh,H.:Polarnet:An
improved grid representation for online lidar point clouds semantic segmentation.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 9601–9610 (2020) 3, 10, 11, 16
66. Zhao, N., Chua, T.S., Lee, G.H.: Few-shot 3d point cloud semantic segmentation.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 8873–8882 (2021) 2, 4
67. Zhou,Y.,Tuzel,O.:Voxelnet:End-to-endlearningforpointcloudbased3dobject
detection.In:ProceedingsoftheIEEE/CVFConferenceonComputerVisionand
Pattern Recognition (CVPR) (June 2018) 924 M. Liu et al.
68. Zhu,X.,Zhou,H.,Wang,T.,Hong,F.,Ma,Y.,Li,W.,Li,H.,Lin,D.:Cylindrical
and asymmetrical 3d convolution networks for lidar segmentation. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 9939–9948 (2021) 1, 2, 4, 5, 10, 11, 12, 13, 14, 16
69. Zou, Y., Weinacker, H., Koch, B.: Towards urban scene semantic segmentation
with deep learning from lidar point clouds: A case study in baden-wu¨rttemberg,
germany. Remote Sensing 13(16), 3220 (2021) 8