RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding XuanyuZhou* CharlesR.Qi* YinZhou DragomirAnguelov WaymoLLC Abstract (directionsofthelasers)andsensorposes(6Dposesinthe global coordinate at the timestamp of every shot), a range Lidars are depth measuring sensors widely used in au- image and its corresponding point cloud can be converted tonomous driving and augmented reality. However, the interchangeablyandlosslessly. Byorganizingthepointsin large volume of data produced by lidars can lead to high arangeimage,insteadofstoringthethree-dimensionalco- costsindatastorageandtransmission.Whilelidardatacan ordinates of the points, we can just store one-dimensional berepresentedastwointerchangeablerepresentations: 3D ranges (around 3x saving in storage). Given this observa- pointcloudsandrangeimages,mostpreviousworkfocuson tion,incontrasttopreviousworksthatfocusoncompress- compressingthegeneric3Dpointclouds. Inthiswork,we ing3Dpointclouds[9,17,25],weproposetodirectlycom- showthatdirectlycompressingtherangeimagescanlever- pressrangeimagestoleveragethelidarscanningpatterns. age the lidar scanning pattern, compared to compressing As range images are in the image format, naturally we the unprojected point clouds. We propose a novel data- canapplyexistingcompressionmethodsforopticalimages driven range image compression algorithm, named RID- (RGBorgrayscale);however,thosemethodshavetheirlim- DLE (Range Image Deep DeLta Encoding). At its core is itations.Forexample,thePNGformatisoftenusedtocom- a deep model that predicts the next pixel value in a raster pressdepthimagesinindoordatasets[4,11,27],wherethe scanning order, based on contextual laser shots from both depthvaluearenormalizedandquantizedto16-bitintegers thecurrentandpastscans(representedasa4Dpointcloud andcompressedlosslessly.WhilePNGalsoappliestocom- ofsphericalcoordinatesandtime). Thedeltasbetweenpre- presslidarrangeimages, itisnotdata-drivenanddoesnot dictionsandoriginalvaluescanthenbecompressedbyen- use temporal information. There are also attempts to use tropyencoding.EvaluatedontheWaymoOpenDatasetand auto-encoder networks [33] to lossily compress range im- KITTI,ourmethoddemonstratessignificantimprovementin ages by storing the bottleneck layer output. However, as thecompressionrate(underthesamedistortion)compared rangevaluesoftenhaveamuchwiderdistributionthanRGB to widely used point cloud and range image compression colors,itischallengingtolearnanaccuratereconstruction, algorithmsaswellasrecentdeepmethods. especiallyattheobjectboundaries. In this work, we propose RIDDLE (Range Image Deep DeLta Encoding), a data-driven algorithm to compress 1.Introduction rangeimageswithpredictiveneuralnetworks(Fig.2). Our methodisinspiredbytheuseofdeltaencodinginPNGim- Lidar (or LiDAR, short for light detection and ranging) agecompression. However,insteadofsimplycomputinga sensorsarecommonlyusedinapplicationsthatrequire3D differencebetweenclose-bypixels,weadoptadeepmodel scene understanding such as autonomous driving and aug- to predict the pixel value from context pixels. The deep mented reality. However, with the growing resolution of model takes a local patch of the decoded range image and lidars,storingandtransmittinglargevolumesofsequential predictstheattributesofthenextpixelinaraster-scanning lidar data become a challenge. There is a strong need to order (a similar process to the sequential image decoder developeffectivealgorithmsforlidardatacompression. PixelCNN [35]). We can then entropy encode the residu- Whilethemeasurementsofalidarscanareoftenusedas alsbetween thepredicted valuesandthe originalvalues to a 3D point cloud, the raw lidar data can be represented as achieve lossless compression under a chosen quantization amorestructuredformat: arangeimage,whereeachpixel rate. In this scheme, the more accurate the prediction is, correspondstoalasershot,eachrowrepresentsshotsfrom thesmallertheentropyoftheresidualsare–improvingthe thesamelaser,eachcolumnrepresentsshotsataspecificaz- compression rate is equivalent to developing a more accu- imuth rotation angle. Given the lidar scanning mechanism ratepredictivemodel. *equalcontribution Whatisuniqueinourmodeldesignisthatwerepresentlocalimagepatchesaspointcloudsinthesphericalcoordi- itscompression. Forexample,[3,7,16]appliedtraditional nates (with azimuth, elevation and range values) to reflect imagecompressionmethodssuchasJPEG,PNGandTIFF the non-uniform ray angles of each shot (or pixel), which to compress the range images. A sequence of range im- liftsthe2Dpixelsto3Dpointclouds. Byfurtherliftingthe ages could be seen as a video, and video-based compres- 3D points to 4D with a timestamp channel, we can unify sion method like H.264 was applied to compress lidar se- the way we represent context pixels/points from both the quences [22]. MPEG also proposed a PCC (V-PCC) stan- current and history scans. Since our model directly takes dardthatcompressesdynamicpointcloudsviaHEVCvideo inpointclouds,neitherinterpolation(totheimagegrid)nor codex[14].Ourworkextendsthemtoleveragedeepmodels imagecropping(projectedpointsfromhistoryframesmay anddeltaencodingtocompressrangeimages. spandifferentimageregions)isneeded. Ontheotherhand, Auto-encodershavebeenusedtoachievelossycompres- as to the model output formulation, instead of directly re- sionofpointclouds.[36,37]proposedtotrainanencoder- gressing the pixel values (which is often multi-modal), we decoderpointcloudreconstructionnetworkandentropyen- treateachpixelintheinputpatchasananchorandpredict codethebottlenecklayerasthecompresseddata.Similarly, aconfidencescoreaswellasaresidualvalueperanchor. [33] trained an auto-encoder to reconstruct range images Evaluated on the large-scale Waymo Open Dataset and compress the bottleneck vectors. While these meth- (WOD) [28], we show that our method reduces the bitrate odsmayachievehighcompressionrates,thereconstructed bymorethan65%forthesamedistortion(measuredusing point clouds could have strong artifacts, especially at the thepoint-to-pointChamferdistance)orreducingmorethan objectboundariesresultinginunboundederrorsinthelossy 85%distortionforthesamebitrate,comparedtotheMPEG compressionscheme. standard compression method G-PCC [14] while also sig- nificantlyoutperformingotherbaselineslikeDraco[1]and Learnedimageandvideocompression Imageandvideo PNG. On the KITTI dataset [13], we compare with prior compression are well-studied fields with many standards artdeepcompressionmethods(usingoctrees)andshowour (for example: PNG, JPEG, TIFF for images, H.264 and method has a clear advantage over them, thanks to its use HEVCforvideos). Amongthem,PNGishighlyrelatedto of the range image representation and the accurate predic- ourworkasituseslosslessimagecompressionusingdelta tion model. We also evaluate the impact of compression encoding. With the popularity of deep convolutional neu- on downstream perception tasks such as 3D object detec- ral networks for image understanding, deep model-based tion and provide extensive ablation studies to validate our image and video compression have also been widely ex- designchoices. plored [5,6,20,21,31,32]. Many of them leverage an encoder-decoderneuralnetwork(forexample,avariational 2.RelatedWork auto-encoder[5])forthecompressing(encodingtheimage toalatentvector)anddecompressing(decode/generatethe Pointcloudcompression As3Dapplicationsrise,recent imagefromthevector). Forthedecodingarchitectures,se- years have seen an increasing number of algorithms pro- quentialmodelssuchasPixelCNN[23]andPixelRNN[35] posedforpointcloudcompression.Onefamilyofthemeth- inspiredourpredictivemodeldesign. odsusesoctreestorepresentandcompressquantizedpoint clouds [10,12,26]. The Motion Picture Experts Group 3.ProblemFormulation (MPEG) has released a related point cloud compression (PCC)standard,calledgeometry-basedPCC(G-PCC)[14], Formostlidarsensors,onescancanbeinterchangeably using the octree structure and various ways to predict the represented as either a point cloud P ∈ RN×C or a range next-level content. More recently, Octsqueeze [17] was imageI ∈ RH×W×C,whereNisthenumberofpoints,H proposed to use a neural network as a conditional entropy and W are height and width of the range image (H is the modeltoestimatetheoctreeoccupancysymbols,andMuS- numberoflaserbeamsinthelidarandWisthenumberof CLE [9] extends it by including temporal prior from pre- shots per laser per frame), C is the feature dimension for vious frames. VoxelContextNet [25] further leverages the each point. Each valid pixel in the range image represents voxel context for the octree structure prediction. These a laser shot corresponding to one point in the point cloud. neuralnetwork-basedmethodsconsistentlyshowimprove- The channels include the range value and other attributes ments over G-PCC which uses hand-crafted entropy mod- suchasreflectionintensity. Theconversionrulebetweena els. While the octree-based methods are flexible to model pointcloudandarangeimagedependsonthelaserscanning arbitrary point clouds (from either a lidar sensor or multi- mechanism(thelasershotazimuthandelevationangles)as viewreconstruction),theydonotmakeuseofthepointdis- wellasthesensorposes(the6Dposeofthelasersensorat tributionpatternsinlidarrangeimages. thetimeofeachlasershot),asillustratedinFig.1. Asalidarpointcloudcanberepresentedasarangeim- Specifically, in a range image I, given a pixel location age,image-basedcompressionmethodscanbeadaptedfor (i,j) (which maps to a specific laser shot angle) and itsZ raw range images Object Y (x,y,z) r ω 2ω 3ω quantized residual Lidar R|t α θ X Object T. T 2 T 3 T 4 trS ae jen cs to or r y Quantization range image D Ee ne cp o D die nl gta map EE nn ct oro dp iny g bitstream 1 Figure2. Thedeepdeltaencodingpipelineforlidarrangeim- Figure1. Illustrationoflasershots. Left: Asinglelasershot. age compression. Given a lidar range image, we first quantize Right: Lasershotsacrosstime(inabird’seyeview). Weshow theattributevaluesandthenruninferenceofthepredictivemodel fourconsecutivelasershots(withdeltaazimuthangleω)thatmea- onthequantizedrangeimagetoderiveresiduals. Finallyweuse suretherangesfromthe(moving)sensortotheobject.Toconvert entropyencoderstocompresstheresidualstoabitstream. therangevaluestoapointcloud,weneedtoknowtheranges,the shotangles,aswellasthesensorposesateachshot. localization), we do not need to store sensor poses either. range value, we get a laser measurement (r,θ,α) where r Onlytherangeimageneedstobecompressed. is the range value, θ (azimuth or yaw) and α (elevation or 4.RangeImageDeepDeltaEncoding pitch)aretheshotanglesrelativetothelidarsensorcoordi- nate. Themeasurementcanbeconvertedtoapointpinthe We first describe our overall compression pipeline in sensorcoordinateby: Sec. 4.1, then dive deep into the design of our prediction modelinSec.4.2,andfinallydescribehowweentropyen- codetheresidualsinSec.4.3. p=(x,y,z)=(rcosαcosθ,rcosαsinθ,rsinα) (1) 4.1.PipelineOverview As at the time of each laser shot, the sensor pose [R|t] AsshowninFig.2,theinputtoourcompressionpipeline (rotation and translation in the global coordinate) can be is a raw range image. First, we quantize the range image different(Fig.1). Toaggregatetheshotsintoapointcloud, withacertainquantizationprecision(thisallowsustostore weneedtoconvertthepointstoasharedglobalcoordinate systemtogetthepointsetP = {R pT +t },i = 1,...,N the deltas as discrete symbols). Next, the core part of the i i i whereiistheindexofthelasershotinascan/rangeimage. pipelineisthedeepdeltaencoding. Wetrainadeepmodel to predict the next pixel value in a raster scanning order. Reversely, given the point cloud P of a scan (in the We then save the delta between the prediction (quantized) globalcoordinate),toconvertittotherangeimage,wefirst and the original (quantized) pixel value instead of saving needtotransformeachpointtothesensorcoordinatecorre- theoriginalpixelvalue. Asthedeltasaresmallerandmore spondingtoitstimeoftheshot. Then,wecaneasilygetthe concentrated in distribution than the original pixel values, (r,θ,α) by the reverse process of Eq. 1, which then maps they can be compressed more effectively. At the last step, backtotherowandcolumnindices. the deltas (or the residual map) are entropy encoded to a Forourlidarrangeimagecompression,wefirstquantize compressedbitstream. therangeimageIbyroundingitspixelvaluestoapredeter- minedquantizationprecision. Thenourgoalistocompress 4.2.DeepDeltaEncoding thequantizedrangeimageI′toabitstreamb∈[0,1]n(with annassmallaspossible),whichcanlaterbedecompressed Commonly used delta encoding adopts a linear predic- intotheexactquantizedrangeimageI′. Itislossywithre- tionmodeltoestimatethepixelvalues.Initssimplestform, spective to the raw range image but lossless regarding the topredictapixelI i,j atthei-throwandj-thcolumn,itsleft quantizedrangeimage. pixelI i,j−1 isusedastheprediction. Otherlinearfiltersof Notethatforcalibratedlidarssuchastheonesusedinthe left, up and nearby pixels can also be used. The delta be- Waymo Open Dataset [28], each pixel in the range image tween the prediction and the original pixel value is stored correspondstoafixedshotangle(θ,α)forthesamelidar, tobecompressed. Inourwork,weproposetotrainadeep sotheanglesdonotneedtobestoredforthecompression1. neural network to predict the pixel values and show that it Besides, as sensor poses are often stored separately from canachievesignificantimprovementinpredictionaccuracy range images and are shared with other modules (such as andcompressionrate. Next,wefirstintroduceourmodelin itsintra-predictionformat(onlyusingtheinformationfrom 1ForthemainlidarsusedinWOD,pixelelevationsaredeterminedby thecurrentframe/scanfortheprediction)andthendescribe thelaserbeaminclinations(64numbers)andazimuthscanbecalculated howweextendittotaketemporalinputfromhistoryscans. based on uniform azimuth rotation. For other lidars such as Velodyne Pleaseseethesupplementaryformoredetailsonthemodel HDL-64, azimuthrotationanglesarenotuniformandneedtobestored (onenumberforeachcolumn,costingonly∼0.1Kbperframe)[34]. architecture,thelossesandthetrainingprocess.Intra-frame Prediction Model Formally, the network I′ andconcatenateitwiththecurrentframeimagepatch. T−1 models the conditional probability of the k-th pixel value However,thisapproachdoesnottaketheego-motionofthe (intherasterscanningorder)conditionedonthequantized lidar sensor into account. As the lidar moves over time, pixel values before k: p(I ;Θ) = p(I |{I′ ,...,I′};Θ), therangeimagepatchwiththesamerowsandcolumnscan k k k−1 1 whereΘarethenetworkweights,I′ isthequantizedrange correspondtovastlydifferentphysicalspace. image and I is the unquantized raw range image. Empir- Totakesensorposesintoconsideration,insteadofquery- ically, as shown in Fig. 3, instead of using the entire past ing pixels of the last frame using the row and column in- context (e.g. with a RNN model), we can use local image dices, we should query neighbors using 3D points in the patch of shape h×w as the context to predict the bottom global coordinate (Fig. 3). However, as we do not know rightpixelofthepatch,similartotheideaofthesequential the ground truth range value for the pixel (i,j), we have imagedecoderPixelCNN[23]. to approximate the query by using a predicted range (e.g. Although the input to our network is an image patch, it using the left pixel range or the predicted value from the is quite different from a typical RGB one. The relations intra-frame model). Given pixel (i,j)’s laser shot angle of the range image pixels depend on the location of the (θ,α)anditsestimatedrangerˆ,wegetapointintheglobal patchandeventhecalibrationofaspecificlidarbecausethe coordinate, following Sec. 3. Then given the points from lasershotanglesareoftennon-uniformlydistributed. This last frame in the global coordinate, we can directly query isevenmoreprominentintheinter-framepredictionwhen neighborsinthe3Dspace(usingKDtreestoacceleratethe we re-project the points from history scans to the coordi- query). Thoseneighboringpointsfromlastframecanthen nate of the current shot. Therefore, we augment the range beprojectedtolasershot(i,j)’ssphericalcoordinate(tothe imagewithtwoextrachannels: thedeltaazimuthanddelta pointsinthesensorcoordinateatthetimeofthelasershot elevationanglesrelativetotheanglesoftheto-be-predicted and then transform to the spherical coordinate), to obtain pixel, whichliftsthe2Dpixelstothe3Dsphericalcoordi- extra points as temporal contexts. 2 This is equivalent to nate. Furthermore, as range prediction is a geometry esti- assuming the points from the last frame are static, and we mationproblem,wefoundthatempirically,usinga3Ddeep re-scan the scene at the sensor location at the time of the learningmodelsuchasPointNet[24]leadstomoreaccurate lasershot(i,j). Todistinguishthepointsfromthelastand predictioncomparedtousinga2Dconvolutionalnetwork. current frames, we augment the points with an extra time As shown in Fig. 3, given the lidar calibration data, we channel (with 1 indicating the last frame and 0 indicating first convert the range image patch to a mini point cloud thecurrentframe). (withmaximallyhw−1points).Insteadofdirectlyregress- Note that the reprojected points from the last frame do ingthepixelrangevalue,whichsuffersfromtheuncertainty notdirectlycorrespondtotherowsandcolumnsofthecur- causedbythemulti-modaldistributionofattributes(esp.on rentframerangeimage. Consideringsuchinputasapoint the object boundaries), we formulate the prediction as an cloud is convenient as we do not require any interpolation anchor-based classification and anchor-residual regression toturnthepointstotheimagegridoranypredefinedneigh- problem, where valid pixels in the range image patch are borhoodsizeforimagecropping. the anchors. The deep network predicts which pixel is the closest in value to the bottom right pixel and regresses a Inference. At inference time (for compression), we start residual(itisanoverloadedwordhere; itisdifferentfrom from the top left patch of the range image to pre- theresidualmapindeltaencoding)withrespecttoeachan- dict pixel I′ or I′ and store the residual. This pro- 1 1,1 chorpixel. cess continues in a raster scanning order to predict pix- els I ,...,I ,I ,...,I ,...,I . The residual map 1,2 1,W 2,1 i,j H,W Temporal Model The temporal model extends the intra- (deltasbetweenthepredictionandquantizedvalues)ofsize frame prediction model by leveraging contexts from both H ×W would be compressed by the entropy encoder. At the current scan and the past scan. The point cloud repre- decompression time, we run the prediction model in the sentation(comparedtothe2Dpixelrepresentation)enables same raster-scanning order, which takes input as already ustounifytheinputfromthepastandcurrentscansaswe reconstructed pixels {I′,..,I′ }, predicts the next pixel 1 k−1 canrepresentalllasershotsinthe4D(sphericalplustime) value Iˆ and then reconstruct the pixel from saved resid- k coordinates. ual as I′ = Iˆ +δ , where δ is the stored delta of pixel k k k k Given the current scan (quantized) range image I′ and k =(i−1)W+j.Thisprocesscanbeparallelizedbydivid- T thepastscanrangeimageI′ ,assumewewanttopredict ingtheinputrangeimageintoblocksandruntheinference T−1 therangevalueofpixel(i,j)inthecurrentscan(k-thpixel intherasterscanningorder). Anaivebaselineapproachto 2Strictly, even the pixels/points from the current frame need be re- projectedtothesensorcoordinateatthetimeoftheshot(i,j). Wehave usetemporaldataistotakethesameneighborhoodatthatin thisreprojectioninourintra-framemodelbuttheimpactissmallasthe I′ (intermsofpixelrowsandcolumns)fromthelastscan sensormoveslittlebetweenafewpixels. Tinparallelforeachblock(discussedinthesupplementary). 11-21). However, as KITTI only released the point cloud databutnotthetherawrangeimagesnorthesensorposes, 4.3.EntropyEncoding wehavetorefertothemanualoftheVelodynelidar[2]used by KITTI to convert a point cloud to the spherical coordi- After the predictive delta encoding, we get a residual nate to get a pseudo range image with 64 rows and 2,088 map/arrayoftherangeimage. Anentropyencoderisused columns. For our method, we compress the pseudo range toleveragethesparsitypatternintheresidualmaptocom- imagesanddonotadditionallystoretheazimuthandeleva- press it. Given an accurate prediction model, most of the tionofthepixels,astheirstorageinactualVelodynerange residualswouldbezero. Weadopttwomethodstoentropy images are negligible (elevations are known and azimuths encodetheresiduals. Inpractice,weselecttheentropyen- canbecompressedtolessthan1Kbperframe[34]). coderwiththehighestcompressionratesdependingonthe quantizationratesandthepredictor. The first method is to represent the residuals using a Metrics Followingpreviousworks[9,14,17],weusetwo sparserepresentation,withthevaluesofthenonzeroresid- geometric metrics to evaluate the reconstruction quality of ualsandtheirindicesinthearray,whichcanthenbearith- the compressed point cloud data: point-to-point Cham- metically encoded to further reduce its size. The second fer distance and point-to-plane peak signal-to-noise ratio method is to represent the residuals using run-length en- (PSNR). We report these metrics as a function of bitrates coding, which achieves better compression rates when the i.e.,theaveragenumberofbitstostoreonelidarpoint. residualsarenotverysparse,i.e.,whenquantizationstepis The point-to-point Chamfer distance CD sym measures small.Afterobtainingtherun-lengthrepresentation,weuse the average point distances between two point clouds LZMAcompressortofurtherreduceitssize. (smaller the better). For a given point cloud P = {p } and the reconstructed point cloud Pˆ = i i=1,...N 5.Experiments {pˆ j} j=1,...M: In this section, we first introduce the datasets and the CD(P,Pˆ)= 1 (cid:88) min∥p −pˆ ∥ (2) metrics in Sec. 5.1. Then we report compression results |P| j i j 2 i compared with strong baselines and prior art methods in Sec. 5.2 both quantitatively and qualitatively. We further CD sym(P,Pˆ)=max{CD(P,Pˆ),CD(Pˆ,P)} (3) evaluatetheimpactofcompresseddatatodownstreamper- The second metric, the peak signal-to-noise ratio ceptiontasks(3Ddetectionofvehiclesandpedestrians)in (PSNR) [30] (the larger the better), measures the ratio be- Sec.5.3.Finally,weprovideextensiveanalysisexperiments tweenthe“resolution”ofthepointcloudrandtheaverage tovalidateourdesignchoicesinSec.5.4. point-to-planeerrorbetweentheoriginalpointcloudP and thereconstructedpointcloudPˆ: 5.1.DatasetandMetrics Waymo Open Dataset (WOD) [28] WOD is the main r2 PSNR(P,Pˆ)=10log dataset we experiment with, as it provides rich lidar cali- 10 max{MSE(P,Pˆ),MSE(Pˆ,P)} bration data and full sensor poses. WOD includes a total (4) number of 1,150 sequences with 798 for training and 202 whereMSE(P,Pˆ) = 1 (cid:80) ((p −pˆ)·n )2 isthepoint- |P| i i i i forvalidation. Eachsequencelastsaround20secondswith to-plane distance, pˆ is the closest point in Pˆ to p , r = i i a sampling frequency of 10Hz. A 64-beam lidar is used, max min ∥p − p ∥ is the intrinsic resolution of providingrangeimagesof64rowsand2,650columns,with pi∈P j̸=i i j 2 the original point cloud. We estimate the normal n using i provided lidar calibration metadata (beam inclination an- Open3D[38]withk =12forknearestneighbor. gles). Therangechanneliscroppedto75m, andeachraw rangevalueisstoredasa32-bitfloatindefault. Weusethe 5.2.CompressionResults trainingsettotrainourdeepmodelandevaluateontheval- In this section, we compare our methods with compet- idation set. Only the first return range images are used in itive baselines as well as prior art lidar data compression ourexperiments. methods. We focus on compressing the range channel or the 3D coordinates of the points as it is the most studied SemanticKITTI[8] WealsoevaluateourmethodonSe- attributeamongtheothers(intensity,elongation)andsome manticKITTI (which enhances KITTI [13] with semantic of the methods in comparison do not support compressing labels)tocomparewithpriorartmethodsOctSqueeze[17] other attributes. See supplementary material for more re- and MuSCLE [9] (since they do not release code, we can- sults on compressing the other channels. We adjust the notcomparewiththemontheWOD).Wedirectlyapplythe quantization precision of the range images to achieve dif- WODtrainedmodelonSemanticKITTItestsplit(sequence ferentcompressionrates(bitsperpoint)ofourmethod.quantized range image at frame T decoded point cloud at frame T-1 query point … j-2 j-1 j reproject to spherical coordinate … … … … … i-2 … 46.5 46.3 46.2 Deep Predictive i-1 … 0 47.2 46.2 Model i … 46.3 46.1 range image patch intra-frame and temporal points anchor classification h x w (h x w -1 + m) x 4 and regression (h x w - 1 + m) x 2 lidar calibration intra-frame points (h x w -1) x 3 Figure3.Thedeeppredictionmodel.GivenarangeimagepatchfromframeTwithquantizedattributevalues(e.g.range),weliftpixels tothesphericalcoordinatewithazimuthandelevationanglesfromlidarcalibration. ToleveragecontextpointsfromthepastframeT-1,a querypointisgeneratedtofindneighborsamongpointsatframeT-1.Thoseneighborpointsarethenprojectedtothesphericalcoordinate ofthepixeltobepredicted. Ourpredictortakestheunionoftheintra-frameandtemporalcontextpointsandpredictstheattributeofthe pixel(i,j)withanchorclassificationandregression(witheachinputpointasananchor). Baselines: G-PCC [14] is a point cloud compression temporal model, RIDDLE-T, uses the same network method proposed by the MPEG, using octrees. Draco [1] architecture as the intra-frame one but takes in an extra is a popular point cloud compression algorithm based on 100 points from the last scan (projected to the spherical Kdtrees proposed by Google. We also compare with two coordinateofthenextpixel). Pleaseseesupplementaryfor priorartdeepmodelbasedmethods3: OctSqueeze[17]is moredetails. aoctree-basedmethodthatusesaneuralnetworktopredict the next-level symbol of the octree; MuSCLE [9] further WaymoOpenDatasetResults Wereportthebitratever- strengthensOctSqueezebyleveragingmulti-sweep(tempo- sus reconstruction quality metrics (PSNR, Chamfer dis- ral)datafortheoctreeprediction. Intermsofrangeimage tance) of competing methods on all frames from the se- representation,wecomparewithPNG(intra-frame)aswell quences in the validation set of the Waymo Open Dataset. asHEVC(avideocompressionstandard)ontopofPNGfor As shown in Fig. 4, our method significantly outperforms temporalrangeimagecompression. ForthePNGcompres- priormethods.AtthesameChamferdistancearound0.005, sion, the range is coded with 16 bits with a varying scal- our method reduces the bitrate by more than 65% com- ing factor to control the distortion/compression rate. We pared to G-PCC (from 10.78 bpp to 3.65 bpp). At the bi- alsocomparewithCluster[29],arangeimage-basedlidar trate of around 4, our method reduces the distortion (mea- data compression algorithm with a pipeline of segmenta- suredbyChamferdistance)bymorethan85%.Ourmethod tion,clustering,3D-HEVCencodingandgroundprediction. alsohasalargerbitrateimprovementoverpreviousmethods Besides,supplementaryprovidesafurtherexperimentcom- whenthereconstructionqualityishigher.Thisindicatesour paringwithanauto-encoderbasedmethodonrangeimages method has more advantage over baselines when the data (notincludedhereduetoitspoorperformance). qualityrequirementishigher. Implementation Details Our intra-frame prediction SemanticKITTI Results Since prior art methods [9,17] model, RIDDLE, takes in a context image patch of size have not released the code or the compression model, we 10×10 (the bottom right pixel is masked out) and uses a turn to the SemanticKITTI dataset to compare with them PointNet [24] like architecture for the prediction (without (we got the raw values of the curves reported in the MuS- the T-Net structure, adapted the output to predict anchor CLE [9] paper from the authors). We apply our model classification and regression). The input to the network is trained on the Waymo Open Dataset directly to the Se- a 3D point cloud in a spherical coordinate with azimuth, manticKITTI lidar point clouds (by creating pseudo range elevation relative to the bottom right pixel and the range images). relative to the mean range of valid context points. Our AsshowninFig.5,ourmethodismorethan50%lower in bitrate (at around 4.3 bpp) with the same Chamfer dis- 3ThereisanotherdeepnetbasedworkVoxelContextNet[25], yetas tance at around 0.005 compared to all prior art methods, theydidnotreleasecodenorthedetaileddefinitionoftheevaluationmet- rics,wecouldnotcomparewiththem. showingsignificantadvantages. ThisstrongleadattributesFigure4.EvaluationofthecompressionmethodswithgeometricmetricsontheWaymoOpenDatasetvalset.Left:Chamferdistance v.s.bitperpoint(bbp);Right:PSNRv.s.bpp.Atacertainbitrate,lowertheChamferdistanceorhigherthePSNR,betterthereconstruction quality. remarkablyresemblestheoriginalpointcloudingeometry evenwhenthebitrateisambitiouslysetverylow,thanksto compressingdirectlyontherangeimagestokeepthepoint distributionpattern. 5.3.ImpacttoDownstreamPerceptionTasks For applications like autonomous driving, we want to understand the impact of lidar data compression to down- stream perception tasks such as 3D object detection. To Figure5.Evaluationofthecompressionmethodswithgeomet- understandsuchimpact,wetrainedawidelyusedPointPil- ricmetricsontheSemanticKITTItestset. Weonlypresentour lars detector [19] on uncompressed point clouds using the intra-framemodelhereastheperpixelsensorposeisunavailable Waymo Open Dataset train set, for the vehicle class and inSemanticKITTI. pedestrianclassrespectively. Detectionqualityismeasured bymeanaverageprecision(mAP). AsshowninFig.6,ourmethodoutperformsothercom- petingbaselinesinmaintainingthebestmAPwiththesame bitrate. At the bitrate around 2, our method leads the sec- ond best method (G-PCC) by more than 1 point on vehi- cledetectionand3pointsonpedestriandetection. Wecan also see that pedestrian detection is more sensitive to data distortion probably due to the smaller average object sizes comparedtovehicles. 5.4.AnalysisExperiments Figure6.Impactoflidardatacompressionto3Dobjectdetec- Inthissectionweablateourdeepmodelintermsofar- tionqualityontheWaymoOpenDatasetvalset.WetrainPoint- chitecturechoice, lossdesignand temporalcontext. In or- Pillars[19]detectorsusingtherawpointclouds(withnocompres- dertocomparepredictionqualityindependentfromtheen- sion) from the WOD train set and evaluate them with the com- tropyencoder, weuseapredictionaccuracyasthemetrics pressedpointclouds(orpointcloudsfromthecompressedrange for ablation studies. The prediciton accuracy (acc.) is de- images)ontheWODvalidationset. fined as the percentage of zero deltas (i.e. perfect predic- tion under quantization) in the range image residual map, to our choice of directly compressing the range images as under a specific quantization precision (e.g. δ = 0.1m 4). wellastheeffectivedeepmodel. A prediction q for the quantized range value p′ is counted ascorrectif|q−p′| < δ/2. Supplementaryprovidesmore analysisrelatedtoentropyencodersandmodellatency. Qualitativeresults. InFig.7,weshowthereconstructed lidarpointcloudsfromourmethod,DracoandG-PCC.We 4Note0.1misnotthatcoarseasaveragepointdisplacementafterthe canseethatthepointcloudreconstructedfromourmethod quantizationisonly2.5cmmodel acc.@0.1m lossfunction acc.@0.1m temporalcontext acc.@0.1m previousvalidvalue 54.35 MSE 59.83 none(intra-frame) 65.75 linearinterpolation 54.64 MAE 61.64 10×10image 67.34 12-layerCNN 64.62 multi-binloss 59.66 100knnpoints 69.23 PointNet(adpated) 65.75 anchorcls.+reg. 65.75 Table1.Effectsofpredictionmodels. Table2.Effectsoflossfunctions. Table3.Effectsoftemporalinput. Groundtruth (32bpp) G-PCC (4.02bpp) Draco (4.02bpp) PNG (4.02bpp) Ours (4.02bpp) 0.00 26.46 52.91 79.37 105.83 132.29 158.74 185.20 211.66 238.11 mm Figure7. Visualizationofreconstructedpointclouds, colorEerdrobr yCpoleorrmpoapintChamferdistance(errorbarcolormaponthebottom). Fromlefttoright: raw,G-PCC,Draco,PNGandRIDDLE(ours). Itisclearthatourmethod,underthesamebitperpoint,hasmushless distortion.Bestviewedincolorwithzoomin. Effects of predictor choices. Table 1 compares several Effectsoftemporalcontexts. Table3showsthebenefits architecture choices. The simplest choice is to use the left ofaddingtemporalcontextstothepredictionmodel.Wesee valid pixel as the prediction to the current pixel: Iˆ = thateventhenaiveconcatenationoftheimagepatchofthe i,j I′ . Another extension is to use linear interpolation of last frame with the same rows and columns (second row) i,j−1 close-bypixels:Iˆ =I′ +I′ −I′ .Notethat canalreadyhelp. Amorecarefulhandlingofthetemporal i,j i,j−1 i−1,j i−1,j−1 forbothcases,firstvalidpixelisusedincasethenearbyone pointsbyconsideringsensorposes(asdescribedinSec.4.2) isanemptypixel.Weseethatdeepmodelscansignificantly leadstomoregainsofusingthetemporaldata. outperform linear models while the point-cloud-based ar- chitecture shows a stronger empirical result compared to 6.Conclusion ConvNetontheimagerepresentation. Withimprovinglidarsensorresolutionandgrowingdata Effects of loss functions. Table 2 compares several loss volume, how to efficiently store and transmit lidar data choices for our model supervision. With direct attribute becomes a challenging problem in many 3D applications, prediction as a regression problem, we can see using the suchasautonomousdrivingandaugmentedreality. Toad- meanabsoluteerror(MAE,L1loss)issuperiortousingthe dress this challenge, we propose a novel lidar data com- meansquarederror(MSE,L2loss)asitisaffectedlessby pression algorithm named RIDDLE (Range Image Deep thelargeerrorsontheobjectboundaries. Turningthedepth DeLtaEncoding), whichcombinesthesuccinctnessoftra- regressionproblemtoamulti-binclassificationandregres- ditionaldeltaencodingandtheexpressivenessofdeepneu- sion problem (with classification and intra-bin regression ralnetworks,withsupportofusingtemporalcontexts. Ex- foreachdepthbinofsize1m)doesnothelpmucheitheras perimentsoverthe WaymoOpenDatasetand KITTIshow showninthethirdrow. Ourproposedformulation(anchor thatcomparedtopreviousmethods,theproposedapproach classificationwithregression)leadsto4.11pointsincrease yields significant improvement in the point cloud recon- in prediction accuracy compared to the second best option structionqualityandthedownstreamperceptionmodelper- ofusingmeanabsoluteerror. formance,underthesamecompressionrates.References [16] HamidrezaHoushiarandAndreasNu¨chter. 3dpointcloud compressionusingconventionalimagecompressionforeffi- [1] Draco. https://github.com/google/draco. Ac- cientdatatransmission. In2015XXVInternationalConfer- cessed:2021-09-28. 2,6 enceonInformation,CommunicationandAutomationTech- [2] Velodyne hdl-64e. https://gpsolution.oss- nologies(ICAT),pages1–8.IEEE,2015. 2 cn-beijing.aliyuncs.com/manual/LiDAR/ [17] Lila Huang, Shenlong Wang, Kelvin Wong, Jerry Liu, MANUAL%2CUSERS%2CHDL-64E_S3.pdf. Accessed: and Raquel Urtasun. Octsqueeze: Octree-structured en- 2021-10-04. 5 tropy model for lidar compression. In Proceedings of the [3] Jae-Kyun Ahn, Kyu-Yul Lee, Jae-Young Sim, and Chang- IEEE/CVF Conference on Computer Vision and Pattern Su Kim. Large-scale 3d point cloud compression using Recognition,pages1313–1323,2020. 1,2,5,6,11,12,13 adaptiveradialdistancepredictioninhybridcoordinatedo- [18] MartinIsenburg. Laszip: losslesscompressionoflidardata. mains.IEEEJournalofSelectedTopicsinSignalProcessing, PhotogrammetricEngineeringandRemoteSensing, 79, 02 9(3):422–434,2015. 2 2013. 12 [4] I.Armeni,A.Sax,A.R.Zamir,andS.Savarese. Joint2D- 3D-SemanticDataforIndoorSceneUnderstanding. ArXiv [19] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, e-prints,Feb.2017. 1 JiongYang,andOscarBeijbom. Pointpillars: Fastencoders [5] Johannes Balle´, Valero Laparra, and Eero P Simoncelli. forobjectdetectionfrompointclouds. InCVPR,2019. 7 End-to-end optimized image compression. arXiv preprint [20] Siwei Ma, Xinfeng Zhang, Chuanmin Jia, Zhenghui Zhao, arXiv:1611.01704,2016. 2 ShiqiWang,andShansheWang. Imageandvideocompres- [6] Johannes Balle´, David Minnen, Saurabh Singh, Sung Jin sion with neural networks: A review. IEEE Transactions Hwang,andNickJohnston. Variationalimagecompression onCircuitsandSystemsforVideoTechnology,30(6):1683– with a scale hyperprior. arXiv preprint arXiv:1802.01436, 1698,2019. 2 2018. 2,13 [21] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, [7] Peter van Beek. Image-based compression of lidar sensor Radu Timofte, and Luc Van Gool. Practical full resolu- data. ElectronicImaging,2019(15):43–1,2019. 2 tionlearnedlosslessimagecompression. InProceedingsof [8] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- the IEEE/CVF conference on computer vision and pattern zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- recognition,pages10629–10638,2019. 2 mantickitti: A dataset for semantic scene understanding of [22] FabrizioNenci,LucianoSpinello,andCyrillStachniss. Ef- lidar sequences. In Proceedings of the IEEE International fectivecompressionofrangedatastreamsforremoterobot ConferenceonComputerVision,pages9297–9307,2019. 5 operations using h.264. In 2014 IEEE/RSJ International [9] Sourav Biswas, Jerry Liu, Kelvin Wong, Shenlong Wang, ConferenceonIntelligentRobotsandSystems,pages3794– and Raquel Urtasun. Muscle: Multi sweep compres- 3799,2014. 2 sion of lidar using deep entropy models. arXiv preprint [23] AaronvandenOord,NalKalchbrenner,OriolVinyals,Lasse arXiv:2011.07590,2020. 1,2,5,6 Espeholt, Alex Graves, and Koray Kavukcuoglu. Con- [10] MarioBotsch,AndreasWiratanaya,andLeifKobbelt. Effi- ditional image generation with pixelcnn decoders. arXiv cienthighqualityrenderingofpointsampledgeometry.Ren- preprintarXiv:1606.05328,2016. 2,4 deringTechniques,2002:13th,2002. 2 [24] CharlesRQi,HaoSu,KaichunMo,andLeonidasJGuibas. [11] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- Pointnet: Deep learning on point sets for 3d classification ber, Thomas Funkhouser, and Matthias Nießner. Scannet: and segmentation. In Proceedings of the IEEE conference Richly-annotated 3d reconstructions of indoor scenes. In oncomputervisionandpatternrecognition,pages652–660, CVPR,2017. 1 2017. 4,6,11 [12] Olivier Devillers and P-M Gandoin. Geometric compres- [25] ZizhengQue,GuoLu,andDongXu. Voxelcontext-net: An sionforinteractivetransmission. InProceedingsVisualiza- octreebasedframeworkforpointcloudcompression.InPro- tion2000.VIS2000(Cat.No.00CH37145),pages319–326. ceedingsoftheIEEE/CVFConferenceonComputerVision IEEE,2000. 2 andPatternRecognition,pages6042–6051,2021. 1,2,6 [13] AndreasGeiger, PhilipLenz, ChristophStiller, andRaquel Urtasun. Visionmeetsrobotics:Thekittidataset. TheInter- [26] Ruwen Schnabel and Reinhard Klein. Octree-based point- national Journal of Robotics Research, 32(11):1231–1237, cloudcompression. InPBG@SIGGRAPH,pages111–120, 2006. 2 2013. 2,5 [14] D Graziosi, O Nakagami, S Kuma, A Zaghetto, T Suzuki, [27] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. andATabatabai. Anoverviewofongoingpointcloudcom- Sunrgb-d:Argb-dsceneunderstandingbenchmarksuite.In pressionstandardizationactivities: video-based(v-pcc)and CVPR,2015. 1 geometry-based(g-pcc).APSIPATransactionsonSignaland [28] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien InformationProcessing,9,2020. 2,5,6,13 Chouard,VijaysaiPatnaik,PaulTsui,JamesGuo,YinZhou, [15] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun. YuningChai,BenjaminCaine,etal.Scalabilityinperception Deep residual learning for image recognition. In Proceed- forautonomousdriving: Waymoopendataset. InProceed- ingsoftheIEEEconferenceoncomputervisionandpattern ingsoftheIEEE/CVFConferenceonComputerVisionand recognition,pages770–778,2016. 11 PatternRecognition,pages2446–2454,2020. 2,3,5[29] Xuebin Sun, Han Ma, Yuxiang Sun, and Ming Liu. A novel point cloud compression algorithm based on cluster- ing. IEEE Robotics and Automation Letters, 4(2):2132– 2139,2019. 6 [30] Dong Tian, Hideaki Ochimizu, Chen Feng, Robert Cohen, andAnthonyVetro. Geometricdistortionmetricsforpoint cloudcompression. In2017IEEEInternationalConference onImageProcessing(ICIP),pages3460–3464.IEEE,2017. 5 [31] George Toderici, Damien Vincent, Nick Johnston, Sung JinHwang, DavidMinnen, JoelShor, andMicheleCovell. Fullresolutionimagecompressionwithrecurrentneuralnet- works.InProceedingsoftheIEEEConferenceonComputer VisionandPatternRecognition,pages5306–5314,2017. 2 [32] James Townsend, Thomas Bird, Julius Kunze, and David Barber. Hilloc: Losslessimagecompressionwithhierarchi- callatentvariablemodels.arXivpreprintarXiv:1912.09953, 2019. 2 [33] ChenxiTu,EijiroTakeuchi,AlexanderCarballo,andKazuya Takeda. Pointcloudcompressionfor3dlidarsensorusing recurrentneuralnetworkwithresidualblocks. In2019In- ternationalConferenceonRoboticsandAutomation(ICRA), pages3274–3280.IEEE,2019. 1,2 [34] ChenxiTu,EijiroTakeuchi,ChiyomiMiyajima,andKazuya Takeda. Compressingcontinuouspointclouddatausingim- agecompressionmethods. In2016IEEE19thInternational Conference on Intelligent Transportation Systems (ITSC), pages1712–1719.IEEE,2016. 3,5 [35] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In In- ternational Conference on Machine Learning, pages 1747–1756.PMLR,2016. 1,2 [36] Louis Wiesmann, Andres Milioto, Xieyuanli Chen, Cyrill Stachniss, and Jens Behley. Deep compression for dense point cloud maps. IEEE Robotics and Automation Letters, 6(2):2060–2067,2021. 2 [37] Wei Yan, Shan Liu, Thomas H Li, Zhu Li, Ge Li, et al. Deep autoencoder-based lossy geometry compression for pointclouds. arXivpreprintarXiv:1905.03691,2019. 2 [38] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3d: A modern library for 3d data processing. arXiv preprint arXiv:1801.09847,2018. 5Supplementary 10×10fromtherangeimagesandtrainthedeepnetwork with batch size 128 and an Adam optimizer. We use loss A.Overview weight γ = 0.01. The initiallearning rate is0.00005, and wedecaylearningrateby10xatstep1500kandstep3000k. In this supplementary, we provide more details of our Wenormalizetherangevaluesto[0,1]bydividingthemby method, extra analysis experiment results and visualiza- 75m. To mimic the same setting in decoding, the training tions. InSec.B,wedescribemoredetailsofthedeeppre- inputsarequantized. Thegroundtruthattributevaluesare dictivemodel,includingitsnetworkarchitecture,lossesand in full precision for more accurate supervision. For pixel itstrainingprocessaswellasmoreexplanationoftheinput locations at the boundary of the range images, we enforce data transformations. In Sec. C, we provide more analysis thesamepatchsizeviazeropadding. experimentresultsonmodellatency,effectsoftheentropy Therearetwostrategiesforinputsquantization.Thefirst encoderchoicesandeffectsofthecontextsizes. InSec.D, strategy is to train different models for different quantiza- we apply our method to compress lidar data attributes be- tion precisions, and for each model we use a fixed quanti- yondtherangevalues. Finally,inSec.E,weprovidemore zation precision for the input. The second strategy is us- visualizationsofourmodel’spredictions. ing mixed precisions to quantize the input during training. B.DetailsofthePredictiveModels Specifically,weuniformlysampleaquantizationprecision for a given input from 0.0001 to 0.5000 with sample bin Modelarchitecture Fordeeppredictionmodel,weadapt sizeof0.0001.Bythesecondstrategy,weonlyneedtotrain the structure of PointNet [24]. Details of layers are vi- onemodelfordifferentcompressionrates. Fromourexper- sualized in Fig. 8. To reduce the latency, the network iments,weobservethatthesecondstrategywon’tharmthe channelsizesarehalvedcomparedtotheoriginalarchitec- compressionratesatindividualquantizationprecision. ture and the T-Nets are removed. After concatenation of globalandlocalfeatures,wesplitthenetworkintoanchor- Baselines Forthepreviousvalidvaluemethod,wepredict classification branch and residuals branch. Each branch is the attribute I at row i, column j to be Iˆ = I′ , if i,j i,j i,j−1 aMLPwithlayersizes[128,64,#ofanchors],where#of I′ is a valid pixel (not void due to empty laser return) i,j−1 anchorsis99forintra-predictionmodeland199fortempo- elserepeatedlydecrementjby1untilthepixelisvalid. For ralmodel. Fortemporalmodel,WebuildtheKDtreesusing thelinearinterpolationbaselinemethod, wepredictIˆ = i,j neighbors.NearestNeighborsmethodbyScikit-learnlibrary. I′ +I′ −I′ . Forthe12-layerCNNmethod, i,j−1 i−1,j i−1,j−1 Weusetheleftvaliddepthandupvaliddepthasestimates weadaptasimilarstructureasResNet[15]. Thenetworkis toeachquery50pointsfromthelastframeasthetemporal composedoftwoconvolutionallayers(channelsizes64,32) context. and5residualblocks(channelsizes32,32foreachblock), andallconvolutionallayershavefiltersize3x3. Lossfunctions Attrainingtime,wetraindeepprediction modelend-to-endwiththeanchorclassificationandthean- C.MoreAnalysis chor residual regression loss. We weight the classification Analysis of the model latency. Table 6 shows the la- lossbyaweight. tencycomparisonbetweenourmethodandOctSqueeze[17] L=γL +L (5) andG-PCCfromMPEG.Toachievefasterdecompression classification regression speed, during compression, we split a range image into Theclassificationlossisacross-entropylossacrosshw−1 smaller blocks and run compression in parallel on these classes for intra-frame prediction model and hw −1+m blocks. During decompression, we decode in parallel on classes for temporal model. The ground truth class is se- thesesmallerblocks. Ourexperimentsshowthatifwesplit lectedastheindexofthepixelwiththeclosestdistanceto a64by2650rangeimageinto212blockswithsize16by theto-be-predictedpixel.Astheinputarequantizedvalues, 50,thebitratewouldonlyincreaseby0.5%,whichisnearly therecouldbeties. Toavoidtiesweaddabiastermtothe negligible. If we split it into 424 blocks with size 16 by distancestofavorthepixelsthatarecloserinangles(abso- 25, the bitrate would increase by 5.13%. Table 6 shows lute delta azimuth + absolute delta elevation) to the to-be- thelatencyofourmethodbysplittingintoblockswithsize predictedpixel.TheregressionlossisaL1lossbetweenthe 16 by 26 during decoding. We benchmark our method on predictedresidualcorrespondingtothepixeloftheground NVIDIA Tesla V100 GPU. Our deep model is accelerated truthanchorandthegroundtruthresidualofthatpixel. by TensorRT with float16 quantization. Operations (en- tropy encoding) other than model inference is written in Training We learn the weights of the prediction model C++. ThelatencyofG-PCCisbenchmarkedonCPUusing by training on the range images from the Waymo Open MPEG’s implementation (github.com/MPEGGroup/mpeg- Dataset train set. We randomly crop the patches of shape pcc-tmc13). The latency of OctSqueeze (depth 16) isshared Waymo | Confidential & Proprietary dxn mlp (32, 32, 32) max pool 512 nx512 global feature 23xn mlp (64, 512) shared shared 821xn mlp (256, 128) 1xn 1xn .sborp rohcna slaudiser rohcna nx544 Figure8. Deeppredictivemodelarchitecture. nisthenumberofcontextpoints,distheinputpointdimension(d = 3fortheintra- predictionmodel,d=4forthetemporalmodel). entropyencoder quantization bpp length of values. This representation achieves better com- sparserepre.+arithmeticencoding 0.1m 2.28 pression rates when the residuals are not that sparse, i.e. varints+LZMA 0.1m 2.33 when quantization step size is small. After obtaining the huffmanencoding 0.1m 2.49 run-lengthrepresentation,weuseLZMAcompressortofur- arithmeticencoding 0.1m 2.41 therreduceitssize. sparserepre.+arithmeticencoding 0.02m 4.17 Table.4showsthatdifferententropyencodershavedif- varints+LZMA 0.02m 4.08 ferentcompressionratesoftheresiduals. Forquantization huffmanencoding 0.02m 4.25 precision of 0.1m, the residuals are more sparse, and the arithmeticencoding 0.02m 4.21 compressionrateofusingsparserepresentation(represent- ing non-zero residuals by specifying their row, col index andtheresidualvalues)witharithmeticencodingishigher Table4.Ablationstudyoftheentropyencoders. than varints with LZMA. However, for quantization preci- sionof0.02m,thecompressionrateofvarintswithLZMA fromtheoriginalpaper[17]. Moreover,webelievefurther ishigher,duetothedecreaseofzeroresiduals. speedupof ourmethodcouldbeachievedbymethods like predictingmultiplepixelsatatimeorsharedpointembed- Analysisoninputchoices. Table7showshowtheinput dingforstreamedprediction. choices affect the prediction accuracy. Instead of inputing intra-frame context of 10 by 10 (minus the bottom right Choicesofentropyencoder Afterthepredictivedeltaen- one), we can just input the up 9 pixels (row 1), the right coding, we get a residual map of the range image. An en- 9pixels(row2)orsmallercontextsize(row4). Wecansee tropyencoderisusedtoleveragethesparsitypatterninthe thatenlargingthereceptivefieldofinputwithcontextfrom residual map to compress it. Given an accurate prediction both upper left and upper right can improve the prediction model, most of the residuals would be zero. In addition, accuracy(row3v.s. row1and2). Moreover,includingaz- as shown in Fig. 9, larger quantization steps would round imuthandinclinationasadditionalinputattributescanalso more residuals to zero, thus the residuals would become improve the predictor (row 4 v.s. row 3) compared to just more sparse. We adapted two methods to entropy encode usingtherelativerow/columnindicesasinput. theresiduals. Inpractice,wecanselecttheentropyencoder withthehighestcompressionratesdependingonthequan- Generalizationofthemethod. Whenweapplythedeep tizationratesandthepredictor. predictivemodeltrainedon64-beamframesontheWaymo Thefirstmethodistorepresenttheresidualsusingsparse Open Dataset directly to the subsampled 32-beam frames, representation. Given an array of residuals, we represent it achieves 2.55 bpp at 0.1m depth precision (only slightly thearraywiththevaluesofnonzeroresidualsandtheirin- larger than 2.23 bpp on 64-beams). In addition, the com- dices in the array. For a long run of sparse residuals, the pressortrainedonWODcanapplywellinKITTI(Fig. 5), sparserepresentationwouldbequitememoryefficient. Af- whichshowsthegeneralizationofit. terobtainingthesparserepresentationofresiduals, weuse arithmeticencodingtofurtherreduceitssize. The second method is to represent the residuals using Comparison with LASzip. Benchmarked on Waymo run-length encoding. We first flatten the residual map to Open Dataset, LASzip [18] has 67.6 PSNR and 0.0048 a vector and then represent it with the values and the run- ChamferDistanceat10.62bpp.Ourmethodachieves72.39method compressing(ms) decompressing(ms) contextsize inputformat acc.@0.1m G-PCC[14] 1594.5 1052.1 10x1 (∆azimuth,∆inclination,depth) 37.53 OctSqueeze[17] 106.0 902.3 1 x10 (∆azimuth,∆inclination,depth) 60.00 RIDDLE(ours) 532.51 966.3 5 x10 (∆rowindex,∆colindex,depth) 65.02 5 x10 (∆azimuth,∆inclination,depth) 65.21 10x10 (∆azimuth,∆inclination,depth) 65.75 Table 5. Latencies of lidar data compression methods. Note theG-PCCisevaluatedusingCPUontheWaymoOpenDataset (WOD). OctSqueeze is evaluated on the KITTI dataset (with a Table7. Effectsofcontextsizeandinputformat. Weusedthe similarrangeimageresolutiontoWOD)andourmethodisevalu- intra-framemodelforthisevaluation. atedontheWOD.BothOctSqueezeandourmethoduseGPUfor modelinference. Auto-encoder-based compression We further compare preprocessing(ms) network(ms) entropyencoding(ms) ourmethodwithanauto-encoder-basedimagecompression 16.23 487.4 28.88 algorithm[6]. r = max pi∈P min j̸=i∥p i−p j∥ 2. Theauto- encoder is trained with a learning rate of 0.0001 and an Adam optimizer. The range values are scaled to [0, 1] by Table6. BreakdownofencodingtimeofRIDDLE.Preprocess- 75m instead of by 255 as for RGB images. Fig. 10 shows ingincludesthetimetocomputeandcompressabinarymaskin- the reconstructed point clouds of the auto-encoder-based dicatingwhetherapixelisavalidreturnintherangeimage. Net- methodandourmethodundersimilarbitrates.Thecolorsof workreferstothemodelpredictiontime.Entropyencodingrefers pointsinthevisualizationsdemonstratethatourmethodhas tothetimeusedbyentropyencoder. much better reconstruction quality compared to this auto- encoder baseline. The auto-encoder-based range image PSNR and 0.0026 Chamfer Distance at 4.51 bpp, which compression method does poorly especially at the bound- clearlyoutperformsit. arybetweenforegroundpointsandbackgroundpoints. D.CompressionofMoreAttributes Since a lidar point cloud may contain additional at- tributes (e.g. intensity, elongation) than the range values, in this section we show how much we can compress the other attributes than ranges. We train a network to take in multi-channel range images and output multi-channel pre- diction. Specifically, we train a network on the Waymo Open dataset, which contains 3 channels for each point: range, intensity and elongation. The network is modified to have 3 anchor-classification branches and 3 residuals branches for 3 attributes. From Table 8 row 1-4, we can seethataquantizationprecision0.02mforrange,or0.1for intensity, or 0.1 for elongation have similiar effect on the objectdetector. Withthosequantizationprecisionsforeach attributes,rangevaluesaccountformostofthestoragecost (4.04bpp)comparedtotheothertwo(0.88bppforintensity and0.78bppforelongation). E.MoreVisualizations Thedistributionsofresiduals Fig.9showsthedistribu- tionoftherangeresidualmapsafterourdeepdeltaencoding step. Wecanseethatthelargerthequantizationintervalthe moreconcentratedaretheresiduals(lowerentropy),which explainsthelowerbitrateafterthecompression. Notethat foraquantizationsizeof0.1m, morethan70%ofthepre- diction has zero error compared to the ground truth quan- tizedrangeimage.precision bitperpoint totalbpp vehiclemAP pedestrianmAP range intensity elongation range intensity elongation - - - 32 32 32 96.00 69.59 65.62 0.02m - - 4.04 32 32 68.04 69.60 65.63 - 0.1 - 32 0.88 32 64.88 69.59 65.84 - - 0.1 32 32 0.78 64.78 69.59 65.62 0.02m 0.1 0.1 4.04 0.88 0.78 5.70 69.59 65.62 Table8. Effectsofcontextsizeandinputformat. Theuncompressedattributeissavedas32-bitfloatnumbers. Wethatquantizingthe intensityorelongationto0.1orquantizingtherangevalueto0.02mhaslittleimpactonthedetectionmAPs.Attheseselectedquantization rates,therangechannelaccountsformostofthebpp(around70%). (a)Residualsdistribution(0.02m) (b)Residualsdistribution(0.1m) (c)Residualsdistribution(0.2m) Figure9.Distributionofresidualsatdifferentquantizationprecisions.Proposeddeeppredictionmodelisabletoaccuratelymodelthe jointdistributionsofrangeimagepixelattributes,resultinginaconcentrateddistributionofresidualswithlowentropy. Groundtruth (32bpp) Ours (2.19bpp) Autoencoder (2.20bpp) Figure10. Visualizationofreconstructedpointclouds,coloredbyperpointChamferdistance(errorbarcolormaponthebottom). Fromlefttoright:raw,RIDDLE(ours)andauto-encoder.Itisclearthatourmethod,underthesamebitperpoint,hasmushlessdistortion. Bestviewedincolorwithzoomin.