RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding Xuanyu Zhou * Charles R. Qi * Yin Zhou Dragomir Anguelov Waymo LLC Abstract Lidars are depth measuring sensors widely used in au- tonomous driving and augmented reality. However, the large volume of data produced by lidars can lead to high costs in data storage and transmission. While lidar data can be represented as two interchangeable representations: 3D point clouds and range images, most previous work focus on compressing the generic 3D point clouds. In this work, we show that directly compressing the range images can lever- age the lidar scanning pattern, compared to compressing the unprojected point clouds. We propose a novel data- driven range image compression algorithm, named RID- DLE (Range Image Deep DeLta Encoding). At its core is a deep model that predicts the next pixel value in a raster scanning order, based on contextual laser shots from both the current and past scans (represented as a 4D point cloud of spherical coordinates and time). The deltas between pre- dictions and original values can then be compressed by en- tropy encoding. Evaluated on the Waymo Open Dataset and KITTI, our method demonstrates significant improvement in the compression rate (under the same distortion) compared to widely used point cloud and range image compression algorithms as well as recent deep methods. 1. Introduction Lidar (or LiDAR, short for light detection and ranging) sensors are commonly used in applications that require 3D scene understanding such as autonomous driving and aug- mented reality. However, with the growing resolution of lidars, storing and transmitting large volumes of sequential lidar data become a challenge. There is a strong need to develop effective algorithms for lidar data compression. While the measurements of a lidar scan are often used as a 3D point cloud, the raw lidar data can be represented as a more structured format: a range image, where each pixel corresponds to a laser shot, each row represents shots from the same laser, each column represents shots at a specific az- imuth rotation angle. Given the lidar scanning mechanism * equal contribution (directions of the lasers) and sensor poses (6D poses in the global coordinate at the timestamp of every shot), a range image and its corresponding point cloud can be converted interchangeably and losslessly. By organizing the points in a range image, instead of storing the three -dimensional co- ordinates of the points, we can just store one -dimensional ranges (around 3x saving in storage). Given this observa- tion, in contrast to previous works that focus on compress- ing 3D point clouds [9, 17, 25], we propose to directly com- press range images to leverage the lidar scanning patterns. As range images are in the image format, naturally we can apply existing compression methods for optical images (RGB or grayscale); however, those methods have their lim- itations. For example, the PNG format is often used to com- press depth images in indoor datasets [4, 11, 27], where the depth value are normalized and quantized to 16-bit integers and compressed losslessly. While PNG also applies to com- press lidar range images, it is not data-driven and does not use temporal information. There are also attempts to use auto-encoder networks [33] to lossily compress range im- ages by storing the bottleneck layer output. However, as range values often have a much wider distribution than RGB colors, it is challenging to learn an accurate reconstruction, especially at the object boundaries. In this work, we propose RIDDLE (Range Image Deep DeLta Encoding), a data-driven algorithm to compress range images with predictive neural networks (Fig. 2). Our method is inspired by the use of delta encoding in PNG im- age compression. However, instead of simply computing a difference between close-by pixels, we adopt a deep model to predict the pixel value from context pixels. The deep model takes a local patch of the decoded range image and predicts the attributes of the next pixel in a raster-scanning order (a similar process to the sequential image decoder PixelCNN [35]). We can then entropy encode the residu- als between the predicted values and the original values to achieve lossless compression under a chosen quantization rate. In this scheme, the more accurate the prediction is, the smaller the entropy of the residuals are – improving the compression rate is equivalent to developing a more accu- rate predictive model. What is unique in our model design is that we represent local image patches as point clouds in the spherical coordi- nates (with azimuth, elevation and range values) to reflect the non-uniform ray angles of each shot (or pixel), which lifts the 2D pixels to 3D point clouds. By further lifting the 3D points to 4D with a timestamp channel, we can unify the way we represent context pixels/points from both the current and history scans. Since our model directly takes in point clouds, neither interpolation (to the image grid) nor image cropping (projected points from history frames may span different image regions) is needed. On the other hand, as to the model output formulation, instead of directly re- gressing the pixel values (which is often multi-modal), we treat each pixel in the input patch as an anchor and predict a confidence score as well as a residual value per anchor. Evaluated on the large-scale Waymo Open Dataset (WOD) [28], we show that our method reduces the bitrate by more than 65% for the same distortion (measured using the point-to-point Chamfer distance) or reducing more than 85% distortion for the same bitrate, compared to the MPEG standard compression method G-PCC [14] while also sig- nificantly outperforming other baselines like Draco [1] and PNG. On the KITTI dataset [13], we compare with prior art deep compression methods (using octrees) and show our method has a clear advantage over them, thanks to its use of the range image representation and the accurate predic- tion model. We also evaluate the impact of compression on downstream perception tasks such as 3D object detec- tion and provide extensive ablation studies to validate our design choices. 2. Related Work Point cloud compression As 3D applications rise, recent years have seen an increasing number of algorithms pro- posed for point cloud compression. One family of the meth- ods uses octrees to represent and compress quantized point clouds [10, 12, 26]. The Motion Picture Experts Group (MPEG) has released a related point cloud compression (PCC) standard, called geometry-based PCC (G-PCC) [14], using the octree structure and various ways to predict the next-level content. More recently, Octsqueeze [17] was proposed to use a neural network as a conditional entropy model to estimate the octree occupancy symbols, and MuS- CLE [9] extends it by including temporal prior from pre- vious frames. VoxelContextNet [25] further leverages the voxel context for the octree structure prediction. These neural network-based methods consistently show improve- ments over G-PCC which uses hand-crafted entropy mod- els. While the octree-based methods are flexible to model arbitrary point clouds (from either a lidar sensor or multi- view reconstruction), they do not make use of the point dis- tribution patterns in lidar range images. As a lidar point cloud can be represented as a range im- age, image-based compression methods can be adapted for its compression. For example, [3, 7, 16] applied traditional image compression methods such as JPEG, PNG and TIFF to compress the range images. A sequence of range im- ages could be seen as a video, and video-based compres- sion method like H.264 was applied to compress lidar se- quences [22]. MPEG also proposed a PCC (V-PCC) stan- dard that compresses dynamic point clouds via HEVC video codex [14]. Our work extends them to leverage deep models and delta encoding to compress range images. Auto-encoders have been used to achieve lossy compres- sion of point clouds. [36, 37] proposed to train an encoder- decoder point cloud reconstruction network and entropy en- code the bottleneck layer as the compressed data. Similarly, [33] trained an auto-encoder to reconstruct range images and compress the bottleneck vectors. While these meth- ods may achieve high compression rates, the reconstructed point clouds could have strong artifacts, especially at the object boundaries resulting in unbounded errors in the lossy compression scheme. Learned image and video compression Image and video compression are well-studied fields with many standards (for example: PNG, JPEG, TIFF for images, H.264 and HEVC for videos). Among them, PNG is highly related to our work as it uses lossless image compression using delta encoding. With the popularity of deep convolutional neu- ral networks for image understanding, deep model-based image and video compression have also been widely ex- plored [5, 6, 20, 21, 31, 32]. Many of them leverage an encoder-decoder neural network (for example, a variational auto-encoder [5]) for the compressing (encoding the image to a latent vector) and decompressing (decode/generate the image from the vector). For the decoding architectures, se- quential models such as PixelCNN [23] and PixelRNN [35] inspired our predictive model design. 3. Problem Formulation For most lidar sensors, one scan can be interchangeably represented as either a point cloud P ∈ R N × C or a range image I ∈ R H × W × C , where N is the number of points, H and W are height and width of the range image (H is the number of laser beams in the lidar and W is the number of shots per laser per frame), C is the feature dimension for each point. Each valid pixel in the range image represents a laser shot corresponding to one point in the point cloud. The channels include the range value and other attributes such as reflection intensity. The conversion rule between a point cloud and a range image depends on the laser scanning mechanism (the laser shot azimuth and elevation angles) as well as the sensor poses (the 6D pose of the laser sensor at the time of each laser shot), as illustrated in Fig. 1. Specifically, in a range image I , given a pixel location ( i, j ) (which maps to a specific laser shot angle) and its θ α r ( x , y , z ) X Y Z Object Lidar R | t Object T 1 Sensor trajectory T 2 T 3 . T 4 ω 2 ω 3 ω Figure 1. Illustration of laser shots. Left: A single laser shot. Right: Laser shots across time (in a bird’s eye view). We show four consecutive laser shots (with delta azimuth angle ω ) that mea- sure the ranges from the (moving) sensor to the object. To convert the range values to a point cloud, we need to know the ranges, the shot angles, as well as the sensor poses at each shot. range value, we get a laser measurement ( r, θ, α ) where r is the range value, θ (azimuth or yaw) and α (elevation or pitch) are the shot angles relative to the lidar sensor coordi- nate. The measurement can be converted to a point p in the sensor coordinate by: p = ( x, y, z ) = ( r cos α cos θ, r cos α sin θ, r sin α ) (1) As at the time of each laser shot, the sensor pose [ R | t ] (rotation and translation in the global coordinate) can be different (Fig. 1). To aggregate the shots into a point cloud, we need to convert the points to a shared global coordinate system to get the point set P = { R i p T i + t i } , i = 1 , ..., N where i is the index of the laser shot in a scan/range image. Reversely, given the point cloud P of a scan (in the global coordinate), to convert it to the range image, we first need to transform each point to the sensor coordinate corre- sponding to its time of the shot. Then, we can easily get the ( r, θ, α ) by the reverse process of Eq. 1, which then maps back to the row and column indices. For our lidar range image compression, we first quantize the range image I by rounding its pixel values to a predeter- mined quantization precision. Then our goal is to compress the quantized range image I ′ to a bitstream b ∈ [0 , 1] n (with an n as small as possible), which can later be decompressed into the exact quantized range image I ′ . It is lossy with re- spective to the raw range image but lossless regarding the quantized range image. Note that for calibrated lidars such as the ones used in the Waymo Open Dataset [28], each pixel in the range image corresponds to a fixed shot angle ( θ, α ) for the same lidar, so the angles do not need to be stored for the compression 1 . Besides, as sensor poses are often stored separately from range images and are shared with other modules (such as 1 For the main lidars used in WOD, pixel elevations are determined by the laser beam inclinations (64 numbers) and azimuths can be calculated based on uniform azimuth rotation. For other lidars such as Velodyne HDL-64, azimuth rotation angles are not uniform and need to be stored (one number for each column, costing only ∼ 0 . 1 Kb per frame) [34]. raw range images Quantization Deep Delta Encoding Entropy Encoding quantized range image residual map bitstream Figure 2. The deep delta encoding pipeline for lidar range im- age compression. Given a lidar range image, we first quantize the attribute values and then run inference of the predictive model on the quantized range image to derive residuals. Finally we use entropy encoders to compress the residuals to a bitstream. localization), we do not need to store sensor poses either. Only the range image needs to be compressed. 4. Range Image Deep Delta Encoding We first describe our overall compression pipeline in Sec. 4.1, then dive deep into the design of our prediction model in Sec. 4.2, and finally describe how we entropy en- code the residuals in Sec. 4.3. 4.1. Pipeline Overview As shown in Fig. 2, the input to our compression pipeline is a raw range image. First, we quantize the range image with a certain quantization precision (this allows us to store the deltas as discrete symbols). Next, the core part of the pipeline is the deep delta encoding. We train a deep model to predict the next pixel value in a raster scanning order. We then save the delta between the prediction (quantized) and the original (quantized) pixel value instead of saving the original pixel value. As the deltas are smaller and more concentrated in distribution than the original pixel values, they can be compressed more effectively. At the last step, the deltas (or the residual map) are entropy encoded to a compressed bitstream. 4.2. Deep Delta Encoding Commonly used delta encoding adopts a linear predic- tion model to estimate the pixel values. In its simplest form, to predict a pixel I i,j at the i -th row and j -th column, its left pixel I i,j − 1 is used as the prediction. Other linear filters of left, up and nearby pixels can also be used. The delta be- tween the prediction and the original pixel value is stored to be compressed. In our work, we propose to train a deep neural network to predict the pixel values and show that it can achieve significant improvement in prediction accuracy and compression rate. Next, we first introduce our model in its intra-prediction format (only using the information from the current frame/scan for the prediction) and then describe how we extend it to take temporal input from history scans. Please see the supplementary for more details on the model architecture, the losses and the training process. Intra-frame Prediction Model Formally, the network models the conditional probability of the k -th pixel value (in the raster scanning order) conditioned on the quantized pixel values before k : p ( I k ; Θ) = p ( I k |{ I ′ k − 1 , ..., I ′ 1 } ; Θ) , where Θ are the network weights, I ′ is the quantized range image and I is the unquantized raw range image. Empir- ically, as shown in Fig. 3, instead of using the entire past context (e.g. with a RNN model), we can use local image patch of shape h × w as the context to predict the bottom right pixel of the patch, similar to the idea of the sequential image decoder PixelCNN [23]. Although the input to our network is an image patch, it is quite different from a typical RGB one. The relations of the range image pixels depend on the location of the patch and even the calibration of a specific lidar because the laser shot angles are often non-uniformly distributed. This is even more prominent in the inter-frame prediction when we re-project the points from history scans to the coordi- nate of the current shot. Therefore, we augment the range image with two extra channels: the delta azimuth and delta elevation angles relative to the angles of the to-be-predicted pixel, which lifts the 2D pixels to the 3D spherical coordi- nate. Furthermore, as range prediction is a geometry esti- mation problem, we found that empirically, using a 3D deep learning model such as PointNet [24] leads to more accurate prediction compared to using a 2D convolutional network. As shown in Fig. 3, given the lidar calibration data, we first convert the range image patch to a mini point cloud (with maximally hw − 1 points). Instead of directly regress- ing the pixel range value, which suffers from the uncertainty caused by the multi-modal distribution of attributes (esp. on the object boundaries), we formulate the prediction as an anchor-based classification and anchor-residual regression problem, where valid pixels in the range image patch are the anchors. The deep network predicts which pixel is the closest in value to the bottom right pixel and regresses a residual (it is an overloaded word here; it is different from the residual map in delta encoding) with respect to each an- chor pixel. Temporal Model The temporal model extends the intra- frame prediction model by leveraging contexts from both the current scan and the past scan. The point cloud repre- sentation (compared to the 2D pixel representation) enables us to unify the input from the past and current scans as we can represent all laser shots in the 4D (spherical plus time) coordinates. Given the current scan (quantized) range image I ′ T and the past scan range image I ′ T − 1 , assume we want to predict the range value of pixel ( i, j ) in the current scan ( k -th pixel in the raster scanning order). A naive baseline approach to use temporal data is to take the same neighborhood at that in I ′ T (in terms of pixel rows and columns) from the last scan I ′ T − 1 and concatenate it with the current frame image patch. However, this approach does not take the ego-motion of the lidar sensor into account. As the lidar moves over time, the range image patch with the same rows and columns can correspond to vastly different physical space. To take sensor poses into consideration, instead of query- ing pixels of the last frame using the row and column in- dices, we should query neighbors using 3D points in the global coordinate (Fig. 3). However, as we do not know the ground truth range value for the pixel ( i, j ) , we have to approximate the query by using a predicted range (e.g. using the left pixel range or the predicted value from the intra-frame model). Given pixel ( i, j ) ’s laser shot angle ( θ, α ) and its estimated range ˆ r , we get a point in the global coordinate, following Sec. 3. Then given the points from last frame in the global coordinate, we can directly query neighbors in the 3D space (using KDtrees to accelerate the query). Those neighboring points from last frame can then be projected to laser shot ( i, j ) ’s spherical coordinate (to the points in the sensor coordinate at the time of the laser shot and then transform to the spherical coordinate), to obtain extra points as temporal contexts. 2 This is equivalent to assuming the points from the last frame are static, and we re-scan the scene at the sensor location at the time of the laser shot ( i, j ) . To distinguish the points from the last and current frames, we augment the points with an extra time channel (with 1 indicating the last frame and 0 indicating the current frame). Note that the reprojected points from the last frame do not directly correspond to the rows and columns of the cur- rent frame range image. Considering such input as a point cloud is convenient as we do not require any interpolation to turn the points to the image grid or any predefined neigh- borhood size for image cropping. Inference. At inference time (for compression), we start from the top left patch of the range image to pre- dict pixel I ′ 1 or I ′ 1 , 1 and store the residual. This pro- cess continues in a raster scanning order to predict pix- els I 1 , 2 , ..., I 1 ,W , I 2 , 1 , ..., I i,j , ..., I H,W . The residual map (deltas between the prediction and quantized values) of size H × W would be compressed by the entropy encoder. At decompression time, we run the prediction model in the same raster-scanning order, which takes input as already reconstructed pixels { I ′ 1 , .., I ′ k − 1 } , predicts the next pixel value ˆ I k and then reconstruct the pixel from saved resid- ual as I ′ k = ˆ I k + δ k , where δ k is the stored delta of pixel k = ( i − 1) W + j . This process can be parallelized by divid- ing the input range image into blocks and run the inference 2 Strictly, even the pixels/points from the current frame need be re- projected to the sensor coordinate at the time of the shot ( i, j ) . We have this reprojection in our intra-frame model but the impact is small as the sensor moves little between a few pixels. in parallel for each block (discussed in the supplementary). 4.3. Entropy Encoding After the predictive delta encoding, we get a residual map/array of the range image. An entropy encoder is used to leverage the sparsity pattern in the residual map to com- press it. Given an accurate prediction model, most of the residuals would be zero. We adopt two methods to entropy encode the residuals. In practice, we select the entropy en- coder with the highest compression rates depending on the quantization rates and the predictor. The first method is to represent the residuals using a sparse representation, with the values of the nonzero resid- uals and their indices in the array, which can then be arith- metically encoded to further reduce its size. The second method is to represent the residuals using run-length en- coding, which achieves better compression rates when the residuals are not very sparse, i.e., when quantization step is small. After obtaining the run-length representation, we use LZMA compressor to further reduce its size. 5. Experiments In this section, we first introduce the datasets and the metrics in Sec. 5.1. Then we report compression results compared with strong baselines and prior art methods in Sec. 5.2 both quantitatively and qualitatively. We further evaluate the impact of compressed data to downstream per- ception tasks (3D detection of vehicles and pedestrians) in Sec. 5.3. Finally, we provide extensive analysis experiments to validate our design choices in Sec. 5.4. 5.1. Dataset and Metrics Waymo Open Dataset (WOD) [28] WOD is the main dataset we experiment with, as it provides rich lidar cali- bration data and full sensor poses. WOD includes a total number of 1,150 sequences with 798 for training and 202 for validation. Each sequence lasts around 20 seconds with a sampling frequency of 10Hz. A 64-beam lidar is used, providing range images of 64 rows and 2,650 columns, with provided lidar calibration metadata (beam inclination an- gles). The range channel is cropped to 75m, and each raw range value is stored as a 32-bit float in default. We use the training set to train our deep model and evaluate on the val- idation set. Only the first return range images are used in our experiments. SemanticKITTI [8] We also evaluate our method on Se- manticKITTI (which enhances KITTI [13] with semantic labels) to compare with prior art methods OctSqueeze [17] and MuSCLE [9] (since they do not release code, we can- not compare with them on the WOD). We directly apply the WOD trained model on SemanticKITTI test split (sequence 11-21). However, as KITTI only released the point cloud data but not the the raw range images nor the sensor poses, we have to refer to the manual of the Velodyne lidar [2] used by KITTI to convert a point cloud to the spherical coordi- nate to get a pseudo range image with 64 rows and 2,088 columns. For our method, we compress the pseudo range images and do not additionally store the azimuth and eleva- tion of the pixels, as their storage in actual Velodyne range images are negligible (elevations are known and azimuths can be compressed to less than 1Kb per frame [34]). Metrics Following previous works [9, 14, 17], we use two geometric metrics to evaluate the reconstruction quality of the compressed point cloud data: point-to-point Cham- fer distance and point-to-plane peak signal-to-noise ratio (PSNR). We report these metrics as a function of bitrates i.e., the average number of bits to store one lidar point. The point-to-point Chamfer distance CD sym measures the average point distances between two point clouds (smaller the better). For a given point cloud P = { p i } i =1 ,...N and the reconstructed point cloud ˆ P = { ˆ p j } j =1 ,...M : CD ( P, ˆ P ) = 1 | P | ∑ i min j ∥ p i − ˆ p j ∥ 2 (2) CD sym ( P, ˆ P ) = max { CD ( P, ˆ P ) , CD ( ˆ P , P ) } (3) The second metric, the peak signal-to-noise ratio (PSNR) [30] (the larger the better), measures the ratio be- tween the “resolution” of the point cloud r and the average point-to-plane error between the original point cloud P and the reconstructed point cloud ˆ P : PSNR ( P, ˆ P ) = 10 log 10 r 2 max { MSE ( P, ˆ P ) , MSE ( ˆ P , P ) } (4) where MSE ( P, ˆ P ) = 1 | P | ∑ i (( p i − ˆ p i ) · n i ) 2 is the point- to-plane distance, ˆ p i is the closest point in ˆ P to p i , r = max p i ∈ P min j ̸ = i ∥ p i − p j ∥ 2 is the intrinsic resolution of the original point cloud. We estimate the normal n i using Open3D [38] with k = 12 for k nearest neighbor. 5.2. Compression Results In this section, we compare our methods with compet- itive baselines as well as prior art lidar data compression methods. We focus on compressing the range channel or the 3D coordinates of the points as it is the most studied attribute among the others (intensity, elongation) and some of the methods in comparison do not support compressing other attributes. See supplementary material for more re- sults on compressing the other channels. We adjust the quantization precision of the range images to achieve dif- ferent compression rates (bits per point) of our method. quantized range image at frame T 46.5 0 46.3 46.3 47.2 46.1 46.2 46.2 ... ... ... ... ... ... ... range image patch h x w i-2 i-1 i ... j-2 j-1 j ... decoded point cloud at frame T-1 intra-frame and temporal points (h x w -1 + m) x 4 anchor classification and regression Deep Predictive Model query point reproject to spherical coordinate (h x w - 1 + m) x 2 lidar calibration intra-frame points (h x w -1) x 3 Figure 3. The deep prediction model. Given a range image patch from frame T with quantized attribute values (e.g. range), we lift pixels to the spherical coordinate with azimuth and elevation angles from lidar calibration. To leverage context points from the past frame T-1, a query point is generated to find neighbors among points at frame T-1. Those neighbor points are then projected to the spherical coordinate of the pixel to be predicted. Our predictor takes the union of the intra-frame and temporal context points and predicts the attribute of the pixel ( i, j ) with anchor classification and regression (with each input point as an anchor). Baselines: G-PCC [14] is a point cloud compression method proposed by the MPEG, using octrees. Draco [1] is a popular point cloud compression algorithm based on Kdtrees proposed by Google. We also compare with two prior art deep model based methods 3 : OctSqueeze [17] is a octree-based method that uses a neural network to predict the next-level symbol of the octree; MuSCLE [9] further strengthens OctSqueeze by leveraging multi-sweep (tempo- ral) data for the octree prediction. In terms of range image representation, we compare with PNG (intra-frame) as well as HEVC (a video compression standard) on top of PNG for temporal range image compression. For the PNG compres- sion, the range is coded with 16 bits with a varying scal- ing factor to control the distortion/compression rate. We also compare with Cluster [29], a range image-based lidar data compression algorithm with a pipeline of segmenta- tion, clustering, 3D-HEVC encoding and ground prediction. Besides, supplementary provides a further experiment com- paring with an auto-encoder based method on range images (not included here due to its poor performance). Implementation Details Our intra-frame prediction model, RIDDLE , takes in a context image patch of size 10 × 10 (the bottom right pixel is masked out) and uses a PointNet [24] like architecture for the prediction (without the T-Net structure, adapted the output to predict anchor classification and regression). The input to the network is a 3D point cloud in a spherical coordinate with azimuth, elevation relative to the bottom right pixel and the range relative to the mean range of valid context points. Our 3 There is another deep net based work VoxelContextNet [25], yet as they did not release code nor the detailed definition of the evaluation met- rics, we could not compare with them. temporal model, RIDDLE-T , uses the same network architecture as the intra-frame one but takes in an extra 100 points from the last scan (projected to the spherical coordinate of the next pixel). Please see supplementary for more details. Waymo Open Dataset Results We report the bitrate ver- sus reconstruction quality metrics (PSNR, Chamfer dis- tance) of competing methods on all frames from the se- quences in the validation set of the Waymo Open Dataset. As shown in Fig. 4, our method significantly outperforms prior methods. At the same Chamfer distance around 0.005, our method reduces the bitrate by more than 65% com- pared to G-PCC (from 10.78 bpp to 3.65 bpp). At the bi- trate of around 4, our method reduces the distortion (mea- sured by Chamfer distance) by more than 85%. Our method also has a larger bitrate improvement over previous methods when the reconstruction quality is higher. This indicates our method has more advantage over baselines when the data quality requirement is higher. SemanticKITTI Results Since prior art methods [9, 17] have not released the code or the compression model, we turn to the SemanticKITTI dataset to compare with them (we got the raw values of the curves reported in the MuS- CLE [9] paper from the authors). We apply our model trained on the Waymo Open Dataset directly to the Se- manticKITTI lidar point clouds (by creating pseudo range images). As shown in Fig. 5, our method is more than 50% lower in bitrate (at around 4.3 bpp) with the same Chamfer dis- tance at around 0.005 compared to all prior art methods, showing significant advantages. This strong lead attributes Figure 4. Evaluation of the compression methods with geometric metrics on the Waymo Open Dataset val set . Left : Chamfer distance v.s. bit per point (bbp); Right : PSNR v.s. bpp. At a certain bitrate, lower the Chamfer distance or higher the PSNR, better the reconstruction quality. Figure 5. Evaluation of the compression methods with geomet- ric metrics on the SemanticKITTI test set . We only present our intra-frame model here as the per pixel sensor pose is unavailable in SemanticKITTI. Figure 6. Impact of lidar data compression to 3D object detec- tion quality on the Waymo Open Dataset val set . We train Point- Pillars [19] detectors using the raw point clouds (with no compres- sion) from the WOD train set and evaluate them with the com- pressed point clouds (or point clouds from the compressed range images) on the WOD validation set. to our choice of directly compressing the range images as well as the effective deep model. Qualitative results. In Fig. 7, we show the reconstructed lidar point clouds from our method, Draco and G-PCC. We can see that the point cloud reconstructed from our method remarkably resembles the original point cloud in geometry even when the bitrate is ambitiously set very low, thanks to compressing directly on the range images to keep the point distribution pattern. 5.3. Impact to Downstream Perception Tasks For applications like autonomous driving, we want to understand the impact of lidar data compression to down- stream perception tasks such as 3D object detection. To understand such impact, we trained a widely used PointPil- lars detector [19] on uncompressed point clouds using the Waymo Open Dataset train set, for the vehicle class and pedestrian class respectively. Detection quality is measured by mean average precision (mAP). As shown in Fig. 6, our method outperforms other com- peting baselines in maintaining the best mAP with the same bitrate. At the bitrate around 2, our method leads the sec- ond best method (G-PCC) by more than 1 point on vehi- cle detection and 3 points on pedestrian detection. We can also see that pedestrian detection is more sensitive to data distortion probably due to the smaller average object sizes compared to vehicles. 5.4. Analysis Experiments In this section we ablate our deep model in terms of ar- chitecture choice, loss design and temporal context. In or- der to compare prediction quality independent from the en- tropy encoder, we use a prediction accuracy as the metrics for ablation studies. The prediciton accuracy (acc.) is de- fined as the percentage of zero deltas (i.e. perfect predic- tion under quantization) in the range image residual map, under a specific quantization precision (e.g. δ = 0 . 1 m 4 ). A prediction q for the quantized range value p ′ is counted as correct if | q − p ′ | < δ/ 2 . Supplementary provides more analysis related to entropy encoders and model latency. 4 Note 0 . 1 m is not that coarse as average point displacement after the quantization is only 2 . 5 cm model acc. @0.1m previous valid value 54.35 linear interpolation 54.64 12-layer CNN 64.62 PointNet (adpated) 65.75 Table 1. Effects of prediction models. loss function acc. @0.1m MSE 59.83 MAE 61.64 multi-bin loss 59.66 anchor cls. + reg. 65.75 Table 2. Effects of loss functions. temporal context acc. @0.1m none (intra-frame) 65.75 10 × 10 image 67.34 100 knn points 69.23 Table 3. Effects of temporal input. G-PCC (4.02bpp) Draco (4.02bpp) PNG (4.02bpp) Ours (4.02bpp) Error Colormap Groundtruth (32bpp) 0.00 26.46 52.91 79.37 105.83 132.29 158.74 185.20 211.66 238.11 mm Figure 7. Visualization of reconstructed point clouds, colored by per point Chamfer distance (error bar colormap on the bottom). From left to right: raw, G-PCC, Draco, PNG and RIDDLE (ours). It is clear that our method, under the same bit per point, has mush less distortion. Best viewed in color with zoom in. Effects of predictor choices. Table 1 compares several architecture choices. The simplest choice is to use the left valid pixel as the prediction to the current pixel: ˆ I i,j = I ′ i,j − 1 . Another extension is to use linear interpolation of close-by pixels: ˆ I i,j = I ′ i,j − 1 + I ′ i − 1 ,j − I ′ i − 1 ,j − 1 . Note that for both cases, first valid pixel is used in case the nearby one is an empty pixel. We see that deep models can significantly outperform linear models while the point-cloud-based ar- chitecture shows a stronger empirical result compared to ConvNet on the image representation. Effects of loss functions. Table 2 compares several loss choices for our model supervision. With direct attribute prediction as a regression problem, we can see using the mean absolute error (MAE, L1 loss) is superior to using the mean squared error (MSE, L2 loss) as it is affected less by the large errors on the object boundaries. Turning the depth regression problem to a multi-bin classification and regres- sion problem (with classification and intra-bin regression for each depth bin of size 1m) does not help much either as shown in the third row. Our proposed formulation (anchor classification with regression) leads to 4.11 points increase in prediction accuracy compared to the second best option of using mean absolute error. Effects of temporal contexts. Table 3 shows the benefits of adding temporal contexts to the prediction model. We see that even the naive concatenation of the image patch of the last frame with the same rows and columns (second row) can already help. A more careful handling of the temporal points by considering sensor poses (as described in Sec. 4.2) leads to more gains of using the temporal data. 6. Conclusion With improving lidar sensor resolution and growing data volume, how to efficiently store and transmit lidar data becomes a challenging problem in many 3D applications, such as autonomous driving and augmented reality. To ad- dress this challenge, we propose a novel lidar data com- pression algorithm named RIDDLE (Range Image Deep DeLta Encoding), which combines the succinctness of tra- ditional delta encoding and the expressiveness of deep neu- ral networks, with support of using temporal contexts. Ex- periments over the Waymo Open Dataset and KITTI show that compared to previous methods, the proposed approach yields significant improvement in the point cloud recon- struction quality and the downstream perception model per- formance, under the same compression rates. References [1] Draco. https://github.com/google/draco . Ac- cessed: 2021-09-28. 2, 6 [2] Velodyne hdl-64e. https : / / gpsolution . oss - cn - beijing . aliyuncs . com / manual / LiDAR / MANUAL%2CUSERS%2CHDL-64E_S3.pdf . Accessed: 2021-10-04. 5 [3] Jae-Kyun Ahn, Kyu-Yul Lee, Jae-Young Sim, and Chang- Su Kim. Large-scale 3d point cloud compression using adaptive radial distance prediction in hybrid coordinate do- mains. IEEE Journal of Selected Topics in Signal Processing , 9(3):422–434, 2015. 2 [4] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2D- 3D-Semantic Data for Indoor Scene Understanding. ArXiv e-prints , Feb. 2017. 1 [5] Johannes Ball ́ e, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704 , 2016. 2 [6] Johannes Ball ́ e, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436 , 2018. 2, 13 [7] Peter van Beek. Image-based compression of lidar sensor data. Electronic Imaging , 2019(15):43–1, 2019. 2 [8] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE International Conference on Computer Vision , pages 9297–9307, 2019. 5 [9] Sourav Biswas, Jerry Liu, Kelvin Wong, Shenlong Wang, and Raquel Urtasun. Muscle: Multi sweep compres- sion of lidar using deep entropy models. arXiv preprint arXiv:2011.07590 , 2020. 1, 2, 5, 6 [10] Mario Botsch, Andreas Wiratanaya, and Leif Kobbelt. Effi- cient high quality rendering of point sampled geometry. Ren- dering Techniques , 2002:13th, 2002. 2 [11] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR , 2017. 1 [12] Olivier Devillers and P-M Gandoin. Geometric compres- sion for interactive transmission. In Proceedings Visualiza- tion 2000. VIS 2000 (Cat. No. 00CH37145) , pages 319–326. IEEE, 2000. 2 [13] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The Inter- national Journal of Robotics Research , 32(11):1231–1237, 2013. 2, 5 [14] D Graziosi, O Nakagami, S Kuma, A Zaghetto, T Suzuki, and A Tabatabai. An overview of ongoing point cloud com- pression standardization activities: video-based (v-pcc) and geometry-based (g-pcc). APSIPA Transactions on Signal and Information Processing , 9, 2020. 2, 5, 6, 13 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016. 11 [16] Hamidreza Houshiar and Andreas N ̈ uchter. 3d point cloud compression using conventional image compression for effi- cient data transmission. In 2015 XXV International Confer- ence on Information, Communication and Automation Tech- nologies (ICAT) , pages 1–8. IEEE, 2015. 2 [17] Lila Huang, Shenlong Wang, Kelvin Wong, Jerry Liu, and Raquel Urtasun. Octsqueeze: Octree-structured en- tropy model for lidar compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1313–1323, 2020. 1, 2, 5, 6, 11, 12, 13 [18] Martin Isenburg. Laszip: lossless compression of lidar data. Photogrammetric Engineering and Remote Sensing , 79, 02 2013. 12 [19] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR , 2019. 7 [20] Siwei Ma, Xinfeng Zhang, Chuanmin Jia, Zhenghui Zhao, Shiqi Wang, and Shanshe Wang. Image and video compres- sion with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology , 30(6):1683– 1698, 2019. 2 [21] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Practical full resolu- tion learned lossless image compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10629–10638, 2019. 2 [22] Fabrizio Nenci, Luciano Spinello, and Cyrill Stachniss. Ef- fective compression of range data streams for remote robot operations using h.264. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 3794– 3799, 2014. 2 [23] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Con- ditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328 , 2016. 2, 4 [24] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 652–660, 2017. 4, 6, 11 [25] Zizheng Que, Guo Lu, and Dong Xu. Voxelcontext-net: An octree based framework for point cloud compression. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6042–6051, 2021. 1, 2, 6 [26] Ruwen Schnabel and Reinhard Klein. Octree-based point- cloud compression. In PBG@ SIGGRAPH , pages 111–120, 2006. 2 [27] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR , 2015. 1 [28] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2446–2454, 2020. 2, 3, 5 [29] Xuebin Sun, Han Ma, Yuxiang Sun, and Ming Liu. A novel point cloud compression algorithm based on cluster- ing. IEEE Robotics and Automation Letters , 4(2):2132– 2139, 2019. 6 [30] Dong Tian, Hideaki Ochimizu, Chen Feng, Robert Cohen, and Anthony Vetro. Geometric distortion metrics for point cloud compression. In 2017 IEEE International Conference on Image Processing (ICIP) , pages 3460–3464. IEEE, 2017. 5 [31] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full resolution image compression with recurrent neural net- works. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5306–5314, 2017. 2 [32] James Townsend, Thomas Bird, Julius Kunze, and David Barber. Hilloc: Lossless image compression with hierarchi- cal latent variable models. arXiv preprint arXiv:1912.09953 , 2019. 2 [33] Chenxi Tu, Eijiro Takeuchi, Alexander Carballo, and Kazuya Takeda. Point cloud compression for 3d lidar sensor using recurrent neural network with residual blocks. In 2019 In- ternational Conference on Robotics and Automation (ICRA) , pages 3274–3280. IEEE, 2019. 1, 2 [34] Chenxi Tu, Eijiro Takeuchi, Chiyomi Miyajima, and Kazuya Takeda. Compressing continuous point cloud data using im- age compression methods. In 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC) , pages 1712–1719. IEEE, 2016. 3, 5 [35] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In In- ternational Conference on Machine Learning , pages 1747–1756. PMLR, 2016. 1, 2 [36] Louis Wiesmann, Andres Milioto, Xieyuanli Chen, Cyrill Stachniss, and Jens Behley. Deep compression for dense point cloud maps. IEEE Robotics and Automation Letters , 6(2):2060–2067, 2021. 2 [37] Wei Yan, Shan Liu, Thomas H Li, Zhu Li, Ge Li, et al. Deep autoencoder-based lossy geometry compression for point clouds. arXiv preprint arXiv:1905.03691 , 2019. 2 [38] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3d: A modern library for 3d data processing. arXiv preprint arXiv:1801.09847 , 2018. 5 Supplementary A. Overview In this supplementary, we provide more details of our method, extra analysis experiment results and visualiza- tions. In Sec. B, we describe more details of the deep pre- dictive model, including its network architecture, losses and its training process as well as more explanation of the input data transformations. In Sec. C, we provide more analysis experiment results on model latency, effects of the entropy encoder choices and effects of the context sizes. In Sec. D, we apply our method to compress lidar data attributes be- yond the range values. Finally, in Sec. E, we provide more visualizations of our model’s predictions. B. Details of the Predictive Models Model architecture For deep prediction model, we adapt the structure of PointNet [24]. Details of layers are vi- sualized in Fig. 8. To reduce the latency, the network channel sizes are halved compared to the original architec- ture and the T-Nets are removed. After concatenation of global and local features, we split the network into anchor- classification branch and residuals branch. Each branch is a MLP with layer sizes [128, 64, # of anchors], where # of anchors is 99 for intra-prediction model and 199 for tempo- ral model. For temporal model, We build the KDtrees using neighbors.NearestNeighbors method by Scikit-learn library. We use the left valid depth and up valid depth as estimates to each query 50 points from the last frame as the temporal context. Loss functions At training time, we train deep prediction model end-to-end with the anchor classification and the an- chor residual regression loss. We weight the classification loss by a weight. L = γ L classification + L regression (5) The classification loss is a cross-entropy loss across hw − 1 classes for intra-frame prediction model and hw − 1 + m classes for temporal model. The ground truth class is se- lected as the index of the pixel with the closest distance to the to-be-predicted pixel. As the input are quantized values, there could be ties. To avoid ties we add a bias term to the distances to favor the pixels that are closer in angles (abso- lute delta azimuth + absolute delta elevation) to the to-be- predicted pixel. The regression loss is a L1 loss between the predicted residual corresponding to the pixel of the ground truth anchor and the ground truth residual of that pixel. Training We learn the weights of the prediction model by training on the range images from the Waymo Open Dataset train set. We randomly crop the patches of shape 10 × 10 from the range images and train the deep network with batch size 128 and an Adam optimizer. We use loss weight γ = 0 . 01 . The initial learning rate is 0.00005, and we decay learning rate by 10x at step 1500k and step 3000k. We normalize the range values to [0 , 1] by dividing them by 75m. To mimic the same setting in decoding, the training inputs are quantized. The ground truth attribute values are in full precision for more accurate supervision. For pixel locations at the boundary of the range images, we enforce the same patch size via zero padding. There are two strategies for inputs quantization. The first strategy is to train different models for different quantiza- tion precisions, and for each model we use a fixed quanti- zation precision for the input. The second strategy is us- ing mixed precisions to quantize the input during training. Specifically, we uniformly sample a quantization precision for a given input from 0.0001 to 0.5000 with sample bin size of 0.0001. By the second strategy, we only need to train one model for different compression rates. From our exper- iments, we observe that the second strategy won’t harm the compression rates at individual quantization precision. Baselines For the previous valid value method, we predict the attribute I i,j at row i, column j to be ˆ I i,j = I ′ i,j − 1 , if I ′ i,j − 1 is a valid pixel (not void due to empty laser return) else repeatedly decrement j by 1 until the pixel is valid. For the linear interpolation baseline method, we predict ˆ I i,j = I ′ i,j − 1 + I ′ i − 1 ,j − I ′ i − 1 ,j − 1 . For the 12-layer CNN method, we adapt a similar structure as ResNet [15]. The network is composed of two convolutional layers (channel sizes 64, 32) and 5 residual blocks (channel sizes 32, 32 for each block), and all convolutional layers have filter size 3x3. C. More Analysis Analysis of the model latency. Table 6 shows the la- tency comparison between our method and OctSqueeze [17] and G-PCC from MPEG. To achieve faster decompression speed, during compression, we split a range image into smaller blocks and run compression in parallel on these blocks. During decompression, we decode in parallel on these smaller blocks. Our experiments show that if we split a 64 by 2650 range image into 212 blocks with size 16 by 50, the bitrate would only increase by 0 . 5% , which is nearly negligible. If we split it into 424 blocks with size 16 by 25, the bitrate would increase by 5 . 13% . Table 6 shows the latency of our method by splitting into blocks with size 16 by 26 during decoding. We benchmark our method on NVIDIA Tesla V100 GPU. Our deep model is accelerated by TensorRT with float16 quantization. Operations (en- tropy encoding) other than model inference is written in C++. The latency of G-PCC is benchmarked on CPU using MPEG’s implementation (github.com/MPEGGroup/mpeg- pcc-tmc13). The latency of OctSqueeze (depth 16) is Waymo | Confidential & Proprietary shared nxd mlp (32, 32, 32) nx512 max pool global feature 512 nx32 shared mlp (64, 512) shared nx128 mlp (256, 128) nx1 nx1 anchor probs. anchor residuals nx544 Figure 8. Deep predictive model architecture. n is the number of context points, d is the input point dimension ( d = 3 for the intra- prediction model, d = 4 for the temporal model). entropy encoder quantization bpp sparse repre. + arithmetic encoding 0.1m 2.28 varints+LZMA 0.1m 2.33 huffman encoding 0.1m 2.49 arithmetic encoding 0.1m 2.41 sparse repre. + arithmetic encoding 0.02m 4.17 varints+LZMA 0.02m 4.08 huffman encoding 0.02m 4.25 arithmetic encoding 0.02m 4.21 Table 4. Ablation study of the entropy encoders. from the original paper [17]. Moreover, we believe further speedup of our method could be achieved by methods like predicting multiple pixels at a time or shared point embed- ding for streamed prediction. Choices of entropy encoder After the predictive delta en- coding, we get a residual map of the range image. An en- tropy encoder is used to leverage the sparsity pattern in the residual map to compress it. Given an accurate prediction model, most of the residuals would be zero. In addition, as shown in Fig. 9, larger quantization steps would round more residuals to zero, thus the residuals would become more sparse. We adapted two methods to entropy encode the residuals. In practice, we can select the entropy encoder with the highest compression rates depending on the quan- tization rates and the predictor. The first method is to represent the residuals using sparse representation. Given an array of residuals, we represent the array with the values of nonzero residuals and their in- dices in the array. For a long run of sparse residuals, the sparse representation would be quite memory efficient. Af- ter obtaining the sparse representation of residuals, we use arithmetic encoding to further reduce its size. The second method is to represent the residuals using run-length encoding. We first flatten the residual map to a vector and then represent it with the values and the run- length of values. This representation achieves better com- pression rates when the residuals are not that sparse, i.e. when quantization step size is small. After obtaining the run-length representation, we use LZMA compressor to fur- ther reduce its size. Table. 4 shows that different entropy encoders have dif- ferent compression rates of the residuals. For quantization precision of 0.1m, the residuals are more sparse, and the compression rate of using sparse representation (represent- ing non-zero residuals by specifying their row, col index and the residual values) with arithmetic encoding is higher than varints with LZMA. However, for quantization preci- sion of 0.02m, the compression rate of varints with LZMA is higher, due to the decrease of zero residuals. Analysis on input choices. Table 7 shows how the input choices affect the prediction accuracy. Instead of inputing intra-frame context of 10 by 10 (minus the bottom right one), we can just input the up 9 pixels (row 1), the right 9 pixels (row 2) or smaller context size (row 4). We can see that enlarging the receptive field of input with context from both upper left and upper right can improve the prediction accuracy (row 3 v.s. row 1 and 2). Moreover, including az- imuth and inclination as additional input attributes can also improve the predictor (row 4 v.s. row 3) compared to just using the relative row/column indices as input. Generalization of the method. When we apply the deep predictive model trained on 64-beam frames on the Waymo Open Dataset directly to the subsampled 32-beam frames, it achieves 2.55 bpp at 0.1m depth precision (only slightly larger than 2.23 bpp on 64-beams). In addition, the com- pressor trained on WOD can apply well in KITTI (Fig. 5), which shows the generalization of it. Comparison with LASzip. Benchmarked on Waymo Open Dataset, LASzip [18] has 67.6 PSNR and 0.0048 Chamfer Distance at 10.62 bpp. Our method achieves 72.39 method compressing (ms) decompressing (ms) G-PCC [14] 1594.5 1052.1 OctSqueeze [17] 106.0 902.3 RIDDLE (ours) 532.51 966.3 Table 5. Latencies of lidar data compression methods. Note the G-PCC is evaluated using CPU on the Waymo Open Dataset (WOD). OctSqueeze is evaluated on the KITTI dataset (with a similar range image resolution to WOD) and our method is evalu- ated on the WOD. Both OctSqueeze and our method use GPU for model inference. preprocessing (ms) network (ms) entropy encoding (ms) 16.23 487.4 28.88 Table 6. Breakdown of encoding time of RIDDLE. Preprocess- ing includes the time to compute and compress a binary mask in- dicating whether a pixel is a valid return in the range image. Net- work refers to the model prediction time. Entropy encoding refers to the time used by entropy encoder. PSNR and 0.0026 Chamfer Distance at 4.51 bpp, which clearly outperforms it. D. Compression of More Attributes Since a lidar point cloud may contain additional at- tributes (e.g. intensity, elongation) than the range values, in this section we show how much we can compress the other attributes than ranges. We train a network to take in multi-channel range images and output multi-channel pre- diction. Specifically, we train a network on the Waymo Open dataset, which contains 3 channels for each point: range, intensity and elongation. The network is modified to have 3 anchor-classification branches and 3 residuals branches for 3 attributes. From Table 8 row 1-4, we can see that a quantization precision 0.02m for range, or 0.1 for intensity, or 0.1 for elongation have similiar effect on the object detector. With those quantization precisions for each attributes, range values account for most of the storage cost (4.04 bpp) compared to the other two (0.88 bpp for intensity and 0.78 bpp for elongation). E. More Visualizations The distributions of residuals Fig. 9 shows the distribu- tion of the range residual maps after our deep delta encoding step. We can see that the larger the quantization interval the more concentrated are the residuals (lower entropy), which explains the lower bitrate after the compression. Note that for a quantization size of 0.1m, more than 70% of the pre- diction has zero error compared to the ground truth quan- tized range image. context size input format acc. @0.1m 10 x 1 ( ∆ azimuth, ∆ inclination, depth) 37.53 1 x 10 ( ∆ azimuth, ∆ inclination, depth) 60.00 5 x 10 ( ∆ row index, ∆ col index, depth) 65.02 5 x 10 ( ∆ azimuth, ∆ inclination, depth) 65.21 10 x 10 ( ∆ azimuth, ∆ inclination, depth) 65.75 Table 7. Effects of context size and input format. We used the intra-frame model for this evaluation. Auto-encoder-based compression We further compare our method with an auto-encoder-based image compression algorithm [6]. r = max p i ∈ P min j ̸ = i ∥ p i − p j ∥ 2 . The auto- encoder is trained with a learning rate of 0.0001 and an Adam optimizer. The range values are scaled to [0, 1] by 75m instead of by 255 as for RGB images. Fig. 10 shows the reconstructed point clouds of the auto-encoder-based method and our method under similar bitrates. The colors of points in the visualizations demonstrate that our method has much better reconstruction quality compared to this auto- encoder baseline. The auto-encoder-based range image compression method does poorly especially at the bound- ary between foreground points and background points. precision bit per point total bpp vehicle mAP pedestrian mAP range intensity elongation range intensity elongation - - - 32 32 32 96.00 69.59 65.62 0.02m - - 4.04 32 32 68.04 69.60 65.63 - 0.1 - 32 0.88 32 64.88 69.59 65.84 - - 0.1 32 32 0.78 64.78 69.59 65.62 0.02m 0.1 0.1 4.04 0.88 0.78 5.70 69.59 65.62 Table 8. Effects of context size and input format. The uncompressed attribute is saved as 32-bit float numbers. We that quantizing the intensity or elongation to 0.1 or quantizing the range value to 0.02m has little impact on the detection mAPs. At these selected quantization rates, the range channel accounts for most of the bpp (around 70%). (a) Residuals distribution (0.02m) (b) Residuals distribution (0.1m) (c) Residuals distribution (0.2m) Figure 9. Distribution of residuals at different quantization precisions . Proposed deep prediction model is able to accurately model the joint distributions of range image pixel attributes, resulting in a concentrated distribution of residuals with low entropy. Groundtruth (32bpp) Ours (2.19bpp) Autoencoder (2.20bpp) Figure 10. Visualization of reconstructed point clouds, colored by per point Chamfer distance (error bar colormap on the bottom). From left to right: raw, RIDDLE (ours) and auto-encoder. It is clear that our method, under the same bit per point, has mush less distortion. Best viewed in color with zoom in.