STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction Zhishuai Zhang1,2∗ Jiyang Gao1 Junhua Mao1 Yukai Liu1 Dragomir Anguelov1 Congcong Li1 1Waymo LLC 2 Johns Hopkins University zzhang99@jhu.edu, {jiyanggao, junhuamao, liuyukai, dragomir, congcongli}@waymo.com Abstract Detecting pedestrians and predicting future trajectories for them are critical tasks for numerous applications, such as autonomous driving. Previous methods either treat the detection and prediction as separate tasks or simply add a trajectory regression head on top of a detector. In this work, we present a novel end-to-end two-stage network: Spatio-Temporal-Interactive Network (STINet). In addition to 3D geometry modeling of pedestrians, we model the temporal information for each of the pedestrians. To do so, our method predicts both current and past locations in the first stage, so that each pedestrian can be linked across frames and the comprehensive spatio-temporal information can be captured in the second stage. Also, we model the interaction among objects with an interaction graph, to gather the information among the neighboring objects. Comprehensive experiments on the Lyft Dataset and the recently released large-scale Waymo Open Dataset for both object detection and future trajectory prediction validate the effectiveness of the proposed method. For the Waymo Open Dataset, we achieve a bird-eyes-view (BEV) detection AP of 80.73 and trajectory prediction average displacement error (ADE) of 33.67cm for pedestrians, which establish the state-of-the-art for both tasks. 1. Introduction To drive safely and smoothly, self-driving cars (SDC) not only need to detect where the objects are currently (i.e. object detection), but also need to predict where they will go in the future (i.e. trajectory prediction). Among the ob- jects, pedestrian is an important and difficult type. The dif- ficulty comes from the complicated properties of pedestrian appearance and behavior, e.g. deformable shape and inter- personal relations [7]. In this paper, we tackle the problem of joint pedestrian detection and trajectory prediction from a sequence of point clouds, as illustrated in Figure 1. ∗Work done during an internship at Waymo. Figure 1. Given a sequence of current and past point clouds, our task is to detect pedestrians in the current frame, and predict the future trajectory of them. In this figure, white points are input point cloud sequence (stacked for visualization), yellow boxes are detected objects, and the cyan lines are predicted future trajectory. Traditionally, this problem is tackled by dividing the perception pipeline into multiple modules: object detec- tion [6, 13, 15, 16, 20, 21, 29, 30], tracking [18] and tra- jectory prediction [2, 7, 9]; latter modules take the outputs from the former modules. Although such strategy makes each sub-module easy to design and implement, it sacrifices the potential advantage of joint optimization. Latter mod- ules can lose critical information bottle-necked by the inter- faces between sub-modules, e.g. a pedestrian’s future tra- jectory depends on many useful geometry features from the raw sensor data, which may be abstracted away in the detec- tion/tracking stage. To this end, researchers recently have proposed several end-to-end neural networks to detect ob- jects and predict trajectories simultaneously. FaF [17] and IntentNet [4] are two of the representative methods, which are designed based on single stage detectors (SSD) [16]; in addition to original anchor classification and regression of SSD, they also regress a future trajectory for each anchor. We observed that there are two major issues that are crit- ical for joint detection and trajectory prediction, but are not addressed by previous end-to-end methods: 1) Tem- poral modeling on object level: existence and future tra- jectory of an object are embedded in both current and past arXiv:2005.04255v1 [cs.CV] 8 May 2020 frames. Current methods simply reuse single-stage detec- tor and fuse the temporal information in the backbone CNN in an object-agnostic manner either via feature concatena- tion or 3D CNN [4, 17]. Such coarse level fusion can loss fine-grained temporal information for each object, which is critical for both tasks. 2) Interaction modeling among ob- jects: the future trajectory of an object could be influenced by the other objects. E.g., a pedestrian walking inside a group may tend to follow others. Existing methods [4, 17] do not explicitly model interactions among objects. To address the aforementioned issues, we propose an end-to-end Spatio-Temporal-Interactive network (STINet) to model pedestrians temporal and interactive information jointly. The proposed network takes a sequence of point clouds as input, detects current location and predicts future trajectory for pedestrians. Specifically, there are three sub- components in STINet : backbone network, proposal gen- eration network, and proposal prediction network. In the backbone net, we adopted a similar structure as PointPil- lars [13], and applied it on each frame of the point cloud, the output feature maps from multi-frames are then com- bined. The proposal generation network takes feature maps from the backbone net and generates potential pedestrian in- stances with both their current and past locations (i.e. tem- poral proposals); such temporal proposals allow us to link the same object across different frames. In the third mod- ule (i.e. prediction network), we use the temporal proposals to explicitly gather the geometry appearance and temporal dynamics for each object. To reason the interaction among pedestrians, we build a graph layer to gather the information from surrounding pedestrians. After extracting the above spatial-temporal-interactive feature for each proposal, the detection and prediction head uses the feature to regress cur- rent detection bounding box and future trajectory. Comprehensive experiments are conducted on Waymo Open Dataset [1] and Lyft Dataset [12] to demonstrate the effectiveness of the STINet. Specifically, it achieves an av- erage precision of 80.73 for bird-eyes-view pedestrian de- tection, and an average displacement error of 33.67 cm for trajectory prediction on Waymo Open Dataset. It achieves real-time inference speeds and takes only 74.6 ms for infer- ence on a range of 100m by 100m. The main contributions of our work come in four folds: • We build an end-to-end network tailored to model pedestrian past, current and future simultaneously. • We propose to generate temporal proposals with both current and past boxes. This enables learning a com- prehensive spatio-temporal representation for pedestri- ans with their geometry, dynamic movement and his- tory path in an end-to-end manner without explicitly associating object across frames. • We propose to build a graph among pedestrians to rea- son the interactions to further improve trajectory pre- diction quality. • We establish the state-of-the-art performance for both detection and trajectory prediction on the Lyft Dataset and the recent large-scale challenging Waymo Open Dataset. 2. Related work 2.1. Object detection Object detection is a fundamental task in computer vi- sion and autonomous driving. Recent approaches can be divided into two folds: single-stage detection [15, 16, 20] and two-stage detection [6, 21]. Single-stage detectors do classification and regression directly on backbone fea- tures, while two-stage detectors generate proposals based on backbone features, and extract proposal features for second-stage classification and regression. Single-stage de- tectors have simpler structure and faster speed, however, they lose the possibility to flexibly deal with complex ob- jects behaviors, e.g., explicitly capturing pedestrians mov- ing across frames with different speeds and history paths. In this work, we follow the two-stage detection framework and predict object boxes for both current and past frames as proposals, which are further processed to extract their ge- ometry and movement features. 2.2. Temporal proposals Temporal proposals have been shown beneficial in action localization in [10, 11]. They showed associating temporal proposals from different video clips can help to leverage the temporal continuity of video frames. [25] proposed to link temporal proposals throughout the video to improve video object detection. In our work, we also exploit temporal proposals and step further to investigate and propose how to build comprehensive spatio-temporal representations of proposals to improve future trajectory prediction. This is a hard task since there are no inputs available for the future. Also we investigate to learn interactions between proposals via a graph. We show that these spatio-temporal features can effectively model objects’ dynamics and provide accu- rate detection and prediction of their future trajectory. 2.3. Relational reasoning An agent’s behavior could be influenced by other agents and it is naturally connected to relational reasoning [3, 23]. Graph neural networks have shown its strong capability in relational modeling in recent years. Wang et al. formulated the video as a space-time graph, show the effectiveness on the video classification task [26]. Sun et al. designed a re- lational recurrent network for action detection and anticipa- tion [24]. Yang et al. proposed to build an object relation- ship graph for the task of scene graph generation [28]. x y z Backbone Features x y z x y z Pillar Features ResUNet Pillar Feature Encoding T=-2 ... T=-1 T=0 Temporal Proposals T-RPN STI Feature Extractor Object Detection Head Trajectory Prediction Head Proposal STI Feature Figure 2. The overview of the proposed method. It takes a sequence of point clouds as input, detects pedestrians and predicts their future trajectories simultaneously. The point clouds are processed by Pillar Feature Encoding [13, 30] to generate Pillar Features. Then each Pillar Feature is fed into a backbone ResUNet [22] to get backbone features. A Temporal Region Proposal Network (T-RPN) takes backbone features and generated temporal proposal with past and current boxes for each object. Spatio-Temporal-Interactive (STI) Feature Extractor learns features for each temporal proposal which are used for final detection and trajectory prediction. 2.4. Trajectory prediction Predicting the future trajectory of objects is an impor- tant task, especially for autonomous driving. Previous re- search has been conducted based on perception objects as inputs [2, 5, 7, 9, 14]. Recently FaF [17] and IntentNet [4] focused on end-to-end trajectory prediction from raw point clouds as input. However, they simply re-used single- stage detection framework and added new regression heads on it. In our work, we exploit temporal region proposal network and explicitly model Spatio-Temporal-Interaction (STI) representations of pedestrians, and our experiments show that the proposed STI modeling is superior on both detection and trajectory prediction for pedestrians. 3. Proposed method In this section, we discuss our proposed network in de- tails. The overview of our proposed method is shown in Figure 2, which can be divided into three steps. For each of these steps, we discuss in the following subsections. 3.1. Backbone network The backbone of our network is illustrated in Figure 3. The input is a sequence of point clouds with t′ frames noted as [PC−(t′−1), PC−(t′−2), · · · , PC0], which corresponds to the lidar sensor input from the past t′ −1 frames as well as the current frame. All point clouds are calibrated to SDCs pose at the current frame so that the ego-motion is discarded. To build rich pillar features while keeping a feasible memory usage, we generate t pillar features from the t′ input frames. Consecutive t′/t point clouds PC−(j+1)t′/t+1, · · · , PC−jt′/t are processed with Voxeliza- tion [13, 30] and then concatenated to generate a pseudo image Ij (i.e. Pillar Features) with shape H × W × Cin. ResBlock ResBlock 1x Upsample 2x Upsample 4x Upsample Concat x y z Pillar Features ResUNet Pillar Feature Encoding Backbone Features Figure 3. Backbone of proposed network. Upper: overview of the backbone. The input point cloud sequence is fed to Voxeliza- tion and Point net to generate pseudo images, which are then pro- cessed by ResNet U-Net to generate final backbone feature se- quence. Lower: detailed design of ResNet U-Net. Thus the output of Pillar Feature Encoding is a sequence of t Pillar Features [I−(t−1), I−(t−2), · · · , I0]. Next we adopt a similar backbone CNN network pro- posed as in [22], as shown in the lower part of Figure 3. Each of the Pillar Features Ij is first processed by three ResNet-style blocks to generate intermediate features with shape RH×W ×C0,R 1 2 H× 1 2 W ×C1 and R 1 4 H× 1 4 W ×C2. Then we use deconvolution layers to upsample them to the same spatial shape with Ij. The concatenation of the upsampled features serve as the backbone feature of Ij, noted as Bj. 3.2. Temporal proposal generation In order to explicitly model objects’ current and past knowledge, we propose a temporal region proposal net- work (T-RPN) to generate object proposals with both cur- rent and past boxes. T-RPN takes the backbone feature se- quence [B−(t−1), B−(t−2), · · · , B0] as the input, concate- nates them in the channel dimension and applies a 1×1 con- volution to generate a temporal-aware feature map. Classifi- cation, current frame regression and past frames regression are generated by applying 1 × 1 convolutional layers over the temporal-aware feature map, to classify and regress the pre-defined anchors. The temporal region proposal network is supervised by ground-truth objects’ current and past locations. For each anchor a = (xa, ya, wa, la, ha) (x, y, w, l, h correspond to x coordinate of box center, y coordinate of box center, width of box, length of box and heading of box respec- tively), it is assigned to a ground-truth object with largest IoU of the current frame box gt = (xgt 0 , ygt 0 , wgt, lgt, hgt 0 ). Similar to SECOND [27], we compute the regression tar- get in order to learn the difference between the pre-defined anchors and the corresponding ground-truth boxes. For the current frame, we generate a 5-d regression target da 0 = (dxa 0, dya 0, dwa, dla, dha 0): dxa 0 = (xgt 0 −xa)/ p (xa)2 + (ya)2 (1) dya 0 = (ygt 0 −ya)/ p (xa)2 + (ya)2 (2) dwa = log wgt wa (3) dla = log lgt la (4) dha 0 = sin hgt 0 −ha 2 (5) With similar equations, we also compute t −1 past regres- sion targets for anchor a against the same ground-truth ob- ject: da j = (dxa j , dya j , dha j ) for j ∈{−1, −2, · · · , −(t − 1)}. Width and length are not considered for the past re- gression since we assume the object size does not change across different frames. For each anchor a, the classifica- tion target sa is assigned as 1 if the assigned ground-truth object has an IoU greater than th+ at the current frame. If the IoU is smaller than th−, classification target is assigned as 0. Otherwise the classification target is −1 and the an- chor is ignored for computing loss. For each anchor a, T-RPN predicts a classification score ˆsa, a current regression vector ˆda 0 = ( ˆdxa 0, ˆdya 0, ˆdwa, ˆdla, ˆdha 0) and t −1 past regression vectors ˆda j = ( ˆdxa j , ˆdya j , ˆdha j ) from the aforementioned 1 × 1 convolutional layers. The objective of T-RPN is the weighted sum of classifica- tion loss, current frame regression loss and past frames re- gression loss as defined in the equations below, where 1(x) is the indicator function and returns 1 if x is true otherwise 0. LT-RPN = λclsLcls + λcur regLcur reg + λpast regLpast reg (6) Lcls = P a CrossEntropy(sa, ˆsa)1(sa ≥0) P a 1(sa ≥0) (7) Lcur reg = P a SmoothL1(da 0, ˆda 0)1(sa ≥1) P a 1(sa ≥1) (8) Lpast reg = t−1 X j=1 P a SmoothL1(da −j, ˆda −j)1(sa ≥1) P a 1(sa ≥1) (9) For proposal generation, classification scores and regres- sion vectors are applied on pre-defined anchors to generate temporal proposals, by reversing Equations 1-5. Thus each temporal proposal has a confidence score as well as the re- gressed boxes for the current and past frames. After that, non-maximum suppression is applied on the current frame boxes of temporal proposals to remove redundancy. 3.3. Proposal prediction 3.3.1 Spatio-temporal-interactive feature extraction Given backbone features [B−(t−1), · · · , B0] and temporal proposals, spatio-temporal-interactive features are learned for each temporal proposal to capture the comprehensive in- formation for detection and trajectory prediction. Different ways for modeling objects are combined to achieve this. Local geometry feature: To extract object geometry knowledge, we use the proposal boxes at j-th frame (i.e. xj, yj, w, l, and hj) to crop features from Bj, as shown in the lower left part of Figure 4. This is an extension of tradi- tional proposal feature cropping used in Faster-RCNN [21], to gather position-discarded local geometry features from each frame. To simplify the implementation on TPU, we rotate the 5-DoF box (xj, yj, w, l, hj) to the closest stand- ing box (xmin,j, ymin,j, xmax,j, ymax,j) for ROIAlign [8]. Local dynamic feature: As illustrated in the lower mid- dle part of Figure 4, we use a meta box (drawn in yel- low) which covers the whole movement of the pedes- trian to crop features for all Bj’s. The meta box is the smallest box which contains all current and history pro- posal boxes. Formally, after transferring all rotated pro- posal boxes (xj, yj, w, l, hj) to the closest standing boxes (xmin,j, ymin,j, xmax,j, ymax,j), the meta box is computed with the following equations: xmin = min j (xmin,j); ymin = min j (ymin,j) xmax = max j (xmax,j); ymax = max j (ymax,j) This feature captures the direction, curvature and speed of the object, which are useful for future trajectory prediction. STI-FE Backbone Features T=-2 ... T=-1 T=0 T-RPN STI-FE Proposal STI Feature Proposal-local Feature Local Geometry Local Dynamics History Path Relational Reasoning Figure 4. Spatial-Temporal-Interactive Feature Extractor (STI- FE): Local geometry, local dynamic and history path features are extracted given a temporal proposal. For local geometry and local dynamics features, the yellow areas are used for feature extraction. Relational reasoning is performed across proposals’ local features to generate interactive features. History path feature: In order to directly encode objects’ past movement, we exploit the location displacement over different frames as the history path feature. To be specific, given a temporal proposal with xj, yj as the box centers, the history path feature is MLP([x0 −x−1, y0 −y−1, x0 − x−2, y0 −y−2, · · · , x0 −x−(t−1), y0 −y−(t−1)]). To aggregate spatial and temporal knowledge for each proposal, the concatenation of local geometry feature and the local dynamic feature is fed into a ResNet block fol- lowed by a global average pooling. The pooled feature is then concatenated with the history path feature, and serves as the proposal-local feature, noted as fi for the i-th tempo- ral proposal. As discussed before, the future trajectory of a pedestrian could be influenced by the surrounding pedestrians’ behav- iors. In order to model such interactions among pedestrians, we design an interaction layer which uses a graph to prop- agate information among objects, as shown in the middle part of Figure 4. Specifically, we represent each tempo- ral proposal as a graph node i; the embedding of node i is noted as fi, which is the corresponding proposal-local fea- ture. The edge vij represents the interaction score between node i and node j. vij is learned from fi and fj, which can be represented as below. vij = α([φ1(fi); φ2(fj)]) where α and φ’s can be any learnable functions. In our implementation, we use fully-connected layer for α and φ’s. Given the interaction scores among all pairs of nodes, we can gather the information for each node from the neighbor- ing nodes. Specifically, the interaction embedding gi gath- ered for node i is calculated as follows: gi = X j exp {vij} Vi γ([fi; fj]) where Vi = P j exp {vij} is the normalization constant, and γ is a mapping function (a fully-connected layer is adopted in our implementation). 3.3.2 Proposal classification and regression Given proposal-local features fi for each temporal propos- als, two fully-connected layers are applied to do classifica- tion and regression respectively for the current frame. To be aligned with our intuitions, the proposal-local feature fi combined with the interaction feature gi is used to predict future frame boxes, by one fully-connected layer with 3t output channels where t is the number of future frames to predict and 3 stands for x coordinate, y coordinate and head- ing respectively. During the training, temporal proposals are assigned classification and regression targets with the same strategy discussed in Subsection 3.2 and the objective is the weighted sum of classification loss, current frame re- gression loss and future frames regression loss similar to Equations 6-9. During inference, each proposal is predicted with a classification score and current/future boxes. Non- maximum suppression is applied on them based on the IoU between their current boxes, to remove redundancy. 4. Experiment 4.1. Experiment settings Dataset: We conduct experiments on the Waymo Open Dataset (WOD) [1] and the Lyft Dataset (Lyft) [12]. WOD contains lidar data from 5 sensors and labels for 1000 seg- ments. Each segment contains roughly 200 frames and has a length of 20 seconds. Train and validation subsets have 798 and 202 segments respectively. To model the history and predict the future, we take 1 second history frames and 3 second future frames for each example and extract examples from the center 16 seconds (1s∼17s) from each segment. Thus 126,437 train examples and 31,998 validation exam- ples are extracted, and each of them contains history frames of 1 second and future frames of 3 seconds. We sample 6 frames including 5 history frames and the current frame, with tinput = {−1.0, −0.8, −0.6, −0.4, −0.2, 0}, and the point clouds from those frames are fed into the network as inputs. In order to build richer voxel features while saving computation and memory, every two frames are combined Model MF TS DE@1 ↓ DE@2 ↓ DE@3 ↓ ADE ↓ HR@1 ↑ HR@2 ↑ HR@3 ↑ IntentNet ✓ 21.17±0.02 39.74±0.07 61.60±0.12 36.04±0.12 93.18±0.03 76.50±0.08 61.60±0.12 MF-FRCNN ✓ ✓ 20.87±0.08 39.23±0.14 60.59±0.22 35.57±0.13 93.45±0.05 76.69±0.18 61.57±0.21 STINet ✓ ✓ 19.63±0.03 37.07±0.08 57.60±0.14 33.67±0.07 94.36±0.05 78.91±0.06 64.43±0.15 Table 1. Trajectory prediction performance for different models on WOD. MF indicates whether the corresponding model takes multiple frames as input. TS indicates whether the model has a two-stage framework. ↑and ↓indicate the higher/lower numbers are better for the corresponding metric. DE and ADE are in centimeters. For models implemented by us, we train and evaluate the model for five times and compute the average and standard deviation shown around ± in the table. Model MF TS BEV AP ↑ PointPillar [29] 68.57 MVF [29] 74.38 StarNet [19] 72.50 IntentNet [4]1 ✓ 79.43±0.10 MF-FRCNN ✓ ✓ 79.69±0.19 STINet ✓ ✓ 80.73±0.26 Table 2. Detection performance for different methods on WOD. MF indicates whether the corresponding model takes multiple frames as input. TS indicates whether the model has a two-stage framework. BEV AP is computed with an IoU threshold of 0.5. ↑indicates the higher numbers are better for the corresponding metric. by concatenating the voxelization output features thus we have three pillar features as discussed in Subsection 3.1. For the future prediction, we predict trajectory for 6 future frames with tfuture = {0.5, 1.0, 1.5, 2.0, 2.5, 3.0}. The range is 150m by 150m around the self-driving car, and we use a pillar size of 31.25cm by 31.25cm to generate pillar features of shape 480 × 480. Lyft contains lidar data from 1 sensor and labels for only 180 segments, with 140 and 40 segments for train and validation respectively. With the same settings, 14,840 and 4,240 examples are extracted for train and val- idation. Each example has 1-second history and 3-second future. We have tfuture = {0.6, 1.2, 1.8, 2.4, 3.0} for Lyft due to its 5Hz sampling rate. Evaluation metric: The evaluation metric for detection is BEV AP (Bird-Eyes-View Average Precision) with the IoU threshold set to 0.5. Objects with fewer than 5 points are considered hard and are excluded during evaluation. For trajectory prediction, we employ the metrics used in [4, 9]. For t ∈tfuture, we compute the DE@t (Displacement Error) and the HR@t (Hit Rate) with a displacement error thresh- old of 0.5m. We also compute the ADE (Average Displace- ment Error) which equals to 1 |tfuture| P t∈tfuture DE@t. Implementation: Our models are implemented in Ten- sorFlow and we train the model with Adam optimizer on TPUv3 for 140k and 70k iterations for Waymo Open Dataset and Lyft Dataset respectively. The learning rate is 4×10−4 and batch size is 1 per TPU. We use 32 TPU cores together for the training, thus the effective batch size is 32. 1IntentNet without intent prediction head implemented by us. We also implement IntentNet [4] and Faster-RCNN [21] in TensorFlow as the baselines, which are noted as “Intent- Net” and “MF-FRCNN”. Our implemented IntentNet (1) takes multiple frames as input and share the same back- bone net as STINet; (2) removes the intent classification part, and only regresses a future trajectory. MF-FRCNN refers to a Faster-RCNN [21] model with several changes: (1) It uses the same backbone net as STINet, please refer to Section 3.1; (2) for each object proposal, in addition to the bounding box, we also regress future trajectories and headings. Note that the difference between proposals from MF-FRCNN and our method is that MF-FRCNN only pre- dicts the current box of objects, while our method exploits a novel Temporal RPN which also generates the correspond- ing history boxes associated to each current box. 4.2. Results on Waymo Open Dataset The main results on Waymo Open Dataset of pedestrian detection and trajectory prediction are summarized in Ta- ble 2 and Table 1. For detection we compare our proposed method (in the last row) with the current state-of-the-art detectors [19, 29] and our method surpasses the off-the- shelf baselines by a very large margin, improving the BEV AP from 74.38 to 80.73. To avoid the effects from multi- frame inputs and different implementation details, we also compare with our implementation of IntentNet and multi- frame Faster RCNN [21], which are noted as “IntentNet” and “MF-FRCNN” in Table 2. Our proposed method out- performs all baselines and it confirms the effectiveness of our T-RPN and the STI modeling of proposals. In Table 1 we compare the trajectory prediction perfor- mance between our proposed method, IntentNet and MF- FRCNN. Our proposed method surpasses all competitors by a large margin, and the improvement is larger than the improvement on detection. It aligns with our intuition since T-RPN and STI modeling are designed to better model ob- jects’ movement and more useful to forecast their trajectory. For a detailed comparison of STINet and MF-FRCNN, we evaluate the detection and trajectory prediction by breaking down the objects into five bins based on the future trajectory length in 3s. The five bins are 0∼2.5m, 2.5∼5m, 5∼7.5m, 7.5∼10m and 10m∼∞respectively. We report BEV AP, ADE and the relative improvement in Table 3 and 4. The STINet is consistently better than MF-FRCNN Model 0∼2.5 2.5∼5 5∼7.5 7.5∼10 10∼∞ MF-FRCNN 63.07 90.44 93.27 88.00 77.15 STINet 64.23 91.15 94.46 88.97 80.50 ∆% 1.8% 0.8% 1.3% 1.1% 4.3% Table 3. Bird-eyes-view average precision (BEV-AP) breakdown comparison of MF-FRCNN and STINet on WOD. Objects are split into five bins base on the future trajectory length with a bin size of 2.5m. Last row is the relative improvement of STINet. Model 0∼2.5 2.5∼5 5∼7.5 7.5∼10 10∼∞ MF-FRCNN 26.90 37.56 46.39 104.60 173.50 STINet 26.73 35.42 41.18 89.74 137.17 ∆% 0.6% 6.0% 11.2% 14.2% 20.9% Table 4. Average displacement error (ADE, in centimeters) break- down comparison of MF-FRCNN and STINet on WOD. Objects are split into five bins base on the future trajectory length with a bin size of 2.5m. Last row is the relative improvement of STINet. Model BEV AP ↑ DE@3 ↓ ADE ↓ HR@3 ↑ MF-FRCNN 33.90 82.61 51.11 49.74 STINet 37.15 76.17 46.09 50.73 Table 5. Detection and trajectory prediction performance on Lyft. LG LD BEV AP ↑ DE@3 ↓ ADE ↓ HR@3 ↑ ✓ 80.38 64.15 37.67 58.46 ✓ 79.69 59.71 34.96 62.22 ✓ ✓ 80.53 58.95 34.49 62.99 Table 6. Ablation studies on local geometry and local dynamic features (noted as LG and LD in the table respectively). All entries are trained without History Path and Interactive features. L+G Path DE@3 ↓ ADE ↓ HR@3 ↑ ✓ 58.95 34.49 62.99 ✓ ✓ 58.04 33.92 63.87 † ✓ 67.80 39.86 52.25 Table 7. Ablation studies on history path feature. † indicates the corresponding feature is used only for detection and ignored while prediction the trajectory. for both tasks. For trajectory prediction on objects moving more than 5m, the relative improvements are significant and consistently more than 10%. It confirms that the proposed method can leverage the details of history information and provide much better trajectory predictions, especially for pedestrians with a larger movement. 4.3. Results on Lyft Dataset The detection and trajectory prediction results on the Lyft Dataset are summarized in Table 5. The performances on both tasks are improved largely and the results confirm the effectiveness of proposed method a small-scale dataset. 4.4. Ablation studies In this section we conduct ablation experiments to ana- lyze the contribution of each component and compare our Breakdown I DE@3 ↓ ADE ↓ HR@3 ↑ All 58.04 33.92 63.87 ✓ 57.60 33.67 64.43 Group 49.67 30.85 64.87 ✓ 48.89 30.40 65.55 Table 8. Ablation studies on interaction features. ‘I’ indicates whether the proposal interaction modeling is adopted. “All” and “Group” correspond to evaluation on all pedestrians and pedestri- ans belonging to a group with at least 5 pedestrians respectively. model with potential alternative methods on the Waymo Open Dataset. The results are summarized below. For clar- ity, we only show DE@3, ADE and HR@3 for trajectory prediction. The other metrics have the same tendency. Effect of local geometry and local dynamic features: We conduct experiments to analyze the effect of local geome- try and local dynamic features, summarized in Table 6. The local geometry feature is good at detection and the local dy- namic feature is good at trajectory prediction. Geometry feature itself does not work well for trajectory prediction since it ignores dynamics for better detection. By combin- ing both of the features, the benefits in detection and trajec- tory prediction can be obtained simultaneously. Effect of history path: Although objects’ geometry and movement are already represented by local geometry dy- namic features, taking history path as an extra feature can give another performance gain by improving the DE@3 from 58.95 to 58.04 and the HR@3 from 62.99 to 63.87 (as shown in the first two row of Table 7). This suggests the history path, as the easiest and most direct representation of objects’ movement, can still help based on the rich repre- sentations. However history path itself is far from enough to give accurate trajectory prediction, suggested by the poor performance in the last row of Table 7. Effect of proposal interaction modeling: To demonstrate the effectiveness of the proposed pedestrian interaction modeling, we measure the performance for all pedestrians as well as pedestrians in a group. Specifically, we design a heuristic rule (based on locations and speeds) to discover pedestrian groups and assign each pedestrian a group label on the evaluation set. The details about the grouping al- gorithm can be found in supplementary. We evaluate the trajectory prediction performance on all pedestrians and the pedestrians belonging to a group with at least 5 pedestri- ans, shown in Table 8. The interaction modeling improves trajectory prediction performance on “all pedestrians” and achieve a larger boost for pedestrians that belong to groups (DE@3 improved from 49.67 to 48.89 by 1.6%). 4.5. Model inference speed We measure the inference speed of our proposed model as well as baseline models on context range of 100m by Figure 5. Qualitative examples of STINet. The blue box are de- tected pedestrians. The cyan and yellow lines are predicted future and history trajectories of STINet respectively. 100m as well as 150m by 150m. All models are imple- mented in TensorFlow and the inference is executed on a single nVIDIA Tesla V100 GPU. For the context range of 100m by 100m, IntentNet, MF-FRCNN and STINet have inference time of 60.9, 69.4 and 74.6ms respectively. Both two-stage models (MF-FRCNN and STINet) are slower than the single-stage model, and STINet is slightly slower than MF-FRCNN. However, all three models can achieve a real-time inference speed higher than 10Hz. For the maximum range of Waymo Open Dataset, i.e., 150m by 150m, three models have inference time of 122.9, 132.1 and 144.7ms respectively. 4.6. Qualitative results The visualization for the predictions of STINet is shown in Figure 5. The blue boxes are the detected pedestrians. The cyan and yellow lines are the predicted future and his- tory trajectory for each detected pedestrian respectively. We show two scenarios where the SDC is stationary in the up- per sub-figure and the SDC is moving fast in the lower sub- figure. It demonstrates that our model detects and predicts very accurately in both cases. Figure 6 shows a detailed comparison between STINet and MF-FRCNN against the ground-truth for trajectory pre- diction. Green boxes are the ground-truth boxes. Yel- low, pink and cyan lines are the ground-truth future trajec- Figure 6. Comparison between MF-FRCNN and STINet. The yel- low line is the ground-truth future trajectory for pedestrians. The pink and cyan lines are the predicted future trajectory from MF- FRCNN and STINet respectively. It is clear that our proposed method gives a much better prediction compared with the baseline, for all three pedestrians. Upper: the overview of three pedestrians. Lower: zoom-in visualization for three pedestrians. tory as well as the predicted future trajectories from MF- FRCNN and STINet respectively. For the left two pedestri- ans who are walking in a straight line, both MF-FRCNN and STINet predict future trajectory reasonably well but the MF-FRCNN still has a small error compared with the ground-truth; for the right-most pedestrian who is making a slight left turn, MF-FRCNN fails to capture the details of its movement and gives an unsatisfactory prediction, while STINet gives a much better trajectory prediction. 5. Conclusion In this paper, we propose STINet to perform joint de- tection and trajectory prediction with raw lidar point clouds as the input. We propose to build temporal proposals with pedestrians’ both current and past boxes and learn a rich representation for each temporal proposal, with local ge- ometry, dynamic movement, history path and interaction features. We show that by explicitly modeling the spatio- temporal-interaction features, both detection and trajectory prediction quality can be drastically improved compared with single-stage and two-stage baselines. This also makes us to re-think the importance of introducing second-stage and proposals, especially for the joint detection and trajec- tory prediction task. Comprehensive experiments and com- parisons with baselines and state-of-the-arts confirm the ef- fectiveness of our proposed method, and our method signif- icantly improves the prediction quality while still achieves the real-time inference speed which makes our model prac- tical to be used in real-world applications. Combining cam- era/map data and utilizing longer history with LSTMs could be investigated to further improve the prediction and we will explore them in future work. References [1] Waymo open dataset: An autonomous driving dataset, 2019. 2, 5 [2] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. So- cial lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 961–971, 2016. 1, 3 [3] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Al- varo Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Ma- linowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learn- ing, and graph networks. arXiv preprint arXiv:1806.01261, 2018. 2 [4] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning to predict intention from raw sensor data. In Con- ference on Robot Learning, pages 947–956, 2018. 1, 2, 3, 6 [5] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Pe- ter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8748–8757, 2019. 3 [6] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- national conference on computer vision, pages 1440–1448, 2015. 1, 2 [7] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan: Socially acceptable tra- jectories with generative adversarial networks. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264, 2018. 1, 3 [8] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 4 [9] Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 8454–8462, 2019. 1, 3, 6 [10] Rui Hou, Chen Chen, and Mubarak Shah. Tube convolu- tional neural network (t-cnn) for action detection in videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 5822–5831, 2017. 2 [11] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio- temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 4405– 4413, 2017. 2 [12] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platin- sky, W. Jiang, and V. Shet. Lyft level 5 av dataset 2019. url- https://level5.lyft.com/dataset/, 2019. 2, 5 [13] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 12697–12705, 2019. 1, 2, 3 [14] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 336–345, 2017. 3 [15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 1, 2 [16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European con- ference on computer vision, pages 21–37. Springer, 2016. 1, 2 [17] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 3569–3577, 2018. 1, 2, 3 [18] Anton Milan, S Hamid Rezatofighi, Anthony Dick, Ian Reid, and Konrad Schindler. Online multi-target tracking using recurrent neural networks. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. 1 [19] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Al- sharif, Patrick Nguyen, et al. Starnet: Targeted compu- tation for object detection in point clouds. arXiv preprint arXiv:1908.11069, 2019. 6 [20] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2 [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems, pages 91–99, 2015. 1, 2, 4, 6 [22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 3 [23] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational rea- soning. In Advances in neural information processing sys- tems, pages 4967–4976, 2017. 2 [24] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Suk- thankar, Kevin Murphy, and Cordelia Schmid. Relational ac- tion forecasting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 273–283, 2019. 2 [25] Peng Tang, Chunyu Wang, Xinggang Wang, Wenyu Liu, Wenjun Zeng, and Jingdong Wang. Object detection in videos by high quality object linking. IEEE transactions on pattern analysis and machine intelligence, 2019. 2 [26] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pages 399–417, 2018. 2 [27] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection. Sensors, 18(10):3337, 2018. 4 [28] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 670–685, 2018. 2 [29] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Va- sudevan. End-to-end multi-view fusion for 3d object detec- tion in lidar point clouds. arXiv preprint arXiv:1910.06528, 2019. 1, 6 [30] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018. 1, 3 STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction Zhishuai Zhang1,2∗ Jiyang Gao1 Junhua Mao1 Yukai Liu1 Dragomir Anguelov1 Congcong Li1 1Waymo LLC 2 Johns Hopkins University zzhang99@jhu.edu, {jiyanggao, junhuamao, liuyukai, dragomir, congcongli}@waymo.com Abstract Detecting pedestrians and predicting future trajectories for them are critical tasks for numerous applications, such as autonomous driving. Previous methods either treat the detection and prediction as separate tasks or simply add a trajectory regression head on top of a detector. In this work, we present a novel end-to-end two-stage network: Spatio-Temporal-Interactive Network (STINet). In addition to 3D geometry modeling of pedestrians, we model the temporal information for each of the pedestrians. To do so, our method predicts both current and past locations in the first stage, so that each pedestrian can be linked across frames and the comprehensive spatio-temporal information can be captured in the second stage. Also, we model the interaction among objects with an interaction graph, to gather the information among the neighboring objects. Comprehensive experiments on the Lyft Dataset and the recently released large-scale Waymo Open Dataset for both object detection and future trajectory prediction validate the effectiveness of the proposed method. For the Waymo Open Dataset, we achieve a bird-eyes-view (BEV) detection AP of 80.73 and trajectory prediction average displacement error (ADE) of 33.67cm for pedestrians, which establish the state-of-the-art for both tasks. 1. Introduction To drive safely and smoothly, self-driving cars (SDC) not only need to detect where the objects are currently (i.e. object detection), but also need to predict where they will go in the future (i.e. trajectory prediction). Among the ob- jects, pedestrian is an important and difficult type. The dif- ficulty comes from the complicated properties of pedestrian appearance and behavior, e.g. deformable shape and inter- personal relations [7]. In this paper, we tackle the problem of joint pedestrian detection and trajectory prediction from a sequence of point clouds, as illustrated in Figure 1. ∗Work done during an internship at Waymo. Figure 1. Given a sequence of current and past point clouds, our task is to detect pedestrians in the current frame, and predict the future trajectory of them. In this figure, white points are input point cloud sequence (stacked for visualization), yellow boxes are detected objects, and the cyan lines are predicted future trajectory. Traditionally, this problem is tackled by dividing the perception pipeline into multiple modules: object detec- tion [6, 13, 15, 16, 20, 21, 29, 30], tracking [18] and tra- jectory prediction [2, 7, 9]; latter modules take the outputs from the former modules. Although such strategy makes each sub-module easy to design and implement, it sacrifices the potential advantage of joint optimization. Latter mod- ules can lose critical information bottle-necked by the inter- faces between sub-modules, e.g. a pedestrian’s future tra- jectory depends on many useful geometry features from the raw sensor data, which may be abstracted away in the detec- tion/tracking stage. To this end, researchers recently have proposed several end-to-end neural networks to detect ob- jects and predict trajectories simultaneously. FaF [17] and IntentNet [4] are two of the representative methods, which are designed based on single stage detectors (SSD) [16]; in addition to original anchor classification and regression of SSD, they also regress a future trajectory for each anchor. We observed that there are two major issues that are crit- ical for joint detection and trajectory prediction, but are not addressed by previous end-to-end methods: 1) Tem- poral modeling on object level: existence and future tra- jectory of an object are embedded in both current and past arXiv:2005.04255v1 [cs.CV] 8 May 2020 frames. Current methods simply reuse single-stage detec- tor and fuse the temporal information in the backbone CNN in an object-agnostic manner either via feature concatena- tion or 3D CNN [4, 17]. Such coarse level fusion can loss fine-grained temporal information for each object, which is critical for both tasks. 2) Interaction modeling among ob- jects: the future trajectory of an object could be influenced by the other objects. E.g., a pedestrian walking inside a group may tend to follow others. Existing methods [4, 17] do not explicitly model interactions among objects. To address the aforementioned issues, we propose an end-to-end Spatio-Temporal-Interactive network (STINet) to model pedestrians temporal and interactive information jointly. The proposed network takes a sequence of point clouds as input, detects current location and predicts future trajectory for pedestrians. Specifically, there are three sub- components in STINet : backbone network, proposal gen- eration network, and proposal prediction network. In the backbone net, we adopted a similar structure as PointPil- lars [13], and applied it on each frame of the point cloud, the output feature maps from multi-frames are then com- bined. The proposal generation network takes feature maps from the backbone net and generates potential pedestrian in- stances with both their current and past locations (i.e. tem- poral proposals); such temporal proposals allow us to link the same object across different frames. In the third mod- ule (i.e. prediction network), we use the temporal proposals to explicitly gather the geometry appearance and temporal dynamics for each object. To reason the interaction among pedestrians, we build a graph layer to gather the information from surrounding pedestrians. After extracting the above spatial-temporal-interactive feature for each proposal, the detection and prediction head uses the feature to regress cur- rent detection bounding box and future trajectory. Comprehensive experiments are conducted on Waymo Open Dataset [1] and Lyft Dataset [12] to demonstrate the effectiveness of the STINet. Specifically, it achieves an av- erage precision of 80.73 for bird-eyes-view pedestrian de- tection, and an average displacement error of 33.67 cm for trajectory prediction on Waymo Open Dataset. It achieves real-time inference speeds and takes only 74.6 ms for infer- ence on a range of 100m by 100m. The main contributions of our work come in four folds: • We build an end-to-end network tailored to model pedestrian past, current and future simultaneously. • We propose to generate temporal proposals with both current and past boxes. This enables learning a com- prehensive spatio-temporal representation for pedestri- ans with their geometry, dynamic movement and his- tory path in an end-to-end manner without explicitly associating object across frames. • We propose to build a graph among pedestrians to rea- son the interactions to further improve trajectory pre- diction quality. • We establish the state-of-the-art performance for both detection and trajectory prediction on the Lyft Dataset and the recent large-scale challenging Waymo Open Dataset. 2. Related work 2.1. Object detection Object detection is a fundamental task in computer vi- sion and autonomous driving. Recent approaches can be divided into two folds: single-stage detection [15, 16, 20] and two-stage detection [6, 21]. Single-stage detectors do classification and regression directly on backbone fea- tures, while two-stage detectors generate proposals based on backbone features, and extract proposal features for second-stage classification and regression. Single-stage de- tectors have simpler structure and faster speed, however, they lose the possibility to flexibly deal with complex ob- jects behaviors, e.g., explicitly capturing pedestrians mov- ing across frames with different speeds and history paths. In this work, we follow the two-stage detection framework and predict object boxes for both current and past frames as proposals, which are further processed to extract their ge- ometry and movement features. 2.2. Temporal proposals Temporal proposals have been shown beneficial in action localization in [10, 11]. They showed associating temporal proposals from different video clips can help to leverage the temporal continuity of video frames. [25] proposed to link temporal proposals throughout the video to improve video object detection. In our work, we also exploit temporal proposals and step further to investigate and propose how to build comprehensive spatio-temporal representations of proposals to improve future trajectory prediction. This is a hard task since there are no inputs available for the future. Also we investigate to learn interactions between proposals via a graph. We show that these spatio-temporal features can effectively model objects’ dynamics and provide accu- rate detection and prediction of their future trajectory. 2.3. Relational reasoning An agent’s behavior could be influenced by other agents and it is naturally connected to relational reasoning [3, 23]. Graph neural networks have shown its strong capability in relational modeling in recent years. Wang et al. formulated the video as a space-time graph, show the effectiveness on the video classification task [26]. Sun et al. designed a re- lational recurrent network for action detection and anticipa- tion [24]. Yang et al. proposed to build an object relation- ship graph for the task of scene graph generation [28]. x y z Backbone Features x y z x y z Pillar Features ResUNet Pillar Feature Encoding T=-2 ... T=-1 T=0 Temporal Proposals T-RPN STI Feature Extractor Object Detection Head Trajectory Prediction Head Proposal STI Feature Figure 2. The overview of the proposed method. It takes a sequence of point clouds as input, detects pedestrians and predicts their future trajectories simultaneously. The point clouds are processed by Pillar Feature Encoding [13, 30] to generate Pillar Features. Then each Pillar Feature is fed into a backbone ResUNet [22] to get backbone features. A Temporal Region Proposal Network (T-RPN) takes backbone features and generated temporal proposal with past and current boxes for each object. Spatio-Temporal-Interactive (STI) Feature Extractor learns features for each temporal proposal which are used for final detection and trajectory prediction. 2.4. Trajectory prediction Predicting the future trajectory of objects is an impor- tant task, especially for autonomous driving. Previous re- search has been conducted based on perception objects as inputs [2, 5, 7, 9, 14]. Recently FaF [17] and IntentNet [4] focused on end-to-end trajectory prediction from raw point clouds as input. However, they simply re-used single- stage detection framework and added new regression heads on it. In our work, we exploit temporal region proposal network and explicitly model Spatio-Temporal-Interaction (STI) representations of pedestrians, and our experiments show that the proposed STI modeling is superior on both detection and trajectory prediction for pedestrians. 3. Proposed method In this section, we discuss our proposed network in de- tails. The overview of our proposed method is shown in Figure 2, which can be divided into three steps. For each of these steps, we discuss in the following subsections. 3.1. Backbone network The backbone of our network is illustrated in Figure 3. The input is a sequence of point clouds with t′ frames noted as [PC−(t′−1), PC−(t′−2), · · · , PC0], which corresponds to the lidar sensor input from the past t′ −1 frames as well as the current frame. All point clouds are calibrated to SDCs pose at the current frame so that the ego-motion is discarded. To build rich pillar features while keeping a feasible memory usage, we generate t pillar features from the t′ input frames. Consecutive t′/t point clouds PC−(j+1)t′/t+1, · · · , PC−jt′/t are processed with Voxeliza- tion [13, 30] and then concatenated to generate a pseudo image Ij (i.e. Pillar Features) with shape H × W × Cin. ResBlock ResBlock 1x Upsample 2x Upsample 4x Upsample Concat x y z Pillar Features ResUNet Pillar Feature Encoding Backbone Features Figure 3. Backbone of proposed network. Upper: overview of the backbone. The input point cloud sequence is fed to Voxeliza- tion and Point net to generate pseudo images, which are then pro- cessed by ResNet U-Net to generate final backbone feature se- quence. Lower: detailed design of ResNet U-Net. Thus the output of Pillar Feature Encoding is a sequence of t Pillar Features [I−(t−1), I−(t−2), · · · , I0]. Next we adopt a similar backbone CNN network pro- posed as in [22], as shown in the lower part of Figure 3. Each of the Pillar Features Ij is first processed by three ResNet-style blocks to generate intermediate features with shape RH×W ×C0,R 1 2 H× 1 2 W ×C1 and R 1 4 H× 1 4 W ×C2. Then we use deconvolution layers to upsample them to the same spatial shape with Ij. The concatenation of the upsampled features serve as the backbone feature of Ij, noted as Bj. 3.2. Temporal proposal generation In order to explicitly model objects’ current and past knowledge, we propose a temporal region proposal net- work (T-RPN) to generate object proposals with both cur- rent and past boxes. T-RPN takes the backbone feature se- quence [B−(t−1), B−(t−2), · · · , B0] as the input, concate- nates them in the channel dimension and applies a 1×1 con- volution to generate a temporal-aware feature map. Classifi- cation, current frame regression and past frames regression are generated by applying 1 × 1 convolutional layers over the temporal-aware feature map, to classify and regress the pre-defined anchors. The temporal region proposal network is supervised by ground-truth objects’ current and past locations. For each anchor a = (xa, ya, wa, la, ha) (x, y, w, l, h correspond to x coordinate of box center, y coordinate of box center, width of box, length of box and heading of box respec- tively), it is assigned to a ground-truth object with largest IoU of the current frame box gt = (xgt 0 , ygt 0 , wgt, lgt, hgt 0 ). Similar to SECOND [27], we compute the regression tar- get in order to learn the difference between the pre-defined anchors and the corresponding ground-truth boxes. For the current frame, we generate a 5-d regression target da 0 = (dxa 0, dya 0, dwa, dla, dha 0): dxa 0 = (xgt 0 −xa)/ p (xa)2 + (ya)2 (1) dya 0 = (ygt 0 −ya)/ p (xa)2 + (ya)2 (2) dwa = log wgt wa (3) dla = log lgt la (4) dha 0 = sin hgt 0 −ha 2 (5) With similar equations, we also compute t −1 past regres- sion targets for anchor a against the same ground-truth ob- ject: da j = (dxa j , dya j , dha j ) for j ∈{−1, −2, · · · , −(t − 1)}. Width and length are not considered for the past re- gression since we assume the object size does not change across different frames. For each anchor a, the classifica- tion target sa is assigned as 1 if the assigned ground-truth object has an IoU greater than th+ at the current frame. If the IoU is smaller than th−, classification target is assigned as 0. Otherwise the classification target is −1 and the an- chor is ignored for computing loss. For each anchor a, T-RPN predicts a classification score ˆsa, a current regression vector ˆda 0 = ( ˆdxa 0, ˆdya 0, ˆdwa, ˆdla, ˆdha 0) and t −1 past regression vectors ˆda j = ( ˆdxa j , ˆdya j , ˆdha j ) from the aforementioned 1 × 1 convolutional layers. The objective of T-RPN is the weighted sum of classifica- tion loss, current frame regression loss and past frames re- gression loss as defined in the equations below, where 1(x) is the indicator function and returns 1 if x is true otherwise 0. LT-RPN = λclsLcls + λcur regLcur reg + λpast regLpast reg (6) Lcls = P a CrossEntropy(sa, ˆsa)1(sa ≥0) P a 1(sa ≥0) (7) Lcur reg = P a SmoothL1(da 0, ˆda 0)1(sa ≥1) P a 1(sa ≥1) (8) Lpast reg = t−1 X j=1 P a SmoothL1(da −j, ˆda −j)1(sa ≥1) P a 1(sa ≥1) (9) For proposal generation, classification scores and regres- sion vectors are applied on pre-defined anchors to generate temporal proposals, by reversing Equations 1-5. Thus each temporal proposal has a confidence score as well as the re- gressed boxes for the current and past frames. After that, non-maximum suppression is applied on the current frame boxes of temporal proposals to remove redundancy. 3.3. Proposal prediction 3.3.1 Spatio-temporal-interactive feature extraction Given backbone features [B−(t−1), · · · , B0] and temporal proposals, spatio-temporal-interactive features are learned for each temporal proposal to capture the comprehensive in- formation for detection and trajectory prediction. Different ways for modeling objects are combined to achieve this. Local geometry feature: To extract object geometry knowledge, we use the proposal boxes at j-th frame (i.e. xj, yj, w, l, and hj) to crop features from Bj, as shown in the lower left part of Figure 4. This is an extension of tradi- tional proposal feature cropping used in Faster-RCNN [21], to gather position-discarded local geometry features from each frame. To simplify the implementation on TPU, we rotate the 5-DoF box (xj, yj, w, l, hj) to the closest stand- ing box (xmin,j, ymin,j, xmax,j, ymax,j) for ROIAlign [8]. Local dynamic feature: As illustrated in the lower mid- dle part of Figure 4, we use a meta box (drawn in yel- low) which covers the whole movement of the pedes- trian to crop features for all Bj’s. The meta box is the smallest box which contains all current and history pro- posal boxes. Formally, after transferring all rotated pro- posal boxes (xj, yj, w, l, hj) to the closest standing boxes (xmin,j, ymin,j, xmax,j, ymax,j), the meta box is computed with the following equations: xmin = min j (xmin,j); ymin = min j (ymin,j) xmax = max j (xmax,j); ymax = max j (ymax,j) This feature captures the direction, curvature and speed of the object, which are useful for future trajectory prediction. STI-FE Backbone Features T=-2 ... T=-1 T=0 T-RPN STI-FE Proposal STI Feature Proposal-local Feature Local Geometry Local Dynamics History Path Relational Reasoning Figure 4. Spatial-Temporal-Interactive Feature Extractor (STI- FE): Local geometry, local dynamic and history path features are extracted given a temporal proposal. For local geometry and local dynamics features, the yellow areas are used for feature extraction. Relational reasoning is performed across proposals’ local features to generate interactive features. History path feature: In order to directly encode objects’ past movement, we exploit the location displacement over different frames as the history path feature. To be specific, given a temporal proposal with xj, yj as the box centers, the history path feature is MLP([x0 −x−1, y0 −y−1, x0 − x−2, y0 −y−2, · · · , x0 −x−(t−1), y0 −y−(t−1)]). To aggregate spatial and temporal knowledge for each proposal, the concatenation of local geometry feature and the local dynamic feature is fed into a ResNet block fol- lowed by a global average pooling. The pooled feature is then concatenated with the history path feature, and serves as the proposal-local feature, noted as fi for the i-th tempo- ral proposal. As discussed before, the future trajectory of a pedestrian could be influenced by the surrounding pedestrians’ behav- iors. In order to model such interactions among pedestrians, we design an interaction layer which uses a graph to prop- agate information among objects, as shown in the middle part of Figure 4. Specifically, we represent each tempo- ral proposal as a graph node i; the embedding of node i is noted as fi, which is the corresponding proposal-local fea- ture. The edge vij represents the interaction score between node i and node j. vij is learned from fi and fj, which can be represented as below. vij = α([φ1(fi); φ2(fj)]) where α and φ’s can be any learnable functions. In our implementation, we use fully-connected layer for α and φ’s. Given the interaction scores among all pairs of nodes, we can gather the information for each node from the neighbor- ing nodes. Specifically, the interaction embedding gi gath- ered for node i is calculated as follows: gi = X j exp {vij} Vi γ([fi; fj]) where Vi = P j exp {vij} is the normalization constant, and γ is a mapping function (a fully-connected layer is adopted in our implementation). 3.3.2 Proposal classification and regression Given proposal-local features fi for each temporal propos- als, two fully-connected layers are applied to do classifica- tion and regression respectively for the current frame. To be aligned with our intuitions, the proposal-local feature fi combined with the interaction feature gi is used to predict future frame boxes, by one fully-connected layer with 3t output channels where t is the number of future frames to predict and 3 stands for x coordinate, y coordinate and head- ing respectively. During the training, temporal proposals are assigned classification and regression targets with the same strategy discussed in Subsection 3.2 and the objective is the weighted sum of classification loss, current frame re- gression loss and future frames regression loss similar to Equations 6-9. During inference, each proposal is predicted with a classification score and current/future boxes. Non- maximum suppression is applied on them based on the IoU between their current boxes, to remove redundancy. 4. Experiment 4.1. Experiment settings Dataset: We conduct experiments on the Waymo Open Dataset (WOD) [1] and the Lyft Dataset (Lyft) [12]. WOD contains lidar data from 5 sensors and labels for 1000 seg- ments. Each segment contains roughly 200 frames and has a length of 20 seconds. Train and validation subsets have 798 and 202 segments respectively. To model the history and predict the future, we take 1 second history frames and 3 second future frames for each example and extract examples from the center 16 seconds (1s∼17s) from each segment. Thus 126,437 train examples and 31,998 validation exam- ples are extracted, and each of them contains history frames of 1 second and future frames of 3 seconds. We sample 6 frames including 5 history frames and the current frame, with tinput = {−1.0, −0.8, −0.6, −0.4, −0.2, 0}, and the point clouds from those frames are fed into the network as inputs. In order to build richer voxel features while saving computation and memory, every two frames are combined Model MF TS DE@1 ↓ DE@2 ↓ DE@3 ↓ ADE ↓ HR@1 ↑ HR@2 ↑ HR@3 ↑ IntentNet ✓ 21.17±0.02 39.74±0.07 61.60±0.12 36.04±0.12 93.18±0.03 76.50±0.08 61.60±0.12 MF-FRCNN ✓ ✓ 20.87±0.08 39.23±0.14 60.59±0.22 35.57±0.13 93.45±0.05 76.69±0.18 61.57±0.21 STINet ✓ ✓ 19.63±0.03 37.07±0.08 57.60±0.14 33.67±0.07 94.36±0.05 78.91±0.06 64.43±0.15 Table 1. Trajectory prediction performance for different models on WOD. MF indicates whether the corresponding model takes multiple frames as input. TS indicates whether the model has a two-stage framework. ↑and ↓indicate the higher/lower numbers are better for the corresponding metric. DE and ADE are in centimeters. For models implemented by us, we train and evaluate the model for five times and compute the average and standard deviation shown around ± in the table. Model MF TS BEV AP ↑ PointPillar [29] 68.57 MVF [29] 74.38 StarNet [19] 72.50 IntentNet [4]1 ✓ 79.43±0.10 MF-FRCNN ✓ ✓ 79.69±0.19 STINet ✓ ✓ 80.73±0.26 Table 2. Detection performance for different methods on WOD. MF indicates whether the corresponding model takes multiple frames as input. TS indicates whether the model has a two-stage framework. BEV AP is computed with an IoU threshold of 0.5. ↑indicates the higher numbers are better for the corresponding metric. by concatenating the voxelization output features thus we have three pillar features as discussed in Subsection 3.1. For the future prediction, we predict trajectory for 6 future frames with tfuture = {0.5, 1.0, 1.5, 2.0, 2.5, 3.0}. The range is 150m by 150m around the self-driving car, and we use a pillar size of 31.25cm by 31.25cm to generate pillar features of shape 480 × 480. Lyft contains lidar data from 1 sensor and labels for only 180 segments, with 140 and 40 segments for train and validation respectively. With the same settings, 14,840 and 4,240 examples are extracted for train and val- idation. Each example has 1-second history and 3-second future. We have tfuture = {0.6, 1.2, 1.8, 2.4, 3.0} for Lyft due to its 5Hz sampling rate. Evaluation metric: The evaluation metric for detection is BEV AP (Bird-Eyes-View Average Precision) with the IoU threshold set to 0.5. Objects with fewer than 5 points are considered hard and are excluded during evaluation. For trajectory prediction, we employ the metrics used in [4, 9]. For t ∈tfuture, we compute the DE@t (Displacement Error) and the HR@t (Hit Rate) with a displacement error thresh- old of 0.5m. We also compute the ADE (Average Displace- ment Error) which equals to 1 |tfuture| P t∈tfuture DE@t. Implementation: Our models are implemented in Ten- sorFlow and we train the model with Adam optimizer on TPUv3 for 140k and 70k iterations for Waymo Open Dataset and Lyft Dataset respectively. The learning rate is 4×10−4 and batch size is 1 per TPU. We use 32 TPU cores together for the training, thus the effective batch size is 32. 1IntentNet without intent prediction head implemented by us. We also implement IntentNet [4] and Faster-RCNN [21] in TensorFlow as the baselines, which are noted as “Intent- Net” and “MF-FRCNN”. Our implemented IntentNet (1) takes multiple frames as input and share the same back- bone net as STINet; (2) removes the intent classification part, and only regresses a future trajectory. MF-FRCNN refers to a Faster-RCNN [21] model with several changes: (1) It uses the same backbone net as STINet, please refer to Section 3.1; (2) for each object proposal, in addition to the bounding box, we also regress future trajectories and headings. Note that the difference between proposals from MF-FRCNN and our method is that MF-FRCNN only pre- dicts the current box of objects, while our method exploits a novel Temporal RPN which also generates the correspond- ing history boxes associated to each current box. 4.2. Results on Waymo Open Dataset The main results on Waymo Open Dataset of pedestrian detection and trajectory prediction are summarized in Ta- ble 2 and Table 1. For detection we compare our proposed method (in the last row) with the current state-of-the-art detectors [19, 29] and our method surpasses the off-the- shelf baselines by a very large margin, improving the BEV AP from 74.38 to 80.73. To avoid the effects from multi- frame inputs and different implementation details, we also compare with our implementation of IntentNet and multi- frame Faster RCNN [21], which are noted as “IntentNet” and “MF-FRCNN” in Table 2. Our proposed method out- performs all baselines and it confirms the effectiveness of our T-RPN and the STI modeling of proposals. In Table 1 we compare the trajectory prediction perfor- mance between our proposed method, IntentNet and MF- FRCNN. Our proposed method surpasses all competitors by a large margin, and the improvement is larger than the improvement on detection. It aligns with our intuition since T-RPN and STI modeling are designed to better model ob- jects’ movement and more useful to forecast their trajectory. For a detailed comparison of STINet and MF-FRCNN, we evaluate the detection and trajectory prediction by breaking down the objects into five bins based on the future trajectory length in 3s. The five bins are 0∼2.5m, 2.5∼5m, 5∼7.5m, 7.5∼10m and 10m∼∞respectively. We report BEV AP, ADE and the relative improvement in Table 3 and 4. The STINet is consistently better than MF-FRCNN Model 0∼2.5 2.5∼5 5∼7.5 7.5∼10 10∼∞ MF-FRCNN 63.07 90.44 93.27 88.00 77.15 STINet 64.23 91.15 94.46 88.97 80.50 ∆% 1.8% 0.8% 1.3% 1.1% 4.3% Table 3. Bird-eyes-view average precision (BEV-AP) breakdown comparison of MF-FRCNN and STINet on WOD. Objects are split into five bins base on the future trajectory length with a bin size of 2.5m. Last row is the relative improvement of STINet. Model 0∼2.5 2.5∼5 5∼7.5 7.5∼10 10∼∞ MF-FRCNN 26.90 37.56 46.39 104.60 173.50 STINet 26.73 35.42 41.18 89.74 137.17 ∆% 0.6% 6.0% 11.2% 14.2% 20.9% Table 4. Average displacement error (ADE, in centimeters) break- down comparison of MF-FRCNN and STINet on WOD. Objects are split into five bins base on the future trajectory length with a bin size of 2.5m. Last row is the relative improvement of STINet. Model BEV AP ↑ DE@3 ↓ ADE ↓ HR@3 ↑ MF-FRCNN 33.90 82.61 51.11 49.74 STINet 37.15 76.17 46.09 50.73 Table 5. Detection and trajectory prediction performance on Lyft. LG LD BEV AP ↑ DE@3 ↓ ADE ↓ HR@3 ↑ ✓ 80.38 64.15 37.67 58.46 ✓ 79.69 59.71 34.96 62.22 ✓ ✓ 80.53 58.95 34.49 62.99 Table 6. Ablation studies on local geometry and local dynamic features (noted as LG and LD in the table respectively). All entries are trained without History Path and Interactive features. L+G Path DE@3 ↓ ADE ↓ HR@3 ↑ ✓ 58.95 34.49 62.99 ✓ ✓ 58.04 33.92 63.87 † ✓ 67.80 39.86 52.25 Table 7. Ablation studies on history path feature. † indicates the corresponding feature is used only for detection and ignored while prediction the trajectory. for both tasks. For trajectory prediction on objects moving more than 5m, the relative improvements are significant and consistently more than 10%. It confirms that the proposed method can leverage the details of history information and provide much better trajectory predictions, especially for pedestrians with a larger movement. 4.3. Results on Lyft Dataset The detection and trajectory prediction results on the Lyft Dataset are summarized in Table 5. The performances on both tasks are improved largely and the results confirm the effectiveness of proposed method a small-scale dataset. 4.4. Ablation studies In this section we conduct ablation experiments to ana- lyze the contribution of each component and compare our Breakdown I DE@3 ↓ ADE ↓ HR@3 ↑ All 58.04 33.92 63.87 ✓ 57.60 33.67 64.43 Group 49.67 30.85 64.87 ✓ 48.89 30.40 65.55 Table 8. Ablation studies on interaction features. ‘I’ indicates whether the proposal interaction modeling is adopted. “All” and “Group” correspond to evaluation on all pedestrians and pedestri- ans belonging to a group with at least 5 pedestrians respectively. model with potential alternative methods on the Waymo Open Dataset. The results are summarized below. For clar- ity, we only show DE@3, ADE and HR@3 for trajectory prediction. The other metrics have the same tendency. Effect of local geometry and local dynamic features: We conduct experiments to analyze the effect of local geome- try and local dynamic features, summarized in Table 6. The local geometry feature is good at detection and the local dy- namic feature is good at trajectory prediction. Geometry feature itself does not work well for trajectory prediction since it ignores dynamics for better detection. By combin- ing both of the features, the benefits in detection and trajec- tory prediction can be obtained simultaneously. Effect of history path: Although objects’ geometry and movement are already represented by local geometry dy- namic features, taking history path as an extra feature can give another performance gain by improving the DE@3 from 58.95 to 58.04 and the HR@3 from 62.99 to 63.87 (as shown in the first two row of Table 7). This suggests the history path, as the easiest and most direct representation of objects’ movement, can still help based on the rich repre- sentations. However history path itself is far from enough to give accurate trajectory prediction, suggested by the poor performance in the last row of Table 7. Effect of proposal interaction modeling: To demonstrate the effectiveness of the proposed pedestrian interaction modeling, we measure the performance for all pedestrians as well as pedestrians in a group. Specifically, we design a heuristic rule (based on locations and speeds) to discover pedestrian groups and assign each pedestrian a group label on the evaluation set. The details about the grouping al- gorithm can be found in supplementary. We evaluate the trajectory prediction performance on all pedestrians and the pedestrians belonging to a group with at least 5 pedestri- ans, shown in Table 8. The interaction modeling improves trajectory prediction performance on “all pedestrians” and achieve a larger boost for pedestrians that belong to groups (DE@3 improved from 49.67 to 48.89 by 1.6%). 4.5. Model inference speed We measure the inference speed of our proposed model as well as baseline models on context range of 100m by Figure 5. Qualitative examples of STINet. The blue box are de- tected pedestrians. The cyan and yellow lines are predicted future and history trajectories of STINet respectively. 100m as well as 150m by 150m. All models are imple- mented in TensorFlow and the inference is executed on a single nVIDIA Tesla V100 GPU. For the context range of 100m by 100m, IntentNet, MF-FRCNN and STINet have inference time of 60.9, 69.4 and 74.6ms respectively. Both two-stage models (MF-FRCNN and STINet) are slower than the single-stage model, and STINet is slightly slower than MF-FRCNN. However, all three models can achieve a real-time inference speed higher than 10Hz. For the maximum range of Waymo Open Dataset, i.e., 150m by 150m, three models have inference time of 122.9, 132.1 and 144.7ms respectively. 4.6. Qualitative results The visualization for the predictions of STINet is shown in Figure 5. The blue boxes are the detected pedestrians. The cyan and yellow lines are the predicted future and his- tory trajectory for each detected pedestrian respectively. We show two scenarios where the SDC is stationary in the up- per sub-figure and the SDC is moving fast in the lower sub- figure. It demonstrates that our model detects and predicts very accurately in both cases. Figure 6 shows a detailed comparison between STINet and MF-FRCNN against the ground-truth for trajectory pre- diction. Green boxes are the ground-truth boxes. Yel- low, pink and cyan lines are the ground-truth future trajec- Figure 6. Comparison between MF-FRCNN and STINet. The yel- low line is the ground-truth future trajectory for pedestrians. The pink and cyan lines are the predicted future trajectory from MF- FRCNN and STINet respectively. It is clear that our proposed method gives a much better prediction compared with the baseline, for all three pedestrians. Upper: the overview of three pedestrians. Lower: zoom-in visualization for three pedestrians. tory as well as the predicted future trajectories from MF- FRCNN and STINet respectively. For the left two pedestri- ans who are walking in a straight line, both MF-FRCNN and STINet predict future trajectory reasonably well but the MF-FRCNN still has a small error compared with the ground-truth; for the right-most pedestrian who is making a slight left turn, MF-FRCNN fails to capture the details of its movement and gives an unsatisfactory prediction, while STINet gives a much better trajectory prediction. 5. Conclusion In this paper, we propose STINet to perform joint de- tection and trajectory prediction with raw lidar point clouds as the input. We propose to build temporal proposals with pedestrians’ both current and past boxes and learn a rich representation for each temporal proposal, with local ge- ometry, dynamic movement, history path and interaction features. We show that by explicitly modeling the spatio- temporal-interaction features, both detection and trajectory prediction quality can be drastically improved compared with single-stage and two-stage baselines. This also makes us to re-think the importance of introducing second-stage and proposals, especially for the joint detection and trajec- tory prediction task. Comprehensive experiments and com- parisons with baselines and state-of-the-arts confirm the ef- fectiveness of our proposed method, and our method signif- icantly improves the prediction quality while still achieves the real-time inference speed which makes our model prac- tical to be used in real-world applications. Combining cam- era/map data and utilizing longer history with LSTMs could be investigated to further improve the prediction and we will explore them in future work. References [1] Waymo open dataset: An autonomous driving dataset, 2019. 2, 5 [2] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. So- cial lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 961–971, 2016. 1, 3 [3] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Al- varo Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Ma- linowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learn- ing, and graph networks. arXiv preprint arXiv:1806.01261, 2018. 2 [4] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning to predict intention from raw sensor data. In Con- ference on Robot Learning, pages 947–956, 2018. 1, 2, 3, 6 [5] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Pe- ter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8748–8757, 2019. 3 [6] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- national conference on computer vision, pages 1440–1448, 2015. 1, 2 [7] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan: Socially acceptable tra- jectories with generative adversarial networks. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264, 2018. 1, 3 [8] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir- shick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 4 [9] Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 8454–8462, 2019. 1, 3, 6 [10] Rui Hou, Chen Chen, and Mubarak Shah. Tube convolu- tional neural network (t-cnn) for action detection in videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 5822–5831, 2017. 2 [11] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio- temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 4405– 4413, 2017. 2 [12] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platin- sky, W. Jiang, and V. Shet. Lyft level 5 av dataset 2019. url- https://level5.lyft.com/dataset/, 2019. 2, 5 [13] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 12697–12705, 2019. 1, 2, 3 [14] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 336–345, 2017. 3 [15] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 1, 2 [16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European con- ference on computer vision, pages 21–37. Springer, 2016. 1, 2 [17] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 3569–3577, 2018. 1, 2, 3 [18] Anton Milan, S Hamid Rezatofighi, Anthony Dick, Ian Reid, and Konrad Schindler. Online multi-target tracking using recurrent neural networks. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. 1 [19] Jiquan Ngiam, Benjamin Caine, Wei Han, Brandon Yang, Yuning Chai, Pei Sun, Yin Zhou, Xi Yi, Ouais Al- sharif, Patrick Nguyen, et al. Starnet: Targeted compu- tation for object detection in point clouds. arXiv preprint arXiv:1908.11069, 2019. 6 [20] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 2 [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems, pages 91–99, 2015. 1, 2, 4, 6 [22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 3 [23] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational rea- soning. In Advances in neural information processing sys- tems, pages 4967–4976, 2017. 2 [24] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Suk- thankar, Kevin Murphy, and Cordelia Schmid. Relational ac- tion forecasting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 273–283, 2019. 2 [25] Peng Tang, Chunyu Wang, Xinggang Wang, Wenyu Liu, Wenjun Zeng, and Jingdong Wang. Object detection in videos by high quality object linking. IEEE transactions on pattern analysis and machine intelligence, 2019. 2 [26] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pages 399–417, 2018. 2 [27] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection. Sensors, 18(10):3337, 2018. 4 [28] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 670–685, 2018. 2 [29] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Va- sudevan. End-to-end multi-view fusion for 3d object detec- tion in lidar point clouds. arXiv preprint arXiv:1910.06528, 2019. 1, 6 [30] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018. 1, 3