SoDA: Multi-Object Tracking with Soft Data Association Wei-Chih Hung 1 , Henrik Kretzschmar 1 , Tsung-Yi Lin 2 , Yuning Chai 1 , Ruichi Yu 1 , Ming-Hsuan Yang 2,3 , and Dragomir Anguelov 1 1 Waymo LLC 2 Google LLC 3 UC Merced Abstract Robust multi-object tracking (MOT) is a prerequisite for a safe deployment of self-driving cars. Tracking objects, however, remains a highly challenging problem, especially in cluttered autonomous driving scenes in which objects tend to interact with each other in complex ways and frequently get occluded. We propose a novel approach to MOT that uses attention to compute track embeddings that encode the spatiotemporal dependencies between observed objects. This attention measurement encoding allows our model to relax hard data associations, which may lead to unrecoverable errors. Instead, our model aggregates information from all object detections via soft data associations. The resulting latent space representation allows our model to learn to reason about occlusions in a holistic data-driven way and maintain track estimates for objects even when they are occluded. Our experimental results on the Waymo Open Dataset suggest that our approach leverages modern large- scale datasets and performs favorably compared to the state of the art in visual multi-object tracking. 1. Introduction Being able to detect and track multiple moving objects in crowded environments simultaneously is a prerequisite for a safe deployment of self-driving cars [ 11 , 16 , 24 ]. De- spite remarkable advances in object detection in recent years [ 31 , 23 , 22 ], multi-object tracking (MOT) in complex scenes remains a highly challenging problem. As the objects move around the scene, they may interact with each other in complex ways that are hard to predict. Furthermore, the objects may frequently occlude each other, which may result in missed detections. Finally, changes in object appearance, as well as similar object appearances, may make it difficult to recognize previously seen objects. In the commonly used “tracking-by-detection” paradigm, a tracker fuses detections to produce object tracks that are consistent over time. A key t t - 1 t - 2 probability . . . z 0 0 z 0 1 z 0 2 z 0 3 z 0 4 z N 3 z N 4 z occ z N 2 z N 3 z N 2 z occ probability Detection Measurement Encoding Data Association . . . . . . t t-1 t-1 t Figure 1: Overview. We propose a novel approach to multi-object tracking. Given object detections based on the measurements z 0 , our model encodes spatiotemporal context information for each measurement with N self-attention layers, resulting in features z N that learn from soft association values, which do not rely on hard- associated tracks. Based on the aggregated features, the model then predicts a probability distribution for each track that captures soft data associations and a latent state z occ , which indicates that the track is occluded. - challenge, therefore, is to associate incoming detections of previously observed objects with the corresponding existing tracks. Data association in most existing methods is based on similarity scores computed between the detections and the existing tracks. The representation of existing tracks can rely on the last detections [ 48 ] or it can be aggregated from historical detections associated with the tracks [ 26 , 33 ]. Fig- ure 2 groups approaches to multi-object tracking by how they leverage context information. One common assumption made by the above-mentioned methods is that their track state estimates updated by only 1 arXiv:2008.07725v2 [cs.CV] 19 Aug 2020 t=0 t=1 t=2 t=3 (a) Examplar t=0 t=1 t=2 t=3 t=0 t=1 t=2 t=3 (b) Recurrence t=0 t=1 t=2 t=3 (c) Interaction (d) Attention (ours) Figure 2: Comparison of encoding methods. We illustrate how tracking methods use context and history. The red and orange circles represent detections associated with tracks. The white circles represent incoming detections that are yet to be associated. (a) Consider how similar an incoming detection is to the latest detection associated with each track. (b) Aggregate information from all detections that are associated with each track via hard data association. (c) Share information between tracks. (d) Aggregate information from all detections to leverage the spatiotemporal context without committing to any hard data associations. the hard-associated detections capture all information rel- evant for subsequent tracking decisions. This assumption, however, is suboptimal as it collapses multiple data associa- tion hypotheses into a single mode, ignoring the contextual information from other unselected association candidates and the effects of possible association errors. The error intro- duced by a single incorrect association may propagate and substantially affect subsequent data associations, especially when the input detections are noisy. To relax the hard data association constraint, some meth- ods use multi-hypothesis tracking [ 8 , 17 , 29 ], and others simply pick the top associated candidates by applying a threshold to the similarities. This, however, requires com- plex heuristics and hyper-parameter tuning and may easily overfit to certain scenes. Instead, a unified and data-driven approach may be able to implicitly aggregate information from all candidate detections across frames to update track states without committing to any hard association. Such a soft data association framework may be able to learn long- term and highly interactive relationships between detections and tracks from large-scale datasets without cumbersome heuristics and hyper-parameters. Inspired by recent advances in natural language process- ing [ 9 , 10 , 20 , 36 , 43 ], we formulate MOT as a sequence-to- sequence problem using attention models, where the input of the model is a series of frames with detections, and the output consists of the trajectories of the detected objects. Specifically, we propose an attention measurement encod- ing to compute track embeddings that reason about the spa- tiotemporal dependencies between all objects, as observed by all detections in a given temporal window. This allows us to avoid hard data association, which is prone to error propagation and may lead to unrecoverable states. Instead, our method maintains a latent space representation that ag- gregates information from soft data associations. The soft associations further allow us to explicitly reason about occlu- sions in a data-driven way. Most existing methods determine the associations based on the similarity between an indi- vidual pair of a detection and a track. However, picking a single threshold to determine if an object is occluded without context information is challenging. The track embeddings computed by our method, on the other hand, contain rich contextual information for effective occlusion reasoning. To evaluate the effectiveness of the proposed approach, we present an extensive ablation study on the Waymo Open Dataset [ 39 ] as well as other benchmark datasets. The results suggest that our method performs favorably when compared to the state of the art in visual multi-object tracking. 2. Related Work Multi-Object Tracking. Most existing approaches to multi- object tracking, including this one, follow the “tracking-by- detection” paradigm. Offline methods [ 34 , 37 , 40 , 41 , 42 , 47 , 50 , 53 , 54 ] process an entire video in a batch fashion. However, these methods are not applicable to most online application. An autonomous vehicle, for instance, must pre- dict the states of objects immediately when new detections become available. Most recent approaches to multi-object tracking are therefore online methods that do not depend on future frames [ 3 , 4 , 7 , 17 , 25 , 33 , 45 , 48 , 52 , 55 ]. Most online methods estimate similarity scores between the de- tections and the existing tracks based on various cues, such as predicted bounding boxes [ 4 ] and appearance similar- ity [ 18 ]. While some methods [ 4 , 18 ] only take the last detection corresponding to a track into account, some tech- niques aggregate temporal information into a track history. For instance, DEEP Sort [ 48 ] computes the maximum sim- ilarity between the detection and any detections that were previously associated with the track. Other methods rely on recurrent neural networks to accumulate temporal informa- tion [ 7 , 17 , 26 , 33 , 55 ]. Sadeghian et al . [ 33 ] apply several recurrent neural networks to learn per-track motion cues, interaction cues, and appearance cues. As opposed to in- ferring interactions directly from the tracks, their method, however, relies on a rasterized occupancy map as a proxy for inferring interaction cues. For each incoming detection, the recurrence methods update an internal representation that corresponds to the matching tracks. Here, however, an erroneous data association can cause the internal represen- tation to end up in a bad state, which the method may not be able to recover from. To tackle this issue, we propose to break up the causality between spatiotemporal information aggregation and data association. We encode the detections along with soft attentions to all the other detections in a fixed temporal window into a latent space representation without committing to any hard data associations. Occlusion Reasoning. A key challenge in multi-object tracking is to robustly track objects even when they can- not be observed due to occlusions or false negatives of the trained object detector. Even though some methods aim to re- cover detections by applying a single object tracker [ 55 ] or a regression head in detector [ 3 ], the object may still be missed during full occlusions. Most online trackers, therefore, pre- dict an occlusion whenever the similarity scores are below a fine-tuned threshold and adopt the buffer-and-recover mecha- nism [ 17 ] by extrapolating the state with motion model, or by relying on appearance re-identification [ 48 ]. Predefined mo- tion models, however, often cannot accurately predict target positions during occlusion, especially in complex interactive scenes. Choosing a single hand-crafted similarity threshold to predict occlusions in a variety of complex scenes may be suboptimal. We therefore directly learn to explicitly predict occlusions without any hand-crafted geometry or appearance similarity constraints by introducing an occlusion state in a latent representation. The virtual candidates proposed by FAMNet [ 6 ] are similar to our explicit occlusion reasoning. However, FAMNet chooses the locations of the virtual can- didates by using heuristics, while our approach learns the occlusion embedding without any hand-crafted rules. Attention Networks. Transformer networks [ 9 , 10 , 20 , 36 , 43 ] and other graph neural networks [ 2 , 44 , 49 ] have recently gained momentum. In addition to successful applications to natural language processing formulated as sequence-to- sequence problems, there is evidence that suggests that Trans- former networks can capture long-range dependencies and interactions between agents [ 38 , 46 ]. We formulate multi- object tracking as a sequence-to-sequence problem, where the input consists of sequences of frames with detections, and the output consists of sequences that may have different start and end times. Other recent methods have adopted attention mechanisms to contextual information [ 7 , 52 , 55 ]. However, these techniques encode information conditioned on trajecto- ries formed by hard data associations. We propose attention measurement encoding to aggregate spatiotemporal infor- mation without conditioning on past trajectory estimates, allowing our method to gracefully handle erroneous data associations. 3. Track with Soft Data Association We adopt the tracking-by-detection paradigm, where a tracker fuses object detections to produce object tracks that are consistent over time. We propose an attention measure- ment encoding mechanism to aggregate the spatiotemporal context into a latent space representation without committing to hard data associations. We further propose an attention as- sociation mechanism that explicitly reasons about occlusions without any predefined similarity thresholds. The proposed attention measurement encoding processes the detections in a temporal window consisting of object measurements, such as position and appearance, and predicts a fixed-dimensional feature vector for each detection. The output features are then passed to the attention data association to form target trajectories. See Figure 1 for an overview of our approach. We will explain the two proposed modules in detail in the rest of this section. 3.1. Attention Measurement Encoding Most existing tracking methods associate incoming de- tections in a pairwise fashion with object states predicted by a simple motion model, such as a constant velocity model, using a Kalman filter or recurrent neural networks. Recent work, however, has demonstrated that aggregating temporal information, as well as context information, may improve multi-object tracking by exploiting higher-order information in addition to pairwise similarities between detections [ 33 ]. We propose to leverage the spatiotemporal dependencies by using an attention measurement encoding mechanism that applies self-attention layers to soft associated detection mea- surements. This enables us to implicitly learn motion and context models without any hard data associations. Attention with Transformer Network. For each new de- tection obtained at time t comprising raw measurements x t,i , a feed-forward neural network computes a measurement em- bedding z 0 t,i . This network consists of two fully-connected layers, followed by a normalization layer [ 1 ]. The two fully- connected layers use ReLU and linear activation functions, respectively. We then apply stacked attention measurement encoding layers on top of the measurement features z 0 to encode the spatiotemporal context information from each detection without any hard data associations. Considering detections at time t , our goal is to encode the information from the detections obtained in the past L enc frames, i.e., { z τ,i |∀ i, τ ∈ [ t − L enc , t ] } . Following the Transformer ar- chitecture [ 43 ], an attention measurement encoding layer consists of two sub-layers, where the first is a self-attention layer, and the second is a point-wise feed forward network. Both sub-layers follow the “Add-&-Norm“ structure adopted by [ 43 ], where the output of each layer is used as the in- put to the following layer by layer normalization [ 1 ], i.e., z o = LayerNorm ( F ( z i ) + z i ) . In the self-attention sub- layer, a measurement embedding feature is updated with scaled dot-product attention, leading to z o i = ∑ j softmax ( Q i K > j √ d k ) V i , (1) where Q i = W q z i , K i = W k z i and V i = W v z i are the query, key, value features obtained by applying a linear trans- formation on an embedding feature z i with size d k , and j denotes the index of all detections in previous L enc frames. This means that each measurement embedding will be up- dated by using a weighted sum of past measurements. After self-attention, we apply layer normalization followed by a feed-forward network with a fully-connected layer and “Add-&-Norm”. Overall, the attention measurement encod- ing mechanism consists of N enc such attention layers to increase the capacity of the model. Relative Time Encoding. Self-attention networks are un- ordered. It is therefore important to properly encode the tem- poral information. One way to do this is to have the attention value A i,j = Q i K > j in Equation 1 consider the temporal difference, ( t i − t j ) , between z i and z j . In this work, we em- ploy the relative encoding proposed by Transformer-XL [ 9 ], leading to A i,j = Q i K > j + Q i R > i − j + uK > j + vR > i − j , (2) where R k ∈ R d k , k ∈ [ − L enc , L enc ] is a learned relative attention feature, and u, v ∈ R d k are bias terms for the mea- surement attention and the relative attention. Note that, in contrast to our work, the Transformer-XL obtains R k by applying the linear transform to the sinusoidal positional encoding proposed by the original formulation of the Trans- former [ 43 ]. The reason for this modification is that we found that using the sinusoidal positional encoding often leads to non-convergence, whereas directly learning R k yields su- perior performance. Figure 2 illustrates how our method differs from existing methods in terms of how contextual information flows while tracking objects. 3.2. Attention Association and Explicit Occlusion Reasoning Data association refers to the task of assigning each in- coming detection d in a given frame to an existing track T or a new track. To this end, many existing methods estimate a similarity score s ( d, T ) between pairs of detections and tracks. The Hungarian algorithm [ 27 ] then computes the opti- mal assignment between detections and tracks using bipartite matching. In multi-object tracking, however, objects tend to occlude each other, which can lead to missed detections, also known as false negative detections. A robust tracking technique must not erroneously associate another nearby detection with the track that corresponds to the occluded object. Association t=0 t=1 t=2 0.1 0.2 t=0 t=1 t=3 0.3 0.1 t=2 Trajectory Association Trajectory 0.6 0.7 Figure 3: Attention association with explicit occlusion reasoning. Our method explicitly reasons about occlusions by attending to a separate occlusion state. The orange cir- cles refer to associated detections, the white circles refer to incoming detections, and the gray circles refer to occlusion states. The model classifies a track as occluded if the track embedding most strongly attends to the occlusion state and maintains the state embedding for future association. Many methods [ 48 , 33 , 55 ], therefore, apply a threshold to the similarity scores to prevent such detections from er- roneously being associated with the track. Using the same threshold for all data associations, however, may lead to sub-optimal results despite substantial efforts to fine-tune the threshold, especially on large-scale datasets. We propose an attention association mechanism to ex- plicitly reason about missed detections. We formulate the data association problem as a dynamic classification prob- lem. Each track effectively chooses one of the detections in the incoming frame. We obtain the classification scores by computing the attention values between a track embed- ding z T and all detection embeddings z d i . In our method, the track embedding z T is simply the embedding of the latest associated detection of a trajectory. In addition, considering occlusion, a track does not necessarily select one detection out of all available ones, as mentioned earlier in this section. Therefore, we propose to learn an occlusion state occ with embedding z occ ∈ R d k , which represents the class that a track can choose in order not to associate with any available detection. As a result, the probability that a track T is asso- ciated with a detection d i , or the latent occlusion state o , can be cast as a softmax function, leading to p ( d i | T ) = exp( z d i > z T ) exp( z occ > z T ) + ∑ j exp( z d j > z T ) , (3) where the logits are the attention values between the track embeddings and the detection embeddings. We can then train the model using a cross-entropy association loss for each alive track, giving L ( T ) = − y T,o log ( p ( o | T )) − ∑ i y T,i log ( p ( d i | T )) , (4) where y T,i = 1 if track T and detection d i belong to the same identity, otherwise y T,i = 0 , and y T,o = 1 only if y T,i = 0 for all i . During inference, each track will associate with the detection d i , i = arg max j p ( d j | t ) . However, if p ( o | T ) > p ( d i | T ) , the track will be marked as “occluded” and does not associate with any detection. 3.3. Track Management We adopt a simple track management technique that is similar to the mechanism used by SORT [ 4 ] to control the initialization and the termination of tracks. When an incom- ing detection is not associated with any track, it is used to initialize an “unpromoted” track for which we do not output a tracked target until it is “promoted“. An unpromoted track becomes a promoted track when it is associated with any detection in the following frames. A track will be killed in the system if there is no associated detection in consecutive T lost frames. We set different thresholds for unpromoted and promoted tracks, denoting as T U P lost and T P lost , where T P lost is usually larger than T U P lost since unpromoted tracks often have a higher probability of containing false positives. If two tracks compete for one measurement, the optimal as- signment, e.g., Hungarian matching, or a greedy algorithm, will assign the track of the higher score to that measurement. The other track will then typically be classified as occluded because other measurements are usually further from the track and have lower attention values than the occlusion state. Note that SORT always sets T P lost = 1 to compensate for a simple motion model. On the contrary, our proposed method implicitly learns the motion model from data using the proposed attention measurement encoding mechanism. Our technique can further reason about occlusions with the attention association mechanism. Therefore, both modules enable us to recover from occlusions and maintain tracks longer. 4. Experimental Evaluation We present an extensive experimental evaluation based on several public benchmark datasets. We conduct a detailed ablation study and compare our method with the state of the art in visual multi-object tracking. 4.1. Experimental Setup 4.1.1 Implementation Details. We implemented the proposed method in Tensorflow. We trained all the models with a single NVIDIA V100 GPU. The network is trained with batch size 16 and SGD optimizer with 1e-3 learning rate and 0.9 momentum, and each training sequence is sampled with 32 consecutive frames with ran- dom start timestamp. The input detection bounding boxes are normalized to [0 , 1] with camera width and height. For sim- plicity, we set d k = 64 for all fully-connected layers in our network. We use the track management hyper-parameters T U P lost = 2 and T P lost = 5 , unless specified otherwise. 4.1.2 Evaluation Metrics. We adopt the CLEAR MOT metrics [ 19 ]. These metrics con- sider the multi-object tracking accuracy (MOTA), the num- ber of track identity switches (IDS), the number of false pos- itives (FP), and the number of false negatives (FN). MOTA, however, heavily depends on the performance of the object detector. Therefore, we also include IDF1 score [32]. 4.1.3 Baseline Methods. We compare our approach with the following baseline meth- ods to evaluate different aspects of the techniques: • An IOU Tracker that relies on a constant velocity mo- tion model and uses intersection-over-union (IOU) as the similarity function between predicted states and detections. • A Center Tracker that relies on a constant velocity motion model and uses negative L2 distance between the box centers of the predicted states and the detections as the similarity function. • A Learned Similarity Tracker that relies on a learned pairwise similarity function. We encode each detection measurement using 4 fully-connected layers to produce a 64 -d feature for each detection. We compute the cosine similarity between each track embedding t and detection embedding d as t > d/ ( ‖ t ‖‖ d ‖ ) . We optimize the similarity using a contrastive loss [ 18 ], where we set the margin to 0.3. We then use Hungarian matching based on the resulting similarity scores. 4.1.4 Public Benchmark Datasets. We use the following public benchmark datasets in our ex- perimental evaluation: • The Waymo Open Dataset [ 39 ] is a large-scale dataset for autonomous driving. The dataset comprises 798 training sequences and 202 validation sequences. Each sequence spans 20 seconds and is densely labeled at 10 frames per second with camera object tracks. We trained and evaluated on the images recorded by the “front” camera to track vehicles in our experiments. • The KITTI Vision Benchmark Suite [ 12 ] comprises 21 training sequences and 19 test sequences. We trained on the camera images to track cars in our experiments. • The Multiple Object Tracking Benchmark [ 19 ] is a unified framework for evaluating multi-object tracking methods. We adopt the MOT17 benchmark, which comprises 7 sequences featuring crowded scenes with lots of pedestrians. 4.2. Ablation Studies in a Controlled Environment We first evaluate our approach in a controlled environ- ment that does not depend on the performance of a trained object detector. To this end, we run our method on the ground truth labels of the Waymo Open Dataset as if they were the detections predicted by an object detector. To simulate occlusions and missed detections, we randomly drop N drop ∼ U (1 , 5) consecutive bounding boxes for every 10 frames in a given track with probability p drop , processing each ground truth trajectory independently. We present a performance analysis of the methods in the controlled envi- ronment in Table 1. Both the IOU Tracker and the Center Tracker perform poorly as their constant velocity motion models fail to accurately predict the motion of objects dur- ing occlusions. The Learned Similarity Tracker outperforms both the IOU Tracker and the Center Tracker by a large mar- gin as it is able to recognize previously seen objects based on their appearance. Our method performs better than all baseline methods. Attention measurement encoding leads to a MOTA improvement of 1.07 % - 1.18 % and an IDS reduc- tion of 9.0 % - 9.2 %. The explicit occlusion reasoning leads to an even more pronounced IDF1 improvement of 24 %. 4.3. Evaluation of Explicit Occlusion Reasoning The goal of the explicit occlusion reasoning mechanism introduced in Section 3.2 is to improve the performance when tracking objects that occasionally get occluded. False positive occlusion predictions, however, may increase the number of missed detections. Therefore, Table 2 (a) evalu- ates how the explicit occlusion reasoning performs for differ- ent detection drop probabilities p drop . Interestingly, the two proposed mechanisms affect the tracking quality in different ways depending on the detection noise. On the one hand, attention measurement encoding leads to the highest gain in MOTA in the absence of simulated occlusions. On the other hand, attending to the occlusion state leads to the most pronounced improvements when the detections are highly noisy. In Table 2 (b), we frame the occlusion prediction as a binary classification problem and report metrics, in- cluding accuracy, recall, and precision. The results suggest that our method improves the performance of the occlusion prediction by choosing to operate at a high precision. 4.4. Evaluation of Attention Measurement Encod- ing Hyper-parameters. We evaluate the effects of the atten- tion measurement encoding hyper-parameters on the per- formance in the controlled environment described in Sec- tion 4.2. Specifically, we set p drop = 0 . 3 and consider different values for the size L enc of the encoding window and the number N enc of encoding layers. As shown in Ta- ble 3, as L enc increases, the tracking performance becomes better. These results suggest that our method leverages all Number of training sequences MOTA 44 46 48 50 1 5 10 50 100 500 Learned Similarity Ours w/o AE Ours Figure 4: Learning curves. The learning curves suggest that our approach benefits the most from modern large-scale datasets, such as the Waymo Open Dataset. available information from context and history. We also ob- serve that using an encoding window that spans more than 10 frames only leads to marginal improvements, while in- troducing more computational complexity. Considering the number of stacked encoding layers N enc , the performance gain is shown with only 1 layer. When increasing N enc , the improvement become more pronounced. However, with N enc > 3 , we observe that it takes the model much longer to converge, while with N enc = 3 , it takes the model only a few epochs. In fact, we found that N enc > 3 often resulted in unstable states in our experiments. As a result, we choose L enc = 5 and N enc = 2 in all remaining experiments. Exploiting future information. To demonstrate the ben- efits of avoiding hard data associations, we conduct exper- iments in which we delay the data association for L future frames. In this way, all the detections in time t will be encoded with context information from time t − L enc to t + L future with the proposed attention measurement encod- ing. We show the results in Table 4 using ground truth as detection with simulated occlusions where p drop = 0 . 3 . En- coding window L enc and N enc are set to 5 and 2, respectively. With only 2 frames of delayed associations, the model outper- forms the variants that only have access to past information (see Table 3). This implies that the implicit state informa- tion of the measurement is improved after integrating more information from the future. Therefore, it is reasonable not to commit to hard associations when gathering spatiotempo- ral information from past measurements. The results also demonstrate that our technique performs well when used for offline applications, such as semi-supervised learning. 4.5. Scalability on Large-Scale Datasets The goal of our approach is to learn complicated depen- dencies between tracks and detections from data. The learn- ing curves on the Waymo Open Dataset, which are depicted in Figure 4, suggest that our method scales well with the size of the dataset. As a consequence, our method is able to learn considerably more complex dependencies between tracks and detections from data, which makes it a great technique whenever large amounts of training data are available. Table 1: Performance in a controlled environment. We run the methods on the ground truth labels of the Waymo Open Dataset as if they were object detections. Here, we simulate occlusions with p drop = 0 . 3 . Method AE Occ MOTA ↑ IDF1 ↑ IDR ↑ IDP ↑ IDS (k) ↓ FP ↓ FN ↓ IOU 38.0 33.7 24.0 56.6 25.2 4 184.7k Center 47.9 25.2 25.1 25.4 172.1 41 4.2k Learned Similarity 83.2 30.0 29.2 30.2 50.9 15 5.5k Ours 87.9 32.2 32.2 32.2 40.9 31 60 3 88.9 32.4 32.4 32.4 37.3 8 33 3 91.4 56.2 56.1 56.3 28.1 21 1008 3 3 92.7 56.3 56.2 56.3 24.0 6 681 t = 1 t = 3 t=5 t=7 t=9 Figure 5: An example showcasing the explicit occlusion reasoning. We show an occlusion scenario as handled by our model with and without explicit occlusion reasoning. Top: The baseline model without explicit occlusion reasoning. Bottom: Our method with attention measurement encoding and explicit occlusion reasoning. The car on the left side is occluded between t = 3 and t = 7 . At t = 3 , our method attends to the occlusion state with a value of 0 . 43 , maintains the track throughout the occlusion, and then recovers the same track at t = 7 . 4.6. Evaluation on Public Benchmark Datasets 4.6.1 Waymo Open Dataset We evaluate the proposed method on the Waymo Open Dataset [ 39 ] with detections provided by a trained object detector. We compare the method with several baseline meth- ods as well as Tracktor [ 3 ], a state-of-the-art visual tracking method that relies on a two-stage object detector. To obtain a fair comparison, we train the Faster R-CNN [ 31 ] object detector with ResNet-101 [ 14 ] as the backbone network. We refer to the supplementary material for more metrics and training details of the trained detector and the Tracktor method. We summarize the results in Table 5 (a). The IOU Tracker and the Center Tracker both perform poorly as their simple motion models do not accurately capture the complex ego and vehicle motion patterns observed in the scenes. Track- tor [ 3 ] performs 3.2 % worse than our Learned Similarity Tracker baseline in terms of MOTA, but 13 % better in terms of IDS. This may be owing to the fact that Tracktor relies on a constant motion model as well as a camera motion compensation to predict the ROI in the next frame. The regression head of the detector is not able to localize objects if the predicted ROI does not have enough overlap with the actual target. Tracktor heavily depends on the appearance of objects. In the crowded scenes observed in the Waymo Open Dataset, however, many objects may share similar appearances in nearby locations. Our method performs favorably compared to the afore- mentioned methods. Specifically, our method achieves a 5.0 % higher MOTA and a 30.8 % lower IDS when compared to Tracktor. The results suggest that our method is able to account for the noise of the object detector. In addition, the proposed attention measurement encoding effectively exploits spatiotemporal context information to improve data association. To further evaluate how the attention measure- ment encoding captures spatiotemporal context information, we evaluate the performance of a variant of our method that has access to the next 2 frames, effectively gathering infor- mation from the future. This variant achieves even better performance, suggesting that our method effectively lever- ages the additional temporal context. We also evaluate our method when the attention measurement encoding aggre- gates the appearance features extracted by ROI Align [ 13 ] Table 2: Robustness with respect to missed detections. We evaluate the gain in MOTA achieved by attention mea- surement encoding (AE) and explicit occlusion reason- ing (Occ) as a function of the probability of dropped de- tections. The results in (a) suggest that the explicit occlusion reasoning becomes more beneficial as the rate of occlusions increases. The results in (b) summarize the occlusion classi- fication performance as a function of the drop probability. (a) Drop prob. MOTA Gain. w/ AE Gain w/ Occ 0.0 98.6 2.1 -0.5 0.1 97.0 0.8 2.0 0.2 95.4 1.6 2.5 0.3 92.6 1.0 3.6 0.4 90.8 1.1 4.8 (b) Drop prob. AE Accuracy ↑ Recall ↑ Precision ↑ 0.3 87.4 57.6 87.1 0.3 3 89.4 60.1 91.9 0.4 85.6 57.9 91.7 0.4 3 88.2 64.2 93.3 Table 3: Attention measurement encoding hyper- parameters. We compare the tracking performance in terms of MOTA for different numbers of encoding layers N enc as well as different sizes of the encoding window L enc . N enc L enc 1 2 3 2 89.7 89.5 89.5 5 90.3 91.8 92.0 10 91.6 92.2 92.7 20 92.1 92.2 92.8 Table 4: Improved tracking with future information. We evaluate the tracking performance when delaying the data association by different numbers L future of frames. The re- sults suggest that the additional implicit state information extracted from future measurements improves the perfor- mance of the model. This underlines that it is reasonable to avoid hard data associations when gathering spatiotemporal information. L future MOTA ↑ IDS (k) ↓ FP ↓ FN ↓ 0 92.5 24.5 5 676 2 93.3 22.0 23 631 5 93.7 20.5 7 643 along with the detected bounding boxes. The results suggest that the proposed framework generalizes to different fea- ture modalities. Finally, we present an example of how our method successfully tracks through an occlusion in Figure 5. 4.6.2 KITTI Tracking Benchmark We evaluate the performance of our approach when tracking cars in the camera images. To this end, we use the detec- tions predicted by RRC [ 30 ]. The results, summarized in Table 5 (b), suggest that, on this dataset, our method achieves similar performance to the state of the art. Interestingly, our method achieves even better performance when pre-trained on the Waymo Open Dataset, leading to a 17 % decrease in IDS. These results confirm the benefits of methods such as ours that leverage modern large-scale datasets. Note, how- ever, that we do not use any image features nor 3D priors in this experiment, which is in contrast to other methods, such as BeyondPixel [35]. 4.6.3 MOT17 Benchmark As demonstrated in Section 4.5, our method is designed to leverage large-scale datasets comprising hundreds or thou- sands of scenes, which is in contrast to the MOT17 bench- mark. The results on this dataset, shown in Table 5 (c), are insightful nevertheless. First, the results suggest that learn- ing an effective association model for the MOT17 bench- mark dataset is challenging. One reason for this may be that the camera view, the frame rate, and the image resolu- tion vary substantially across sequences. It is further worth pointing out that top-performing methods tend to rely on cer- tain guided detections to achieve better performance on this dataset, ranging from highly optimized detectors [ 3 , 5 ] to single object trackers [ 6 , 55 ]. To obtain the most informative evaluation, we therefore provide results for our method on both the public detections provided by the dataset as well as the private detections used by Tracktor [3]. 5. Conclusion We presented a novel approach to multi-object tracking that leverages the exciting recent developments in attention models in conjunction with the availability of modern large- scale datasets. We proposed attention measurement encod- ing to aggregate the rich spatiotemporal context observed in modern datasets into a latent space representation. This allows our model to avoid committing to any hard data asso- ciations that may lead to unrecoverable states. We proposed a mechanism to learn to explicitly reason about occlusions based on the latent space representation. This allows our model to track objects through occlusions while taking into account the context of the scene. We conducted an extensive experiments on the public benchmark datasets to evaluate the Table 5: Performance on public benchmark datasets. (a) Waymo Open Dataset. Method MOTA ↑ IDF1 ↑ IDR ↑ IDP ↑ IDS (k) ↓ FP (k) ↓ FN (k) ↓ IOU 26.5 30.1 20.3 57.8 22.6 9.6 330.2 Center 28.7 25.7 21.1 32.6 107.5 33.7 204.1 Tracktor [17] 44.4 46.8 37.6 61.8 16.7 16.7 226.6 Learned Similarity 47.8 48.6 39.9 62.1 19.1 30.1 202.0 Ours w/o AE, Occ 48.3 49.8 41.0 63.6 15.6 34.0 200.9 Ours w/o AE 49.1 55.0 45.4 65.8 13.1 32.00 201.9 Ours 49.4 55.8 46.0 70.9 11.4 31.9 201.9 Ours w/ appearance 49.5 54.1 44.3 69.4 11.3 31.9 201.8 Ours w/ future info 49.6 56.5 46.6 71.7 10.8 29.1 201.7 (b) KITTI-Car Benchmark. Method MOTA ↑ MOTP ↑ MT ↑ ML ↓ FP ↓ FN ↓ IDS ↓ mbodSSP [21] 72.7 78.8 48.8 8.7 1918 7360 114 CIWT [28] 75.4 79.4 49.9 10.3 954 7345 165 MDP [51] 76.6 82.1 52.2 13.4 606 7315 130 FAMNet [6] 77.1 79.4 49.9 10.3 954 7345 165 BeyondPixel [35] 84.2 85.7 73.2 2.8 705 4247 468 Ours 84.2 85.3 71.1 3.3 433 4531 490 Ours pre-trained on Waymo 84.3 85.3 70.3 3.5 406 4575 408 (c) MOT17 Benchmark. Method Guided Detections MOTA ↑ IDF1 ↑ MT ↑ ML ↓ FP ↓ FN ↓ IDS ↓ DMAN [55] 3 48.2 55.7 19.3 38.3 26218 263608 2194 MOTDT [5] 3 50.9 52.7 17.5 35.7 24069 250768 2474 FAMNet [6] 3 52.0 48.7 17.5 33.4 14138 250768 2474 Tracktor++ [3] 3 53.5 52.3 19.5 36.6 12201 248047 2072 Ours 43.4 30.9 15.2 35.2 21600 285129 12843 Ours 3 55.9 44.3 24.2 28.9 19683 217926 11178 effectiveness of the proposed approach. The results suggest that our approach performs favorably against or comparable to several baseline methods as well as state-of-the-art meth- ods. We further demonstrated that our method benefits from large-scale datasets as it is able to learn complex dependen- cies between tracks and detections from data. In future work, we will explore approaches to training our tracking model jointly with the detection model. Furthermore, we will inves- tigate mechanisms to predict refined track estimates based on the spatiotemporal dependencies encoded in the latent space representation. References [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450 , 2016. 3, 11 [2] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Al- varo Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Ma- linowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learn- ing, and graph networks. arXiv preprint arXiv:1806.01261 , 2018. 3 [3] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. Tracking without bells and whistles. In ICCV , 2019. 2, 3, 7, 8, 9, 11 [4] Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In ICIP , 2016. 2, 5 [5] Long Chen, Haizhou Ai, Zijie Zhuang, and Chong Shang. Real-time multiple people tracking with deeply learned can- didate selection and person re-identification. In ICME , 2018. 8, 9 [6] Peng Chu and Haibin Ling. Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In ICCV , 2019. 3, 8, 9 [7] Qi Chu, Wanli Ouyang, Hongsheng Li, Xiaogang Wang, Bin Liu, and Nenghai Yu. Online multi-object tracking using cnn- based single object tracker with spatial-temporal attention mechanism. In ICCV , 2017. 2, 3 [8] Ingemar J. Cox and Sunita L. Hingorani. An efficient imple- mentation of reid’s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking. TPAMI , 1996. 2 [9] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed- length context. In ACL , 2019. 2, 3, 4 [10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Ł ukasz Kaiser. Universal transformers. ICLR , 2019. 2, 3 [11] Davi Frossard and Raquel Urtasun. End-to-end learning of multi-sensor 3d tracking by detection. In ICRA , 2018. 1 [12] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR , 2013. 5 [13] Kaiming He, Georgia Gkioxari, Piotr Doll ́ ar, and Ross Gir- shick. Mask r-cnn. In ICCV , 2017. 7 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR , 2016. 7, 11 [15] Dirk Helbing and Peter Molnar. Social force model for pedes- trian dynamics. Physical review E , 51(5):4282, 1995. 13 [16] Hou-Ning Hu, Qi-Zhi Cai, Dequan Wang, Ji Lin, Min Sun, Philipp Krahenbuhl, Trevor Darrell, and Fisher Yu. Joint monocular 3d vehicle detection and tracking. In ICCV , 2019. 1 [17] Chanho Kim, Fuxin Li, and James M Rehg. Multi-object tracking with neural gating using bilinear lstm. In ECCV , 2018. 2, 3, 9 [18] Laura Leal-Taix ́ e, Cristian Canton-Ferrer, and Konrad Schindler. Learning by tracking: Siamese cnn for robust target association. In CVPR Workshops , 2016. 2, 5 [19] Laura Leal-Taix ́ e, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942 , 2015. 5 [20] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Se- ungjin Choi, and Yee Whye Teh. Set transformer: A frame- work for attention-based permutation-invariant neural net- works. In ICML , 2019. 2, 3 [21] Philip Lenz, Andreas Geiger, and Raquel Urtasun. Followme: Efficient online min-cost flow tracking with bounded memory and computation. In ICCV , 2015. 9 [22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ́ ar. Focal loss for dense object detection. In ICCV , 2017. 1 [23] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV , 2016. 1 [24] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi- ous: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In CVPR , 2018. 1 [25] Andrii Maksai and Pascal Fua. Eliminating exposure bias and metric mismatch in multiple object tracking. In CVPR , 2019. 2 [26] Anton Milan, S Hamid Rezatofighi, Anthony Dick, Ian Reid, and Konrad Schindler. Online multi-target tracking using recurrent neural networks. In AAAI , 2017. 1, 2 [27] James Munkres. Algorithms for the assignment and trans- portation problems. Journal of the society for industrial and applied mathematics , 1957. 4 [28] Aljo ˇ sa Osep, Wolfgang Mehner, Markus Mathias, and Bastian Leibe. Combined image-and world-space tracking in traffic scenes. In ICRA . IEEE. 9 [29] Donald Reid. An algorithm for tracking multiple targets. IEEE transactions on Automatic Control , 1979. 2 [30] Jimmy Ren, Xiaohao Chen, Jianbo Liu, Wenxiu Sun, Jiahao Pang, Qiong Yan, Yu-Wing Tai, and Li Xu. Accurate single stage detector using recurrent rolling convolution. In CVPR , 2017. 8 [31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS , 2015. 1, 7, 11 [32] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV , 2016. 5 [33] Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Track- ing the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV , 2017. 1, 2, 3, 4 [34] Samuel Schulter, Paul Vernaza, Wongun Choi, and Manmo- han Chandraker. Deep network flow for multi-object tracking. In CVPR , 2017. 2 [35] Sarthak Sharma, Junaid Ahmed Ansari, J Krishna Murthy, and K Madhava Krishna. Beyond pixels: Leveraging geometry and shape cues for online multi-object tracking. In ICRA . IEEE. 8, 9 [36] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self- attention with relative position representations. arXiv preprint arXiv:1803.02155 , 2018. 2, 3 [37] Jeany Son, Mooyeol Baek, Minsu Cho, and Bohyung Han. Multi-object tracking with quadruplet convolutional neural networks. In CVPR , 2017. 2 [38] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. Stochastic prediction of multi-agent in- teractions from partial observations. In ICLR , 2019. 3 [39] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR , 2020. 2, 5, 7, 11 [40] Siyu Tang, Bjoern Andres, Miykhaylo Andriluka, and Bernt Schiele. Subgraph decomposition for multi-target tracking. In CVPR , 2015. 2 [41] Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Multi-person tracking by multicut and deep matching. In ECCV , 2016. 2 [42] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple people tracking by lifted multicut and per- son re-identification. In CVPR , 2017. 2 [43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS , 2017. 2, 3, 4 [44] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NeurIPS , 2015. 3 [45] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. Mots: Multi-object tracking and segmentation. In CVPR , 2019. 2 [46] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR , June 2018. 3 [47] Xinchao Wang, Engin T ̈ uretken, Francois Fleuret, and Pascal Fua. Tracking interacting objects using intertwined flows. TPAMI , 2015. 2 [48] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. arXiv preprint arXiv:1703.07402 , 2017. 1, 2, 3, 4 [49] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596 , 2019. 3 [50] Zheng Wu, Ashwin Thangali, Stan Sclaroff, and Margrit Betke. Coupling detection and data association for multiple object tracking. In CVPR , 2012. 2 [51] Yu Xiang, Alexandre Alahi, and Silvio Savarese. Learning to track: Online multi-object tracking by decision making. In ICCV , 2015. 9 [52] Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. Spatial- temporal relation networks for multi-object tracking. In ICCV , 2019. 2, 3 [53] Amir Roshan Zamir, Afshin Dehghan, and Mubarak Shah. Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs. In ECCV , 2012. 2 [54] Li Zhang, Yuan Li, and Ramakant Nevatia. Global data association for multi-object tracking using network flows. In CVPR , 2008. 2 [55] Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang, and Ming-Hsuan Yang. Online multi-object tracking with dual matching attention networks. In ECCV , 2018. 2, 3, 4, 8, 9 A. Tracktor Setup on the Waymo Open Dataset Since Tracktor [ 3 ] only relies on a two-stage detector and a few hyper-parameters, we performed a fair comparison using a vanilla Faster R-CNN [ 31 ] with ResNet-101 [ 14 ] as the backbone network. The model achieves 41.3% AP (76.7% AP L , 42.1% AP M , 9.28% AP S ) on the vehicle class on the Waymo Open Dataset [ 39 ]. To gather the detections for tracking, we set the non-maximum suppression threshold to be 0.6, and we discard all detections with score lower than Measurement: 𝒙 𝒊 = ( 𝒙𝟏 , 𝒚𝟏 , 𝒙𝟐 , 𝒚𝟐 ) FC: 64 + ReLU FC: 64 LayerNorm Attention Measurement Encoding Layers FC: 64 + ReLU FC: 64 + tanh Attention Association 𝑧 + , 𝑧 + - 𝑧 + Figure 6: Overall Network Architecture. We present the end-to-end network architecture for a detection bounding box. The measurement of a detection d i is denoted as x i . The embeddings before and after N attention measurement encoding layers are denoted as z 0 i and z N i , respectively. Be- fore the attention association, we apply two more layers to obtain the final embedding z i . 0.5. We optimized the parameters of the Tracktor method on the dataset and set σ active = 0 . 4 , λ active = 0 . 6 , and λ new = 0 . 3 . B. Network Architecture Details We show the detailed network architecture of the pro- posed method in Figure 6. For all the experiments pre- sented in the paper except Ours w/ appearance in Table 5 of the paper, we only use the bounding box coordinates ( x 1 , y 1 , x 2 , y 2 ) as the detection measurement values. Before the attention measurement encoding layers, the measure- ments are passed through two fully-connected layers and LayerNorm [ 1 ]. After encoding with the spatiotemporal con- text information, the embeddings are further passed through two fully-conntected layers. However, the activation of the final layer is tanh since we found it leads to a more stable training process. C. Experiments with Simulation Dataset We create a simulated environment to evaluate the pro- posed multi-object tracking method with different chal- lenges that could be faced in MOT. In the simulation, we put N p particles of the same size in the box with ran- dom initial positions p = ( p x , p y ) ∼ U (0 , 1) and veloci- ties v i = ( v x , v y ) ∼ N (0 , 0 . 1) . In each time step, we com- pute a new position of a particle as p t +1 = p t + vt . When the center of a particle exceeds the boundary, it will bounce back with reversed velocity ( − v x , v y ) or ( v x , − v y ) . To obtain the detection results, we add another random perturbation to the Figure 7: Simulation Environment Illustration. To better understand the challenges introduced by different noise and group dynamic models, we create a simulation environment and separately impose several types of challenges, including measurement noise, mutual/environmental occlusions, and group context. Left: The true state of the objects. Right: Simulated detections affected by occlusions and position noise. groundtruth position d = p + n d , n d ∼ N (0 , 0 . 05) , serving as the input to tracker. Since we focus on learning the motion and context modeling in this work, the appearance feature is not considered in this setting. For each setting, we generate 1000 train sequences and 20 test sequences. Each sequence contains 600 frames. We illustrate a sample image for both generated GT and detections in Figure 7. Basic Environment with Detection Noise. We first evalu- ate the proposed method in the simplest setup, where there is only displacement noise n d in the simulation. We further apply a random force f = ( f x , f y ) ∼ N (0 , 0 . 01) to the par- ticles to simulate independent environmental forces. We re- port the quantitative evaluation in Table 6 with N p = 5 . We first show the performance of a standard Kalman filter-based tracker denoted as IOU tracker . Owing to the simplicity of the dataset, the IOU tracker can already achieve 94 . 08% MOTA. The Learned baseline performs slightly better than the IOU tracker with 0 . 54% in MOTA. However, the number of id switches is higher than that of the IOU tracker while leading to fewer false negatives because a learned tracker could still associate detections that have no overlap with previous bounding boxes. We show the performance of the proposed method in Table 6 with the ablation study of two proposed components: AE represents attention measurement encoding, and Occ represents the existence of the occlusion state for attention association. All the variants of our method achieve higher MOTA than the baselines, while the attention measurement encoding improves the tracking performance with a 5% reduction in the id switches with or without oc- clusion reasoning. We note that the occlusion state does not improve the tracking performance much since there is no actual occlusion in this simplest setting. Track with Occlusions. To simulate challenging cases in multi-object tracking, we inject occlusion noise into the sim- ulation. Specifically, we simulation two types of occlusions: Method AE Occ MOTA IDS FP FN IOU Tracker 94.08 1642 740 1168 Learned 94.62 1731 745 745 Ours 94.84 1669 732 732 3 94.96 1589 740 740 3 94.65 1671 736 736 3 3 95.07 1573 731 731 Table 6: Track with random force and measurement noises. We show the comparisons between the baseline methods and the proposed method on the synthetic dataset with radom forces and measurement noises. The results show that the proposed method can better understand the motion model of the particles with attention measuremetn encoding. Method AE Occ MOTA IDS FP FN IOU tracker 75.59 2321 417 11907 Learned 91.59 1962 511 2563 Ours 91.82 1818 518 2558 3 91.90 1793 507 2550 3 91.86 1800 515 2556 3 3 91.95 1752 511 2550 Table 7: Track with Occlusions. We evaluate the proposed method on the synthetic dataset with simulated occlusions: mutual and environmental occlusion. The results show that the explicit occlusion reasoning (Occ) is able to reason about occlusion explicitly and recover the track when detection shows up again. mutual occlusions and environmental occlusions. To sim- ulate mutual occlusions, we assign a random depth value to each particle. When the IOU between two particles is higher than 0 . 3 , we remove the detection of the particle with larger depth value. For environmental occlusion, we simu- late an occlusion block with random position and size in each sequence. Every particle that has overlap will be marked missed in the detections. We apply the baseline methods and our approach to this synthetic occlusion dataset and show the comparisons in Table 7. With synthesized occlusion, the performance of IOU tracker drops drastically compared to Table 6 because it is not able track the particles correctly after occlusion, resulting many false negatives and id switches. Learned trackers, however, suffer less from the additional occlusion noise. With the proposed method, the attention measurement encoding, comparing to occlusion-free environment, still bring improvement with 2.5% drop of id switches. On the other hand, the explicit reasoning of occlusion state improve the id switches by 2.7% since there is systematic occlusion N p = 5 N p = 10 Method AE Occ MOTA IDS FP FN MOTA IDS FP FN IOU tracker 94.7 2809 113 234 86.6 14.6k 568 871 Learned 95.2 2634 123 123 88.0 12.6k 257 257 Ours 95.3 2606 116 116 87.5 13.8k 605 605 3 97.9 1130 66 66 93.4 7.1k 446 446 3 95.3 2605 114 114 87.3 14.1k 597 597 3 3 97.7 1204 74 74 93.3 7.1k 449 449 Table 8: Track with Social Forces. We evaluate the proposed method on the synthetic dataset with social forces. The results demonstrate that the proposed attention measurement encoding (AE) could leverage the context information from other particles to better track the targets. noise introduced. Track with Social Forces. Finally, to demonstrate the ef- fectiveness of the proposed attention measurement encoding to learn from spatiotemporal context information, we apply social forces [ 15 ] to the particles by considering the repulsive effects between the particles, where they try to not collide with each other. Note that there is no occlusion simulation. Specifically, let p i be the position of a particle i , the repulsive social forces for the specific particle i can be formulated as f i = ∑ j F 0 e − || pi − pj || R · ∇ ( p i − p j ) , (5) where j denote the index of all the other particles in the scene, F 0 denotes the base force magnitude, R denotes the tolerance radius, and ∇ ( p i − p j ) denotes the unit vector from j to i . By simulating the simple social force model, the particles movements are now depending on each other, and therefore the state of the other particles would be an important information for reliably tracking a particle. We show the comparisons in Table 8 with both N p = 5 and N P = 10 settings to illustrate how the trackers per- forms in different densities. Without any occlusions, the IOU tracker performs reasonably well. However, id switches occur when particles interact with each other and change their motion drastically because of that. Similar results are shown in both the Learned baseline and our methods baseline since it is difficult to track the objects without the context information. However, with the proposed attention mea- surement encoding, the MOTA is improved by 2 . 4% with N p = 5 and 5 . 9% with N p = 10 , and the number of id switches are greatly reduced. This is because the tracker is able to reason about the interaction between particles and associate the particles with better state estimation.