UniGen: Unified Modeling of Initial Agent States and Trajectories for Generating Autonomous Driving Scenarios Reza Mahjourian∗, Rongbing Mu∗, Valerii Likhosherstov, Paul Mougin, Xiukun Huang, Joao Messias, Shimon Whiteson Waymo Abstract—This paper introduces UniGen, a novel approach to generating new traffic scenarios for evaluating and im- proving autonomous driving software through simulation. Our approach models all driving scenario elements in a unified model: the position of new agents, their initial state, and their future motion trajectories. By predicting the distributions of all these variables from a shared global scenario embedding, weensurethatthefinalgeneratedscenarioisfullyconditioned on all available context in the existing scene. Our unified mod- eling approach, combined with autoregressive agent injection, conditions the placement and motion trajectory of every new agent on all existing agents and their trajectories, leading to realistic scenarios with low collision rates. Our experimental resultsshowthatUniGenoutperformspriorstateofthearton the Waymo Open Motion Dataset. I. INTRODUCTION Autonomous Vehicles (AVs) have the potential to revolu- Fig.1. UniGen’sautoregressiveprocessforiterativelyinjectingnewagents tionize transportation by providing convenient, reliable, and intoascenario.Ineachiteration,themodelfullyinstantiatestheinitialstate safemobilityforhumansandgoods.However,ensuringtheir (top) and future trajectory (bottom) for a new agent (highlighted in pink). reliabilityandsafetyindiverseandcomplextrafficscenarios All properties of the new agent are conditioned on the scene context and theentiretrajectoriesforexistingagents(showninwhite). isasignificantchallenge.ToevaluatetheperformanceofAV systemsinsafety-criticalsituations,itisnecessarytocapture trajectories over T future timesteps, and p and p are rare or long-tail events [1]. However, collecting a large and ϕ ψ probabilitydistributionsparameterizedbyϕ,ψ.Inmostprior diverse real-world dataset of such events is difficult and methods,ϕandψ aredisjoint andtrainedseparatelyviatwo expensive,duetotheextensivemileagerequiredtoencounter different training procedures. them in the real world. Therefore, there is a critical need However, this decomposition of the scenario generation for automated techniques that can generate realistic safety- problem is unnatural and mostly a result of incremental critical traffic scenarios at scale. researchadvances.Usingtwoseparatemodelsandprocesses Simulation environments [2], [3], [4] offer a solution to leadsnotonlytoparameterredundancy,butalsoaninflexible this problem by allowing for controlled and reproducible model with limited capacity for sharing the scenario context evaluation of AV safety and reliability. Simulated traffic in all stages of scenario generation. scenarios are often created with manually-designed heuris- Inthispaper,weintroduceUniGen,amulti-agentscenario tics [2], [3]. However, such approaches do not capture the generationmethodwithstate-of-the-artperformance.UniGen complexity and diversity of real-world traffic scenarios. To uses a unified model to generate the initial position, initial address these limitations, more recent approaches train deep state attributes, and future trajectories of new agents. Using learning models to generate traffic scenarios. Existing ap- an autoregressive setup (Fig. 1), UniGen can iteratively proachesincludegeneratingastaticsnapshotofthescene[5], inject fully-instantiated new agents into a blank scene or a [6] or generating trajectories separately based on initial partially-populated scenario. In particular, UniGen’s model conditions [7], [8], [9]. conditions all properties of the next agent on all properties These methods factor the scenario generation problem as of the existing agents, including agents injected in previous p(S|R)=p ϕ(S 0|R)p ψ(S 1..T|S 0,R), (1) iterations. By ensuring consistency between initial positions andfuturetrajectories,thisapproachgeneratesmorerealistic where R is the scene context including the road layout and traffic scenarios with lower collision rates. traffic light states, S is initial agent states, S is agent 0 1..T WeevaluateUniGenontheWaymoOpenMotionDataset ∗Equalcontribution. (WOMD) [10], and show that it achieves state-of-the-art 1 4202 yaM 6 ]OR.sc[ 1v70830.5042:viXraperformanceonbothscenedistributionandcollisionmetrics. position and state of the traffic lights, and S = {si},i ∈ We also report ablation experiment results to quantify the {1,...,N}representsN agentspresentinthescenario.Each impact of each key component in UniGen’s design. agent’s state si is characterized by its initial state si and its 0 future trajectory si over T future timesteps t ∈ {1,...,T}. II. RELATEDWORK t Theinitialagentstatesi capturesposition(xi,yi),andother a) Trafficscenariogeneration: Theearliestapproaches 0 t 0 attributes including width wi, length li, heading angle θi, to scenario generation are procedural. They insert traffic 0 and velocity vi. Each agent’s future trajectory is captured agents into the scene based on predefined rules and heuris- 0 by {si,...,si }. tics [2], [3], [11]. While these methods allow for manual 1 T Thetaskistogenerateacompletedrivingscenario(R,S), parametertuningtoachievereasonableagentplacementsand given the scene context R and an initial set of agents S ⊂ c behaviors,theirscalabilityislimited,particularlyincomplex S, which might be empty. In other words, we would like urban environments with many edge cases. to generate the conditional distribution of all agent states Building on recent advances in deep learning, several p(S|R,S ). approaches model distributions over traffic scenarios based c on real-world data and then draw from them. Most learning- IV. METHOD based scene generation methods mainly generate the initial Fig. 2illustrates the overalldesign of UniGen.The model states for the agents [5], [6]. To fully generate a scenario, consists of a shared scenario encoder and three separate these methods rely on a separate motion forecasting model decoder heads: an occupancy predictor for generating the that generates trajectories given the initial agent states. location of new agents, an attribute predictor for initializing SimNet [7] initializes entire scenes using a Graph Neu- agent states, and a trajectory predictor for generating future ral Network (GNN), and then uses a separate trajectory motion.Inadditiontothesharedwhole-sceneencoderwhich generation model to add agent motion. Similarly, Traffic- produces an embedding from all inputs, there is also a per- Gen [8] models the initial agent states using a vectorized new-agent transformer encoder which encodes the polylines representation [12] and subsequently augments the agent representing the road layout from the vantage point of each data with trajectories generated by a separate model. The new agent location. The road layout encoding produced by recently-proposed LCTGen [9] model uses a shared encoder this encoder is passed to the agent attribute decoder and the and separate transformer decoder heads to predict initial agent trajectory decoder, in addition to the global scenario agentstatesandtrajectories.However,LCTGenproducesthe embedding. output scene all at once and, unlike our method, does not condition its predictions on the location and trajectories of A. Masking Ground Truth Scenarios previous agents, leading to less consistency between initial To construct the ground-truth data for training the model, positions and future trajectories and higher collision rates. we convert every ground-truth scenario in the dataset into b) Motion forecasting: The problem of motion fore- multiple training examples by randomly splitting the real casting involves modeling the future motion of agents based agentsintotwosetsofinputandhiddenagents,asillustrated on their current state and their recent motion tracks. Ad- in Fig. 3. The model is trained to predict the location, vances have been made through improvements in modeling attributes, and trajectories of the hidden agents, given only inputs [12], [13], outputs [14], agent interactions [15], [16], the input agents as inputs. When generating each ground- and multimodality [17], [18] . However, these methods are truth example, a random probability p is first sampled to not directly applicable to scenario generation, since they as- keep controlthefractionofagentstokeepininputs.Usingawide sumeavailabilityofcurrentandpaststateinformationforall range of fractions allows the model to see both empty and agents in the scene. By contrast, our approach requires only crowded scenarios during training. the scene context (road layout) to create realistic proposals This approach is conceptually similar to BERT [24] and for injecting agents at any given (x,y) location on the map masked autoencoders [25]. The random masking encourages andcanproducefuturemotiontrajectoriesforagentswithout the model to learn the scenario dynamics in order to be access to their recent motion tracks. able to predict the hidden agent information. Additionally, c) Autoregressive generative models: Neural autore- since every ground-truth scenario can be split in many gressive models have found success in different domains, different ways, this strategy effectively increases the amount includingtext[19],[20],2D[21],and3Dindoorscenegener- of training data available. ation[22],[23].Fortrafficscenariogeneration,SceneGen[5] employs an autoregressive approach to insert agents into a B. Input Representation scene one at a time. Building on this setup, we condition Following StopNet [13], we use sparse inputs which can new agents on both the initial states and future trajectories adequately capture both the static scene context, and the of existing agents—thereby increasing scenario realism and dynamic elements of the scenario captured by the location consistency while reducing collision rates. and extents of agent bounding boxes. More specifically, III. PROBLEMFORMULATION the road layout is represented by polylines that map the We represent a traffic scenario as (R,S), where R rep- positions of lane centers, lane boundaries, road boundaries, resents the scene context, including road layout, and the crosswalks, speed bumps, and stop signs. The state of traffic 2Input PointPillars Existing Agent embeddings State Points Occupancy grid (a) Traffic Light Convolutional neural [H, W, C] Points network Road Layout Random (c) Polylines sampling PointPillars encoder + Sampled agent CoAtNet Feature patch locations extraction [N, 2] (b) Agent-centric road (e) Attributes layout encoder Sampled agent Per-agent attribute [N, M, 5] (g) locations decoder PointPillars embeddings, Sample per-agent patch extraction (h) initial state [N, 2] Fuse Trajectories Per-agent trajectory (i) [N, K, T, 2] decoder (f) Sample (d) Inp iu tets r afo tir o n next t ir nia n tj j oe e c c st t co ear g ny ae a rnn it od (j) Fig.2. TheoveralldesignofUniGen.(a)Thesparseinputstothemodelconsistofthepolylinesfromtheroadlayout,thepointsrepresentingtrafficlights, andthepointsuniformlysampledfromBEVboundingboxesofexistingscenarioagents,ifany.(b)Thepointsareencodedintoadensescenarioembedding. Threeseparatedecoderspredictoccupancydistributionofnewagentstoinject,theirinitialstates,andtheirfuturetrajectories.(c)Theoccupancydecoder predictsthedistributionofinitiallocationsseparatelyforCclassesofagents.Ineachiteration,onelocationissampledfromtheoccupancyheatmaptoinject anewagent.(d)Thelocationofthenewagentislinearlymappedtoalocationinthedensescenarioembeddingandafeaturepatchisextractedsurrounding thatlocation.(e)Inaddition,aagent-centricroadlayouttransformerencoderextractsandencodestheroadpolylinesnormalizedtothecoordinateframe oftheinjectionlocation.(f)Thisagent-centricroadlayoutencodingisfusedwiththeflattenedfeaturepatchextractedfromthesharedscenarioembedding using a 1-layer MLP. (g) The product is fed to the attribute decoder to predict the initial agent states as a 5D multivariate mixture distribution with M modes. (h) Five scalar attribute values are sampled, which together with the sampled agent location constitute the complete initial agent state. (i) The trajectory decoder receives this initial agent state in addition to the fused feature encoding from (f), and predicts a set of K trajectories with associated probabilitiesspanningoverT timesteps.Eachtrajectorywaypointisrepresentedbya2DGaussian.(j)Finally,asingletrajectoryissampledfromtheK choices. At this point, the new agent is fully instantiated. The new agent is added to the scenario inputs in component (a) and the next iteration starts. Note:Attrainingtime,N equalsthenumberofhiddenground-truthagents.Atinferencetime,N equals1forinjectingasingleagentineachiteration. lights is also fed to the model as points placed at the different pillar features. The output of the shared encoder is end of the traffic-controlled lanes. In addition, the model a dense feature map with dimensions H ×W ×D . d d d receives,asinput,pointsuniformlysampledfromtheinterior D. Occupancy Prediction of Bird’s-Eye View (BEV) bounding boxes for any existing or previously injected scenario agents. To encode agent The dense occupancy decoder outputs the distribution of trajectories, separate bounding boxes are laid out for each initial positions for new agents to insert into the scenario. timestep and a separate grid of points is sampled for each It receives the output of the shared encoder as input and timestep. These points carry feature vectors that encode all uses a convolutional neural network to decode it into C theirrelevantattributes,includingposition,heading,velocity, disjoint occupancy grids of size H × W corresponding andone-hotvectorsthatrepresentthetimestep.Despitebeing to C different agent classes, e.g., vehicles, pedestrians, sparse, this input representation makes it easy for the model cyclists. The ground-truth occupancy grids are constructed to see the regions occupied by agent bounding boxes. byrenderingthecenterpointsforthehidden(masked)agents as shown in Fig. 3. Each ground-truth cell contains a binary C. Shared Scenario Encoder value O x,y ∈ {0,1}. The predicted occupancy values O(cid:98)x,y are in the range [0,1] representing the probability that a cell All sparse inputs are encoded at once using a PointPillars contains the center of some hidden agent belonging to the encoder [26], where each pillar encodes the points residing correspondingclass.Theoccupancydecoderistrainedusing in it using a Multi-Layer Perceptron (MLP) and produces a cross-entropy loss. At inference time, the initial position a single feature vector from them using max-pooling. The (xi,yi) of the new agent is determined by sampling a cell 0 0 dense feature map is further encoded using a CoAtNet from the predicted occupancy grid. At training time, we backbone [27] to encode the global interactions between the simply use the ground-truth positions of the hidden agents. 3Real Driving Scene TheattributedecoderpredictsM distinctmodesandasso- ciated probabilities for each initial state attribute a. Inspired Random bytrajectorypredictionmethods[28],weusealossfunction Split consisting of classification and regression terms to learn the distribution of attributes. For each attribute a, let mˆ∗ denote the index of the predicted mode closest to the ground truth, and let max(mˆ) denote the index of the most likely mode according to the model. The classification term uses cross the entropy loss to maximize the probability of selecting Input Agents Hidden Agents mˆ∗.TheregressiontermminimizestheL1distancebetween the ground-truth value a and the value of the most likely mode aˆ . We found this choice to be more effective max(mˆ) than minimizing the distance to the closest mode aˆ as mˆ∗ done in trajectory prediction losses [28]. We also scale all attribute values to have comparable loss magnitudes. Occupancy Prediction L a =−log Pr(mˆ∗)+(cid:12) (cid:12)a−aˆ max(mˆ)(cid:12) (cid:12). (2) (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) Fig.3. Maskingground-truthagentsforconstructinglabelsfortrainingthe classificationloss regressionloss occupancydecoder(shown),aswellastheattributeandtrajectorydecoders (not shown). In the ground-truth occupancy grids, only cells containing H. Trajectory Prediction agentcentersareturnedon. We adapt the transformer-based trajectory decoder and lossesfromMultiPath++[29]andWayformer[28].However, E. Agent-Centric Road Layout Encoder these trajectory prediction methods require the current and recent positions of agents, while our approach can predict While the shared scenario embedding is useful for captur- trajectoriesgivenonlyfeaturessampledatarbitrarylocations ingtheglobalinteractionsbetweenagentsandroadelements, on the map. Our trajectory decoder takes the same inputs its low spatial resolution is not ideal for regressing location- as the attribute decoder, namely a patch from the shared sensitive attributes like heading, which can be radically scenario embedding and the per-agent encoding of road differentfortwonearbyagentsinopposinglanes.Tothisend, layout polylines around the agent. The trajectory decoder whenpredictingagentattributesandtrajectories,weaugment also receives, as initial state, the initial position and heading the shared scenario embedding with an agent-centric road oftheagent.Theinitialpositionisalreadydeterminedatthis layout embedding, which is obtained as follows. Given the point. At evaluation and inference time, the initial heading sampled position (xi,yi) for the new agent, we extract and 0 0 is determined by the attributes sampled from the outputs of normalize a set of road layout polylines around that position the attribute decoder. At training time, we simply use the and encode them using a transformer encoder similar to ground-truth state of the hidden agents. Wayformer [28]. The encoder produces a 1 × D feature r vector per agent. I. Autoregressive Scenario Generation F. Per-Agent Feature Fusion We use an autoregressive approach to generate new scenarios at inference time by injecting new agents into For each new agent, we bilinearly sample a k×k×D d the scenario one at a time. In other words, we factor the feature patch from the shared encoder’s output, where k is conditional distribution for generating agent i as: a fixed hyperparameter. The location of the patch in the shared embedding is determined by linearly mapping the P(si|R,s1,s2,...,si−1). (3) initialagentposition(xi,yi)inthescenetoalocationinthe 0 0 dense feature map H d×W d. For each agent, we also obtain For each new agent, an initial position (xi 0,y 0i) is first a 1×D r feature vector from the agent-centric road layout sampledfromthedistributionofagentpositionspredictedin encoder. These two feature maps are flattened and passed theoccupancygrid.Thissampledpositionisusedtoextracta through a 1-layer MLP to obtain a 1×D feature vector. fusedfeaturevectorneededtopredictadistributionofinitial stateattributeswi,li,θi,vi.Samplingasetofattributesfrom G. Agent Attribute Prediction 0 0 this distribution instantiates the initial state of the agent. While the position (xi,yi) of the new agent is sampled This initial agent state together with the fused feature vector 0 0 from the occupancy grid, its other initial state attributes is used to predict a distribution of future trajectories for are predicted by the attribute decoder. For each agent, the agent. A specific trajectory is then sampled from this the additional state attributes are captured by five values: distributiontofullyinstantiateandinjecttheagent,andcom- width wi, length li, a two-dimensional unit heading vector plete one iteration of autoregressive generation. In the next (cos(θi),sin(θi)), and a scalar speed, which combined with iteration, the newly-generated agent is included as part of 0 0 the predicted heading can produce the velocity vector vi. themodelinputs,influencingallpropertiesofthesubsequent 0 4agentsgeneratedbythemodel.Thisautoregressiveapproach, similar but uses two separate models to predict initial agent coupled with the unified model, yields realistic scenarios statesandagenttrajectories,asinallpriormethods.UniGen whereagentinitialstatesandfuturetrajectoriesareconsistent w/ Agent-Centric R improves upon UniGen Joint by adding with each other. theagent-centricroadlayoutencoder.UniGenw/Traj.Inputs improves upon UniGen Joint by processing future agent V. EXPERIMENTALSETUP trajectories in the shared scenario encoder. Finally, UniGen a) Dataset: Wetrainourmodelonrealtrafficscenarios Combined adds all these improvements. fromtheWaymoOpenMotionDataset[10].Thetrainingset e) Initial State Metrics: We employ the Maxi- containsabout69,500scenariosand weevaluateonasubset mum Mean Discrepancy (MMD) [31] metric, denoted as of 1000 examples from the validation set. Each scenario MMD2(X,Y), to quantify the similarity between the origi- contains 8s agent future trajectories recorded at 10 Hz. nalandmodel-generateddistributionsforinitialagentstates. When masking agents in the ground-truth training set, we Given sets of samples X = {x ,...,x } and Y = 1 m sample the per-scenario probability p keep for keeping agents {y 1,...,y n}, and a Gaussian kernel function k, an empirical in inputs uniformly from [−0.3,0.9] while clipping negative estimate of MMD2(X,Y) is given by values to zero. This ensures adequate exposure to blank 1 (cid:88) 2 (cid:88) 1 (cid:88) scenes during training. For the evaluation set, we remove k(x ,x )− k(x ,y )+ k(y ,y ), m2 i j mn i j n2 i j all agents from the scenario and then use our method to i,j i,j i,j inject the same number of agents into the blank scene. The (4) resulting scenario is then compared with the ground truth where k(x i,x j) denotes the kernel function evaluated at using scene and trajectory distribution metrics. samples x i and x j, and similarly for k(x i,y j) and k(y i,y j). b) Hyperparameters: Wemodelthescenesasa120m× We compute MMD separately for each initial state attribute, 120m region centered on the AV. We filter out agents which including position, bounding box size, heading, and velocity are outside this BEV FOV and agents which do not have vector. The MMD metric is computed for each scenario valid8sfuturetrajectories.Ground-truthoccupancygridsare separately using a Gaussian kernel and then averaged over createdandpredicted ataresolutionof H×W =384×384 the validation set into a single number. We also report the forC =3classesofvehicles,pedestrians,andcyclists.Each Static Collision Rate (SCR), i.e., the percentage of agents grid cell corresponds to a 31.2cm×31.2cm region of the with overlapping initial state bounding boxes per scenario. world. f) Motion Metrics: Evaluating motion predictions for The scenario encoder has 128×128 pillars and produces scenario generation is not straightforward since the model a 128×128×64 output. The CoATNet backbone outputs generally injects agents in positions that differ from the a H ×W ×D =32×32×64 dense feature map, from groundtruth.Somemethods[8],[9]usematchingalgorithms d d d which we extract feature patches of size k×k =5×5. to pair up and reorient predicted agents with ground-truth The agent-centric road layout encoder is a 4-layer trans- agents and then compute standard trajectory metrics such former with 256 KV hidden size, and outputs a 1 × 256 as ADE and FDE [16]. This may be suboptimal since the feature vector. A 1-layer MLP fuses this vector with the trajectory alignment quality is mostly affected by the initial flattened dense feature map into a 1×D =1×512 feature position of the agents. For example, consider cases where vector per agent. the closest ground-truth agent may be in the opposing lane. TheattributedecoderisanMLPwith4hiddenlayers,each Inspired by the Waymo Sim Agents Challenge [32], containing 1024 hidden units with ReLU [30] activations. our motion evaluation incorporates Dynamic Collision Rate The attribute decoder predicts a distribution with M = 8 (DCR), the average percentage of agents with overlapping modes.Thetrajectorydecoderisan8-layertransformerwith trajectoryboundingboxesperscenario.WealsoreportMMD 512KVhiddensize,predictingK =64differenttrajectories metrics on motion attributes including velocity, acceleration, with associated probabilities. At inference time we sample distance to the nearest agent, and distance to the road edge. the most likely trajectory for each agent. Velocity and acceleration for each timestep are ascertained c) Baselines: We compare our method with Traffic- utilizing finite difference method. Gen [8] and the non-conditioned variant of LCTGen [9] on VI. RESULTS scene generation and motion prediction metrics. Unlike our method, TrafficGen is limited to vehicles only. We use the Table I shows MMD and static collision metrics over the pretrained model and scenario generation code provided by initial agent states. UniGen significantly outperforms prior the authors. For a fair assessment of the underlying model’s methodsonstaticMMDmetrics.Ablationexperimentsshow performance, we use TrafficGen’s scenario generation code that including the agent-centric road layout encoder has the without its collision check and resampling logic. most impact on improvinginitial state metrics. Conditioning d) Ablations: Toinvestigatetheeffectofdifferentcom- the predictions on future trajectories for existing agents ponents in our approach, we train five variants of UniGen. greatly improves the metrics as well. The multitask UniGen UniGenJointusesaunifiedmodeltogeneratealloutputsbut JointperformssimilarlytoUniGenSeparatewithadedicated does not use the agent-centric road layout encoder or condi- trajectory model, while also allowing for encoding agent tionitspredictionsonfuturetrajectories.UniGenSeparateis trajectories, which is not possible in UniGen Separate. 5Ground Truth Sample 1 Sample 2 Sample 3 Fig.4. ExamplescenariosgeneratedbyUniGen.Theleftcolumnshowstwosampleground-truthscenarios.Foreachscenario,weremoveallagentsand generatenewagentswithtrajectoriesgivenjusttheroadlayout.Weapplythemethodthreeseparatetimesresultinginthreedifferentgeneratedscenarios. TABLEI TABLEII INITIALSTATEMMDMETRICSANDSTATICCOLLISIONRATES MOTIONMMDMETRICSAND8STRAJECTORYCOLLISIONRATES Method SCR(%) Position Heading Size Velocity Method DCR(%) Speed Acceleration Dist.to Dist.to Nearest RoadEdge GroundTruth 0.14 - - - - TrafficGen[8] 12.71 0.1451 0.1325 0.0926 0.1733 GroundTruth 1.20 - - - - TrafficGen[8] 19.05 0.207 0.133 0.205 0.258 LCTGen[9] N/A 0.1319 0.1418 0.1092 0.1948 LCTGen5sTraj[9] 8.38 N/A N/A N/A N/A UniGenSeparate 1.82 0.1357 0.2203 0.0835 0.1910 UniGenSeparate 7.71 0.220 0.144 0.155 0.190 UniGenJoint 1.87 0.1323 0.2251 0.0831 0.1915 UniGenJoint 7.69 0.223 0.145 0.153 0.193 UniGenw/Agent-CentricR 1.16 0.1217 0.1095 0.0817 0.1679 UniGenw/Agent-CentricR 6.72 0.199 0.105 0.171 0.194 UniGenw/Traj.Inputs 1.35 0.1197 0.1897 0.0826 0.1657 UniGenw/Traj.Inputs 5.21 0.225 0.142 0.153 0.181 UniGenCombined 1.13 0.1208 0.1104 0.0815 0.1591 UniGenCombined 4.63 0.186 0.101 0.136 0.175 Table II shows the MMD metrics over motion attributes positions,initialstates,andtrajectoriesmimicthebehaviorof andthedynamiccollisionmetrics.UniGenoutperformsprior ground-truthscenarioagents.Byconditioningthepredictions methods on all available metrics. The ablation experiments on the entire state attributes of existing agents, UniGen show that the collision metrics improve the most when we increases the distributional realism of agent interactions and include the future agent trajectories in the scenario encoder. the consistency of the generated scenarios. On the other hand, location- and heading-sensitive attributes VII. CONCLUSIONS like speed and acceleration improve the most by adding the agent-centric road layout encoder, which allows the model ThispaperintroducesUniGen,anovelscenariogeneration to react to the road layout around the injection position with method with state of the art performance. UniGen generates high precision. multimodal distributions for agent positions, initial state at- Despite significantly improving SCR and DCR metrics tributes,andfuturetrajectories.Usingaunifiedmodelallows compared to prior works, still some generated scenarios UniGen to fully condition all properties of every new agent containcollisions.Whilestaticcollisionsmightbeinfluenced onallpropertiesofexistingagents.Inparticular,conditioning by noisy bounding boxes and presence of box overlaps the predictions on trajectories of existing scenario agents in ground-truth data, dynamic collisions are mainly an in- greatly improves the consistency of the generated scenarios dication of limited capacity in the model to incorporate asreflectedbytheimproveddistributionalmetricsandlower future trajectories of existing scenario agents, which can be collisionrates.Furthermore,employinganagent-centricroad addressed in future work. layout encoder greatly improves the precision of the model Fig. 4 shows sample scenarios generated by UniGen. The anditscontroloverlocation-andheading-sensitiveattributes. 6REFERENCES and forecasting with rich maps,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. [1] W. Ding, C. Xu, M. Arief, H. Lin, B. Li, and D. Zhao, “A survey 8748–8757. onsafety-criticaldrivingscenariogeneration—amethodologicalper- [17] Y.Chai,B.Sapp,M.Bansal,andD.Anguelov,“Multipath:Multiple spective,” IEEE Transactions on Intelligent Transportation Systems, probabilistic anchor trajectory hypotheses for behavior prediction,” 2023. arXivpreprintarXiv:1910.05449,2019. [2] A.Dosovitskiy,G.Ros,F.Codevilla,A.Lopez,andV.Koltun,“Carla: [18] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. An open urban driving simulator,” in Conference on robot learning. Huang,J.Schneider,andN.Djuric,“Multimodaltrajectorypredictions PMLR,2017,pp.1–16. forautonomousdrivingusingdeepconvolutionalnetworks,”in2019 [3] P.A.Lopez,M.Behrisch,L.Bieker-Walz,J.Erdmann,Y.-P.Flo¨ttero¨d, InternationalConferenceonRoboticsandAutomation(ICRA). IEEE, R.Hilbrich,L.Lu¨cken,J.Rummel,P.Wagner,andE.Wießner,“Mi- 2019,pp.2090–2096. croscopic traffic simulation using sumo,” in 2018 21st international [19] A.Radford,J.Wu,R.Child,D.Luan,D.Amodei,I.Sutskeveretal., conferenceonintelligenttransportationsystems(ITSC). IEEE,2018, “Languagemodelsareunsupervisedmultitasklearners,”OpenAIblog, pp.2575–2582. vol.1,no.8,p.9,2019. [4] K. Wong, Q. Zhang, M. Liang, B. Yang, R. Liao, A. Sadat, and [20] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, R.Urtasun,“Testingthesafetyofself-drivingvehiclesbysimulating A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, perceptionandprediction,”inComputerVision–ECCV2020:16thEu- “Wavenet: A generative model for raw audio,” arXiv preprint ropeanConference,Glasgow,UK,August23–28,2020,Proceedings, arXiv:1609.03499,2016. PartXXVI16. Springer,2020,pp.312–329. [21] A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori, “Layoutvae: [5] S.Tan,K.Wong,S.Wang,S.Manivasagam,M.Ren,andR.Urtasun, Stochastic scene layout generation from a label set,” in Proceedings “Scenegen:Learningtogeneraterealistictrafficscenes,”inProceed- oftheIEEE/CVFInternationalConferenceonComputerVision,2019, ings of the IEEE/CVF Conference on Computer Vision and Pattern pp.9895–9904. Recognition,2021,pp.892–901. [22] D. Ritchie, K. Wang, and Y.-a. Lin, “Fast and flexible indoor scene [6] E.Pronovost,K.Wang,andN.Roy,“Generatingdrivingsceneswith synthesis via deep convolutional generative models,” in Proceedings diffusion,”arXivpreprintarXiv:2305.18452,2023. oftheIEEE/CVFConferenceonComputerVisionandPatternRecog- [7] L. Bergamini, Y. Ye, O. Scheel, L. Chen, C. Hu, L. Del Pero, nition,2019,pp.6182–6190. B.Osin´ski,H.Grimmett,andP.Ondruska,“Simnet:Learningreactive [23] X.Wang,C.Yeshwanth,andM.Nießner,“Sceneformer:Indoorscene self-drivingsimulationsfromreal-worldobservations,”in2021IEEE generationwithtransformers,”in2021InternationalConferenceon3D InternationalConferenceonRoboticsandAutomation(ICRA). IEEE, Vision(3DV). IEEE,2021,pp.106–115. 2021,pp.5119–5125. [24] J.Devlin,M.-W.Chang,K.Lee,andK.Toutanova,“Bert:Pre-training [8] L.Feng,Q.Li,Z.Peng,S.Tan,andB.Zhou,“Trafficgen:Learning ofdeepbidirectionaltransformersforlanguageunderstanding,”arXiv to generate diverse and realistic traffic scenarios,” arXiv preprint preprintarXiv:1810.04805,2018. arXiv:2210.06609,2022. [25] K. He, X. Chen, S. Xie, Y. Li, P. Dolla´r, and R. Girshick, “Masked [9] S. Tan, B. Ivanovic, X. Weng, M. Pavone, and P. Kraehen- autoencoders are scalable vision learners,” in Proceedings of the buehl, “Language conditioned traffic generation,” arXiv preprint IEEE/CVFConferenceonComputerVisionandPatternRecognition, arXiv:2307.07947,2023. 2022,pp.16000–16009. [10] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, [26] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou et al., “Large scale interactive “Pointpillars: Fast encoders for object detection from point clouds,” motionforecastingforautonomousdriving:Thewaymoopenmotion inProceedingsoftheIEEE/CVFconferenceoncomputervisionand dataset,” in Proceedings of the IEEE/CVF International Conference patternrecognition,2019,pp.12697–12705. onComputerVision,2021,pp.9710–9719. [27] Z.Dai,H.Liu,Q.V.Le,andM.Tan,“Coatnet:Marryingconvolution [11] A.Kar,A.Prakash,M.-Y.Liu,E.Cameracci,J.Yuan,M.Rusiniak, and attention for all data sizes,” Advances in Neural Information D.Acuna,A.Torralba,andS.Fidler,“Meta-sim:Learningtogenerate ProcessingSystems,vol.34,pp.3965–3977,2021. synthetic datasets,” in Proceedings of the IEEE/CVF International [28] N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and ConferenceonComputerVision,2019,pp.4551–4560. B. Sapp, “Wayformer: Motion forecasting via simple & efficient [12] J.Gao,C.Sun,H.Zhao,Y.Shen,D.Anguelov,C.Li,andC.Schmid, attentionnetworks,”arXivpreprintarXiv:2207.05844,2022. “Vectornet: Encoding hd maps and agent dynamics from vectorized [29] B.Varadarajan,A.Hefny,A.Srivastava,K.S.Refaat,N.Nayakanti, representation,”inProceedingsoftheIEEE/CVFConferenceonCom- A. Cornman, K. Chen, B. Douillard, C. P. Lam, D. Anguelov et al., puterVisionandPatternRecognition,2020,pp.11525–11533. “Multipath++:Efficientinformationfusionandtrajectoryaggregation [13] J.Kim,R.Mahjourian,S.Ettinger,M.Bansal,B.White,B.Sapp,and forbehaviorprediction,”in2022InternationalConferenceonRobotics D.Anguelov,“Stopnet:Scalabletrajectoryandoccupancyprediction andAutomation(ICRA). IEEE,2022,pp.7814–7821. forurbanautonomousdriving,”in2022InternationalConferenceon [30] K.Fukushima,N.H.Kyo¯kai,andN.H.K.S.G.Kenkyu¯jo,Cognitron, RoboticsandAutomation(ICRA). IEEE,2022,pp.8957–8963. a Self-organizing Multilayered Neural Network Model, ser. NHK [14] R.Mahjourian,J.Kim,Y.Chai,M.Tan,B.Sapp,andD.Anguelov, technical monograph. Nippon Hoso Kyokai, (Japan Broadcasting “Occupancyflowfieldsformotionforecastinginautonomousdriving,” Corporation), Technical Research Laboratories, 1981. [Online]. IEEERoboticsandAutomationLetters,vol.7,no.2,pp.5639–5646, Available:https://books.google.co.uk/books?id=QiFQwAACAAJ 2022. [31] K.M.Borgwardt,A.Gretton,M.J.Rasch,H.-P.Kriegel,B.Scho¨lkopf, [15] T.Phan-Minh,E.C.Grigore,F.A.Boulton,O.Beijbom,andE.M. and A. J. Smola, “Integrating structured biological data by kernel Wolff, “Covernet: Multimodal behavior prediction using trajectory maximummeandiscrepancy,”Bioinformatics,vol.22,no.14,pp.e49– sets,”inProceedingsoftheIEEE/CVFConferenceonComputerVision e57,2006. andPatternRecognition,2020,pp.14074–14083. [32] N. Montali, J. Lambert, P. Mougin, A. Kuefler, N. Rhinehart, [16] M.-F.Chang,J.Lambert,P.Sangkloy,J.Singh,S.Bak,A.Hartnett, M. Li, C. Gulino, T. Emrich, Z. Yang, S. Whiteson, B. White, and D.Wang,P.Carr,S.Lucey,D.Ramananetal.,“Argoverse:3dtracking D.Anguelov,“Thewaymoopensimagentschallenge,”2023. 7