Hierarchical Imitation Learning for Stochastic Environments Maximilian Igl∗, Punit Shah∗, Paul Mougin∗, Sirish Srinivasan∗, Tarun Gupta†, Brandyn White∗, Kyriacos Shiarlis∗, Shimon Whiteson∗ Abstract—Many applications of imitation learning require theagenttogeneratethefulldistributionofbehaviourobserved in the training data. For example, to evaluate the safety of autonomous vehicles in simulation, accurate and diverse behaviour models of other road users are paramount. Existing methodsthatimprovethisdistributionalrealismtypicallyrelyon hierarchical policies. These condition the policy on types such as goals or personas that give rise to multi-modal behaviour. (a)Training However, such methods are often inappropriate for stochastic environmentswheretheagentmustalsoreacttoexternalfactors: because agent types are inferred from the observed future trajectory during training, these environments require that the contributions of internal and external factors to the agent behaviour are disentangled and only internal factors, i.e., those under the agent’s control, are encoded in the type. Encoding futureinformationaboutexternalfactorsleadstoinappropriate (b)Testing agent reactions during testing, when the future is unknown and Fig. 1: Example highlighting how stochastic environments can types must be drawn independently from the actual future. We causeout-of-distributionissuesforhierarchicalpolicies.Top:During formalize this challenge as distribution shift in the conditional training, the latent gˆis inferred from the future trajectory τ in the distributionofagenttypesunderenvironmentalstochasticity.We data using an encoder e (gˆ|τ). In this example, it captures the θ propose Robust Type Conditioning (RTC), which eliminates this driving direction and whether the light turns green. The policy shift with adversarial training under randomly sampled types. π (aˆ|gˆ,s) ‘decodes’ gˆby acting in the environment to generate θ Experiments on two domains, including the large-scale Waymo τˆ. The reconstruction loss L penalises differences between τ and rec Open Motion Dataset, show improved distributional realism τˆ, training the policy to follow gˆ. Bottom: During testing, without while maintaining or improving task performance compared to access to the future, the latent gˆmust be sampled randomly from a state-of-the-art baselines. prior p (gˆ) which was trained to match the marginal distribution θ of possible latents, i.e., it randomly samples red or green lights and I. INTRODUCTION possible driving directions. This can cause issues such as collisions Learning to imitate behaviour is crucial when reward whentherandomlatentandtheenvironmentdonotmatch.Because design is infeasible [1, 2, 3, 4], for overcoming hard the prior cannot know the future, it might sample a red light while the real traffic light turns green (2nd example) or, worse, it might exploration problems [5, 6], and for realistic modelling of wrongly sample a green light, possibly leading to collisions (last dynamical systems with multiple interacting agents [7]. Such example) if the agent follows the latent gˆas it was trained to do. systems, including games, driving simulations, and agent- On the other hand, random sampling of agent-internal decisions based economic models, often have known state transition such as driving directions is unproblematic as these do not make functions, but require accurate agent models to be realistic. assumptions about the future environment. For example, for driving simulations, which are crucial for accelerating the development of autonomous vehicles [8, 9], unimodal data and only evaluating task performance as mea- faithfulreactionsofallroadusersareparamount.Furthermore, suredbyrewards,butnotmodecoverageorrecall.Bycontrast, it is not enough to mimic a single mode in the data; instead, many applications, such as agent modeling for autonomous agents must reproduce the full distribution of behaviours to vehicles, require distributional realism in addition to good avoid sim2real gaps in modelled systems [10, 11]. taskperformance.Consequently,ourgoalistoimprovedistri- Current imitation learning (IL) methods fall short of butional realism while maintaining strong task performance. achieving such distributional realism: while they are capable To mitigate mode collapse and improve distributional real- ofgeneratingindividualtrajectoriesthatarerealistic,theyfail ismincomplexenvironments,previousworkuseshierarchical to match the full distribution of observed behaviour. Indeed, policies in an autoencoder-like framework [12, 8, 9, 18]. the adversarial training objective which enables state-of-the- During training, an encoder infers goals from observed art performance of most current IL methods is known to be future trajectories and the agent, conditioned on those goals, prone to mode dropping in practice [12, 13, 14], even though strives to imitate the original trajectory. At test time, a prior it optimises a distribution-matching objective in principle [15, distribution proposes distributionally realistic goals, without 16,17].Furthermore,progressondistributionalrealismishin- requiring access to privileged future information. We refer to dered by a lack of suitable benchmarks, with most relying on these goals as an agent’s inferred type since it can express *WaymoResearch,†U.ofOxford.Workperformedduringinternship. not only goals, but many agent characteristics responsible for 3202 peS 52 ]GL.sc[ 1v30041.9032:viXramulti-modal behaviour, such as persona, goal, or strategy. types improve distributional realism, but exhibit poor task However,asweshowinsectionIII,usingsuchhierarchical performanceinstochasticenvironments.Bycontrast,RTCcan policies in stochastic environments can create a distribution maintain good task performance in stochastic environments shift between training and testing, possibly leading to out-of- while improving distributional realism. We evaluate RTC on distribution inputs and reduced performance. Unfortunately, theillustrativeDoubleGoalProblemaswellasthelargescale the autoencoder training does not prevent extrinsic informa- Waymo Open Motion Dataset [20] of real driving behaviour. tion from being encoded. Intuitively, the type should only capture agent-intrinsic choices that are under the agent’s II. BACKGROUND control. WearegivenadatasetD ={τ }N ofN trajectoriesτ = i i=1 i Consideracarwaitingatanintersection(seefig.1).During s(i),a(i),...s(i), drawn from p(τ) of one or more experts training,becausetheagent’stype(e.g.,goalanddrivingstyle) 0 0 T interacting with a stochastic environment p(s |s ,a ) t+1 t t arenotdirectlyobserved,theymustbeinferredfromitsfuture where s ∈ S are states and a ∈ A are actions. Our goal t t trajectory. However, this inferred type might not only capture is to learn a policy π (a |s ) to match p(τ) when replacing θ t t their goal and driving style, but also external factors out of the unknown expert and generating rollouts τˆ ∼ p(τˆ) = theagent’scontrol,suchasthetimeuntilthetrafficlightturns p(s )(cid:81)T−1π (aˆ |sˆ)p(sˆ |sˆ,aˆ ) from the inital states green. Even innocuous seeming type representations can leak 0 t=0 θ t t t+1 t t s ∼p(s ). We simplify notation and write τˆ∼π (τˆ) and 0 0 θ external information; for example, a goal location extracted τ ∼ D(τ) to indicate rollouts generated by the policy or from the future trajectory can leak information about waiting drawn from the data respectively. Expectations E and τ∼D times based on its distance to the starting position. E are taken over all pairs (s ,a )∈τ and (sˆ,aˆ )∈τˆ. Capturing information about external events in the inferred τˆ∼πθ t t t t Previous work [e.g., 19, 16] shows that a core challenge type causes problems at test time, when the type must be of learning from demonstration is reducing or eliminating sampled randomly without foresight of future events. For the covariate shift in the state-visitation frequencies p(s) example, if the type contains information about traffic light causedbyaccumulatingerrorswhenusingπ .Unfortunately, θ timings, the actual timing on the test data will almost surely Behavioural Cloning (BC), a simple supervised training differ from the randomly sampled one, resulting in out-of- objectiveoptimisingmax E [logπ (a |s )]isnotrobust θ τ∼D θ t t distribution inputs to the policy which never encountered to it. To overcome covariate shift, generative adversarial such a mismatch during training where the type was always imitationlearning(GAIL)[16]optimisesπ tofoolalearned θ inferred from the actual future. Furthermore, the agent might discriminatorD (aˆ ,sˆ)thatistrainedtodistinguishbetween ϕ t t have learned to ignore the actual traffic light, instead relying trajectories in D and those generated by π : θ entirely on the inferred type that was always optimal during (cid:104) (cid:105) training. This could cause it to enter the intersection too minmaxE log(D (aˆ ,sˆ)) + (1) early, resulting in potentially catastrophic consequences such θ ϕ τˆ∼πθ ϕ t t (cid:104) (cid:105) as collisions. E log(1−D (a ,s )) . (2) τ∼D ϕ t t Existing hierarchical work either assumes no external stochasticity in the environment [12, 8], relies on manually The policy can be optimised using reinforcement learning, designed type representations that cannot capture external by treating the log-discriminator scores as costs, r = t events but limit expressiveness [9], or relies on manually −logD (aˆ ,sˆ). Alternatively, if the policy can be repa- ϕ t t designed cost functions and type filters that mitigate the rameterized [21] and the environment is differentiable, the performance degradation but do not solve the underlying sum of log discriminator scores can be optimised directly problem and induce biases in the learned behaviour [18]. without relying on high-variance score function estimators by In this paper, we identify the challenges arising under backpropagating through the transition dynamics, L (τˆ)= adv stochasticenvironmentsandformulatethemasanewformof E [(cid:80) −logD (aˆ ,sˆ)]. We refer to this as Model- τˆ∼πθ t ϕ t t distribution shift for hierarchical policies. Unlike the familiar based GAIL (MGAIL) and assume a known differentiable covariate shift in the state distribution [19], this conditional environment instead of a learned model as in [17]. type shift occurs in the distribution of the inferred latent type. In this work, we are concerned with multimodal It greatly reduces performance by yielding causally confused distributions p(τ) and how mode collapse can agents that rely on the latent type for information about be avoided when learning π . To this end, we θ external factors, instead of inferring them from the latest en- assume the dataset is sampled from p(τ) = vironmentobservation.WeproposeRobustTypeConditioning p(s )(cid:82) p(g)p(ξ)(cid:81)T p(a |s ,g)p(s |s ,a ,ξ)dξdg, 0 t=0 t t t+1 t t (RTC) to eliminate this distribution shift through a coupled where g is the agent type, expressing agent characteristics adversarial training objective under randomly sampled types. such as persona, goal, or, strategy, and ξ is a random We do not require access to an expert, counterfactuals, or variable capturing the stochasticity in the environment, i.e., manually specified type labels for trajectories. p(s |s ,a ,ξ) is a delta distribution δ (s ) for t+1 t t f(st,at,ξ) t+1 Experimentally, we show the need for improved some transition function f. We call an agent realistic if its distributional realism in state-of-the-art imitation learning generated trajectories τˆ∼π lie in the support of p(τ). We θ techniques such as GAIL [16]. Furthermore, we show call an agent distributionally realistic if its distribution over that naively trained hierarchical models with inferred trajectories matches the data, i.e. p(τˆ)≈p(τ). As we showξ (whether the traffic light is red or green) and the type g of the expert we are mimicking (whether the expert is paying attention). The crucial difference between g and ξ is that ξ represents external factors outside the agent’s control to which it must react, while g encodes agent-internal decisions that can be taken independently of ξ. In this simple model, the temporal dimension is removed and the state s is a (a)Encodergˆ e∼e θ(gˆ e|s,a)and (b) Prior gˆ p ∼ p θ(gˆ p) and deterministic function of only ξ and not influenced by g. policyπ θ(aˆ|s,gˆ e) policyπ θ(aˆ|s,gˆ p). Hence, in this section we use s and ξ interchangeably. During training, the inferred type gˆ is drawn from the e encoder e (gˆ |τ) which has access to the future ‘trajectories’ θ e τ =(s,a) in the data. During testing, without access to τ, a prior p (gˆ ) is used to sample gˆ . Actions aˆ are drawn from θ p p the learned control policy π (aˆ|s,gˆ) and a reconstruction θ loss L (a,aˆ) is minimised. As typical in autoencoders, the rec prior p (gˆ ) is trained to match the marginal distribution of θ p (c)Exampledataset. theencoderp (gˆ )=E [e (gˆ |τ)]byminimizingL (τ)= Fig. 2: Simplified, non-temporal setup with environmental noise ξ E (cid:2) KL(cid:2) pe (gˆe )∥e (τ gˆ |τθ )(cid:3)(cid:3)e . kl and unobserved true agent type g. The inferred type gˆis sampled τ∼D θ p θ e from e (gˆ |s,a) during training (top-left) and p (gˆ ) otherwise θ e θ p (top-right). The control policy is π (aˆ|s,gˆ). Circles are random B. Only external factors of influence θ variables and squares deterministic functions. The loss L(a,aˆ) We first describe a scenario with only external sources of penalises differences between a and aˆ. Bottom: Example data, B denotes Bernoulli distributions. stochasticity that serves as a minimal example of how things can go wrong due to conditional type shift. As there are no agent-internal decisions, hierarchies are unnecessary in this in section VI, current non-hierarchical adversarial methods minimal scenario. In section III-C, we extend this example to [16] are not distributionally realistic. include agent-internal decisions which hierarchical policies To combat mode collapse, hierarchical methods [e.g., capture well. 12, 22, 8, 9, 18] often rely on an encoder to infer latent Consider the example data in fig. 2c with ϵ=0, i.e., for agent types gˆ from trajectories during training, gˆ ∼ e e now we assume the expert is always paying attention. The e (gˆ |τ), and optimise the control policy π (aˆ |sˆ,gˆ ) to θ e θ t t e environment can be in two states. Half the time, it is in s , generate trajectories τˆ similar to τ: τˆ ∼ p(τˆ |gˆ ) = 0 e e e e p(s )(cid:81)T−1π (aˆ |sˆ,gˆ )p(sˆ |aˆ ,sˆ). As ground truth wherethetrafficlightisredandtheagentalwaystakesaction 0 t=0 θ t t e t+1 t t a = stop. Otherwise, in s , the traffic light is green and trajectories are not accessible during testing, a prior p (gˆ ), 0 1 θ p the agent takes action a =go. which has been trained to match the marginal distribution 1 p (gˆ ) = E [e (gˆ |τ)], is used to sample distributionally During training, the encoder observes the actual future e e τ θ e realistic types gˆ . We indicate by subscript gˆ or gˆ whether τ =(s,a) in the data and proposes the type gˆ e ∼e θ(gˆ e|τ) p p e with gˆ ∈ {0,1}. This allows, for example, the following the inferred type and trajectory are drawn from the prior dis- e tribution p (gˆ ) or encoder e (gˆ |τ). Subscripts are omitted solution1whichminimisesthereconstructionlossL rec(a,aˆ): θ p θ e for states and actions to simplify notation. Inferred types and gˆ (s ,a )=j and π (aˆ|s ,gˆ )=a predicted trajectories without subscripts indicate that either e i j θ i e gˆ e samplingdistributioncouldbeused.Forinformationtheoretic The encoder encodes the desired action in the type gˆ and quantities we use capital letters S,A,Aˆ,Gˆ and Ξ to denote the policy follows gˆ while ignoring s. This constitue tes a e the random variables for values s,a,aˆ,gˆ and ξ. perfect solution during training. However, during testing, we donothaveaccesstoτ andmustinsteaddrawtypesrandomly III. CONDITIONALTYPESHIFT from the prior gˆ ∼ p (gˆ ), which matches the marginal p θ p Here we outline the challenge of conditional type shift distribution of the encoder, i.e., p (gˆ )=p (gˆ )=B(0.5). θ p e e thatarisesforhierarchicalpoliciesinstochasticenvironments. Here, B is the Bernoulli distribution and the prior is drawing We provide a simple example illustrating the challenge and types gˆ ∈ {0,1} with equal probability because it cannot p how it can be overcome, as well as formulate a proof for the know the stochastic environment state s in advance. exact conditions under which such a distribution shift occurs. The conditional type shift arises because this marginal These insights motivate the algorithm in section IV. distribution does not need to match the conditional type distribution in specific states, i.e., p (gˆ )̸=e (gˆ |s,a). For A. Simplified model θ p θ e example, the prior might sample gˆ =1 while the stochastic p We use the simplified model in fig. 2. For intuition, we environment shows a red light (s=s ). The resulting input 0 connect it to the example mentioned in the introduction of an to the policy, (s ,gˆ = 1), was never seen during training 0 agent approaching a traffic light. This model has two sources where state and type always matched, i.e., the input pairs ofrandomnessinthetrainingdataD:theenvironmentalnoise were either (s ,gˆ=0) or (s ,gˆ=1). 0 1If the policy generalises to this new input by following sampling latent types from encoder e and prior p respec- θ θ the type, as was optimal during training, it randomly stops tively. We assume an optimal reconstruction loss L = 0 rec or goes, clearly not reproducing the data distribution and on the training data P (s,a). For the training distribution D causing potentially catastrophic mistakes such as collisions. P(s,a,gˆ )=P (s,a)e (gˆ |s,a)andtest-timedistribution e D θ e This problem always occurs when information about P(s,aˆ,gˆ ) = P (s)p (gˆ )π (aˆ|s,gˆ ) we have that if p D θ p θ p external stochastic factors is captured by the type. As there H(A|Gˆ ) < I(S,A) and H(A|Gˆ ) = H(Aˆ|Gˆ ), then e e p arenointernaldecisionbytheagentinthisexample,theideal I(S,Aˆ)H(A|S). solution is for the type to not encode any information. In We denote by H(X) the entropy, by H(X|Y) the con- section III-C we show that the conditional type shift problem ditional entropy, by I(X,Y) the mutual information and doesnotarisewhenonlyagent-internaldecisionsareencoded, by I(X,Y|Z) the conditional mutual information between as these can be taken by the agent independently from the random variables. Intuitively, H(A|Gˆ ) < I(S,A) if the e environment stochasticity. encoder captures information about A in Gˆ that is also e accessible through S, i.e., information about external events. C. External and internal factors of influence The condition H(A|Gˆ )=H(Aˆ|Gˆ ) implies that the policy e p To express this, we now introduce ϵ>0 as the probability reliesonthisinformationinGˆ topredictAˆ.Ifbothconditions that the agent decides not to pay attention to the traffic light. are true then H(Aˆ|S) > H(A|S), stating that the state S Hence, the expert now either follows the traffic light with has less predictive power for the predicted action Aˆ than p(a |s )=1−ϵ, or ignores it with p(a |s )=ϵ. for actions A in the dataset, i.e.: that the policy is ignoring i i ̸=i i The previous solution 1 is still viable during training, action-relevant information in the states. minimisingthereconstructionloss,butstillfailsduringtesting Proof The proof relies on the interaction information as it generates an action distribution which ignores the traffic I(X,Y,Z), an extension of mutual information to three light50%ofthetime,i.e.,p(aˆ |s )=0.5,incontrasttothe variables. Importantly, the interaction information can be ̸=i i expert, which only deviates with probability p(a |s )=ϵ. positive or negative. A positive interaction information ̸=i i By contrast, solution 2 avoids the conditional type shift indicates that one variable explains some of the correlation but successfully encodes the agent-internal decision: betweentheothertwowhileanegativeinteractioninformation (cid:40) (cid:40) indicates that one variable enhances their correlation. gˆ (s ,a )= 0 if i=j , π (aˆ|s ,gˆ)= a i if gˆ=0 Themodelp θ(aˆ|s,a)=e θ(gˆ e|a,s)π θ(aˆ|s,gˆ e)istrained e i j 1 if i̸=j θ i a if gˆ=1 on the dataset P (s,a). Achieving minimal reconstruction ̸=i D loss is achieved only when aˆ =a is predicted with certainty, Here the latent type gˆ only captures whether the agent pays implying H(Aˆ|S,Gˆ )=H(Aˆ|S,Gˆ )=0. e p attention (gˆ=0) or not (gˆ=1). Now the marginal encoder During training on the dataset P (s,a) the interaction D type distribution is p e(gˆ e =0)=1−ϵ and hence we have information is positive because H(A|Gˆ )0. e e mentsonlygeneraliseattesttimewhenthetypeonlyconveys information about agent-internal features and is uncorrelated On the other hand, during testing, we have I(Gˆ p,S)=0 with any external stochastic events during training. because Gˆ p is drawn independently of S. The interaction For simplicity, the model in fig. 2c has only one time- information becomes weakly negative: step. For temporally extended data, the states s depend t I(Aˆ,Gˆ ,S)=I(Gˆ ,S)−I(Gˆ ,S|Aˆ)≤0 not only on ξ, but also on g or gˆ, complicating theoretical p p p treatment. Nevertheless, seeing ξ as all future stochasticity With I(Aˆ,Gˆ ,S)=I(S,Aˆ)−H(Aˆ|Gˆ ) we get p p in the environment, the same challenges arise. In realistic I(S,Aˆ)−H(Aˆ|Gˆ )≤0H(A|S) follows directly. D. Theorem IV. ROBUSTTYPECONDITIONING Here we provide an information theoretic proof that agents We present Robust Type Conditioning (RTC), a method that, during training, rely on the inferred type to acquire for improving distributional realism in imitation learning information about external events, do not react appropriately while maintaining high task performance, even in stochastic to environment stochasticity during testing. This formalises environments. As shown in previous work [8, 9, 18], and thepreviousdiscussionbutisnotneededtofollowsubsequent confirmed in section VI, hierarchical policies trained in sections of the paper. an autoencoder framework are currently the most effective Theorem 1 The hierarchical autoencoding model approach at improving distributional realism. However, such p (aˆ|s,a) and test policy p (aˆ|s) are as described above, policies require the latent type to be inferred from the θ θFig. 3: RobustTypeConditioning(RTC):Thecontrolpolicyπ (aˆ |sˆ,gˆ)istrainedunderinferredtypesgˆsampledfromboththeencoder θ t t e (gˆ |τ) and the prior p (gˆ ). The hierarchical loss L (τ,τˆ)=L (τ,τˆ)+βL (τ) improves distributional realism. The adversarial θ e θ p vae rec e kl loss L (τˆ) under prior types prevents causally confused policies and ensures good task performance at test time, even in stochastic adv environments. L (τ) optimises the prior to sample distributionally realistic types. kl future trajectory, which can cause problems in stochastic environment. Parameters θ¯are held fixed and λ and β are environments (see section III). scalar weights. D (a ,s ) is a learned per-timestep discrimi- ϕ t t To overcome this limitation, we propose to combine the nator.Lastly,L (τ,τˆ)isareconstructionlossbetweenτ and rec autoencoder training objective L = L +βL with an τˆ whichcantakedifferentforms.Forexample,insectionVI- vae rec kl e additional adversarial objective L utilising a learned dis- AweusetheBClossL (τ,τˆ)=−logπ (a |s ,gˆ )while adv rec θ t t e criminator D (a ,s ). Importantly, this additional objective in section VI-B we minimise the L distance between agent ϕ t t 2 allows us to sample training types not only from the encoder, positions in s and sˆ. The loss L (τ) optimises the prior t t kl but also from the prior. When types are sampled from the to propose distributionally realistic types by matching the prior,thehierarchicallossL cannotbeusedaswegenerally marginal encoder distribution. vae do not have access to ground truth trajectories corresponding One can understand the problem of conditional type shift tothisspecifictype,whicharerequiredforthereconstruction as one of causally confused policies which refer to the loss L . Instead, for these types, we only optimise the type for information about external stochastic events, instead rec (cid:80) adversarial objective L (τˆ)= −logD (aˆ ,sˆ). of acquiring this information directly from the currently adv t ϕ t t During training, we hence split each minibatch B = observed states. From this perspective, sampling from the {τ(b)}N bb of N b trajectories sampled from D into two parts. prior constitutes a causal intervention do(gˆ) in which gˆ is For the fraction f of trajectories in B the rollouts τˆ e are changed independently of the environmental factor ξ. [23] generated from types sampled from the encoder gˆ ∼ show that causal confusion can be avoided by applying e e θ(gˆ e|τ)andobjectivesL adv+λL vae areoptimised(firstline such interventions and optimising the policy to correctly in eq. (4)). For the remaining fraction (1−f) of trajectories, predict the counterfactual expert trajectory distribution, in types are sampled from the prior p θ(gˆ p) and only L adv is our case p expert(τ|ξ,do(gˆ)). Unfortunately, we do not have optimised (second line in eq. (4)). accesstothiscounterfactualtrajectory.Instead,werelyonthe Because the policy does not know whether the type is sam- generalisationofπ togetus‘close’tosuchacounterfactual θ pled from the encoder or prior, this combination ensures that trajectory for types do(gˆ) and then refine the policy locally policies follow agent-internal information in the type, due to using the adversarial objective. theautoencodertrainingobjective,butignoreanyinformation We find that both continuous type representations with and in the type about external stochastic events, as this would discrete type representations using straight-through gradient lead to unrealistic trajectories under prior types, which are estimation work well in practice (see section VI-B). penalisedbytheadversarialtrainingobjective.Lastly,because Optimisation of L and L can either be performed the KL objective L kl(τ) = E τ∼D(cid:2) KL(cid:2) p θ(gˆ p)∥e θ(gˆ e|τ)(cid:3)(cid:3) directly, similar to Ma Gdv AIL [17re ]c , by using a differentiable minimises the amount of information encoded in the type, environment and reparameterised policies and encoder [21] such unused information would not even be encoded. or by treating them as rewards and using RL methods such Consequently, the full RTC loss is as TRPO [24, 16] or PPO [25]. The loss L can always be kl L =E (cid:2) λL (τˆ)+L (τ,τˆ)(cid:3) optimised directly. RTC +ED(τ)eθ(gˆ e|τ)πθ(τˆe|gˆ e) (cid:2) λLadv (τˆ)(cid:3) , vae (4) D(τ)p θ¯(gˆ p)πθ(τˆp|gˆ p) adv V. RELATEDWORK with Hierarchical policies have been extensively studied in L (τ,τˆ)=L (τ,τˆ)+βL (τ) RL [e.g., 26, 27, 28, 29, 30] and IL. In RL, they improve vae rec kl L (τ)=E (cid:2) KL(cid:2) p (gˆ )∥e (gˆ |τ)(cid:3)(cid:3) exploration,sampleefficiencyandfastadaptation.Bycontrast, kl τ∼D θ p θ e inIL,hierarchiesareusedtocapturemultimodaldistributions, (cid:88) L adv(τˆ)= −logD ϕ(aˆ t,sˆ t), improvedataefficiency[31,32],andenablegoalconditioning t [33]. Similar to our work, [12] and [22] learn to encode wherep (gˆ )isalearnedprior,e (gˆ |τ)alearnedtrajectory trajectories into latent types that influence a control policy. θ p θ e encoder and π (τˆ|gˆ) is shorthand for generating trajectories Crucially, both only consider deterministic environments and θ τˆ by rolling out the learned control policy π (aˆ|sˆ,gˆ) in the hence avoid the distribution shifts and unwanted information θbuilding on MGAIL, utilises future lane segments as manu- ally specified types which avoid conditional type shift but limit expressiveness. InfoMGAIL [41] augments MGAIL (a) (b) (c) to elicit distinct trajectories for different types by using Fig. 4: Differences between realism, coverage and distributional an information-theoretic loss. This is an alternative training realism. The data distribution P(X ) is shown in green, blue paradigmforhierarchicalpolicies,besidesusingautoencoders D denotes a learned distribution P θ(X L). (a) Data from the learned with reconstruction loss. For fair comparison, our method distribution is realistic, i.e. supp(X L) ⊆ supp(X D), but not RTC uses the same MGAIL implementation as adversarial distributionally realistic. (b) The learned distribution achieves objective. We investigate both continuous and discrete type coveragebutnotdistributionalrealism:thefrequenciesofmodesare notmatched.(c)Thelearneddistributionisdistributionallyrealistic. representations, RTC-C and RTC-D. NaiveHierarchy In practice we measure distributional realism in selected features is a hierarchical autoencoder not training on prior-sampled h(X) as the dimensionality of X is too high. types (but also using the adversarial MGAIL objective) and hence experiencing conditional type shift and high collision leakage we address. They extend prior work in which the frequency. type, or context, is provided in the dataset [34], which is also assumed in [35]. [36] use a sampling method to infer latent A. Double Goal Problem types. [37] and [9] use manually designed encoders specific to road users by expressing future goals as sequences of lane segments.Thisavoidsinformationleakagebutcannotexpress all characteristics of human drivers, such as persona, and cannottransfertoothertasks.BITS[18]usesgoalpositionsas types, which suffer from conditional type shift. Consequently, their method requires behaviour prediction and a manually specified cost function to filter goals that might mismatch with predicted futures. Futures states in deterministic environments [38], language [39], and predefined strategy statistics [40] have also been used as types. Information theoretic regularization offers an alternative to learning hierarchical policies using the auto-encoder frame- work [41, 42]. However, these methods are less expressive sincetheirpriordistributioncannotbelearnedandonlyaimto cluster modes already captured by the agent but not penalize dropping modes in the data. This provides a useful inductive bias but often struggles in complex environments with high Fig. 5: Top: Visualization of ten randomly sampled goal pairs diversity, requiring manual feature engineering [43, 44]. and associated trajectories. Bottom: Training curves, exponentially Lastly, TrafficSim [8] uses IL to model driving agents, smoothed and averaged over 20 seeds. Shading shows the standard controllingallstochasticityinthescenebutusingindependent deviation. We show task performance as ‘Test Return’ (higher is better) and distributional realism as ‘JSD’ between the goal prior distributions for separate agents. Hence, while the envi- distribution of expert and agent (lower is better). ronment is assumed deterministic, conditional type shift can occur between the separate agent-types which are correlated In the double goal problem, the expert starts from the during training but independent during testing. They use a origin and creates a multimodal trajectory distribution by biased “common sense” collision avoidance loss, motivated randomly choosing and approaching one of two possible, by covariate shift in visited states. Our work suggests that slowly moving goals located on the 2D plane. Stochasticity type shift might also explain the benefits gained. In contrast, is introduced through randomized initial goal locations and our adversarial objective is unbiased. movement directions. Nevertheless, the lower and upper goal {g ,g }remainidentifiablebytheirlocationasy <0forg VI. EXPERIMENTS l u l l and y >0 for g (see fig. 5). While both goals are equally u u We show in two stochastic environments with multimodal easy to reach, the expert has a preference P(G = g ) = l expert behaviour that i) existing IL methods suffer from 0.75.SufficientlycomplexexperttrajectoriespreventBCfrom insufficient distributional realism, ii) hierarchical methods achieving optimal performance, requiring more advanced can suffer from conditional type shift and degrading task approaches. The expert follows a curved path and randomly performance, and iii) RTC improves distributional realism resamples the selected goal for the first ten steps to avoid a while maintaining excellent task performance. simple decision boundary along the x-axis in which experts We compare the following models: MGAIL uses an adver- in the lower half-plane always target goal g . RTC uses the l sarial training objective with learned discriminator. It also BC loss as reconstruction loss L (τ)=−logπ (a |s ,gˆ ) rec θ t t e optimisesaBClossaswefoundthistoimproveperformance. and continuous types. All policies use a bimodal Gaussian Symphony [9] (called ‘MGAIL+H’ in the original paper), mixture model as action distribution.TABLE I: Averages and standard deviation over 20 training runs on WOMD. The best two values are highlighted. Collision Off-road MinADE Curvature JSD Progress JSD rate (%) ↓ time (%) ↓ (m) ↓ (×10−3) ↓ (×10−3) ↓ Data Distribution 1.16 0.68 - - - MGAIL 5.39 ± 0.68 0.89 ± 0.12 1.34 ± 0.08 1.32 ± 1.48 3.81 ± 1.29 Symphony 6.39 ± 0.95 0.90 ± 0.06 1.40 ± 0.12 0.97 ± 0.62 6.44 ± 5.25 InfoMGAIL - C 5.21 ± 0.37 0.89 ± 0.14 1.29 ± 0.07 1.24 ± 0.93 4.40 ± 1.47 InfoMGAIL - D 4.82 ± 0.29 0.84 ± 0.10 1.35 ± 0.11 0.77 ± 0.44 4.01 ± 1.45 NaiveHierarchy 35.08 ± 0.44 1.83 ± 0.42 1.12 ± 0.01 1.76 ± 2.05 2.54 ± 0.63 RTC - C 4.23 ± 0.16 0.68 ± 0.04 1.15 ± 0.10 0.43 ± 0.06 2.17 ± 0.65 RTC - D 4.21 ± 0.24 0.74 ± 0.06 1.12 ± 0.10 0.89 ± 0.66 2.56 ± 0.54 This experiment combines agent-internal decisions (which relative frequency of modes, e.g., low probability modes goal to approach) with external stochasticity (goal starting may be overrepresented. To measure distribution matching positions and movement directions). Task performance is in driving intent, we use the Curvature JSD [9]: in lane measuredasthenumberofstepsforwhichtheagentiswithin branching regions, such as intersections, it maps trajec- δ =0.1 distance of one of the goals (higher is better). Distri- tories to the nearest lane and extracts its curvature as butional realism is measured as the divergence between the feature h . The driving style distribution is measured cur empirical distributions, JSD(p (h )∥p (h )) (lower is through the progress feature h = δ(sˆ ,sˆ ). To com- agent s expert s style 0 T (cid:0) (cid:1) better) where we take h = sign(y ) of the final agent puteJSD p (h )∥p (h ) ,thevalueof s T agent cur/style expert cur/style position[x ,y ]toindicatewhichgoalwasapproached.Our h is discretize into 100 equisized bins. T T cur/style aim is to improve distributional realism while maintaining or Results are provided in table I. Both versions of RTC improving task performance. improvetaskperformance(collisionsandoff-roadevents)and Figure 5 shows that MGAIL improves task performance distributionalrealismmetrics(minADEanddivergences)com- compared to BC. Our method, RTC, improves it further, paredtotheflatMGAILbaselineandprevioushierarchicalap- possibly because given a type, the required action distri- proaches (Symphony, InfoMGAIL, NaiveHierarchy). bution is unimodal. Importantly, RTC substantially improves Both type representations, RTC-C and RTC-D, perform simi- distributional realism, achieving lower JSD values. The bias larly,showingrobustnessofRTCtodifferentimplementations. introduced by the information-theoretic loss in InfoMGAIL MGAIL achieves good task performance, but is outper- reduces task performance without improving distributional formed by RTC due to the use of hierarchy. On the other realism. Lastly, NaiveHierarchy achieves excellent dis- hand, Symphony, using lane segment goals to capture tributional realism through the learned hierarchy but suffers driving intent, consequently improves on the Curvature reduced task performance due to conditional type shift. JSD distributional realism metric, but not on Progress JSD which measures driving style, not intent. In contrast, RTC B. Waymo Open Motion Dataset (WOMD) improves on both distributional realism metrics since the fully learned type is more expressive. The information- To evaluate RTC on a complex environment we use theoreticlossinInfoMGAILimprovesdistributionalrealism the Waymo Open Motion Dataset [20] consisting of 487K on some metrics, but is less effective than RTC: while the segments of real world driving behaviour, each 9s long additional InfoMGAIL loss ensures that the type contains at 10Hz. We follow [9] by controlling agents at 3.3Hz some information, it does not require this information to be and replying uncontrolled agents from logs. Distributionally useful, unlike in an autoencoder framework. realisticagentsarecriticalfordrivingsimulations,forexample Lastly, the advantage of RTC in achieving both good for estimating safety metrics. Diverse intents and driving task performance and distributional realism becomes styles cause the data to be highly multimodal. External clearest by comparing it to NaiveHierarchy. While stochasticity is induced through the unpredictable behaviour NaiveHierarchy achieves some improvements in dis- of other cars, cyclists and pedestrians. We use L (τ,τˆ)= rec (cid:80)T L (s ,sˆ) where L is the average Huber loss of tributional realism, is has nearly an order of magnitude t Huber t t Huber more collisions. This is a consequence of the challenges the four vehicle bounding box corners. discussed in section III: At training time, the inferred type The percentage of segments with collisions and contains too much information, for example when to break time spent off-road are proxy metrics for task per- or start driving. At test time, because this information is formance and realism. Mode coverage is measured by sampled independently to what is actually happening in the the minimum average displacement error, minADE = (cid:104) (cid:105) environment, the agent behaves incorrectly and collides with E min 1 (cid:80)T δ(s ,sˆ ) , where δ is the τ∼D,{τˆi}K i ∼πθ τˆi T t=1 t i,t other road users. Euclidean distance between agent positions and we find the minimum over K = 16 rollouts (hierarchical methods use VII. CONCLUSIONS,LIMITATIONS,ANDFUTUREWORK K independently sampled types). Lower minADE implies This paper identified new challenges in learning hierarchi- better mode coverage, but does not directly measure the cal policies from demonstration to capture multimodal trajec-tory distributions in stochastic environments. We expressed [20] S.Ettinger,S.Cheng,B.Caine,C.Liu,H.Zhao,S.Pradhan,Y.Chai, them as conditional type shifts in the hierarchical policy. B. Sapp, C. Qi, Y. Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V.Vasudevan,A.McCauley,J.Shlens,andD.Anguelov,“Largescale We proposed Robust Type Conditioning (RTC) to eliminate interactivemotionforecastingforautonomousdriving:Thewaymo these distribution shifts and showed improved distributional openmotiondataset,”CoRR,2021. realism while maintaining or improving task performance [21] D.P.KingmaandM.Welling,“Auto-encodingvariationalbayes,”ICLR, 2014. on two stochastic environments, including the Waymo Open [22] C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, Motion Dataset [20]. Future work will address conditional andP.Sermanet,“Learninglatentplansfromplay,”inConferenceon distributional realism by not only matching the marginal robotlearning. PMLR,2020,pp.1113–1132. [23] P. De Haan, D. Jayaraman, and S. Levine, “Causal confusion in distributionp(τ),buttheconditionaldistributionp(τ|ξ)under imitationlearning,”NeurIPS,2019. aspecificrealizationoftheenvironment.Forexample,drivers [24] J.Schulman,S.Levine,P.Abbeel,M.Jordan,andP.Moritz,“Trust mightchangetheirintentbasedonthecurrenttrafficsituation regionpolicyoptimization,”inICML,2015. [25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, or players might adapt their strategy as the game unfolds. “Proximalpolicyoptimizationalgorithms,”arXiv:1707.06347,2017. Achieving such conditional distributional realism will also [26] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi- require new models and metrics. mdps:Aframeworkfortemporalabstractioninreinforcementlearning,” Artificialintelligence,vol.112,no.1-2,pp.181–211,1999. [27] P.-L.Bacon,J.Harb,andD.Precup,“Theoption-criticarchitecture,” REFERENCES inAAAI,2017. [28] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, [1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical and D. Mané, “Concrete problems in ai safety,” arXiv preprint reinforcementlearning,”inICML,2017. arXiv:1606.06565,2016. [29] O.Nachum,S.Gu,H.Lee,andS.Levine,“Near-optimalrepresentation [2] D.Hadfield-Menell,S.Milli,P.Abbeel,S.J.Russell,andA.Dragan, learningforhierarchicalreinforcementlearning,”inICLR,2019. “Inverserewarddesign,”Advancesinneuralinformationprocessing [30] M.Igl,A.Gambardella,J.He,N.Nardelli,N.Siddharth,W.Böhmer, systems,vol.30,2017. andS.Whiteson,“Multitasksoftoptionlearning,”inUCA,2020. [3] J. Fu, K. Luo, and S. Levine, “Learning robust rewards [31] S.Krishnan,R.Fox,I.Stoica,andK.Goldberg,“Ddco:Discoveryof with adverserial inverse reinforcement learning,” in International deepcontinuousoptionsforrobotlearningfromdemonstrations,”in ConferenceonLearningRepresentations,2018.[Online].Available: Conferenceonrobotlearning. PMLR,2017,pp.418–437. https://openreview.net/forum?id=rkHywl-A- [32] H.Le,N.Jiang,A.Agarwal,M.Dudik,Y.Yue,andH.DauméIII, [4] T.Everitt,M.Hutter,R.Kumar,andV.Krakovna,“Rewardtampering “Hierarchicalimitationandreinforcementlearning,”inICML,2018. problemsandsolutionsinreinforcementlearning:Acausalinfluence [33] K.Shiarlis,M.Wulfmeier,S.Salter,S.Whiteson,andI.Posner,“Taco: diagramperspective,”Synthese,vol.198,no.27,pp.6435–6467,2021. Learningtaskdecompositionviatemporalalignmentforcontrol,”in [5] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, ICML,2018. E.Todorov,andS.Levine,“Learningcomplexdexterousmanipulation [34] J. Merel, Y. Tassa, D. TB, S. Srinivasan, J. Lemmon, Z. Wang, withdeepreinforcementlearninganddemonstrations,”arXivpreprint G. Wayne, and N. Heess, “Learning human behaviors from motion arXiv:1709.10087,2017. capturebyadversarialimitation,”arXiv:1707.02201,2017. [6] Y.Zhu,Z.Wang,J.Merel,A.Rusu,T.Erez,S.Cabi,S.Tunyasuvu- [35] C.Fei,B.Wang,Y.Zhuang,Z.Zhang,J.Hao,H.Zhang,X.Ji,and nakool, J. Kramár, R. Hadsell, N. de Freitas et al., “Reinforcement W.Liu,“Triple-gail:amulti-modalimitationlearningframeworkwith andimitationlearningfordiversevisuomotorskills,”arXivpreprint generativeadversarialnets,”arXivpreprintarXiv:2005.10622,2020. arXiv:1802.09564,2018. [36] A. Tamar, K. Rohanimanesh, Y. Chow, C. Vigorito, B. Goodrich, [7] J.D.FarmerandD.Foley,“Theeconomyneedsagent-basedmodelling,” M. Kahane, and D. Pridmore, “Imitation learning from visual data Nature,vol.460,no.7256,pp.685–686,2009. withmultipleintentions,”inICLR,2018. [8] S.Suo,S.Regalado,S.Casas,andR.Urtasun,“Trafficsim:Learning [37] S. Khandelwal, W. Qi, J. Singh, A. Hartnett, and D. Ramanan, tosimulaterealisticmulti-agentbehaviors,”inICCV,2021. “What-ifmotionpredictionforautonomousdriving,”arXivpreprint [9] M. Igl, D. Kim, A. Kuefler, P. Mougin, P. Shah, K. Shiarlis, arXiv:2008.10587,2020. D.Anguelov,M.Palatucci,B.White,andS.Whiteson,“Symphony: [38] Y.Ding,C.Florensa,P.Abbeel,andM.Phielipp,“Goal-conditioned Learningrealisticanddiverseagentsforautonomousdrivingsimulation,” imitationlearning,”NeurIPS,2019. inICRA,2022. [39] A.Pashevich,C.Schmid,andC.Sun,“Episodictransformerforvision- [10] A. Grover, M. Al-Shedivat, J. Gupta, Y. Burda, and H. Edwards, and-languagenavigation,”inICCV,2021. “Learningpolicyrepresentationsinmultiagentsystems,”inInternational [40] O.Vinyals,I.Babuschkin,W.M.Czarnecki,M.Mathieu,A.Dudzik, conferenceonmachinelearning. PMLR,2018,pp.1802–1811. J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., [11] Y. Liang, C. Guo, Z. Ding, and H. Hua, “Agent-based modeling in “Grandmaster level in starcraft ii using multi-agent reinforcement electricitymarketusingdeepdeterministicpolicygradientalgorithm,” learning,”Nature,vol.575,no.7782,pp.350–354,2019. IEEETransactionsonPowerSystems,vol.35,no.6,2020. [41] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation [12] Z. Wang, J. S. Merel, S. E. Reed, N. de Freitas, G. Wayne, and learningfromvisualdemonstrations,”AdvancesinNeuralInformation N.Heess,“Robustimitationofdiversebehaviors,”NeurIPS,2017. ProcessingSystems,vol.30,2017. [13] M.Lucic,K.Kurach,M.Michalski,S.Gelly,andO.Bousquet,“Are [42] K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. J. Lim, ganscreatedequal?alarge-scalestudy,”NeurIPS,2018. “Multi-modalimitationlearningfromunstructureddemonstrationsusing [14] A.Creswell,T.White,V.Dumoulin,K.Arulkumaran,B.Sengupta, generativeadversarialnets,”inNeurIPS,2017. andA.A.Bharath,“Generativeadversarialnetworks:Anoverview,” [43] B.Eysenbach,A.Gupta,J.Ibarz,andS.Levine,“Diversityisallyou IEEEsignalprocessingmagazine,vol.35,no.1,pp.53–65,2018. need:Learningskillswithoutarewardfunction,”inICLR,2019. [15] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley, [44] D.Pathak,D.Gandhi,andA.Gupta,“Self-supervisedexplorationvia S.Ozair,A.Courville,andY.Bengio,“Generativeadversarialnetworks,” disagreement,”inICML,2019. CommunicationsoftheACM,vol.63,no.11,pp.139–144,2020. [16] J. Ho and S. Ermon, “Generative adversarial imitation learning,” NeurIPS,2016. [17] N. Baram, O. Anschel, and S. Mannor, “Model-based adversarial imitationlearning,”arXivpreprintarXiv:1612.02179,2016. [18] D.Xu,Y.Chen,B.Ivanovic,andM.Pavone,“Bits:Bi-levelimitation fortrafficsimulation,”arXivpreprintarXiv:2208.12403,2022. [19] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learningandstructuredpredictiontono-regretonlinelearning.” JMLR WorkshopandConferenceProceedings,2011.