Synthesis of Shared Control Protocols with Provable Safety and Performance Guarantees Nils Jansen1 Murat Cubuktepe1 Ufuk Topcu1 Abstract— We formalize synthesis of shared control protocols with correctness guarantees for temporal logic specifications. More specifically, we introduce a modeling formalism in which both a human and an autonomy protocol can issue commands to a robot towards performing a certain task. These commands are blended into a joint input to the robot. The autonomy protocol is synthesized using an abstraction of possible human commands accounting for randomness in decisions caused by factors such as fatigue or incomprehensibility of the problem at hand. The synthesis is designed to ensure that the resulting robot behavior satisfies given safety and performance specifications, e.g., in temporal logic. Our solution is based on nonlinear programming and we address the inherent scalability issue by presenting alternative methods. We assess the feasibility and the scalability of the approach by an experimental evaluation. I. INTRODUCTION We study the problem of shared control, where a robot shall accomplish a task according to a human operator’s goals and given specifications addressing safety or performance. Such scenarios are for instance found in remotely operated semi-autonomous wheelchairs [11]. In a nutshell, the human has a certain action in mind and issues a command. Si- multaneously, an autonomy protocol provides—based on the available information—another command. These commands are blended—also referred to as arbitrated—and deployed to the robot. Earlier work discusses shared control from different per- spectives [7], [8], [20], [19], [13], [10], however, formal correctness in the sense of ensuring safety or optimizing performance has not been considered. In particular, having the human as an integral factor in this scenario, correctness needs to be treated in an appropriate way as a human might not be able to comprehend factors of a system and—in the extremal case—can drive a system into inevitable failure. There are several things to discuss. First, a human might not be sure about which command to take, depending on the scenario or factors like fatigue or incomprehensibility of the problem. We account for uncertainties in human decisions by introducing randomness to choices. Moreover, a means of actually interpreting a command is needed in form of a user interface, e. g., a brain-computer interface; the usually imperfect interpretation adds to the randomness. We call a formal interpretation of the human’s commands the human strategy (this concept will be explained later). As many formal system models are inherently stochas- tic, our natural formal model for robot actions inside an environment is a Markov decision process (MDP) where All authors are with the University of Texas at Austin, Austin, TX 78751, USA, njansen@utexas.edu deterministic action choices induce probability distributions over system states. Randomness in the choice of actions, like in the human strategy, is directly carried over to these probabilities when resolving nondeterminism. For MDPs, quantitative properties like “the probability to reach a bad state is lower than 0.01” or “the cost of reaching a goal is below a given threshold” can be formally verified. If a set of such specifications is satisfied for the human strategy and the MDP, the task can be carried out safely and with good performance. Given that the human strategy induces certain critical actions with a high probability, one or more specifications might be refuted. In this case, the autonomy should provide an alternative strategy that—when blended with the human strategy—satisfies the specifications without discarding too much of the human’s choices. As in [8], the blending puts weight on either the human’s or the autonomy protocol’s choices depending on factors such as the confidence of the human or the level of information the autonomy protocol has at its disposal. The question is now how such a human strategy can be obtained. It seems unrealistic that a human can comprehend an MDP modeling a realistic scenario in the first place; primarily due the possibly very large size of the state space. Moreover, a human might not be good at making sense of probabilities or cost of visiting certain states at all. We employ learning techniques to collect data about typical human behavior. This can, for instance, be performed within a simulation environment. In our case study, we model a typical shared control scenario based on a wheelchair [11] where a human user and an autonomy protocol share the control responsibility. Having a human user solving a task, we compute strategies from the obtained data using inverse reinforcement learning [16], [1]. Thereby, we can give guar- antees on how good the obtained strategy approximates the actual intends of the user. The design of the autonomy protocol is the main concern of this paper. We define the underlying problem as a nonlin- ear optimization problem and propose a technique to address the consequent scalability issues by reducing the problem to a linear optimization problem. After an autonomy protocol is synthesized, guarantees on safety and performance can be given assuming that the user behaves according to the human strategy obtained beforehand. The main contribution is a formal framework for the problem of shared autonomy together with thorough discussions on formal verification, ex- periments, and current pitfalls. A summary of the approaches and an outline are given in Section II. arXiv:1610.08500v1 [cs.RO] 26 Oct 2016 Human Autonomy protocol Blending Robot execution command command blended command Blending function b Formal model Mr Specifications ϕ1, . . . , ϕn Human strategy Fig. 1. Shared autonomy architecture. Shared control has attracted considerable attention re- cently. We only overview some recent approaches into con- text with our results. First, Dragan and Srinivasa discussed strategy blending for shared control in [8], [7]. There, the focus was on the prediction of human goals. Combining these approaches, e. g., by inferring formal safety or performance specifications by prediction of human goals, is an interesting direction for future work. Iturrate et al. presented shared control using feedback based on electroencephalography (a method to record electrical activity of the brain) [13], where a robot is partly controlled via error signals from a brain-computer interface. In [19], Trautman proposes to treat shared control broadly as a random process where different components are modeled by their joint probability distributions. As in our approach, randomness naturally pre- vents strange effects of blending: Consider actions “up” and “down” to be blended with equally distributed weight without having means to actually evaluating these weights. Finally, in [10] a synthesis method switches authority between a human operator and the autonomy such that satisfaction of linear temporal logic constraints can be ensured. II. SHARED CONTROL Consider first Fig. 1 which recalls the general framework for shared autonomy with blending of commands; additionally we have a set of specifications, a formal model for robot behavior, and a blending function. In detail, a robot is to take care of a certain task. For instance, it shall move to a certain landmark. This task is subject to certain performance and safety considerations, e. g., it is not safe to take the shortest route because there are too many obstacles. These consid- erations are expressed by a set of specifications ϕ1,...,ϕn. The possible behaviors of the robot inside an environment are given by a Markov decision process (MDP) Mr. Having MDPs gives rise to choices of certain actions to perform and to randomness in the environment: A chosen path might induce a high probability to achieve the goal while with a low probability, the robot might slip and therefore fail to complete the task. Now, in particular, a human user issues a set of commands for the robot to perform. We assume that the commands is- sued by the human are consistent with an underlying random- ized strategy σh for the MDP Mr. Put differently, at design time we compute an abstract strategy σh of which the set of human commands is one realization. This modeling way allows to account for a variety of imperfections. Although it is not directly issued by a human, we call this strategy the human strategy. Due to possible human incomprehensibility or lack of detailed information, this leads to the fact that the strategy might not satisfy the requirements. Now, an autonomy protocol is to be designed such that it provides an alternative strategy σa, the autonomous strategy. The two strategies are then blended—according to the given blending function b into a new strategy σha which satisfies the specifications. The blending function reflects preference over either the decisions of the human or the autonomy protocol. We also ensure that the blended strategy deviates only minimally from the human strategy. At runtime we can then blend decisions of the human user with decisions based on the autonomous strategy. The resulting “blended” decisions are according to the blended strategy σha, thereby ensuring satisfaction of the specifications. This procedure, while involving expensive computations at design time, is very efficient at runtime. Summarized, the problem we are addressing in this pa- per is then—in addition to the proposed modeling of the scenario—to synthesize the autonomy protocol in a way such that the resulting blended strategy meets all of the specifica- tions while it only deviates from the human strategy as little as possible. We introduce all formal foundations that we need in Section III. The shared control synthesis problem with all needed formalisms is presented in Section IV as being a nonlinear optimization problem. Addressing scalability, we reduce the problem to a linear optimization problem in Section V. We indicate the feasibility and scalability of our techniques using data-based experiments in Section V and draw a short conclusion in Section VII. III. PRELIMINARIES 1) Models: A probability distribution over a finite or countably infinite set X is a function µ : X →[0,1] ⊆R with ∑x∈X µ(x) = µ(X) = 1. The set of all distributions on X is denoted by Distr(X). Definition 1 (MDP): A Markov decision process (MDP) M = (S,sI,A,P) is a tuple with a set of states S, a unique initial state sI ∈S, a finite set A of actions, and a (partial) probabilistic transition function P : S×A →Distr(S). MDPs operate by means of nondeterministic choices of actions at each state, whose successors are then determined probabilistically with respect to the associated probability distribution. The enabled actions at state s ∈S are denoted by A(s) = {α ∈A | ∃µ ∈Distr(S).µ = P(s,α)}. To avoid deadlock states, we assume that |A(s)| ≥1 for all s ∈S. A cost function ρ : S × A →R≥0 for an MDP M adds cost to a transition (s,α) ∈S × A with α ∈A(s). A path in an M is a finite (or infinite) sequence π = s0α0s1α1 ... with P(si,α,si+1) > 0 for all i ≥0. If |A(s)| = 1 for all s ∈S, all actions can be disregarded and the MDP M reduces to a discrete-time Markov chain (MC). The unique probability measure PrD(Π) for a set Π of paths of MC D can be defined by the usual cylinder set construction, the expected cost of a set Π of paths is denoted by ECD(Π), see [2] for details. In order to define a probability measure and expected cost on MDPs, the nondeterministic choices of actions are resolved by so- called strategies. For practical reasons, we restrict ourselves to memoryless strategies, again refer to [2] for details. Definition 2 (Strategy): A randomized strategy for an MDP M is a function σ : S →Distr(A) such that σ(s)(α) > 0 implies α ∈A(s). A strategy with σ(s)(α) = 1 for α ∈A and σ(β) = 0 for all β ∈A\{α} is called deterministic. The set of all strategies over M is denoted by SchedM . Resolving all nondeterminism for an MDP M with a strategy σ ∈SchedM yields an induced Markov chain M σ. Intu- itively, the random choices of actions from σ are transferred to the transition probabilities in M σ. Definition 3 (Induced MC): Let MDP M = (S,sI,A,P) and strategy σ ∈SchedM . The MC induced by M and σ is M σ = (S,sI,A,Pσ) where Pσ(s,s′) = ∑ α∈A(s) σ(s)(α)·P(s,α)(s′) for all s,s′ ∈S . 2) Specifications: A quantitative reachability property P≤λ(♦T) with upper probability threshold λ ∈[0,1] ⊆Q and target set T ⊆S constrains the probability to reach T from sI in M to be at most λ. Expected cost properties E≤κ(♦G) impose an upper bound κ ∈Q on the expected cost to reach goal states G ⊆S. Intuitively, bad states T shall only be reached with probability λ (safety specification) while the expected cost for reaching goal states G has to be below κ (performance specification). Probability and expected cost to reach T from sI are denoted by Pr(♦T) and EC(♦T), respectively. Hence, PrD(♦T) ≤λ and ECD(♦G) ≤κ ex- press that the properties P≤λ(♦T) and E≤κ(♦G) are satisfied by MC D. These concepts are analogous for lower bounds on the probability. We also use until properties of the form Pr≥λ(¬T U G) expressing that the probability of reaching G while not reaching T beforehand is at least λ. An MDP M satisfies both safety specification φ and per- formance specification ψ, iff for all strategies σ ∈SchedM it holds that the induced MC M σ satisfies φ and ψ, i.e., M σ |= φ and M σ |= ψ. If several performance or safety speci- fications ϕ1,...,ϕn are given MDP M , the simultaneous satisfaction for all strategies, denoted by M |= ϕ1,...,ϕn, can be formally verified for an MDP using multi-objective model checking [9]. Here, we are interested in the synthesis problem, where the aim is to find one particular strategy σ for which the specifications are satisfied. If for ϕ1,...,ϕn and strategy σ it holds that M σ |= ϕ1,...,ϕn, then σ is said to admit the specifications, also denoted by σ |= ϕ1,...,ϕn. Example 1: Consider Fig. 2(a) depcting MDP M with initial state s0, where states s0 and s1 have choices between actions a or b and c or d, respectively. For instance, action a induces a probabilistic choice between s1 and s3 with probabilities 0.6 and 0.4. The self loops at s2,s3 and s4 indicate looping back with probability one for each action. s0 s1 s2 s3 s4 a b c d 0.6 0.4 0.4 0.6 0.6 0.4 0.4 0.6 1 1 1 (a) MDP M s0 s1 s2 s3 s4 0.6 0.4 0.6 0.4 1 1 1 (b) Induced MC M σ1 Fig. 2. MDP M with target state s2 and induced MC for strategy σunif (a) Human perspective 0.4 0.4 0.2 0.2 (b) Autonomy perspective Fig. 3. A wheelchair in a shared control setting. Assume now, a safety specification is given by φ = P≤0.21(♦s2). The specification is violated for M , as the deterministic strategy σ1 ∈SchedM with σ1(s1)(α) = 1 and σ1(s1)(c) = 1 induces a probability of reaching s2 of 0.36, see the induced MC in Fig. 2(b). For the randomized strategy σunif ∈SchedM with σunif(s0)(α) = σunif(s0)(b) = 0.5 and σunif(s1)(c) = σunif(s1)(d) = 0.5, which chooses between all actions uniformly, the specification is also violated: The probability of reaching s2 is 0.25, hence σ2 ̸|= φ. How- ever, for the deterministic strategy σsafe ∈SchedM with σsafe(s0)(b) = 1 and schedsafe(s1)(d) = 1 the probability is 0.16, thus σsafe |= φ. Note that σsafe minimizes the probability of reaching s2 while σ1 maximizes this probability. IV. SYNTHESIZING SHARED CONTROL PROTOCOLS In this section we describe our formal approach to synthesize a shared control protocol in presence of randomization. We start by formalizing the concepts of blending and strategy perturbation. Afterwards we formulate the general problem and show that the solution to the synthesis problem is correct. Example 2: Consider Fig. 3, where a room to navigate in is abstracted into a grid. We will use this as our ongoing example. A wheelchair as in [11] is to be steered from the lower left corner of the grid to the exit on the upper right corner of the grid. There is also an autonomous robotic vacuum cleaner moving around the room; the goal is for the wheelchair to reach the exit without crashing into the vacuum cleaner. We now assume that the vacuum cleaner moves according to probabilities that are fixed according to evidence gathered beforehand; these probabilities are unknown or incomprehensible to the human user. To improve the safety of the wheelchair, it is equipped with an autonomy protocol that is to improve decisions of the human or even overwrite them in case of safety hazards. For the design of the autonomy protocol, the evidence data about the cleaner is present. Now an obvious strategy to move for the wheelchair, not taking into account the vacuum cleaner, is depicted by the red solid line in Fig. 3(a). As indicated in Fig. 3(b), the strategy proposed by the human is unsafe because there is a high probability to collide with the obstacle. The autonomy protocol computes a safe strategy, indicated by the solid line in Fig. 3(b). As this strategy deviates highly from the human strategy, the dashed line indicates a still safe enough alternative which is a compromise or—in our terminology— a blending between the two strategies. We assume in the following that possible behaviors of the robot inside the environment are modeled by MDP Mr = (S,sI,A,P). The human strategy is given as randomized strategy σh for Mr. We explain how to obtain this strategy in Section VI. Specifications are ϕ1,...,ϕn being either safety properties P≤λ(♦T) or performance properties E≤κ(♦T). A. Strategy blending Given two strategies, they are to be blended into a new strategy favoring decisions of one or the other in each state of the MDP. In our setting, the human strategy σh ∈SchedMr is blended with the autonomous strategy σa ∈SchedMr by means of an arbitrary blending function. In [8] it is argued that blending intuitively reflects the confidence in how good the autonomy protocol is able to assist with respect to the human user’s goals. In addition, factors probably unknown or incomprehensible for the human such as safety or performance optimization also should be reflected by such a function. Put differently, possible actions of the user should be assigned low confidence by the blending function, if he cannot be trusted to make the right decisions. For instance, recall Example 2. At cells of the grid where with a very high probability the wheelchair might collide with the vacuum cleaner, it makes sense to assign a high confidence in the autonomy protocol’s decisions because not all safety-relevant information is present for the human. In order to enable formal reasoning together with such a function we instantiate the blending with a state-dependent function which at each state of an MDP weighs the con- fidence in both the human’s and the autonomy’s decisions. A more fine-grained instantiation might incorporate not only the current state of the MDP but also the strategies of both human and autonomy or history of a current run of the system. Such a formalism is called linear blending and is used in what follows. In [19], additional notions of blending are discussed. Definition 4 (Linear blending): Given an MDP Mr = (S,sI,A,P), two strategies σh,σa ∈SchedMr, and a blending function b: S →[0,1], the blended strategy σha ∈SchedMr for all states s ∈S, and actions α ∈A is σha(s)(α) = b(s)·σh(s)(α)+(1−b(s))·σa(s)(α) . Note that the blended strategy σha is a well-defined random- ized strategy. For each s ∈S, the value b(s) represents the confidence in the human’s decisions at this state, i. e., the “weight” of σh at s. Coming back to Example 2, the critical cells of the grid correspond to certain states of the MDP Mr; at these states a very low confidence in the human’s decisions should be assigned. For instance at such a state s ∈S we might have b(s) = 0.1 leading to the fact that all randomized choices of the human strategy are scaled down by this factor. Choices of the autonomous strategy are only scaled down by factor 0.9. The addition of these scaled choices then gives a new strategy highly favoring the autonomy’s decisions. B. Perturbation of strategies As mentioned before, we want to ensure that the blended strategy deviates minimally from the human strategy. To now measure such a deviation, we introduce the concept of per- turbation which was—on a complexity theoretic level—for instance investigated in [5]. Here, we introduce an additive perturbation for a (randomized) strategy, incrementing or decrementing probabilities of action choices such that a well- defined distribution over actions is maintained. Definition 5 (Strategy perturbation): Given MDP M and strategy σ ∈SchedM , an (additive) perturbation δ is a function δ : S×A →[−1,1] with ∑ α∈A δ(s,α) = 0 for all s ∈S . The value δ(s,α) is called the perturbation value at state s for action α. Overloading the notation, the perturbed strategy δ(σ) is given by δ(σ)(s,α) = σ(s)(α)+δ(s,α) for all s ∈S and α ∈A . C. Design of the autonomy protocol For the formal problem, we are given blending function b, specifications ϕ1,...,ϕn, MDP Mr, and human strategy σh ∈Mr. We assume that σh does not satisfy all of the specifications, i. e., σh ̸|= ϕ1,...,ϕn. The autonomy protocol provides the autonomous strategy σa ∈SchedMr. According to b, the strategies σa and σh are blended into strategy σha, see Definition 4, i. e., σha(s,α) = b(s)·σa(s,α)+(1−b(s))· σh(s,α). The shared control synthesis problem is to design the autonomy protocol such that for the blended strategy σha it holds σha |= ϕ1,...,ϕn, while minimally deviating from σh. The deviation from σh is captured by finding a perturbation δ as in Definition 5, where, e. g., the infinity norm of all perturbation values is minimal. Our problem involves the explicit computation of a ran- domized strategy and the induced probabilities, which is inherently nonlinear because the corresponding variables need to be multiplied. Therefore, the canonical formulation is given by a nonlinear optimization program (NLP). We first assume that the only specification is a quantitative reacha- bility property ϕ = P≤λ(♦T), then we describe how more properties can be included. The program has to encompass defining the autonomous strategy σa, the perturbation δ of the human strategy, the blended strategy σha, and the probability of reaching the set of target states T ⊆S. We introduce the following specific set Var of variables: • σs,α a ,σs,α ha ∈[0,1] for each s ∈S and α ∈A define the autonomous strategy σa and the blended strategy σha. • δ s,α ∈[−1,1] for each s ∈S and α ∈A are the pertur- bation variables for σh and σha. • ps ∈[0,1] for each s ∈S are assigned the probability of reaching T ⊆S from state s under strategy σha. Using these variables, the NLP reads as follows: minimize max{|δ sα| | s ∈S,α ∈A} (1) subject to psI ≤λ (2) ∀s ∈T. ps = 1 (3) ∀s ∈S. ∑ α∈A σs,α a = ∑ α∈A σs,α ha = 1 (4) ∀s ∈S.∀α ∈A. σs,α ha = σh(s)(α)+δ s,α (5) ∀s ∈S. ∑ α∈A δ s,α = 0 (6) ∀s ∈S.∀α ∈A. σs,α ha = b(s)·σh(s)(α)+(1−b(s))·σs,α a (7) ∀s ∈S. ps = ∑ α∈A σs,α ha · ∑ s′∈S P(s,α)(s′)· ps′ (8) The NLP works as follows. First, the infinity norm of all perturbation variables is minimized (by minimizing the maximum of all perturbation variables) (1). The probability assigned to the initial state sI ∈S has to be smaller than or equal to λ to satisfy ϕ = P≤λ(♦T) (2). For all target states T ⊆S, the probability of the corresponding probability variables is assigned one (3). Now, to have well-defined strategies σa and σha, we ensure that the assigned values of the corresponding strategy variables at each state sum up to one (4). The perturbation δ of the human strategy σh result- ing in the strategy σha as in Definition 5 is computed using the perturbation variables (5); in order for the perturbation to be well-defined, the variables have to sum up to zero at each state (6). The blending of σa and σha with respect to b as in Definition 4 is defined in (7). Finally, the probability to reach T ⊆S from each s ∈S is computed in (8), defining a non-linear equation system, where action probabilities, given by the induced strategy σha, are multiplied by probability variables for all possible successors. Note that this nonlinear program is in fact bilinear due to multiplying the strategy variables σs,α ha with the probability variables ps′ (8). The number of constraints is governed by the number of state and action pairs, i. e., the size of the problem is in O(|Sr|·|A|). An assignment of real-valued variables is a function ν : Var →R; it is satisfying for a set of (in)equations, if each one evaluates to true. A satisfying assignment ν∗is minimizing with respect to objective o if for ν∗(o) ∈R there is no other assignment ν′ with ν′(o) < ν∗(o). Using these notions, we state the correctness of the NLP in (1) – (8). Theorem 1 (Soundness and completeness): The NLP is sound in the sense that each minimizing assignment induces a solution to the shared control synthesis problem. It is complete in the sense that for each solution to the shared TABLE I EXAMPLE RESULTS bi σa(a) σa(b) σa(c) σa(d) σha(a) σha(b) σha(c) σha(d) Prha i = 1 0.5 0.08 0.92 0.08 0.92 0.29 0.71 0.29 0.71 0.209 i = 2 0.1 0.27 0.73 0.27 0.73 0.29 0.71 0.29 0.71 0.209 i = 3 0 0.29 0.71 0.29 0.71 0.29 0.71 0.29 0.71 0.209 control synthesis there is a minimizing assignment of the NLP. Soundness tells that each satisfying assignment of the vari- ables corresponds to strategies σa and σha as well as the perturbation δ as defined above. Moreover, any optimal solution induces a perturbation minimally deviating from the human strategy σh. Completeness means that all possible solutions of the shared control synthesis problem can be encoded by this NLP. Unsatisfiability means that no such solution exists; the problem is infeasible. D. Additional specifications We now explain how the NLP can be extended for further specifications. Assume in addition to ϕ = P≤λ(♦T), another reachability property ϕ′ = P≤λ ′(♦T ′) with T ′ ̸= T is given. We add another set of probability variables p′ s for each state s ∈S; (2) is copied for p′ sI and λ ′, (3) is defined for all states s ∈T ∪T ′ and (8) is copied for all p′ s, thereby computing the probability of reaching T ′ under σha for all states. To handle an expected cost property E≤κ(♦G) for G ⊆ S, we use variables rs being assigned the expected cost for reaching G for all s ∈S. We add the following equations: rsI ≤κ (9) ∀s ∈G. rs = 0 (10) ∀s ∈S. rs = ∑ α∈A  σs,α ha ·r(s,α)+ ∑ s′∈S P(s,α)(s′)·rs′  (11) First, the expected cost of reaching G is smaller than or equal to κ at sI (9). Goal state are assigned cost zero (10), oth- erwise infinite cost is collected at absorbing states. Finally, the expected cost for all other states is computed by (11) where according to the blended strategy σha the cost of each action is added to the expected cost of the successors. An important insight is that if all specifications are expected reward properties, the program is no longer nonlinear but a linear program (LP), as there is no multiplication of variables. E. Generalized blending If the problem is not feasible for the given blending func- tion, optionally the autonomy protocol can try to compute a new function b: S →[0,1] for which the altered problem is feasible. We call this procedure generalized blending. The idea is that computing this function gives the designer of the protocol insight on where more confidence needs to be placed into the autonomy or, vice versa, where the human cannot be trusted to satisfy the given specifications. Computing this new function is achieved by nearly the same NLP as for a fixed blending function while adding variables bs for each state s ∈S, defining the new blending function by b(s) = bs. We substitute Equation 7 by ∀s ∈S.∀α ∈A. σs,α ha = bs ·σh(s)(α)+(1−bs)·σs,α a . (12) A satisfying assignment for the resulting nonlinear program induces a suitable blending function b: S →[0,1] in addition to the strategies. If this problem is also infeasible, there is no strategy that satisfies the given specifications for MDP Mr. Corollary 1: If there is no solution for the NLP given by Equations 1 – 12, there is no strategy σ ∈SchedMr such that σ |= ϕ1,...,ϕn. As there are no restrictions on the blending function, this corollary trivially holds: Consider for instance b with b(s) = 0 for each s ∈S. This function disregards the human strategy which may be perturbed to each other strategy σa = σha. Example 3: Reconsider the MDP M from Example 1 with specification ϕ = P≤0.21(♦{s2}) and the randomized strategy σunif which takes each action uniformly distributed. As we saw, σunif ̸|= ϕ. We choose this strategy as the human strategy σh = σunif and Mr = M as the robot MDP. For a blending function bh putting high confidence in the human, e. g., if bh(s) ≥0.6 for all s ∈S, the problem is infeasible. In Table I we display results putting medium (b1), low (b2), or no confidence (b3) in the human at s0 and s1. We list the assignments for the resulting strategies σa and σha as well as the probability Prha = PrM σah r s0 (♦T) to reach s2 under the blended strategy σha. The results were obtained using the NLP solver IPOPT [4]. We observe that for decreasing confidence in the human decisions, the autonomous strategy has higher probabilities for actions a and c which are the “bad” actions here. That means that—if there is a higher confidence in the autonomy—solutions farer away from the optimum are good enough. The maximal deviation from the human strategy is 0.21. Generalized blending with maximizing over the confidence in the human’s decisions at all states s ∈S yields bh(s) = 0.582, i. e., we compute the highest possible confidence in the human’s decisions where the problem is still feasible under the given human strategy. V. COMPUTATIONALLY TRACTABLE APPROACH The nonlinear programming approach presented in the pre- vious section gives a rigorous method to solve the shared control synthesis problem and serves as mathematically concise definition of the problem. However, NLPs are known to have severe restrictions in terms of scalability and suffer from numerical instabilities. The crucial point to an effi- cient solution is circumventing the expensive computation of optimal randomized strategies and reducing the number of variables. We propose a heuristic solution which enables to use linear programming (LP) while ensuring soundness. We utilize a technique referred to as model repair. Intu- itively, an erroneous model is changed such that it satisfies certain specifications. In particular, given a Markov chain D and a specification ϕ that is violated by D, a repair of D is an automated method that transforms it to new MC D′ such that ϕ is satisfied for D′. Transforming refers to changing probabilities or cost while regarding certain side constraints such as keeping the original graph structure. In [3], the first approach to automatically repair an MC model was presented as an NLP. Simulation-based algorithms were investigated in [6]. A heuristic but very scalable tech- nique called local repair was proposed in [17]. This approach greedily changes the probabilities or cost of the original MC until a property is satisfied. An upper bound δr on changes of probabilities or cost can be specified; correctness and completeness can be given in the sense that if a repair with respect to δr exists, it will be obtained. Take now the MC Dσh r which is induced by the robot MDP Mr and the human strategy σh. We perform model repair such that the repaired MC D′ = (S,sI,P′) satisfies the specifications ϕ1,...,ϕn. The question is now, how from the repaired MC D′, the strategy σ′ ∈SchedMr can be extracted. More precisely, we need σ′ inducing exactly D′, i. e., Dσ′ r = D′, when applied to MDP Mr. First, we need to make sure that the repaired MC is consistent with the original MDP such that a strategy σ′ with Dσ′ r = D′ actually exists. Therefore, we define the maximal and minimal possible transition probabilities Pmax and Pmin that can occur in any induced MC of MDP Mr: Pmax(s,s′) = max{Pr(s,α)(s′) | α ∈A} (13) for all s ∈S; Pmin is defined analogously. Now, the repair is performed such that in the resulting MC D′ = (S,sI,P′) for all s,s′ ∈S it holds that Pmin(s,s′) ≤P(s,s′) ≤Pmax(s,s′) . (14) While obtaining D′, model checking needs to be performed intermediately to check if the specifications are satisfied; once they are, the algorithm terminates. In fact, for each state s ∈S, the probability of satisfaction is computed. We assign variables mcs for all s ∈S with exactly this probability: mcs = Pr(s |= ϕ1,...,ϕn) . (15) Now recall the NLP from the previous section, in particular Equation 8 which is the only nonlinear equation of the program. We replace each variable ps by the concrete model checking result mcs for each s ∈S: mcs = ∑ α∈A σs,α ha · ∑ s′∈S P(s,α)(s′)·mcs′ . (16) As (16) is affine in the variables σah, the program resulting from replacing (8) by (16) is a linear program (LP). More- over, (2) and (3) can be removed, reducing the number of constraints and variables. The LP gives a feasible solution to the shared control synthesis problem. Lemma 1 (Correctness): The LP is sound in the sense that each minimizing assignment induces a solution to the shared control problem. The correctness is given by construction, as the specifications are satisfied for the blended strategy which is derived from the repaired MC. However, the minimal deviation from the human strategy as in Equation 1 is dependent on the previous computation of probabilities for the blended strategy. There- fore, we actually compute an upper bound on the optimal solution. Let δ ∗be the minimal deviation possible for any given problem and δ be the minimal deviation obtained by the LP resulting from replacing (8) by (16). Let ∥δ∥∞and ∥δ ∗∥∞denote the infinity norms of both perturbations. Corollary 2: For the perturbations δ and δ ∗of σh it holds that ∥δ ∗∥∞≤∥δ∥∞. As we mentioned before, the local repair method can employ a bound δr on the maximal change of probabilities or cost in the model. If a repair exists for a given δr, the resulting deviation δ is then bounded by this δr. VI. CASE STUDY AND EXPERIMENTS Defining a formal synthesis approach to the shared control scenario requires a precomputed estimation of a human user’s intentions. As explained in the previous chapter, we account for inherent uncertainties by using a randomized strategy over possible actions to take. We discuss how such strategies may be obtained and report on benchmark results. A. Experimental setting Our setting is the wheelchair scenario from Example 2 inside an interactive Python environment. The size of the grid is variable and an arbitrary number of stationary and randomly moving obstacles (the vacuum cleaner) can be defined. An agent (the wheelchair) is moved according to predefined (randomized) strategies or interactively by a human user. From this scenario, an MDP with states corresponding to the position of the agent and the obstacles is generated. Actions induce position changes of the agent. The safety specification ensures that the agent reaches a target cell with- out crashing into an obstacle with a certain high probability λ ∈[0,1], formally P≥λ(¬crash U target). We use the probabilistic model checker PRISM [15] for verification, in form of either a worst–case analysis for each possible strat- egy or concretely for a specific strategy. The whole toolchain integrates the simulation environment with the approaches described in the previous sections. We use the NLP solver IPOPT [4] and the LP solver Gurobi [12]. To perform model repair for strategies, see Section V, we implemented the greedy method from [17] into our framework augmented by side constraints ensuring well-defined strategies. B. Data collection We ask five participants to perform tests in the environment with the goal to move the agent to a target cell while never being in same cell as the moving obstacle. From the data obtained from each participant, an individual ran- domized human strategy σh for this participant can be ob- tained via Maximum Entropy Inverse Reinforcement Learn- ing (MEIRL) [22]. Inverse reinforcement learning has—for instance—also been used in [14] to collect data about human behavior in a shared control scenario (though without any formal guarantees) or in [18] to distinguish human intents Process data via MEIRL Shared control synthesis Simulation environment Blending function human strategy autonomous strategy sample data Fig. 4. Experimental setting for the shared control simulation. with respect to different tasks. In our setting, each sample is one particular command of the participant, while we have to assume that command is actually made with the intent to satisfy the specification of safely reaching a target cell. For the resulting strategy, the probability of a possible deviation from the actual intend can be bounded with respect to the number of samples using Hoeffding’s inequality, see [21] for details. On the other hand, we can determine the number of samples needed to get a reasonable approximation of typical behavior. The concrete probabilities of possible deviation depend on O(exp(−nε)), where n is the number of samples and ε is the desired upper bound on the deviation between the true probability of satisfying the specification and the average obtained by the sampled data. Here, in order to ensure an upper bound ε = 0.1 with probability 0.99, the required amount of samples is 265. C. Experiments The work flow of the experiments is depicted in Figure 4. First off, we discuss sample data for one particular participant using a 8 × 8 grid with one moving obstacle inducing an MDP of 2304 states. In the synthesis, we employ the model repair procedure as explained in Section V because the approach based on NLP is only feasible for very small examples. We design the blending function as follows: At states where the human strategy induces a high probability of crashing, we put low confidence in the human and vice versa. Using this function, the autonomous strategy σa is created and passed (together with the function) back to the environment. Note that the blended strategy σah is ensured to satisfy the specification, see Lemma 1. Now, we let the same participant as before do test runs, but this time we blend the human commands with the (randomized) commands of the autonomous strategy σa. Then the actual action of the agent is determined stochastically. We obtain the following results. Our safety specification is instantiated with λ = 0.7, ensuring that the target is safely reached with at least probability 0.7. The human strategy σh has probability 0.546, violating the specification. With the aforementioned blending function we compute σa which induces probability 0.906. Blending these two strategies into σah yields a probability of 0.747. When testing the synthesized autonomy protocol for the individual participant, we observe that his choices are mostly corrected if intentionally bad decisions are made. Also, simulating the blended strategy leeds to the expected result that the agent (a) strategy σh (b) strategy σah (c) strategy σh Fig. 5. Graphical representation of the obtained human, blended, and autonomous strategy in the grid. does not crash in roughly 70% of the cases. To make the behavior of the strategies more accessible, consider Figure 5. For each σa, σah, and σh we indicate for each cell of the grid the worst-case probability to safely reach the target. This probability depends on the current position of the obstacle, which is again probabilistic. The darker the color, the higher the probability; thereby black indicates a probability of 1 to reach the target. We observe that the human’s decisions are rather risky even near the target, while for the blended strategy—once the agent is near the target— there is a very high probability of reaching it safely. This representation also shows that with our approach the blended strategy improves the human strategy while not changing it too much. Specifically, the maximal deviation from the human strategy is 0.27, which is the result of the infinity norm as in Equation 1. To finally assess the scalability of our approach, consider Table II. We generated MDPs for several grid sizes, number of obstacles, and human strategies. We list the number of reachable MDP states (states) and the number of transitions (trans.). We report on the time the synthesis process took (synth.), which is basically the time of solving the LP, and the total time including model checking times using PRISM (total) measured in seconds. To give an indication on the quality of the synthesis, we list the deviation from the human strategy (δ∞). A memory out is indicated by “–MO–”. All experiments were conducted on a 2.3GHz machine with 8GB of RAM. Note that MDPs resulting from grid structures are very strongly connected, resulting in a large number of transitions. Thus, the encoding in the PRISM-language [15] is very large, rendering it a very hard problem. We observe that while the procedure is very efficient for models having a few thousand states and hundreds of thousands of transitions, its scalability is ultimately limited due to memory issues. In the future, we will utilize efficient symbolic data structures internal to PRISM. Moreover, we observe that for larger benchmarks the computation time is governed by the solving time of Gurobi. VII. CONCLUSION We introduced a formal approach to synthesize autonomy protocols in a shared control setting with guarantees on quan- titative safety and performance specifications. The practical usability of our approach was shown by means of data- based experiments. Future work will concern experiments in TABLE II SCALABILITY RESULTS. grid obst. states trans. synth. total δ∞ 8×8 1 2.304 36.864 6.30 14.12 0.15 8×8 2 82.944 5.308.416 –MO– –MO– –MO– 10×10 1 3.600 57.600 12.29 23.80 0.24 12×12 1 14.400 230.400 157.94 250.78 0.33 robotic scenarios and further improvement of the scalability. REFERENCES [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004. [2] Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. The MIT Press, 2008. [3] Ezio Bartocci, Radu Grosu, Panagiotis Katsaros, CR Ramakrishnan, and Scott A Smolka. Model repair for probabilistic systems. In TACAS, volume 6605 of LNCS, pages 326–340. Springer, 2011. [4] Lorenz T. Biegler and Victor M. Zavala. Large-scale nonlinear programming using IPOPT: An integrating framework for enterprise- wide dynamic optimization. Computers & Chemical Engineering, 33(3):575–582, 2009. [5] Taolue Chen, Yuan Feng, David S. Rosenblum, and Guoxin Su. Perturbation analysis in verification of discrete-time Markov chains. In CONCUR, volume 8704 of LNCS, pages 218–233. Springer, 2014. [6] Taolue Chen, Ernst Moritz Hahn, Tingting Han, Marta Kwiatkowska, Hongyang Qu, and Lijun Zhang. Model repair for Markov decision processes. In TASE, pages 85–92. IEEE CS, 2013. [7] Anca D. Dragan and Siddhartha S. Srinivasa. Formalizing assistive teleoperation. In Robotics: Science and Systems, 2012. [8] Anca D. Dragan and Siddhartha S. Srinivasa. A policy-blending formalism for shared control. I. J. Robotic Res., 32(7):790–805, 2013. [9] Kousha Etessami, Marta Z. Kwiatkowska, Moshe Y. Vardi, and Mihalis Yannakakis. Multi-objective model checking of Markov decision processes. Logical Methods in Computer Science, 4(4), 2008. [10] Jie Fu and Ufuk Topcu. Synthesis of shared autonomy policies with temporal logic specifications. IEEE Trans. Automation Science and Engineering, 13(1):7–17, 2016. [11] F. Gal´an, M. Nuttin, E. Lew, P. W. Ferrez, G. Vanacker, J. Philips, and J. del R. Mill´an. A brain-actuated wheelchair: Asynchronous and non-invasive brain-computer interfaces for continuous control of robots. Clinical Neurophysiology, 119(9):2159–2169, 2016/05/28. [12] Gurobi Optimization, Inc. Gurobi optimizer reference manual. http: //www.gurobi.com, 2013. [13] I˜naki Iturrate, Jason Omedes, and Luis Montesano. Shared control of a robot using eeg-based feedback signals. In Proceedings of the 2Nd Workshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, MLIS ’13, pages 45–50, New York, NY, USA, 2013. ACM. [14] Shervin Javdani, J Andrew Bagnell, and Siddhartha Srinivasa. Shared autonomy via hindsight optimization. In Proceedings of Robotics: Science and Systems, 2015. [15] Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. In CAV, volume 6806 of LNCS, pages 585–591. Springer, 2011. [16] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000. [17] Shashank Pathak, Erika ´Abrah´am, Nils Jansen, Armando Tacchella, and Joost-Pieter Katoen. A greedy approach for the efficient repair of stochastic models. In NFM, volume 9058 of Lecture Notes in Computer Science, pages 295–309. Springer, 2015. [18] Constantin A Rothkopf and Dana H Ballard. Modular inverse rein- forcement learning for visuomotor behavior. Biological cybernetics, 107(4):477–490, 2013. [19] Pete Trautman. Assistive planning in complex, dynamic environments: a probabilistic approach. CoRR, abs/1506.06784, 2015. [20] Pete Trautman. A unified approach to 3 basic challenges in shared autonomy. CoRR, abs/1508.01545, 2015. [21] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010. [22] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. 2008.