1 A Learning Based Optimal Human Robot Collaboration with Linear Temporal Logic Constraints Bo Wu, Bin Hu and Hai Lin Abstract—This paper considers an optimal task allocation problem for human robot collaboration in human robot systems with persistent tasks. Such human robot systems consist of human operators and intelligent robots collaborating with each other to accomplish complex tasks that cannot be done by either part alone. The system objective is to maximize the probability of successfully executing persistent tasks that are formulated as linear temporal logic specifications and minimize the average cost between consecutive visits of a particular proposition. This paper proposes to model the human robot collaboration under a framework with the composition of multiple Markov Decision Process (MDP) with possibly unknown transition probabilities, which characterizes how human cognitive states, such as hu- man trust and fatigue, stochastically change with the robot performance. Under the unknown MDP models, an algorithm is developed to learn the model and obtain an optimal task allocation policy that minimizes the expected average cost for each task cycle and maximizes the probability of satisfying linear temporal logic constraints. Moreover, this paper shows that the difference between the optimal policy based on the learned model and that based on the underlying ground truth model can be bounded by arbitrarily small constant and large confidence level with sufficient samples. The case study of an assembly process demonstrates the effectiveness and benefits of our proposed learning based human robot collaboration. Note to Practitioners—This paper is motivated by the task allocation problem for a class of human robot collaboration systems. In such systems, the most efficient way to achieve a given task is for the human and the robot to smartly collaborate with each other. Our objective is to dynamically assign tasks to the human and the robot based on human cognitive and physiologic states as well as the robot’s performance, so that the given task can be accomplished with the optimal probability and the minimized average cost to finish one round of the operation. To this end, we propose a mathematical model to capture the dynamic evolution of human cognitive and physiologic states, such as trust and fatigue. Such model in practice could be unknown initially, so we develop an efficient learning based algorithm to learn the model with predefined accuracy. Then we show that the performance of the optimal policy obtained on the learned model is sufficiently close to the optimal policy on the underlying ground truth model. Our proposed methods are applicable to manufacturing assembly process, semi-autonomous driving and other human robot collaboration systems that require optimal allocation of the tasks between human and robot. The limitation of the proposed approach is the assumption of the perfect real time observation of the human states. The authors believe that such limitation can be addressed with the recent advances in the cognitive science research. Index Terms—Formal methods, human robot collaboration, temporal logic This work was supported by the National Science Foundation (NSF-CNS- 1446288 and NSF- ECCS-1253488). Bin Hu is with New Mexico State University, Las Cruces, NM, 88003, USA. binhu.complicated@gmail.com Bo Wu and Hai Lin are with University of Notre Dame, Notre Dame, IN, 46556, USA. bwu3@nd.edu, hlin1@nd.edu I. INTRODUCTION A. Background and Motivation Human-robot collaboration consists of intelligent robots and human operators collaborating with each other to accomplish complex tasks that are less efficient or cannot be completed by either human or robots alone. Human-robot collaboration has become an important and even necessary part to ensure safe and efficient operations for many safety-critical systems, to name a few, such as modern manufacturing processes [1], [2], medical and health-care systems [3], [4], teleoperation [5] and semi-autonomous vehicle systems [6]. Thus, a reliable and ef- ficient optimal control framework is important, even necessary especially for systems with safety-critical applications [7]. It is, however, fairly challenging to ensure desirable per- formance for human robot collaboration due to the complex interactions between human and robot. The complexity lies in (1) the inherent stochastic uncertainties caused by a variety of human characteristics and external environments (2) the difficulty of obtaining a priori knowledge about how human interacts with robot due to the unanticipated nature of human beings. To address these challenges, this paper proposes a learning based human-robot collaboration framework to ensure optimal system performance for a manufacturing task alloca- tion system with linear temporal logic constraints (LTL) [8], a rich and expressive specification language to specify desired properties that’s close to human natural language. B. Relevant Work Human-robot collaboration has been a rapidly growing re- search area that has found its roles in many applications [9]. It is thus, beyond the scope of this paper to conduct an exhaustive literature review on this popular topic. However, there are not many research activities that examine the optimal performance for human robot collaboration under LTL constraints. This section will focus on the discussions of the relationship and differences between the proposed work in this paper and the relevant research work. Among many factors affecting human robot interaction, human factors, such as trust and fatigue, have been long considered as overarching concerns in human robot systems that significantly affects system performance, especially as it is related to safety-critical applications [10]. Prior work in [11], [12] showed that the evolution of both human trust and fatigue are dynamical processes. The human trust toward the robot is highly dependent on the automation performance. These observations motivate research efforts on the development of quantitative models for human trust and fatigue to provide appropriate real time predictions [13], [14] and control of human trust levels [1]. In this paper, we explicitly consider arXiv:1706.00007v1 [cs.RO] 31 May 2017 2 the impact of the human trust and fatigue on the human robot interaction. We assume the complete observability of the trust and fatigue level in real time, which can be met by using the techniques mentioned in [15], [16]. To address the challenge of modeling the unknown un- certainties in human robot interaction, this paper proposes a composition framework of multiple Markov Decision Process (MDP) models with possibly unknown transition probabilities. Our proposed framework can be viewed as a stochastic gener- alization of the conventional deterministic models proposed in [1], [13]. Under the MDP models for human robot interactions, the second challenge is to ensure the optimal performance when conducting persistent tasks, such as “visiting regions A infinite often to pick up products” or “repeating the steps of assembling the car engine infinite often”, which are important and widely encountered situations in a variety of industrial systems. The desired performance metric for those persistent tasks is to maximize the probability of successfully executing those persistent tasks while optimizing some cost metric. Such performance specification is often formulated as an optimal control problem of MDP with temporal logic constraints. Wolff et. al [17] developed a method to automatically generate the optimal control policy subject to LTL specifications and optimizing the weighted average cost function on a non- probabilistic model. The controller synthesis framework to op- timize average cost per operation cycle, a natural performance metric for persistent tasks with temporal logic constraints in MDPs was considered in [18] and [19]. All the above results assume that the model is precisely known. However, in many practical applications, such assumption may not be true and the model has to be learned through interaction with the environment. Motivated by the need of considering unknown models in human robot collaboration , this paper studies a learning-based optimal task allocation problem where the optimal policies can be synthesized on the learned model to optimize the average cost per task cycle while maximizing the probability of satisfying the LTL constraints used to specify persistent tasks. The frameworks considered in [16], [20] are the most relevant to the work considered in this paper. To address the optimal control problem under unknown MDPs, Sadigh et. al [20] de- veloped a reinforcement learning based approach to synthesize control policy for MDP with unknown transition probabilities to satisfy LTL specification with optimal probability. Fu et. al [16] extended the probably approximate correct (PAC)-MDP learning framework that synthesize controllers of unknown MDPs with temporal logic constraints. Both work focus on developing learning methods to maximize the probability of satisfying the LTL specification without considering the cost per task cycle metric. It is, however, unclear whether the methods in [16], [20] can be used to achieve optimal average cost per task cycle. To our best knowledge, this paper presents the first results that address the optimization problem that achieves optimal average costs per task cycle with LTL constraints under the unknown MDP models. C. Contributions The objective of this paper is to (1) develop a human robot collaboration framework that is able to model the unknown stochastic uncertainties on human robot interactions, (2) as well as a learning based algorithm to ensure optimal perfor- mance for human robot collaboration with LTL constraints under unknown human models. The main contributions of this paper are summarized as below, • This paper proposes a human robot collaboration frame- work with composition of multiple MDPs with unknown transition probabilities. Compared to the existing frame- works [13], [14], [21], [22], such framework enables the characterization of stochastic uncertainties as well as their unknown features for human robot interactions as well as the use of formal design methodologies for performance guarantee. • Based on the unknown MDP models, this paper further develops a learning-based algorithm inspired by probably approximately correct method to achieve the optimality for both LTL satisfaction and average cost per task cycles. To the best of our knowledge, this is the first set of results addressing both issues of LTL constraints and average cost per task cycles under unknown MDP models. • This paper also shows that the optimal policies achieved by the proposed learning method under the unknown MDP model, can asymptotically approach the ones with the true model. The optimality gap of the proposed learning algorithm can be analytically bounded and to achieve such bound, the complexity to learn the model is polynomial in the size of the MDP, the size of automaton from the LTL specification and other quantities that either are task specific or measure the confidence level and accuracy of the learned model. This paper extends our preliminary results in [23] by considering synthesizing optimal policy under the model with unknown transition probabilities and prove its approximate optimality. The rest of this paper is structured as follows. Section II provides the necessary preliminaries; Section III models and formulates the problem. The task assignment problem is solved in Section IV and Section V for known and unknown model respectively. A simulation is presented in Section VI. Section VII concludes the paper. II. PRELIMINARIES In this section, we introduce the required background knowledge about transition system, Markov Decision Process and temporal logic. A. Transition System Transition system is a popular model to describe the non- probabilistic behavior of the system. For example, it can be used to describe the flow of the assembly task driven by worker’s actions. Definition 1. [8] A transition system is a tuple T S = (S, A, →, I, AP, L) where • S = {s0, s1, ...} is a finite set of states; 3 • A is a finite set of actions; • →⊆S × A × S is a transition relation. • I ⊆S is a set of initial states; • AP is a set of atomic propositions; • L : S →2AP is a labeling function that maps each s ∈S to one or several elements of AP For s ∈S and a ∈A, we denote Post(s, a) = {s′ ∈ S|(s, a, s′) ∈→}. Specifically, T S = (S, A, →, I, AP, L) is called an action-deterministic transition system if |I| ≤1 and |Post(s, a)| ≤1 for all s and a. B. Markov Decision Process Definition 2. [8] An MDP with cost is a tuple M = (S, ˆs, A, P, L, C) where • S = {s0, s1, ...} is a finite set of states; • ˆs ∈S is the initial state; • A is a finite set of actions; • P(s, a, s′) := Pr(s(i+1) = s′|s(i) = s, a(i) = a), ∀i ≥ 0. • L : S →2AP is the labeling function that maps each s ∈S to one or several elements of a set AP of atomic propositions. • c : S ×A →R+ is the cost function that maps each state and action pair to a non-negative cost. The maximum cost is bounded by Rmax. For each state s ∈S, we denote A(s) as the set of available actions. From the definition it is not hard to see that the action- deterministic transition system is a special case of MDP with P(s, a, s′) = 1 for any s ∈S and a ∈A that are defined. Vice versa, if we ignore the probabilities in each transition, the MDP will become a transition system which we denote as TSM. A path ω of an MDP is a non-empty sequence of the form ω = s0 a0 −→s1 a1 −→s2...si ai −→si+1..., where each transition is enabled by an action ai such that P(si, ai, si+1) > 0. We denote Pathfin s as the collection of finite length paths that start in a state s. The nondeterminism of an MDP is resolved with the help of a scheduler. Definition 3. A scheduler µ : Pathfin ˆs →A (also known as adversary or policy) of an MDP M is a function mapping every finite path ωfin ∈Pathfin ˆs onto an action a ∈ A(last(ωfin)) where last(ωfin) denotes the last state of ωfin. Intuitively, the scheduler specifies the next action to take for each finite path. The behavior of an MDP M under a given scheduler µ is purely probabilistic and thus reduces to a discrete time Markov chain (DTMC) with a set of recurrent classes. A recurrent class Src refers to a set of recurrent states that the probability to return to any state s ∈Src after leaving it is 1 and any pair s, s′ ∈Src is communicating, that is, s′ is reachable from s (there exists a state sequence s0s1s2...sn with nonzero probability such that s0 = s, sn = s′) and s is reachable from s′. A policy µ is called memoryless if µ(ωfin) = µ(last(ωfin), that is, the action to select only depends on the current state that the MDP is in. A policy is said to have memory otherwise. An MDP is said to be communicating [24] if there exits a memoryless and deterministic (stationary) policy such that the induced DTMC is communicating, that is , all the states pair s, s′ in the DTMC are communicating. C. Linear Temporal Logic Under the MDP model, LTL can be used to describe a wide range of properties of sequences of states such as safety (bad things never happen), liveness (good things eventually happen), persistence (good things happen infinitely often), response (if A then B) and so on. An LTL formula is built up from a set of atomic propositions (AP), true, false, the Boolean operators ¬ (negation), ∨(dis- junction), ∧(conjunction) and temporal operators □(always), X (next), ∪(until), ♦(eventually). An LTL formula φ can always be represented by a deterministic Rabin automaton (DRA) [8] Rφ = (Q, q0, 2AP , δ, ACC) where Q is a set of finite states, q0 ∈Q is the initial state, 2AP is the alphabet, δ : Q × 2AP →Q is the transition function, and ACC = {(L(1), K(1)), ..., (L(M), K(M))}, with M being a positive integer, is a set of tuples where L(i), K(i) ⊆Q for all i = 1, ..., M. Given a infinite word l = σ0σ1... where σi ∈2AP for all i, a unique path ω = q0q1... will be induced where qi+1 = δ(qi, σi). A run ω is accepted in Rφ if there exists a pair (L, K) ∈ACC such that 1) there exists n ≥0, such that for all m ≥n, we have qm /∈L, and 2) there exist infinitely many indices k where qk ∈K. Note that L could be empty but K may not. Intuitively, for a pair (L, K), the acceptance condition means that an accepted run should visit states in L only finite times and states in K infinite times. Given an LTL formula φ, it is always possible to construct a DRA that accepts exactly the words that satisfy φ. III. MODELING AND PROBLEM FORMULATION A. Modeling Fig. 1: Human robot collaborative assembly Task model Consider an industrial assembly process involving both human and robot as shown in Figure 1. The assembly task can be represented by an action-deterministic transition system Mw as shown in Figure 2. There are N parts that need to be assembled by actions Ar = {ar 1, .., ar Nr} denoting the robot actions, and Ah = {ah 1, ..., ah Nh} denoting the human actions. In Figure 2, ai ∈{a1, a2} can represent either human or robot’s corresponding action. During the task execution all necessary parts are fed by suitable mechanisms such as the conveyor belts. The two arrows in Figure 1 denote the distribution of the parts. As seen in Figure 2, these assembly tasks often consist of a sequence 4 w0 start w1 w2 a1 a2 ah 3 Fig. 2: Assembly plan Mw of stages where each stage represents one sub-task that needs human or machine’s actions to complete. In practice, the difficulty levels of the sub-tasks in the assembly process often varies from stages to stages, which may result in stochastic variations on machine’s performance. Robot Model This paper introduces an MDP to model the stochastic dynamics of the machine performances. For each robot action ar i , robot may finish it with some cost representing the energy consumption and the time, additionally, ar i may lead to a faulty state with certain probability pi to characterize the possible failure of the robot. Once the robot is in the faulty state, human repair is required. Formally the robot model is an MDP Mr = {Sr, sr 0, Ar ∪ {repair}, P r, Lr, cr} where Sr is a finite set of states repre- senting the machine state, sr 0 ∈Sr is the initial state, and T r is the transition matrix. Figure 3 shows a simple example of two performance levels in a robotic system. In particular, the state labeled Normal represents a normal performance level under which the actions taken by the robot can accomplish the tasks as expected, while the state labeled Faulty denotes an abnormal status under which the robot will fail to performing any assigned tasks. r0 start r1 ar i , pi ar i , 1 −pi repair, 1 {normal} {faulty} Fig. 3: Robot performance model Human trust and fatigue model The interaction between human trust and robot performance is well known to exist in general human robot systems, such as process control systems [25] and human-robot collaboration system [14]. In particular, the work in [13], [14] demonstrated that the dynamics of the human trust can be adequately mod- eled as a first order linear system with the robot performance as its input. This finding motivates us to model the human trust as an MDP whose states represent different trust levels of the human operator. Definition 4 (Human Trust Model). A human trust MDP model is a tuple Mt = {St, st 0, At, P t, ct} where • St is a finite set of human trust levels. • st 0 ∈St is the initial human trust level. • At = Ar ∪{repair} is the action set • P t(st(i), a, st(i + 1) = Pr(st(i + 1)|st(i), a) for a ∈ Ar ∪{repair}. • ct : St × At →R+ is a positive cost function. Similarly, it is also well known that the human performance is affected by the fatigue level [21]. Thus, we model the human fatigue process as an MDP Mf = {Sf, sf 0, Af, P f, cf} where Af = Ah ∪Ar ∪{repair}. Human robot interaction model. Under the definitions of the four models, namely task model Mw, robot model Mr, human trust model Mt and fatigue model Mf, the human robot interaction considered in this paper is characterized by the fact that the human trust and fatigue can be appro- priately regulated by controlling robot performance through strategic task assignment. A desired level of human trust and fatigue will certainly improve the human performance in the assembly process, thereby leading to an effective human-robot collaboration. In the presence of human robot interaction, the manufacturing system is modeled as an MDP M from parallel composition of the four models. M = Mw||Mr||Mt||Mf Where || denotes the parallel composition defined as follows. Definition 5 (parallel composition). Given two Markov de- cision processes M1 = (S1, s1 0, A1, P1, L1, c1) and M2 = (S2, s2 0, A2, P2, L2, c2), the parallel composition of M1 and M2 is the MDP M = M1||M2 = (S1 × S2, s1 0 × s2 0, A1 ∪ A2, P, L, c) where L(s1, s2) = L1(s1) ∪L2(s2), and • P((s1, s2), a, (s′ 1, s′ 2)) = P1(s1, a, s′ 1)P2(s2, a, s′ 2), c((s1, s2), a) = c1(s1, a) + c2(s2, a), if a ∈A1 ∩A2 and both P1(s1, a, s′ 1) and P2(s2, a, s′ 2) are defined, or • P((s1, s2), a, (s′ 1, s2)) = P1(s1, a, s′ 1), c((s1, s2), a) = c1(s1, a) if a ∈A1\A2 and P1(s1, a, s′ 1) is defined, or • P((s1, s2), a, (s1, s′ 2)) = P2(s2, a, s′ 2), c((s1, s2), a) = c2(s2, a) if a ∈A2\A1 and P2(s2, a, s′ 2) is defined. From the definition of the parallel composition, it can be seen that the state s of M is actually a four tuple s = (sw, sr, st, sf). Also note that the transition probability in the manufacturing process is not only dependent on the actions taken by the robot or human operator, but also dependent on the robot performance or the human trust level when those ac- tions are taken. This dependency explicitly models the impact of the human robot interaction on the manufacturing process, and addresses the practical concern that the human or robot’s real time performance does affect how well the manufacturing tasks being accomplished. Furthermore, we would like to point out that the quantification of the cost related to human models is application specific. Such metric may need to consider the time-to-completion, workload assessment, comfort level and cognitive workload and so on [22]. Detailed task analysis with subjective rating from human worker are needed for cost function construction. Inverse reinforcement learning [26] may also be applied to learn a cost function based on human demonstrations of the human robot collaborative tasks. B. Problem formulation Once we get the system model M, we are interested in how to optimally assign actions to human and robot dynamically based on their current states. 5 Consider the composed MDP M, suppose we are using the task model as shown in Figure 2, we label all the states of the form Sπ = {w0, ∗, ∗, ∗} as π where ∗means arbitrary state in each submodule as appropriate. That is, Sπ denotes all the states when one round of assembly is finished. The assembly process should keep on going and the finished state should be visited infinite times. We say that each visit to the set Sπ completes a cycle. That is, from the initial state, the first time to reach s ∈Sπ is the first cycle, after that the path revisiting s′ ∈Sπ is the second cycle and so on. Given a path ωs = s0s1...snsn+1..., then we denote the number of cycles completed up to stage n as C(ωs, n) which starts with 1 at the very beginning. As we are dealing with an assembly task, the average cost per cycle (ACPC) is a more reasonable cost measure [18] to optimize. Then we are ready to formally define our problem. Problem 1. Given the composed MDP M and Sπ, find a task assignment policy µ which optimizes the probability of finishing the assembly process infinitely often while minimizing the ACPC defined as J(s0) = lim sup N→∞ E{ PN k=0 c(sk, µ(ωsk µ )) C(ωsµ, N) |Sπ is visited infinitely often and L(ws µ) obeys other constraints} (1) where ωs µ denotes the state path under policy µ and ωsk µ denotes the state path up to stage k. L(ωs µ) is the observed label sequence along the ωs µ. Other constraints may impose additional requirements on µ as well, such as the robot and human assembly action should be alternatively assigned. IV. DYNAMIC TASK ASSIGNMENT WITH KNOWN MODEL The requirement that the assembly process should be fin- ished infinite times with additional constraints can be written into the following LTL formula φ = □♦π ∧ϕ (2) where ϕ denotes other constraints. Therefore, we can convert the Problem 1 into the following. Problem 2. Given the composed MDP M, LTL formula φ in (2) and Sπ, find a policy µ which optimizes the probability of satisfying φ while minimizing J(s0) = lim sup N→∞ E{ PN k=0 c(sk, µ(ωsk µ )) C(ωsµ, N) |L(ωs µ) |= φ} (3) Given the task model, robot model, human trust model and human fatigue model as MDPs, we compose them using parallel composition introduced in Definition 5 into the system model M. The specification is given in terms of an LTL formula φ and is converted into a DRA Rφ. An example of the DRA corresponds to the LTL formula φ = □♦π is shown in Figure 4(a) where K = {q1} and is marked in double circles. So the accepted run in Rφ should visit q1 infinitely often. Then the product P between the MDP M and Rabin automaton Rφ is needed to capture all the paths of M to satisfy φ. q0 start q1 π ∅ ∅ π L = ∅, K = {q1} (a) DRA Rφ s0 start s1 s2 a0, 1 a1, 1 a2, 0.4 a3, 0.5 a2, 0.6 a3, 0.5 {π} (b) MDP M s0, q0 start s1, q0 s2, q0 s1, q1 s0, q1 a0, 1 a1, 1 a2, 0.6 a3, 0.5 a1, 1 a2, 0.4 a3, 0.5 a0, 1 AMEC C {π} (c) the composed model P Fig. 4: An example of DRA Rφ for φ = □♦π, the MDP M, and the construction of the composed model P Definition 6. [18] Given an MDP M = {S, s0, A, P, AP, L, c} and a DRA Rφ = (Q, q0, 2AP , δ, ACC), the product is an MDP P = (SP, s0 × q0, A, PP, ACCP, SPπ, cP) where • SP = S × Q • PP((s, q), a, (s′, q′) = Pr(s′|s, a) if q′ = δ(q, L(s)), 0 otherwise; • ACCP = {(LP(1), KP(1)), ..., (LP(M), KP(M))} where LP(i) = S × L(i), KP(i) = S × K(i), for i = 1, ..., M; • SPπ = Sπ × Q; • cP((s, q), a) = c(s, a). A simple MDP M is shown in Figure 4(b) and its product MDP P is illustrated in Figure 4(c) where LP = ∅, KP = {(s1, q1), (s0, q1)} which are marked in double circles. Note that there is a one-to-one correspondence between a path s0s1, ... on M and a path (s0, q0)(s1, q1)... on P with the same cost according to the Definition 6 which enables us to only focus on the behavior of the composed model P. If there is a memoryless policy on P, it is always possible to map it back to M and possibly become a policy with memory which is from the underlying DRA Rφ. Then the problem of maximizing the probability of sat- isfying the LTL formula φ for M can be transformed into a problem of maximizing the probability of reaching a set C(P) of Accepting Maximal End Components (AMEC) in the product MDP P [8]. Definition 7. [18] Given a pair (LP, KP), an end component C is a communicating MDP (SC, AC, PC, KC, SCπ, cC) where SC ⊆SP,AC ⊆A, AC(s) ⊆AP(s) for all s ∈SC, KC = SC∩ KP, SCπ = SC∩SPπ, and cC(s, a) = cP(s, a) for s ∈SC, a ∈ AC(s). For any P(s, a, s′) > 0, s ∈SC, a ∈AC(s), s′ ∈SC and PC(s, a, s′) = PP(s, a, s′). An Accepting Maximal End Components (AMEC) is the largest such en component with nonempty KC and SC ∩LP = ∅. 6 AMECs with respect to the same acceptance pair are pairwise disjoint. Therefore the total number of AMECs is bounded by |S||Q||ACCP|. There exists at least one such AMEC and the procedure of finding the AMEC set C(P) can be found in [8]. As shown in Figure 4(c), there is only one AMEC C in C(P) which is inside the dotted box. Once C is reached, it is always possible to construct a policy such that some K ∈KP is visited infinitely often and thus the LTL constraint is satisfied. Therefore the maximum probability of satisfying φ is equal to the maximum probability of reaching C(P). We can find a stationary policy f→C via value iteration or linear program such that the probability of reaching a state in one of the AMEC C ∈C(P) equals Pmax(φ). Note that f→C is defined only outside of C. Now we focus on minimizing the average cost per cycle (ACPC). From (3) we know that as N goes to infinity, the cost contributed by the finite path from the initial state to a state in an AMEC would be zero and J(s0) is actually solely decided by the policy f⟳C which is defined inside of the AMEC. Once inside C, the task is then to construct a stationary policy f⟳C such that a state in KP is visited infinitely often and minimize (3). We can utilize Algorithm 1 in [18], a policy iteration algorithm to get the optimal (or suboptimal) f⟳C for each C. Then we find the optimal policy fC for all C ∈C(P) such that fC = ( f→C(i), if i /∈SC f⟳C(i) otherwise The Problem 2 is then solved by finding optimal policy f ∗ from f ∗ C such that it incurs the optimal ACPC among all fC. Note that this stationary policy is defined on P, whose state is the tuple (s, q) where q is the DRA’s state. Therefore different actions could be assigned for the same s on M since q may be different and it actually induces a policy with finite memory for M. V. DYNAMIC TASK ASSIGNMENT WITH UNKNOWN MODEL For human robot interaction applications, in many cases the model may not be known a prior and has to be learned. This section discusses finding the optimal control policy under an unknown MDP model. With the unknown MDP model M, LTL specification φ, the composed MDP P, We have the following assumptions. Assumption 1. The structure of M, e.g. TSM is known, but the transition probability is not known. Assumption 2. Every loop in the form of s0s1...sns0 induced by any policy in C ∈C(P) of P includes at least one state (s, q) where s ∈Sπ and the longest number of steps to finish a cycle starting at (s, q) and ending at (s′, q′) where s, s′ ∈Sπ is bounded by an finite positive integer L. Assumption 3. Given the LTL formula φ and the MDP M, Pmax(φ) = 1 and for every AMEC C ∈C(P), there exists deterministic and memoryless policies such that P(♦C) = 1 under this policy. Furthermore, the first state (s, q) ∈C for all such polices visit is unique and s ∈Sπ. We can such a state the entrance of C and denoted by ent(C). Assumption 4. Every C ∈C(P) is a unichain MDP [24], that is, for every deterministic and memoryless policy in C, it will incur single recurrent class. Remark 1. In many practical scenarios, the underlining structure of the MDP models can be known beforehand. For example, the assembly process model, the robot performance model and so on. Therefore, Assumption 1 is not overly restrictive. Assumption 2 makes sure that no matter which action is chosen, the system always makes progress towards finishing a cycle. Therefore the number of steps to finish a cycle is bounded. With Assumption 3, we can examining the optimality in each C ∈C(P) and designs the policy to reach C with the optimal ACPC. Also, since the first state in C to be visited is fixed, it facilitates us to compare the optimality of bounded time polices in different C’s. Assumption 4 is similar to the one stated [18] in the sense that we guarantee that the optimal policy also satisfies the LTL specification. Problem 3. Given the MDP model M with unknown transi- tion probability, and LTL specification φ, two parameters δ, ϵ with Assumptions 1, 2, 3 and 4, design an algorithm such that with probability at least 1 −δ, find a policy f→C for every C ∈C(P) such that C can be reached with probability 1. Furthermore, find a policy f⟲C : S × Q →A defined in one of the AMEC C ∈C(P) such that once enters this C, the T cycle ACPC is ϵ−close to the optimal ACPC in M. And the sample complexity is polynomial in |S|, |A|, T, 1 ϵ , 1 δ and linear in the number of accepting pairs. From the problem formulation, we are only interested in the ACPC after reaching the AMEC. To solve Problem 3, we first define α−approximation in MDPs by combining the definitions in [27] and [16] to consider both labels and rewards. Definition 8. Let M = (S, ˆs, A, P, L, C) and ¯ M = ( ¯S, ¯ˆs, ¯A, ¯P, ¯L, ¯C) be two labeled MDPs as defined in Defi- nition 2. We say that ¯ M is an α−approximation of M if • S = ¯S, ˆs = ¯ˆs, A = ¯A, P(s, a, s′) > 0 iff ¯P(s, a, s′) > 0, L = ¯L and C = ¯C. That is, they share the same state and action space, initial condition, structure, labeling and reward function. • |P(s, a, s′) −¯P(s, a, s′)| ≤α for any s, s′ and a. By definition, it is not hard to see that if ¯ M is an α−approximation of M, then ¯P = ¯ M × Rφ is an α−approximation of P = M × Rφ. Since we assume that Pmax(M |= φ) = 1, we then have the following lemma. Lemma 1. If ¯ M is an α−approximation of M, φ is an LTL formula and Pmax(M |= φ) = 1, then Pmax( ¯ M |= φ) = 1. Proof. Since ¯ M is an α−approximation of M, we know that ¯P = ¯ M×Rφ is an α−approximation of P = M×Rφ where Rφ is then DRA from φ. By definition, Pmax(M |= φ) = 1 implies that there exists a policy µ such that Pµ(♦C(P)) = 1 . Now we prove that Pmax( ¯ M |= φ) = 1 with the same policy µ by contradiction. First, observe that C(P) is identical with C( ¯P) expect for the transition probabilities since P and ¯P share the same structure. Suppose under the same µ such that Pµ(♦C(P)) = 1, there 7 exist paths with nonzero probability in ¯P that C( ¯P) is never reached, those paths must also exist with nonzero probability in P under µ because of the same structure, which is not possible since we know Pmax(M |= φ) = 1. Once we have an estimated model that is close enough to the true MDP model, we have the following lemma to evaluate the performance of the synthesized optimal policy, which is similar to the simulation lemma in [27] but we extended it from T-step horizon to T-cycle horizon. Lemma 2. If ¯ M is a ϵ NT RmaxD2 −approximation of M, N = |S|, T is a finite number of cycles and 0 < ϵ < 1, D is the longest number of steps to finish a cycle starting from any finishing state under any proper policy that can complete T cycle with probability 1, then for any LTL specification φ, for the T-cycle policy f : S × Q →A defined in C, an AMEC, we have that |Jf,T P (s, q)−Jf,T ¯ P (s, q)| ≤ϵ, where (s, q) = ent(C) and Jf,T is the average cost in T cycles with the policy f. Proof. Under the T-cycle policy f, we have that |Jf,T P (s, q)−Jf,T ¯ P (s, q)| ≤ X ω |PP(ω)JP(ω)−P ¯ P(ω)J ¯ P(ω)| where P(ω) and J(ω) are the probability of the path ω that runs for exactly T cycles and its corresponding average cost in T cycles under policy f. Since f is proper, we have that P ω PP(ω) = P ω P ¯ P(ω) = 1. Each such ω starts in a state (s, q) and ends in a state (s′, q′) where s, s′ ∈Sπ and runs for exactly T cycles. Note that the paths may be of different lengths by going through T cycles, but their length are bounded by DT. To facilitate our analysis, we (virtually) extend all the paths to DT. Once a path ω finishes T cycles but has not reached DT steps, it stays at the last state with probability 1 and receives 0 cost from then on until the path is DT steps long. Furthermore, the cycle counts will also stay at T. In this way, neither the probability nor the average cost will be affected. From Definition 8, it can be seen that JP(ω) = J ¯ P(ω) = PDT k=1 ck T ≤DTRmax T = DRmax Therefore, we need to show that X ω |PP(ω) −P ¯ P(ω)| ≤ ϵ DRmax We define the following random processes Pi by following the policy f with all paths extended to DT steps, the first i transitions are the same as in ¯Pf and the rest transitions are the same as in Pf, where Pf and ¯Pf are DTMCs induced by the policy f from P and ¯P Clearly, we have that ¯Pf = PDT and Pf = P0. Then we have that X ω |PP(ω) −P ¯ P(ω)| = X ω |PP0(ω) −PPDT (ω)| = X ω |PP0(ω) −PP1(ω) + PP1(ω) −PP2(ω) + ... + PPDT −1(ω) −PPDT (ω)| ≤ DT −1 X i=0 X ω |PPi(ω) −PPi+1(ω)| (4) Then what’s left to prove is that X ω |Pi(ω) −Pi+1(ω)| ≤ ϵ TRmaxD2 (5) where for notational simplicity, we rewrite PPi as Pi. Inspired by [27], if si denotes the state that is reachable after i transitions, pre(si) and suf(si) denote the i-step prefix reaching si (not including si) and the suffix starting at si, respectively, for some ω = pre(si).si.suf(si+1), we have Pi(ω) = Pi(pre(si))Pi(si+1|si)Pi(suf(si+1)) . Then (4) can be rewritten as X ω |Pi(ω) −Pi+1(ω)| = X si X pre(si) X si+1 X suf(si+1) |Pi(pre(si))Pi(si+1|si)Pi(suf(si+1))− Pi+1(pre(si))Pi+1(si+1|si)Pi+1(suf(si+1))| (6) Note that (6) holds even if we virtually extended all the ω to the same length. Also from the definition of Pi, it is not hard to observe that Pi(pre(si)) = Pi+1(pre(si)) and Pi(suf(si+1)) = Pi+1(suf(si+1)). Therefore (6) becomes X si X pre(si) Pi(pre(si)) X si+1 X suf(si+1) Pi(suf(si+1))|Pi(si+1|si) −Pi+1(si+1|si)| ≤ X si X pre(si) Pi(pre(si)) X si+1 X suf(si) Pi(suf(si+1)) ϵ NTRmaxD2 (7) Note that P si P pre(si) Pi(pre(si)) = 1 since it sums over all prefix of i steps and P si+1 P suf(si) Pi(suf(si+1)) ≤N since P suf(si) Pi(suf(si+1)) = 1 and the number of choice of si+1 after si is bounded by N, which can be seen from the definition of product MDP. Therefore we have proven that (5) holds and thus the lemma is proved. Lemma 3. If ¯ M is an ϵ NT RmaxD2 −approximation of M, and ϵ > 0, then for any LTL specification φ, for T-cycle optimal policy f and g defined in C ∈C(P) for ¯P and P, we have that |JT,f P (s, q)−JT,g P (s, q)| ≤2ϵ, where (s, q) = ent(C) and JT is the average cost in T cycles. Proof. This is the direct result from Lemma 2 and the fact that JT,f ¯ P (s, q) ≤JT,g ¯ P (s, q). Lemma 3 is analogous to Lemma 2 in [16] and intuitively bounds the performance of the optimal T cycle policy in the estimated model when applied to the ground truth model. The 8 Algorithm 1: ModelLearningAndPolicyFinding input : The state and action sets S and A, the labeling function L and the underlying transition system TSM for the MDP M. The specification DRA Rφ, ϵ, δ, (estimated) mixing cycle T. output: policy f : S × Q →A 1 Obtain the AMEC set C(P) with only the transition probabilities unknown and f→C for each C ∈C(P) from the product TSM × Rφ; 2 for C ∈C(P) do 3 (s, q) = (s0, q0); 4 apply f→C until reach ent(C); 5 ¯C ←ModelLearning (C); 6 f⟳¯C ←optimal T-cycle policy; 7 J ¯C = Jf,T ¯C (ent( ¯C)); end 8 C∗= arg min ¯C J ¯C; 9 f ∗ →= f→C∗; 10 f ∗ ⟳= f⟳C∗; 11 return f = [f ∗ →, f ∗ ⟳]; finite T cycle is chosen such that the average cost over T cycle for the optimal policy g satisfies the following definition. Definition 9. Given the optimal policy g defined in C over infinite time horizon, the ϵ-mixing cycle is defined to be the smallest positive integer TC such that Jg,TC P,C (s, q) −Jg P,C(s, q) < ϵ for (s, q) = ent(C) where Jg,TC P,C (s, q) is the average cost per cycle under policy g from (s, q) runs for T cycles and Jg P,C(s, q) is the optimal ACPC. The ϵ-mixing cycle T for the MDP is defined to be T = maxC∈C(P){TC}. From the above discussion, it can be seen that it is essential to obtain an estimated MDP that is statistically close to the ground truth one. To this regard, we need to estimate the transition probabilities of the MDP. We follow the results in [16], when the number of sampling is large enough, the maximum likelihood estimator of the transition probability P(s, a, s′) can be seen as a random variable of normal distribution with mean µ = count(s,a,s′) count(s,a) and variance σ2 = count(s,a,s′)(count(s,a)−count(s,a,s′)) count(s,a)2(count(s,a)+1) , where count(s, a) is the total number of times that action a is executed on s and count(s, a, s′) is the total number of times that the transition s, a, s′ is observed. Then we use the idea of known states [16], [27], [28] as defined below. Definition 10. Given an MDP M and LTL specification φ, a probabilistic transition ((s, q), a, (s′, q′)) from P = M × Rφ is known if with probability at least 1 −δ, for any (s, q′) that is reachable by executing a, σ2k ≤ ϵ NT RmaxD2 where σ2 is the variance of the transition probability estimator, k is the critical value for 1−δ confidence interval, ϵ > 0 is a constant, N = |S|, T is the ϵ-mixing cycle and L is the longest step required to finish one cycle. A state (q, s) is known if any action a defined on it, ((s, q), a, (s′, q′)) is known for any (s′, q′). From Definition 10, if all the states of the MDP are known, then with probability no less than 1 −δ, it is an ϵ NT RmaxD2 - approximation of the true underling MDP. In fact, since we are only interested in the behavior inside the AMECs, it is sufficient that all the states in the AMECs are known. Then we have the following algorithm and theorem to shown that we can learn such model efficiently. Theorem 1. Given an MDP model M and LTL specification φ in the form of (2) and satisfy the Assumption 1, 2, 3 and 4. Given 0 < δ < 1, ϵ > 0, T is the ϵ-mixing cycle. With probability no less than 1 −δ, Algorithm 1 will learn an estimated MDP model with all states in AMECs known within number of steps polynomial in |S|, |A|, T, 1 ϵ , 1 δ and linear in the number of accepting pairs. The T-cycle optimal policy f : S × Q →A found on this estimated model will satisfy Jf,T P (s, q) −J∗ P(s, q) < 3ϵ for s = ent(C) and C incurs the minimum of T-cycle average cost. Proof. The upper bound on the number of visits to make a state known is polynomial in |A|, T, 1 ϵ , 1 δ from Chernoff bound. Before all states in one c ∈C are known, the exploration policy makes sure that every state in c will be sufficiently visited because C is a communicating MDP. Once all the C ∈C(P) are known which takes the number of steps polynomial in |S|, |A|, T, 1 ϵ , 1 δ and linear in the number of accepting pairs, for every C, the optimal T cycle policy fC satisfies |JfC,T P,C (s, q) −Jg,T P,C(s, q)| ≤2ϵ from Lemma 3. From the definition of ϵ−mixing time, we know that Jg,T P,C(s, q) −J∗ P,C(s, q) < ϵ. Therefore, JfC,T P,C (s, q) −J∗ P,C(s, q) < 3ϵ Since we select the AMEC with the smallest average cost, that is Jf,T P (s, q) = minC∈C(P) JfC,T P,C (s, q), therefore Jf,T P,C(s, q) − J∗ P,C(s, q) < 3ϵ for all C ∈C(P), then we have that Jf,T P (s, q) −J∗ P(s, q) < 3ϵ Note that f→C can be computed using standard value itera- tion [29]. The sub-algorithm ModelLearning(C) in Algorithm 1 can be flexible as long as it can explore all the state transitions in C sufficiently well. For example, for each state (s, q) ∈C, we could select the action in a round-robin fashion to make sure every state and action pair is sampled. In Section VI, we solve the exploration and exploitation by following the strategy that for any unknown states, we select the action in a round-robin fashion while for states that are already known, we follow the strategy based on the current model that maximizes the probability to reach unknown states which 9 can be efficiently computed by standard value iteration [30]. As to obtain the optimal stationary T-cycle policy, we utilize the policy evaluation method in [18] and find the policy that incurs the optimized T−cycle ACPC. VI. EXAMPLE We illustrate the validity of our framework in a case study of a human robot collaboration in an assembly scenario. The simulation is run in Matlab on a laptop with Intel i7 processor with 2.6GHz and 16GB of memory. For the underlying true model to be learned, the task model is as shown in Figure 2. The robot model is as shown in Figure 3 with p0 = 0.6, p1 = 0.65. The fatigue model and trust model are as shown in Figure 5 and Figure 6. Their corresponding transition probabilities are as shown in Table I. For the fatigue model, it can be seen that for each human action ah i and repair, there are probabilities f h i and frepair for the human to stay at the same fatigue level and 1 −f h i , 1 −frepair to move to a higher fatigue level if possible. If the human is already at the highest fatigue level, any action involves human wouldn’t make any change. If the robot is selected to perform actions ar i , then the human would be idle. There is f i r probability for the human to stay at the same fatigue level and 1 −f i r probability to decrease to the lower fatigue level. When the human is at the lowest fatigue level, any idleness wouldn’t change the fatigue level. As shown in the human trust model in Figure 6, human action would not affect human trust to the robots. Any robot action would have certain probability for the human to stay at the same trust level, and certain probability to transfer to adjacent trust level when applicable. The transition probabil- ities imply the robot’s performance as there is higher chance to increase human trust if the robot performs it well. f0 start f1 f2 ar i , 1 ah i , f h i repair, frepair ah i , 1 −f h i repair, 1 −frepair ar i , f r i ah i , f h i repair, frepair ar i , 1 −f r i ah i , 1 −f h i repair, 1 −frepair ar i , f r i ah i , 1 repair, 1 ar i , 1 −f r i Fig. 5: Human fatigue model t0 start t1 t2 ar i , tr i repair, 1 ar i , 1 −tr i ar i , tr 1,i ar i , (1 −tr 1,i)/2 repair, 1 ar i , (1 −tr 1,i)/2 ar i , tr i ar i , 1 −tr i repair, 1 Fig. 6: Human trust model The four models are created using PRISM model checker [31] which also produces the composed system model M. As for the cost, we assume it is unknown to the learner initially and is learned later. For the task model, all actions in the task model would incur a cost of 0.7. For the robot model, robot actions ar 0 and ar 1 incur 0.003 and 0.07 respectively, the cost of repair is 0.07. The cost for human fatigue and trust models are as shown in Table II where the empty entries mean not applicable. Essentially we penalize the following cases by assigning a larger cost 1) human is idle and the fatigue level is low; 2) human is busy and the fatigue level is high; 3) human trust is low; 4) lost of trust due to robot failure. On the other hand, we reward the following cases 1) human is busy when fatigue level is low; 2) human is idle when fatigue level is high; 3) human is at the medium fatigue level 4) When human trust is high. The action set for composed M is {ar 0, ar 1, ah 0, ah 1, ah 2, repair}. The total number of states in this example is 54. The states refer to the tuple (sw, sr, st, sf) where each element in the tuple is from task automaton, robot model, human fatigue model and human trust model respectively. We label Snormal = {∗, 0, ∗, ∗} as normal, for all the states when the robot is in the normal state. Additionally, we label Sfaulty = {∗, 1, ∗, ∗} as faulty for the all the states when the robot is in the faulty state. Our LTL specification is as the following φ1 = □♦π ∧□(faulty →Xnormal) (8) In plain words, it requires the composed MDP M to visit the finish state in Sπ infinitely often and whenever there is fault, it will be fixed in the next step, since leaving robot in faulty state for too long may permanently damage it. The corresponding DRA has 9 states and one acceptance pair. Therefore the product model has 486 states. For this example, we set ϵ = 0.35, T = 10, Rmax = 1 and 90% confidence level. Even if the true model is unknown before learning, since the underline transition system is known, we can find that the maximum step to finish a cycle is D = 5 and calculate the AMEC. There is one AMEC with 75 states in the product model which includes the initial state (q0, s0), therefore there is no need to find f→C. The model learning took 206 seconds and 812872 cycles. Then we obtain the T- cycle optimal policy by evaluating the T-cycle ACPC with formula (21) in from [18] for each possible deterministic and memoryless policy. The optimal T−cycle ACPC we obtain is 1.134 while the optimal ACPC with infinite time horizon is 1.128. The T−cycle optimal policy we found is almost identical to the one for unbounded case with only one different action choice. It can be seen that the performance bound we have is quite conservative due to the loose upper bound. VII. CONCLUSION In this paper we proposed a learning based dynamic task assignment framework for a class of human robot interactive systems that human and robot work closely to finish a given mission repetitively. Due to the uncertainties in robot dynamics and the evolution of human cognitive and physiological states, we model the robot status, human trust and fatigue as MDPs respectively. A specification is given to ensure that the mission can be finished infinitely often with the optimal probability 10 f r 0 f r 1 f h 0 f h 1 f h 2 frepair tr 0 tr 1 tr 1,0 tr 1,1 0.5 0.4 0.5 0.4 0.45 0.4 0.5 0.4 0.3 0.4 TABLE I: Transition probabilities f0 f1 f2 t0 t1 t2 ar i 0.3 0.1 0.03 0.3 0.17 0.03 ah i 0.03 0.1 0.3 repair 0.03 0.1 0.3 0.17 0.17 0.5 TABLE II: Costs associated to each state and action and additional constraints can be added as necessary. The abstract model of the human robot system is a composed MDP from the task model, robot model, human fatigue and trust models. The task assignment problem is then converted to an optimal controller synthesis problem. When the system model is unknown, we proposed a PAC-learning inspired algorithm to efficiently learn the model and then find the optimal policy. An example has been explored to show the validity of our framework. Future work includes extending the problem to more general settings, such as two-player stochastic games to consider the case where the human action is not controllable and partial observability where the human states such as trust and fatigue cannot be directly observed. REFERENCES [1] B. Sadrfaridpour, H. Saeidi, J. Burke, K. Madathil, and Y. Wang, “Mod- eling and control of trust in human-robot collaborative manufacturing,” in Robust Intelligence and Trust in Autonomous Systems. Springer, 2016, pp. 115–141. [2] F. Chen, K. Sekiyama, F. Cannella, and T. Fukuda, “Optimal subtask allocation for human and robot collaboration within hybrid assembly system,” IEEE Transactions on Automation Science and Engineering, vol. 11, no. 4, pp. 1065–1075, 2014. [3] E. Broadbent, R. Stafford, and B. MacDonald, “Acceptance of healthcare robots for the older population: review and future directions,” Interna- tional journal of social robotics, vol. 1, no. 4, pp. 319–330, 2009. [4] A. M. Okamura, M. J. Mataric, and H. I. Christensen, “Medical and health-care robotics,” IEEE Robotics & Automation Magazine, vol. 17, no. 3, pp. 26–37, 2010. [5] C.-W. Lin, M.-H. Khong, and Y.-C. Liu, “Experiments on human-in-the- loop coordination for multirobot system with task abstraction,” IEEE Transactions on Automation Science and Engineering, vol. 12, no. 3, pp. 981–989, 2015. [6] S. A. Seshia, D. Sadigh, and S. S. Sastry, “Formal methods for semi-autonomous driving,” in Proceedings of the 52nd Annual Design Automation Conference. ACM, 2015, p. 148. [7] A. M. Zanchettin, N. M. Ceriani, P. Rocco, H. Ding, and B. Matthias, “Safety in human-robot collaborative manufacturing environments: Met- rics and control,” IEEE Transactions on Automation Science and Engi- neering, vol. 13, no. 2, pp. 882–893, 2016. [8] C. Baier and J. Katoen, Principles of Model Checking. MIT Press, 2008. [9] A. Bauer, D. Wollherr, and M. Buss, “Human–robot collaboration: a survey,” International Journal of Humanoid Robotics, vol. 5, no. 01, pp. 47–66, 2008. [10] P. A. Hancock, D. R. Billings, K. E. Schaefer, J. Y. Chen, E. J. De Visser, and R. Parasuraman, “A meta-analysis of factors affecting trust in human-robot interaction,” Human Factors: The Journal of the Human Factors and Ergonomics Society, vol. 53, no. 5, pp. 517–527, 2011. [11] M. Desai, M. Medvedev, M. V´azquez, S. McSheehy, S. Gadea- Omelchenko, C. Bruggeman, A. Steinfeld, and H. Yanco, “Effects of changing reliability on trust of robot systems,” in Human-Robot Interaction (HRI), 2012 7th ACM/IEEE International Conference on. IEEE, 2012, pp. 73–80. [12] P. Robinette, A. R. Wagner, and A. M. Howard, “The effect of robot performance on human–robot trust in time–critical situations,” 2015. [13] J. Lee and N. Moray, “Trust, control strategies and allocation of function in human-machine systems,” Ergonomics, vol. 35, no. 10, pp. 1243– 1270, 1992. [14] M. Desai, P. Kaniarasu, M. Medvedev, A. Steinfeld, and H. Yanco, “Impact of robot failures and feedback on real-time trust,” in Proceed- ings of the 8th ACM/IEEE international conference on Human-robot interaction. IEEE Press, 2013, pp. 251–258. [15] X. Wang, Z. Shi, F. Zhang, and Y. Wang, “Mutual trust based scheduling for (semi) autonomous multi-agent systems,” in 2015 American Control Conference (ACC). IEEE, 2015, pp. 459–464. [16] J. Fu and U. Topcu, “Probably approximately correct mdp learning and control with temporal logic constraints,” in Proceedings of Robotics: Science and Systems, Berkeley, USA, July 2014. [17] E. M. Wolff, U. Topcu, and R. M. Murray, “Optimal control with weighted average costs and temporal logic specifications,” in Robotics: Science and Systems, 2012. [18] X. Ding, S. L. Smith, C. Belta, and D. Rus, “Optimal control of markov decision processes with linear temporal logic constraints,” Automatic Control, IEEE Transactions on, vol. 59, no. 5, pp. 1244–1257, 2014. [19] M. Svorenova, I. Cerna, and C. Belta, “Optimal control of mdps with temporal logic constraints,” in Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on. IEEE, 2013, pp. 3938–3943. [20] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications,” in Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on. IEEE, 2014, pp. 1091–1096. [21] Q. Ji, P. Lan, and C. Looney, “A probabilistic framework for modeling and real-time monitoring human fatigue,” IEEE Transactions on systems, man, and cybernetics-Part A: Systems and humans, vol. 36, no. 5, pp. 862–875, 2006. [22] J. Fu and U. Topcu, “Synthesis of shared autonomy policies with tem- poral logic specifications,” IEEE Transactions on Automation Science and Engineering, vol. 13, no. 1, pp. 7–17, Jan 2016. [23] B. Wu, B. Hu, and H. Lin, “Toward efficient manufaturing systems: a trust based human robot collaboration,” in American Control Conference (ACC), 2015. IEEE, 2017, pp. 1536–1541. [Online]. Available: http://www3.nd.edu/∼bwu3/doc/ACC2017.pdf [24] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [25] B. M. Muir, Operators’ trust in and use of automatic controllers in a supervisory process control task. University of Toronto, 1990. [26] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein- forcement learning,” in Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 1. [27] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time algorithm for near-optimal reinforcement learning,” Journal of Machine Learning Research, vol. 3, no. Oct, pp. 213–231, 2002. [28] M. Kearns and S. Singh, “Near-optimal reinforcement learning in polynomial time,” Machine Learning, vol. 49, no. 2-3, pp. 209–232, 2002. [29] J. J. Rutten, M. Kwiatkowska, G. Norman, and D. Parker, Mathematical techniques for analyzing concurrent and probabilistic systems. Amer- ican Mathematical Soc., 2004. [30] D. P. Bertsekas, Dynamic Programming and Optimal Control, 4th ed. Athena Scientific, 2012, vol. 2. [31] M. Kwiatkowska, G. Norman, and D. Parker, “Prism 4.0: Verification of probabilistic real-time systems,” in International Conference on Computer Aided Verification. Springer, 2011, pp. 585–591.