Reinforcement Learning based Embodied Agents Modelling Human Users Through Interaction and Multi-Sensory Perception Kory W. Mathewson, Patrick M. Pilarski Departments of Computing Science and Medicine University of Alberta, Edmonton, Alberta, Canada [korym, pilarski] @ ualberta.ca Abstract This paper extends recent work in interactive machine learn- ing (IML) focused on effectively incorporating human feed- back. We show how control and feedback signals comple- ment each other in systems which model human reward. We demonstrate that simultaneously incorporating human control and feedback signals can improve interactive robotic systems performance on a self-mirrored movement control task where a RL-agent controlled right arm attempts to match the pre- programmed movement pattern of the left arm. We illustrate the impact of varying human feedback parameters on task performance by investigating the probability of giving feed- back on each time step and the likelihood of given feedback being correct. We further illustrate that varying the tempo- ral decay with which the agent incorporates human feedback has a significant impact on task performance. We found that smearing human feedback over time steps improves perfor- mance and we show varying the probability of feedback at each time step, and an increased likelihood of those feedbacks being ’correct’, can impact agent performance. We conclude that understanding latent variables in human feedback is cru- cial for learning algorithms acting in human-machine interac- tion domains. Introduction Reinforcement learning (RL) agents can learn optimal ac- tions through building models of environments through perceptive sensors during repeated interactions. Often RL agents cooperate interactively with human trainers to solve difficult tasks. Human teachers are a unique component of the environment who may deliver control signals and con- textual information through feedback. As human-robot in- teraction becomes more complex, due to rapid advance- ments in actuator and sensor technology, a significant gap emerges between the number of possible control signals a human can provide and the number of controllable actua- tors a robotic system. There is often a limited set of control signals which a human can provide, and a large number of robotic system controllable functions. The limit of human provided control signals is of par- ticular interest in the field of robotic prosthesesartificial limbs attached to the body to augment and/or replace abil- ities lost through injury or illness. Prosthetic limbs with Copyright c⃝2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Configuration with A) results example, B) Myo, C) simulation/learning/feedback System, and D) Nao. many degrees-of-freedom have been developed (Castellini et al. 2014). State-of-the-art prosthetics can perform com- plex functions and movements, but rapid, reactive control of this functionality, by human users, is limited; this limita- tion causes some users to abandon their devices (Castellini et al. 2014; Biddiss and Chau 2007; Micera, Carpaneto, and Raspopovic 2010; Scheme and Englehart 2011). New meth- ods are in development to help humans control complex robotic devices through intelligent control sharing and by allowing a learning agent inside the prosthetic to model the human user. The work presented herein explores RL agents controlled by simulated human electromyography (EMG) signals, with additional reward feedback signals. Background RL is a learning framework inspired by behaviorism (Skin- ner 1938) which describes how agents improve over time by taking actions in an environment with a goal of maxi- mizing expected return, the cumulative future reward sig- nal received by the agent (Sutton and Barto. 1998). An agents control policy is iteratively improved by selecting actions which maximize return. RL problems are often de- scribed as sequential decision making problems modelled as Markov Decision Processes (MDPs) which define tuples: (State, Action, Transitions, γ, Reward), full details of MDPs are omitted for space and can be found in (Sutton and Barto. 1998; Mathewson and Pilarski 2016). The ulti- mate goal of an RL agent is to determine a policy which maps a given current state to the correct actions to maximize expected return. In this work we use a continuous actor- arXiv:1701.02369v3 [cs.HC] 26 Jan 2017 critic (AC) algorithm (Algorithm 1) similar to that described in (Pilarski et al. 2011; Pilarski, Dick, and Sutton 2013; Mathewson and Pilarski 2016). AC methods can reduce vari- ance in gradient estimation through the use of two learning systems: a policy-focused actor (selects the best action) and a critic (estimate of value function, criticizes actor) (Sutton and Barto. 1998). The Interactive Shaping Problem (ISP) defines the prob- lem of optimizing the incorporation of human feedback into a learning agent in a sequential decision making problem (Knox and Stone 2010). The ISP asks: how can the agent learn the best possible task policy as measured by task per- formance or cumulative human feedback, given the infor- mation contained in the human responses (Knox and Stone 2009; Knox and Stone 2012). While there are many ways to incorporate human knowledge into a learning system both before and during learning (Thomaz and Breazeal 2008; Chernova and Thomaz 2014), this paper focuses on incor- porating human feedback directly alongside MDP derived reward. This work builds on the work of Vien and Ertel, who showed that the human feedback model can be general- ized to address the problems associated with periods of noisy, and/or inconsistent, human feedback (Vien, Ertel, and Chung 2013). Recent advancements in modelling human feedback with a Bayesian approach have improved on the work of Knox and Stone in discrete environments (Loftin et al. 2016). Most recently work by Macglashan et al. show that human feedback may be better modelled as an advan- tage function to handle changes in a humans feedback strat- egy over time (Macglashan et al. 2016). In this study, we explore the implications of varying sev- eral latent variables in human feedback for learning algo- rithms acting in complex human-machine interaction do- mains. We investigate the probability of the human trainer providing feedback, the probability that feedback is correct, and the effect of smearing that feedback over time to account for the limited number of time steps with direct human feed- back. Algorithm 1 Continuous Actor-Critic Reinforcement Learning 1: initialize: wµ, wσ, v, eµ, eσ, ev, s 2: repeat 3: µ ←wT µ x(s) 4: σ ←exp[wT σ x(s)] 5: a ←N(µ, σ2) 6: take action a, observe r, s′ 7: δ ←r + γvT x(s′) −vT x(s) 8: ev ←min[1, λvγev + x(s)] 9: v ←v + αvδev 10: eµ ←λweµ + (a −µ)x(s) 11: wµ ←wµ + αµδeµ 12: eσ ←λweσ + [(a −µ)2 −σ2]x(s) 13: wσ ←wσ + ασδeσ 14: s ←s′ 15: until termination criteria is met Methods Aldebaran Nao and Myo EMG Data The experimental set up is shown in 1. It is composed of the Aldebaran Nao robotic platform (Aldebaran Robotics), a wireless Myo EMG armband (Thalmic Labs), and a Mac- Book Air (Apple, 2.2 GHz Intel Core i7, 8GB RAM) for human feedback and running the learning agent. The experiments in this paper are performed using a sim- ulated Nao platform, a simulated EMG signal, and a sim- ulated human feedback model. We have previously shown the performance of this experimental set-up to be compara- ble between simulation and real-world experiments (Math- ewson and Pilarski 2016). By simulating the human feed- back, we are able to characterize and vary important latent variables hidden from the agent which impact the learning of the system. For this study, we investigate: the rate at which a human-delivered feedback should decay (smear), the probability with which the human will provide a feed- back (P(feedback)), and the probability that this feedback will be correct for a given MDP (P(correct)). These are critical variables that have been estimated in previous exper- iments (Knox and Stone 2015; Loftin et al. 2016), we aim to improve understanding of their impact through an experi- mental grid sweep over the variables of interest and investi- gation into the results. Experiments We extend on the results in (Mathewson and Pilarski 2016) by exploring the impacts of varying model parameters of human trainer feedback on the RL system during the per- formance of a self-mirrored movement control task. In this task, we preprogram the left arm of the Nao to move in a pe- riodic pattern of flexion and extension at the elbow joint. The RL agent controls the right arm and selects angular displace- ment actions attempting to match the pattern of the left. With this configuration we are able to define an optimal policy, which would track the pre-programmed arm exactly, with this optimal trajectory we are able to derive MDP reward given a set angular error threshold. When the RL-controlled elbow joint is within the angular deviation threshold of the preprogrammed elbow joint then a reward of 1 is received from the MDP, otherwise, a negative relative error is deliv- ered proportional to the difference between the actual and optimal angles. We are interested in modelling smear, the time-decay with which the feedback given by the human should be decayed. As the human is unable to give feedback at every step that an agent takes, we need to account for the fact that after the ex- act time step a feedback is given there are likely suboptimal states which support the optimal trajectory. With a decay pa- rameter we are able to smear the human feedback forward in time, it has been shown that the limited human feedback can be applied across near-optimal state-action pairs, and sup- port the agent learning an optimal solution (Pilarski et al. 2011). We further explore the following characteristics of human feedback: (P(feedback)) the probability of giving feedback on each time step, and (P(correct)) the probabil- ity of giving correct vs. incorrect feedback. These are impor- tant latent human parameters to understand, cognitively they represent how effective and attentive a human trainer is. The continuous state space is defined by the filtered, time- averaged, and dimensionally reduced EMG signal and the angle of the actuated joint, and is represented with approx- imation using tile coding (Mathewson and Pilarski 2016). Parameters were set as follows: αv = 0.1/m, αµ = ασ, γ = 0.9, λw = 0.3, λv = 0.7, joint angles were limited by manufacturer specifications at θ ∈[0.0349, 1.5446] rads. Weight vectors wµ, wσ, v, eµ, eσ, ev were initialized to 0 and standard deviation was bounded by σ ≥0.01. The el- igibility trace update for the critic is scaled by γ to speed up convergence. Maximum number of time steps = 10k, learning update and action selection occurred at 33 Hz or every 30 ms, and angular deviation threshold was set to ∆θmax = 0.1, absolute angular joint updates were clipped to 0.1 and actions were selected and performed on every time step. The ACRL system was trained online with simulated hu- man feedback and simulated EMG control signals (designed to mimic acceptable control signals). Human feedback is in- tegrated into the learning algorithm as reward accumulated on Step 6 of Algorithm 1. Performance was measured by taking the average mean absolute angular error from the last 5k steps. This was done to compare the experimental results after some learning and helped to reduce noise intrinsic in early learning. This paper presents results of a parameterized grid sweep over three parameters with given estimates of reason- able values: smear = (0.2, 0.5, 0.9), P(feedback) = (0.03, 0.05, 0.09), P(correct) = (0.6, 0.75, 0.9). Addition- ally, as a control case, n = 60 trials without human-feedback were performed. On all time steps MDP reward and human reward were directly summed and applied to the learning agent update (Algorithm 1). Results The results are presented in 2. Results are presented which show performance over a variety of combinations of pa- rameters for the latent variables of interest: P(feedback), P(correct) and smear. Results indicate that human interac- tion improves agent performance on a self-mirroring move- ment task where performance is measured by the mean an- gular error over the last 5k time steps. Fig. 2A shows that a lower probability of potentially incorrect feedback pro- vides better performance. Fig. 2B shows that there may not be a significant difference in performance when varying the probability of the correctness human feedback, given tested values of P(feedback). This may also be due to the tested values, which were all greater than a 50% chance of being correct. Fig. 2C shows that there is a benefit to selecting a smear decay value appropriate for the task and robotic con- trol system, this parameter may vary task to task and care must be taken when selecting the smear constant. The re- sults indicate that there is benefit to be gained by correct modelling the latent variables associated with human reward signal to allow for true simultaneous incorporation of human control and feedback. These results indicate that the ACRL algorithm robust to a small amount of incorrect feedback. On average without human-feedback the RL agent was able to attain a mean absolute error on the final 5k steps of 0.22 ± 0.02 (mean ± SEM, n=60). In comparison, the opti- mal set of parameters (P(feedback) = 0.06, P(correct) = 0.6, smear = 0.5) was able to attain a performance of 0.12 ± 0.01 (mean ± SEM, n=7), the worst performing set of parameters (P(feedback) = 0.09, P(correct) = 0.9, smear = 0.9) attained a performance of 0.38 ± 0.18 (mean ± SEM, n=4). A total of 232 trials were run over pa- rameter combinations. Figure 2: Mean and standard error over experimental condi- tions A) P(feedback), B) P(correct), C) smear. Discussion The experiments in this paper are performed using a sim- ulated Nao, simulated EMG signal and simulated human feedback. It has been previously shown the performance of this experimental set-up to be comparable between simula- tion and real-world experiments (Mathewson and Pilarski 2016). In this related work we explore the degree to which the learning system is affected when incorporating real hu- man feedback. While working in simulation allows rapid iteration and enables testing of many different algorithmic characteristics, simulation is often an easier learning prob- lem than the real-world, due to simplified physics and re- duced noise. Future work will address robust modelling real human feedback, and quantify impact of varying feedback density and correctness. We have shown that smearing of hu- man feedback impacts learning, future work will investigate if the decay of human delivered rewards is best modelled as time dependent over task performance and if optimal decay parameters may be learned online. In this paper we found that modelling the delivery of hu- man feedback can significantly impact the performance of an ACRL algorithm. While we have not optimized for the human feedback characteristics, these results indicate that some human reward paradigms may be preferable to others (Loftin et al. 2016). This idea is explored in (Macglashan et al. 2016) where modelling the user feedback as an ad- vantage function, we can understand positive feed back as ’yes, that was good’ and negative feedback as ’no, that was bad’. A greater understanding of human reward strategies is required. Personalized robotics will demand perception of human strategies to learn optimal in a very few sample. Fu- ture work will focus predicting and optimizing for known and uncertain feedback strategies. Linking control signals in state space with feedback shap- ing reward signals effectively blends multisensory human data to the learning agent. There remains an open problem of how feedback should best be interpreted by the learning agent and how to encourage human feedback without caus- ing prohibitive additional cognitive load. Modelling, and predicting, human feedback may relieve burden while allow- ing for shaping control signal interpretation. Human feed- back is beneficial to the agent, providing it adds compli- mentary information about the contextual state the agent is in. Human feedback may shape the MDP reward with more specificity and more often than the sparse, delayed, MDP- derived reward. Our results demonstrate potential benefits by introduc- ing well modelled human feedback into the robotic learn- ing system. The inclusion of human shaping signals was shown to improve performance over strictly environmentally derived reward. Providing consistent, correct feedback de- mands cognitive attention from the user which may be dif- ficult if the user is also required to provide control signals to the robotic system. Future work is introduced to explore implications of inviting humans to simultaneously provide control and feedback signals to learning systems. Conclusions This paper contributes a set of results from experiments incorporating simulated human feedback and simultaneous human control in the training of a semi-autonomous robotic agent. These results indicate that task performance increases with the incorporation of human feedback into existing actor-critic reinforcement learning algorithms. These results support the idea that human interaction can improve perfor- mance in complex robotic tasks when the human feedback is delivered correctly, consistently, and on a time scale con- sistent with the original learning problem. This work supports an emerging viewpoint surrounding human training of a robotic system tightly coupled to a user. By showing improving the performance of the RL agent this work further supports the sharing of autonomy between hu- man and machine. Acknowledgments This work is supported by National Sciences and Engineer- ing Research Council of Canada (NSERC), Alberta Inno- vates Technology Futures (AITF), the Canada Research Chairs Program, the Canada Foundation for Innovation, and the Alberta Machine Intelligence Institute (Amii)/Alberta Innovates Centre for Machine Learning (AICML). References [Biddiss and Chau 2007] Biddiss, E. A., and Chau, T. T. 2007. Upper limb prosthesis use and abandonment: a survey of the last 25 years. Prosthetics and Orthotics International 31(3):236–257. [Castellini et al. 2014] Castellini, C.; Artemiadis, P.; Wininger, M.; Ajoudani, A.; Alimusaj, M.; Bicchi, A.; Caputo, B.; Craelius, W.; Dosen, S.; Englehart, K.; Farina, D.; Gijsberts, A.; Godfrey, S. B.; Hargrove, L.; Ison, M.; Kuiken, T.; Markovi´c, M.; Pilarski, P. M.; Rupp, R.; and Scheme, E. 2014. Proceedings of the first workshop on peripheral machine interfaces: Going beyond traditional surface electromyography. In Frontiers in Neurorobotics, volume 8, 22. [Chernova and Thomaz 2014] Chernova, S., and Thomaz, A. L. 2014. Robot Learning from Human Teachers. Synthe- sis Lectures on Artificial Intelligence and Machine Learning 8(3):1–121. [Knox and Stone 2009] Knox, W. B., and Stone, P. 2009. In- teractively Shaping Agents via Human Reinforcement: The TAMER Framework. In Proceedings of the Fifth Interna- tional Conference on Knowledge Capture, 9–16. New York, New York, USA: ACM Press. [Knox and Stone 2010] Knox, W. B., and Stone, P. 2010. Combining Manual Feedback with Subsequent MDP Re- ward Signals for Reinforcement Learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, 5–12. [Knox and Stone 2012] Knox, W. B., and Stone, P. 2012. Reinforcement learning from human reward: Discounting in episodic tasks. In The 21st IEEE International Symposium on Robot and Human Interactive Communication, 878–885. [Knox and Stone 2015] Knox, W. B., and Stone, P. 2015. Framing reinforcement learning from human reward: Re- ward positivity, temporal discounting, episodicity, and per- formance. Artificial Intelligence 225:24–50. [Loftin et al. 2016] Loftin, R.; Peng, B.; MacGlashan, J.; Littman, M. L.; Taylor, M. E.; Huang, J.; and Roberts, D. L. 2016. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems 30(1):30–59. [Macglashan et al. 2016] Macglashan, J.; Littman, M. L.; Roberts, D. L.; Loftin, R.; Peng, B.; and Taylor, M. E. 2016. Convergent Actor Critic by Humans. In International Con- ference on Intelligent Robots and Systems. [Mathewson and Pilarski 2016] Mathewson, K. W., and Pi- larski, P. M. 2016. Simultaneous Control and Human Feedback in the Training of a Robotic Agent with Actor- Critic Reinforcement Learning. In IJCAI International Joint Conference on Artificial Intelligence - Interactive Machine Learning Workshop. [Micera, Carpaneto, and Raspopovic 2010] Micera, S.; Carpaneto, J.; and Raspopovic, S. 2010. Control of hand prostheses using peripheral information. IEEE Reviews in Biomedical Engineering 3:48–68. [Pilarski et al. 2011] Pilarski, P. M.; Dawson, M. R.; Degris, T.; Fahimi, F.; Carey, J. P.; and Sutton, R. S. 2011. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In IEEE International Conference on Rehabilitation Robotics, 1–7. IEEE. [Pilarski, Dick, and Sutton 2013] Pilarski, P. M.; Dick, T. B.; and Sutton, R. S. 2013. Real-time Prediction Learning for the Simultaneous Actuation of Multiple Prosthetic Joints. In IEEE International Conference on Rehabilitation Robotics, 1–8. [Scheme and Englehart 2011] Scheme, E., and Englehart, K. 2011. Electromyogram pattern recognition for control of powered upper-limb prostheses: State of the art and chal- lenges for clinical use. Journal of Rehabilitation Research and Development 48(6):643–660. [Skinner 1938] Skinner, B. F. 1938. The Behavior of Organ- isms: An experimental analysis. [Sutton and Barto. 1998] Sutton, R. S., and Barto., A. G. 1998. Reinforcement learning: An introduction. Cambridge: MIT Press, 1st edition. [Thomaz and Breazeal 2008] Thomaz, A. L., and Breazeal, C. 2008. Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence 172(6-7):716–737. [Vien, Ertel, and Chung 2013] Vien, N. A.; Ertel, W.; and Chung, T. C. 2013. Learning via human feedback in continu- ous state and action spaces. Applied Intelligence 39(2):267– 278. Reinforcement Learning based Embodied Agents Modelling Human Users Through Interaction and Multi-Sensory Perception Kory W. Mathewson, Patrick M. Pilarski Departments of Computing Science and Medicine University of Alberta, Edmonton, Alberta, Canada [korym, pilarski] @ ualberta.ca Abstract This paper extends recent work in interactive machine learn- ing (IML) focused on effectively incorporating human feed- back. We show how control and feedback signals comple- ment each other in systems which model human reward. We demonstrate that simultaneously incorporating human control and feedback signals can improve interactive robotic systems performance on a self-mirrored movement control task where a RL-agent controlled right arm attempts to match the pre- programmed movement pattern of the left arm. We illustrate the impact of varying human feedback parameters on task performance by investigating the probability of giving feed- back on each time step and the likelihood of given feedback being correct. We further illustrate that varying the tempo- ral decay with which the agent incorporates human feedback has a significant impact on task performance. We found that smearing human feedback over time steps improves perfor- mance and we show varying the probability of feedback at each time step, and an increased likelihood of those feedbacks being ’correct’, can impact agent performance. We conclude that understanding latent variables in human feedback is cru- cial for learning algorithms acting in human-machine interac- tion domains. Introduction Reinforcement learning (RL) agents can learn optimal ac- tions through building models of environments through perceptive sensors during repeated interactions. Often RL agents cooperate interactively with human trainers to solve difficult tasks. Human teachers are a unique component of the environment who may deliver control signals and con- textual information through feedback. As human-robot in- teraction becomes more complex, due to rapid advance- ments in actuator and sensor technology, a significant gap emerges between the number of possible control signals a human can provide and the number of controllable actua- tors a robotic system. There is often a limited set of control signals which a human can provide, and a large number of robotic system controllable functions. The limit of human provided control signals is of par- ticular interest in the field of robotic prosthesesartificial limbs attached to the body to augment and/or replace abil- ities lost through injury or illness. Prosthetic limbs with Copyright c⃝2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Configuration with A) results example, B) Myo, C) simulation/learning/feedback System, and D) Nao. many degrees-of-freedom have been developed (Castellini et al. 2014). State-of-the-art prosthetics can perform com- plex functions and movements, but rapid, reactive control of this functionality, by human users, is limited; this limita- tion causes some users to abandon their devices (Castellini et al. 2014; Biddiss and Chau 2007; Micera, Carpaneto, and Raspopovic 2010; Scheme and Englehart 2011). New meth- ods are in development to help humans control complex robotic devices through intelligent control sharing and by allowing a learning agent inside the prosthetic to model the human user. The work presented herein explores RL agents controlled by simulated human electromyography (EMG) signals, with additional reward feedback signals. Background RL is a learning framework inspired by behaviorism (Skin- ner 1938) which describes how agents improve over time by taking actions in an environment with a goal of maxi- mizing expected return, the cumulative future reward sig- nal received by the agent (Sutton and Barto. 1998). An agents control policy is iteratively improved by selecting actions which maximize return. RL problems are often de- scribed as sequential decision making problems modelled as Markov Decision Processes (MDPs) which define tuples: (State, Action, Transitions, γ, Reward), full details of MDPs are omitted for space and can be found in (Sutton and Barto. 1998; Mathewson and Pilarski 2016). The ulti- mate goal of an RL agent is to determine a policy which maps a given current state to the correct actions to maximize expected return. In this work we use a continuous actor- arXiv:1701.02369v3 [cs.HC] 26 Jan 2017 critic (AC) algorithm (Algorithm 1) similar to that described in (Pilarski et al. 2011; Pilarski, Dick, and Sutton 2013; Mathewson and Pilarski 2016). AC methods can reduce vari- ance in gradient estimation through the use of two learning systems: a policy-focused actor (selects the best action) and a critic (estimate of value function, criticizes actor) (Sutton and Barto. 1998). The Interactive Shaping Problem (ISP) defines the prob- lem of optimizing the incorporation of human feedback into a learning agent in a sequential decision making problem (Knox and Stone 2010). The ISP asks: how can the agent learn the best possible task policy as measured by task per- formance or cumulative human feedback, given the infor- mation contained in the human responses (Knox and Stone 2009; Knox and Stone 2012). While there are many ways to incorporate human knowledge into a learning system both before and during learning (Thomaz and Breazeal 2008; Chernova and Thomaz 2014), this paper focuses on incor- porating human feedback directly alongside MDP derived reward. This work builds on the work of Vien and Ertel, who showed that the human feedback model can be general- ized to address the problems associated with periods of noisy, and/or inconsistent, human feedback (Vien, Ertel, and Chung 2013). Recent advancements in modelling human feedback with a Bayesian approach have improved on the work of Knox and Stone in discrete environments (Loftin et al. 2016). Most recently work by Macglashan et al. show that human feedback may be better modelled as an advan- tage function to handle changes in a humans feedback strat- egy over time (Macglashan et al. 2016). In this study, we explore the implications of varying sev- eral latent variables in human feedback for learning algo- rithms acting in complex human-machine interaction do- mains. We investigate the probability of the human trainer providing feedback, the probability that feedback is correct, and the effect of smearing that feedback over time to account for the limited number of time steps with direct human feed- back. Algorithm 1 Continuous Actor-Critic Reinforcement Learning 1: initialize: wµ, wσ, v, eµ, eσ, ev, s 2: repeat 3: µ ←wT µ x(s) 4: σ ←exp[wT σ x(s)] 5: a ←N(µ, σ2) 6: take action a, observe r, s′ 7: δ ←r + γvT x(s′) −vT x(s) 8: ev ←min[1, λvγev + x(s)] 9: v ←v + αvδev 10: eµ ←λweµ + (a −µ)x(s) 11: wµ ←wµ + αµδeµ 12: eσ ←λweσ + [(a −µ)2 −σ2]x(s) 13: wσ ←wσ + ασδeσ 14: s ←s′ 15: until termination criteria is met Methods Aldebaran Nao and Myo EMG Data The experimental set up is shown in 1. It is composed of the Aldebaran Nao robotic platform (Aldebaran Robotics), a wireless Myo EMG armband (Thalmic Labs), and a Mac- Book Air (Apple, 2.2 GHz Intel Core i7, 8GB RAM) for human feedback and running the learning agent. The experiments in this paper are performed using a sim- ulated Nao platform, a simulated EMG signal, and a sim- ulated human feedback model. We have previously shown the performance of this experimental set-up to be compara- ble between simulation and real-world experiments (Math- ewson and Pilarski 2016). By simulating the human feed- back, we are able to characterize and vary important latent variables hidden from the agent which impact the learning of the system. For this study, we investigate: the rate at which a human-delivered feedback should decay (smear), the probability with which the human will provide a feed- back (P(feedback)), and the probability that this feedback will be correct for a given MDP (P(correct)). These are critical variables that have been estimated in previous exper- iments (Knox and Stone 2015; Loftin et al. 2016), we aim to improve understanding of their impact through an experi- mental grid sweep over the variables of interest and investi- gation into the results. Experiments We extend on the results in (Mathewson and Pilarski 2016) by exploring the impacts of varying model parameters of human trainer feedback on the RL system during the per- formance of a self-mirrored movement control task. In this task, we preprogram the left arm of the Nao to move in a pe- riodic pattern of flexion and extension at the elbow joint. The RL agent controls the right arm and selects angular displace- ment actions attempting to match the pattern of the left. With this configuration we are able to define an optimal policy, which would track the pre-programmed arm exactly, with this optimal trajectory we are able to derive MDP reward given a set angular error threshold. When the RL-controlled elbow joint is within the angular deviation threshold of the preprogrammed elbow joint then a reward of 1 is received from the MDP, otherwise, a negative relative error is deliv- ered proportional to the difference between the actual and optimal angles. We are interested in modelling smear, the time-decay with which the feedback given by the human should be decayed. As the human is unable to give feedback at every step that an agent takes, we need to account for the fact that after the ex- act time step a feedback is given there are likely suboptimal states which support the optimal trajectory. With a decay pa- rameter we are able to smear the human feedback forward in time, it has been shown that the limited human feedback can be applied across near-optimal state-action pairs, and sup- port the agent learning an optimal solution (Pilarski et al. 2011). We further explore the following characteristics of human feedback: (P(feedback)) the probability of giving feedback on each time step, and (P(correct)) the probabil- ity of giving correct vs. incorrect feedback. These are impor- tant latent human parameters to understand, cognitively they represent how effective and attentive a human trainer is. The continuous state space is defined by the filtered, time- averaged, and dimensionally reduced EMG signal and the angle of the actuated joint, and is represented with approx- imation using tile coding (Mathewson and Pilarski 2016). Parameters were set as follows: αv = 0.1/m, αµ = ασ, γ = 0.9, λw = 0.3, λv = 0.7, joint angles were limited by manufacturer specifications at θ ∈[0.0349, 1.5446] rads. Weight vectors wµ, wσ, v, eµ, eσ, ev were initialized to 0 and standard deviation was bounded by σ ≥0.01. The el- igibility trace update for the critic is scaled by γ to speed up convergence. Maximum number of time steps = 10k, learning update and action selection occurred at 33 Hz or every 30 ms, and angular deviation threshold was set to ∆θmax = 0.1, absolute angular joint updates were clipped to 0.1 and actions were selected and performed on every time step. The ACRL system was trained online with simulated hu- man feedback and simulated EMG control signals (designed to mimic acceptable control signals). Human feedback is in- tegrated into the learning algorithm as reward accumulated on Step 6 of Algorithm 1. Performance was measured by taking the average mean absolute angular error from the last 5k steps. This was done to compare the experimental results after some learning and helped to reduce noise intrinsic in early learning. This paper presents results of a parameterized grid sweep over three parameters with given estimates of reason- able values: smear = (0.2, 0.5, 0.9), P(feedback) = (0.03, 0.05, 0.09), P(correct) = (0.6, 0.75, 0.9). Addition- ally, as a control case, n = 60 trials without human-feedback were performed. On all time steps MDP reward and human reward were directly summed and applied to the learning agent update (Algorithm 1). Results The results are presented in 2. Results are presented which show performance over a variety of combinations of pa- rameters for the latent variables of interest: P(feedback), P(correct) and smear. Results indicate that human interac- tion improves agent performance on a self-mirroring move- ment task where performance is measured by the mean an- gular error over the last 5k time steps. Fig. 2A shows that a lower probability of potentially incorrect feedback pro- vides better performance. Fig. 2B shows that there may not be a significant difference in performance when varying the probability of the correctness human feedback, given tested values of P(feedback). This may also be due to the tested values, which were all greater than a 50% chance of being correct. Fig. 2C shows that there is a benefit to selecting a smear decay value appropriate for the task and robotic con- trol system, this parameter may vary task to task and care must be taken when selecting the smear constant. The re- sults indicate that there is benefit to be gained by correct modelling the latent variables associated with human reward signal to allow for true simultaneous incorporation of human control and feedback. These results indicate that the ACRL algorithm robust to a small amount of incorrect feedback. On average without human-feedback the RL agent was able to attain a mean absolute error on the final 5k steps of 0.22 ± 0.02 (mean ± SEM, n=60). In comparison, the opti- mal set of parameters (P(feedback) = 0.06, P(correct) = 0.6, smear = 0.5) was able to attain a performance of 0.12 ± 0.01 (mean ± SEM, n=7), the worst performing set of parameters (P(feedback) = 0.09, P(correct) = 0.9, smear = 0.9) attained a performance of 0.38 ± 0.18 (mean ± SEM, n=4). A total of 232 trials were run over pa- rameter combinations. Figure 2: Mean and standard error over experimental condi- tions A) P(feedback), B) P(correct), C) smear. Discussion The experiments in this paper are performed using a sim- ulated Nao, simulated EMG signal and simulated human feedback. It has been previously shown the performance of this experimental set-up to be comparable between simula- tion and real-world experiments (Mathewson and Pilarski 2016). In this related work we explore the degree to which the learning system is affected when incorporating real hu- man feedback. While working in simulation allows rapid iteration and enables testing of many different algorithmic characteristics, simulation is often an easier learning prob- lem than the real-world, due to simplified physics and re- duced noise. Future work will address robust modelling real human feedback, and quantify impact of varying feedback density and correctness. We have shown that smearing of hu- man feedback impacts learning, future work will investigate if the decay of human delivered rewards is best modelled as time dependent over task performance and if optimal decay parameters may be learned online. In this paper we found that modelling the delivery of hu- man feedback can significantly impact the performance of an ACRL algorithm. While we have not optimized for the human feedback characteristics, these results indicate that some human reward paradigms may be preferable to others (Loftin et al. 2016). This idea is explored in (Macglashan et al. 2016) where modelling the user feedback as an ad- vantage function, we can understand positive feed back as ’yes, that was good’ and negative feedback as ’no, that was bad’. A greater understanding of human reward strategies is required. Personalized robotics will demand perception of human strategies to learn optimal in a very few sample. Fu- ture work will focus predicting and optimizing for known and uncertain feedback strategies. Linking control signals in state space with feedback shap- ing reward signals effectively blends multisensory human data to the learning agent. There remains an open problem of how feedback should best be interpreted by the learning agent and how to encourage human feedback without caus- ing prohibitive additional cognitive load. Modelling, and predicting, human feedback may relieve burden while allow- ing for shaping control signal interpretation. Human feed- back is beneficial to the agent, providing it adds compli- mentary information about the contextual state the agent is in. Human feedback may shape the MDP reward with more specificity and more often than the sparse, delayed, MDP- derived reward. Our results demonstrate potential benefits by introduc- ing well modelled human feedback into the robotic learn- ing system. The inclusion of human shaping signals was shown to improve performance over strictly environmentally derived reward. Providing consistent, correct feedback de- mands cognitive attention from the user which may be dif- ficult if the user is also required to provide control signals to the robotic system. Future work is introduced to explore implications of inviting humans to simultaneously provide control and feedback signals to learning systems. Conclusions This paper contributes a set of results from experiments incorporating simulated human feedback and simultaneous human control in the training of a semi-autonomous robotic agent. These results indicate that task performance increases with the incorporation of human feedback into existing actor-critic reinforcement learning algorithms. These results support the idea that human interaction can improve perfor- mance in complex robotic tasks when the human feedback is delivered correctly, consistently, and on a time scale con- sistent with the original learning problem. This work supports an emerging viewpoint surrounding human training of a robotic system tightly coupled to a user. By showing improving the performance of the RL agent this work further supports the sharing of autonomy between hu- man and machine. Acknowledgments This work is supported by National Sciences and Engineer- ing Research Council of Canada (NSERC), Alberta Inno- vates Technology Futures (AITF), the Canada Research Chairs Program, the Canada Foundation for Innovation, and the Alberta Machine Intelligence Institute (Amii)/Alberta Innovates Centre for Machine Learning (AICML). References [Biddiss and Chau 2007] Biddiss, E. A., and Chau, T. T. 2007. Upper limb prosthesis use and abandonment: a survey of the last 25 years. Prosthetics and Orthotics International 31(3):236–257. [Castellini et al. 2014] Castellini, C.; Artemiadis, P.; Wininger, M.; Ajoudani, A.; Alimusaj, M.; Bicchi, A.; Caputo, B.; Craelius, W.; Dosen, S.; Englehart, K.; Farina, D.; Gijsberts, A.; Godfrey, S. B.; Hargrove, L.; Ison, M.; Kuiken, T.; Markovi´c, M.; Pilarski, P. M.; Rupp, R.; and Scheme, E. 2014. Proceedings of the first workshop on peripheral machine interfaces: Going beyond traditional surface electromyography. In Frontiers in Neurorobotics, volume 8, 22. [Chernova and Thomaz 2014] Chernova, S., and Thomaz, A. L. 2014. Robot Learning from Human Teachers. Synthe- sis Lectures on Artificial Intelligence and Machine Learning 8(3):1–121. [Knox and Stone 2009] Knox, W. B., and Stone, P. 2009. In- teractively Shaping Agents via Human Reinforcement: The TAMER Framework. In Proceedings of the Fifth Interna- tional Conference on Knowledge Capture, 9–16. New York, New York, USA: ACM Press. [Knox and Stone 2010] Knox, W. B., and Stone, P. 2010. Combining Manual Feedback with Subsequent MDP Re- ward Signals for Reinforcement Learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, 5–12. [Knox and Stone 2012] Knox, W. B., and Stone, P. 2012. Reinforcement learning from human reward: Discounting in episodic tasks. In The 21st IEEE International Symposium on Robot and Human Interactive Communication, 878–885. [Knox and Stone 2015] Knox, W. B., and Stone, P. 2015. Framing reinforcement learning from human reward: Re- ward positivity, temporal discounting, episodicity, and per- formance. Artificial Intelligence 225:24–50. [Loftin et al. 2016] Loftin, R.; Peng, B.; MacGlashan, J.; Littman, M. L.; Taylor, M. E.; Huang, J.; and Roberts, D. L. 2016. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems 30(1):30–59. [Macglashan et al. 2016] Macglashan, J.; Littman, M. L.; Roberts, D. L.; Loftin, R.; Peng, B.; and Taylor, M. E. 2016. Convergent Actor Critic by Humans. In International Con- ference on Intelligent Robots and Systems. [Mathewson and Pilarski 2016] Mathewson, K. W., and Pi- larski, P. M. 2016. Simultaneous Control and Human Feedback in the Training of a Robotic Agent with Actor- Critic Reinforcement Learning. In IJCAI International Joint Conference on Artificial Intelligence - Interactive Machine Learning Workshop. [Micera, Carpaneto, and Raspopovic 2010] Micera, S.; Carpaneto, J.; and Raspopovic, S. 2010. Control of hand prostheses using peripheral information. IEEE Reviews in Biomedical Engineering 3:48–68. [Pilarski et al. 2011] Pilarski, P. M.; Dawson, M. R.; Degris, T.; Fahimi, F.; Carey, J. P.; and Sutton, R. S. 2011. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In IEEE International Conference on Rehabilitation Robotics, 1–7. IEEE. [Pilarski, Dick, and Sutton 2013] Pilarski, P. M.; Dick, T. B.; and Sutton, R. S. 2013. Real-time Prediction Learning for the Simultaneous Actuation of Multiple Prosthetic Joints. In IEEE International Conference on Rehabilitation Robotics, 1–8. [Scheme and Englehart 2011] Scheme, E., and Englehart, K. 2011. Electromyogram pattern recognition for control of powered upper-limb prostheses: State of the art and chal- lenges for clinical use. Journal of Rehabilitation Research and Development 48(6):643–660. [Skinner 1938] Skinner, B. F. 1938. The Behavior of Organ- isms: An experimental analysis. [Sutton and Barto. 1998] Sutton, R. S., and Barto., A. G. 1998. Reinforcement learning: An introduction. Cambridge: MIT Press, 1st edition. [Thomaz and Breazeal 2008] Thomaz, A. L., and Breazeal, C. 2008. Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence 172(6-7):716–737. [Vien, Ertel, and Chung 2013] Vien, N. A.; Ertel, W.; and Chung, T. C. 2013. Learning via human feedback in continu- ous state and action spaces. Applied Intelligence 39(2):267– 278.