Detection of social signals for recognizing engagement in human-robot interaction Divesh Lala, Koji Inoue, Pierrick Milhorat and Tatsuya Kawahara Graduate School of Informatics, Sakyo-ku Kyoto University Kyoto, Japan Abstract Detection of engagement during a conversation is an impor- tant function of human-robot interaction. The level of user engagement can influence the dialogue strategy of the robot. Our motivation in this work is to detect several behaviors which will be used as social signal inputs for a real-time en- gagement recognition model. These behaviors are nodding, laughter, verbal backchannels and eye gaze. We describe models of these behaviors which have been learned from a large corpus of human-robot interactions with the android robot ERICA. Input data to the models comes from a Kinect sensor and a microphone array. Using our engagement recog- nition model, we can achieve reasonable performance using the inputs from automatic social signal detection, compared to using manual annotation as input. Introduction Human conversation makes use of a range of non-verbal so- cial signals, which provide additional information about the internal state of the parties involved. Conversational robots should be able to recognize such signals. By doing this, they can control their behavior and dialogue to provide more nat- ural communication with humans. We assume that some social signals are used to indicate the engagement of a user. Engagement has been described as the process of establishing, maintaining an interaction (Sidner et al. 2005) or, more concretely, how interested and attentive they are towards a conversation (Yu, Aoki, and Woodruff 2004). One major goal is to keep the user engaged in a conversation. Therefore, if the robot is able to detect a change in the engagement level of the user, we can formulate a dialogue strategy to make or keep them engaged, thereby improving the user experience. For example, the robot may choose to continue or discontinue the current topic of con- versation depending on how engaged it perceives the user to be. This type of scenario motivates the work in this paper. While engagement cannot be measured directly, we can infer it by observing a number of social signals. In this paper, we discuss four of these - nodding, laughter, ver- bal backchannels, and eye gaze. These have been identi- fied in other works as being indicative of the engagement level of the user (Rich et al. 2010; Xu, Li, and Wang 2013; Copyright c⃝2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: An overview of the architecture of the engagement recognition system. Oertel et al. 2015). The general architecture of our engage- ment system is shown in Figure 1. We use Kinect and a mi- crophone array as sensors in our system. Our objective in this paper is to implement models which detect the social signals in real-time so that they can be used as inputs to the engagement recognition model. The robot which we use for this work is ERICA, a spe- cially designed android who is a young Japanese woman (Glas et al. 2016). Our long-term goal is for ERICA to inter- act with humans in the real world. She has a human-like ap- pearance and over 20 motors in her face which generate life- like facial movements for speech and emotion expression. ERICA can be tele-operated by another human, but we in- tend for her to be completely autonomous with the ability to recognize and respond appropriately to natural speech. Cur- arXiv:1709.10257v1 [cs.HC] 29 Sep 2017 rently ERICA is integrated with a speech recognition system and a module to track the subject’s body movements. We use these sensors to train our social signal recognition models. We expect that ERICA’s realism will elicit humans to in- teract with her as if she is a real human. In previous stud- ies we have developed ERICA’s dialogue system (Lala et al. 2017; Milhorat et al. 2017), but have not considered the state of the user. We are now integrating engagement recognition into ERICA’s architecture. Our approach in this work is to create a large corpus and annotate social signals related to engagement during interactions with her. The signals occur within a human-robot conversation but are natural behaviors seen in the real world. We then use this data to train social signal detection models for the engagement recognizer. We then compare the performance of the engagement recogni- tion model when using the detection models with manual annotation as input. Related Work Tracking user engagement for agent interaction has been an active area of research (Rich et al. 2010; Forbes-Riley and Litman 2012; Yu et al. 2016; Yu et al. 2017). The general approach is to discover what types of social signals are re- lated to engagement and then create a model which can mea- sure the engagement level of the user. This is often a binary or categorical classification (Bednarik, Eivazi, and Hradis 2012). The ability to recognize engagement in the user can influ- ence the conversation. Previous works have found that en- gagement is related to turn-taking (Xu, Li, and Wang 2013; Inoue et al. 2016). There has also been an implementation of an agent which modifies its dialogue strategy according to the engagement level of the user (Yu et al. 2016). We in- tend for ERICA to be used in a number of conversational scenarios such as interviewing, so the ability to detect the engagement level of the user is crucial to maintain a positive interaction. We must first detect the relevant social signals which in- dicate engagement. There have been numerous works which detail these behaviors and apply them to agents and robots. They include facial expressions (Castellano et al. 2009; Yu et al. 2017), posture (Sanghvi et al. 2011), conversational phenomena (Rich et al. 2010; Xu, Li, and Wang 2013) as well as the four social signals which we investigate in this work. Typically the models to recognize social signals are trained using machine learning techniques. Our goal is to extend the research by not only recogniz- ing social signals but creating the engagement recognition model itself. We propose a hierarchical Bayesian model. Rather than arbitrarily choosing which social signals are needed, we conducted an experiment to determine which are relevant as judged by multiple third party observers of an in- teraction (Inoue et al. 2016). This is a similar approach to a recent work (Oertel et al. 2015), where the engagement model is directly learned from the results of the annotation. Our model also considers the latent character of the observer as an underlying variable to accommodate for differing per- sonalities. In this work we describe the detection of the so- cial signals provided as a result of our experiment. Figure 2: ERICA participating in a conversation with a user. Data Collection We created our own corpus to collect the input data for train- ing. The corpus consists of 91 conversational sessions be- tween a human subject and ERICA. Each session lasted be- tween 5 and 20 minutes. An image of the scenario is shown in Figure 2. Note the Kinect sensor to the right of ERICA and the vase on the table which is actually a microphone array. In the conversation scenario, ERICA plays the role of a secretary for a laboratory in a university. The subjects arrive at the lab with the intention of talking to the professor, who is temporarily absent. ERICA informs the subject of this fact and while they are waiting begins a conversation. The top- ics of conversation varied and included the subject’s hobbies and interests, life as a student, and their thoughts on android robots. ERICA asked and answered questions about herself according to rough guidelines set out before the experiment. ERICA was controlled by a female operator in a remote, hidden location. During the sessions we used six different operators. All were professional voice actors. The subjects heard the operator’s speech, which was also used to move ERICA’s lips and mouth in a natural manner (Sakai et al. 2015). The operator could also control ERICA’s head move- ments through the use of controllers. We recorded several streams of data for the conversational sessions. These included multiple perspectives of video, au- dio channels of both ERICA and the subject, and motion capture data from a Kinect sensor which was located next to ERICA. The relevant data from the Kinect sensor is the yaw, roll and pitch values of the head captured at 30 Hz, relative to the Kinect sensor. We use several microphones to capture separate audio channels of ERICA and the user. Kinect and audio were synchronized. To create a ground truth of the data, annotators watched videos of the sessions and marked the beginning and end points of each target behavior. We only consider behaviors which occurred during ERCIA’s conversational turn. Due to the time taken to annotate the sessions, we do not use all of them when training the behavior models. Furthermore, the data used to train each of the behaviors are not necessarily consistent across all the models. This is because different annotators were responsible for different types of behavior and so worked independently on the sessions. Selection of Engagement-related Behaviors Previously we conducted an experiment in which multiple third-party annotators watched the sessions of interaction and identified the behaviors which accompany a change in perceived engagement (Inoue et al. 2016). The outcome of this experiment was that nodding, laughter, verbal backchan- nels, and eye gaze were commonly chosen by the annota- tors as being indicative of engagement. The annotation of engagement was inconsistent among the third-party annota- tors, so we also modeled the character of the annotator as a latent variable. Our experiment also showed that facial ex- pression was also important, but because this is more diffi- cult to annotate, we omit it from our model. We focus on engagement as being a function of the be- haviors of the human in their role as a listener. In our system we have three main tasks. The first is to measure behaviors of the user during the system’s speaking turn, which is the focus of this work. Secondly, we estimate their level of en- gagement based on the previously mentioned behaviors us- ing a hierarchical Bayesian model. Thirdly, we choose the robot’s dialogue strategy for her next turn based on the en- gagement level. This is to be considered in the near future. The type of classification for each behavior is slightly dif- ferent. Nodding is detected continuously as soon as new output from the Kinect sensor is received. Laughter and backchannel detection is detected whenever a new inter- pausal unit (IPU) is said by the user. In the live system, this IPU is an output of the Japanese automatic speech recog- nition (ASR) system Julius (Lee, Kawahara, and Shikano 2001). Eye gaze is detected per speaker turn. The integrated system can recognize the engagement level during the system’s speaking turn. The only required hard- ware is a Kinect sensor and a microphone array to imple- ment this system, so we expect that it can be used for other conversational robots. In the next sections we will describe the behavioral models and their performances in detail. Nodding Nodding is often used as a backchannel by the listener, par- ticularly in Japanese conversations (Maynard 1987; Han- zawa 2012). Nodding was also identified by annotators as being indicative of engagement. Several nodding detection algorithms have been implemented in previous works us- ing machine learning methods such as hidden Markov mod- els, support vector machines and hidden conditional random fields (Fujie et al. 2004; Morency et al. 2005; Wang et al. 2006). We used a long short-term memory (LSTM) network (Hochreiter and Schmidhuber 1997) for the head nodding model, which can be readily applied to gesture recognition (Ord´o˜nez and Roggen 2016). Input to this model consists of a sequence of frames from the Kinect sensor. The LSTM was constructed using Ten- sorflow using a GPU on a desktop computer. From prelimi- nary trials we found that using a sequence of 30 frames (ap- proximately 300ms of data) was suitable for training. For each frame we calculated 7 features. The most basic of these were the instantaneous speeds of the yaw, roll and pitch of the head. We also included features related only to the head pitch over the previous 15 frames. These were the average speed, average velocity, acceleration and range. To classify a sequence as a nod, we observe whether the final frame in the sequence is within the range of a nod in the ground truth annotation. We used an LSTM with 16 nodes, a learning rate of 0.001 and equally weighted samples. The mini-batch size was fixed at 32 samples. Every five epochs we measured the error on the validation set to determine early stopping. This epoch was then used as a parameter for evaluating the test set. Median IQR Listener turns per session 38 8.5 Length of listener turns (s) 4.89 7.03 Nods per session 31 30.5 Length of nods (s) 0.74 0.50 Table 1: Statistics of the 19 conversation sessions used for the nodding model. Distributions are highly skewed so we report the median and inter-quartile range (IQR). Due to the large amount of data and time to do the anno- tation, we used only 19 sessions in this work. Table 1 dis- plays general statistics about the subset of the corpus. We extracted 27,360 samples from the corpus, with 3,152 sam- ples in the nodding class (approximately 11.5% of the total). We performed 10-fold cross-validation across 19 sessions, with each fold consisting of two sessions for testing and two sessions for validation, except for one fold which contained only one session each for validation and testing. We compared our model to a baseline model which al- ways predicts the positive class, a support vector machine (SVM) with a radial basis kernel, a single hidden layer neu- ral network and a deep neural network (DNN) with two hid- den layers. We used the same features as the LSTM, but combined them in one input vector with size 210 (30 frames x 7 features). The neural network and DNN used the same learning rate and mini-batch size as the LSTM. The number of nodes of the hidden layers in the neural network and DNN were 128. We performed a frame-wise evaluation of the models. The performance across all test sets for the positive class is shown in Table 2. Nodding Model Prec. Rec. F1 Acc. Baseline 0.115 1.000 0.206 0.796 Neural network 0.482 0.411 0.444 0.881 SVM (radial basis) 0.424 0.542 0.475 0.863 DNN 0.511 0.516 0.514 0.887 LSTM 0.566 0.589 0.577 0.901 Table 2: Performance of frame-wise nodding detection models. For frame-wise detection, the LSTM approach outper- forms the other models over all metrics. We also evaluated the models by detecting nodding sequences rather than in- dividual frames. We set a minimum length of a nodding se- quence l, and only label a nod if it is continuously detected as such for longer than l milliseconds, to reject short, iso- lated sequences of nods. If a detected nod sequence from the model overlaps with a ground truth nod sequence, then we label it as correct. We evaluated various values of l and found that setting this value to 300ms was optimal for all the models. Results are shown in Table 3. Nodding Model Prec. Rec. F1 Neural network 0.551 0.581 0.566 SVM (radial basis) 0.550 0.638 0.591 DNN 0.553 0.687 0.613 LSTM 0.608 0.763 0.677 Table 3: Performance of event-wise nodding detection mod- els. We again find that the LSTM is the best performing model and that event-wise detection is better than frame-wise de- tection. Laughter Some works have used smiling to detect engagement (Castellano et al. 2009; Yu et al. 2017). Smiling in general displays a positive feeling toward the agent and so it is nat- ural that this would indicate engagement. However, reliably recognizing smiles would require us to integrate a camera into our system. As a proxy, we hypothesize that laugh- ter also indicates engagement. Furthermore, this was also confirmed by the third party annotators in our experiment. For these reasons we opt to build a system that can detect laughter, making use of the microphone array that is already used by the system. Automatic laughter detection is a well- researched topic (Cosentino, Sessa, and Takanishi 2016). We constructed a binary classifier to detect whether or not a given IPU contains laughter. For each IPU, we ex- tracted prosodic and linguistic features. We use the duration and the voiced-unvoiced intensity ratio of the IPU plus six features related to its pitch and the intensity (the mean, me- dian, slope, minimum, maximum and range), giving a total of 14 prosodic features. For linguistic features, we include the classifications of the previous 5 IPUs, a normalized word representation of previous utterances provided by Word2Vec (Mikolov et al. 2013), and part of speech tags provided by JUMAN++ (Morita, Kawahara, and Kurohashi 2015). A to- tal of 55 linguistic features are used. We classified an IPU as containing laughter whether it occurred in isolation or as part of an utterance. We used 46 sessions of the corpus to test the laughter detection sys- tem. Our corpus contained 9,320 IPUs, with the percentage of laughter samples being 6.3%. Cross-validation was per- formed leaving one session out for each fold. We tested var- ious models and found that a two-layer DNN had the best performance. Results of the laughter detection model are shown in Table 4. We compare two type of models to the baseline. The first uses only prosodic features, while the second adds linguistic information based on the transcription of the corpus. Model Prec. Rec. F1 Acc. Baseline 0.063 1.000 0.119 0.882 Prosody only 0.343 0.222 0.269 0.927 Prosody + linguistic 0.587 0.455 0.513 0.947 Table 4: Performance of laughter detection models. The addition of linguistic features improves the perfor- mance of the model, because of transcriptions such as “ha- haha” which explicitly represent laughter. However, we can- not guarantee that an ASR system would successfully gen- erate the same output as the transcription. In this case, a prosodic-only model provides an alternative solution. We acknowledge that the inclusion of other spectral features would considerably improve the model. Verbal backchannels Backchannels are short responses to the speaker during a conversation, such as mm and uh-huh in English or un and ee in Japanese, where they are termed aizuchi. These occur more frequently in Japanese than in English and are often accompanied by head movements (Ike 2010). Backchannels have been addressed in several works as being an indicator of engagement (Rich et al. 2010; Oertel et al. 2015) and con- firmed as being so by annotators in our experiment. Our goal is to create a model which can effectively rec- ognize backchannels during the robot’s speaking turn. Al- though our main task is to use the results of this recog- nition for our engagement model, it is also useful to clas- sify backchannels because the system should distinguish them from situations where the user is actually starting their speaking turn or trying to barge-in during her speech. We again use IPUs as an input and construct a binary classifier to determine if an IPU is a backchannel or not. We used the same sessions for training as the laughter de- tection model. 26.2% of samples were classified as verbal backchannels. We also used the same features as the laugh- ter detection model and found that the best model was a ran- dom forest with 56 estimators. Table 5 shows the results of backchannel detection. As with laughter detection, we com- pared the model with and without linguistic features. Model Prec. Rec. F1 Acc. Baseline 0.262 1.000 0.415 0.613 Prosody only 0.818 0.745 0.780 0.887 Prosody + linguistic 0.909 0.926 0.918 0.955 Table 5: Performance of backchannel detection models. We again see that linguistic features improve the perfor- mance of the model. This is expected as backchannels have several common lexical forms. However, even using only prosodic features the model still has a reasonable F1 score so we do not need to rely on a correct ASR output. Eye gaze Eye gaze behavior has been identified in previous research as indicative of engagement (Nakano and Ishii 2010; Rich et al. 2010) and the annotators in our experiment confirmed this. We consider that the most important aspect of the user’s eye gaze is that they are looking at ERICA while she is speaking. Although eye-gaze models using computer vision tech- niques have been developed in previous work (Sewell and Komogortsev 2010; Zhang et al. 2015), we opted to use a simpler geometry-based method with the Kinect sensor. We define the 3-d world co-ordinates of ERICA and the Kinect sensor’s position, and can receive the vector of the head ori- entation provided by the Kinect sensor. We first transform this orientation into world co-ordinates so that we have a vector whose origin and direction are the user’s head and head orientation. We then use collision detection to check for intersections between the vector and a 30cm sphere around ERICA’s head. This is to accommodate for the measurement of head orientation rather than actual eye gaze. If there is an intersection, we label it as looking at ERICA. From our experiment we found that the highest inter- annotator agreement of engagement (Spearman’s correlation coefficient of 0.375) was when the subject gazed at ER- ICA continuously for 10 seconds during her speaking turn. Therefore, we use this rule as a basis for the eye gaze input. Turns of less than 10 seconds long were classified as neg- ative according to the eye gaze model. We manually anno- tated a ground truth of eye gaze through visual observation then labelled the turn according to the 10 second rule. We then generated labels according the output of our eye gaze model. The number of turns which were labeled as positive (the user gazed at ERICA for at least 10 seconds continu- ously) was 17.1% of the total. The model was tested using 20 sessions of data and results are shown in Table 6. Model Prec. Rec. F1 Acc. Baseline 0.171 1.000 0.292 0.716 Gaze model 0.504 0.580 0.539 0.847 Table 6: Performance of eye gaze detection model. We find that the model works reasonably well for clas- sifying continuous gaze behavior. However we could only estimate the positions of ERICA’s head and the Kinect sen- sor in our corpus by observing the video. In the live system we calculate these values exactly, so we expect that the per- formance of the model will be improved from this result. Engagement Recognition We have described four different social signals for and their recognition models. The performance of the models is var- ied, but our goal is not to produce state-of-the-art individual models, but to assess whether they can be used in conjunc- tion with our engagement recognition model. This is a hi- erarchical Bayesian binary classifier, predicting if a listener is engaged or not during the system’s speaking turn. The graphical model is shown in Figure 3. For our evaluation we selected 20 sessions from our cor- pus which were a different subset than those used to train the individual social signal models. We recruited 12 third- party observers to watch video of these sessions and anno- tate time-points where they perceived the engagement of the subject was changed to high. We defined engagement for the annotators as “How much the subject is interested in and willing to continue the current dialogue with ERICA”. The annotator variable is represented by i in our graphical model. The input to the model, xt is a state which is defined as the observed combination of social signals during a turn. Each social signal can be classified as a binary value (either detected or not detected), giving 16 possible combinations. When training the model we know this combination, but in a live system we marginalize over each behavior combination using its prior probability. We also define a variable k, which represents a specific character type of an annotator. For example, laughter may be influential for one character type but not so important for an- other. Therefore, annotators with different characters would perceive laughter differently in terms of perceived engage- ment. We created a distribution of characters for each anno- tator and found that they could be grouped by similar char- acter. From our experiments we found that including a char- acter variable improved the model’s performance and that the best model had three different character types (K = 3). Figure 3: Graphical model of engagement recognizer For brevity, we omit details of most of the calculations used in our model. The posterior probability of perceived engagement is p(yit|xt, i, ˜Θ, ˜Φ) = K X k=1 ˜θik ˜φkxt (1) where xt is the observed behavior combination for a sys- tem turn, i is the index of the annotator, and Θ and Φ are the learned parameters of the latent character model. ˜θik is the probability that an annotator i has a character k, and ˜φkxt is the probability that the behavior combination xt is perceived as engaged by k. We compared the performance of our engagement model under two conditions - using manually annotated data as in- puts and using the results of our social signal detection mod- els as inputs. Engagement is classified per the robot’s speak- ing turn. We also analyze the model according to whether it uses contextual information. This means the result of en- gagement in the system’s previous turn is used as a fea- ture for classifying engagement in her current turn. We use the area under the precision-recall curve (AUC) as a perfor- mance measure. The results are shown in Table 7. Labeling method No context Context Manual annotation 0.650 0.669 Detection system 0.615 0.620 Table 7: AUC scores for the engagement model using man- ual annotation and our social signal detection models. From our results we see a drop in performance of the en- gagement model when using social signal detection com- pared to manual annotation, but is not drastic. We also see that adding contextual information provides an improvement in performance. We can also show that even though the indi- vidual models do not all have high recognition ability, their combination is adequate for engagement recognition. Discussion We find that the individual social signal models have vary- ing levels of performance. Laughter detection is quite poor, while backchannel detection is considerably better. To im- prove the performance of our recognition systems, we could add other modalities as in other works (Morency et al. 2005). In particular, the co-gesture of verbal backchannels and nod- ding could help to improve both systems. Similarly, visual recognition of laughing would improve the results of laugh- ter detection. We are trying to improve the performance of the individual models, including using spectral features for better laughter and backchannel detection. Although we have evaluated the model using our corpus, we expect that laughter and backchannel detection perfor- mance in the live system will be slightly degraded because these signals are mixed in with ERICA’s speech during her turn. We limit this by using the microphone array to ignore ERICA’s voice, but cannot guarantee that user speech will be clean. Other previous works tend to focus on non-verbal so- cial signals, but from our third party annotation experiment, verbal signals are necessary. Our data was collected in a one-to-one conversational set- ting, but the models are not restricted to this specific envi- ronment. Nodding, laughter and backchannel detection are independent of both environment and user. The eye gaze model needs to be calibrated to accommodate the position of Kinect and ERICA. We have successfully implemented all our models in a separate environment, with different place- ments of Kinect and ERICA. We propose that the models can function in a varied number of conversational settings, including multi-party dialogue. Our next step is to use the results of engagement recog- nition to modify the dialogue policy of the system. We con- sider that the engagement of the user has an influence on turn-taking behavior or changing the topic of conversation. However, we can also consider that the flow of dialogue may be completely modified. One scenario we are considering is ERICA giving a technical explanation. By recognizing if the user is engaged, the robot may use simpler terminology to make her talk more understandable. We intend to formulate such scenarios where conversational engagement recogni- tion is necessary and then conduct user experiments to con- firm the effectiveness of our system. Conclusion This paper described models for detecting nodding, laugh- ter, verbal backchannels and eye gaze, which will be used by an engagement recognizer during conversation with a robot. The robot we use in this work is the android ERICA. We selected these social signals based on a previous experiment where third party observers annotated changes in engage- ment based on behaviors. The inputs are a Kinect sensor and a microphone array. Although the performance of our mod- els are varied, their combination is effective. We observe a slight degradation in performance for our engagement recognition model when using the outputs of social signal detection compared to annotated values. We have integrated these models into ERICA’s system architecture and intend to make her an engagement-aware conversational robot. Acknowledgements This work was supported by JST ERATO Ishiguro Symbi- otic Human-Robot Interaction program (Grant Number JP- MJER1401), Japan. References [Bednarik, Eivazi, and Hradis 2012] Bednarik, R.; Eivazi, S.; and Hradis, M. 2012. Gaze and conversational en- gagement in multiparty video conversation: an annotation scheme and classification of high and low levels of engage- ment. In Proceedings of the 4th workshop on eye gaze in intelligent human machine interaction, 10. ACM. [Castellano et al. 2009] Castellano, G.; Pereira, A.; Leite, I.; Paiva, A.; and McOwan, P. W. 2009. Detecting user en- gagement with a robot companion using task and social interaction-based features. In Proceedings of the 2009 in- ternational conference on Multimodal interfaces, 119–126. ACM. [Cosentino, Sessa, and Takanishi 2016] Cosentino, S.; Sessa, S.; and Takanishi, A. 2016. Quantitative laughter de- tection, measurement, and classification - a critical survey. IEEE Reviews in Biomedical Engineering 9:148–162. [Forbes-Riley and Litman 2012] Forbes-Riley, K., and Lit- man, D. 2012. Adapting to multiple affective states in spo- ken dialogue. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 217–226. Association for Computational Linguistics. [Fujie et al. 2004] Fujie, S.; Ejiri, Y.; Nakajima, K.; Mat- susaka, Y.; and Kobayashi, T. 2004. A conversation robot us- ing head gesture recognition as para-linguistic information. In Robot and Human Interactive Communication, 2004. RO- MAN 2004. 13th IEEE International Workshop on, 159–164. IEEE. [Glas et al. 2016] Glas, D. F.; Minato, T.; Ishi, C. T.; Kawa- hara, T.; and Ishiguro, H. 2016. Erica: The Erato intelligent conversational android. In 25th IEEE International Sympo- sium on Robot and Human Interactive Communication (RO- MAN), 22–29. IEEE. [Hanzawa 2012] Hanzawa, C. 2012. Listening behaviors in Japanese: Aizuchi and head nod use by native speakers and second language learners. Ph.D. Dissertation, University of Iowa. [Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780. [Ike 2010] Ike, S. 2010. Backchannel: A feature of Japanese English. In The Japan Association for Language Teachers (JALT) 2009 Conference Proceedings, 205–215. [Inoue et al. 2016] Inoue, K.; Lala, D.; Nakamura, S.; Takanashi, K.; and Kawahara, T. 2016. Annotation and anal- ysis of listener’s engagement based on multi-modal behav- iors. In Proceedings of the Workshop on Multimodal Anal- yses Enabling Artificial Agents in Human-Machine Interac- tion, MA3HMI ’16, 25–32. ACM. [Lala et al. 2017] Lala, D.; Milhorat, P.; Inoue, K.; Ishida, M.; Takanashi, K.; and Kawahara, T. 2017. Attentive lis- tening system with backchanneling, response generation and flexible turn-taking. In Proceedings of the 18th Annual Meeting of the Special Interest Group on Discourse and Dia- logue, 127–136. Association for Computational Linguistics. [Lee, Kawahara, and Shikano 2001] Lee, A.; Kawahara, T.; and Shikano, K. 2001. Julius – an open source realtime large vocabulary recognition engine. In EUROSPEECH, 1691– 1694. [Maynard 1987] Maynard, S. K. 1987. Interactional func- tions of a nonverbal sign: Head movement in Japanese dyadic casual conversation. Journal of Pragmatics 11(5):589–606. [Mikolov et al. 2013] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representa- tions in vector space. arXiv preprint arXiv:1301.3781. [Milhorat et al. 2017] Milhorat, P.; Lala, D.; Inoue, K.; Zhao, T.; Ishida, M.; Takanashi, K.; Nakamura, S.; and Kawahara, T. 2017. A conversational dialogue manager for the hu- manoid robot erica. In Proceedings of IWSDS. [Morency et al. 2005] Morency, L.-P.; Sidner, C.; Lee, C.; and Darrell, T. 2005. Contextual recognition of head ges- tures. In Proceedings of the 7th International Conference on Multimodal Interfaces, 18–24. ACM. [Morita, Kawahara, and Kurohashi 2015] Morita, H.; Kawa- hara, D.; and Kurohashi, S. 2015. Morphological analysis for unsegmented languages using recurrent neural network language model. In M`arquez, L.; Callison-Burch, C.; Su, J.; Pighin, D.; and Marton, Y., eds., EMNLP, 2292–2297. The Association for Computational Linguistics. [Nakano and Ishii 2010] Nakano, Y. I., and Ishii, R. 2010. Estimating user’s engagement from eye-gaze behaviors in human-agent conversations. In Proceedings of the 15th In- ternational Conference on Intelligent User Interfaces, 139– 148. ACM. [Oertel et al. 2015] Oertel, C.; Funes Mora, K. A.; Gustafson, J.; and Odobez, J.-M. 2015. Deciphering the silent participant: On the use of audio-visual cues for the classification of listener categories in group discussions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 107–114. ACM. [Ord´o˜nez and Roggen 2016] Ord´o˜nez, F. J., and Roggen, D. 2016. Deep convolutional and LSTM recurrent neural net- works for multimodal wearable activity recognition. Sensors 16(1):115. [Rich et al. 2010] Rich, C.; Ponsler, B.; Holroyd, A.; and Sidner, C. L. 2010. Recognizing engagement in human- robot interaction. In 2010 5th ACM/IEEE International Con- ference on Human-Robot Interaction (HRI), 375–382. IEEE. [Sakai et al. 2015] Sakai, K.; Ishi, C. T.; Minato, T.; and Ishiguro, H. 2015. Online speech-driven head motion gen- erating system and evaluation on a tele-operated robot. In 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 529–534. IEEE. [Sanghvi et al. 2011] Sanghvi, J.; Castellano, G.; Leite, I.; Pereira, A.; McOwan, P. W.; and Paiva, A. 2011. Automatic analysis of affective postures and body motion to detect en- gagement with a game companion. In Proceedings of the 6th International Conference on Human-robot Interaction, HRI ’11, 305–312. ACM. [Sewell and Komogortsev 2010] Sewell, W., and Komogort- sev, O. 2010. Real-time eye gaze tracking with an unmod- ified commodity webcam employing a neural network. In CHI’10 Extended Abstracts on Human Factors in Comput- ing Systems, 3739–3744. ACM. [Sidner et al. 2005] Sidner, C. L.; Lee, C.; Kidd, C. D.; Lesh, N.; and Rich, C. 2005. Explorations in engagement for hu- mans and robots. Artificial Intelligence 166(1-2):140–164. [Wang et al. 2006] Wang, S. B.; Quattoni, A.; Morency, L.- P.; Demirdjian, D.; and Darrell, T. 2006. Hidden conditional random fields for gesture recognition. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Con- ference on, volume 2, 1521–1527. IEEE. [Xu, Li, and Wang 2013] Xu, Q.; Li, L.; and Wang, G. 2013. Designing engagement-aware agents for multiparty conver- sations. In Proceedings of the SIGCHI Conference on Hu- man Factors in Computing Systems, 2233–2242. ACM. [Yu, Aoki, and Woodruff 2004] Yu, C.; Aoki, P. M.; and Woodruff, A. 2004. Detecting user engagement in every- day conversations. In INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing. [Yu et al. 2016] Yu, Z.; Nicolich-Henkin, L.; Black, A. W.; and Rudnicky, A. I. 2016. A wizard-of-oz study on a non- task-oriented dialog systems that reacts to user engagement. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 55–63. Associ- ation for Computational Linguistics. [Yu et al. 2017] Yu, Z.; Ramanarayanan, V.; Lange, P.; and Suendermann-Oeft, D. 2017. An open-source dialog system with real-time engagement tracking for job interview train- ing applications. In Proceedings of IWSDS. [Zhang et al. 2015] Zhang, X.; Sugano, Y.; Fritz, M.; and Bulling, A. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4511–4520.