A Cross-cultural Corpus of Annotated Verbal and Nonverbal Behaviors in Receptionist Encounters Maxim Makatchev Robotics Institute Carnegie Mellon University Pittsburgh, PA, USA mmakatch@cs.cmu.edu Reid Simmons Robotics Institute Carnegie Mellon University Pittsburgh, PA, USA reids@cs.cmu.edu Majd Sakr Carnegie Mellon University in Qatar Doha, Qatar msakr@qatar.cmu.edu ABSTRACT We present the first annotated corpus of nonverbal behaviors in receptionist interactions, and the first nonverbal corpus (excluding the original video and audio data) of service en- counters freely available online. Native speakers of American English and Arabic participated in a naturalistic role play at reception desks of university buildings in Doha, Qatar and Pittsburgh, USA. Their manually annotated nonverbal behaviors include gaze direction, hand and head gestures, torso positions, and facial expressions. We discuss possible uses of the corpus and envision it to become a useful tool for the human-robot interaction community. 1. INTRODUCTION Behavioral realism has been one of the promising direc- tions in the development of on-screen conversational agents and robots capable of natural language dialogue (see [19] for an overview). For example, interactions with a robot receptionist that evoke user’s social response are associated with better engagement and lower rate of breakdowns dur- ing information-seeking dialogues [21]. A necessary step in designing such interactions is to identify behaviors with a potential to evoke a desired user response. Data sources that can be used to harvest behavior candi- dates include ethnographic and controlled studies. Ethno- graphic studies provide an opportunity for collection of nat- uralistic conversational data, but often face the issues of un- clear sample population and coarse granularity of captured data [4]. On the other hand, collecting high resolution data in a controlled setting may hamper spontaneity and natural- ness of the interaction. In general, data collection methodol- ogy can influence both the sociopragmatic choices, namely, what speech act to say, and their pragmalinguistic realiza- tion, namely, how to say it (see [4] for a discussion). These methodological difficulties, combined with the chal- lenges of annotating multimodal data, result in the lack of annotated corpora of naturalistic interactions for many sce- narios that are currently relevant for human-robot interac- tion research. The corpus of role plays between a visitor and a receptionist in a realistic environment that we present in this paper attempts to help fill this gap. In the next section, we describe related work on corpora of service encounters. After that, we introduce our data col- lection methodology and the annotation scheme we use. We Copyright is held by the author/owner(s). ACM X-XXXXX-XX-X/XX/XX. conclude with the discussion of possible uses of the corpus. 2. CORPORA OF SERVICE ENCOUNTERS Audio corpora of human service encounters have been used for analysis of linguistic and paralinguistic features, such as timing and prosody. For example, Vienna-Oxford In- ternational Corpus of English (VOICE) [20] includes service encounters between speakers of English as a lingua franca. Audio recordings of Syrian shopping interactions were col- lected and analyzed by Traverso [22]. Service encounters gathered in public offices and shops of Catalonia were ex- amined with respect to how bilinguals negotiate code (lan- guage) of their interaction. Audio recordings have been used to analyze politeness strategies in shopping interactions (see, for example, [12]). The importance of gaze (see [15] for an overview) and smile (see, for example, [10]) in defining the outcome of the service interactions suggest the need for capturing and studying nonverbal behaviors in videos. For instance, cus- tomers reported higher satisfaction when they interacted face-to-face with a bank teller who responded with contin- gent smile, rather than constant neutral or constant smiling expression [10]. The same data showed that amused and polite smiles differ with respect to their temporal proper- ties [9]. Analysis of verbal and nonverbal expressions in the videos of interethnic encounters of Korean retailers with Korean and African-American customers showed that these language communities had different perception of function of socially minimal and socially expanded encounters [3]. Receptionist interactions, a subtype of service encounters, were analyzed with respect to their verbal content via role plays in [5]. Hewitt et al. [8] conducted discourse analysis of dialogues involving hospital receptionists. The openly ac- cessible CUBE-G corpus of nonverbal behaviors from role plays of German and Japanese participants covers scenarios that may be relevant for service encounters, including first meeting, negotiation and status difference [17]. The original Map Task [1] and followup projects collect direction-giving dialogues that may be relevant to some receptionist encoun- ters. We were not able to find any nonverbal corpora of hu- man receptionist interactions. With respect to availabil- ity, among all the corpora mentioned above only VOICE, CUBE-G and Map Task related corpora are freely accessi- ble. Hence, our corpus may be the first annotated corpus of nonverbal behaviors in receptionist interactions, and the first nonverbal corpus (excluding the original video and au- dio data) of service encounters freely available online [13]. 1 arXiv:1203.2299v1 [cs.CL] 11 Mar 2012 3. DATA COLLECTION 3.1 Participants We recruited via emails and posters in Education City, in Doha, Qatar and via announcements posted on bulletin boards across CMU campus in Pittsburgh, USA. The re- cruitment materials specified that we were looking for na- tive speakers of American English or Arabic. Majority of the participants (17 of 22) were university students, staff, or fac- ulty. The participants filled demographic surveys and evalu- ated themselves on ten-item personality inventory (TIPI) [7] and 20-item positive and negative affect scale (PANAS) [24]. The distribution of participants is shown in Table 1. Doha Arabic Females 2 Males 6 American English Females 2 Males 3 Pittsburgh Arabic Females 1 Males 1 American English Females 5 Males 1 Table 1: Distribution of participants between Doha and Pittsburgh experiment sites People apply different criteria when they report their na- tive language and mother tongue [14]. To control for this, we asked the participants to list the countries they lived in for more than a year, and their age at the time of moving in and out of the country. All but 3 participants (who were all in the American English condition in Doha) spent the majority of their lives in the country where their native lan- guage is a primary spoken language. A female participant in Doha changed her reported native language from American English to Tulu, after asking the experimenter a clarifica- tion question. Her data remains in the corpus although she is not included in the Table 1. Mean age of participants in Doha was 25 years (SD = 7.8). In Pittsburgh, average age was 28.7 years (SD = 12.7). Native speakers of Arabic were on average 23.2 years old (SD = 4.2), while average age of native speakers of Ameri- can English was 30.9 years (SD = 12.5). 3.2 Procedure After filling out the questionnaires, one of the participants was asked to play the role of a receptionist while another was asked to imagine themselves as a first-time visitor looking for a particular location inside the building. The location was picked by the experimenter from the following list: library, restroom, cafeteria, student recreation room, a professor A’s office, etc. Visitors were asked to seek help of the reception- ist for directions using English and then to proceed towards their destination. Most of the participant pairs were not familiar with each other. The fact of familiarity, when clear, is noted in the annotations. Similarly, the annotations include information on whether the participant has a thorough (works or studies inside the building) or passing (works or studies in a nearby building) familiarity with the experiment site. In both sites, the receptionist would occupy the actual receptionist area in the lobby of the building. In Doha, on-duty security guards were present in the vicinity of the reception desk. Each pair of participants would have 2-3 interactions with one of the subjects as a receptionist, and then they would switch roles and have 2 or 3 more interactions, depending on allotted time. After that, the participants were debriefed on their experiences. Overall, more than 60 interactions were recorded. The interactions were recorded with 2 or 3 consumer-level high definition cameras. Visitor and receptionist were each dedicated a camera capturing their torso, arms and face that was positioned about 45 degrees offtheir default line of sight (namely, the line of sight that is perpendicular to the front edge of the rectangular reception desk). Most of the inter- actions would have a third camera capturing the side view of the scene. All cameras were in plain view. In addition to the audio captured by the cameras, an audio recorder (iPod) was placed on the receptionist desk. 4. ANNOTATION SCHEME The main goal of our corpus is to analyze occurrences and timing of verbal and nonverbal behaviors. Consequently, we have chosen to annotate the data at the level of granular- ity that minimizes the coding effort while at the same time allowing to capture timing and major features of commu- nicative events. For example, instead of annotating each of preparation, hold, stroke, and retraction phases of a hand gesture [11] we annotate an interval between beginnings of the stroke and retraction phases. Similarly, facial expression are annotated as intervals approximately from the beginning of rise to the beginning of decay [9] phases, with some er- ror inherent to manual annotation. The annotation scheme, developed in the process of annotating the corpus, is sum- marized in Table 2. Modality Values Speech Transcribed utterances, including non-words Eye gaze Pointing (self-initiated), pointing (following interlocutor), focus (in- terlocutor, guard, desktop, down, up, left, right, front, back, scattered, destination) Face smile (open or closed mouth) Head nod, half nod, double nod, multiple nod, upward nod, multiple upward nod, micro nod, shake Hand Pointing (left or right hand), finger only Torso Sitting, standing, focus (left, right, front, back, destination, interlocu- tor, desk) Table 2: Annotation scheme Coding nonverbal expressions, as well as transcribing am- biguous speech involves a degree of subjectivity. For exam- ple, the exact point of gaze fixation within the recipient’s face is hard to identify even by the recipient himself [23]. In fact, a typical direct eye contact consists of a sequence of fixations on different points on the face [6]. Since it is unclear whether the exact fixation pattern has any influence 2 on social communication, in this study we do not distinguish between different fixation points within the general face area (neither does the video fidelity allow that). We plan to val- idate the annotations by employing a second annotator. The annotation is done using the multi-track video anno- tation tool Advene [2]. 5. DISCUSSION While the small number of individual participants makes this corpus unsuitable for cross-subject analysis, the multi- ple trials may be accounted for by mixed-effects models [16]. More appropriately, the corpus should be used for qualita- tive analysis and formation of hypothesis for further stud- ies. For example, compare the gaze behaviors of a native Arabic-speaking female S4 (Subject 4) playing a reception- ist responding to native Arabic-speaking male S1 playing a visitor (Fig. 1) versus the dialogue with the subjects’ roles reversed (Fig. 2). Notice that both subjects gazed at their interlocutor more in the visitor role. This appears to be a trend that can be explained in part by the receptionist looking towards the destination during the direction-giving speech, while the visitor may continue looking at the recep- tionist. Now, compare a receptionist gaze of S4 (Fig. 1) with one of S12 (Fig. 3), who is a female native speaker of American English. Notice the short glances that punctuate fragments of the directions sequence spoken by S12. These glances appear to precede visitor’s backchannels and therefore may play a role in connection events [18]. Receptionist S4, on the contrary, did not glance at the visitor until the very end of the directions sequence. These different gaze behaviors may reflect individual styles, genders and cultures of receptionist- visitor pairs, or levels of comfort and expertise, among other possibilities. Further, more controlled, studies may address these hypothesis. 6. ACKNOWLEDGMENTS This publication was made possible by the support of an NPRP grant from the Qatar National Research Fund. The authors would like to express their gratitude to Michael Agar, Mark Barker, Justine Cassell, Anwar El-Shamy, Ismet Hajdarovic, Alicia Holland, Carol Miller, Dudley Reynolds, Michele de la Reza, Candace Sidner, Mark Stehlik, Mark C. Thompson, security and receptionist staffof CMU Qatar, and the study participants. 7. REFERENCES [1] A. Anderson, M. Bader, E. Bard, E. Boyle, G. M. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. S. Thompson, and R. Weinert. The HCRC Map Task corpus. Language and Speech, 34:351–366, 1991. [2] O. Aubert and Y. Pri´e. Advene: active reading through hypervideo. In Proc. of ACM Hypertext, September 2005. [3] B. Bailey. Communication of respect in interethnic service encounters. Language in Society, 26:327–356, 1997. [4] L. M. Beebe and M. C. Cummings. Natural speech act data versus written questionnaire data: How data collection method affects speech act performance. In S. M. Gass and J. Neu, editors, Speech Acts Across Cultures: Challenges to Communication in a Second Language, pages 65–86. Berlin / New York: Mouton de Gruyter, 1996. [5] B. Chee, A. Wong, D. Limbu, A. Tay, Y. Tan, and T. Park. Understanding communication patterns for designing robot receptionist. In S. Ge, H. Li, J.-J. Cabibihan, and Y. Tan, editors, Social Robotics, volume 6414 of Lecture Notes in Computer Science, pages 345–354. Springer Berlin / Heidelberg, 2010. [6] M. Cook. Gaze and mutual gaze in social encounters: How long—and when—we look others “in the eye” is one of the main signals of nonverbal communication. American Scientist, 65(2):328–333, 1977. [7] S. D. Gosling, P. J. Rentfrow, and J. William B. Swann. A very brief measure of the Big-Five personality domains. Journal of Research in Personality, 37:504–528, 2003. [8] H. Hewitt, L. McCloughan, and B. McKinstry. Front desk talk: discourse analysis of receptionist–patient interaction. British Journal of General Practice, 59(565):e260–e266, 2009. [9] M. E. Hoque, L.-P. Morency, and R. W. Picard. Are you friendly or just polite? —Analysis of smiles in spontaneous face-to-face interactions. In Proc. of the Affective Computing and Intelligent Interaction, October 2011. [10] K. Kim. Affect Reflection Technology in Face-to-Face Service Encounters. MIT MS Thesis, September 2009. [11] S. Kita, I. van Gijn, and H. van der Hulst. Movement phases in signs and co-speech gestures, and their transcription by human coders. In I. Wachs-muth and M. Fr¨ohlich, editors, Gesture and Sign Language in Human-Computer Interaction, pages 23–35. Springer, 1998. [12] K. C. Kong. Politeness of service encounters in Hong Kong. Pragmatics, 8(4):555–575, 2010. [13] M. Makatchev, R. Simmons, and M. Sakr. Carnegie Mellon Receptionist Corpus. http://www.qatar.cmu.edu/hala/corpora/. [14] M. McPherson, L. Smith-Lovin, and J. M. Cook. What is a language community? American Journal of Political Science, 44(1):142–155, 2000. [15] E. Montague, J. Xu, P. yu Chen, O. Asan, B. P. Barret, and B. Chewning. Modeling eye gaze patterns in clinician-patient interaction with lag sequential analysis. J. of Human Factors and Ergonomics Society, 53:502–516, October 2011. [16] J. C. Pinheiro and D. M. Bates. Mixed-Effects Models in S and S-PLUS. Springer, 2000. [17] M. Rehm, E. Andr´e, N. Bee, B. Endrass, M. Wissner, Y. Nakano, A. A. Lipi, T. Nishida, and H.-H. Huang. Creating standardized video recordings of multimodal interactions across cultures. In M. Kipp, J.-C. Martin, P. Paggio, and D. Heylen, editors, Multimodal corpora, pages 138–159. Springer-Verlag, Berlin, Heidelberg, 2009. [18] C. Rich, A. Holroyd, B. Ponsler, and C. Sidner. Recognizing engagement in human-robot interaction. In Proceedings of ACM/IEEE International Conference on Human Robot Interaction, pages 375–382, 2010. [19] C. Rich and C. L. Sidner. Robots and avatars as hosts, 3 advisors, companions and jesters. AI Magazine, 30(1):29–41, 2009. [20] B. Seidlhofer, A. Breiteneder, T. Klimpfinger, S. Majewski, R. Osimk, and M.-L. Pitzl. Vienna-Oxford international corpus of English (version 1.1 online). http://voice.univie.ac.at. Accessed January 16, 2012. [21] R. Simmons, M. Makatchev, R. Kirby, M. Lee, I. Fanaswala, B. Browning, J. Forlizzi, and M. Sakr. Believable robot characters. AI Magazine, 32(4):39–52, 2011. [22] V. Traverso. Syrian service encounters: a case of shifting strategies within verbal exchange. Pragmatics, 11(4):421–444, 2001. [23] M. von Cranach and J. H. Ellgring. Problems in the recognition of gaze direction. In M. von Cranach and I. Vine, editors, Social Communication and Movement: Studies of Interaction and Expression in Man and Chimpanzee, pages 419–443. Academic Press, London, 1973. [24] D. Watson, L. A. Clark, and A. Tellegen. Development and validation of brief measures of positive and negative affect: The PANAS scales. J. of Personality and Social Psychology, 47:1063–1070, 1988. 4 (unclear) morning eh, do you know where is d.. professor Majd Sakr office? majd sakr uhu uhu ok thank you hi, good morning eh, professor who? eh, majd sakr, professor so, you may go this way, in this corridor it's the c s corridor and then (i or o unclear)n your right... you have all offices you can read his name on the pallet on the office 0.84 0.93 0.60 0.28 0.20 0.68 0.03 0.07 0.00 0.36 0.64 0.09 Fraction of entire interaction: Fraction of receptionist's speech: Fraction of visitor's speech: 23.0 20 15 10 5 0 Time (seconds) visitor receptionist v r v r Figure 1: Interaction between S1 as a visitor and S4 as a receptionist. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots. 5 hi um, excuse me, i am (unclear) looking for cmu library, can you lead me to this library please? ok ok. so that (unclear) over there? ok, thank you so much hi yeah, sure the library is there.. just you have to .. walk that way uhu yeah... there 0.61 0.48 0.70 0.57 0.17 0.76 0.31 0.52 0.18 0.40 0.83 0.18 Fraction of entire interaction: Fraction of receptionist's speech: Fraction of visitor's speech: 15.3 10 5 0 Time (seconds) visitor receptionist v r v r Figure 2: Interaction between S4 as a visitor and S1 as a receptionist. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots. 6 hey (unclear:hey) can i please know where is the library? yeah okay it's on the ground floor you mean? er um okay okay thank you hi how can i help you a library um if you come down this hallway all the way to the end... and and you take a right it's on the ground floor it'll be right there you'll er um see these glass doors and it's the library right there so just go down the hallway and take a right and straight (rise) mm−hm 0.64 0.63 0.72 0.60 0.47 0.84 0.25 0.33 0.17 0.33 0.42 0.16 Fraction of entire interaction: Fraction of receptionist's speech: Fraction of visitor's speech: 25.7 20 15 10 5 0 Time (seconds) visitor receptionist v r v r Figure 3: Interaction between S11 as a visitor and S12 as a receptionist. The visitor’s eye gaze for this particular dialogue is partially inferred from his head gaze. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots. 7 A Cross-cultural Corpus of Annotated Verbal and Nonverbal Behaviors in Receptionist Encounters Maxim Makatchev Robotics Institute Carnegie Mellon University Pittsburgh, PA, USA mmakatch@cs.cmu.edu Reid Simmons Robotics Institute Carnegie Mellon University Pittsburgh, PA, USA reids@cs.cmu.edu Majd Sakr Carnegie Mellon University in Qatar Doha, Qatar msakr@qatar.cmu.edu ABSTRACT We present the first annotated corpus of nonverbal behaviors in receptionist interactions, and the first nonverbal corpus (excluding the original video and audio data) of service en- counters freely available online. Native speakers of American English and Arabic participated in a naturalistic role play at reception desks of university buildings in Doha, Qatar and Pittsburgh, USA. Their manually annotated nonverbal behaviors include gaze direction, hand and head gestures, torso positions, and facial expressions. We discuss possible uses of the corpus and envision it to become a useful tool for the human-robot interaction community. 1. INTRODUCTION Behavioral realism has been one of the promising direc- tions in the development of on-screen conversational agents and robots capable of natural language dialogue (see [19] for an overview). For example, interactions with a robot receptionist that evoke user’s social response are associated with better engagement and lower rate of breakdowns dur- ing information-seeking dialogues [21]. A necessary step in designing such interactions is to identify behaviors with a potential to evoke a desired user response. Data sources that can be used to harvest behavior candi- dates include ethnographic and controlled studies. Ethno- graphic studies provide an opportunity for collection of nat- uralistic conversational data, but often face the issues of un- clear sample population and coarse granularity of captured data [4]. On the other hand, collecting high resolution data in a controlled setting may hamper spontaneity and natural- ness of the interaction. In general, data collection methodol- ogy can influence both the sociopragmatic choices, namely, what speech act to say, and their pragmalinguistic realiza- tion, namely, how to say it (see [4] for a discussion). These methodological difficulties, combined with the chal- lenges of annotating multimodal data, result in the lack of annotated corpora of naturalistic interactions for many sce- narios that are currently relevant for human-robot interac- tion research. The corpus of role plays between a visitor and a receptionist in a realistic environment that we present in this paper attempts to help fill this gap. In the next section, we describe related work on corpora of service encounters. After that, we introduce our data col- lection methodology and the annotation scheme we use. We Copyright is held by the author/owner(s). ACM X-XXXXX-XX-X/XX/XX. conclude with the discussion of possible uses of the corpus. 2. CORPORA OF SERVICE ENCOUNTERS Audio corpora of human service encounters have been used for analysis of linguistic and paralinguistic features, such as timing and prosody. For example, Vienna-Oxford In- ternational Corpus of English (VOICE) [20] includes service encounters between speakers of English as a lingua franca. Audio recordings of Syrian shopping interactions were col- lected and analyzed by Traverso [22]. Service encounters gathered in public offices and shops of Catalonia were ex- amined with respect to how bilinguals negotiate code (lan- guage) of their interaction. Audio recordings have been used to analyze politeness strategies in shopping interactions (see, for example, [12]). The importance of gaze (see [15] for an overview) and smile (see, for example, [10]) in defining the outcome of the service interactions suggest the need for capturing and studying nonverbal behaviors in videos. For instance, cus- tomers reported higher satisfaction when they interacted face-to-face with a bank teller who responded with contin- gent smile, rather than constant neutral or constant smiling expression [10]. The same data showed that amused and polite smiles differ with respect to their temporal proper- ties [9]. Analysis of verbal and nonverbal expressions in the videos of interethnic encounters of Korean retailers with Korean and African-American customers showed that these language communities had different perception of function of socially minimal and socially expanded encounters [3]. Receptionist interactions, a subtype of service encounters, were analyzed with respect to their verbal content via role plays in [5]. Hewitt et al. [8] conducted discourse analysis of dialogues involving hospital receptionists. The openly ac- cessible CUBE-G corpus of nonverbal behaviors from role plays of German and Japanese participants covers scenarios that may be relevant for service encounters, including first meeting, negotiation and status difference [17]. The original Map Task [1] and followup projects collect direction-giving dialogues that may be relevant to some receptionist encoun- ters. We were not able to find any nonverbal corpora of hu- man receptionist interactions. With respect to availabil- ity, among all the corpora mentioned above only VOICE, CUBE-G and Map Task related corpora are freely accessi- ble. Hence, our corpus may be the first annotated corpus of nonverbal behaviors in receptionist interactions, and the first nonverbal corpus (excluding the original video and au- dio data) of service encounters freely available online [13]. 1 arXiv:1203.2299v1 [cs.CL] 11 Mar 2012 3. DATA COLLECTION 3.1 Participants We recruited via emails and posters in Education City, in Doha, Qatar and via announcements posted on bulletin boards across CMU campus in Pittsburgh, USA. The re- cruitment materials specified that we were looking for na- tive speakers of American English or Arabic. Majority of the participants (17 of 22) were university students, staff, or fac- ulty. The participants filled demographic surveys and evalu- ated themselves on ten-item personality inventory (TIPI) [7] and 20-item positive and negative affect scale (PANAS) [24]. The distribution of participants is shown in Table 1. Doha Arabic Females 2 Males 6 American English Females 2 Males 3 Pittsburgh Arabic Females 1 Males 1 American English Females 5 Males 1 Table 1: Distribution of participants between Doha and Pittsburgh experiment sites People apply different criteria when they report their na- tive language and mother tongue [14]. To control for this, we asked the participants to list the countries they lived in for more than a year, and their age at the time of moving in and out of the country. All but 3 participants (who were all in the American English condition in Doha) spent the majority of their lives in the country where their native lan- guage is a primary spoken language. A female participant in Doha changed her reported native language from American English to Tulu, after asking the experimenter a clarifica- tion question. Her data remains in the corpus although she is not included in the Table 1. Mean age of participants in Doha was 25 years (SD = 7.8). In Pittsburgh, average age was 28.7 years (SD = 12.7). Native speakers of Arabic were on average 23.2 years old (SD = 4.2), while average age of native speakers of Ameri- can English was 30.9 years (SD = 12.5). 3.2 Procedure After filling out the questionnaires, one of the participants was asked to play the role of a receptionist while another was asked to imagine themselves as a first-time visitor looking for a particular location inside the building. The location was picked by the experimenter from the following list: library, restroom, cafeteria, student recreation room, a professor A’s office, etc. Visitors were asked to seek help of the reception- ist for directions using English and then to proceed towards their destination. Most of the participant pairs were not familiar with each other. The fact of familiarity, when clear, is noted in the annotations. Similarly, the annotations include information on whether the participant has a thorough (works or studies inside the building) or passing (works or studies in a nearby building) familiarity with the experiment site. In both sites, the receptionist would occupy the actual receptionist area in the lobby of the building. In Doha, on-duty security guards were present in the vicinity of the reception desk. Each pair of participants would have 2-3 interactions with one of the subjects as a receptionist, and then they would switch roles and have 2 or 3 more interactions, depending on allotted time. After that, the participants were debriefed on their experiences. Overall, more than 60 interactions were recorded. The interactions were recorded with 2 or 3 consumer-level high definition cameras. Visitor and receptionist were each dedicated a camera capturing their torso, arms and face that was positioned about 45 degrees offtheir default line of sight (namely, the line of sight that is perpendicular to the front edge of the rectangular reception desk). Most of the inter- actions would have a third camera capturing the side view of the scene. All cameras were in plain view. In addition to the audio captured by the cameras, an audio recorder (iPod) was placed on the receptionist desk. 4. ANNOTATION SCHEME The main goal of our corpus is to analyze occurrences and timing of verbal and nonverbal behaviors. Consequently, we have chosen to annotate the data at the level of granular- ity that minimizes the coding effort while at the same time allowing to capture timing and major features of commu- nicative events. For example, instead of annotating each of preparation, hold, stroke, and retraction phases of a hand gesture [11] we annotate an interval between beginnings of the stroke and retraction phases. Similarly, facial expression are annotated as intervals approximately from the beginning of rise to the beginning of decay [9] phases, with some er- ror inherent to manual annotation. The annotation scheme, developed in the process of annotating the corpus, is sum- marized in Table 2. Modality Values Speech Transcribed utterances, including non-words Eye gaze Pointing (self-initiated), pointing (following interlocutor), focus (in- terlocutor, guard, desktop, down, up, left, right, front, back, scattered, destination) Face smile (open or closed mouth) Head nod, half nod, double nod, multiple nod, upward nod, multiple upward nod, micro nod, shake Hand Pointing (left or right hand), finger only Torso Sitting, standing, focus (left, right, front, back, destination, interlocu- tor, desk) Table 2: Annotation scheme Coding nonverbal expressions, as well as transcribing am- biguous speech involves a degree of subjectivity. For exam- ple, the exact point of gaze fixation within the recipient’s face is hard to identify even by the recipient himself [23]. In fact, a typical direct eye contact consists of a sequence of fixations on different points on the face [6]. Since it is unclear whether the exact fixation pattern has any influence 2 on social communication, in this study we do not distinguish between different fixation points within the general face area (neither does the video fidelity allow that). We plan to val- idate the annotations by employing a second annotator. The annotation is done using the multi-track video anno- tation tool Advene [2]. 5. DISCUSSION While the small number of individual participants makes this corpus unsuitable for cross-subject analysis, the multi- ple trials may be accounted for by mixed-effects models [16]. More appropriately, the corpus should be used for qualita- tive analysis and formation of hypothesis for further stud- ies. For example, compare the gaze behaviors of a native Arabic-speaking female S4 (Subject 4) playing a reception- ist responding to native Arabic-speaking male S1 playing a visitor (Fig. 1) versus the dialogue with the subjects’ roles reversed (Fig. 2). Notice that both subjects gazed at their interlocutor more in the visitor role. This appears to be a trend that can be explained in part by the receptionist looking towards the destination during the direction-giving speech, while the visitor may continue looking at the recep- tionist. Now, compare a receptionist gaze of S4 (Fig. 1) with one of S12 (Fig. 3), who is a female native speaker of American English. Notice the short glances that punctuate fragments of the directions sequence spoken by S12. These glances appear to precede visitor’s backchannels and therefore may play a role in connection events [18]. Receptionist S4, on the contrary, did not glance at the visitor until the very end of the directions sequence. These different gaze behaviors may reflect individual styles, genders and cultures of receptionist- visitor pairs, or levels of comfort and expertise, among other possibilities. Further, more controlled, studies may address these hypothesis. 6. ACKNOWLEDGMENTS This publication was made possible by the support of an NPRP grant from the Qatar National Research Fund. The authors would like to express their gratitude to Michael Agar, Mark Barker, Justine Cassell, Anwar El-Shamy, Ismet Hajdarovic, Alicia Holland, Carol Miller, Dudley Reynolds, Michele de la Reza, Candace Sidner, Mark Stehlik, Mark C. Thompson, security and receptionist staffof CMU Qatar, and the study participants. 7. REFERENCES [1] A. Anderson, M. Bader, E. Bard, E. Boyle, G. M. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo, H. S. Thompson, and R. Weinert. The HCRC Map Task corpus. Language and Speech, 34:351–366, 1991. [2] O. Aubert and Y. Pri´e. Advene: active reading through hypervideo. In Proc. of ACM Hypertext, September 2005. [3] B. Bailey. Communication of respect in interethnic service encounters. Language in Society, 26:327–356, 1997. [4] L. M. Beebe and M. C. Cummings. Natural speech act data versus written questionnaire data: How data collection method affects speech act performance. In S. M. Gass and J. Neu, editors, Speech Acts Across Cultures: Challenges to Communication in a Second Language, pages 65–86. Berlin / New York: Mouton de Gruyter, 1996. [5] B. Chee, A. Wong, D. Limbu, A. Tay, Y. Tan, and T. Park. Understanding communication patterns for designing robot receptionist. In S. Ge, H. Li, J.-J. Cabibihan, and Y. Tan, editors, Social Robotics, volume 6414 of Lecture Notes in Computer Science, pages 345–354. Springer Berlin / Heidelberg, 2010. [6] M. Cook. Gaze and mutual gaze in social encounters: How long—and when—we look others “in the eye” is one of the main signals of nonverbal communication. American Scientist, 65(2):328–333, 1977. [7] S. D. Gosling, P. J. Rentfrow, and J. William B. Swann. A very brief measure of the Big-Five personality domains. Journal of Research in Personality, 37:504–528, 2003. [8] H. Hewitt, L. McCloughan, and B. McKinstry. Front desk talk: discourse analysis of receptionist–patient interaction. British Journal of General Practice, 59(565):e260–e266, 2009. [9] M. E. Hoque, L.-P. Morency, and R. W. Picard. Are you friendly or just polite? —Analysis of smiles in spontaneous face-to-face interactions. In Proc. of the Affective Computing and Intelligent Interaction, October 2011. [10] K. Kim. Affect Reflection Technology in Face-to-Face Service Encounters. MIT MS Thesis, September 2009. [11] S. Kita, I. van Gijn, and H. van der Hulst. Movement phases in signs and co-speech gestures, and their transcription by human coders. In I. Wachs-muth and M. Fr¨ohlich, editors, Gesture and Sign Language in Human-Computer Interaction, pages 23–35. Springer, 1998. [12] K. C. Kong. Politeness of service encounters in Hong Kong. Pragmatics, 8(4):555–575, 2010. [13] M. Makatchev, R. Simmons, and M. Sakr. Carnegie Mellon Receptionist Corpus. http://www.qatar.cmu.edu/hala/corpora/. [14] M. McPherson, L. Smith-Lovin, and J. M. Cook. What is a language community? American Journal of Political Science, 44(1):142–155, 2000. [15] E. Montague, J. Xu, P. yu Chen, O. Asan, B. P. Barret, and B. Chewning. Modeling eye gaze patterns in clinician-patient interaction with lag sequential analysis. J. of Human Factors and Ergonomics Society, 53:502–516, October 2011. [16] J. C. Pinheiro and D. M. Bates. Mixed-Effects Models in S and S-PLUS. Springer, 2000. [17] M. Rehm, E. Andr´e, N. Bee, B. Endrass, M. Wissner, Y. Nakano, A. A. Lipi, T. Nishida, and H.-H. Huang. Creating standardized video recordings of multimodal interactions across cultures. In M. Kipp, J.-C. Martin, P. Paggio, and D. Heylen, editors, Multimodal corpora, pages 138–159. Springer-Verlag, Berlin, Heidelberg, 2009. [18] C. Rich, A. Holroyd, B. Ponsler, and C. Sidner. Recognizing engagement in human-robot interaction. In Proceedings of ACM/IEEE International Conference on Human Robot Interaction, pages 375–382, 2010. [19] C. Rich and C. L. Sidner. Robots and avatars as hosts, 3 advisors, companions and jesters. AI Magazine, 30(1):29–41, 2009. [20] B. Seidlhofer, A. Breiteneder, T. Klimpfinger, S. Majewski, R. Osimk, and M.-L. Pitzl. Vienna-Oxford international corpus of English (version 1.1 online). http://voice.univie.ac.at. Accessed January 16, 2012. [21] R. Simmons, M. Makatchev, R. Kirby, M. Lee, I. Fanaswala, B. Browning, J. Forlizzi, and M. Sakr. Believable robot characters. AI Magazine, 32(4):39–52, 2011. [22] V. Traverso. Syrian service encounters: a case of shifting strategies within verbal exchange. Pragmatics, 11(4):421–444, 2001. [23] M. von Cranach and J. H. Ellgring. Problems in the recognition of gaze direction. In M. von Cranach and I. Vine, editors, Social Communication and Movement: Studies of Interaction and Expression in Man and Chimpanzee, pages 419–443. Academic Press, London, 1973. [24] D. Watson, L. A. Clark, and A. Tellegen. Development and validation of brief measures of positive and negative affect: The PANAS scales. J. of Personality and Social Psychology, 47:1063–1070, 1988. 4 (unclear) morning eh, do you know where is d.. professor Majd Sakr office? majd sakr uhu uhu ok thank you hi, good morning eh, professor who? eh, majd sakr, professor so, you may go this way, in this corridor it's the c s corridor and then (i or o unclear)n your right... you have all offices you can read his name on the pallet on the office 0.84 0.93 0.60 0.28 0.20 0.68 0.03 0.07 0.00 0.36 0.64 0.09 Fraction of entire interaction: Fraction of receptionist's speech: Fraction of visitor's speech: 23.0 20 15 10 5 0 Time (seconds) visitor receptionist v r v r Figure 1: Interaction between S1 as a visitor and S4 as a receptionist. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots. 5 hi um, excuse me, i am (unclear) looking for cmu library, can you lead me to this library please? ok ok. so that (unclear) over there? ok, thank you so much hi yeah, sure the library is there.. just you have to .. walk that way uhu yeah... there 0.61 0.48 0.70 0.57 0.17 0.76 0.31 0.52 0.18 0.40 0.83 0.18 Fraction of entire interaction: Fraction of receptionist's speech: Fraction of visitor's speech: 15.3 10 5 0 Time (seconds) visitor receptionist v r v r Figure 2: Interaction between S4 as a visitor and S1 as a receptionist. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots. 6 hey (unclear:hey) can i please know where is the library? yeah okay it's on the ground floor you mean? er um okay okay thank you hi how can i help you a library um if you come down this hallway all the way to the end... and and you take a right it's on the ground floor it'll be right there you'll er um see these glass doors and it's the library right there so just go down the hallway and take a right and straight (rise) mm−hm 0.64 0.63 0.72 0.60 0.47 0.84 0.25 0.33 0.17 0.33 0.42 0.16 Fraction of entire interaction: Fraction of receptionist's speech: Fraction of visitor's speech: 25.7 20 15 10 5 0 Time (seconds) visitor receptionist v r v r Figure 3: Interaction between S11 as a visitor and S12 as a receptionist. The visitor’s eye gaze for this particular dialogue is partially inferred from his head gaze. Wide vertical stripes represent intervals of speech. Narrow vertical stripes represent (from left to right): intervals of visitor’s and receptionist’s gaze towards the direction pointed by the receptionist, and visitor’s and receptionist’s gaze towards each other. Color coding of these modalities is specified by the icons in the upper part of the plots. 7