arXiv:1807.08076v1 [cs.CL] 21 Jul 2018 Consequences and Factors of Stylistic Differences in Human-Robot Dialogue Stephanie M. Lukin 1 , Kimberly A. Pollard 1 , Claire Bonial 1 , Matthew Marge 1 , Cassidy Henry 1 , Ron Artstein 2 , David Traum 2 and Clare R. Voss 1 1 U.S. Army Research Laboratory, Adelphi, MD 20783 2 USC Institute for Creative Technologies, Playa Vista, CA 90094 stephanie.m.lukin.civ@mail.mil Abstract This paper identifies stylistic differences in instruction-giving observed in a corpus of human-robot dialogue. Differences in verbosity and structure (i.e., single-intent vs. multi-intent instructions) arose natu- rally without restrictions or prior guid- ance on how users should speak with the robot. Different styles were found to produce different rates of miscommunica- tion, and correlations were found between style differences and individual user varia- tion, trust, and interaction experience with the robot. Understanding potential con- sequences and factors that influence style can inform design of dialogue systems that are robust to natural variation from human users. 1 Introduction When human users engage in spontaneous lan- guage use with a dialogue system, a variety of naturally occurring language is observed. A per- sistent challenge in the development of dialogue systems is determining how to handle this diver- sity. One strategy is to limit diversity and maxi- mize the system’s natural language understanding by training users a priori on what language and syntax is valid. However, these constraints could potentially yield inefficient interactions, e.g., the user may incur greater task and cognitive load try- ing to remember the proper phrasing needed by the system, worrying whether or not their speech will be understood if they do not get it exactly right. A broader approach to dealing with diversity is to develop more robust systems that can respond ap- propriately to different styles of language. A set of dialogue system policies that takes into account natural stylistic variations in users’ speech would Dialogue 1: Lower Verbosity U : take pictures in all four directions Robot : executing... Robot : done Dialogue 2: Higher Verbosity U : robot face north, take a picture, face south, take a picture, face east, take a picture Robot : executing... Robot : done Dialogue 3: Minimal Structure Style U : go through the other door Robot : executing... Robot : done U : take a picture Robot : image sent Dialogue 4: Extended Structure Style U : face your starting position and send a picture Robot : executing... Robot : image sent Figure 1: Dialogues between Users (U) and a Robot, exemplifying stylistic differences provide for a more nuanced, adaptable, and user- focused approach to interaction. Rather than constrain users or develop a gener- alized dialogue system that attempts to cover all variations in the same way, we focus on analytic understanding of differences in observed language behavior, as well as possible causes of these differ- ences and implications of misunderstanding. This work is a first step towards a more nuanced and flexible dialogue policy that can be sensitive to in- dividual and situational differences, and adapt ap- propriately. This paper introduces a taxonomy of stylistic differences in instructions that humans is- sue to robots in a dialogue. The taxonomy consists of two classes: verbosity and structure . Verbosity is measured by number of words in an instruction. Dialogues 1 and 2 in Figure 1 show contrasting verbosity levels. Structure concerns the number of intents issued in an instruction: Minimal if it con- tains a single intent (Dialogue 3 in Figure 1 has two Minimal) or Extended if it contains more than one intent (Dialogue 4). Understanding stylistic differences can support the development of dia- logue systems with strategies that tailor system re- sponses to the user’s style, rather than constrain the user’s style to the expected input. The taxon- omy is described in more detail in Section 3 . We observe and analyze these stylistic differ- ences in a corpus of human-robot direction-giving dialogue from Marge et al. ( 2017 ). These styles are not unique to this corpus; they emerge in other human-robot and human-human dialogue, such as TeamTalk ( Marge and Rudnicky , 2011 ) and SCARE ( Stoia et al. , 2008 ). The corpus contains 60 dialogues from 20 participants (Sec- tion 4 ). The robot dialogue management in the corpus is controlled by a Wizard-of-Oz experi- menter, allowing for the study of users’ style with a fluent and naturalistic partner (i.e., with an ap- proximation of an idealized automated system). In Section 5 , we investigate possible conse- quences and implications of these categorized styles in this corpus. We examine the relation- ship of style and miscommunication frequency, applying an existing taxonomy for miscommu- nication in human-agent conversational dialogue ( Higashinaka et al. , 2015a ) to this human-robot corpus. We explore the relationship between stylistic differences and other dialogue phenom- ena described in Section 6 , specifically whether: • The rate of miscommunication is related to verbosity (H 1 ) and structure (H 2 ); • Latent user differences are related to ver- bosity (H 3 ) and structure (H 4 ); • Trust in the robot is related to verbosity (H 5 ) and structure (H 6 ); • Time/experience with the robot is related to verbosity (H 7 ) and structure (H 8 ). Finally, we speculate about how knowledge of style, miscommunication, individual differences, trust, and experience might be leveraged to imple- ment targeted and personalized dialogue manage- ment strategies and offer concluding remarks on future work (Sections 7 and 8 ). 2 Related Work A number of human-human direction-giving cor- pora exist, among them, ArtWalk ( Liu et al. , 2016 ), CReST ( Eberhard et al. , 2010 ), SCARE ( Stoia et al. , 2008 ), and SaGA ( L ̈ ucking et al. , 2010 ). The majority of existing analyses on these corpora focus on the vocabulary of referring ex- pressions and entrainment. While variations in instruction-giving verbosity and structure are ev- ident in these human-human interactions, the goal of this work is to improve human-robot communi- cation. Humans have different assumptions about how robots communicate and behave, and may speak differently to robots than they do to other humans. We therefore chose a human-robot cor- pus for our style analysis that uses a Wizard-of-Oz for dialogue management. This allowed us specif- ically to isolate the style usage and miscommu- nication errors of the human partner (because the Wizard makes very few errors on the robot’s end). Studies of human-robot automated systems tend to focus on the miscommunication errors of the dialogue system (i.e., the robot itself), rather than the miscommunication or style of the hu- man partner. In conversational agents, the re- search focus is also primarily to categorize er- rors made by the agent, not the human, includ- ing errors in ASR, surface realization, or appro- priateness of the response (e.g., Higashinaka et al. ( 2015b ); Paek and Horvitz ( 2000 )). The more generic task-oriented and agent-based response- level errors from Higashinaka et al. ( 2015a ) map well to the user miscommunication in the corpus we examine, including excess/lack of information, non-understanding, unclear intention, and misun- derstanding. Works that focus specifically on mis- communication from the user when interacting with a robot include those categorizing referential ambiguity and impossible-to-execute commands ( Marge and Rudnicky , 2015 ). These categories are common in the data we examine as well. In this analysis, we predict that trust will have an effect on stylistic variations. Factors of trust in co-present and remote human-robot collabora- tion has been studied with respect to engagement with the robot, and memory of information from the robot ( Powers et al. , 2007 ). 3 Stylistic Differences We describe two classes of stylistic differences for instruction-giving: differences in the verbosity of an instruction, and in the structure of the instruc- tion. These styles emerge when decomposing a high-level plan or intent (e.g., exploring a physical space) into (potentially, but not necessarily) low- level instructions (e.g., how to explore the space, where to move, how to turn). 3.1 Verbosity Verbosity is a continuous measure of the number of words per instruction. Compare the instruction in Dialogue 1 in Figure 1 “take pictures in all four directions” (6 words) with the instruction in Di- alogue 2 “robot face north, take a picture, face south, take a picture, face east, take a picture” (16 words). Both issue the same plan (with the excep- tion of a picture towards the west in Dialogue 2), yet Dialogue 1 condenses the instruction and as- sumes that the robot can unpack the higher-level plan. Dialogue 2 is more verbose and low-level, making reference to individual cardinal directions. Verbosity alone does not capture all style differ- ences; additional categorization is needed. 3.2 Structure of Instructions We define a Minimal instruction as one contain- ing a single intent (e.g., “turn”, “move”, or “re- quest image”). A sequence of Minimal instruc- tions often reveals the higher-level plan of the user. In Dialogue 3, the user issues a single instruc- tion “go through the other door” and waits until the instruction has been completed. Upon receiv- ing completion feedback from the robot (“execut- ing” and “done” responses), the next instruction, “take a picture”, is issued. Compare this with Di- alogue 4, where the intents “face your starting po- sition” and “send a picture” are compounded to- gether and issued at the same time. This is clas- sified as an Extended intent structure: instruc- tions that have more than one expressed intent. These structural definitions were first described in Traum et al. ( 2018 ) to classify the composition of an instruction. In this work, we use these defini- tions to classify the style of the user. 4 Human-Robot Dialogue Corpus We examine these styles in a corpus of human- robot dialogue collected from a collaborative human-robot task ( Marge et al. , 2017 ). The user and the robot were not co-present. The user in- structed the robot in three remote, search-and- navigation tasks: a Training trial and two Main tri- als (M1 and M2). During Training, users got used to speaking to the robot. Main trials lasted for 20 minutes each, and users were given concrete goals for each exploration, including counting particular objects (e.g., shoes) and making deductions (e.g., if the space could be a headquarters environment). Users spoke instructions into a microphone while looking at a live 2D-map built from the robot’s LIDAR scanner. A low bandwidth envi- ronment was simulated by disabling video stream- ing; instead, photos could be captured on-demand from the robot’s front-facing camera. To allow full natural language use, users were not provided ex- ample commands to the robot, though they were provided with a list of the robot’s capabilities which they could reference throughout the trials. Well-formed instructions (unambiguous, with a clear action, end-point, and state) could be exe- cuted without any additional clarification (e.g., all dialogues in Figure 1 ). The robot responded with status updates to the user to make it known when an instruction was heard and completed. When necessary, the robot requested instruction clarifi- cation. A human Wizard experimenter stood in for the robot’s speech recognition, natural language understanding, and language production capabil- ities, which were guided by a response protocol. 4.1 Corpus Statistics The corpus contains 3,573 utterances from 20 users, totaling 18,336 words. 1,981 instructions were issued. The least verbose instruction ob- served is 1 word (“stop”), and the most verbose is 59 words (mean 7.3, SD 5.8). Of the total instruc- tions, 1,383 are of the Minimal style, and 598, Ex- tended. A moderate, positive correlation exists be- tween higher verbosity and the Extended style in this corpus ( r s ( 1969 ) = . 613 , p < .001)), support- ing an intuition that more words would be found in Extended instructions. That this correlation is not stronger, however, may suggest that the verbosity metric is insufficient to capture critical elements of stylistic variation of structure. Number of words does not completely map onto the complexity or the “packed” nature of instructions. For example, the Minimal but highly verbose instruction from in corpus “continue down the hallway to the first entrance on the left first doorway on the left” is 16 words, but the Extended instruction “stop. take a picture” is only 4. 5 Stylistic Differences and Miscommunication A user’s utterance is classified as a miscommuni- cation if the following robot utterance is a request for clarification or indicates inability to comply; this occurred at least once in 216 (16%) of the in- structions in the corpus. We hypothesize that dif- ferent instruction styles will differ in their over- all rates of miscommunication, i.e., that miscom- munication rates are related to verbosity (H 1 ) and structure (H 2 ). If a scarcity of words leads to ambiguity or missing information, we would predict that ver- bosity and miscommunication rate would be neg- atively correlated. However, if more words simply yield more opportunities for erroneous or contra- dictory information, then we would predict a pos- itive correlation between verbosity and miscom- munication. We assessed this using binary logistic regression of verbosity on overall miscommunica- tion presence (H 1 ). Results revealed that miscom- munication significantly increases with verbosity (verbosity as a continuous independent variable, with model χ 2 = 55.94, p < .001, with Wald = 56.67, p < .001, Nagelkerke R 2 = .06) If having more intents in a single instruction leaves more opportunities for mistakes, then we would predict that greater use of Extended struc- ture would be positively related to miscommunica- tion rates. To examine this relationship, we com- pared overall miscommunication rates and use of different instruction structures (H 2 ). The over- all miscommunication rate for Minimal instruc- tions is 8%, while Extended is 18%; we confirmed that Extended instructions were significantly more likely to have miscommunication (structure as a categorical independent variable, with Chi-square test, χ 2 (1, N = 1969) = 40.91, p < .001). 5.1 Miscommunication Types While overall miscommunication rate differed significantly among styles, it would be useful to understand whether different styles are as- sociated with different types of miscommuni- cation, as this may inform what error-handing algorithms the system should favor. Follow- ing Higashinaka et al. ( 2015a ), we categorize miscommunication in the corpus according to Response-level and Environmental-level ontolo- gies. Though this ontology was designed to cat- egorize errors made by a virtual agent dialogue system, a number of categories are applicable to communicative errors committed by human users in situated dialogues. Figure 2 shows examples of user-miscommunication errors from the human- robot corpus and the robot response. Further ex- planation and expected relationships are below. Dialogue 5: Response, Missing Information U : leave the room and go to your left Robot : How far left should I go? Dialogue 6: Response, Unclear Instruction U : go into the center of the first room on your left and then take a picture at the west north east and south position Robot : Can you repeat the positions you’d like photos from? Dialogue 7: Environment, Ambiguous Reference U : can you move forward to take a picture of the object Robot : I’m not sure which object you are referring to. Can you describe it another way? Dialogue 8: Environment, Impossible Instruction U : move around the cone into the next room turn right ninety degrees Robot : There is not enough space for me to move around the cone and enter the room ahead. Would you like me to get as close as possible to the room? Dialogue 9: Environment, Misunderstand Capabili- ties U : okay turn to your right forty five degrees and turn on video camera Robot : Our bandwidth will currently not support the video feed. Would you like me to send a photo? Figure 2: Dialogues between Users (U) and Robot with different miscommunications types Response-level, Missing Information instruc- tions lack information required for execution (Di- alogue 5). We expected that lower verbosity in- structions would be more likely to lead to this type of miscommunication due to a lack of specificity, as measured by fewer words, e.g., “turn right” is missing how many degrees to turn. On the other hand, we might expect that Extended instructions would be more likely to lead to this type because with multiple intent comes a compounding poten- tial for lack of specificity. Response-level, Unclear Instructions are un- clear due to phrasing or order of information pre- sented (Dialogue 6). More verbose instructions were expected to be more prone to this type of mis- communication because more information, mea- sured as words, has a higher potential to be mis- construed (e.g., it is unclear if “north east” is “north” and “east” or “north-east”). However, in- creased information may provide additional con- text required for specification, the opposite rela- tionship. Due to compounding potential, we ex- pected Extended style would lead to more Unclear type errors. Environment-level, Ambiguous Reference in- structions include an ambiguous referent in the en- vironment, potentially due to a lack of common ground (Dialogue 7). We expected that lower ver- bosity instructions, with less information (words) would have more Ambiguous miscommunication (e.g., “go to the doorway” versus “go to the door- way furthest from you”). For Extended style, we hypothesize more Ambiguous type errors due to compounding potential. Environment-level, Impossible instructions are impossible to execute in the physical space in terms of distance and dimension (Dialogue 8). We expected that overspecified instructions (higher verbosity or Extended) might be more likely to be Impossible (e.g., in the more verbose, Extended instruction “move up two feet, turn right ninety degrees, move forward seven feet”, it is not pos- sible for the robot to move 7 feet after completing the first two actions). Environment-level, Misunderstood Capabilities instructions are those in which the user misunder- stands the robot’s capabilities (Dialogue 9). We expect verbosity and structure to affect Misunder- stood rates much as they affect Impossible mis- communication rates. Logistic regression revealed that verbosity does not significantly correlate with any type of mis- communication that occurred ( χ 2 = 4.89, p = .298). To examine this result in more detail, we conducted binomial logistic regression on each miscommunication type separately, asking, e.g., does verbosity predict whether the miscommuni- cation is of the Ambiguous type or not? None of these results were significant. With regard to structure, a Chi-square test showed a non-significant trend, suggesting there may be a possible influence of structure on mis- communication type ( χ 2 (4, N = 216) = 8.71, p = .065). We explored this result in more detail, look- ing at each miscommunication type separately, asking, e.g., does structure predict whether the miscommunication is of the Ambiguous type or not? Results were significant for Ambiguous mis- communication type ( χ 2 (1, N = 216) = 4.01, p = .045) and a trend toward significance for Unclear miscommunication type ( χ 2 (1, N = 216) = 3.34, p = .067) With Minimal styles, miscommunica- tions that arise are more likely to be Ambiguous type. With Extended styles, miscommunication that arise may tend to be Unclear type. Counts of miscommunication types for each structure style are shown in Figure 3 . Missing Unclear Ambiguous Impossible Capabilities 0 20 40 60 * 49 16 35 4 2 39 8 51 8 4 # miscommunication Extended Minimal Figure 3: Miscommunication types observed in structures style (* p < 0 . 05) 6 Factors related to Style Differences Knowing that stylistic differences are observed in unconstrained dialogue and the relationship of these differences to miscommunication rates, it is important to assess factors of these differences in the first place. We examine latent individual dif- ferences, as well as trust and interaction time with the robot, which may influence style. 6.1 Individual Differences A broad-use dialogue system can expect to receive instructions from different individuals. The dia- logue system must therefore be robust to a range of individuals who will bring different speaking styles to the interaction. We hypothesized that in- dividual users differ in their verbosity (H 3 ) and structure (H 4 ). We first examined whether individual user iden- tity predicted verbosity (H 3 ). The ANOVA as- sumption of homogeneity of variances was vio- lated, so a Kruskal-Wallis H test was used, sup- porting H 3 with significant difference in verbosity across individual participants ( χ 2 (19, N = 1969) = 422.53, p < .001). The most verbose user used an average of 15 words per instruction, and the least verbose used an average of 4 words. Chi-square tests revealed that individual users also vary in structure (H 4 ; χ 2 (19, N = 1969) = 511.70, p < .001). Figure 4 graphs the percent- age of structural style employed by users (sorted from smallest to largest percent of Extended us- age). Some users seem to simply prefer the Min- imal style (Users 1, 2, 3) while other users em- ployed a majority of Extended (Users 19, 20). Others are almost evenly split (Users 13, 14, 15). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 20 40 60 80 100 98 97 96 89 87 85 83 74 73 72 70 60 53 47 41 37 37 35 32 29 2 3 4 11 13 15 17 26 27 28 30 40 47 53 59 63 63 65 68 71 Individual users % of structure style Extended Minimal Figure 4: Percent distribution of instruction structure between users (sorted smallest to largest Extended) 6.2 Trust in the Robot User trust in the robot may be a factor in how the user realizes their instructions, e.g., because the user may have different levels of confidence in the robot’s abilities. Users completed the Trust Per- ception Scale-HRI ( Schaefer , 2016 ) after M1, and again after M2. The Trust Perception Scale-HRI is a 40-item scale designed to measure an individ- ual’s subjective perception of trust in a robot. We hypothesized that trust in the robot would be related to verbosity (H 5 ) and structure (H 6 ). If reported trust is indicative of a user’s comfort with speaking more with the robot, and/or if trust is in- dicative of having higher confidence in the robot’s ability to process many words, then we would pre- dict a positive relationship between trust and ver- bosity. On the other hand, if trust scores reflect confidence that the robot will understand instruc- tions without need for additional words or expla- nations, then we would predict a negative relation- ship between trust and verbosity. To assess whether and how trust levels are re- lated to verbosity (H 5 ), we compared trust lev- els for a trial to the verbosity in that trial (there were not enough data points to control for individ- ual user ID in a regression). Spearman correlation was significant, with higher trust correlating with greater verbosity ( r s ( 38 ) = . 33 , p = .035). If higher trust scores indicate user confidence that the robot can understand, parse, and execute complex instructions, then we predict that more Extended instructions would be observed. To as- sess this relationship (H 6 ), trust levels measured for each trial were compared to the proportion of Extended instructions used in that trial. Spear- man correlation revealed a nonsignificant trend for higher trust to correlate with more use of the Ex- tended structure ( r s ( 38 ) = . 29 , p = .07). 6.3 Time and Experience As time passes and experience grows, people are known to interact differently with technology and with communication partners. We thus hypothe- size that time/experience with the robot would be related to verbosity (H 7 ) and structure (H 8 ), i.e., as the user progresses from Training to M1 and M2, instruction-giving style may change. If it is the case that users become more com- fortable or confident as they gain more experience, we predict that verbosity should increase over time/experience (H 7 ). Indeed, verbosity increased across trials from an average of 6.1 words in Train- ing, to 7.3 average words in M1, to 8.1 average words in MP2. A one-way repeated measures ANOVA was conducted to determine whether ver- bosity differed by trial (repeated measures analy- sis effectively controls for user ID). Trial was sig- nificantly related to verbosity, (F(2,38) = 13.45, p < .001), and post-hoc LSD t-tests indicated that each trial had significantly more verbose instruc- tions than previous trials (Training vs. M1 p = .003; Training vs. M2 p = .001; M1 vs. M2 p = .020). Figure 5 shows the percentage of structural style in each trial. There is a general upward trend in use of the Extended style as users engage in successive trials. A one-way repeated measures ANOVA was used to determine whether structure Training M1 M2 0 50 100 ** * 78 70 64 22 30 36 % of structure style Extended Minimal Figure 5: Percent distribution of instruction struc- ture between trials (* p < 0 . 05; ** p < 0 . 01) usage differed by trial (H 8 ). Results showed sig- nificant differences among trials (F(2,38) = 8.26, p = .001), with post-hoc LSD t-tests revealing greater Extended structure use in M1 and M2 as compared to Training (Training vs. M1 p = .014; Training vs. M2 p = .002). Structural usage be- tween M1 and M2 was not significantly different (M1 vs. M2 p = .190). 7 Discussion 7.1 Miscommunication Styles differ in the overall frequency of miscom- munication they engender, but these differences are not consistent across all miscommunication types. Among miscommunication-producing in- structions, we found no correlation between ver- bosity and what type of miscommunication was produced (H 1 ). Future analyses that look at ad- ditional linguistic features may help reveal what is happening at a level of specificity beyond a sim- ple word count. We can speculate that this may be because user misunderstandings of the robot or environment exist regardless of how many words it takes the user to express these misunderstand- ings (Impossible, Misunderstood Capabilities), or because Ambiguous, Unclear, or Missing Infor- mation miscommunication can result either from too few words, or from cases where the participant adds more words and commits more miscommuni- cation with those words. This raises the question of what it is that is being added with more ver- bose instructions, if not clarification information. Future research can aim to address this. Our analyses revealed an effect of structure on miscommunication types (H 2 ). Minimal structure had a greater tendency to yield Ambiguous mis- communications. This may be because additional intents in an instruction offer opportunities to cor- rect ambiguity in the first intent. For example, if the robot is told to go through the door and take a photo of the chair, the robot can use the presence or absence of a chair to settle any am- biguity about which door to go through. With- out the additional intent packed into the instruc- tion, this would remain ambiguous. Extended style additionally showed a nonsignificant trend toward yielding more response-level Unclear mis- communication types, which may result because Extended instructions are packed, sequentially- ordered instructions and thus have the ability to introduce miscommunication in the order of infor- mation presented. Missing Information, Impossi- ble Instructions, and Misunderstood Capabilities were not significantly related to structure. These miscommunication types might not arise from the structural style, but instead stem from a funda- mental misunderstanding on the user-end. Further analysis of the content of the instructions, rather than only the structure, may uncover if content is a factor. 7.2 Individual Differences Our analysis revealed that latent differences among individuals appear to yield differences in verbosity and structure style (H 3 and H 4 ). Fu- ture analysis may aim to identify these latent dif- ferences. Possibilities include variations in po- tential for introspection, personality, perspective- taking ability, and other differences. Regardless of the underlying factors that cause individual dif- ferences, dialogue systems must be robust to a range of individuals who bring with them differ- ent stylistic tendencies. 7.3 Varying Degrees of Trust We found that higher trust was related to higher verbosity. We speculate that this may be because when a user trusts in the robot’s competence and capabilities, they are more likely to feel comfort- able enough to speak more and be confident that the robot can parse longer instructions. Users’ propensity for trust was not measured during the experimental collection, which may be an unob- served factor in this analysis. Future analysis will incorporate this additional information about users’ latent traits. If trust is measured in questionnaires, or gauged by other means, this information could be incorpo- rated as feedback for the dialogue system to appro- priately adjust dialogue management strategies; as the users’ trust in the robot is gauged during an interaction, the system will know to expect ad- justments to verbosity and structure, so it can of- fer more appropriate and tailored responses to the user’s style. Furthermore, providing feedback that encourages trust (or discourages it) may be a gen- tle, minimally obtrusive way of guiding a user to employ a different style to avoid particular mis- communication types, if working with a less ro- bust dialogue system. 7.4 Effect of Interaction Time Users increased their verbosity (H 7 ) and use of the Extended style (H 8 ) when progressing from Train- ing to M1 (and verbosity again when progress- ing from M1 to M2). We speculate that starting with lower verbosity and Minimal style during the Training trial might suggest users initially are hes- itant or do not have a strong sense for the robot’s language processing capabilities. Users may be learning from the training and growing in comfort level over interaction time and experience with the robot, and are willing to use more verbose or Ex- tended instructions in successive trials. Another possible explanation might be that users face a more difficult task in the main trials as compared to training; when pressed for time in a more chal- lenging task, users may use more words and be more prone to combine intents together. Future studies can aim to disentangle these effects. We observed an increase in Extended style use between M1 and M2, but it did not reach statistical significance. This might suggest that any learning or strategy convergence in terms of structure that occurred from training to M1 may have mostly set- tled by M1. It is possible that future work with a greater sample size will reveal that Extended style use continues to grow across trials. An under- standing of interaction time or experience effects can be incorporated in the dialogue system to bet- ter support the change of user styles that emerge with repeated interactions. 8 Conclusion and Future Work This paper defines two classes of stylistic differ- ences: verbosity and structure, and examines these styles in a corpus of human-robot dialogue with no constraints on how robot-directed instructions were formulated. We show that stylistic differ- ences are linked to different rates and types of mis- communication (H 2 ), that latent individual differ- ences exist (H 3 and H 4 ), and that there is a rela- tionship between style and trust (H 5 and H 6 ), and style and interaction time (H 7 and H 8 ). By understanding the effects of stylistic differ- ences used in instruction-giving, we are posed to implement adjusting dialogue systems to the ex- pectations of styles to increase user interaction and system performance. Tapus et al. ( 2008 ) has shown that users prefer a robot that tailors en- couragement strategies according to users’ per- sonality (introverted or extroverted). Torrey et al. ( 2006 ) found that users prefer robots that tailor their speech to the human’s level of expertise. We posit that dialogue systems could similarly be crafted to support and interact with different ver- bosity and structural styles. Future dialogue sys- tems might adjust to the verbosity style by, for ex- ample, providing system feedback in more or less verbose styles, which may make the interaction feel more like a natural conversation. A system can adjust to the structural style by providing in- cremental feedback to users issuing Extended in- structions to capture miscommunications early, as well as provide feedback that the system under- stood the compound instruction. The monitoring of trust and interaction time can be incorporated as feedback for the dialogue system to offer more appropriate responses or attentive repair strategies in advance of miscommunication being made. This investigation of style warrants further turn- by-turn analysis to better understand where style shift occurs during an interaction, and why particu- lar styles are subject to increased rates of miscom- munication. A future robot may be able to propose alternate courses of action for certain miscommu- nication types (e.g., the suggestion to offer the user a picture of the room in Dialogue 9). These propositions may be difficult for other miscommu- nication styles, which require contextual, environ- ment information and specification directly from the user. Future work will investigate these alter- native suggestions to study if a users’ style would shift around the alternate action (e.g., reducing Minimal structure usage for Ambiguous instruc- tions), or if the user would adapt the alternate ac- tion into their own style (e.g., continuing to use Minimal, but not repeating the same mistake). References Kathleen M Eberhard, Hannele Nicholson, Sandra K ̈ ubler, Susan Gundersen, and Matthias Scheutz. 2010. The Indiana “Cooperative Remote Search Task” (CReST) Corpus. In Proceedings of the In- ternational Conference on Language Resources and Evaluation . Ryuichiro Higashinaka, Kotaro Funakoshi, Masahiro Araki, Hiroshi Tsukahara, Yuka Kobayashi, and Masahiro Mizukami. 2015a. Towards Taxonomy of Errors in Chat-oriented Dialogue Systems. In Pro- ceedings of the Special Interest Group on Discourse and Dialogue . pages 87–95. Ryuichiro Higashinaka, Masahiro Mizukami, Kotaro Funakoshi, Masahiro Araki, Hiroshi Tsukahara, and Yuka Kobayashi. 2015b. Fatal or Not? Finding Errors that Lead to Dialogue Breakdowns in Chat- Oriented Dialogue Systems. In Proceedings of Em- pirical Methods for Natural Language Processing . pages 2243–2248. Kris Liu, Jean E Fox Tree, and Marilyn A Walker. 2016. Coordinating Communication in the Wild: The Artwalk Dialogue Corpus of Pedestrian Navi- gation and Mobile Referential Communication. In Proceedings of the International Conference on Language Resources and Evaluation . Andy L ̈ ucking, Kirsten Bergmann, Florian Hahn, Ste- fan Kopp, and Hannes Rieser. 2010. The Biele- feld Speech and Gesture Alignment Corpus (SaGA). In Proceedings of the International Conference on Language Resources and Evaluation workshop: Multimodal corpora–advances in capturing, coding and analyzing multimodality . Matthew Marge, Claire Bonial, Ashley Foots, Cory Hayes, Cassidy Henry, Kimberly Pollard, Ron Art- stein, Clare Voss, and David Traum. 2017. Ex- ploring Variation of Natural Human Commands to a Robot in a Collaborative Navigation Task. In Proceedings of the First Workshop on Language Grounding for Robotics . pages 58–66. Matthew Marge and Alexander Rudnicky. 2011. The TeamTalk Corpus: Route Instructions in Open Spaces. In Proceedings of the Workshop on Ground- ing Human-Robot Dialog for Spatial Tasks. . Matthew Marge and Alexander Rudnicky. 2015. Mis- communication Recovery in Physically Situated Di- alogue. In Proceedings of the Special Interest Group on Discourse and Dialogue . pages 22–31. Tim Paek and Eric Horvitz. 2000. Conversation as Ac- tion under Uncertainty. In Proceedings of the Six- teenth conference on Uncertainty in artificial intel- ligence . Morgan Kaufmann Publishers Inc., pages 455–464. Aaron Powers, Sara Kiesler, Susan Fussell, and Cristen Torrey. 2007. Comparing a Computer Agent with a Humanoid Robot. In Proceedings of Human- Robot Interaction (HRI), 2007 2nd ACM/IEEE In- ternational Conference on . IEEE, pages 145–152. Kristin E Schaefer. 2016. Measuring Trust in Human Robot Interactions: Development of the “Trust Per- ception Scale-HRI”. In Proceedings of Robust Intel- ligence and Trust in Autonomous Systems , Springer, pages 191–218. Laura Stoia, Darla Magdalena Shockley, Donna K By- ron, and Eric Fosler-Lussier. 2008. SCARE: a Situ- ated Corpus with Annotated Referring Expressions. In Proceedings of the International Conference on Language Resources and Evaluation . Adriana Tapus, Cristian T ̧ ̆ apus ̧, and Maja J Matari ́ c. 2008. User-Robot Personality Matching and Assis- tive Robot Behavior Adaptation for Post-Stroke Re- habilitation Therapy. In Intelligent Service Robotics 1(2):169. Cristen Torrey, Aaron Powers, Matthew Marge, Su- san R Fussell, and Sara Kiesler. 2006. Ef- fects of adaptive robot dialogue on information ex- change and social relations. In Proceedings of the First ACM SIGCHI/SIGART Conference on Human- Robot Interaction . Association for Computing Ma- chinery, pages 126–133. David Traum, Cassidy Henry, Stephanie Lukin, Ron Artstein, Felix Gervits, Kimberly Pollard, Claire Bonial, Su Lei, Clare Voss, Matthew Marge, Cory Hayes, and Susan Hill. 2018. Dialogue Structure Annotation for Multi-Floor Interaction. In Proceed- ings of the International Conference on Language Resources and Evaluation .