Designing Speech, Acoustic, and Multimodal Interactions

CHI 2017 Workshop


Our workshop aims to develop speech, audio, and multimodal interaction as a well-established area of study within HCI, aiming to leverage current engineering advances in ASR, NLP, TTS, multimodal, gesture recognition, or brain-computer interfaces (BCI). In return, advances in HCI can contribute to creating processing algorithms that are informed by and better address the usability challenges of such interfaces.

We also aim to increase the cohesion between research currently dispersed across many areas including HCI, wearable design, ASR, NLP, BCI complementing speech, EMG interaction and eye-gaze input. Our hope is to energize the CHI and engineering communities to push the boundaries of what is possible with wearable, mobile, social robots, and pervasive computing, but also make advances in each of the respective communities. As an example, the recent significant breakthroughs in deep neural networks is largely confined to audio-only features, while there is a significant opportunity to incorporate into this framework other features and context (such as multimodal input for wearables). We anticipate this can only be accomplished by closer collaboration between the speech and the HCI communities.

Our ultimate goal is to cross pollinate ideas from the activities and priorities of different disciplines. With its unique format and reach, a CHI workshop offers the opportunity to strengthen future approaches and unify practices moving forward. The CHI community can be a host to researchers from other disciplines with the goal of advancing multimodal interaction design for wearable, mobile, and pervasive computing. The organizing committee for this workshop (the authors list) is living proof that CHI is the most appropriate venue to initiate such inter-disciplinary collaborations.


We are proposing to build upon the discussions started during our lively-debated and highly-engaging panel on speech interaction that was held at CHI 2013 [6], which was followed by two successful workshops (20+ participants each) on speech and language interaction, held at CHI 2014 and 2016 [7]. Additionally, a course on speech interaction that has been offered at CHI for the past six years [5] by two of the co-authors of the present proposal has always been well-attended. As such, we are proposing here to broaden the domain of this community-building activities to all manners of human forms of acoustic-based communications (audio and speech/language), and to all types of interfaces for which such communications may be more suitable: desktops, mobiles, wearables, personal assistant robots, smart home devices. We propose several topics for discussions and activity:

  • What are the important challenges in using speech as a “mainstream” modality? Speech is increasingly present in commercial applications – can we characterize which other applications speech is suitable for or has the highest potential to help with?

  • What interaction opportunities are presented by the rapidly evolving mobile, wearable, and pervasive computing areas? How and how much does multimodal processing increase robustness over speech alone, and in what contexts?

  • Can speech and multimodal increase usability and robustness of interfaces and improve user experience beyond input/output?

  • What can the CHI community learn from Automatic Speech Recognition (ASR), Text-to-Speech Synthesis (TTS), and Natural Language Processing (NLP) research, and in turn, how can it help these communities improve the user-acceptance of such technologies? For example, what should we be asking them to extract from speech beside words/segments? How can work in context and discourse understanding or dialogue management shape research in speech and multimodal UI? And can we bridge the divide between the evaluation methods used in HCI and the AI-like batch evaluations used in speech processing?

  • How can UI designers make better use of the acoustic-prosodic information in speech beyond simply word recognition, such as emotion recognition or identifying users' cognitive states? How can this be translated into the design of empathic voice interfaces?

  • What are the usability challenges of synthetic speech? How can expressiveness and naturalness be incorporated into interface design guidelines, particularly in mobile or wearable contexts where text-to-speech could potentially play a significant role in users' experiences? And how can this be generalized to designing usable UIs for mobile and pervasive (in-car, in-home) applications that rely on multimedia response generation?

  • What are the opportunities and challenges for speech and multimodal interaction with regards to spontaneous access to information afforded by wearable and mobile devices? And can such modalities facilitate access in a secure and personal manner, especially since mobile and wearable interfaces raise significant privacy concerns?

  • Are there particular challenges when interacting with emerging devices such as smart home / ambient personal assistants (e.g. Amazon Echo) or when interacting with social robots?

  • What are the implications for the design of speech and multimodal interaction presented by new contexts for wearable use, including hands-busy, cognitively demanding situations and perhaps even unconscious and unintentional use (in the case of body-worn sensors)? Wearables may have form factors that verge on being ‘invisible’ or inaccessible to direct touch. Such reliance on sensors requires clearer conceptual analyses of how to combine active input modes with passive sensors to deliver optimal functionality and ease of use. And what role can understanding users' context (hands, eyes busy) play in selecting best modality for such interactions or in predicting user needs?