Eliminating the Hole in Media Space Communication

Anuj Gujar (1), Jeremy R. Cooperstock (1,4), Koichiro Tanikoshi (2), William Buxton (1,3)

(1)Telepresence Project
Department of Computer Science
University of Toronto
Toronto, Ontario
Canada, M5S 1A4
{anujg, jer, willy}@dgp.toronto.edu

(2)Hitachi Research Laboratory
Hitachi Limited
7-1-1 Omika-cho, Ibaraki

(3)Alias | Wavefront
110 Richmond Street
Toronto, ON M5C 1P1


Media space, speech interfaces, navigating


Imagine placing a telephone call and not connecting with your party until they dialed your number as well. This absurd situation is actually the status quo when visitors try to contact a media space from another site, using an audio- video modem (codec).

When visitors contact a media space without pre-arranging their calls, they receive, at best, a preset view of the environment and at worst, no view at all. In either case, visitors would be stranded. To avoid falling into this "hole" of media space communication, visitors must pre-arrange their calls with a local attendee through, for example, e-mail. Both the visitor and the local attendee must then connect to the modem at a specified time in order to videoconference. The Audio Video Server Attendant overcomes this limitation by providing visitors with the ability to navigate independently through a media space.


The University of Toronto media space [1][5] is a desktop videoconferencing environment that provides an audio and video channel through which individuals using the same hardware and software [2] can communicate. However, when people from other sites contact us, using an audio- video modem (codec), they have no ability to navigate independently, for instance, to electronically enter my office. Instead they must rely on the presence of a local attendee to mediate communication.

Our solution to this problem, the Audio Video Server Attendant (AVSA), is an automated attendant that instructs visitors on available commands and processes these commands as they are received. The attendant allows visitors to see if individuals are present and if so, to contact them directly. The AVSA caters to the lowest common denominator in user capabilities, including those without computers. Hence, the visitor is only required to use the traditional videoconferencing equipment already in place to support human-human communication, namely a camera, microphone and monitor. Using these devices, the AVSA obtains input through speech and provides output through video overlay, as shown in Figure 1.


Automated attendants are already prevalent in many applications, as discussed by Schmandt [7]. For example, interactive voice response (IVR) systems, commonly answer our calls with menus of the form, "press 1 for customer service, 2 for store hours, 3 for catalogue information, etc."

These attendants are typically controlled by the touch-tones generated from the telephone keypad. However, this control mechanism does not transfer to the codec situation. Codecs may not have keypads, and certainly do not have the standardized touch tones of telephony. Regardless, the visitor should not be required to suffer from the cognitive burden of using a keypad, or worse still, a computer, assuming they even have one. We note that speech is a common denominator in all videoconference communication and thus, use it to provide a natural interaction mechanism that is far more suited to the menu selection task. This is supported by previous research, showing that speech is the preferred input modality when the task involves short, interactive communication [6].

Two further problems of automated attendants stem from their use of audio as the sole medium of interaction: The time required to listen to the message of options in its entirety and the cognitive load of remembering the options and the action associated with each. Both of these problems can be solved by exploiting the extra communication channel afforded through video. By presenting the menu of available options with computer graphics, rather than sequentially with voice, the options are visible simultaneously and instantaneously, as shown in Figure 1. Furthermore, the options are displayed continuously, and remain displayed until an action is initiated. Hence, the shortcomings of being presented with the options through the audio channel are eliminated by transferring the burden to the more suited visual channel.


When visitors connect to a media space that has an AVSA, they are presented with a graphical menu of options, as shown in Figure 1. The AVSA then incorporates a simple speech recognition system enabling incoming callers to make selections from a menu hierarchy. These options allow users to navigate through our media space, visiting various offices, accessing virtual receptionist services, and interacting with equipment such as video cameras. For example, as shown in Figure 2, the visitor can change seats or control the VCR. These abilities provide the visitor with a greater sense of face-to-face communication.


The AVSA is already a promising system. We are currently pushing it in several directions to expand its utility further. One such direction is the addition of a video mail function, similar to the SpeechActs research [8]. This would permit remote attendees to leave audio/video messages for people in the local media space who are either busy or unavailable.

So far, we have only used audio as an input mechanism for remote control of a media space. We are also investigating the use of video, specifically gesture, as a natural input mechanism for tasks such as control of devices. A working example is the head tracking system [3], which uses the visitor's head position to control a motorized camera.


Current interfaces to media spaces do not provide visitors with the ability to independently initiate and control desktop videoconferencing. The AVSA overcomes this limitation by providing a widely accessible mechanism for navigation, just as telephones do for audio conversations, thereby eliminating the "hole" in media space communication.


The AVSA will be available for use as of February 1. Codec users may dial the system using the following two numbers: +1-416-971-2095 +1-416-971-2096


This research has resulted from the Ontario Telepresence Project. The authors thank all those who contributed to the project either technically or through their generous support. A special thanks to Marilyn Mantei and Kimiya Yamaashi for their helpful comments and insightful discussion.


1. Buxton, W., Integrating the Periphery and Context: A New Model of Telematics, Proceedings of Graphics Interface 1995 (GI'95), (Quebec, May 17-19), Cana- dian Human-Computer Communications Society, pp. 239-246.

2. Buxton, W. and Moran, T., EuroPARC's Integrated Interactive Intermedia Facility (iiif): Early Experience, In S. Gibbs & A.A. Verinj-Stuart (Eds.). Multi-user interfaces and applications, Proceedings of the IFIP WG 8.4 Conference on Multi-user Interfaces and Applications, Heraklion, Crete. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), pages 11-34, 1990.

3. Gaver, W., Smets, G., Overbeeke, K. A Virtual Win- dow on Media Space. Proceedings of Human Factors in Computing Systems 1995 (CHI'95), (Denver, May 7-11), ACM Press. pp. 257-264.

4. Kelly, P. H., Katkere, A., Kuramura, D. Y., Moezzi, S., Chatterjee, S., Jain, R. An Architecture for Multiple Perspective Interactive Video. Proceedings of ACM Multimedia 1993, pp. 201-212.

5. Mantei, M., Baecker, R., Sellen, A., Buxton, W., Milli- gan, T., and Wellman, B., Experiences in the use of a media space. Proceedings of CHI'91. ACM Confer- ence on Human Factors in Software. Pages 203-208. Reprinted in D. Marca & G. Bock (Eds.) 1992. Group- ware: software for computer-supported collaborative work. Los Alamitos, CA.: IEEE Computer Society Press, pages 372-377.

6. Martin, G. L. The utility of speech input in user-com- puter interfaces. International Journal of Man/Machine Studies, Vol. 30, 1989, pp. 355-375.

7. Schmandt, C. (1993). Phoneshell: the Telephone as Computer Terminal. Proceedings of ACM Multimedia 1993, pp. 373-382.

8. Yankelovich, N., Levow, G., and Marx, M. (1995) Designing SpeechActs: Issues in Speech User Inter- faces. Proceedings of Human Factors in Computing Systems 1995 (CHI'95), (Denver, May 7-11), ACM Press. pp. 369-376.


Figure 1: The initial AVSA menu offers a selection of people with whom the user can visit. Selections are made by uttering the desired option enclosed in quotes.

Figure 2: While connected, visitors can control equipment (vcr control and pip control), and change their electronic view (head tracking and seat changing).