Combining Technology Through The Audio/Video Server Attendant

Anuj Gujar (1), Koichiro Tanikoshi (2), Jeremy Cooperstock (1,4), Shahir Daya, William Buxton1,3

(1) Telepresence Project
Department of Computer Science
University of Toronto
Toronto, Ontario
Canada M5S 1A4
{anujg, jer, willy}

(2) Hitachi Research Laboratory
Hitachi Limited
7-1-1, Omika-cho
Hitachi, Ibaraki, Japan

(3) Alias | Wavefront
110 Richmond Street
Toronto, Ontario
Canada M5C 1P1


Combining technology, speech interfaces, speech input, teleconferencing, ubiquitous computing


Traditional videoconferencing enables the exchange of thoughts without requiring participants to be physically present at the same location. Over time it has evolved into a forum for creativity, allowing attendees to interact with one another with the aid of many different media. A major problem, however, is that remote attendees, i.e. those not located at the site of the conference, have no way to interact with the available technology, such as cameras, monitors, VCRs, etc. It is our contention that this shortcoming contributes to a lack of engagement on the part of the remote attendees, significantly detracting from the quality of their experience.

To alleviate this problem, we developed the Audio/Video Server Attendant (AVSA), which provides remote attendees control over our media space, using speech as input.


Among human aspects of technology, it seems that the most pervasive theme is the convergence of media and the "information highway." What is evasive is any discussion of the topic that goes beyond rather simple models and guesswork. While it is generally agreed that it is the applications that will spell the success or failure of such network-based services, curiously little research seems to have been undertaken in this area. What exists is generally narrowly focused, and is typically idiosyncratic to the appliance of delivery.

Hence, if the appliance is a television, the applications discussed are video on demand (VOD), or home shopping. If the appliance is a computer, the applications are things like e-mail, hypertext, and the World Wide Web. And if a telephone is the appliance? It is mostly taken for granted and seldom enters into the conversation. Few, if any general models exist that rise above the individual appliances and which both contain and stimulate the discourse.

Our frustration with this situation is fueled by our collective experience over the past eight years of actually living with these technologies, first at Rank Xerox EuroPARC [2], and more recently within the context of the Ontario Telepresence Project at the University of Toronto [9]. This experience has had two main effects. First, it has convinced us that there really are powerful and valuable applications that can emerge from this convergence, and second, that some of the most interesting applications are not obvious and will not emerge from the prevalent superficial investigation of potential. One successful result is Xerox's Portholes application [5] and its successor developed at the University of Toronto [1][8].


One way to break set with the prevalent ways of thinking about the potential of these emerging technologies is through the introduction of new and rich examples, or reference points. Therein lies the motivation for the work described in this paper and other work from our lab [11]. Our intent has been to build upon the infrastructure and insights gained through the Telepresence Project, and develop an application that cuts across the functionality normally associated with each of the currently discussed delivery appliances: the television, personal computer, and the telephone.

What we have developed could be called a "video server." But it is not a video server in the way normally thought of. For example, it is not designed to provide a virtual video store on the end of a wire. Rather, it is a voice-activated server that facilitates browsing through electronic hyperdocuments, supports human-human communication, and supports video (actually demo) on demand. Each, as shown in Figure 1, is reminiscent of the computer-based worldwide web, telephone, and television, respectively. Yet together, they present something quite unlike anything in the current mind-set; something that helps us stretch in our thinking about these issues. We call the server the Audio/ Video Server Attendant (AVSA).

Our hope is that the work described provides not only examples of a new class of application and how media can be combined, but also demonstrates an approach to unveiling other such applications.


The core of our media space environment is, in telephone terminology, an audio/video private branch exchange (PBX). The system, called iiif, is a computer controlled switch that routes audio and video around the building. As described in other works [1][7][10] this gives us a high degree of connectivity. Connectivity with other iiif sites is fine, since iiif servers can register with each other and negotiate cross site calls using ISDN lines and video codecs. But there are relatively few such sites. The problem with this system emerges when you want connectivity to people outside the building, for example when someone with a conventional codec wants to contact me at my desk. Addressing the problem of how they can do so without calling ahead using e-mail or the telephone was the catalyst for the AVSA.

Previously, when a codec calls an iiif site like the Toronto Telepresence Project, what they get -- at best -- is a connection to the server and a view out of the window camera. With the AVSA, the intent is to present callers with a graphical menu of options, overlaid on the view from the window camera. The AVSA incorporates a simple speech recognition system which in turn enables incoming callers to navigate through the menu options. Services available ideally encompass connecting to individuals (including the physical receptionist), messaging services and on-line services such as demo-on-demand, or product/service information.

Once the connection has been established, the AVSA can then be used to mediate transactions during the conversation. For example, if I begin our conversation looking at you, but subsequently want to look at a document on your desk (real or virtual), the same technology that enabled me to connect to you in the first also supports this gaze redirection.

We now sketch out the main parts of the system by way of overview, then present the technology and experience in more detail.


Our goal was to enable visitors to interact with our media space, while not incurring a high cost, either in terms of cognitive burden or actual computer equipment. Basically, we required an automated attendant that would instruct the visitor on available commands, and process these commands as they are received.

Automated Attendants

Automated attendants are common in telephony. All of us are used to the synthetic voice presented menus of the form, "dial 1 for x, 2 for y, 3 for z, etc." These attendants are activated by tones generated with the telephone keypad. This does not transfer to the codec situation, as codecs may not even have keypads, and they certainly do not have the standardized touch tones of telephony.

Why Speech?

Returning to human modes of interaction, we soon realized that visitors with codecs already make use of two input devices (microphone and camera) and two output devices (speaker and monitor) to communicate with members of the media space. The technology already supported human- human interaction through gesture and, to a greater extent, speech. We decided to concentrate on this latter mode of communication, using a speech recognition system, to provide the most natural control mechanism available.

Reliability, however, may be hampered by inconsistent background noise, speaker utterances, and room acoustics. This problem can be reduced significantly by limiting the size of the vocabulary. In this case, we can obtain reliable speaker-independent interaction, thereby providing universal access to the system. The vocabulary we have chosen consists of the digits "1" through "5" for variable menu selections, and the words "disconnect", "page up," "page down," "show menu," "previous menu," and "hide," for these generic actions.

While speech serves as a natural input mechanism for human to computer, there are two problems with respect to its use for output from computer to human (i.e. to provide a menu of services to visitors):


Again, our solution comes from technology already available to visitors: their video monitors. We provide the menu using computer-generated text, overlaid (using a genlock device) on top of the video image of our media space. Overlaying, rather than replacing the video with a menu, serves an important purpose. The disruptive effect of hiding the media space from our visitors is minimized, thereby providing a greater sense of engagement.

The effect of video overlay is much like credits in a movie. Menus too large to fit on a single screen are distributed over multiple pages, with next/previous page selections available as menu items.

Furthermore, with video, we can overcome the shortcomings of time and memory, mentioned above:

An example of this is shown in Figure 2.

The Package

Having established the means of input and output, the final question was that of architecture. For the purposes of videoconferencing, there is no need to provide a computer at each node. The full functionality of a dedicated computer is simply not required (as explained in Figure 3). Visitors only require the ability to provide a computer with speech input and receive from it video output. We concluded, therefore, that a centrally located server would be sufficient.

The AVSA is a PC-based application that lives at the called site (the University of Toronto). As such, it talks to the main iiif server just like any other client. That is to say, it is a free-standing independent module, technically consistent with the main desk-top clients.


The system has developed through several stages. First, we combined speech recognition with video overlay technology to provide a voice-controlled graphical menu system. Figure 4 shows the configuration of the AVSA system. Just as users could browse electronic hyperdocuments on the WWW, they could now navigate through the services offered by our system.

Initially, the AVSA provided an electronic seat changing service, offering remote attendees the ability to move between the front and back of the room, as dictated by their social roles in a videoconference [6]. This served as a proof of concept for the system and allowed us to obtain preliminary user feedback. As expected, we found that the presentation of menu options through video was highly effective.

More functionality and generality were added as the system progressed. The next stage introduced the ability to contact different nodes of the local media space, thereby solving the problem of contacting me at my desk. With respect to human-human communication, our system was now delivering the capabilities of the telephone. The most recent stage of the AVSA added the ability for remote visitors to obtain information about our media space and view video demos on demand (DOD). This component represents the television aspect of our system. Figure 5 depicts the current status of the system.


Several users were asked to evaluate the system at each stage of its development. Their feedback helped shape the design and improve both the usability and functionality of the interface. This section outlines the comments we received and our efforts to accommodate them.

Unintentional Actions

Users occasionally asked the AVSA to do something unintended. Once the error was recognized, they did not know how to return to the previous state. This prompted the suggestion that an undo facility to reverse the last action would be useful.

Some functions, such as connecting to a particular node, are relatively time consuming. In this case, users wanted some means by which they could interrupt or cancel a command, especially one invoked inadvertently. Furthermore, some functions, such as VCR recording, have potentially destructive consequences. Many users wanted a confirmation process for these actions. We are presently experimenting with both of these features.

Displaying the Menu

To display the menu of options, users must say, "show menu." Intuitive as this is, we should still provide a means of reminding visitors of this command. However, we do not want to waste valuable screen real estate by displaying this command at all times.

One solution is to periodically display a message indicating the required command. However, this would require forgetful users to wait until this message appears before interacting with the media space. Alternatively, we could display the message in response to a distress utterance such as "help." This type of solution is consistent with a reactive environment [4]. Another solution is to display the message until it has been used a certain number of times, after which, we assume that the user has learnt the command. We expect that some combination of these solutions is desirable.

Menu Appearance

Some users questioned the position, size and impact of the menu overlay. We observed that the bottom right corner of the screen seemed to be the least intrusive position of the menu. The text of the menu was displayed in white with a black outline to maintain high visibility while retaining as much of the original video image as possible. The size of the text was also adjusted so that options were not ambiguous, even when transmitted through the codec. While these measures are effective in maintaining the sense of connection to a media space, users may lose a sense of who they are talking to (i.e. the AVSA or the local attendee). The implications of this can be easily understood by another user's comment on seeing a single command displayed: "Is that a command or some floating text on the screen?"

We are investigating several possible solutions to these problems. One is to section off an area of the screen that will display commands through a translucent or outlined box. Another is to display the menu in the center of the screen with a larger font so that visitors know they are talking to the AVSA. Our current method assumes that when the menu is displayed, visitors are talking to the AVSA, and otherwise, to the media space.

A related issue is the disruption caused to meetings while the remote attendee issues audio commands to the system. One simple way to minimize this is for the system to temporarily mute the audio channel after the invocation of any menu option. This way, the AVSA will continue to receive speech commands without them bothering other attendees. We are also considering the use of gesture as a visual clutch that would turn audio on or off as appropriate.

Users also commented that they would like a distinction between generic and variable options. For example, generic options could be displayed in a small capitalized font since they will be remembered from previous use, whereas variable options can be displayed in a larger font with brighter color.

System Feedback

Three levels of system feedback were requested by users. First, users wanted feedback indicating whether their speech command was recognized and if so, as what command. Next, the user wanted feedback as to what action their speech initiated and whether or not it is being performed. Finally, the user expected either a confirmation that their action was carried out or a reason why it was rejected.

The first two concerns are addressed by video feedback. The speech recognition software indicates its interpretation attempts while the AVSA displays an appropriate text message whenever a valid command is recognized. In regards to the third level of feedback, most commands result in readily perceptible changes in either the audio or video channel, as they are executed, indicative of their execution. For those commands that can not otherwise be monitored by the visitor, such as rewinding a video tape, we plan to report the progress of their execution, whenever possible,

Multiparty Videoconferencing

Another function desired by some users was the ability to establish simultaneous connections to multiple nodes. While this multiparty videoconferencing facility is already supported by our media space, the functionality remains to be added to the AVSA. We expect to have this completed shortly.


Since the current system is a prototype, many issues must be addressed before it can be used as a commercial system. The following discussion deals with three of the most important.

Speaker Independent Recognition

The integration of a reliable speaker-independent speech recognition system will be instrumental in making this system successful. We feel that the measures we are taking to ensure reliability and usability, such as limiting the size of the vocabulary, diminishes the impact of speech.

Noise and Filtering

One of the major themes of our research team is ubiquitous computing [10]. Our commitment to this theme has motivated us to make certain sacrifices in terms of equipment. Specifically, we do not require the visitor to wear a microphone and instead, attempt to recognize speech through an omnidirectional microphone located in the visitor's environment.

These microphones complicate the speech recognition task, as they are as sensitive to background noise as they are to human voice. We could individually fine tune the audio component of the AVSA to a multitude of environments, but this would require our system to undergo manual environment training; making it environmentally dependant. As an alternative to combatting these problems, we are investigating the use of audio filtering techniques that will automatically adapt to an environment.

Device Contention

The AVSA assumes that there is only one remote visitor in our media space. Ideally, we would like to allow many visitors to interact with the media space simultaneously. This would either require a multi-processing AVSA or multiple AVSAs, in addition to multiple codecs, laserdiscs, vcrs, etc. Since our media space presently contains only one codec, we place device contention low on our priority list. However, we feel that it is an important issue to note and consider for further development of AVSA-like systems.


The AVSA is already a promising system allowing the combination of different modes of communication. In addition to the improvements mentioned earlier, we are currently pushing it in several directions to expand its utility further. Figure 5 shows how these new directions fit into the system.

One such direction is the addition of a video mail function. This would permit remote attendees to leave A/V messages for people in the local media space who are either busy or unavailable.

The AVSA currently allows visitors to control services, such as changing seats and controlling the vcr, within a videoconference room. We are adding a new dimension of control whereby any node can provide services to a remote visitor while connected. These services will be added and deleted easily as attributes of the local node.

So far, we have only used audio as an input mechanism for remote control of a media space. As noted earlier, another input medium, video, is available. Consequently, we will be investigating the use of gesture as a natural input mechanism to allow further control of devices. The head tracking system [3], which uses the visitor's head position to control a motorized camera in our media space, is a working example of the use of gesture in this manner.

We are also augmenting our head track system to use speech as a means of labelling various camera positions, hot-spots, and returning to them later with a simple, user- specified, command.


We have introduced a novel video server that fills a practical need. It also serves a more general purpose of providing a new data point that can help expand the scope of our discussions of the "Information Highway" and "convergence."

The AVSA represents a rare example of integrating services typically seen on only one of a computer-centric, telephone-centric or television-centric appliance. From the computer world of the WWW, we see navigable and retrievable hyperdocuments. From the telephone world, we see the ability to support synchronous communication (conversations/phone calls), as well as automated attendants. From the TV world, we see interactive video and video on demand -- in a form that goes well beyond watching "Top Gun" from home whenever you want.

Finally, we have a concrete demonstration of a practical commercial application for a small-scale video server for business: a server that can be delivered with today's technologies, and grow with new emerging technologies and services.

In the long run, it is perhaps this last point that is most important. It seems that much, or even most, of the potential of the Information Highway is following the traditional MIS Big Bang approach to development. That is, customers are being told by the technology providers, "We are going to give you this really great system, all at once, someday soon. Trust me." Our position is that this approach has not worked in the past, and cannot work in the future.

What we believe is required, and what the work described in this paper demonstrates, is that another approach has far more promise, and is far less expensive. This is an approach that involves iterative human-centered design. For example, the development and testing of the AVSA, grounded as it was in an applied context, has provided a range of insights into the architecture of small-scale video servers; insights that would never emerge in developing a full-blown ATM super-server that can distribute 500 movies simultaneously over a network that might someday exist.

We believe that the human potential of these new technologies is immense. We are concerned, therefore, that this potential is met. Our hope is that the work described in this paper makes some small contribution to its realization.


We wish to thank Tracy Narine for his invaluable technical assistance, David Audrain for coding assistance, and Borysa Struk and Mike Ruicci for providing access to equipment necessary during the testing stages of the project. We also thank William Hunt, David Modjeska and all the members of all our research groups for their valuable feedback and patience.

This research has resulted from the Ontario Telepresence Project. Support has come from the Government of Ontario, the Information Technology Research center of Ontario, the Telecommunications Research Institute of Ontario, the Natural Sciences and Engineering Research Council of Canada, British Telecom, Xerox PARC, Bell Canada, Alias|Wavefront, Sun Microsystems, Hewlett Packard, Hitachi Corp., the Arnott Design Group and Adcom Electronics. This support is gratefully acknowledged.


The AVSA will be available for public use as of October 15, 1995. Users with a codec may dial into the system using the following two numbers:


1. Buxton, W., Integrating the Periphery and Context: A New Model of Telematics, Proceedings of Graphics Interface 1995 (GI'95), (Quebec, May 17-19), Cana- dian Human-Computer Communications Society, pp. 239-246.

2. Buxton, W. and Moran, T., EuroPARC's Integrated Interactive Intermedia Facility (iiif): Early Experi- ence, In S. Gibbs & A.A. Verinj-Stuart (Eds.). Multi- user interfaces and applications, Proceedings of the IFIP WG 8.4 Conference on Multi-user Interfaces and Applications, Heraklion, Crete. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), pages 11-34, 1990.

3. Cooperstock, J., Tanikoshi, K., and Buxton, W., Turn- ing Your Video Monitor into a Virtual Window, Proc. of IEEE PACRIM, Pacific Rim Conference on Commu- nications, Computers, Visualization and Signal Pro- cessing 1995, Victoria, May 1995.

4. Cooperstock, J., Tanikoshi, K., Beirne, G., Narine, T., and Buxton, W., Evolution of a Reactive Environment. Proceedings of Human Factors in Computing Systems 1995 (CHI'95), (Denver, May 7-11), ACM Press. pp. 170-177.

5. Dourish, P., and Bly, S., Portholes: Supporting Aware- ness in a Distributed Work Group. Proceedings of Human Factors in Computing Systems 1992 (CHI'92), (Monterey, California), pp. 541-547.

6. Gujar, A., Daya, S., Cooperstock, J., Tanikoshi, K., and Buxton, W. Talking Your Way Around a Confer- ence: A speech Interface for Remote Equipment Con- trol. CASCON'95 CD-ROM Proceedings, Toronto, Ontario, Canada.

7. Mantei, M., Baecker, R., Sellen, A., Buxton, W., Milli- gan, T., and Wellman, B., Experiences in the use of a media space. Proceedings of CHI'91. ACM Confer- ence on Human Factors in Software. Pages 203-208. Reprinted in D. Marca & G. Bock (Eds.) 1992. Group- ware: software for computer-supported collaborative work. Los Alamitos, CA.: IEEE Computer Society Press, pages 372-377.

8. Narine, T., Leganchuk, A., Mantei, M., and Buxton, W., Collaboration Awareness and its Use to Consoli- date a Disperse Group, Paper submitted, for publica- tion in Procedings, to Human Factors in Computing Systems 1996 (CHI'96).

9. Riesenbach, R. The Ontario Telepresence Project, Human Factors in Computing Systems 1994 Confer- ence Companion (CHI'94), (Denver, Colorado), pp 173-174.

10. Weiser, M. (1993). Some Computer Science Issues in Ubiquitous Computing. Communications of the ACM, 36(7), 75-83.

11. Yamaashi, K., Cooperstock, J., Narine, T., Buxton, and W., Beating the Limitations of Camera-Monitor Medi- ated Telepresence with Extra Eyes, Paper submitted, for publication in Proceedings, to Human Factors in Computing Systems 1996 (CHI'96).


Figure 1: Combining technology metaphors. The menu (computer) seen by a visitor to the media space. The visitor is given the options of calling a person (telephone), or viewing demos (television). Note that the image quality of this photograph has been degraded significantly due to the loss of color.

Figure 2: Simultaneous and continuous display. This example shows that the video channel can be used to reduce the time required to identify options and memory load placed on visitors-- eliminating the shortcomings of the telephone. Note that the image quality of this photograph has been degraded significantly due to the loss of color.

Figure 3: Interface Functionality. A videoconference requires a subset of the functionality provided by a human or computer interface. Therefore we are developing the AVSA, starting with a small-scale model, constantly being expanded as new technology emerges, to make use of only that functionality that is required.

Figure 4: Configuration of the AVSA system.

Figure 5: AVSA status. Dotted lines indicate future work.

Figure 6: Feedback in the top section of the display indicates that the AVSA is responding to a "Page Up" command. Note that the image quality of this photograph has been degraded significantly due to the loss of color.