Buxton, W. (1990). The Natural Language of Interaction: A Perspective
on Non-Verbal
Dialogues. In Laurel, B. (Ed.). The Art of Human-Computer Interface Design,
Reading, MA:
Addison-Wesley. 405-416.
The "Natural" Language of Interaction:
A Perspective on Non-Verbal Dialogues
Bill Buxton
ABSTRACT
The argument is made that the concept of "natural language understanding
systems" should be extended to include non-verbal dialogues. The claim
is made that such dialogues are, in many ways, more natural than those based
on words. Furthermore, it is argued that the hopes for verbal natural language
systems are out of proportion, especially when compared with the potential
of systems that could understand natural non-verbal dialogue. The benefits
of non-verbal natural language systems can be delivered by technology available
today. In general, the benefits will most likely exceed those of verbal
interfaces (if and when they ever become generally available).
This is a revision of a paper that previously appeared under the same
title in, Proceedings of CIPS '87, Intelligent Integration, Edmonton, Canadian
Information Processing Society, 311-316, and INFOR Canadian Journal of Operations
Research and Information Processing, 26(4), 428-438.
INTRODUCTION
There is little dispute that the user interface is a bottle-neck restricting
the potential of today's computational and communications technologies.
When we begin to look for solutions to this problem, however, consensus
evaporates. Researchers and users all have their own view of how we should
interact with computers, and each of these views is different. If there
is a thread of consistency, however, it is generally in the view that user
interfaces should be more "natural." Within the AI community,
especially, this translates into "natural language understanding systems"
which are put forward as the great panacea that will free us from all of
our current problems.
The question is, is this hope realistic? The answer, we believe, lies very
much in what is meant by "natural language".
What is normally meant by the term is the ability to converse using a language
like English or German. When conversing with a machine, such conversations
may be coupled with speech understanding and synthesis, or may involve typing
using a more conventional keyboard and CRT. Regardless, our personal view
is that the benefits and applicability of such systems will be limited,
due largely to the imprecise and verbose nature of such language.
But we do not want to argue that point, since it has too much in common
with arguments against motherhood or about politics and religion. More importantly,
it is secondary to our principal thesis: that this class of conversation
represents only a small part of the full range of natural language.
We argue that there is a rich and potent gestural language which is at least
as "natural" as verbal language, and which - in the short and
long term - may have a more important impact on facilitating human-computer
interaction. And, despite its neglect, we argue that this type of language
can be supported by existing technology, and so we can reap the potential
benefits immediately.
ANOTHER VIEW OF NATURAL LANGUAGE
There is probably little argument that verbal language coexists with a rich
variety of manual gestures (that seems to increase in range as one approaches
the Mediterranean Sea). The real question is, what does this have to do
with computers, much less with natural language? What we are going to argue
is that such gestures are part of a non-verbal vocabulary which is natural,
and is a language capable of efficiently communicating powerful concepts
to a computer.
The burden of proof, therefore, is to establish the communicative potential
of such a language, and to show that it is, in fact, natural. Our approach
is to argue by demonstration and by example. We will provide some concrete
demonstrations of how such language can be used, and argue that it is natural
in the sense that users come to the system with the basic requisite communication
skills already in place. Our main hope is that we may be able to cause some
researchers to rethink their priorities, and direct more attention to this
aspect of interaction than has previously been the case.
THE MACINTOSH AS VICTIM
Before going too much further, it is probably worth making a few comments
about my approach and my examples. As will become pretty evident, the Apple
Macintosh takes a bit of a beating in what follows. Some readers my view
this as evidence of contempt for its design. From my perspective, it is
evidence of respect. Let me explain.
User interface design today is plagued by an unhealthy degree of complacency.
We live in a world of copy-cat unimaginative interface products, where major
manufacturers put out a double-paged colour spreads in Business Week announcing
"... our system's user interface is as easy to use as the most easy
to use microcomputer [i.e., the Mac]" (or words to that effect). In
what other industry can you get away with stating that you're as good as
the competition, rather than better? Especially when the competition's product
is over 4 years old!
The best ideas are the most dangerous, since they take hold and are the
hardest to change. Hence, the Macintosh is dangerous to the progress of
user interfaces precisely because it was so well done! Designers seem to
be viewing it as a measure of success rather than as a point of departure.
Consequently, it runs the risk of becoming the Cobol of the 90's.
I criticize the Macintosh precisely because it is held in such high regard.
It is one of the few worthy targets. If I can make the point that there
are other design options, and that these options appear to have a lot of
potential, then I may help wake people up to the fact that it is an unworthy
design objective to aim for anything less than trying to do to the Macintosh
what the Macintosh did to the previous state-of-the-art.
Enough of the sermon. Let's leave the pulpit for some concrete examples.
OF PROOF-READERS AND FOOTBALL COACHES
Let us start off with a "real" computer-relevant example. Of all
applications, perhaps the human factors of text editing has been the most
studied. Within text editors (and command-line interfaces in general), perhaps
no linguistic construct has received more attention and been more problematic
than that of a verb that has both a direct and an indirect object.
The classic examples of this construct, largely because they are so ubiquitous,
are the move and copy operations. In fact, the problems posed by verbs having
two different types of operand are such that in programs like MacWrite,
move and copy have each been replaced by two operations (cut-and-paste and
copy-and-paste, respectively). Each step of these compound operations has
only one operand (the second being implicit, the "magic" operand,
the clipboard).

(a) Proof-Reader's "Move" Symbol

(b) The 49er Double Scrape
Figure 1: Similar notations for two types of information.
Figure 1(a) illustrates the use of proof-reader's notation
to specify a spatial relationship within a document. Figure 1(b) uses a
similar notation to specify a spatial relationship over time in the context
of a football play.
What is clear, however, to anyone who has ever annotated a document, or
seen a football playbook, is that there exists an alternative notation (read
"language") for expressing these concepts. This is characterized
by the gestural proof-reader's move symbol. Whether intended in the spatial
or temporal sense, the notation is clear, succinct, and is known independently
of (but is understandable by) computers.
Fig. 1(a) shows how the proof-reader's move symbol can be used to specify
a verb with two objects (direct and indirect) without any ambiguity. Figure
1(b) illustrates the use of essentially the same notational language in
very different context. This time it is being used to illustrate plays in
a football game. Despite the context, the notation is clear and unambiguous,
will virtually never result in an error of syntax, and is known to the user
before a computer is ever used. And, it can be used to articulate concepts
that users traditionally have a great deal of trouble expressing.
Could this type of notation be the "natural" language for this
type of concept? More to the point, do any of today's "state-of-the-art"
user interfaces even begin to let us find out?
NATURAL LANGUAGES ARE LEARNED
The capacity for language is one of the things that distinguishes humans
from the other animal species. Having said that, nobody would argue that
humans are born with language. Even "natural" languages are learned.
Anyone who has tried to learn a foreign language (a language that is "natural"
to others) knows this. We are considered a native speaker if and when we
have developed fluency in the language by the time we are required to draw
upon those language skills.
If we orient our discussion around computers, then the same rules apply.
A language could be considered "natural" if, upon approaching
the computer, the typical user already has language skills adequate for
expressing desired concepts in a rich, succinct, fluent, and articulate
manner.
By this definition, most methods of interacting with computers are anything
but natural. But is there an untapped resource there, a language resource
that users bring to the system on first encounter that could provide the
basis for such "natural" dialogues?
Yes!
WHAT'S NATURAL TO YOU IS FOREIGN TO ME
One of the problems with arguments in favour of natural language interfaces
is the unspoken implication that such systems will be universally accessible.
But even if we restrict ourselves to the consideration of verbal language,
we must accept the reality of foreign languages. German, for example, is
different than English in both vocabulary and syntax. There is no universally
agreed placement of verbs in sentences, for example.
The point to this train of thought, which is a continuation of the previous
comments about languages being learned, is that so-called natural languages
are only natural to those who have learned them. All others are foreign.
If we start to consider non-verbal forms of communication, the same thing
holds true. The graphic artist's language of using an airbrush, for example,
is foreign to the house painter. Similarly, the architectural drafts person
has a language which includes the use of a drafting machine in combination
with a pencil. Each of these "languages" is natural (albeit learned)
for the profession.
But the argument will now be raised that I am playing with words, and that
what I am talking about are specialized skills developed for the practice
of particular professions, or domains of endeavor. But how is that different
than verbal language? What is conventional verbal language if not a highly
learned skill developed to enable one to communicate about various domains
of knowledge?
Where all of this is heading is the observation that the notion of a universally
understood natural language is naive and not very useful. Each language
has special strengths and weaknesses in its ability to communicate particular
concepts. Languages can be natural or foreign to concepts as well as speakers.
A true "natural language" system is only achieved when the language
employed is natural to the task, and the person engaged in that task is
a native speaker. But perhaps the most important concept underlying this
is acknowledging that naturalness is domain specific, and we must, therefore,
support a variety of natural languages.
WHERE THERE'S LANGUAGE THERE MUST BE PHRASES
Let us accept, therefore, that there is a world of natural non-verbal languages
out there waiting to be tapped for the purpose of improved human-computer
interaction. Then where are the constructs and conventions that we find
in verbal language? Are there, for example, concepts such as sentences and
phrases?
Let us look at one of our favorite examples (Buxton, 1986b). Consider making
a selection using a pop-up menu. Conceptually, you are doing one thing:
making a choice. But if we look more closely, there is a lot more going
on (as the underlying parser would tell you if it knew how). You are "uttering"
a complex sentence which includes:
- mouse button down: to invoke the menu
- a 1 dimension locate: moving the mouse to the item to be selected
- mouse button up: to cause the currently highlighted item to be selected
While you generate each of these tokens, you are not aware of the mechanics
of doing so. The reason is that from the moment that you depress the mouse
button, you are in a non-neutral state of tension (your finger). While you
may make a semantic error (make the wrong selection), everything in the
system is biased towards the fact that the only reasonable action to conclude
the transaction is to release your finger. There is virtually no cognitive
overhead in determining the mechanics of the interaction. Tension is used
to bind together the tokens of the transaction just as musical tension binds
together the notes in a phrase. This is a well designed interaction, and
if anything deserves to be called natural, this does.
But let us go one step further. If appropriate phrasing of gestural languages
can be used to reduce or eliminate errors of syntax (as in the pop-up menu
example), can we find instances where lack of phrasing permits us to predict
errors that will be made? The question is rhetorical, and we can find an
instance even within our pop-up menu example.
Consider the case where the item being selected from the menu is a verb,
like cut, which requires a direct object. Within a text editor, for example,
specifying the region of text to be cut has no gestural continuity (i.e.,
is not articulated in the same "phrase") with the selection of
the cut operator itself. Consequently, we can predict (and easily verify
by observation) a common error in systems that use this technique: that
users will invoke the verb before they have selected the direct object.
As a result, they must restart, select the text, then reselect cut.
To find the inverse situation, where the use of phrasing enables us to predict
where this type of error will not occur, we need look no further than the
proof-reader's symbol example discussed earlier. The argument is, if the
language is fluid and natural, it will permit complete concepts to be articulated
in fluid connected phrases. That is natural, and if such a criterion is
followed in language design, learning time and error rates will drop while
efficiency will improve.
ON HEAD-TAPPING AND STOMACH-RUBBING
If there is anything that makes a language natural, as the previous discussion
emphasized, it is the notion of fluidity, continuity, and phrasing. Let
us push this a little farther, using an example that moves us even further
from verbal language.
One of the problems of verbal language, especially written language, is
that it is single threaded. We can only parse one stream of words at a time.
Many people would argue that anything other than that is unnatural, and
would be akin to the awkward party trick of rubbing your stomach while tapping
your head. The logic (so called) seems to be based on the belief that since
we can only speak and read one stream of words at a time, languages based
on multiple streams are unnatural (after all, if God had wanted us to communicate
in multiple streams, she would have given us two mouths).
But this argument is so easy to refute that it is almost embarrassing to
have to do so. Imagine, if you will, a voice activated automobile. (If this
is too hard, imagine yourself as an instructor with a student driver.) Your
task is to talk the car down London's Edgware Road, around Marble Arch,
and along Park Lane. If anything will convince you that verbal language
is unnatural in some contexts - even spoken and coupled with the ultimate
speech understanding system (a human being, in the case of the student driver)
- this will. The single stream of verbal instructions does not have the
bandwidth to simultaneously give the requisite instructions for steering,
gear shifting, braking, and accelerating.
(The AI pundits will, of course say, that the solution here is to couple
the natural language system with an expert system that knows how to drive.
Fine. Then replace the student driver with an expert, but one who has never
driven in London before and go out at rush hour. The odds are still less
than 50:50 of making it around Marble Arch. If you aren't in an accident,
you will be bound to cause one. QED)
There are some things which verbal language is singularly unsuited for.
For our purposes, we will characterize these as tasks where the single threaded
nature of such language causes us to violate the principles of continuity
and fluidity in phrasing, as outlined above. That is what was happening
in the car driving example, what would happen if one tried to talk a pianist
through a concerto, and what does happen in many common computer applications,
which are - likewise - inappropriately based on single-threaded dialogues.
FROM MARBLE ARCH TO MACWRITE
We can use the Apple Macintosh, in particular the well known word processor
Macwrite, to illustrate our statements about continuity and multi-threaded
dialogues. The example is based on an experiment undertaken by myself and
Brad Myers (Buxton and Myers, 1986).
The study was motivated by the observation that a lot of time using window-based
WYSIWYG text editors was spent switching between editing text and navigating
through the document being edited. In the direct manipulation type systems
that we were interested in, this switching took the form of the mouse being
used to alternatively select text in the document and to navigate by manipulating
the scroll bar and scroll arrows at the side of the screen.
This type of task switching occurs when the text that one wants to select
is off screen. Hence, what is conceptually a "select text" task
becomes a compound "navigate / select text" task. Since each of
the component tasks (navigate and select) were independent, we decided to
design an alternative interface in which each task was assigned to a separate
hand.
Assuming right handed users, the right hand manipulated the mouse and performed
the text selection task. The left hand performed the navigation task (using
two touch sensitive strips: one that permitted smooth scrolling, the other
that jumped to the same relative position in the document as the point touched
on the strip).
We implemented the new interface in an environment that copied Macwrite.
What we saw was a dramatic improvement in the performance of experts and
novices. In particular, we saw that by simply changing the interface, not
only did performance improve, but the performance gap between experts and
novices was narrowed. Perhaps most important, it was clear that using both
hands in this manner caused no problem experts or for novices. Clearly they
had the requisite motor skills before ever approaching the computer and
the mapping of the skills employed to the task was an appropriate one.
Why did this multi-handed, multi-threaded approach work? One reason is that
each hand was always in "home position" for its respective task
(assuming that they were not on the keyboard, in which case, each hand could
be positioned on its own device about as fast as one hand could be placed
on the mouse - which is the normal case). Hence, the flow of each task was
uninterrupted, preserving the continuity of each.
The improvements in efficiency can be predicted by simple time-motion analysis,
such as is obtainable using the Keystroke level model of Card, Moran and
Newell (1980). Since each hand is in home position, no time is spent acquiring
the scroll gadgets or moving back to the text portion of the screen. Having
both hands simultaneously available, the user can (and did) "mouse
ahead" to where the text will appear while still scrolling it into
view. That users spontaneously used such optimal strategies is a good argument
for the naturalness of the mode of interaction.
ANOTHER HANDWAVING EXAMPLE
Is this just one of those interesting idiosyncratic examples, or are there
some generally applicable principles here? We clearly believe the latter.
Most "modern" direct manipulation systems appear to have been
designed for Napoleon, or people of his ilk, who want to keep one hand tucked
away for no apparent purpose.
But everyday observation gives us numerous examples where people assign
a secondary task to their non-dominant hand (or feet) in order to avoid
interrupting the flow of some other task being undertaken by the dominant
hand. While this is common in day-to-day tasks, the single-threaded nature
of systems biases greatly against using these same techniques when conversing
with a computer. Consequently, in situations where this approach is appropriate,
single-threaded dialogues (and this includes verbal natural language understanding
systems) are anything but natural.
But it is not just a case of time-motion efficiency at play here. It is
the very nature of system complexity. If a single device must be time multiplexed
across multiple functions, there is a penalty in complexity as well as time.
This can be illustrated by another example taken from the Macintosh program,
Macpaint.
Like the previous example, this one involves navigation and switching between
functions using the mouse. In this case, the tasks are inking (drawing with
the paint brush), and navigating (moving the document under the window).

Figure 2: Grabbing the page in Macpaint.
In order to expose different parts of the "page,"
in Macpaint, one selects the hand icon from the menu on the left, then uses
the hand to drag the page under the window (using the mouse).
In this example, imagine that you are painting the duck shown in Fig. 2.
Having finished the head, you now want to paint the body. However, there
is not enough of the page exposed in the window. The solution is, then,
to move the appropriate part of the page under the window. We then go on
painting.
Let us work through this in more detail, contrasting two different styles
of interaction.
- Assume that our initial state is that we have been painting (having
selected the "brush" icon in the menu).
- Assume that our goal (target end state) is to paint on a part of the
page not visible in the window.
- Our strategy is to move the page under the window until the desired
part is visible, and then resume painting.
- The official method, as dictated by Macpaint:
- Move the tracking symbol (using the mouse) from the paint window to
the menu in the left margin.
- Select the "hand" icon.
- Move to the paint window.
- Drag the page under the window (as shown in Figure 2). This may involve
multiple "strokes," i.e., releasing the mouse button, repositioning
the mouse, then dragging some more.
- Move the tracking symbol from the paint window back to the menu in the
left margin.
- Select the "brush" icon.
- Move back to the paint window.
- Resume painting.
- Our multi-handed, multi-threaded Method: Assume that the position of
the page under the window is connected to a trackball which is manipulated
by the left hand, while painting is carried out (as always) with the mouse
in the right hand. The revised method is:
- Drag the page under the window (which may involve multiple "strokes"
of the trackball.
It is clear that the second method involves far fewer steps, and is far
more respectful of the continuity of the primary task, painting. It is likely
that it is easier to learn, less prone to error, and much faster to perform
(we don't make this claim outright because there can be other influences
when the larger context is considered, and we have not performed the study).
It also means that the non-intuitive hand icon can be eliminated from the
menu, reducing complexity in yet another way. It is, within the context
of the example, clearly a more natural way to perform the transaction.
An important part of this example is the fact that the initial and final
(goal) states are identical except for the position of the page in the window.
In general, when this situation comes up during task analysis, bells should
go off in the analyst's head prompting the question: if the start and goal
state involve the performance of the same task, is the intermediate task
a secondary task that can be assigned to the other hand? Can we preserve
the fluidity and continuity of the primary task by so doing?
In many cases, the answer will be yes, and we are doing the user a disservice
if we don't act on the observation.
IS IT THAT SIMPLE? LEARNING FROM HISTORY
To be fair, at this point it is worth addressing two specific questions:
At first glance, many of these ideas seem pretty good. Is
it really that easy?
and
Are these ideas really new?
The answer to both questions is a qualified "no." Furthermore,
the questions are very much related. Many of the ideas have shown potential
and have been demonstrated as long as twenty years ago. However, developing
a good idea takes careful design and the right combination of ingredients.
While many of these ideas have been explored previously, the work was often
ahead of the technology's ability to deliver to any broad population in
a cost-effective form. A human nature issue also arises from the fact that
once tried without follow-through, people have frequently seemed to adopt
a "that's been tried, didn't catch on, so must not have worked"
type of attitude.
In addition, there are some really deep issues that remain to be solved
to support a lot of this work, and certainly, the leap from a lab demo to
robust production system is non-trivial, regardless of the power or "correctness"
of the concept. For example, gesture-based systems generally require a stylus
and tablet who's responsiveness to subtle changes of nuance is beyond what
can be delivered by most commercial tablets (and certainly any mouse). IBM,
among others, have spent a lot of time trying to get a tablet/stylus combination
with the right feel, amount of tip-switch travel, and linearity suitable
for this type of interaction.
The last point (not in importance) concerning the degree of acceptance of
gestural and two-handed input - despite isolated compelling examples being
around for a long time - has to do with user testing. Demos, there have
been. Carefully run and documented user testing and evaluation, however,
can be noted only by their scarcity. The only published studies that I'm
aware of that tried to carry out formal user testing of gestural interfaces
are Rhyne & Wolf (1986), Wolf (1986), and Wolf & Morrel-Samuels
(1987). The only published formal study that I am aware of of user tests
of 2-handed navigation-selection and scaling-positioning tasks was by Buxton
& Myers (1986).
One of the most illuminating, glaring (and depressing) points stemming from
the above is the huge discrepancy (in computer terms) between the dates
of the first prototypes and demonstrations (mid 60's - early 70's), and
the dates of any of the published user testing (mid - late 80's). I think
that this says a lot about why these techniques have not received their
due attention. Change is always hard to bring about. Without testing, we
neither get anywhere near an optimal understanding or implementation of
the concept, and, we never get the data that would otherwise permit us to
quantify the benefits. In short, we simply are handicapped in our ability
to fight the inertia which is inevitable in arguing for any type of significant
change.
SUMMARY
We have argued strongly for designer to adopt a mentality that considers
non-verbal gestural modes of interaction as falling within the domain of
natural languages. While verbal language likely has an important role to
play in human-computer interaction, it is not going to be any type of general
panacea.
What is clear is that different forms of interaction support the expression
of different types of concepts. "Natural" language includes gestures.
Gestures can be used to form clear fluid phrases, and multi-threaded gestures
can capitalize on the capabilities of human performance to enable important
concepts to be expressed in a clear, appropriate, and "natural"
manner. These include concepts in which the threads are expressed simultaneously
(as in driving a car, playing an instrument, or mousing ahead), or sequentially,
where using a second thread enables us to avoid disrupting the continuity
of some primary task by having the secondary task articulated using a different
channel (as in the Macpaint example).
We believe that this notion of a natural language understanding system can
bring about a significant improvement in the quality of human-computer interaction.
To achieve this, however, requires a change in attitude on the part of researchers
and designers. Hopefully the arguments made above will help bring such a
change about.
ACKNOWLEDGEMENTS
The research reported in this paper has been undertaken at the University
of Toronto, with the support of the Natural Sciences and Engineering Research
Council of Canada, and at Rank Xerox's Cambridge EuroPARC facility in England.
This support is gratefully acknowledged. We would like to acknowledge the
helpful contributions made by Thomas Green and the comments of Rick Beach,
Elizabeth Churchill, Michael Brook and Larry Tessler.
REFERENCES AND BIBLIOGRAPHY
Baecker, R. & Buxton, W. (1987). Readings in Human-Computer Interaction:
A Multidisciplinary Approach, Los Altos, CA: Morgan Kaufmann.
Buxton, W. (1986a). Chunking and Phrasing and the Design of Human-Computer
Dialogues, Proceedings of the IFIP World Computer Congress, Dublin,
Ireland, September 1 - 5, 1986, 475 - 480.
Buxton, W. (1986b). There's More to Interaction than Meets the Eye: Some
Issues in Manual Input, in D. Norman & S. Draper (Eds.), User Centred
Systems Design: New Perspectives on Human-Computer Interaction, Hillsdale,
NJ: Lawrence Erlbaum Associates, 319 - 337.
Buxton, W. (1983). Lexical and Pragmatic Considerations of Input Structures,
Computer Graphics, 17(1), 31 - 37.
Buxton, W. (1982). An Informal Study of Selection-Positioning Tasks, Proceedings
of Graphics Interface '82, 323 - 328.
Buxton, W. & Myers, B. (1986). A Study in Two-Handed Input, Proceedings
of CHI'86, 321 - 326.
Card, S., Moran, T. & Newell, A. (1980). The Keystroke Level Model for
User Performance Time with Interactive Systems, Communications of the
ACM, 23(7), 396 - 410.
Rhyne, J.R. & Wolf, C.G. (1986). Gestural interfaces for information
processing applications, Computer Science Technical Report RC 12179,
IBM T.J. Watson Research Center, Distribution Services 73-F11, P.O. Box
218, Yorktown Heights, N.Y.
Wolf, C.G. (1986). Can People Use Gesture Commands? ACM SIGCHI Bulletin,
18(2), 73 - 74.
Wolf, C.G. & Morrel-Samuels, P. (1987). The use of hand-drawn gestures
for text-editing, International Journal of Man-Machine Studies, 27,
91 - 102.
Return to Bill Buxton's Home Page
Return to Bill Buxton's Recent Publications Page