Back to the Research

Speech Input
Nigel Eames

Several years of concentrated research effort have resulted in significant advances in the technology of speech input to computers. Most usually, this effort has been justified by an almost implicit assumption that automatic speech recognition (ASR) offers the key to dramatic improvements in the effectiveness of the human computer interface. In the landmark volume documenting the state-of-the-art in ASR, researchers explain one would want to use speech whenever possible because it is the human's most communication modality. It is also agreed that voice input to computers offers a natural, fast, hands free, location free input medium.

But allow this student to explain some basic properties speech and sound. The variety of sounds produced by our vocal cords gives us more different meanings that all other sounds put together. In fact, there is a vocal sounds, or combination of sounds, to use for any idea known to man. The really interesting thing about speech is not so much that we can produce so many different sounds, but rater that we can hear and understand them.

There are many ways of transmitting speech so that one person in one place can say something to somebody in another place. Obvisously the simplest medium for transmitting speech is the air. One man talks, another listems, with air forming the link between them. One voice and one ear can get the job done. With such a system of speech communication the oldest and most reliable the number of problems involved are relatively small. To be sure, we are concerned with the amount of noise in the environment, with how loud a man has to shout, and a few things like that. But relatively speaking , the problems are pretty simple.

This pattern is really a time-frequency-intensity pattern, a change in the intensity- frequency relations through a period of time. The frequencies are not always the same, but the pattern of the frequencies is the same. Although we are not sure just how the ear is able to comprehend these relations, it seems pretty clear that speech intelligibility resides in the pattern of frequencies, rather than in the frequencies themselves. In other words, the absolute frequencies involved in speech are not nearly so important as the relative frequency pattern. The whole pitch of a voice could be changed an octave, but , it the frequency pattern did not change, we could understand the speech.

Intensity per se is not nearly so crucial as one would first suspect. Since the ratio of the speech energy to the noise energy is so important, we frequently use the speech-to-noise ratio, or just s/n ratio, as a measure of effective speech intensity. This ratio is easily expressed in decibels. If the speech-to-noise ratio is held constant, articulation is reasonably constant, regardless of the overall level of speech. Intensity is not the only thing that determines speech intelligibility. The spectrum of the speech-the patterned distribution of energy among the various frequencies in speech sounds-also has a log to do with the intelligibility of speech. We have pointed out previously that the frequency pattern is what makes it possible for us to understand speech. If we destroy this pattern, we will seriously affect out ability to understand words.

Now back to the computer model, in reference to the superiority of speech as an input medium, most often authorities refer to the use of speech in cooperative problem solving between humans in terns of solution speed. However, others have found obvious and marked difference between human-human and human- computer interaction which mean that advantages in the former case may not necessarily transfer to the latter situation. Less often do authors attempt a more direct justification for ASR based on demonstrated, concrete advantages of speech relative to alternative input media. One possible reason for this is "despite advances in speech technology, human factors research since the late 1970s has provided only weak evidence that ASR devices are superior to conventional input devices" (Karl, Pettey and Shnederman 1993).

The results of early work comparing speech and keying were essentially pessimistic with regard to the utility of speech input in isolated-work, small-vocabulary, primary data-entry tasks. Advantages were, however, more apparent in situations of concurrent, secondary tasking and high work load. Using speech recognition, keying and light pen for primary data entry the so-called "simple scenario", researchers found that the keyboard provided the fastest and most accurate entry of random numeric strings, although the percentage degradation in performance when a secondary, hand-occupation task was added was smaller for speech (10%) than for keying (30%). For random alphameric data entry in the simple scenario, keying was also faster but speech had a lower error rate.

In an experiments which 24 subjects entered and edited program codes by speech and keyboard using a language - directed editor (i.e. one having "knowledge of the underlying syntax"). The input and edit vocabularies contained 40 commands. The keyed edit commands could be abbreviated to the first three letters but, in the absence of information in their paper on this point, it must be assumed that input commands had to be keyed in full. The researchers found that keyboard input was faster in that "subjects were able to complete more of the input and edit tasks by keyboard (70%) than by voice (50-55%). However, speech had a much lower error rate on both the input task (3.8 versus 11.00%) and the edit task (1.2 versus 14.3%) (Leggett and Williams 1984).

In another experiment comparing speech selection of edition commands with mouse activated selection in a work-processing task. Literal text entry was by keyboard and the mouse was sued for direct manipulations in both cases. This student consider this work to be relevant here because a mouse combines the pointing function with a rudimentary (one out of two or three) key-press selection mechanism, so implementing what amounts to a "dynamic" keypad. Sixteen subjects achieved an average reduction in task time of 18.7% when using speech-rather thatn mouse-activation of commands. The authors apparently draw an implicit distinction between "user" and "system" errors in that they report "error rates due to subject mistakes were roughly the same" for speech and mouse, while quoting "recognition error" of 6.3% for speech input but no error data for mouse entry. Again, this highlights another of the problems if effecting a fair comparison of input media (Karl 1993).

A particular subject proved to be very problematic user of speech input. His errors and timings were some 100% higher than the average. Accordingly, in line with maximally stringent test, data for certain subjects were removed in order to present speech input in as fair a light as possible. The literature reveals that this is not an uncommon necessity. For instance, one researcher state: "One of the subjects suffered from very bad performance, due partly to an inability to adjust to the consistent pronunciation required... and one subject encountered difficulties in this regard, because he frequently talked to himself" (Martin 1989).

In some experiments, many have tried to effect a fair comparison of speech and keying by minimizing the transaction cycle for the two media. That is, researchers have varied the mappings from commands (sequences of input actions) to command-strings elicited to suit the properties of the individual media. They find that isolated-word speech input in a simulated command and control application is slightly slower that keying (10.6%) although not significant. However, ASR is enormously more error-prone (360.4%) . This is in stark contract to the similar work of who claimed speech was 17.5 % faster while keying produced 183.2% more errors (Martin 1989).

It is interpret these figures as broad support for the notion that commands do indeed disadvantage keying relative to speech. However, many estimated do not align perfectly with other findings in that keying remains superior to speech in terms of error rate. One strong possibility is that keying error increase disproportionately with command length, so that the basic assumption underlying the above extrapolation is violated. Apart from this, the most likely caused of the remaining discrepancy are different. As a result, subjects that had a significantly higher cognitive load imposed during the task than those who did not that speech confers and advantage in such situations.

References Cited

Karl, L. R., Pettey, M. Shneiderman, B. 1993. Speech versus mouse commands for word-processing: an empirical evaluation. International Journal of Man- Machine Studies, 39, 667-687.

Leggett, J. and Williams, G. 1984. An empirical investigation of voice as an input modality for computer programmming. International Journal of Man-Machine Studies, 21, 493-520.

Martin, G. L. 1989. The utility of speech input in user-computer interfaces. International Journal of Man-Machine Studies, 30, 355-375.

Back to the Research