SGI's Speech Recognition Project

HUB | Up | Pheedbak | Download | Tree | Topic | A-Z | Search | Hot | New

See also speech-faq from comp.speech


SGI's Speech Recognition Project


UNFORTUNATELY, the speech software has not been ported to IRIX 6.


What can I do with this speech software?

You will be able to issue verbal commands and have applications respond with the appropriate actions. For example, in Media Mail, you may say "reply" and a pre-addressed compose window will open. Only a subset of commands are offered for a few applications in this first release. However, you may add commands to these and new applications.

For more general information about speech technologies, refer to the speech recognition and speech synthesis web pages available from Lance Welsh at Silicon Graphics. There are several third party vendors who have ported their software to SGI. Some ports are inteneded for evaluation only, while others are products fully supported by the third party that ported the technology.

How does it work?

This version of speech images can recognize "discrete utterances" which are words or short phrases. The size of the "vocabulary" (the set of recognizable utterances) should not exceed around 50-100 words at any given time, depending on the utterances, which loose their uniqueness in large vocabularies (longer utterances are more discernible). Utterances may be pretrained since the algorithms are speaker-independent, as well as added by the user for their particular applications. Currently, speech can respond to recognitions with keystroke events, button events, and shell commands.

What do I need to get involved?

Requirements for speech include an SGI Indy or R4k Indigo, or any machine with Indigo audio capabilities (R3k Indigos may not have sufficient CPU power to execute the speech recognition algorithms with usable results). Irix 5.x or better is also required.

You may receive the speech software (distributed as SGI inst'able images) on DAT tape, via FTP, or from fizz.engr.sgi.com:/dist/speech/LATEST (SGI network only). Contact speech@sgi.com with any questions, problems, or suggestions.

What to expect

The biggest challenge is getting a sufficient signal-to-noise ratio, meaning you may have to get a better mic or headset, or just speak up in a quiet office. You might get away without reading any documentation, but just in case, be aware of the following as well as the release notes and online help.

How do I use speech (getting started)?

Install the default systems then reboot if it is the first time you have installed speech (launching speech will present an error message if you don't reboot).

Ensure that your audio works well. The key to accurate recognition is optimizing your signal-to-noise ratio. Use a headset for best results, especially in the presence of any ambient noise. Placing your mic close to your mouth is important, but not as important as getting it away from noise, such as your keyboard or the computer's fan. Don't even think about holding it. The lapel position works if you don't move. I just drape it on top of the monitor. Check apanel's response to your voice by an increase in the metering level. As long as there is some measurable response in apanel's meter, don't worry about apanel's input level for now (the algorithms can compensate somewhat with automatic gain control). The input should be set to the mic, and the rate is usually 8Khz. You can leave apanel open to fiddle with later if needed.

Launch "speech" using the command line, or find an icon, or from the icon pages DesktopTools or ControlPanels. Drag the titlebar-less window to a convenient place using button1.

Test recognition with the commands "go_to_sleep" and "wake_up" by observing state changes of the character and textual messages. Go on to test other global commands (see the Global window in speech's Vocabulary menu), such as 4Dwm's "minimize_window" and "restore_window". The vocabularies are not currently optimized, so at this point you may want to select words and "train" them from the vocabulary window's menu.

Commands in the global vocabulary are special in that they are not dependent on input focus. Other commands, such as those in the MediaMail vocabulary, do require that the input focus be directed in the main MediaMail window. This focus is per application window, so multi-windowed apps like MediaMail will have multiple vocabularies and the active vocabulary is determined by input focus.

Action syntax

Specifying actions are easy. Really. But it is also the number one area for improvement, not just in syntax, but in power. This should be the desktop macro recorder that allows users to simply describe complex behaviors, even beyond use by speech. Messages in the new desktop bus should be included. Anyways, here's how it works:

Text is text, like "abc" in a command's action would stuff "abc" into the keyboard buffer, just as it were typed. So this command wouldn't make sense in the root window's vocabulary, but it might make sense to MediaMail's compose window, or the shell (if you still use those). Which brings us to "what about special keys?".

Let's say you used "make" for a shell command instead of "abc". The command was recognized, but now "make" just sits on there the command line, taunting you, waiting for a return. To include a return after "make", use the action syntax "make<return>".

Any key can be described by specifying the keycap name inside angle brackets (case is not sensitive and common synonyms exist such as ctrl and control, esc and escape). This follows for meta keys, so for instance, the action in response to a copy command might be "<control>c". Meta keys are special in that they hold their 'press' for one stroke beyond normal, so that "<control>ca" would generate a control-c followed by a normal 'a'.

Finally, shell commands are specified inside angle brackets and preceded with a bang or exclamation point, such as "<!xrset -switch work>" (which switches the desktop to a desk named "work").

There is a special action "<delay>" which introduces 30 ticks of delay inbetween events. This is useful when specifying mnemonics which require the menu subwindows to be mapped before receiving subsequent keystrokes (which you haven't noticed unless you type really fast). So you might use "<alt>m<delay>r" to specify an action for MediaMail's read command. This syntax works, but obviously it could be improved. Once such improvement, at least for a keystroke action, is to grab the keystrokes just as they are typed. I'd almost implement this, but want to consider the introduction of new problems. Should the keystrokes be replayed with the timing in which they were specified? What does the user see to represent the action? How does the user edit the action? How are normal editing keys specified, such as backspace and return? The whole action/macro facility needs to be revisited and I'd like to hear your comments.

Top 10 troubles with speech recognition (with solutions!)

10. It doesn't hear me at all: Make sure the audio stream is flowing by ensuring that: - a microphone or headset is connected to your computer - apanel's source is set to that device (usually mic) - apanel's input rate is set to 8 or 16Khz (usually 8) - apanel's meters jump when you speak Also, make sure that speech is listening and not in the sleep or off state by observing the character and text.

9. It doesn't recognize the word I'm saying: Make sure that the window with focus has the vocabulary with the word of interest (unless it is a global word which is always a candidate for recognition). Use speech's vocabulary "show" to display a window containing available commands for that window. Make sure that that word is trained sufficiently. Also make sure that it has not been "deactivated" (an available option on the word menu).

8. A particular wrong word is consistently misrecognized: If a word is consistently misrecognized, more training of the correct word should help in that the other words are untrained to adapt with respect to the trained word. Increasing the adaptation "ask for clarification" on the preferences panel may also help. However, some words may just be too much in conflict with eachother and need to be removed or renamed when found in the same context.

7. The accuracy is not good: If speaking louder or closer to the microphone helps, then this is because of a poor signal-to-noise ratio. The best way to increase this ratio is to get a better microphone (directional, noise cancelling), preferably a headset, or move the microphone to a quieter location.

6. Speech goes to sleep or wakes up too easily: The word has been trained with noise or other words. You can either deactivate it or remove /var/public/speech/templates/trained/go_to_sleep, reboot your system, and try again (planned improvement).

5. Speech seems "trigger happy", reacting even when I'm quiet: If you're using a headset, chances are that your breath is flowing over the microphone element, or that you are rustling the microphone element. Reposition the element at least one inch off the side of your mouth (or whatever the manual directs) and make sure it has a windscreen. Also, you may want to turn down apanel's input level if it is set high.

4. Speech goes to sleep too easily: Speech sleeps when it receives too many errors in a given amount of time. To adjust the number of errors that puts it to sleep, see "noise events" in the preferences panel and set it higher. The amount of quiet time that puts it to sleep may also be adjusted.

3. Speech responds to sounds my computer makes: The sounds from the computer are another source of noise that can cause problems. Unlike ambient noise, these transient noises are usually louder, making them more difficult to adjust to and ignore. An obvious solution is to turn down the output level of the sound. You can also position the microphone further away from the source of the noise, or get a better microphone (noise-cancelling or directional) or headset. In the future, there may be some active echo/noise cancellation built into our machines to that this problem is reduced for applications like speech recognition or audio conferencing.

2. Responds to conversational speech: The speech manager should go to sleep because of errors before conversational speech causes a misrecognition. If not, reduce the preference panel's "noise event" parameter, or put speech to sleep or off before engaging in conversation.

1. Speech does not understand my yes/no answers: Train yes and no from the global vocabulary. Also true for go_to_sleep and wake_up.

Miscellaneous (MediaMail)

Three words are not properly trained. Chance of recognition is poor unless you train them. These are MediaMail's "read message", jot's "save as", and the global vocabulary's "double click".

To do more with speech in MediaMail, be aware of the following capability. In ~/.zmailrc, include "command" in the main_panes variable like:

set main_panes=messages,buttons,folder,command,output

then type "help" in the command area to see all your options. This capability allows you to verbally do things like save speech messages and open folders without going through the cumbersome GUI browsers. Unfortunately, this command text field is only available in the main MediaMail window and this field must have focus before the text is interpreted. The speech manager UI will invoke speech synthesis for its prompts if it is available as a command called "speak" taking command line arguments as text to synthesize into speech. There are some public domain speech synthesizers as well as some available inside SGI for evaluation.

-=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=--+=+--+=+-
Lance Welsh M/S 01L-545 Silicon Graphics, Inc. PO Box 7311 Mountain View, CA 94039-7311
wk: (415) 390-1860 fax: (415) 390-6056 hm: (415) 322-7225
lance@sgi.com
-=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=--=+=-


Files of interest from "src/exampleCode/opengl/speech" directory

Documentation

Reference

Subdirectories


Select any combo of files you'd like to send yourself a compressed tar image of. Trailing character indicates: `/' == Directory; `*' == executable/script. (Depending upon the browser, it may be necessary to hold down the Ctrl key to select/deselect disjoint items.) a compressed tar image of the above-selected items.
OR, ...
a compressed tar image of the entire speech subtree.

Copyright © 1995, Silicon Graphics, Inc.