Sphinx-4 Frequently Asked Questions

General

Java Sphinx-4 Acoustic & Language Models

Who created Sphinx-4?

Sphinx-4 was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).
I have a question about Sphinx-4. How can I get it answered?
First, check this FAQ, many questions are answered here. If your question is not in the FAQ, you can post it to the Sphinx4 Open Discussion Forum on SourceForge. Many of the Sphinx-4 developers monitor this forum and answer technical questions.
How can I contact the Sphinx-4 team?
You can contact the Sphinx-4 team by sending email to cmusphinx-contacts at sourceforge dot net.
How well does Sphinx-4 perform compared to other speech recognizers?
Comparing speech recognizers is often difficult. Speed and accuracy data for commercial recognizers is not typically available. We have compared Sphinx-4 with the Sphinx 3.3 recognizer. Results of this comparison are here: Sphinx-4 Performance Comparison
Isn't the Java Platform too slow to be used for speech recognition?
No, rumors of the poor performance of the Java platform are unfounded. Sphinx-4 runs faster than Sphinx 3.3 (CMUs fast recognizer) for many tests. For a good discussion of Java platform performance in speech engines look at FreeTTS - A Performance Case Study a technical paper that compares the performance of speech synthesis engine written in the Java programming language to its native-C counterpart.
Which Sphinx-4 distribution should I use?
Download the binary distribution if: Download the source distribution if you want to do everything above, plus:
Does Sphinx-4 support the Java Speech API (JSAPI)?
Currently, Sphinx-4 does not support the full Java Speech API. Instead, Sphinx-4 uses a lower-level API. However, Sphinx-4 does support Java Speech Grammar Format (JSGF) grammars.
Where can I learn more about the Java Speech Grammar Format (JSGF)?
A complete description of the JSGF can be found in the JSGF Grammar Format Specification
Can I use Sphinx-4 in a J2ME device such as a phone or a PDA?
Probably not. Sphinx-4 requires version 1.4 of the Java platform. This is typically not avaiable on smaller devices. Also, Sphinx-4 requires more memory than is typically available on a J2ME device. Even simple digits recognition will require a 16Mb heap. Sphinx-4 uses extensive floating point math. Most J2ME devices do not have adequate floating point performance for Sphinx-4.
Why can't I use Java versions prior to 1.4?
Sphinx-4 uses many language and API features of version 1.4 of the Java platform including the logging API, the regular expressions API, XML parsing APIs and the assert facility.
I am having microphone troubles under linux. What can I do?
There seems to be a significant difference in how different versions of the JDK determine which audio resources are available on Linux. This difference seems to affect different machines in different ways. We are working with the Java Sound folks to get to the root cause of the problem. In the mean time, if you are having trouble getting the demos to work on your Linux box try the following:
How do I select a different microphone (e.g., a USB headset) on my machine?

By default, Sphinx-4 will use the getLine method of the Java Sound AudioSystem class to obtain a TargetDataLine (i.e., the object used to interface to your microphone). This method grabs a line from any of the available Mixers known to the AudioSystem. As such, when using the AudioSystem to obtain the TargetDataLine, you have little control over which line is chosen if more than one line matches the requirements of the front end. For example, if you plug in a USB headset into a Macintosh PowerBook, the getLine method of the AudioSystem class will typcially never select a line from the USB device.

This behavior can be frustrating, especially when you have a nice USB microphone you'd like to use.

To override the default behavior, you can set the selectMixer property of the Microphone class. In Java Sound, a Mixer is an audio device with one or more lines. In practice, a Mixer tends to be mapped to a particular system audio device. For example, on the Mac, there's a Mixer associated with the built-in audio hardware. Furthermore, when you plug in a USB headset, a new Mixer will appear for that headset. The selectMixer property allows you to specify which specific Mixer Sphinx-4 will use to select the TargetDataLine.

The value of the selectMixer property can be "default," which means let the AudioSystem decide which line to use from all the available Mixers, "last," which means select the last Mixer supported by the AudioSystem (USB headsets tend to be associated with the last Mixer), or an integer value that represents the index of the Mixer.Info that is returned by AudioSystem.getMixerInfo().

To get the list of Mixer.Info objects available on your system, along with their integer index values, run the AudioTool application with a command line argument of "-dumpMixers".

To set the selectMixer property of the Microphone, you need to have a component in your config file that defines the microphone. In the examples in Sphinx-4, this component is aptly named "microphone." In the configuration for the microphone component, you can the set the selectMixer property in the config file for the application. For example:

        <property name="selectMixer" value="last"/>

You can also set the selectMixer property from the command line. For example:

        java -Dmicrophone[selectMixer]=last -jar bin/AudioTool.jar 

In both of these examples, the last Mixer discovered by the Java Sound AudioSystem class will be used to select the TargetDataLine for the microphone.

Where can I find a speech synthesizer for the Java platform?
The Speech Integration group of Sun Labs has released FreeTTS, a speech synthesis system written in the Java programming language.
I want to add speech recognition to my application. Where do I start?
First, look at the sourcecode for Sphinx-4 demos to get a feel for how to write a Sphinx-4 application. After that, read the Sphinx-4 Application Programmer's Guide for description of how to write a Sphinx-4 application.
How can I decode/transcribe .wav files?
Take a look at the Hello Wave Demo program a command line application that transcribes audio in a '.wav' file. Additionally, the Transcriber Demo demonstrates how Sphinx-4 can be used to transcribe a continuous audio file with multiple utterances.
How can I get the recognizer to return partial results while a recognition is in process?
It is possible to configure Sphinx-4 to generate partial results, that is, to inform you periodically as to what it thinks is the best possible hypothesis so far, even before the user has stopped speaking.

To get this information, add a result listener to the recognizer. Your listener will receive a result (which may or not be a final result). The hypothesis text can be extracted from the text.

There is a good example of this in sphinx4/tests/live/Live.java You can control how often the result listener is fired by setting the configuration variable 'featureBlockSize' in the decoder. The default setting of 50 indicates that the listener will be called after every 50 frames. Since each frame represents 10MS of speech, the listener is called every 500ms.

How can I get the N-Best list?
The method 'Results.getResultTokens()' returns a list of all the tokens associated with paths that have reached the end of sentence state.

This list is not a traditional N-best list of results. Some good results are not represented in this list. We also support full word lattices that can provide full N-Best lists. We currently do not have any user documentation for this, however, we will be providing some shortly.

See also: How can I obtain confidence scores for the recognition result?

How can I detect and ignore out-of-grammar utterances?
An out-of-grammar utterance occurs when a speaker says something that is not represented by the speech grammar. Usually, the recognizer will try to force a match between what was said and the grammar. Many applications need to detect when the user has spoken something unexpected. This is called out-of-grammar detection.

The FlatLinguist and the DynamicFlatLinguist can be configured to detect out-of-grammar utterances. To do so, set the following properties of either linguists:

When configured this way, the search will look for out-of-grammar utterances. If an out of grammar utterance is detected, Sphinx-4 will return a result that contains a single <unk> word. Moreover, if you want to know the exact sequence of phones that the unknown word is comprised of, you can call the method: Result.getBestToken().getWordUnitPath()

How can I change my language models or grammars at runtime?
The JSGFGrammar class provides methods that allow for swapping in a new JSGF grammar or modifying the currently active RuleGrammar grammar used by a given Recognizer. The JSGFDemo gives an example of how to do this.

To handle more complex problems, such as switching between n-Gram grammars, you can configure more than one Recognizer (one grammar per Recognizer) and switch between those Recognizers. The Dialog Demo provides an example of how to do this.

How can I perform word-spotting?
There is no support for word-spotting right now.
Can I use Sphinx-4 to recognize 8khz audio?
Yes. In your configuration file, specify the 8kHz Wall Street Journal models, and tell the front end that your models are 8kHz:
<component name="wsj"
    type="edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.Model">
    <property name="loader" value="8kLoader">
    ...
</component>

<component name="8kLoader">
    type="edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.ModelLoader">
    ...
</component>
    
Where can I get the audio data used in the regression tests?
Much of the data audio data used in the regression tests is obtained from the Linguistic Data Consortium.
How do I use the Result object?
A search result typically consists of a number of hypothesis. Each hypothesis is represented by a path through the search space. Each path is represented by a single token. The token corresponds to the end point of the path. Using the token.getPredecessor() method it is possible for an application to trace back through the entire path to the beginning of the utterance.

Each token along the path contains numerous interesting data that can be used by the application including:

The method getScore() returns the path score for the path represented by a particular token This is the total score (which includes the language, acoustic and insertion components). The getAcousticScore returns the acoustic score for a token. This score represents how well the associated search state matches the input feature for the frame associated with the token. This is typically only for 'emitting' states. The getLanguageScore() returns the language component of the score The getInsertionProbability() returns the insertion component of the score. So the method getScore returns (all values are in the log domain):
        entryScore + getAcousticScore() + getLanguageScore() + getInsertionProbability()
        
(where entryScore is token.getPredecessor().getScore() )
Does Sphinx-4 support speaker identification?
Sphinx-4 currently does not support speaker identification, which is the process of identifying who is speaking. However, the architecture of Sphinx-4 is flexible enough for someone to add such capabilities. To learn more about speaker identification: http://www.speech.cs.cmu.edu/comp.speech/Section6/Q6.6.html
How can I obtain confidence scores for the recognition result?
Some experimental work has been done to support confidence scores. As this work is still experimental, please use it with precaution. Please refer to the Confidence Score Demo for example code of how to do this. Note that currently this only works for configurations using the LexTreeLinguist and the WordPruningBreadthFirstSearchManager .
How can I train my own acoustic models?
Sphinx-4 loads Sphinx-3 acoustic models. These can be trained with the Sphinx-3 Trainer called SphinxTrain.
How do I use models trained by SphinxTrain in Sphinx-4?
Please refer to the document Using SphinxTrain Models in Sphinx-4.
Does the Sphinx-4 front end generate the same features as the SphinxTrain wave2feat program?
The features that SphinxTrain generate are called cepstrum. Cepstrum are usually 13-dimensional. The features that Sphinx-4 generates are more than cepstrum. It is 39-dimensional, and consists of the cepstrum, the delta of the cepstrum, and the double delta of the cepstrum (thus 3X the size). To make Sphinx-4 generate the same cepstrum as SphinxTrain wave2feat, you should remove the last two steps in the front end, so that it looks like:
    <component name="mfcFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
        <propertylist name="pipeline">
	    <item>streamDataSource</item>
	    <item>premphasizer</item>
	    <item>windower</item>
	    <item>fft</item>
	    <item>melFilterBank</item>
	    <item>dct</item>
	</propertylist>
    </component>
    
How can I create my own language models?
N-Gram language models can be created with the CMU Statistical Language Modeling (SLM) Toolkit. For more information see this Example of building a Language Model.
How can I create my own dictionary?
Sphinx-4 currently supports dictionaries in the CMU dictionary format. The CMU dictionary format is described in the FullDictionary javadocs.

Each line of the dictionary specifies the word, followed by spaces or tab, followed by the pronuncation (by way of the list of phones) of the word. Each word can have more than one pronunciations. For example, a digits dictionary will look like:

    ONE HH W AH N
    ONE(2) W AH N
    TWO T UW
    THREE TH R IY
    FOUR F AO R
    FIVE F AY V
    SIX S IH K S
    SEVEN S EH V AH N
    EIGHT EY T
    NINE N AY N
    ZERO Z IH R OW
    ZERO(2) Z IY R OW
    OH OW
    
In the above example, the words "one" and "zero" have two pronunciations each.

Some more details on the format of the dictionary can be found at the CMU Pronouncing Dictionary page.

Note that the phones used to define the pronunciation for a word can be arbitrary strings. It is important however, that they match the units in the acoustic model. If you unpack an acoustic model you will find among the many files a file with the suffix ".mdef". This file contains a mapping of units to senones (tied gaussian mixtures). The first column in this file represent the unit names (phone) used by the acoustic model.

Your dictionary should use these units to define the pronunciation for a word.

I've created by own language model. How do I create the binary (DMP) form?
The tool lm3g2dmp generates a DMP file from an LM in ARPA format (which can be generated by the CMU/CU Statistical Language Model Toolkit). The lm3g2dmp code is available at sf.net, in the module share/lm3g2dmp. But the easiest way to get it is via a nightly build (a tarball generated nightly from the cvs tree). The site: Sphinx utilities has a link to this build (and also to the CMU SLM toolkit). Or one can retrieve it from the Sphinx nightly builds.

Note that the format of the output from the CMU/CU SLM Toolkit program idngram2lm (using the -binary option) is different from the DMP format, and therefore cannot be read by Sphinx-4.


Last update on June 30, 2004.
Copyright 1999-2004 Carnegie Mellon University.
Portions Copyright 2002-2004 Sun Microsystems, Inc.
Portions Copyright 2002-2004 Mitsubishi Electric Research Laboratories.
All Rights Reserved. Usage is subject to license terms.