Sphinx-4 Application Programmer's Guide |
This tutorial shows you how to write Sphinx-4 applications. We will use the HelloDigits demo as an example to show how a simple application can be written. We will then proceed to a more complex example. Consequently, this tutorial is divided into the following parts:
We will look at a very simple Sphinx-4 speech application, the HelloDigits demo. This application recognizes connected digits. As you will see, the code is very simple. The harder part is understanding the configuration, but we will guide you through every step of it. Lets look at the code first.
All the source code of the HelloDigits demo is in one short file
sphinx4/demo/sphinx/hellodigits/HelloDigits.java
:
/* * Copyright 2004 Carnegie Mellon University. * Portions Copyright 2004 Sun Microsystems, Inc. * Portions Copyright 2004 Mitsubishi Electric Research Laboratories. * All Rights Reserved. Use is subject to license terms. * * See the file "license.terms" for information on usage and * redistribution of this file, and for a DISCLAIMER OF ALL * WARRANTIES. * */ package demo.sphinx.hellodigits; import edu.cmu.sphinx.frontend.util.Microphone; import edu.cmu.sphinx.recognizer.Recognizer; import edu.cmu.sphinx.result.Result; import edu.cmu.sphinx.util.props.ConfigurationManager; import edu.cmu.sphinx.util.props.PropertyException; import java.io.File; import java.io.IOException; import java.net.URL; /** * A simple HelloDigits demo showing a simple speech application * built using Sphinx-4. This application uses the Sphinx-4 endpointer, * which automatically segments incoming audio into utterances and silences. */ public class HelloDigits { /** * Main method for running the HelloDigits demo. */ public static void main(String[] args) { try { URL url; if (args.length > 0) { url = new File(args[0]).toURI().toURL(); } else { url = HelloDigits.class.getResource("hellodigits.config.xml"); } ConfigurationManager cm = new ConfigurationManager(url); Recognizer recognizer = (Recognizer) cm.lookup("recognizer"); Microphone microphone = (Microphone) cm.lookup("microphone"); /* allocate the resource necessary for the recognizer */ recognizer.allocate(); /* the microphone will keep recording until the program exits */ if (microphone.startRecording()) { System.out.println ("Say any digit(s): e.g. \"two oh oh four\", " + "\"three six five\"."); while (true) { System.out.println ("Start speaking. Press Ctrl-C to quit.\n"); /* * This method will return when the end of speech * is reached. Note that the endpointer will determine * the end of speech. */ Result result = recognizer.recognize(); if (result != null) { String resultText = result.getBestResultNoFiller(); System.out.println("You said: " + resultText + "\n"); } else { System.out.println("I can't hear what you said.\n"); } } } else { System.out.println("Cannot start microphone."); recognizer.deallocate(); System.exit(1); } } catch (IOException e) { System.err.println("Problem when loading HelloDigits: " + e); e.printStackTrace(); } catch (PropertyException e) { System.err.println("Problem configuring HelloDigits: " + e); e.printStackTrace(); } catch (InstantiationException e) { System.err.println("Problem creating HelloDigits: " + e); e.printStackTrace(); } } }
edu.cmu.sphinx.recognizer.Recognizer
edu.cmu.sphinx.result.Result
edu.cmu.sphinx.util.props.ConfigurationManager
The Recognizer
is the main class any application should
interact with (refer also to the architecture diagram above).
The Result
is returned by the Recognizer to the application
after recognition completes. The ConfigurationManager
creates the entire Sphinx-4 system according to the configuration specified
by the user.
Lets look at the main()
method. The first few
lines creates the URL of the XML-based configuration file.
A ConfigurationManager
is then created using that URL.
The ConfigurationManager then reads in the
file internally. Since the configuration file specifies the components
"recognizer"
and "microphone"
(we will look
at the configuration file next),
we perform a lookup()
in the ConfigurationManager to obtain these components. The allocate()
method of the Recognizer is then called to allocate the resources
need for the recognizer. The Microphone class is used for capturing live
audio from the system audio device. Both the Recognizer and the Microphone
is configured as specified in the configuration file.
Once all the necessary components are created, we can start running the demo.
The program first turns on the Microphone
(microphone.startRecording()
).
After the microphone is turned on successfully,
the program enters a loop that repeats the following. It tries to recognize
what the user is saying, using the
Recognizer.recognize()
method. Recognition stops when
the user stops speaking, which is detected by the endpointer built into
the front end by configuration. Once an utterance is recognized,
the recognized text, which is returned by the method
Result.getBestResultNoFiller()
, is printed out.
If the Recognizer recognized nothing (i.e., result
is
null), then it will print out a message saying that. Finally, if the
demo program cannot turn on the microphone in the first place, the
Recognizer will be deallocated, and the program exits.
It is generally a good practice to call the method
deallocate()
after the work is done to release all the
resources.
Note that several exceptions were being caught. First of all,
the IOException is thrown by the constructor
of the ConfigurationManager and the Recognizer.allocate() method.
The PropertyException
is thrown again by the constructor, and by the
lookup()
method, of the ConfigurationManager.
These exceptions should be caught and handled appropriately.
Hopefully, by this point, you will have some idea of how to write a simple Sphinx-4 application. We will now turn to the harder part, understanding the various components necessary to create a connected-digits recognizer. These components are specified in the configuration file, which we will now explain in depth.
In this section, we will explain the various Sphinx-4 components that are used for the HelloDigits demo, as specified in the configuration file. We will look at each section of the config file in depth. If you want to learn about the format of these configuration files, please refer to the document Sphinx-4 Configuration Management.
The lines below define the frequently tuned properties. They are located at the top of the configuration file so that they can be edited quickly.
<!-- ******************************************************** --> <!-- frequently tuned properties --> <!-- ******************************************************** --> <property name="logLevel" value="WARNING"/> <property name="absoluteBeamWidth" value="-1"/> <property name="relativeBeamWidth" value="1E-80"/> <property name="wordInsertionProbability" value="1E-36"/> <property name="languageWeight" value="8"/> <property name="frontend" value="epFrontEnd"/> <property name="recognizer" value="recognizer"/> <property name="showCreations" value="false"/>
The lines below define the recognizer component that performs speech recognition. It defines the name and class of the recognizer, Recognizer. This is the class that any application should interact with. If you look at the javadoc of the Recognizer class, you will see that it has two properties, 'decoder' and 'monitors'. This configuration file is where the value of these properties are defined.
<!-- ******************************************************** --> <!-- word recognizer configuration --> <!-- ******************************************************** --> <component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer"> <property name="decoder" value="decoder"/> <propertylist name="monitors"> <item>accuracyTracker </item> <item>speedTracker </item> <item>memoryTracker </item> </propertylist> </component>
We will explain the monitors later. For now, lets look at the decoder.
<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder"> <property name="searchManager" value="searchManager"/> </component>The decoder component is of class
edu.cmu.sphinx.decoder.Decoder
. Its property 'searchManager'
is set to the component 'searchManager', defined as:
<component name="searchManager" type="edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager"> <property name="logMath" value="logMath"/> <property name="linguist" value="flatLinguist"/> <property name="pruner" value="trivialPruner"/> <property name="scorer" value="threadedScorer"/> <property name="activeListFactory" value="activeList"/> </component>The searchManager is of class
edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager
.
This class performs a simple breadth-first search through the search graph
during the decoding process to find the best path. This search manager
is suitable for small to medium sized vocabulary decoding.
The logMath property is the
log math that is used for calculation of scores during the search process.
It is defined as having the log base of 1.0001. Note that typically the
same log base should be used throughout all components, and therefore
there should only be one logMath definition in a configuration file:
<component name="logMath" type="edu.cmu.sphinx.util.LogMath"> <property name="logBase" value="1.0001"/> <property name="useAddTable" value="true"/> </component>The linguist of the searchManager is set to the component 'flatLinguist' (which we will look at later), which again is suitable for small to medium sized vocabulary decoding. The pruner is set to the 'trivialPruner':
<component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/>which is of class
edu.cmu.sphinx.decoder.pruner.SimplePruner
.
This pruner performs simple absolute beam and relative beam pruning
based on the scores of the tokens.
The scorer of the searchManager is set to the component 'threadedScorer',
which is of class
edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer
.
It can use multiple threads (usually one per CPU) to score the tokens
in the active list. Scoring is one of the most time-consuming step of the
decoding process. Tokens can be scored independently of each other,
so using multiple CPUs will definitely speed things up.
The threadedScorer is defined as follows:
<component name="threadedScorer" type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer"> <property name="frontend" value="${frontend}"/> <property name="isCpuRelative" value="true"/> <property name="numThreads" value="0"/> <property name="minScoreablesPerThread" value="10"/> <property name="scoreablesKeepFeature" value="true"/> </component>The 'frontend' property is the front end from which features are obtained. For details about the other properties of the threadedScorer, please refer to javadoc for ThreadedAcousticScorer. Finally, the activeListFactory property of the searchManager is set to the component 'activeList', which is defined as follows:
<component name="activeList" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"< <property name="logMath" value="logMath"/< <property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/< <property name="relativeBeamWidth" value="${relativeBeamWidth}"/< </component<It is of class
edu.cmu.sphinx.decoder.search.PartitionActiveListFactory
.
It uses a partitioning algorithm to select the top N highest scoring
tokens when performing absolute beam pruning. The 'logMath' property
specifies the logMath used for score calculation, which is the same
LogMath used in the searchManager. The property 'absoluteBeamWidth'
is set to the value given at the very top of the configuration file using
${absoluteBeamWidth}
. The same is for 'relativeBeamWidth'.
<component name="flatLinguist" type="edu.cmu.sphinx.linguist.flat.FlatLinguist"> <property name="logMath" value="logMath"/> <property name="grammar" value="jsgfGrammar"/> <property name="acousticModel" value="wsj"/> <property name="wordInsertionProbability" value="${wordInsertionProbability}"/> <property name="languageWeight" value="${languageWeight}"/> </component>It also uses the logMath that we've seen already. The grammar used is the component called 'jsgfGrammar', which is a BNF-style grammar:
<component name="jsgfGrammar" type="edu.cmu.sphinx.jsapi.JSGFGrammar"> <property name="grammarLocation" value="resource:/demo.sphinx.helloworld.HelloWorld!/demo/sphinx/hellodigits/"/> <property name="dictionary" value="dictionary"/> <property name="grammarName" value="digits"/> <property name="logMath" value="logMath"/> </component>
JSGF grammars are defined in JSAPI. The class that translates JSGF into a form that
Sphinx-4 understands is
edu.cmu.sphinx.jsapi.JSGFGrammar
(NOTE: that this link to the javadoc also
describes the limitations of the current implementation).
The property 'grammarLocation' can take two kinds of values.
If it is a URL, it specifies the URL of the directory where JSGF grammar files
are to be found. Otherwise, it is interpreted as resource locator.
In our example, the HelloDigits demo is being deployed as a JAR file.
The 'grammarLocation' property is therefore used to specify
the location of the resource
digits.gram
within the JAR file. Note that it is not necessary to the JAR file
within which to search. The system searches the JAR file in which the class
demo.sphinx.helloworld.HelloWorld
resides.
The 'grammarName' property specifies the grammar to use when creating
the search graph. 'logMath' is the same log math as the other components.
The 'dictionary' is the component that maps words to their phonemes.
It is almost always the dictionary of the
acoustic model, which lists all the words that were used to trained
the acoustic model:
<component name="dictionary" type="edu.cmu.sphinx.linguist.dictionary.FastDictionary"> <property name="dictionaryPath" value="resource:/edu.cmu.sphinx.model.acoustic.TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz!/edu/cmu/sphinx/model/acoustic/dictionary"/> <property name="fillerPath" value="resource:/edu.cmu.sphinx.model.acoustic.TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz!/edu/cmu/sphinx/model/acoustic/fillerdict"/> <property name="addSilEndingPronunciation" value="false"/> <property name="wordReplacement" value="<sil>"/> <property name="allowMissingWords" value="true"/> </component>
The locations of these dictionary files are specified using the Sphinx-4 resource mechanism. In short, this mechanism locates the JAR file in which a class resides, and looks into that JAR file for the desired resource. The syntax is:
resource:/{name of class to locate the JAR file}!{location in the JAR file of the desired resource}Take the 'dictionaryPath' property, for example. The "name of class to locate the JAR file" is "
edu.cmu.sphinx.model.acoustic.TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz
", while the "location in the JAR file of the desired resource" is "/edu/cmu/sphinx/model/acoustic/dictionary
".
The dictionary for filler words like "BREATH" and "LIP_SMACK" is the file
fillerdict
.
For details about the other properties, please refer to the javadoc for FastDictionary.
The next important property of the flatLinguist is the acoustic model, which is defined as:
<component name="tidigits" type="edu.cmu.sphinx.model.acoustic.TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz"> <property name="loader" value="sphinx3Loader"/> </component> <component name="sphinx3Loader" type="edu.cmu.sphinx.model.acoustic.TIDIGITSLoader"> >property name="logMath" value="logMath"/> </component>
'tidigits' stands for the TIDIGITS acoustic models. In Sphinx-4,
different acoustic models are represented by classes that implement the
AcousticModel interface. The implementation class, together with the
actual model data files, are packaged in a JAR file, which is included
in the classpath of the Sphinx-4 application. This JAR file for the TIDIGITS
models is called TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar
,
and is in the sphinx4/lib
directory. Inside this JAR file
is a class called edu.cmu.sphinx.model.acoustic.TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz
, which implements the AcousticModel interface.
This class will automatically load up all the actual model data files when
it is loaded. The loader of this AcousticModel is called
TIDIGITSLoader
, and it also is inside the JAR file.
As a programmer, all you need to do is to specify the class of the
AcousticModel, and the loader of the AcousticModel, as shown above
(note that if you are using the TIDIGITS model in other applications,
these lines should be the same, except that you might have called your
'logMath' component something else).
Vocabulary Size | Word Insertion Probability | Language Weight |
Digits (11 words - TIDIGITS) | 1E-36 | 8 |
Small (80 words - AN4) | 1E-26 | 7 |
Medium (1000 words - RM1) | 1E-10 | 7 |
Large (64000 words - HUB4) | 0.2 | 10.5 |
<!-- ******************************************************** --> <!-- The frontend configuration --> <!-- ******************************************************** --> <component name="frontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>microphone </item> <item>premphasizer </item> <item>windower </item> <item>fft </item> <item>melFilterBank </item> <item>dct </item> <item>liveCMN </item> <item>featureExtraction </item> </propertylist> </component> <!-- ******************************************************** --> <!-- The live frontend configuration --> <!-- ******************************************************** --> <component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>microphone </item> <item>speechClassifier </item> <item>speechMarker </item> <item>nonSpeechDataFilter </item> <item>premphasizer </item> <item>windower </item> <item>fft </item> <item>melFilterBank </item> <item>dct </item> <item>liveCMN </item> <item>featureExtraction </item> </propertylist> </component>As you might notice, the only different between these two front ends is that the live front end (epFrontEnd) has the additional components
speechClassifier
, speechMarker
and
nonSpeechDataFilter
. These three components make up the
default endpointer of Sphinx-4. Below is a listing of all the components
of both front ends, and those properties which have values different from
the default:
<component name="speechClassifier" type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier"> <property name="threshold" value="13"/> </component> <component name="nonSpeechDataFilter" type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/> <component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker" > <property name="speechTrailer" value="50"/> </component> <component name="premphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/> <component name="windower" type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"> </component> <component name="fft" type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"/> <component name="melFilterBank" type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"> </component> <component name="dct" type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/> <component name="batchCMN" type="edu.cmu.sphinx.frontend.feature.BatchCMN"/> <component name="liveCMN" type="edu.cmu.sphinx.frontend.feature.LiveCMN"/> <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/> <component name="microphone" type="edu.cmu.sphinx.frontend.util.Microphone"> <property name="msecPerRead" value="10"/> <property name="closeBetweenUtterances" value="false"/> </component>Lets explain some of the properties set here that have values different from the default. The property 'threshold' of the SpeechClassifier specifies the minimum difference between the input signal level and the background signal level in order that the input signal is classified as speech. Therefore, the smaller this number, the more sensitive the endpointer, and vice versa. The property 'speechTrailer' of the SpeechMarker specifies the length of non-speech signal to be included after the end of speech to make sure that no speech signal is lost. Here, it is set at 50 milliseconds. The property 'msecPerRead' of the Microphone specifies the number of milliseconds of data to read at a time from the system audio device. The value specified here is 10ms. The property 'closeBetweenUtterances' specifies whether the system audio device should be released between utterances. It is set to false here, meaning that the system audio device will not be released between utterances. This is set as so because on certain systems (Linux for one), closing and reopening the audio does not work too well.
<component name="accuracyTracker" type="edu.cmu.sphinx.instrumentation.AccuracyTracker"> <property name="recognizer" value="${recognizer}"/> <property name="showAlignedResults" value="false"/> <property name="showRawResults" value="false"/> </component> <component name="memoryTracker" type="edu.cmu.sphinx.instrumentation.MemoryTracker"> <property name="recognizer" value="${recognizer}"/> <property name="showSummary" value="false"/> <property name="showDetails" value="false"/> </component> <component name="speedTracker" type="edu.cmu.sphinx.instrumentation.SpeedTracker"> <property name="recognizer" value="${recognizer}"/> <property name="frontend" value="${frontend}"/> <property name="showSummary" value="true"/> <property name="showDetails" value="false"/> </component>
The various knobs of these monitors mainly control whether statistical information about accuracy, speed and memory usage should be printed out. Moreover, the monitors monitor the behavior of a recognizer, so they need a reference to the recognizer that they are monitoring.
HelloDigits uses a very small vocabulary and a guided grammar. What if you want to use a larger vocabulary, and there is no guided grammar for your application? One way to do it would be to use what is known as a language model, which describes the probability of occurrence of a series of words. The HelloNGram demo shows you how to do this with Sphinx-4.
The source code for the HelloNGram demo is exactly the same as that of the HelloDigits demo, except for the names of the demo class. The demo runs exactly the same way: it keeps listening to and recognizes what you say, and when it was detected the end of an utterance, it will show the recognition result.
Sphinx-4 supports the n-gram language model (both ascii and binary versions) generated by the Carnegie Mellon University Statistical Language Modeling toolkit. The input file is a long list of sample utterances. Using the occurrence of words and sequences of words in this input file, a language model can be trained. The resulting trigram language model file is hellongram.trigram.lm.
<!-- ******************************************************** --> <!-- frequently tuned properties --> <!-- ******************************************************** --> <property name="absoluteBeamWidth" value="500"/> <property name="relativeBeamWidth" value="1E-80"/> <property name="absoluteWordBeamWidth" value="20"/> <property name="relativeWordBeamWidth" value="1E-60"/> <property name="wordInsertionProbability" value="1E-16"/> <property name="languageWeight" value="7.0"/> <property name="silenceInsertionProbability" value=".1"/> <property name="frontend" value="epFrontEnd"/> <property name="recognizer" value="recognizer"/> <property name="showCreations" value="false"/>The above lines defines frequently tuned properties. They are located at the top of the configuration file so that they can be edited quickly.
<component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer"> <property name="decoder" value="decoder"/> <propertylist name="monitors"> <item>accuracyTracker </item> <item>speedTracker </item> <item>memoryTracker </item> <item>recognizerMonitor </item> </propertylist> </component>The above lines define the recognizer component that performs speech recognition. It defines the name and class of the recognizer. This is the class that any application should interact with. If you look at the javadoc of the Recognizer class, you will see that it has two properties, 'decoder' and 'monitors'. This configuration file is where the value of these properties are defined.
<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder"> <property name="searchManager" value="wordPruningSearchManager"/> <property name="featureBlockSize" value="50"/> </component>The decoder component is defined to be of class
edu.cmu.sphinx.decoder.Decoder
. Its property 'searchManager'
is set to the component 'wordPruningSearchManager':
<component name="wordPruningSearchManager" type="edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager"> <property name="logMath" value="logMath"/> <property name="linguist" value="lexTreeLinguist"/> <property name="pruner" value="trivialPruner"/> <property name="scorer" value="threadedScorer"/> <property name="activeListManager" value="activeListManager"/> <property name="growSkipInterval" value="0"/> <property name="checkStateOrder" value="false"/> <property name="buildWordLattice" value="false"/> <property name="acousticLookaheadFrames" value="1.7"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component>The searchManager is of class
edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager
. It is better than the SimpleBreadthFirstSearchManager
for larger vocabulary recognition. This class also performs a simple
breadth-first search through the search graph, but at each frame it also
prunes the different types of states separately. The logMath property is the
log math that is used for calculation of scores during the search process.
It is defined as having the log base of 1.0001. Note that typically the
same log base should be used throughout all components, and therefore
there should only be one logMath definition:
<component name="logMath" type="edu.cmu.sphinx.util.LogMath"> <property name="logBase" value="1.0001"/> <property name="useAddTable" value="true"/> </component>The linguist of the searchManager is set to the component 'lexTreeLinguist' (which we will look at later), which again is suitable for large vocabulary recognition. The pruner is set to the 'trivialPruner':
<component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/>which is of class
edu.cmu.sphinx.decoder.pruner.SimplePruner
.
This pruner performs simple absolute beam and relative beam pruning
based on the scores of the tokens.
The scorer of the searchManager is set to the component 'threadedScorer',
which is of class
edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer
.
It can use multiple threads (usually one per CPU) to score the tokens
in the active list. Scoring is one of the most time-consuming step of the
decoding process. Tokens can be scored independently of each other,
so using multiple CPUs will definitely speed things up.
The threadedScorer is defined as follows:
<component name="threadedScorer" type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer"> <property name="frontend" value="${frontend}"/> <property name="isCpuRelative" value="true"/> <property name="numThreads" value="0"/> <property name="minScoreablesPerThread" value="10"/> <property name="scoreablesKeepFeature" value="true"/> </component>The 'frontend' property is the front end from which features are obtained. For details about the other properties of the threadedScorer, please refer to javadoc for ThreadedAcousticScorer. Finally, the 'activeListManager' property of the wordPruningSearchManager is set to the component 'activeListManager', which is defined as follows:
<component name="activeListManager" type="edu.cmu.sphinx.decoder.search.SimpleActiveListManager"> &propertylist name="activeListFactories"> <item>standardActiveListFactory</item> <item>wordActiveListFactory</item> <item>wordActiveListFactory</item> <item>standardActiveListFactory</item> <item>standardActiveListFactory</item> <item>standardActiveListFactory</item> </propertylist> </component> <component name="standardActiveListFactory" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="logMath" value="logMath"/> <property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component> <component name="wordActiveListFactory" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="logMath" value="logMath"/> <property name="absoluteBeamWidth" value="${absoluteWordBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeWordBeamWidth}"/> </component>The SimpleActiveListManager is of class
edu.cmu.sphinx.decoder.search.SimpleActiveListManager
.
Since the word-pruning search manager performs pruning on different
search state types separately, we need a different active list for each
state type. Therefore, you see different active list factories being listed
in the SimpleActiveListManager, one for each type. So how do we know which
active list factory is for which state type? It depends on the 'search order'
as returned by the search graph (which in this case is generated by the
LexTreeLinguist). The search state order and active list factory used here are:
State Type | ActiveListFactory |
LexTreeNonEmittingHMMState | standardActiveListFactory |
LexTreeWordState | wordActiveListFactory |
LexTreeEndWordState | wordActiveListFactory |
LexTreeEndUnitState | standardActiveListFactory |
LexTreeUnitState | standardActiveListFactory |
LexTreeHMMState | standardActiveListFactory |
<component name="lexTreeLinguist" type="edu.cmu.sphinx.linguist.lextree.LexTreeLinguist"> <property name="logMath" value="logMath"/> <property name="acousticModel" value="wsj"/> <property name="languageModel" value="trigramModel"/> <property name="dictionary" value="dictionary"/> <property name="addFillerWords" value="false"/> <property name="fillerInsertionProbability" value="1E-10"/> <property name="generateUnitStates" value="false"/> <property name="wantUnigramSmear" value="true"/> <property name="unigramSmearWeight" value="1"/> <property name="wordInsertionProbability" value="${wordInsertionProbability}"/> <property name="silenceInsertionProbability" value="${silenceInsertionProbability}"/> <property name="languageWeight" value="${languageWeight}"/> </component>
For details about the LexTreeLinguist, please refer to the Javadocs of the LexTreeLinguist. In general, the LexTreeLinguist is the one to use for large vocabulary speech recognition, and the FlatLinguist is the one to use for small vocabulary speech recognition. The LexTreeLinguist has a lot of properties that can be set, but the ones that are must be set are the 'logMath', the 'acousticModel', the 'languageModel', and the 'dictionary'. These properties are the necessary sources of information for the LexTreeLinguist to build the search graph. The rest of the properties are for controlling the speed and accuracy performance of the linguist, and you can read more about them in the Javadocs of the LexTreeLinguist.
The 'acousticModel' is where the LexTreeLinguist obtains the HMM for the words or units. For the HelloNGram demo, it is defined as:
<component name="wsj" type="edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz"> <property name="loader" value="wsjLoader"/> </component> <component name="wsjLoader" type="edu.cmu.sphinx.model.acoustic.WSJLoader"> <property name="logMath" value="logMath"/> </component>
'wsj' stands for the Wall Street Journal acoustic models. In Sphinx-4,
different acoustic models are represented by classes that implement the
AcousticModel interface. The implementation class, together with the
actual model data files, are packaged in a JAR file, which is included
in the classpath of the Sphinx-4 application. The JAR file for the WSJ
models is called WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar
,
and is in the sphinx4/lib
directory. Inside this JAR file
is a class called edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz
, which implements the AcousticModel interface.
This class will automatically load up all the actual model data files when
it is loaded. The loader of this AcousticModel is called
WSJLoader
, and it also is inside the JAR file. As a programmer,
all you need to do is to specify the class of the AcousticModel, and the
loader of the AcousticModel, as shown above (note that if you are using
the WSJ model in other applications, these lines should be the same,
except that you might have called your 'logMath' component something else).
<component name="trigramModel" type="edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel"> <property name="location" value="resource:/demo.sphinx.hellongram.HelloNGram!/demo/sphinx/hellongram/hellongram.trigram.lm"/> <property name="logMath" value="logMath"/> <property name="dictionary" value="dictionary"/> <property name="maxDepth" value="3"/> <property name="unigramWeight" value=".7"/> </component>
The language model is generated by the CMU Statistical Language Modeling Toolkit. It is in text format, which can be loaded by the SimpleNGramModel class. For this class, you also need to specify the dictionary that you are using, which is the same as the one used by the lexTreeLinguist. Same for 'logMath' (note that the same logMath component should be used throughout the system). The 'maxDepth' property is 3, since this is a trigram language model. The 'unigramWeight' should normally be set to 0.7.
The last important component of the LexTreeLinguist is the 'dictionary', which is defined as follows:
<component name="dictionary" type="edu.cmu.sphinx.linguist.dictionary.FastDictionary"> <property name="dictionaryPath" value= value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz!/edu/cmu/sphinx/model/acoustic/dict/cmudict.0.6d"/> <property name="fillerPath" value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz!/edu/cmu/sphinx/model/acoustic/dict/fillerdict"/> <property name="addSilEndingPronunciation" value="false"/> <property name="wordReplacement" value="<sil>"/> </component>
As you might realize, it is using the dictionary inside the JAR file of the
Wall Street journal acoustic model. The main dictionary for words is the
edu/cmu/sphinx/model/acoustic/dict/cmudict.0.6d
file
inside the JAR file, and the dictionary for filler words
like "BREATH" and "LIP_SMACK" is
edu/cmu/sphinx/model/acoustic/dict/fillerdict
. You can
inspect the contents of a JAR file by (assuming your JAR file is called
myJar.jar)
jar tvf myJar.jarYou can see the contents of the WSJ JAR file by:
sphinx4> jar tvf lib/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar 0 Fri May 07 10:50:26 EDT 2004 META-INF/ 219 Fri May 07 10:50:24 EDT 2004 META-INF/MANIFEST.MF 0 Fri May 07 10:50:20 EDT 2004 edu/ 0 Fri May 07 10:50:20 EDT 2004 edu/cmu/ 0 Fri May 07 10:50:20 EDT 2004 edu/cmu/sphinx/ 0 Fri May 07 10:50:20 EDT 2004 edu/cmu/sphinx/model/ 0 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/ 262 Fri May 07 10:50:16 EDT 2004 edu/cmu/sphinx/model/acoustic/WSJLoader.class 688 Fri May 07 10:50:16 EDT 2004 edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.class 0 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/cd_continuous_8gau/ 0 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/dict/ 0 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/etc/ 1492 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/README 5175518 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/cd_continuous_8gau/means 132762 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/cd_continuous_8gau/mixture_weights 2410 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/cd_continuous_8gau/transition_matrices 5175518 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/cd_continuous_8gau/variances 353 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/dict/alpha.dict 4718935 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/dict/cmudict.0.6d 373 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/dict/digits.dict 204 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/dict/fillerdict 5654967 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef 2641 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.ci.mdef 375 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/etc/variables.def 1797 Fri May 07 10:50:22 EDT 2004 edu/cmu/sphinx/model/acoustic/license.terms 719 Fri May 07 10:50:24 EDT 2004 edu/cmu/sphinx/model/acoustic/model.propsThe locations of the dictionary files with the JAR file are specified using the Sphinx-4 resource mechanism. In short, this mechanism locates the JAR file in which a class resides, and looks into that JAR file for the desired resource. The general syntax is:
resource:/{name of class to locate the JAR file}!{location in the JAR file of the desired resource}Take the 'dictionaryPath' property, for example. The "name of class to locate the JAR file" is "
edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz
", while the "location in the JAR file of the desired resource" is "/edu/cmu/sphinx/model/acoustic/dict/cmudict.0.6d
". This
gives the string: "resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz!/edu/cmu/sphinx/model/acoustic/dict/cmudict.0.6d"
.
For details about the other properties, please refer to the javadoc for FastDictionary.
The rest of the configuration file, which includes the front end configuration and the configuration of the monitors, are the same as in the HelloDigits demo. Therefore, please refer to those sections for explanations. This concludes the walkthrough of the simple HelloNGram example.
As you can see from the above examples, the Recognizer returns a Result object
which provides the recognition results.
The Result object essentially contains all the paths during the recognition
search that have reached the final state (or end of sentence, usually denoted
by ""). They are ranked by the
ending score of the path, and the one with the highest score is the best
hypothesis. Moreover, the Result also contains all the active paths (that
have not reached the final state) at the end of the recognition. Usually,
one would call the Result.getBestResultNoFiller
method to
obtain a string of the best result that has no filler words like "++SMACK++".
This method first attempts to return the best path that has reached the
final state. If no paths have reached the final state, it returns the best
path out of the paths that have not reached the final state.
If you only want to return those paths that have reached the final state,
you should call the method Result.getBestFinalResultNoFiller
. For example, the HelloWorld demo uses this method to avoid treating
any partial sentence in the grammar as the result.
There are other methods in the Result object that can give you
more information, e.g., the N-best results. You will also notice that there are
a number of methods that return Tokens. Tokens are objects along a search
path that record where we are at the search, and the various scores at that
particular location. For example, the Token object has a getWord
method that tells you which word the search is in. For details about the
Token object please refer to the javadoc for Token. For details about the Result object,
please refer to the
javadoc for Result.