Sphinx-4
|
Sphinx-4 Links
SourceForge
|
General InformationInstallationBuildingSphinx-4 in Detail
Versions
|
Sphinx-4 is a state-of-the-art speech recognition system written entirely in the JavaTM programming language. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).
Sphinx-4 started out as a port of Sphinx-3 to the Java programming language, but evolved into a recognizer designed to be much more flexible than Sphinx-3, thus becoming an excellent platform for speech research.
Live mode and batch mode speech recognizers, capable of recognizing discrete and continuous speech.
Generalized pluggable front end architecture. Includes pluggable implementations of preemphasis, Hamming window, FFT, Mel frequency filter bank, discrete cosine transform, cepstral mean normalization, and feature extraction of cepstra, delta cepstra, double delta cepstra features.
Generalized pluggable language model architecture. Includes pluggable language model support for ASCII and binary versions of unigram, bigram, trigram, Java Speech API Grammar Format (JSGF), and ARPA-format FST grammars.
Generalized acoustic model architecture. Includes pluggable support for Sphinx-3 acoustic models.
Generalized search management. Includes pluggable support for breadth first and word pruning searches.
Utilities for post-processing recognition results, including obtaining confidence scores, generating lattices and embedding ECMAScript into JSGF tags.
Standalone tools. Includes tools for displaying waveforms and spectrograms and generating features from audio.
(NOTE: The links in this section point to local files created by javadoc. If they are broken, please follow the instructions on Creating Javadocs to create these links.)
Sphinx-4 is a very flexible system capable of performing many different types of recognition tasks. As such, it is difficult to characterize the performance and accuracy of Sphinx-4 with just a few simple numbers such as speed and accuracy. Instead, we regularly run regression tests on Sphinx-4 to determine how it performs under a variety of tasks. These tasks and their latest results are as follows (each task is progressively more difficult than the previous task):
The following table compares the performance of Sphinx 3.3 with Sphinx-4.
Test | S3.3 WER | S4 WER | S3.3 RT | S4 RT(1) | S4 RT (2) | Vocabulary Size | Language Model |
---|---|---|---|---|---|---|---|
TI46 | 1.217 | 0.168 | 0.14 | .03 | .02 | 11 | isolated digits recognition |
TIDIGITS | 0.661 | 0.549 | 0.16 | 0.07 | 0.05 | 11 | continuous digits |
AN4 | 1.300 | 1.192 | 0.38 | 0.25 | 0.20 | 79 | trigram |
RM1 | 2.746 | 2.88 | 0.50 | 0.50 | 0.41 | 1,000 | trigram |
WSJ5K | 7.323 | 6.97 | 1.36 | 1.22 | 0.96 | 5,000 | trigram |
HUB4 | 18.845 | 18.756 | 3.06 | ~4.4 | 3.95 | 60,000 | trigram |
This data was collected on a dual CPU UltraSPARC(R)-III running at 1015 MHz with 2G of memory.
Sphinx-4 has been built and tested on the Solaris TM Operating Environment, Mac OS X, Linux and Win32 operating systems. Running, building, and testing Sphinx-4 requires additional software. Before you start, you will need the following software available on your machine.
Sphinx-4 has two packages available for download:
See this FAQ question to help determine whether you should get the binary or the source distribution.
After you have downloaded the distribution, unjar the ZIP files using the
jar
command which is in the bin
directory of
your Java installation:
jar xvf sphinx4-{version}-bin.zip jar xvf sphinx4-{version}-src.zip
For both downloads, a directory called "sphinx4-{version}" will be created.
There are also the RM1 acoustic model, and HUB4 acoustic and language models, available for download at the same location on SourceForge. Download them only if you want to run the regression tests for RM1 and HUB4.
If you want to be able to get the latest updates from the CVS source tree, you should retrieve the code from the CVS source tree on SourceForge. The Sphinx-4 code is located at sourceforge.net as open source. Please follow the instructions below to retrieve it.
% export CVS_RSH=ssh % cvs -z3 -d:ext:developername@cvs.sourceforge.net:/cvsroot/cmusphinx co sphinx4where developername is your sourceforge developer name.
% cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/cmusphinx login % cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/cmusphinx co sphinx4
Since the sphinx4-{version}-bin.zip distribution does not contain the source code, you must download the sphinx4-{version}-src.zip, or retrieved the code from SourceForge using CVS, in order to be able to build from the sources. The software required for building Sphinx-4 are listed in the Required Software section.
Setup JSAPI 1.0
Before you build Sphinx-4, it is important to setup your environment to support the Java Speech API (JSAPI), because a number of tests and demos rely on having JSAPI installed.
To build Sphinx-4, at the command prompt change to
the directory where you installed Sphinx-4 (usually, a simple
"cd sphinx4" will do). Set your JAVA_HOME
,
ANT_HOME
and PATH
environment variables
as described above. Then type the following:
ant
This executes the Apache Ant
command to build the Sphinx-4 classes under the bld
directory, the jar files under the lib
directory,
and the demo jar files under the bin
directory.
To delete all the output from the build to give you a fresh start:
ant clean
The javadocs have already been built if you downloaded the sphinx4-{version}-bin.zip. In order to build the javadocs yourself, you must download the sphinx4-{version}-src.zip distribution instead. To build the javadocs, go to the top level directory ("sphinx4-{version}"), and type:
ant javadoc
This will build javadocs from public classes, displaying only the public methods and fields. In general, this is all the information you will need. If you need more details, such as private or protected classes, you can generate the corresponding javadoc by doing, for example:
ant -Daccess=private javadoc
setenv JAVA_HOME /lab/speech/java/j2sdk1.4.0
export JAVA_HOME='/lab/speech/java/jdk1.4.1_01'
export JAVA_HOME='c:/Progra~1/J2SDK_Forte/jdk1.4.0'
setenv CVS_RSH ssh
export CVS_RSH='/usr/local/bin/ssh'
export CVS_RSH='ssh'
d:\work\sphinx4>ant clean /c: Can't open /c: No such file or directory
Sphinx-4 contains a number of demo programs. If you downloaded the binary distribution (sphinx4-{version}-bin.zip), the JAR files of the demos are already built, so you can just run them directly. However, if you downloaded the source distribution (sphinx4-{version}-src.zip or via CVS), you need to build the demos. Click on the links below for instructions on how to build and run the demos.
There is also a live-mode test program (this link only works if you downloaded the source distribution), which is available if you download the sphinx-src-{version}.zip file but not available in the sphinx-bin-{version}.zip file.
The AudioTool is a visual tool that records and displays the waveform and spectrogram of an audio signal. It is available in both the binary and source releases.
The document Sphinx-4 Configuration Management describes, in detail, how to configure a Sphinx-4 system.
The document Sphinx-4 Instrumentation describes, in detail, how to use the instrumentation facilities of the Sphinx-4 system.
Sphinx-4 contains a number of regression tests using common speech databases. Again, you have to download the source distribution or downloaded the source tree using CVS in order to get the regression tests directory. The regression tests we have are:
Before you run any of the tests, make sure that you have built Sphinx-4 already. To do so, go to the top level and type:
ant
You also need to make sure you have the appropriate acoustic model(s) installed. More details below.
The Sphinx-4 regression tests have different directories for the different tasks. The directory sphinx4/tests/performance contains directories named ti46, tidigits, an4, rm1, hub4, and some other tests. Each of these directories contains a build.xml with targets specific to the particular task. The build.xml allows you to run a number of different tests. Type:
ant -projecthelpto list a help text with the possible targets.
The TIDIGITS models are already included as part of the distribution. Therefore, you do not need to download them separately. You must have the TI46 test data, available from the LDC TI46 website.
You need to edit the batch file called ti46.batch
,
located in tests/performance/ti46
directory.
You will need to change it such that
it matches where you stored the TI46 test files. Refer to the section
Batch Files for detail about the format of
batch files.
To run the tests:
% cd sphinx4/tests/performance/ti46 % ant -projecthelp # to see a list of possible targets % ant ti46_wordlist
The TIDIGITS models are already included as part of the distribution. Therefore, you do not need to download them separately.
You must have the TIDIGITS test data, available from the LDC TIDIGITS website.
You need to edit the batch file called tidigits.batch
,
located in the tests/performance/tidigits
directory.
You will need to change it such that
it matches where you stored the TIDIGITS test files. Refer to the section
Batch Files for detail about the format of
batch files.
To run the tests:
% cd sphinx4/tests/performance/tidigits % ant -projecthelp # to see a list of possible targets % ant tidigits_flat_unigram
The Wall Street Journal (WSJ) models are already included as part of the distribution. Therefore, you do not need to download them separately.
Download the big endian raw audio format of the AN4 Database. Unpack it at a directory of your choice:
% gunzip an4_raw.bigendian.tar.gz % tar -xvf an4_raw.bigendian.tar
Then update the following batch files (located in the
tests/performance/an4
directory), so that they match up with where you unpacked the AN4 data.
You probably just need to replace all instances of the string
"/lab/speech/sphinx4/data"
inside these batch files.
Please refer to the Batch Files section for
details about batch files:
an4_full.batch
an4_spelling.batch
an4_words.batch
After you have updated the batch files, you can run the tests by:
% cd sphinx4/tests/performance/an4 % ant -projecthelp # to see a list of possible targets % ant an4_words_unigram
Make sure that you have downloaded the binary RM1 model file, called
RM1_13dCep_16k_40mel_130Hz_6800Hz.jar
, located at the
sphinx4
package in the downloads page.
Then in the build file for the RM1 tests,
sphinx4/tests/performance/rm1/build.xml
,
changed the classpath
property of the build file to point to
the location of your RM1_13dCep_16k_40mel_130Hz_6800Hz.jar
.
You must have the RM1 test data, available from the LDC RM1 website.
You also need to prepare a batch file called rm1.batch
,
by following instructions in the Batch Files
section. There is already one in the RM1 test directory, but it will
not work for you, since the paths to test files will not match your setup.
To run the tests:
% cd sphinx4/tests/performance/rm1 % ant -projecthelp # to see a list of possible targets % ant rm1_bigram
You must have the HUB4 test data, available from the LDC HUB4 website.
You must download the binary HUB4 model file, called
HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.jar
, and the
binary HUB4 trigram language model, called HUB4_trigram_lm.zip
,
both located at the sphinx4
package in the
downloads page. For the trigram language model file, unpack it by:
jar xvf HUB4_trigram_lm.zipThe trigram model file is called
language_model.arpaformat.DMP
.
Then, in the build file for the HUB4 tests,
sphinx4/tests/performance/hub4/build.xml
,
changed the classpath
property of the build file to
point to the location of your
HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.jar
.
In the configuration file,
tests/performance/hub4/hub4.config.xml
, change the 'location'
of the 'trigramModel' component to where your
language_model.arpaformat.DMP
file is located.
You also need to prepare a batch file, which is currently called
f0_hub4.batch
in the build.xml file, by following instructions
in the Batch Files section.
To run the test:
% cd sphinx4/tests/performance/hub4 % ant -projecthelp # to see a list of possible targets % ant hub4_trigram
Each batch mode regression test consists of the following components:
To learn about how to setup a regression test, take a look at the walkthrough of setting up the AN4 tests.
Batch files are used in batch mode regressions tests. It is a text file that contains the list of files to be processed, with the transcription for each file. The format is as shown below: one line for each file, where the first element in a line is the file name, which can be an absolute or relative path, and includes the file extension; after the file name, the words that make up the transcription for the audio. Sphinx-4 uses the transcription provided here to compute the system's accuracy after each sentence is processed. An utterance's processing produces in a hypothesis for what was said. This hypothesis is compared with the transcription, i.e., the hypothesis is aligned against the reference transcript, and a summary of the results is reported.
/lab/speech/sphinx4/data/tidigits/test/raw16k/man/man.ah.24z982za.raw two four zero nine eight two zero /lab/speech/sphinx4/data/tidigits/test/raw16k/man/man.ah.25896o4a.raw two five eight nine six oh four
An example batch file is
tidigits.batch
(this link only works if you downloaded the source distribution).
The audio files used by Sphinx-4 can contain raw audio or cepstra, which is a form of encoded speech. The Java platform has support for other data formats, such as MS WAV or Sun's au, but, provided as is, Sphinx-4 can handle only raw data.
The audio defaults to 2 bytes/sample, at 16000 samples per second. The files are expected to be binaries without header. The Java platform assumes big endian order, always. These defaults can be changed. For example, the byte order or the sampling rate can be changed.
The input can also be cepstra. The cepstral file has a 4 byte integer containing the number of floats that follow. The following floats are 13 dimensional vectors concatenated. Notice that since the first piece of information is the number of floats, the total file size can be computed. If a comparisons with the actual size fails, either the byte order has to be reversed, or the file is corrupted. Importantly, the byte order can be automatically detected.
Walkthrough of Setting up the AN4 Tests
To illustrate the process of setting up a regression test, lets use AN4, an existing test, as an example. Use the following steps to create the AN4 tests.
tests/performance
.
For example, the AN4 tests reside in tests/performance/an4
.
/lab/speech/sphinx4/data/an4
.
Since the AN4 test data
already comes in raw audio format, no conversion is necessary.
However, other test databases might require conversion to raw audio.
For example, the TIDIGITS test files are in SPHERE format, so it is
necessary to convert them to raw audio format before it can be read by
the Sphinx-4 front end. This is usually accomplished by using the
program sox
on UNIX platforms.
tests/performance/an4/an4_full.batch
file looks like:
/lab/speech/sphinx4/data/an4/an4_clstk/fash/an251-fash-b.raw yes /lab/speech/sphinx4/data/an4/an4_clstk/fash/an253-fash-b.raw go /lab/speech/sphinx4/data/an4/an4_clstk/fash/an254-fash-b.raw yes /lab/speech/sphinx4/data/an4/an4_clstk/fash/an255-fash-b.raw u m n y h six ...All batch files should reside in the test directory, in this case
tests/performance/an4
.
ant
at the top level directory will create the JAR file for the WSJ model.
The JAR file should be included in the classpath of the application
you are deploying. In this case, the WSJ JAR file
(lib/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar
) is included
in the java command line inside the build.xml run file. We also need
to specify in the config file (see the next item below)
the acoustic model class we are using, which in this case is
edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz
. The dictionary is also specified in the config file using
the resource mechanism of Sphinx-4.
tests/performance/an4/an4.config.xml
, please take a look
at it. This file describes how the batch-mode recognizer and
its various sub-components should be configured. Note that this
file also contains configurations for the live-mode recognizer,
which is not the subject of interest of this walkthrough.
In the following we will refer to components in the config file
using highlights
.
In an4.config.xml, the batch-mode recognizer is called batch
.
It uses the Recognizer called wordRecognizer
,
which contains the decoder
, as well as
various monitors that keeps track of recognition accuracy, speed, and
memory. The decoder
contains the searchManager
,
which in turn contains the linguist
, the pruner
,
the scorer
, and the activeList
.
Refer to the Javadoc (go to bottom
of the page) for a description of each of these components.
The linguist used is the flatLinguist
,
and the grammar of the flatLinguist
is either the
wordListGrammar
, which is a file with a list of words, e.g.,
AND APOSTROPHE APRIL AREA AUGUST CODEthe
lmGrammar
(i.e., N-gram language model), or
fstGrammar
(i.e., finite state tranducer grammar).
The lmGrammar
uses a language model file (text-based for AN4)
generated by the CMU
Statistical Language Modeling (SLM) Toolkit.
The flatLinguist
also
specifies the acoustic model used, and in this case it is the WSJ models.
The location and format of the WSJ model, as well as the location of
the various files in the model, are also specified.
The scorer
contains the front end,
which is called mfcFrontEnd
since it produces MFCC features.
build.xml
is necessary to run Ant. This file is the Ant version of the Makefile
in Make. All Ant targets are listed in this file.
For details on how to write this file, refer to the documentation
at http://ant.apache.org/.
Lets use the first Ant target, an4_words_wordlist
, as an
example. This Ant target invokes the java
command
on the class edu.cmu.sphinx.tools.batch.BatchModeRecognizer
.
This class takes a configuration file (an4.config.xml
)
and a batch file (an4_words.batch
) as arguments.
This class looks for the component named batch
in the configuration file. The configuration manager will create this
component (and its subcomponents). Therefore, the component
edu.cmu.sphinx.tools.batch.BatchModeRecognizer
should always
be named "batch"
in the config.xml file.
Other AN4 Ant targets are created similarly.
The two main acoustic models that are used by Sphinx-4, TIDIGITS and
Wall Street Journal, are already included in the "lib"
directory of the binary distribution. For the source distribution, you will
build it when you type ant
at the top level directory.
Our regression tests also uses the RM1 and HUB4 models,
which are available for download separately on the download page.
Sphinx-4 can handle model packages provided as a jar file.
Each acoustic model implements the
AcousticModel interface. For example, the WSJ models are wrapped by
a class called edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz
, which implements the AcousticModel interface.
This implementation class is in the JAR file of the models, together
with the actual data files of the model. This way, two simple steps are
need to use a particular acoustic model:
You can find out the model implementation class of a JAR file using the
java -jar
command. For example, you can find out the model
class of the WSJ model by:
sphinx4>java -jar lib/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar Wall Street Journal acoustic models Class: edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz Is Binary: true Sparse Form: false Filters: 40 Vector Length: 39 Gaussians: 8 Model Definition: etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef Data Location: cd_continuous_8gau Feature Type: cepstra_delta_doubledelta Sample Rate: 16000 Description: Wall Street Journal acoustic models Number Fft Points: 512 Max Freq: 6800 Min Freq.: 130The print out also includes details about how the model was trained, but this is not important for the average user.
The language model used by Sphinx-4 follows the ARPA format. Language models provided with the acoustic model packages were created with the Carnegie Mellon University Statistical Language Modeling toolkit (CMU SLM toolkit), available at CMU. A manual is available there.
The language model is created from a list of transcriptions. Given a file with training transcription, the following script creates a list of words that appear in the transcriptions, then creates a bigram and a trigram LM files in the ARPA format. The file with extension ccs contains the context cues, and it is usually a list of words used as markers - beginning or end of speech etc.
set task = RM # Location of the CMU SLM toolkit set bindir = ~/src/CMU-SLM_Toolkit_v2/bin cat $task.transcript | $bindir/text2wfreq | $bindir/wfreq2vocab > $task.vocab set mode = "-absolute" # Create bigram cat $task.transcript | $bindir/text2idngram -n 2 -vocab $task.vocab | \ $bindir/idngram2lm $mode -context $task.ccs -n 2 -vocab $task.vocab \ -idngram - -arpa $task.bigram.arpa # Create trigram cat $task.transcript | $bindir/text2idngram -n 3 -vocab $task.vocab | \ $bindir/idngram2lm $mode -context $task.ccs -n 3 -vocab $task.vocab \ -idngram - -arpa $task.trigram.arpa
Sphinx-4 uses the Java Speech API Grammar Format (JSGF) to perform speech recognition using a BNF-style grammar. Currently, you can only use JSGF grammars with the FlatLinguist. To specify JSGF grammars, set the following in the configuration file:
<component name="flatLinguist" type="edu.cmu.sphinx.linguist.flat.FlatLinguist"> <property name="grammar" value="jsgfGrammar"> // ... other properties ... </component> <component name="jsgfGrammar" type="edu.cmu.sphinx.jsapi.JSGFGrammar"> <property name="grammarLocation" value="...URL of grammar directory"/> </component>
For information on how to write JSGF grammars, and how to specify the location of your JSGF grammar file(s), and the limitations of the current implementation of JSGF grammar, please refer to the Javadocs for JSGFGrammar.
The Sphinx-4 API can be found in the javadoc documentation.
If the previous is broken, please build the javadocs using the instructions in Creating Javadocs. In fact, rebuilding javadocs is something you should do every time you change code in Sphinx-4.
In this section, we will provide an overview of Sphinx-4, starting with
an introduction of HMM-based recognizers. We will highlight in
Sphinx-4 is an HMM-based speech recognizer.
During speech recognition, features are derived from the
incoming speech (we will use "speech" to mean the same thing as "audio")
in the same way as in the training process. The component of the recognizer
that generates these features is called the
The process of speech recognition is to find the best possible sequence
of words (or units) that will fit the given input speech. It is a
Constructing the above graph requires knowledge from various sources.
It requires a
Usually, the search graph also has information about how likely certain
words will occur. This information is supplied by the
Once this graph is constructed, the sequence of parametrized speech
signals (i.e., the features) is matched against different paths
through the graph to find the best fit.
The best fit is usually the least cost or highest
scoring path, depending on the implementation.
In Sphinx-4, the task of searching through the graph for the best path
is done by the
As you can see from the above graph, a lot of the nodes have self
transitions. This can lead to a very large number of possible paths
through the graph. As a result, finding the best possible path can
take a very long time. The purpose of the
As we described earlier, the input speech signal is transformed into a
sequence of feature vectors. After the last feature vector is decoded,
we look at all the paths that have reached the final exit node
(the red node). The path with the highest score is the best fit, and a
In this section, we describe the main components of Sphinx-4, and how they work together during the recognition process. First of all, lets look at the architecture diagram of Sphinx-4. It contains almost all the concepts (the words in red) that were introduced in the previous section. There are a few additional concepts in the diagram, which we will explain promptly.
When the recognizer starts up, it constructs the front end (which generates features from speech), the decoder, and the linguist (which generates the search graph) according to the configuration specified by the user. These components will in turn construct their own subcomponents. For example, the linguist will construct the acoustic model, the dictionary, and the language model. It will use the knowledge from these three components to construct a search graph that is appropriate for the task. The decoder will construct the search manager, which in turn constructs the scorer, the pruner, and the active list.
Most of these components represents interfaces. The search manager,
linguist, acoustic model, dictionary, language model, active list, scorer,
pruner, and search graph are all Java interfaces. There can
be different implementations of these interfaces. For example,
there are two different implementations of the search manager.
Then, how does the system know which implementation to use? It is specified
by the user via the configuration file, an XML-based file that is loaded
by the
The
When the application asks the recognizer to perform recognition, the search manager will ask the scorer to score each token in the active list against the next feature vector obtained from the front end. This gives a new score for each of the active paths. The pruner will then prune the tokens (i.e., active paths) using certain heuristics. Each surviving paths will then be expanded to the next states, where a new token will be created for each next state. The process repeats itself until no more feature vectors can be obtained from the front end for scoring. This usually means that there is no more input speech data. At that point, we look at all paths that have reached the final exit state, and return the highest scoring path as the result to the application.
The performance of Sphinx-4 critically depends on your task and how you configured Sphinx-4 to suit your task. For example, a large vocabulary task needs a different linguist than a small vocabulary task. Your system has to be configured differently for the two tasks. This section will not tell you the exact configuration for different tasks, which will be dealt with later. Instead, this section will introduce you to the configuration mechanism of Sphinx-4, which is via an XML-based configuration file. Please click on the document Sphinx-4 Configuration Management to learn how to do this. It is important that you read this document before you proceed.