Front End Framework |
Typically, you will need to fine tune some of the front end parameters depending on the sampling frequency used to collect the audio data. Sphinx-4 has default values that may be different from what you need. Please check the constant field values page to verify the default values.
You will find details about how to choose the values to fine tune the front end in the javadocs for the appropriate classes. Other than the sampling rate itself, it is common that a user has to change the filterbank parameters, i.e., the number of filters and the frequency range spanned by the filter. Depending on the sampling rate, a user may also want to change the window shift and size, as well as the number of fft points.
For more information on these variables, please follow the links below. You do not need all the links, just the ones defined on your configuration.
edu/cmu/sphinx/tools/feature
directory,
and type:
ant -Dinput="input file" -Doutput="output file" cepstra_producer
To create a binary PLP cepstrum file using this program, type:
ant -Dinput="input file" -Doutput="output file" plp_producer
To create a binary spectra file, type:
ant -Dinput="input file" -Doutput="output file" spectra_producer
As you might notice by comparing the files cepstra_dump.props and spectra_dump.props in the "edu/cmu/sphinx/tools/feature" directory, the only difference in setup between dumping different types of features is in the sequence of data processors as specified in the properties file. If you give it a difference data processor sequence, it will give you different output.
Binary File Format
The first 4 bytes of the binary file is an integer indicating the total number of data points in the file. This is used by the program that reads this file to check the endianness of the file by comparing with the file size. The rest of the file is simply the data points. Each data point is a 4-byte floating point number, in big-endian order.
This normally applies to batch mode decoding using the
BatchModeRecognizer.
In the configuration file, set the first processor of the front end
to be StreamCepstrumSource, and use that as the input source of the
BatchModeRecognizer. Also, change the front end pipeline so that
either BatchCMN or LiveCMN will follow the StreamCepstrumSource,
skiping the preemphasis, windowing, MFCC, and DCT steps. The
configuration should contain the lines:
<component name="batchRecognizer" type="edu.cmu.sphinx.tools.batch.BatchModeRecognizer">
<property name="inputSource" value="streamCepstrumSource">
// ... other properties ...
</component>
<component name="frontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>streamCepstrumSource</item>
<item>batchCMN</item>
// ... other front end processors
</propertylist>
</component>
<component name="streamCepstrumSource" type="edu.cmu.sphinx.frontend.util.StreamCepstrumSource"/>
For more information on configuration files, please refer to the document
Sphinx-4 Configuration Management
The Sphinx-4 audio endpointer is composed of three data processors that carry out different functions:
SpeechClassifier - classifies chunks of audio into speech
and non-speech.
SpeechMarker - marks the audio stream into speech and non-speech
regions, giving some 'cushion areas' around these regions.
NonSpeechDataFilter - removes the non-speech regions from
the audio.
The Sphinx-4 audio endpointer is enabled by including it as part of the front end pipeline. The three data processors should be placed in front of all the other data processors, e.g.,
<component name="frontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>speechClassifier</item>
<item>speechMarker</item>
<item>nonSpeechDataFilter</item>
<item>preemphasizer</item>
// ... other front end processors
</propertylist>
</component>
<component name="speechClassifier" type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier">
<property name="threshold" value="13"/>
</component>
<component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker"/>
<component name="nonSpeechDataFilter" type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/>
The SpeechClassifier property 'threshold' controls how sensitive the endpointer is. It is empirically determined that the value of 13 is optimal for most environments. A lower threshold will make the LevelTracker more sensitive, that is, mark more audio as speech. A higher threshold will make the endpointer less sensitive, that is, mark less audio as speech.
It is also important to understand the operation of the mergeSpeechSegment property of the NonSpeechDataFilter when using the audio endpointer. Refer to the Javadoc for NonSpeechDataFilter for details about its operation.