Front End Framework

Table of Contents


Fine Tuning the Front End to the Sampling Rate

Typically, you will need to fine tune some of the front end parameters depending on the sampling frequency used to collect the audio data. Sphinx-4 has default values that may be different from what you need. Please check the constant field values page to verify the default values.

You will find details about how to choose the values to fine tune the front end in the javadocs for the appropriate classes. Other than the sampling rate itself, it is common that a user has to change the filterbank parameters, i.e., the number of filters and the frequency range spanned by the filter. Depending on the sampling rate, a user may also want to change the window shift and size, as well as the number of fft points.

For more information on these variables, please follow the links below. You do not need all the links, just the ones defined on your configuration.


Creating MFC Cepstrum/PLP Cepstrum/Spectrum from Audio

There is a program called FeatureFileDumper that turns an audio file into binary cepstra file using the front end. To create MFCC cepstrum using this program, go to the edu/cmu/sphinx/tools/feature directory, and type:

ant -Dinput="input file" -Doutput="output file" cepstra_producer

To create a binary PLP cepstrum file using this program, type:

ant -Dinput="input file" -Doutput="output file" plp_producer

To create a binary spectra file, type:

ant -Dinput="input file" -Doutput="output file" spectra_producer

As you might notice by comparing the files cepstra_dump.props and spectra_dump.props in the "edu/cmu/sphinx/tools/feature" directory, the only difference in setup between dumping different types of features is in the sequence of data processors as specified in the properties file. If you give it a difference data processor sequence, it will give you different output.

Binary File Format

The first 4 bytes of the binary file is an integer indicating the total number of data points in the file. This is used by the program that reads this file to check the endianness of the file by comparing with the file size. The rest of the file is simply the data points. Each data point is a 4-byte floating point number, in big-endian order.


Decoding from Cepstra Files

This normally applies to batch mode decoding using the BatchModeRecognizer. In the configuration file, set the first processor of the front end to be StreamCepstrumSource, and use that as the input source of the BatchModeRecognizer. Also, change the front end pipeline so that either BatchCMN or LiveCMN will follow the StreamCepstrumSource, skiping the preemphasis, windowing, MFCC, and DCT steps. The configuration should contain the lines:

<component name="batchRecognizer" type="edu.cmu.sphinx.tools.batch.BatchModeRecognizer">

    <property name="inputSource" value="streamCepstrumSource">
    // ... other properties ...

</component>

<component name="frontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">

    <propertylist name="pipeline">
        <item>streamCepstrumSource</item>
	<item>batchCMN</item>
	// ... other front end processors
    </propertylist>

</component>

<component name="streamCepstrumSource" type="edu.cmu.sphinx.frontend.util.StreamCepstrumSource"/>
      
For more information on configuration files, please refer to the document Sphinx-4 Configuration Management


Enabling the Endpointer

The Sphinx-4 audio endpointer is composed of three data processors that carry out different functions:

SpeechClassifier - classifies chunks of audio into speech and non-speech.
SpeechMarker - marks the audio stream into speech and non-speech regions, giving some 'cushion areas' around these regions.
NonSpeechDataFilter - removes the non-speech regions from the audio.

The Sphinx-4 audio endpointer is enabled by including it as part of the front end pipeline. The three data processors should be placed in front of all the other data processors, e.g.,

<component name="frontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">

    <propertylist name="pipeline">
        <item>speechClassifier</item>
	<item>speechMarker</item>
	<item>nonSpeechDataFilter</item>
	<item>preemphasizer</item>
	// ... other front end processors
    </propertylist>

</component>

<component name="speechClassifier" type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier">
    <property name="threshold" value="13"/>
</component>
<component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker"/>
<component name="nonSpeechDataFilter" type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/>
      

The SpeechClassifier property 'threshold' controls how sensitive the endpointer is. It is empirically determined that the value of 13 is optimal for most environments. A lower threshold will make the LevelTracker more sensitive, that is, mark more audio as speech. A higher threshold will make the endpointer less sensitive, that is, mark less audio as speech.

It is also important to understand the operation of the mergeSpeechSegment property of the NonSpeechDataFilter when using the audio endpointer. Refer to the Javadoc for NonSpeechDataFilter for details about its operation.


Copyright 1999-2004 Carnegie Mellon University.
Portions Copyright 2002-2004 Sun Microsystems, Inc.
Portions Copyright 2002-2004 Mitsubishi Electric Research Laboratories.
All Rights Reserved. Usage is subject to license terms.