Sphinx 3.3 vs. Sphinx 4 Algorithm
Differences
There are some differences in performance between Sphinx 3.3 and
Sphinx 4.
This paper describes some of the major algorithmic
differences between Sphinx 3.3 and Sphinx 4. The goal is to help us
understand the performance gap and to give us some ideas of how to
improve the performance of Sphinx 4.
Major Differences
- Token Passing vs. Multiple Lex
Trees - This is a fundamental difference between Sphinx
3.3 and Sphinx 4. In large vocabulary recognition there are
potentially many active search paths. Managing the active paths
efficiently is central to the decoding process. Sphinx 3.3 and
Sphinx 4 use different techniques in dealing with this issue. In
Sphinx 3.3 the search space is managed as a set of lex trees with
active paths managed directly in the lex tree. Only the best path
through a particular node is retained (this is known as Viterbi
pruning). Since there can be many simultaneous active paths through a
particular node of the lex tree (due to different word histories)
these paths would incorrectly compete against each other. To
avoid this, a set of lex trees are used (typically 3 in Sphinx3.3 plus
one for filler words). As paths reach the end of a lex tree they
a propagated into the next lex tree. This reduces, but does not
eliminate, the chance of unfair Viterbi pruning. For a good
discussion of multiple lex trees see Spoken
Language Processing (Huang et al.) page 648.
Sphinx-4 takes a very different approach to manging the active search
paths. Sphinx-4 maintains a single lex tree. However, no active search
information is retained in this tree. Instead a token tree is used to
manage the active paths during the search. Paths only complete
against each other if their lex tree state, HMM state and word
histories are identical. This approach allows Sphinx-4 to
perform true n-gram decoding and is theoretically correct and results
in higher accuracy. This generality has a price however.
For instance, it is straightforward in S3.3 to determine if two
paths have reached the same state in the lex tree since the two paths
are maintained in the same lex tree structure. In S4, however, there
isn't this explicit structure, instead the context of the paths
(lextree state, HMM state and word history) have to be be explicitly
compared to determine if the two paths collide. Similarly, to
propagate a path into a new state, S3.3 merely has to set the active flag in the new lex tree
node to true and then add the node to the active list, while S4
has to generate a new search graph state (containing the lex tree
state, HMM state and word history) as well as generating a new token to
add to the path and and adding the token to the active list.
- Composites - Sphinx
3.3 uses composite senones to represent the word ending nodes.
Composite senones are a method of dealing with the large fanout in
states that occur at word boundaries. This fanout typically
accounts for 40 to 50% of all active states in the search. Since
we are typically dealing with context dependent units where the context
is a single left and right unit, the right context for a word ending
unit must include all possible initial units for the next set of
words. Typically there are 20 or 30 separate CD units needed to
represent the final phone in any word. Composites deal reduce
this fanout by collecting all of the possible word ending CD units into
a single bucket. This bucket of senones is scored by taking the maximum
(or the average) of the scores of the individual senones inthe
bucket. The composite doesn't reduce the number of HMM states
that are scored since all of the underlying senones are scored, but it
does reduce the overall processing time by reducing the number of
states that need to be managed and propagated. The imprecision of
the composites can potentially reduce the accuracy of the recognizer.
Differences in Constraints
- Arbitrary HMM Topologies -
Sphinx 4 allows HMMs to consist of an arbitrary number of
states with no constraints on the transitions between these states.
Sphinx 3.3 only allows 3 or 5 state HMMs with left-to-right
transitions. Sphinx-4 currently cannot take advantage of this
flexibility since we are relying on Sphinx-3 acoustic models which are
3 or 5 state left-to-right HMMs.
- Arbitrary Context Sizes - Sphinx-4,
in general, allows for arbitrary context sizes, while Sphinx 3.3 has a
fixed context of a single left and right unit. Note however
that the current Sphinx-4 linguists also constrain the context size to
the single left and right unit since there are some performance
optimizations to be gained with this constraint.
- Arbitrary language model depth
- The Sphinx-4 decoder supports any depth of n-gram.
Currently, we are constrained to 3-grams by the n-gram model loader
since we only have access to 3-gram language models. Sphinx 3.3 is
restricted to use 3-grams.
- Support for rule grammars
- Sphinx-4 can use rule based grammars such as those
specified by JSGF or FSTs. Sphinx 3.3 supports only stochastic grammars.
- Arbitrary search graph
topologies - The Sphinx-4 representation of the search graph is
independent of the topology of the graph. For instance, a
flat topology is used for small vocabulary recognition, while a
lex tree is used for medium and large vocabulary recognition. Sphinx
3.3 only supports a lex tree topology.
- Tuning Parameters - Sphinx-4
is more configurable and tunable, with about 300 separate configurable
properties compared to S3.3 with about 25.
Scoring / Pruning Differences
There are a number of differences in how pruning is performed.
- Sub-vector quantization - Sphinx
3.3 uses sub-vector quantization to reduce the amount of time needed to
score the active states. Sphinx 4 scores all the active states fully,
while Sphinx 3.3 will perform a quick pass using sub-vq to select a
subset of active states to fully score.
- HMMs vs. HMM States - In
Sphinx 3.3, the lowest level scoreable entity is an HMM, while in
Sphinx-4 it is an HMM State. Sphinx 3.3 treats the HMM as a
whole. Whenever an HMM is active, S3.3 scores all of the states of the
HMM against the incoming feature and generates two values, a 'best
score' and an 'exit score'. The 'best score' is the best score
among all of the states of the HMM. This score is used to determine if
the HMM will remain active in the succeeding frames. The 'exit
score' is used to determine if the subsequent HMMs in the lex tree will
become active (that is, should we propagate the path into the next
HMM). In Sphinx-4, each individual state of the HMM is scored
separately and is subject to pruning and propagating. This method
used by Sphinx-4 allows for arbitrary HMM topologies and allows for
consistent pruning. It does however require handling state transitions
much more frequently. Sphinx-4 will generate state transitions
four times more often than Sphinx 3.3 because of this method of
handling finer grained states. This most likely has a peformance
impact. For instance, the test that determines if a path should be
dropped because of a competing higher scoring path (aka Viterbi
pruning) is done four times as often in S4 because of this.
Although this is a fairly quick test it is done quite often (50,000
times per 10ms audio frame), so this can add significantly to the
processing time for S4.
- Folding of common senone
sequences - Due to state tying, it is often the case that
several distinct context dependent phones will share a common senone
sequence. Both Sphinx 3.3 and Sphinx 4 will identify states
that have identical seneone sequences and allow these to be shared,
greatly reducing the number of states processed. Sphinx 3.3 also
performs this folding on the entry into the lex tree. That is, the
initial set of states into the lex tree are also compressed based upon
common senone sequences. It is unclear how much this optimization
saves in terms of state creation.
- Absolute Beam - Sphinx
4 applies the absolute beam by applying a 'randomizing partitioning
algorithm', while Sphinx 3.3 uses a simple binning algorithm.
Both of these algorithms are O(n), however they have different overhead
constants. The Sphinx-4 algorithm is precise in that it
guarantees to produce the top scorers in the surviving beam, whereas
the Sphinx 3.3 binning algorithm is only approximate.
- LexTree Pruning - In
S3.3, at the end of a frame, only the best word exit for each distinct
word final CI phone per lex tree is allowed to continue. In S4
there is no pruning based on the final phone.
- Phone Beam - In
S3.3 there is relative beam that is applied on the transition out of an
HMM. Only active HMMs whose score exceeds this relative beam are
allowed to propagate. S4 has no such beam.
- Narrow Phone Beam - In
S3.3 there is an option to narrow the phone beam every Nth frame.
- Word History Pruning - In
S3.3 the word beam only ever contains a single path for a particular
word. For instance if there are three path candidates for the
beam:
- "the dog" score 100
- "my dog" score 75
- "my cat", score 50
Only paths (1) and (3) would be retained, whereas in S4 both paths (1)
and (2) could survive.
- More Word History Pruning
- In S3.3 there is a constraint placed upon the total
number of word histories retained, in S4 there is no such constraint.
- Filler Pruning - In
S3.3 only the best scoring filler word is allowed into the beam. All
others are dropped. In S4 fillers compete for space in the beam
with all other words.
- Integer vs. Float Scores -
In S4 all path scores are maintained as single precision
floating point values. In S3.3, path scores are maintained as
32-bit integer values.
- Single vs. Double Precision
scoring - internally, S3.3 uses double precision math to
score HMMs whereas S4 uses single precision math. Note that we have
done some detaileed analysis on the scoring code and have determined
that the differences in scoring code do not significantly affect the
results.
- AcousticLookAhead - In
S4 there is an option to prune based upon an acoustic score that
has been extrapolated into the future by N frames. S3.3 has no such
option.
- Grow Skipping - In S4
there is an option to skip the propagation of states every Nth frame.
S3.3 has no such option.
- Frame Dropping In
S4 there is an option to drop every Nth frame. S3.3 has no such option.
- Frame Replacement - In
S4 there is an option to replace every Nth frame with the preceding
frame. S3.3 has no such option.
Miscellaneous Differences
- S3.3 has an optional filler penalty for each type of filler,
whereas S4 has only a single filler penalty for all all non-silence
fillers.
- S3.3 filler words are maintained in their own lex tree. Since the
actual identity of fillers (cough, breath, silence, smack) are not
usually important, s3.3 treats filler words separately, giving them
their own lex tree where they compete only against each other and not
against other words.
- S4 has a threaded scorer that can be used to improve the time
spent scoring states by spreading the computation over multiple CPUs if
available.
- S3.3 seems to have a memory leak. It seems to slowly grow
its foot print over time (estimate of 1MB for every 3 minutes of audio
processed).