home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.speech
- Path: sparky!uunet!convex!darwin.sura.net!spool.mu.edu!agate!ames!riacs!danforth
- From: danforth@riacs.edu (Douglas G. Danforth)
- Subject: Re: Very simple speech recognition Alg. wanted.
- Message-ID: <1992Nov12.180625.13886@riacs.edu>
- Sender: news@riacs.edu
- Organization: RIACS, NASA Ames Research Center
- References: <MHALL.92Nov9152432@occs.cs.oberlin.edu>
- Distribution: comp.speech
- Date: Thu, 12 Nov 92 18:06:25 GMT
- Lines: 120
-
- In <MHALL.92Nov9152432@occs.cs.oberlin.edu> mhall@occs.cs.oberlin.edu (Matthew Hall) writes:
-
- >Hello-
- > I asked this question before, and recieved no replies.
- >However I did recieve at least five requests to pass information on.
- >If you can help, please do. Many people want to know.
-
- >Simply the question is this - How does one implement a speaker
- >dependant, discrete recognition system? For my purposes, the
- >vocabulary can be very small (<100 commands), but others have shown
- >interest in larger vocabularies.
-
- >Specifically, what data should one store - what patterns are unique to
- >different words. How does one search a "dictionary" for a specific
- >word, and how does one quickly and somewhat accurately match a word
- >spoken to it's saved (pattern?) The sound, at least in my case, is
- >stored in a raw waveform. I am using pascal on a Macintosh, but I am
- >pretty flexible.
-
- >If you can help me and the other querents out, either by source code
- >or pointers to information, please do. There seems to be a great
- >interest in this.
-
- >Thank you,
- >-matt hall
- >--
- >-------------------------------------------------------------------------------
- >Matt Hall. mhall@occs.oberlin.edu OR SMH9666@OBERLIN.BITNET
- > (216)-775-6613 (That's a Cleveland Area code. Lucky Me)
-
- >"Life's good, but not fair at all" -Lou Reed
-
- QUICKY RECOGNIZER sketch:
-
- Here is a simple recognizer that should give you 85%+ recognition
- accuracy. The accuracy is a function of WHAT words you have in
- your vocabulary. Long distinct words are easy. Short similar
- words are hard. You can get 98+% on the digits with this recognizer.
-
- Overview:
- (1) Find the begining and end of the utterance.
- (2) Filter the raw signal into frequency bands.
- (3) Cut the utterance into a fixed number of segments.
- (4) Average data for each band in each segment.
- (5) Store this pattern with its name.
- (6) Collect training set of about 3 repetitions of each pattern (word).
- (7) Recognize unknown by comparing its pattern against all patterns
- in the training set and returning the name of the pattern closest
- to the unknown.
-
- This type of recognizer has been used by several companies such as
- Interstate Electronics. There are many variations on this theme:
- Use Mel-Ceptral rather than frequency bands, dynamic time warping
- rather than linear segment rule, Hidden Markov Models with no
- specific end point determination, etc.
-
- If you use filter bands then you need to know how to construct a
- filter which has a center frequency and band width. There are many
- signal processing books that describe how to do this but can get
- quite technical very fast. I have found that a simple "second
- order state space" filter works very well. By this I mean that
- each filter is represented by a 2x2 matrix which specifies its
- center frequency and bandwidth along with a 2x1 vector, its state.
- The state is modified from sample to sample by first adding the
- input signal from whatever hardware board you have to one of the
- components of the state and then multiplying that state by the
- 2x2 matrix: add and rotate. The output of the filter is just
- one of the components of the state (it doesn't really matter which,
- the phase is just shifted slightly).
-
- The 2x2 matrix is contructed as following:
-
- |a -b|
- R = r | |
- |b a|
-
- where 0 < r < 1, a=cos(t), b=sin(t).
-
- The parameter r determines the width of the filter. If r is close to 1
- then the width is very narrow and the output can grow very large for
- inputs with frequency in resonance with the filter. For r small the
- width is broad and the amplitude grows less strongly.
-
- The parameter t is the frequency of the filter, small t low frequency,
- large (near pi) t high frequency. You should spread your filters
- over the range 200Hz to 4000Hz. The spread should be heavy near the
- low frequency with fewer filters near the high (critical bands).
-
- The output of a filter will look choppy and irregular just like the
- input but will be large for resonance input signals. One needs to
- smooth the output of each band filter by "lowpass" filtering the
- rectified fullwave (absolute value of)(make all negative values positive).
- This entails using a second stage with a single 1x1 state scalar that
- adds a fraction of the rectified bandpass filter output to a fraction
- of its value: Lowpass := (1-u)*Lowpass + u*|Bandpass|, where 0 < u < 1.
-
- Resample the Lowpass at about 200 times a second to use for the other
- parts of the pattern generation.
-
- How many filters? How many segments? Well 16 for both works quite
- nicely. This gives a pattern of 256 numbers. That's what you store.
-
- How do you find the begining and end of an utterance? Use a threshold
- for the total energy (square of the input signal) and remember that
- just because the signal drops below the threshold does not mean that
- the word is finished. It may come up again! Consider the word "it".
- There is a long pause between the "i" and the release of the "t" so
- you need allow for this. Again, other more sophisticated techniques
- can avoid having to make these "end point" decisions in this way but
- take more work to implement.
-
- I think I have provided enough information for you to begin building
- your first speech recognition system. Oh yes, just use a Euclidean
- distance between the 256 elements of two patterns (other metrics
- also work).
-
- Good luck,
-
- Doug Danforth
-
-