home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-08-21 | 82.4 KB | 2,026 lines |
- Path: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!pad-thai.aktis.com!pad-thai.aktis.com!not-for-mail
- From: andrewh@ee.su.oz.au (Andrew Hunt)
- Newsgroups: comp.speech,comp.answers,news.answers
- Subject: comp.speech FAQ (Frequently Asked Questions)
- Supersedes: <comp-speech-faq_741931206@GZA.COM>
- Followup-To: comp.speech
- Date: 21 Aug 1993 00:00:15 -0400
- Organization: Speech Technology Group, The University of Sydney
- Lines: 2006
- Sender: faqserv@GZA.COM
- Approved: news-answers-request@MIT.Edu
- Expires: 2 Oct 1993 04:00:08 GMT
- Message-ID: <comp-speech-faq_745905608@GZA.COM>
- Reply-To: andrewh@ee.su.oz.au (Andrew Hunt)
- NNTP-Posting-Host: pad-thai.aktis.com
- Summary: Useful information about Speech Technology
- X-Last-Updated: 1993/08/20
- Xref: senator-bedfellow.mit.edu comp.speech:1132 comp.answers:1680 news.answers:11634
-
- Archive-name: comp-speech-faq
- Last-modified: 1993/08/20
-
-
- comp.speech
-
- Frequently Asked Questions
- ==========================
-
- This document is an attempt to answer commonly asked questions and to
- reduce the bandwidth taken up by these posts and their associated replies.
- If you have a question, please check this file before you post.
-
- The FAQ is not meant to discuss any topic exhaustively. It will hopefully
- provide readers with pointers on where to find useful information. It also
- tries to list useful material available elsewhere on the net.
-
- This FAQ is posted monthly to comp.speech, comp.answers and news.answers.
- It is also available for anonymous ftp from the comp.speech archive site
- svr-ftp.eng.cam.ac.uk:/comp.speech/FAQ
- It is also available from the news.answers ftp site (and its mirrors) as
- rtfm.mit.edu:/pub/usenet/news.answers/comp-speech-faq
-
- If you have not already read the Usenet introductory material posted to
- "news.announce.newusers", please do. For help with FTP (file transfer
- protocol) look for a regular posting of "Anonymous FTP List - FAQ" in
- comp.misc, comp.archives.admin and news.answers amongst others.
-
-
- Admin
- -----
-
- This month brings a few updates on the CELP information, updates of
- information on many of the software packages and a few minor changes.
-
-
- Cheers,
-
- Andrew Hunt
- Speech Technology Research Group email: andrewh@ee.su.oz.au
- Department of Electrical Engineering Ph: 61-2-692 4509
- University of Sydney, NSW, Australia. Fax: 61-2-692 3847
-
-
- ========================== Acknowledgements ===========================
-
- Thanks to the following for their significant comments and contributions.
-
- Barry Arons <barons@media-lab.mit.edu>
- Joe Campbell <jpcampb@afterlife.ncsc.mil>
- Oliver Jakobs <jakobs@ldv01.Uni-Trier.de>
- Sonja Kowalewski <kowa@uniko.uni-koblenz.de>
- Tony Robinson <ajr@eng.cam.ac.uk>
- Mike <mike%jim.uucp@wupost.wustl.edu>
-
- Many others have provided useful information. Thanks to all.
-
-
- ============================ Contents =================================
-
- PART 1 - General
-
- Q1.1: What is comp.speech?
- Q1.2: Where are the comp.speech archives?
- Q1.3: Common abbreviations and jargon.
- TIMIT - A big speech database from TI and MIT - see Q1.6
- Q1.4: What are related newsgroups and mailing lists?
- Q1.5: What are related journals and conferences?
- Q1.6: What speech data is available?
- Q1.7: Speech File Formats, Conversion and Playing.
- Q1.8: What "Speech Laboratory Environments" are available?
-
- PART 2 - Signal Processing for Speech
-
- Q2.1: What speech sampling and signal processing hardware can I use?
- Q2.2: What signal processing techniques are for speech technology?
- Q2.3: How do I find the pitch of a speech signal?
- Q2.4: How do I convert to/from mu-law format?
-
- PART 3 - Speech Coding and Compression
-
- Q3.1: Speech compression techniques.
- Q3.2: What are some good references/books on coding/compression?
- Q3.3: What software is available?
-
- PART 4 - Speech Synthesis
-
- Q4.1: What is speech synthesis?
- Q4.2: How can speech synthesis be performed?
- Q4.3: What are some good references/books on synthesis?
- Q4.4: What software/hardware is available?
-
- PART 5 - Speech Recognition
-
- Q5.1: What is speech recognition?
- Q5.2: How can I build a very simple speech recogniser?
- Q5.2: What does speaker dependent/adaptive/independent mean?
- Q5.3: What does small/medium/large/very-large vocabulary mean?
- Q5.4: What does continuous speech or isolated-word mean?
- Q5.5: How is speech recognition done?
- Q5.6: What are some good references/books on recognition?
- Q5.7: What speech recognition packages are available?
-
- PART 6 - Natural Language Processing
-
- Q6.1: What are some good references/books on NLP?
- Q6.2: What NLP software is available?
-
- =======================================================================
-
- PART 1 - General
-
- Q1.1: What is comp.speech?
-
- comp.speech is a newsgroup for discussion of speech technology and
- speech science. It covers a wide range of issues from application of
- speech technology, to research, to products and lots more. By nature
- speech technology is an inter-disciplinary field and the newsgroup reflects
- this. However, computer application is the basic theme of the group.
-
- The following is a list of topics but does not cover all matters related
- to the field - no order of importance is implied.
-
- [1] Speech Recognition - discussion of methodologies, training, techniques,
- results and applications. This should cover the application of techniques
- including HMMs, neural-nets and so on to the field.
-
- [2] Speech Synthesis - discussion concerning theoretical and practical
- issues associated with the design of speech synthesis systems.
-
- [3] Speech Coding and Compression - both research and application matters.
-
- [4] Phonetic/Linguistic Issues - coverage of linguistic and phonetic issues
- which are relevant to speech technology applications. Could cover parsing,
- natural language processing, phonology and prosodic work.
-
- [5] Speech System Design - issues relating to the application of speech
- technology to real-world problems. Includes the design of user interfaces,
- the building of real-time systems and so on.
-
- [6] Other matters - relevant conferences, books, public domain software,
- hardware and related products.
-
- ------------------------------------------------------------------------
-
- Q1.2: Where are the comp.speech archives?
-
- comp.speech is being archived for anonymous ftp.
-
- ftp site: svr-ftp.eng.cam.ac.uk (or 129.169.24.20).
- directory: comp.speech/archive
-
- comp.speech/archive contains the articles as they arrive. Batches of 100
- articles are grouped into a shar file, along with an associated file of
- Subject lines.
-
- Other useful information is also available in comp.speech/info.
-
- ------------------------------------------------------------------------
-
- Q1.3: Common abbreviations and jargon.
-
- ANN - Artificial Neural Network.
- ASR - Automatic Speech Recognition.
- ASSP - Acoustics Speech and Signal Processing
- AVIOS - American Voice I/O Society
- CELP - Code-book excited linear prediction.
- COLING - Computational Linguistics
- DTW - Dynamic time warping.
- FAQ - Frequently asked questions.
- HMM - Hidden markov model.
- IEEE - Institute of Electrical and Electronics Engineers
- JASA - Journal of the Acoustic Society of America
- LPC - Linear predictive coding.
- LVQ - Learned vector quantisation.
- NLP - Natural Language Processing.
- NN - Neural Network.
- TI - Texas Instruments.
- TIMIT - A big speech database from TI and MIT - see Q1.6
- TTS - Text-To-Speech (i.e. synthesis).
- VQ - Vector Quantisation.
-
- ------------------------------------------------------------------------
-
- Q1.4: What are related newsgroups and mailing lists?
-
-
- NEWGROUPS
-
- comp.ai - Artificial Intelligence newsgroup.
- Postings on general AI issues, language processing and AI techniques.
- Has a good FAQ including NLP, NN and other AI information.
-
- comp.ai.nat-lang - Natural Language Processing Group
- Postings regarding Natural Language Processing. Set up to cover
- a broard range of related issues and different viewpoints.
-
- comp.ai.nlang-know-rep - Natural Language Knowledge Representation
- Moderated group covering Natural Language.
-
- comp.ai.neural-nets - discussion of Neural Networks and related issues.
- There are often posting on speech related matters - phonetic recognition,
- connectionist grammars and so on.
-
- comp.compression - occasional articles on compression of speech.
- FAQ for comp.compression has some info on audio compression standards.
-
- comp.dcom.telecom - Telecommunications newsgroup.
- Has occasional articles on voice products.
-
- comp.dsp - discussion of signal processing - hardware and algorithms and more.
- Has a good FAQ posting.
- Has a regular posting of a comprehensive list of Audio File Formats.
-
- comp.multimedia - Multi-Media discussion group.
- Has occasional articles on voice I/O.
-
- sci.lang - Language.
- Discussion about phonetics, phonology, grammar, etymology and lots more.
-
- alt.sci.physics.acoustics - some discussion of speech production & perception.
-
- alt.binaries.sounds.misc - posting of various sound samples
- alt.binaries.sounds.d - discussion about sound samples, recording and playback.
-
-
- MAILING LISTS
-
- ECTL - Electronic Communal Temporal Lobe
- Founder & Moderator: David Leip
- Moderated mailing list for researchers with interests in computer speech
- interfaces. This list serves a broad community including persons from
- signal processing, AI, linguistics and human factors.
-
- To subscribe, send the following information to:
- ectl-request@snowhite.cis.uoguelph.ca
- name, institute, department, daytime phone & e-mail address
-
- To access the archive, ftp snowhite.cis.uoguelph.ca, login as anonymous,
- and supply your local userid as a password. All the ECTL things can be
- found in pub/ectl.
-
- Prosody Mailing List
- Unmoderated mailing list for discussion of prosody. The aim is
- to facilitate the spread of information relating to the research
- of prosody by creating a network of researchers in the field.
- If you want to participate, send the following one-line
- message to "listserv@purccvm.bitnet" :-
-
- subscribe prosody Your Name
-
- foNETiks
- A monthly newsletter distributed by e-mail. It carries job
- advertisements, notices of conferences, and other news of
- general interest to phoneticians, speech scientists and others
- The current editors are Linda Shockey and Gerry Docherty.
- To subscribe, send a message to FONETIKS-REQUEST@dev.rdg.ac.uk.
-
- Digital Mobile Radio
- Covers lots of areas include some speech topics including speech
- coding and speech compression.
- Mail Peter Decker (dec@dfv.rwth-aachen.de) to subscribe.
-
- ------------------------------------------------------------------------
-
- Q1.5: What are related journals and conferences?
-
- Try the following commercially oriented magazines...
-
- Speech Technology - no longer published
-
- Try the following technical journals...
-
- IEEE Transactions on Speech and Audio Processing (from Jan 93)
- Computational Linguistics (COLING)
- Computer Speech and Language
- Journal of the Acoustical Society of America (JASA)
- Transactions of IEEE ASSP
- AVIOS Journal
-
- Try the following conferences...
-
- ICASSP Intl. Conference on Acoustics Speech and Signal Processing (IEEE)
- ICSLP Intl. Conference on Spoken Language Processing
- EUROSPEECH European Conference on Speech Communication and Technology
- AVIOS American Voice I/O Society Conference
- SST Australian Speech Science and Technology Conference
-
- ------------------------------------------------------------------------
-
- Q1.6: What speech data is available?
-
- A wide range of speech databases have been collected. These databases
- are primarily for the development of speech synthesis/recognition and for
- linguistic research.
-
- Some databases are free but most appear to be available for a small cost.
- The databases normally require lots of storage space - do not expect to be
- able to ftp all the data you want.
-
- [There are too many to list here in detail - perhaps someone would like to
- set up a special posting on speech databases?]
-
-
- PHONEMIC SAMPLES
- ================
-
- First, some basic data. The following sites have samples of English phonemes
- (American accent I believe) in Sun audio format files. See Question 1.7
- for information on audio file formats.
-
- sounds.sdsu.edu:/.1/phonemes
- phloem.uoregon.edu:/pub/Sun4/lib/phonemes
- sunsite.unc.edu:/pub/multimedia/sun-sounds/phonemes
-
-
- HOMOPHONE LIST
- ==============
-
- A list of homophones in General American English is available by anonymous
- FTP from the comp.speech archive site:
-
- machine name: svr-ftp.eng.cam.ac.uk
- directory: comp.speech/data
- file name: homophones-1.01.txt
-
-
- LINGUISTIC DATA CONSORTIUM (LDC)
- ================================
-
- Information about the Linguistic Data Consortium is available via
- anonymous ftp from: ftp.cis.upenn.edu (130.91.6.8)
- in the directory: /pub/ldc
-
- Here are some excerpts from the files in that directory:
-
- Briefly stated, the LDC has been established to broaden the collection
- and distribution of speech and natural language data bases for the
- purposes of research and technology development in automatic speech
- recognition, natural language processing and other areas where large
- amounts of linguistic data are needed.
-
- Here is the brief list of corpora:
-
- * The TIMIT and NTIMIT speech corpora
- * The Resource Management speech corpus (RM1, RM2)
- * The Air Travel Information System (ATIS0) speech corpus
- * The Association for Computational Linguistics - Data Collection
- Initiative text corpus (ACL-DCI)
- * The TI Connected Digits speech corpus (TIDIGITS)
- * The TI 46-word Isolated Word speech corpus (TI-46)
- * The Road Rally conversational speech corpora (including "Stonehenge"
- and "Waterloo" corpora)
- * The Tipster Information Retrieval Test Collection
- * The Switchboard speech corpus ("Credit Card" excerpts and portions
- of the complete Switchboard collection)
-
- Further resources to be made available within the first year (or two):
-
- * The Machine-Readable Spoken English speech corpus (MARSEC)
- * The Edinburgh Map Task speech corpus
- * The Message Understanding Conference (MUC) text corpus of FBI
- terrorist reports
- * The Continuous Speech Recognition - Wall Street Journal speech
- corpus (WSJ-CSR)
- * The Penn Treebank parsed/tagged text corpus
- * The Multi-site ATIS speech corpus (ATIS2)
- * The Air Traffic Control (ATC) speech corpus
- * The Hansard English/French parallel text corpus
- * The European Corpus Initiative multi-language text corpus (ECI)
- * The Int'l Labor Organization/Int'l Trade Union multi-language
- text corpus (ILO/ITU)
- * Machine-readable dictionaries/lexical data bases (COMLEX, CELEX)
-
- The files in the directory include more detailed information on the
- individual databases. For further information contact
-
- Linguistic Data Consortium
- 441 Williams Hall
- University of Pennsylvania
- Philadelphia, PA 19104-6305
- Phone: +1 (215) 898-0464
- Fax: +1 (215) 573-2175
- e-mail: ldc@unagi.cis.upenn.edu
-
-
- Center for Spoken Language Understanding (CSLU)
- ===============================================
-
- 1. The ISOLET speech database of spoken letters of the English alphabet.
- The speech is high quality (16 kHz with a noise cancelling microphone).
- 150 speakers x 26 letters of the English alphabet twice in random order.
- The "ISOLET" data base can be purchased for $100 by sending an email request
- to vincew@cse.ogi.edu. (This covers handling, shipping and medium costs).
- The data base comes with a technical report describing the data.
-
- 2. CSLU has a telephone speech corpus of 1000 English alphabets. Callers
- recite the alphabet with brief pauses between letters. This database is
- available to not-for-profit institutions for $100. The data base is described
- in the proceedings of the International Conference on Spoken Language
- Processing. Contact vincew@cse.ogi.edu if interested.
-
-
- PhonDat - A Large Database of Spoken German
- ===========================================
-
- The PhonDat continuous speech corpora are now available on
- CD-ROM media (ISO 9660 format).
-
- PhonDat I (Diphone Corpus) : 6 CDs (1140.- DM)
- PhonDat II (Train Enquiries Corpus): 1 CD ( 190.- DM)
-
- PhonDat I comprises approx. 20.000, PhonDat II approx. 1500
- files signal files in high quality 16-bit 16 KHz recording.
- The corpora come with a documentation containing the orthographic
- transcription and a citation form of the utterances, as well as a
- detailed file format description. A narrow phonetic transcription
- is available for selected files from corpus I and II.
-
- For information and orders contact
-
- Barbara Eisen
- Institut fuer Phonetik
- Schellingstr. 3 / II
- D 8000 Munich 40
-
- Tel: +49 / 89 / 2180 -2454 or -2758
- Fax: +49 / 89 / 280 03 62
-
- ------------------------------------------------------------------------
-
- Q1.7: Speech File Formats, Conversion and Playing.
-
- Section 2 of this FAQ has information on mu-law coding.
-
- A very good and very comprehensive list of audio file formats is prepared
- by Guido van Rossum. The list is posted regularly to comp.dsp and
- alt.binaries.sounds.misc, amongst others. It includes information on
- sampling rates, hardware, compression techniques, file format definitions,
- format conversion, standards, programming hints and lots more. It is much
- too long to include within this posting.
-
- It is also available by ftp
- from: ftp.cwi.nl
- directory: /pub
- file: AudioFormats<version>
-
- ------------------------------------------------------------------------
-
- Q1.8: What "Speech Laboratory Environments" are available?
-
- First, what is a Speech Laboratory Environment? A speech lab is a
- software package which provides the capability of recording, playing,
- analysing, processing, displaying and storing speech. Your computer
- will require audio input/output capability. The different packages
- vary greatly in features and capability - best to know what you want
- before you start looking around.
-
- Most general purpose audio processing packages will be able to process speech
- but do not necessarily have some specialised capabilities for speech (e.g.
- formant analysis).
-
- The following article provides a good survey.
-
- Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An Evaluation"
- Journal of Speech and Hearing Research, pp 314-332, April 1992.
-
-
- Package: Entropic Signal Processing System (ESPS) and Waves
- Platform: Range of Unix platforms.
- Description: ESPS is a very comprehensive set of speech analysis/processing
- tools for the UNIX environment. The package includes UNIX commands,
- and a comprehensive C library (which can be accessed from other
- languages). Waves is a graphical front-end for speech processing.
- Speech waveforms, spectrograms, pitch traces etc can be displayed,
- edited and processed in X windows and Openwindows (versions 2 & 3).
- The HTK (Hidden Markov Model Toolkit) is now available from Entropic.
- HTK is described in some detail in Section 5 of this FAQ - the
- section on Speech Recognition.
- Cost: On request.
- Contact: Entropic Research Laboratory, Washington Research Laboratory,
- 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
- (202) 547-1420. email - info@wrl.epi.com
-
-
- Package: CSRE: Canadian Speech Research Environment
- Platform: IBM/AT-compatibles
- Description: CSRE is a comprehensive, microcomputer-based system designed
- to support speech research. CSRE provides a powerful, low-cost
- facility in support of speech research, using mass-produced and
- widely-available hardware. The project is non-profit, and relies
- on the cooperation of researchers at a number of institutions and
- fees generated when the software is distributed. Functions
- include speech capture, editing, and replay; several alternative
- spectral analysis procedures, with color and surface/3D displays;
- parameter extraction/tracking and tools to automate measurement
- and support data logging; alternative pitch-extraction systems;
- parametric speech (KLATT80) and non-speech acoustic synthesis,
- with a variety of supporting productivity tools; and a
- comprehensive experiment generator, to support behavioral testing
- using a variety of common testing protocols.
- A paper about the whole package can be found in:
- Jamieson D.G. et al, "CSRE: A Speech Research Environment",
- Proc. of the Second Intl. Conf. on Spoken Language Processing,
- Edmonton: University of Alberta, pp. 1127-1130.
- Hardware: Can use a range of data aqcuisition/DSP
- Cost: Distributed on a cost recovery basis.
- Availability: For more information on availability
- contact Krystyna Marciniak - email march@uwovax.uwo.ca
- Tel (519) 661-3901 Fax (519) 661-3805.
- For technical information - email ramji@uwovax.uwo.ca
- Note: Also included in Q4.4 on speech synthesis packages.
-
-
- Package: OGI Speech Tools from the Center for Spoken Language
- Understanding (CSLU) at the Oregon Graduate Institute of Science
- and Technology (Portland Oregon)
- Platform: Unix????
- Description: The OGI Speech tools include :-
- 1. An X windows display tool (LYRE) for displaying data in a time
- synchronous fashion for a. the speech signal b. spectrograms
- c. phoneme labels, and other information.
- 2. A Neural Network (NOPT) training package.
- 3. An set of C library routines (LIBNSPEECH) for the manipulation
- of speech data, including: a. PLP Analysis, b. Rasta PLP
- Analysis, c. Linear Predictive Coding, d. Mel Cepstrum Coding,
- e. Fast Fourier Transform
- 4. A set of utilities for converting file formats such as ADC, NIST,
- mu-law, binary files, and ascii. Includes filtering.
- 5. A set of PEARL Scripts which have been used mainly to automate
- the use of the OGI Speech Tools.
- 6. Routines to play speech on different machines: Sun Sparc,
- decmips, Gradient box (general purpose atod/dtoa), Macintosh.
- 7. MAN Pages for all routines and programs developed, as well as
- a User manual in both in postscript and {\bf tex} format.
- Misc: Software is written in ANSI C.
- Availability: By anonymous ftp from
- lydia.cse.ogi.edu:/pub/tools/*
- Contact: Try tools@cse.ogi.edu ?
-
-
- Package: Signalyze 2.0 from InfoSignal
- Platform: Macintosh
- Description: Signalyze's basic conception revolves around up to 100
- signals, displayed synchronously in HyperCard fashion on "cards".
- The program offers a full complement of signal editing features,
- quite a few spectral analysis tools, manual scoring tools, pitch
- extraction routines, a good set of signal manipulation tools, and
- extensive input-output capacity.
- Handles multiple file formats: Signalyze, MacSpeech Lab, AudioMedia,
- SoundDesigner II, SoundEdit/MacRecorder, SoundWave, three sound
- resource formats, and ASCII-text.
- Sound I/O: Direct sound input from MacRecorder and similar devices,
- AudioMedia, AudioMedia II and AD IN, some MacADIOS boards and devices,
- Apple sound input (built-in microphone). Sound output via Macintosh
- internal sound, some MacADIOS boards and devices as well as via the
- Digidesign 16-bit boards.
- Compatibility: MacPlus and higher (including II, IIx, IIcx, IIci, IIfx,
- IIvx, IIvi, Portable, all PowerBooks, Centris and Quadras). Takes
- advantage of large and multiple screens and 16/256 color/grayscales.
- System 7.0 compatible. Runs in background with adjustable priority.
- Misc: A demo available upon request.
- Manuals and tutorial included.
- It is available in English, French, and German.
- Cost: Individual licence US$350, site license US$500, plus shipping.
- Contact: North America - Network Technology Corporation
- 91 Baldwin St., Charlestown MA 02129
- Fax: 617-241-5064 Phone: 617-241-9205
- Elsewhere - InfoSignal Inc.
- C.P. 73, 1015 LAUSANNE, Switzerland,
- FAX: +41 21 691-1372,
- Email: 76357.1213@COMPUSERVE.COM.
-
-
- Package: Kay Elemetrics CSL (Computer Speech Lab) 4300
- Platform: Minimum IBM PC-AT compatible with extended memory (min 2MB)
- with at least VGA graphics. Optimal would be 386 or 486 machine
- with more RAM for handling larger amounts of data.
- Description: Speech analysis package, with optional separate LPC program
- for analysis/synthesis. Uses its own file format for data, but has
- some ability to export data as ascii. The main editing/analysis prog
- (but not the LPC part) has its own macro language, making it easy to
- perform repetitive tasks. Probably not much use without the extra
- LPC program, which also allows manipulation of pitch, formant and
- bandwidth parameters.
- Hardware includes an internal DSP board for the PC (requires ISA
- slot), and an external module containing signal processing chips
- which does A/D and D/A conversion.
- A speaker and microphone are supplied.
- Misc: A programmers kit is available for programming signal processing
- chips (experts only).
- Manuals included.
- Cost: Recently approx 6000 pounds sterling. (Less in USA?)
- Availibility: UK distributors are Wessex Electronics,
- 114-116 North Street, Downend, Bristol, B16 5SE
- Tel: 0272 571404.
- In USA: Kay Elemetrics Corp,
- 12 Maple Avenue, PO Box 2025, Pine Brook, NJ 07058-9798
- Tel:(201) 227-7760
-
-
- Package: MacSpeech Lab II (MSL II)
- Platform: Macintosh
- Description: A sound analysis and acquisition for Macs. MSL II delivers
- the most common functions for speech analysis (FFTs, LPCs, f0
- extraction, etc.) & produces grayscale spectrographic displays.
- Can be used for various speech technology and phonetic training
- tasks. The software an trade off accuracy and speech.
- Hardware: requires MacADIOS ("Macintosh Analog/Digital Input/Output
- System") hardware for speech I/O at 12/16 bits.
- Misc: Software no longer updated by GW Instruments; MSL soft/hardware will
- not perform input/output on Quadras, for example, though analysis
- seems fine. Known to operate properly on systems as high as IIcx &
- II fx.
- Cost: $4990 (in May '92 price list; no MSL soft/hardware package
- listed in January '93).
- Contact: GW Instruments
- 35 Medford Street, Somerville, MA 02143
- Phone: (617) 625-4096 Fax: (617) 625-1322
-
-
- Package: Ptolemy
- Platform: Sun SPARC, DecStation (MIPS), HP (hppa).
- Description: Ptolemy provides a highly flexible foundation for the
- specification, simulation, and rapid prototyping of systems.
- It is an object oriented framework within which diverse models
- of computation can co-exist and interact. Ptolemy can be used
- to model entire systems.
- Ptolemy has been used for a broad range of applications including
- signal processing, telecomunications, parallel processing, wireless
- communications, network design, radio astronomy, real time systems,
- and hardware/software co-design. Ptolemy has also been used as a lab
- for signal processing and communications courses.
- Ptolemy has been developed at UC Berkeley over the past 3 years.
- Further information, including papers and the complete release
- notes, is available from the FTP site.
- Cost: Free
- Availability: The source code, binaries, and documentation are available
- by anonymous ftp from "ptolemy.bekeley.edu" - see the README file -
- ptolemy.berkeley.edu:/pub/README
-
-
- Package: Khoros
- Description: Public domain image processing package with a basic DSP
- library. Not particularly applicable to speech, but not bad
- for the price.
- Cost: FREE
- Availability: By anonymous ftp from pprg.eece.unm.edu
-
-
- Can anyone provide information on capability and availability of the
- following packages?
-
- ILS ("Interactive Laboratory System")
- SpeechViewer (PC)
-
-
-
- =======================================================================
-
- PART 2 - Signal Processing for Speech
-
- Q2.1: What speech sampling and signal processing hardware can I use?
-
- In addition to the following information, have a look at the Audio File
- format document prepared by Guido van Rossum (see details in Section 1.7).
-
-
- Product: Sun standard audio port (SPARC 1 & 2)
- Input: 1 channel, 8 bit mu-law encoded (telephone quality)
- Output: 1 channel, 8 bit mu-law encoded (telephone quality)
-
-
- Product: Ariel
- Platform: Sun + others?
- Input: 2 channels, 16bit linear, sample rate 8-96kHz (inc 32, 44.1, 48kHz).
- Output: 2 channels, 16bit linear, sample rate 8-50kHz (inc 32, 44.1, 48kHz).
- Contact: Ariel Corp.433 River Road,
- Highland Park, NJ 08904.
- Ph: 908-249-2900 Fax: 908-249-2123 DSP BBS: 908-249-2124
-
-
- Product: IBM RS/6000 ACPA (Audio Capture and Playback Adapter)
- Description: The card supports PCM, Mu-Law, A-Law and ADPCM at 44.1kHz
- (& 22.05, 11.025, 8kHz) with 16-bits of resolution in stereo.
- The card has a built-in DSP (don't know which one). The device
- also supports various formats for the output data, like big-endian,
- twos complement, etc. Good noise immunity.
- The card is used for IBM's VoiceServer (they use the DSP for
- speech recognition). Apparently, the IBM voiceserver has a
- speaker-independent vocabulary of over 20,000 words and each
- ACPA can support two independent sessions at once.
- Cost: $US495
- Contact: ?
-
- Product: Sound Galaxy NX , Aztech Systems
- Platform: PC - DOS,Windows 3.1
- Cost: ??
- Input: 8bit linear, 4-22 kHz.
- Output: 8bit linear, 4-44.1 kHz
- Misc: 11-voice FM Music Synthesizer YM3812; Built-in power amplifier;
- DSP signal processing support - ST70019SB
- Hardware ADPCM decompression (2:1,3:1,4:1)
- Full "AdLib" and "Sound Blaster" compatbility.
- Software includes a simple Text-to-Speech program "Monologue".
-
-
- Product: Sound Galaxy NX PRO, Aztech Systems
- Platform: PC - DOS,Windows 3.1
- Cost: ??
- Input: 2 * 8bit linear, 4-22.05 kHz(stereo), 4-44.1 KHz(mono).
- Output: 2 * 8bit linear, 4-44.1 kHz(stereo/mono)
- Misc: 20-voice FM Music Synthesizer; Built-in power amplifier;
- Stereo Digital/Analog Mixer; Configuration in EEPROM.
- Hardware ADPCM decompression (2:1,3:1,4:1).
- Includes DSP signal processing support
- Full "AdLib" and "Sound Blaster Pro II" compatybility.
- Software includes a simple Text-to-Speech program "Monologue"
- and Sampling laboratory for Windows 3.1: WinDAT.
- Contact: USA (510)6238988
-
-
- Other PC Sound Cards
- ============================================================================
- sound stereo/mono compatible included voices
- card & sample rate with ports
- ============================================================================
- Adlib Gold stereo: 8-bit 44.1khz Adlib ? audio 20 (opl3)
- 1000 16-bit 44.1khz in/out, +2 digital
- mono: 8-bit 44.1khz mic in, channels
- 16-bit 44.1khz joystick,
- MIDI
-
- Sound Blaster mono: 8-bit 22.1khz Adlib audio 11 synth.
- FM synth with in/out,
- 2 operators joystick,
-
- Sound Blaster stereo: 8-bit 22.05khz Adlib audio 22
- Pro Basic mono: 8-bit 44.1khz Sound Blaster in/out,
- joystick,
-
- Sound Blaster stereo: 8-bit 22.05khz Adlib audio 11
- Pro mono: 8-bit 44.1khz Sound Blaster in/out
- joystick,
- MIDI, SCSI
-
- Sound Blaster stereo: 8-bit 4-44.1khz Sound Blaster audio 20
- 16 ASP stereo: 16-bit 4-44.1khz in/out,
- joystick,
- MIDI
-
- Audio Port mono: 8-bit 22.05khz Adlib audio 11
- Sound Blaster in/out,
- joystick
-
- Pro Audio stereo: 8-bit 44.1khz Adlib audio, 20
- Spectrum + Pro Audio in/out,
- Spectrum joystick
-
-
- Pro Audio stereo: 16-bit 44.1khz Adlib audio 20
- Spectrum 16 Pro Audio in/out,
- Spectrum joystick,
- Sound Blaster MIDI, SCSI
-
- Thunder Board stereo: 8-bit 22khz Adlib audio 11
- Sound Blaster in/out,
- joystick
-
- Gravis stereo: 8-bit 44.1khz Adlib, audio line 32 sampled
- Ultrasound mono: 8-bit 44.1khz Sound Blaster in/out, 32 synth.
- amplified
- out,
- (w/16-bit daughtercard) mic in, CD
- stereo: 16-bit 44.1khz audio in,
- mono: 16-bit 44.1khz daughterboard
- ports (for
- SCSI and
- 16-bit)
-
- MultiSound stereo: 16-bit 44.1kHz Nothing audio 32 sampled
- 64x oversampling in/out,
- joystick,
- MIDI
-
- =============================================================================
-
-
- Can anyone provide information on Mac, NeXT and other hardware?
-
- Product: xxx
- Platform: PC, Mac, Sun, ...
- Rough Cost (pref $US):
- Input: e.g. 16bit linear, 8,10,16,32kHz.
- Output: e.g. 16bit linear, 8,10,16,32kHz.
- DSP: signal processing support
- Other:
- Contact:
-
- ------------------------------------------------------------------------
-
- Q2.2: What signal processing techniques are for speech technology?
-
- This question is far to big to be answered in a FAQ posting. Fortunately
- there are many good books which answer the question!
-
- Some good introductory books include
-
- Digital processing of speech signals; L. R. Rabiner, R. W. Schafer.
- Englewood Cliffs; London: Prentice-Hall, 1978
-
- Voice and Speech Processing; T. W. Parsons.
- New York; McGraw Hill 1986
-
- Computer Speech Processing; ed Frank Fallside, William A. Woods
- Englewood Cliffs: Prentice-Hall, c1985
-
- Digital speech processing : speech coding, synthesis, and recognition
- edited by A. Nejat Ince; Kluwer Academic Publishers, Boston, c1992
-
- Speech science and technology; edited by Shuzo Saito
- pub. Ohmsha, Tokyo, c1992
-
- Speech analysis; edited by Ronald W. Schafer, John D. Markel
- New York, IEEE Press, c1979
-
- Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal Processing,
- 1987.
-
- ------------------------------------------------------------------------
-
- Q2.3: How do I find the pitch of a speech signal?
-
- This topic comes up regularly in the comp.dsp newsgroup. Question 2.5
- of the FAQ posting for comp.dsp gives a comprehensive list of references
- on the definition, perception and processing of pitch.
-
- ------------------------------------------------------------------------
-
- Q2.4: How do I convert to/from mu-law format?
-
- Mu-law coding is a form of compression for audio signals including speech.
- It is widely used in the telecommunications field because it improves the
- signal-to-noise ratio without increasing the amount of data. Typically,
- mu-law compressed speech is carried in 8-bit samples. It is a companding
- technqiue. That means that carries more information about the smaller signals
- than about larger signals. Mu-law coding is provided as standard for the
- audio input and output of the SUN Sparc stations 1&2 (Sparc 10's are linear).
-
-
- On SUN Sparc systems have a look in the directory /usr/demo/SOUND. Included
- are table lookup macros for ulaw conversions. [Note however that not all
- systems will have /usr/demo/SOUND installed as it is optional - see your
- system admin if it is missing.]
-
-
- OR, here is some sample conversion code in C.
-
- # include <stdio.h>
-
- unsigned char linear2ulaw(/* int */);
- int ulaw2linear(/* unsigned char */);
-
- /*
- ** This routine converts from linear to ulaw.
- **
- ** Craig Reese: IDA/Supercomputing Research Center
- ** Joe Campbell: Department of Defense
- ** 29 September 1989
- **
- ** References:
- ** 1) CCITT Recommendation G.711 (very difficult to follow)
- ** 2) "A New Digital Technique for Implementation of Any
- ** Continuous PCM Companding Law," Villeret, Michel,
- ** et al. 1973 IEEE Int. Conf. on Communications, Vol 1,
- ** 1973, pg. 11.12-11.17
- ** 3) MIL-STD-188-113,"Interoperability and Performance Standards
- ** for Analog-to_Digital Conversion Techniques,"
- ** 17 February 1987
- **
- ** Input: Signed 16 bit linear sample
- ** Output: 8 bit ulaw sample
- */
-
- #define ZEROTRAP /* turn on the trap as per the MIL-STD */
- #undef ZEROTRAP
- #define BIAS 0x84 /* define the add-in bias for 16 bit samples */
- #define CLIP 32635
-
- unsigned char linear2ulaw(sample) int sample; {
- static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,
- 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
- 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
- 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
- int sign, exponent, mantissa;
- unsigned char ulawbyte;
-
- /* Get the sample into sign-magnitude. */
- sign = (sample >> 8) & 0x80; /* set aside the sign */
- if(sign != 0) sample = -sample; /* get magnitude */
- if(sample > CLIP) sample = CLIP; /* clip the magnitude */
-
- /* Convert from 16 bit linear to ulaw. */
- sample = sample + BIAS;
- exponent = exp_lut[( sample >> 7 ) & 0xFF];
- mantissa = (sample >> (exponent + 3)) & 0x0F;
- ulawbyte = ~(sign | (exponent << 4) | mantissa);
- #ifdef ZEROTRAP
- if (ulawbyte == 0) ulawbyte = 0x02; /* optional CCITT trap */
- #endif
-
- return(ulawbyte);
- }
-
- /*
- ** This routine converts from ulaw to 16 bit linear.
- **
- ** Craig Reese: IDA/Supercomputing Research Center
- ** 29 September 1989
- **
- ** References:
- ** 1) CCITT Recommendation G.711 (very difficult to follow)
- ** 2) MIL-STD-188-113,"Interoperability and Performance Standards
- ** for Analog-to_Digital Conversion Techniques,"
- ** 17 February 1987
- **
- ** Input: 8 bit ulaw sample
- ** Output: signed 16 bit linear sample
- */
-
- int ulaw2linear(ulawbyte) unsigned char ulawbyte; {
- static int exp_lut[8] = { 0, 132, 396, 924, 1980, 4092, 8316, 16764 };
- int sign, exponent, mantissa, sample;
-
- ulawbyte = ~ulawbyte;
- sign = (ulawbyte & 0x80);
- exponent = (ulawbyte >> 4) & 0x07;
- mantissa = ulawbyte & 0x0F;
- sample = exp_lut[exponent] + (mantissa << (exponent + 3));
- if(sign != 0) sample = -sample;
-
- return(sample);
- }
-
- =======================================================================
-
- PART 3 - Speech Coding and Compression
-
- Q3.1: Speech compression techniques.
-
- Can anyone provide a 1-2 page summary on speech compression? Topics to
- cover might include common technqiues, where speech compression might be
- used and perhaps something on why speech is difficult to compress.
-
- [The FAQ for comp.compression includes a few questions and answers
- on the compression of speech.]
-
- ------------------------------------------------------------------------
-
- Q3.2: What are some good references/books on coding/compression?
-
- Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal
- Processing, 1987.
-
- ------------------------------------------------------------------------
-
- Q3.3: What software is available?
-
- Note: there are two types of speech compression technique referred to below.
- Lossless technqiues preserve the speech through a compression-decompression
- phase. Lossy techniques do not preserve the speech prefectly. As a general
- rule, the more you compress speech, the more the quality degardes.
-
-
- Package: shorten - a lossless compressor for speech signals
- Platform: UNIX/DOS
- Description: A lossless compressor for speech signals. It will compile and
- run on UNIX workstations and will cope with a wide variety of
- formats. Compression is typically 50% for 16bit clean speech
- sampled at 16kHz.
- Availability: Anonymous ftp svr-ftp.eng.cam.ac.uk: /misc/shorten-1.09.tar.Z
-
-
- Package: CELP 3.2a & LPC
- Platform: Sun (the makefiles & source can be modified for other platforms)
- Description: CELP is lossy compression technqiue.
- The U.S. DoD's Federal-Standard-1016 based 4800 bps code excited
- linear prediction voice coder version 3.2a (CELP 3.2a) Fortran and
- C simulation source codes. Available for worldwide distribution
- (on DOS diskettes, but configured to compile on Sun SPARC stations)
- from NTIS and DTIC. Example input and processed speech files are
- included. A Technical Information Bulletin (TIB), "Details to Assist
- in Implementation of Federal Standard 1016 CELP," and the official
- standard, "Federal Standard 1016, Telecommunications: Analog to
- Digital Conversion of Radio Voice by 4,800 bit/second Code Excited
- Linear Prediction (CELP)," are also available.
-
- Availability 1: Through the National Technical Information Service:
- NTIS
- U.S. Department of Commerce
- 5285 Port Royal Road,
- Springfield, VA 22161, USA
-
- The "AD" ordering number for the CELP software is AD M000 118
- (US$ 90.00) and for the TIB it's AD A256 629 (US$ 17.50).
- The LPC-10 standard, described below, is FIPS Pub 137 (US$ 12.50).
- There is a $3.00 shipping charge on all U.S. orders. The telephone
- number for their automated system is 703-487-4650, or 703-487-4600
- if you'd prefer to talk with a real person.
-
- (U.S. DoD personnel and contractors can receive the package from the
- Defense Technical Information Center: DTIC, Building 5, Cameron
- Station, Alexandria, VA 22304-6145. Their telephone number is
- 703-274-7633.)
-
- Availability 2: By anonymous ftp from:
- super.org (192.31.192.1):/pub/celp_3.2a.tar.Z
- OR
- svr-ftp.eng.cam.ac.uk:comp.speech/sources/celp_3.2a.tar.Z
-
- Misc: The following articles describe the Federal-Standard-1016 4.8-kbps
- CELP coder (it's unnecessary to read more than one):
-
- Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch,
- "The Federal Standard 1016 4800 bps CELP Voice Coder," Digital Signal
- Processing, Academic Press, 1991, Vol. 1, No. 3, p. 145-155.
-
- Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch,
- "The DoD 4.8 kbps Standard (Proposed Federal Standard 1016),"
- in Advances in Speech Coding, ed. Atal, Cuperman and Gersho,
- Kluwer Academic Publishers, 1991, Chapter 12, p. 121-133.
-
- Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The
- Proposed Federal Standard 1016 4800 bps Voice Coder: CELP," Speech
- Technology Magazine, April/May 1990, p. 58-64.
-
- * The U.S. DoD's Federal-Standard-1015/NATO-STANAG-4198 based 2400
- bps linear prediction coder (LPC-10) was republished as a Federal
- Information Processing Standards Publication 137 (FIPS Pub 137).
- It is described in:
-
- Thomas E. Tremain, "The Government Standard Linear Predictive Coding
- Algorithm: LPC-10," Speech Technology Magazine, April 1982, p. 40-49.
-
- There is also a section about FS-1015 in the book:
- Panos E. Papamichalis, Practical Approaches to Speech Coding,
- Prentice-Hall, 1987.
-
- * The voicing classifier used in the enhanced LPC-10 (LPC-10e) is
- described in: Campbell, Joseph P., Jr. and T. E. Tremain, "Voiced/
- Unvoiced Classification of Speech with Applications to the U.S.
- Government LPC-10E Algorithm," Proceedings of the IEEE International
- Conf. on Acoustics, Speech, and Signal Processing, 1986, p. 473-6.
-
- * Copies of the official standard, "Federal Standard 1016, Tele-
- communications: Analog to Digital Conversion of Radio Voice by 4,800
- bit/second Code Excited Linear Prediction (CELP)" are available for
- US$ 5.00 each from:
- GSA Federal Supply Service Bureau
- Specification Section, Suite 8100
- 470 E. L'Enfant Place, S.W.
- Washington, DC 20407
- (202)755-0325
-
- * Realtime DSP code for FS-1015 and FS-1016 is sold by:
-
- John DellaMorte, DSP Software Engineering
- 165 Middlesex Tpk, Suite 206
- Bedford, MA 01730, USA
- Ph: 1-617-275-3733 Fax: 1-617-275-4323
- dspse.bedford@channel1.com
-
- * DSP Software Engineering's FS-1016 code can run on a DSP Research's
- Tiger 30 (a PC board with a TMS320C3x and analog interface suited
- to development work).
-
- DSP Research
- 1095 E. Duane Ave.
- Sunnyvale, CA 94086, USA
- Ph: (408)773-1042 Fax: (408)736-3451 (fax)
-
-
-
- Package: 32 kbps ADPCM
- Platform: SGI and Sun Sparcs
- Description: 32 kbps ADPCM C-source code (G.721 compatibility is uncertain)
- Contact: Jack Jansen
- Availablity: Anoymous ftp to ftp.cwi.nl: pub/adpcm.shar
-
-
- Package: GSM 06.10 Compression
- Platform: Runs faster than real time on most Sun SPARCstations
- Description: GSM 06.10 is lossy compression technqiue.
- European GSM 06.10 provisional standard for full-rate speech
- transcoding, prI-ETS 300 036, which uses RPE/LTP (residual
- pulse excitation/long term prediction) coding at 13 kbit/s.
- Contact: Carsten Bormann <cabo@cs.tu-berlin.de>
- Availability: An implementation can be ftp'ed from:
- tub.cs.tu-berlin.de: /pub/tubmik/gsm-1.0.tar.Z
- +/pub/tubmik/gsm-1.0-patch1
- or as a faster but not always up-to-date alternative:
- liasun3.epfl.ch: /pub/audio/gsm-1.0pl1.tar.Z
-
- Package: G.721/722/723 Compression
- Description: ?
- Availability: By email to teledoc@itu.arcom.ch, with
- GET ITU-3022
- as the *only* line in the body of the message.
- This is also available by anonymous ftp from:
- svr-ftp.eng.cam.ac.uk:comp.speech/sources/G711_G722_G723.tar.Z
-
-
- Package: U.S.F.S. 1016 CELP vocoder for DSP56001
- Platform: DSP56001
- Description: Real-time U.S.F.S. 1016 CELP vocoder that runs on a single
- 27MHz Motorola DSP56001. Free demo software available from PC-56
- and PC-56D. Source and object code available for a one-time
- license fee.
- Contact: Cole Erskine
- Analogical Systems
- 2916 Ramona St.
- Palo Alto, CA 94306, USA
- Tel:(415) 323-3232 FAX:(415) 323-4222
- Internet: cole@analogical.com
-
-
- Product: 8 Kbit/s CELP on the TMS320C5x family of DSP chips.
- Description: For low bandwidth transmission of voice, compact voice storage
- for archival purposes, low-cost digital answering machines and
- efficient storage for voice mail. Features :-
- - near toll quality at 8 Kb/s.
- - Variable rate option with 1 Kb/s silence encoding
- - Implemented on a fixed-point processor for lower system cost.
- - Attractive licensing scheme.
- - Future availability of 4 Kb/s.
- - Custom rates possible.
- Capacity :-
- - Two half-duplex or one full duplex channels on the 20 MIPS 'C5x
- (at 95% and 55% CPU utilization respectively).
- - Two full duplex channels on the 28.6 MIPS 'C5x
- (at 77% CPU utilization).
- - Requires 9 K-words program memory and 3 K-words data memory.
- - Decoding in real-time on a 486 class CPU.
- Contact: CVI Inc.
- 443 Vienna Cres. North Vancouver, BC, Canada V7N 3B3
- Tel: (604) 987 1719 Fax: (604) 986 8139
- Email: cvi@extropia.wimsey.com
-
-
- =======================================================================
-
- PART 4 - Speech Synthesis
-
- Q4.1: What is speech synthesis?
-
- Speech synthesis is the task of transforming written input to spoken output.
- The input can either be provided in a graphemic/orthographic or a phonemic
- script, depending on its source.
-
- ------------------------------------------------------------------------
-
- Q4.2: How can speech synthesis be performed?
-
- There are several algorithms. The choice depends on the task they're used
- for. The easiest way is to just record the voice of a person speaking the
- desired phrases. This is useful if only a restricted volume of phrases and
- sentences is used, e.g. messages in a train station, or schedule information
- via phone. The quality depends on the way recording is done.
-
- More sophisticated but worse in quality are algorithms which split the
- speech into smaller pieces. The smaller those units are, the less are they
- in number, but the quality also decreases. An often used unit is the phoneme,
- the smallest linguistic unit. Depending on the language used there are about
- 35-50 phonemes in western European languages, i.e. there are 35-50 single
- recordings. The problem is combining them as fluent speech requires fluent
- transitions between the elements. The intellegibility is therefore lower, but
- the memory required is small.
-
- A solution to this dilemma is using diphones. Instead of splitting at the
- transitions, the cut is done at the center of the phonemes, leaving the
- transitions themselves intact. This gives about 400 elements (20*20) and
- the quality increases.
-
- The longer the units become, the more elements are there, but the quality
- increases along with the memory required. Other units which are widely used
- are half-syllables, syllables, words, or combinations of them, e.g. word stems
- and inflectional endings.
-
- ------------------------------------------------------------------------
-
- Q4.3: What are some good references/books on synthesis?
-
- The following are good introductory books/articles.
-
- Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal Processing,
- 1987.
-
- D. H. Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of
- the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793.
-
- I. H. Witten. Principles of Computer Speech.
- (London: Academic Press, Inc., 1982).
-
- John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech:
- The MITalk System", Cambridge University Press, 1987.
-
- ------------------------------------------------------------------------
-
- Q4.4: What software/hardware is available?
-
- There appears to be very little Public Domain or Shareware speech synthesis
- related software available for FTP. However, the following are available.
- Strictly speaking, not all the following sources are speech synthesis - all
- are speech output systems. They are in no particular order.
-
-
- SIMTEL-20
- The following is a list of speech related software available from SIMTEL-20
- and its mirror sites for PCs.
-
- The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil [192.88.110.20].
- Try looking at your nearest archive site first.
-
- Directory PD1:<MSDOS.VOICE>
- Filename Type Length Date Description
- ==============================================
- AUTOTALK.ARC B 23618 881216 Digitized speech for the PC
- CVOICE.ARC B 21335 891113 Tells time via voice response on PC
- HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth.
- HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker
- SAY.ARC B 20224 860330 Computer Speech - using phonemes
- SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes
- TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker
- TRAN.ARC B 39766 890715 Repeats typed text in digital voice
- VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs
- VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening
-
-
-
- Package: ORATOR Text-to-Speech Synthesizer
- Platform: SUN SPARC, Decstation 5000. Portable to other UNIX platforms.
- Description: Sophisticated speech synthesis package. Has text preprocessing
- (for abbreviations, numbers), acronym citation rules, and human-like
- spelling routines. High accuracy for pronunciation of names of
- people, places and businesses in America, text-to-speech translation
- for common words; rules for stress and intonation marking, based on
- natural-sounding demisyllable synthesis; various methods of user
- control and customization at most stages of processing. Currently,
- ORATOR is most appropriate for applications containing a large
- component of names in the text, and requires some amount of user-
- specified text-preprocessing to produce good quality speech for
- general text.
- Hardware: Standard audio output of SPARC, or Decstation audio hardware.
- At least 16M of memory recommended.
- Cost: Binary License: $5,000.
- Source license for porting or commercial use: $30,000.
- Availability: Contact Bellcore's Licensing Office (1-800-527-1080)
- or email: jzilg@cc.bellcore.com (John Zilg)
-
-
- Package: Text to phoneme program (1)
- Platform: unknown
- Description: Text to phoneme program. Based on Naval Research Lab's
- set of text to phoneme rules.
- Availability: By FTP from "shark.cse.fau.edu" (131.91.80.13) in the directory
- /pub/src/phon.tar.Z
-
-
- Package: Text to phoneme program (2)
- Platform: unknown
- Description: Text to phoneme program.
- Availability: By FTP from "wuarchive.wustl.edu" in the file
- /mirrors/unix-c/utils/phoneme.c
-
-
- Package: Text to phoneme program (3)
- Description: A public domain version of the same Naval Research Lab
- text to phoneme rules.
- Availability: By anonymous ftp from
- svr-ftp.eng.cam.ac.uk:comp.speech/sources/english2phoneme.shar
-
-
- Package: Text to speech program
- Description: A implementation of the Klatt phoneme to waveform speech
- synthesiser.
- Availability: By anonymous ftp from
- svr-ftp.eng.cam.ac.uk:comp.speech/sources/klatt-0.02.tar.Z
-
-
- Package: "Speak" - a Text to Speech Program
- Platform: Sun SPARC
- Description: Text to speech program based on concatenation of pre-recorded
- speech segments. A function library can be used to integrate
- speech output into other code.
- Hardware: SPARC audio I/O
- Availability: by FTP from "wilma.cs.brown.edu" as /pub/speak.tar.Z
-
-
- Package: TheBigMouth - a Text to Speech Program
- Platform: NeXT
- Description: Text to speech program based on concatenation of pre-recorded
- speech segments. NeXT equivalent of "Speak" for Suns.
- Availability: try NeXT archive sites such as sonata.cc.purdue.edu.
-
-
- Package: TextToSpeech Kit
- Platform: NeXT Computers
- Description: The TextToSpeech Kit does unrestricted conversion of English
- text to synthesized speech in real-time. The user has control over
- speaking rate, median pitch, stereo balance, volume, and intonation
- type. Text of any length can be spoken, and messages can be queued
- up, from multiple applications if desired. Real-time controls such
- as pause, continue, and erase are included. Pronunciations are
- derived primarily by dictionary look-up. The Main Dictionary has
- nearly 100,000 hand-edited pronunciations which can be supplemented
- or overridden with the User and Application dictionaries. A number
- parser handles numbers in any form. A letter-to-sound knowledge base
- provides pronunciations for words not in the Main or customized
- dictionaries. Dictionary search order is under user control.
- Special modes of text input are available for spelling and emphasis
- of words or phrases. The actual conversion of text to speech is done
- by the TextToSpeech Server. The Server runs as an independent task
- in the background, and can handle up to 50 client connections.
- Misc: The TextToSpeech Kit comes in two packages: the Developer Kit and the
- User Kit. The Developer Kit enables developers to build and test
- applications which incorporate text-to-speech. It includes the
- TextToSpeech Server, the TextToSpeech Object, the pronunciation
- editor PrEditor, several example applications, phonetic fonts,
- example source code, and developer documentation. The User Kit
- provides support for applications which incorporate text-to-speech.
- It is a subset of the Developer Kit.
- Hardware: Uses standard NeXT Computer hardware.
- Cost: TextToSpeech User Kit: $175 CDN ($145 US)
- TextToSpeech Developer Kit: $350 CDN ($290 US)
- Upgrade from User to Developer Kit: $175 CDN ($145 US)
- Availability: Trillium Sound Research
- 1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3
- Tel: (403) 284-9278 Fax: (403) 282-6778
- Order Desk: 1-800-L-ORATOR (US and Canada only)
- Email: manzara@cpsc.UCalgary.CA
-
-
- Package: SENSYN speech synthesizer
- Platform: PC, Mac, Sun, and NeXt
- Rough Cost: $300
- Description: This formant synthesizer produces speech waveform files
- based on the (Klatt) KLSYN88 synthesizer. It is intended
- for laboratory and research use. Note that this is NOT a
- text-to-speech synthesizer, but creates speech sounds based
- upon a large number of input variables (formant frequencies,
- bandwidths, glottal pulse characteristics, etc.) and would
- be used as part of a TTS system. Includes full source code.
- Availability: Sensimetrics Corporation, 64 Sidney Street, Cambridge MA 02139.
- Fax: (617) 225-0470; Tel: (617) 225-2442.
- Email: sensimetrics@sens.com
-
-
- Package: SPCHSYN.EXE
- Platform: PC?
- Availability: By anonymous ftp from evans.ee.adfa.oz.au (131.236.30.24)
- in /mirrors/tibbs/Applications/SPCHSYN.EXE
- It is a self extracting DOS archive.
- Requirements: May require special TI product(s), but all source is there.
-
-
- Package: CSRE: Canadian Speech Research Environment
- Platform: PC
- Cost: Distributed on a cost recovery basis
- Description: CSRE is a software system which includes in addition to the
- Klatt speech synthesizer, SPEECH ANALYSIS and EXPERIMENT CONTROL
- SYSTEM. A paper about the whole package can be found in:
- Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc.
- of the Second Intl. Conf. on Spoken Language Processing, Edmonton:
- University of Alberta, pp. 1127-1130.
- Hardware: Can use a range of data aqcuisition/DSP
- Availability: For more information about the availability of this software
- contact Krystyna Marciniak - email march@uwovax.uwo.ca
- Tel (519) 661-3901 Fax (519) 661-3805.
- For technical information email ramji@uwovax.uwo.ca
- Note: A more detailed description is given in Q1.8 on speech environments.
-
-
- Package: JSRU
- Platform: UNIX and PC
- Cost: 100 pounds sterling (from academic institutions and industry)
- Description: A C version of the JSRU system, Version 2.3 is available.
- It's written in Turbo C but runs on most Unix systems with very
- little modification. A Form of Agreement must be signed to say
- that the software is required for research and development only.
- Contact: Dr. E.Lewis (eric.lewis@uk.ac.bristol)
-
-
- Package: Klatt-style synthesiser
- Platform: Unix
- Cost: FREE
- Description: Software posted to comp.speech in late 1992.
- Availability: By anonymous ftp from the comp.speech archives.
- Two files are available from the directory "comp.speech/sources".
- The files are "klatt-cti.tar.Z" and "klatt-jpi.tar.Z". The first
- is the original source, the second is a modified version.
-
-
- Package: MacinTalk
- Platform: Macintosh
- Cost: Free
- Description: Formant based speech synthesis.
- There is also a program called "tex-edit" which apparently
- can pronounce English sentences reasonably using Macintalk.
- Availability: By anonymous ftp from many archive sites (have a look on
- archie if you can). tex-edit is on many of the same sites. Try
- wuarchive.wustl.edu:/mirrors2/info-mac/Old/card/macintalk.hqx[.Z]
- /macintalk-stack.hqx[.Z]
- wuarchive.wustl.edu:/mirrors2/info-mac/app/tex-edit-15.hqx
-
-
- Package: Tinytalk
- Platform: PC
- Description: Shareware package is a speech 'screen reader' which is use
- by many blind users.
- Availability: By anonymous ftp from handicap.shel.isc-br.com.
- Get the files /speech/ttexe145.zip & /speech/ttdoc145.zip.
-
-
- Package: Narrator - narrator.device
- Platform: Amiga
- Description: Formant based speech synthesis. Includes a Engish-to-phoneme
- translation library, and a SPEAK: pseudo-device for speech
- output.
- Hardware: Standard Amiga hardware
- Availability: Part of AmigaOS
-
-
- Package: Bliss
- Contact: Dr. John Merus (Brown University) Mertus@browncog.bitnet
-
-
- Package: xxx
- Platform: (PC, Mac, Sun, NeXt etc)
- Rough Cost: (if appropriate)
- Description: (keep it brief)
- Hardware: (requirement list)
- Availability: (ftp info, email contact or company contact)
-
-
-
-
-
- Can anyone provide information on the following:
-
- MultiVoice
- Monolog
-
-
- Please email or post suitable information for this list. Commercial,
- public domain and research packages are all appropriate.
-
- [Perhaps someone would like to start a separate posting on this area.]
-
-
- =======================================================================
-
- PART 5 - Speech Recognition
-
- Q5.1: What is speech recognition?
-
- Automatic speech recognition is the process by which a computer maps an
- acoustic speech signal to text.
-
- Automatic speech understanding is the process by which a computer maps an
- acoustic speech signal to some form of abstract meaning of the speech.
-
- ------------------------------------------------------------------------
-
- Q5.2: How can I build a very simple speech recogniser?
-
- Doug Danforth provides a detailed account in article 253 in the comp.speech
- archives - also available as file info/DIY_Speech_Recognition.
-
- The first part is reproduced here.
-
- QUICKY RECOGNIZER sketch:
-
- Here is a simple recognizer that should give you 85%+ recognition
- accuracy. The accuracy is a function of WHAT words you have in
- your vocabulary. Long distinct words are easy. Short similar
- words are hard. You can get 98+% on the digits with this recognizer.
-
- Overview:
- (1) Find the begining and end of the utterance.
- (2) Filter the raw signal into frequency bands.
- (3) Cut the utterance into a fixed number of segments.
- (4) Average data for each band in each segment.
- (5) Store this pattern with its name.
- (6) Collect training set of about 3 repetitions of each pattern (word).
- (7) Recognize unknown by comparing its pattern against all patterns
- in the training set and returning the name of the pattern closest
- to the unknown.
-
- Many variations upon the theme can be made to improve the performance.
- Try different filtering of the raw signal and different processing methods.
-
- ------------------------------------------------------------------------
-
- Q5.2: What does speaker dependent/adaptive/independent mean?
-
- A speaker dependent system is developed (trained) to operate for a single
- speaker. These systems are usually easier to develop, cheaper to buy and
- more accurate, but are not as flexible as speaker adaptive or speaker
- independent systems.
-
- A speaker independent system is developed (trained) to operate for any
- speaker or speakers of a particular type (e.g. male/female, American/English).
- These systems are the most difficult to develop, most expensive and currently
- accuracy is not as good. They are the most flexible.
-
- A speaker adaptive system is developed to adapt its operation for new
- speakers that it encounters usually based on a general model of speaker
- characteristics. It lies somewhere between speaker independent and speaker
- dependent systems.
-
- Each type of system is suited to different applications and domains.
-
- ------------------------------------------------------------------------
-
- Q5.3: What does small/medium/large/very-large vocabulary mean?
-
- The size of vocabulary of a speech recognition system affects the complexity,
- processing requirements and the accuracy of the system. Some applications
- only require a few words (e.g. numbers only), others require very large
- dictionaries (e.g. dictation machines).
-
- There are no established definitions but the following may be a helpful guide.
-
- small vocabulary - tens of words
- medium vocabulary - hundreds of words
- large vocabulary - thousands of words
- very-large vocabulary - tens of thousands of words.
-
- ------------------------------------------------------------------------
-
- Q5.4: What does continuous speech or isolated-word mean?
-
- An isolated-word system operates on single words at a time - requiring a
- pause between saying each word. This is the simplest form of recognition
- to perform, because the pronunciation of the words tends not affect each
- other. Because the occurrences of each particular word are similar they are
- easier to recognise.
-
- A continuous speech system operates on speech in which words are connected
- together, i.e. not separated by pauses. Continuous speech is more difficult
- to handle because of a variety of effects. First, it is difficult to find
- the start and end points of words. Another problem is "coarticulation".
- The production of each phoneme is affected by the production of surrounding
- phonemes, and similarly the the start and end of words are affected by the
- preceding and following words. The recognition of continuous speech is also
- affected by the rate of speech (fast speech tends to be harder).
-
- ------------------------------------------------------------------------
-
- Q5.5: How is speech recognition done?
-
- A wide variety of techniques are used to perform speech recognition.
- There are many types of speech recognition. There are many levels of
- speech recognition/processing/understanding.
-
- Typically speech recognition starts with the digital sampling of speech.
- The next stage would be acoustic signal processing. Common techniques
- include a variety of spectral analyses, LPC analysis, the cepstral transform,
- cochlea modelling and many, many more.
-
- The next stage will typically try to recognise phonemes, groups of phonemes
- or words. This stage can be achieved by many processes such as DTW (Dynamic
- Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), and
- sometimes expert systems. In crude terms, all these processes to recognise
- the patterns of speech. The most advanced systems are statistically
- motivated.
-
- Some systems utilise knowledge of grammar to help with the recognition
- process.
-
- Some systems attempt to utilise prosody (pitch, stress, rhythm etc) to
- process the speech input.
-
- Some systems try to "understand" speech. That is, they try to convert the
- words into a representation of what the speaker intended to mean or achieve
- by what they said.
-
- ------------------------------------------------------------------------
-
- Q5.6: What are some good references/books on recognition?
-
- Some general introduction books on speech recognition:
-
- Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang Juang
- Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993
- ISBN 0-13-015157-2
-
- Speech recognition by machine; W.A. Ainsworth
- London: Peregrinus for the Institution of Electrical Engineers, c1988
-
- Speech synthesis and recognition; J.N. Holmes
- Wokingham: Van Nostrand Reinhold, c1988
-
- Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal Processing,
- 1987.
-
- Electronic speech recognition: techniques, technology and applications
- edited by Geoff Bristow, London: Collins, 1986
-
- Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee.
- San Mateo: Morgan Kaufmann, c1990
-
- More specific books/articles:
-
- Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack.
- Edinburgh: Edinburgh University Press, c1990
-
- Automatic speech recognition: the development of the SPHINX system;
- by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989
-
- Prosody and speech recognition; Alex Waibel
- (Pitman: London) (Morgan Kaufmann: San Mateo, Calif) 1988
-
- S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction to the
- Application of the Theory of Probabilistic Functions of a Markov Process
- to Automatic Speech Recognition" in Bell Syst. Tech. Jnl. v62(4),
- pp1035--1074, April 1983
-
- R. P. Lippmann, "Review of Neural Networks for Speech Recognition", in
- Neural Computation, v1(1), pp 1-38, 1989.
-
- ------------------------------------------------------------------------
-
- Q5.7: What speech recognition packages are available?
-
- Package Name: Votan
- Platform: MS-DOS, SCO UNIX
- Description: Isolated word and continuous speech modes, speaker dependant
- and (limited) speaker independent. Vocab size is 255 words or up to a
- fixed memory limit - but it is possible to dynamically load different
- words for effectively unlimited number of words.
- Rough Cost: Approx US $1,000-$1,500
- Requirements: Cost includes one Votan Voice Recognition ISA-bus board
- for 386/486-based machines. A software development system is also
- available for DOS and Unix.
- Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice users.
- A telephone interface is also available. There is also a 4GL and a
- software development system.
- Apparently there is more than one version - more info required.
- Contact: 800-877-4756, 510-426-5600
-
-
- Package: HTK (HMM Toolkit) - From Entropic
- Platform: Range of Unix platforms.
- Description: HTK is a software toolkit for building continuous density HMM
- based speech recognisers. It consists of a number of library
- modules and a number of tools. Functions include speech analysis,
- training tools, recognition tools, results analysis, and an
- interactive tool for speech labelling. Many standard forms of
- continuous density HMM are possible. Can perform isolated word or
- connected word speech recognition. It van model whole words, sub-
- word units. Can perform speaker verification and other pattern
- recognition work using HMMs. HTK is now integerated with the
- ESPS/Waves speech research environment which is described in
- Section 1.8 of this posting.
- Misc: The availability of HTK changed in early 1993 when Entropic obtained
- exclusive marketing rights to HTK from the developers at Cambridge.
- Cost: On request.
- Contact: Entropic Research Laboratory, Washington Research Laboratory,
- 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
- (202) 547-1420. email - info@wrl.epi.com
-
-
- Package Name: DragonDictate-30K
- Platform: PC
- Description: Speaker dependent/adaptive system requiring words to be
- separated by short pauses. Vocabulary of 25,000 words including
- a "custom" word set.
- Rough Cost: $5000
- Requirements: Minimum of 20 Mhz 386 with 8M memory and 10M disk space
- Contact: Dragon Systems Inc.
- 90 Bridge Street, Newton MA 02158
- Tel: 1-617-965-5200, Fax: 1-617-527-0372
-
-
- Product Name: IN3 Voice Command for Windows
- Platform: PC with Windows 3.1
- Description: IN3 is now available for MS-Windows. Users can call
- applications to the foreground with voice commands. Once the
- application is called, the user may enter commands and data with
- voice commands. Voice macros can reduce the strain of repetitive
- stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by
- replacing heavy repetitive keyboard hammering with simple voice
- operations. Voice macros take complex operations and reduce them
- to simple verbal commands. Voice input can provide new facilities
- for tasks which could not easily have been otherwise performed
- without the multiple axis of input. IN3 is hardware-independent,
- users with any Windows-compatible audio add speech recognition to
- the desktop. IN3 works with either 8 bit or 16 bit Windows audio
- boards. IN3 is based on continuous word-spotting technology. A
- developer API is also available for creating voice-enabled
- applications.
- Price: $179 U.S.
- Requirements: PC with 80386 processor or better, Microsoft Windows 3.1, and
- Windows compatible audio system with microphone.
- Misc: Fully functional demos are available on Compuserve in various
- Multimedia and CAD forums. Demos are also available from "America
- on Line", the comp.binaries.ms-windows archive sites, and various
- BBS systems.
- An equivilant Sun product is described below.
- Contact: Brantley Kelly
- Email: cbk@gacc.atl.ga.us CIS: 75120,431
- FAX: 1-404-925-7924 Phone: 1-404-925-7950
- Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
-
-
-
- Product Name: IN3 Voice Command
- Platform: Sun SPARCstation
- Description: IN3 provides a secure, robust, word spotting, continuous
- speech recognition facility for the Sun OS or Solaris operating
- systems. The recognition system is a secure operating system
- facility capable of working with various interfaces, microphones,
- and devices. The operating system interface works with native UNIX
- outside of X Windows as well as provides enhanced X Windows facilities
- including named window support. The user interface provides a
- means to quickly create commands on the fly for replacing long strings
- and complex operations with voice macros. [Voice macros can reduce
- the strain of repetitive stress injuries (RSI) such as Carpel Tunnel
- Syndrome (CTS) by replacing heavy repetitive keyboard hammering with
- simple voice operations. ]
- The IN3 user interface works with generic X servers and window
- managers. A developer API is also available for creating voice-
- enabled applications, interfacing with other audio sources, and
- providing extensive application control over the recognition facility.
- Availability: SunSite archive at SunSITE.unc.edu as well as on Catalyst
- CDware as both a runable demo and unlockable software.
- Hardware Required: Sun SPARCstation with audio input.
- Noise canceling microphone recommended but not required.
- Software Required: Sun OS 4.1.2 with OpenWindows 3.0 or
- Sun OS 4.1.3 or
- Solaris 2.1 or Solaris 2.2
- Misc: An equivilant MS-Windows product is described above.
- Price: $495 U.S.
- Contact: Brantley Kelly
- Email: cbk@gacc.atl.ga.us CIS: 75120,431
- FAX: 1-404-925-7924 Phone: 1-404-925-7950
- Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
-
-
- Package Name: SayIt
- Platform: Sun SPARCstation
- Description: Voice recognition and macro building package for Suns
- in the Openwindows 3.0 environment. Speaker dependent discrete speech
- recognition. Vocabularies can be associated to applications and the
- active vocabulary follows the application that has input focus.
- Macros can include mouse commands, keystrokes, Unix commands,
- sound, Openwindow actions and more.
- An evaluation copy is available by email.
- Hardware: Microphone required (SunMicrophone is fine).
- Cost: $US295
- Contact: Phone: 1-800-245-UNIX or 1-415-572-0200
- Fax: 1-415-572-1300
- Email: info@qualix.com
-
-
- Package Name: recnet
- Platform: UNIX
- Description: Speech recognition for the speaker independent TIMIT and
- Resource Management tasks. It uses recurrent networks to estimate
- phone probabilities and Markov models to find the most probable
- sequence of phones or words. The system is a snapshot of evolving
- research code. There is no documentation other than published
- research papers. The components are:
- 1. A preprocessor which implements many standard and many non-
- standard front end processing techniques.
- 2. A recurrent net recogniser and parameter files
- 3. Two Markov model based recognisers, one for phone recognition
- and one for word recognition
- 4. A dynamic programming scoring package
- The complete system performs competatively.
- Cost: Free
- Requirements: TIMIT and Resource Management databases
- Contact: ajr@eng.cam.ac.uk (Tony Robinson)
- Availability: by FTP from "svr-ftp.eng.cam.ac.uk" as /misc/recnet-1.3.tar.Z
-
-
- Package Name: Voice Command Line Interface
- Platform: Amiga
- Description: VCLI will execute CLI commands, ARexx commands, or ARexx
- scripts by voice command through your audio digitizer. VCLI allows
- you to launch multiple applications or control any program with an
- ARexx capability entirely by spoken voice command. VCLI is fully
- multitasking and will run in the background, continuously listening
- for your voice commands even while other programs are running.
- Documentation is provided in AmigaGuide format.
- VCLI 6.0 runs under either Amiga DOS 2.0 or 3.0.
- Cost: Free?
- Requirements: Supports the DSS8, PerfectSound 3, Sound Master, Sound Magic,
- and Generic audio digitizers.
- Availability: by ftp from wuarchive.wustl.edu in the file
- systems/amiga/incoming/audio/VCLI60.lha and from
- amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lha
- Contact: Author's email is RHorne@cup.portal.com
-
-
- Package Name: DATAVOX - French
- Platform: PC
- Description: Continuous speech - speaker independent or dependent.
- Rough Cost: ?
- Requirements: 2 PC format boards (RdF1000 and TdS 96/25) and an
- A/D - D/A module (ASA116)
- Misc: Application software may dialog with DATAVOX through 2 types
- of interfaces :
- 1) Keyboard overlay
- The application software may be used with any PC compatible
- package. No specific adaptation is necessary, you only need
- to define your configuration with the application software.
- 2) C library
- Allows a user-written program to drive the recognition system.
- DATAVOX is based on the AMADEUS speech recognition software
- developed at LIMSI. It provides
- - Continuous speech recognition with
- * speaker dependant : 500 words
- * speaker independant : 50 words (custom-made vocabulary).
- - Grammar of the application language (syntax acquisition,
- verification and simplification software).
- - Large vocabulary : DATAVOX can recognize vocabularies of several
- thousand words as long as there are no more than 500 words in the
- active vocabulary at any given node. It takes less than 1 second
- to change syntax and vocabulary.
- - Training controlled by the system (use of co-articulation models).
- - Response time less than 500 ms for any phrase length.
- - Synthetis (ADPCM) can be heard simultaneously while recognition
- is being carried out.
- Contact: VECSYS, Le Chene rond, 91570 Bievres, France
- Fax: 33 1 69 41 24 30
- Voice: 33 1 69 41 15 04
-
-
-
- Package Name: xxx
- Platform: PC, Mac, UNIX, Amiga ....
- Description: (e.g. isolated word, speaker independent...)
- Rough Cost: (if applicable)
- Requirements: (hardware/software needs - if applicable)
- Misc:
- Contact: (email, ftp or address)
-
-
- Can anyone provide info on
-
- Verbex Listen for Windows
- Voice Navigator (from Articulate Systems)
- IN3 Voice Command
-
-
- Can you provide information on any other software/hardware/packages?
- Commercial, public domain and research packages are all appropriate.
-
- [There should be enough info for someone to start a separate posting.]
-
-
- =======================================================================
-
- PART 6 - Natural Language Processing
-
- There is now a newsgroup specifically for Natural Language Processing.
- It is called comp.ai.nat-lang.
-
- There is also a lot of useful information on Natural Language Processing
- in the FAQ for comp.ai. That FAQ lists available software and useful
- references. It includes a substantial list of software, documentation
- and other info available by ftp.
-
- ------------------------------------------------------------------------
-
- Q6.1: What are some good references/books on NLP?
-
-
- Take a look at the FAQ for the "comp.ai" newsgroup as it also includes some
- useful references.
-
-
- James Allen: Natural Language Understanding. (Benjamin/Cummings Series in
- Computer Science) Menlo Park: Benjamin/Cummings Publishing Company, 1987.
-
- This book consists of four parts: syntactic processing, semantic
- interpretation, context and world knowledge, and response generation.
-
- G. Gazdar and C. Mellish, Natural Language Processing in {Prolog/Lisp/Pop11},
- Addison Wesley, 1989
-
- Emphasis on parsing, especially unification-based parsing, lots of
- details on the lexicon, feature propagation, etc. Fair coverage of
- semantic interpretation, inference in natural language processing,
- and pragmatics; much less extensive than in Allen's book, but more
- formal. There are three versions, one for each programming language
- listed above, with complete code.
-
- Shapiro, Stuart C.: Encyclopedia of Artificial Intelligence Vol.1 and 2.
- New York: John Wiley & Sons, 1990.
-
- There are articles on the different areas of natural language
- processing which also give additional references.
-
- Paris, Ce'cile L.; Swartout, William R.; Mann, William C.: Natural Language
- Generation in Artificial Intelligence and Computational Linguistics. Boston:
- Kluwer Academic Publishers, 1991.
-
- The book describes the most current research developments in natural
- language generation and all aspects of the generation process are
- discussed. The book is comprised of three sections: one on text
- planning, one on lexical choice, and one on grammar.
-
- Readings in Natural Language Processing, ed by B. Grosz, K. Sparck Jones
- and B. Webber, Morgan Kaufmann, 1986
-
- A collection of classic papers on Natural Language Processing.
- Fairly complete at the time the book came out (1986) but now
- seriously out of date. Still useful for ATN's, etc.
-
- Klaus K. Obermeier, Natural Language Processing Technologies
- in Artificial Intelligence: The Science and Industry Perspective,
- Ellis Horwood Ltd, John Wiley & Sons, Chichester, England, 1989.
-
-
- The major journals of the field are "Computational Linguistics" and
- "Cognitive Science" for the artificial intelligence aspects, "Cognition"
- for the psychological aspects, "Language", "Linguistics and Philosophy" and
- "Linguistic Inquiry" for the linguistic aspects. "Artificial Intelligence"
- occasionally has papers on natural language processing.
-
-
- The major conferences are ACL (held every year) and COLING (held every two
- years). Most AI conferences have a NLP track; AAAI, ECAI, IJCAI and the
- Cognitive Science Society conferences usually are the most interesting for
- NLP. CUNY is an important psycholinguistic conference. There are lots of
- linguistic conferences: the most important seem to be NELS, the conference
- of the Chicago Linguistic Society (CLS), WCCFL, LSA, the Amsterdam Colloquium,
- and SALT.
-
-
- ------------------------------------------------------------------------
-
- Q6.2: What NLP software is available?
-
- The FAQ for the "comp.ai" newsgroup lists a variety of language processing
- software that is available. That FAQ is posted monthly.
-
- Natural Language Software Registry
-
- The Natural Language Software Registry is available from the German Research
- Institute for Artificial Intelligence (DFKI) in Saarbrucken.
-
- The current version details
- + speech signal processors, e.g. Computerized Speech Lab (Kay Electronics)
- + morphological analyzers, e.g. PC-KIMMO (Summer Institute for Linguistics)
- + parsers, e.g. Alveytools (University of Edinburgh)
- + knowledge representation systems, e.g. Rhet (University of Rochester)
- + multicomponent systems, such as ELU (ISSCO), PENMAN (ISI), Pundit (UNISYS),
- SNePS (SUNY Buffalo),
- + applications programs (misc.)
-
- This document is available on-line via anonymous ftp to
- Site: ftp.dfki.uni-sb.de
- Directory: /registry
- or by email to registry@dfki.uni-sb.de.
-
- If you have developed a piece of software for natural language processing
- that other researchers might find useful, you can include it by returning
- a description form, available from the same source.
-
- Contacts: Christoph Jung, Markus Vonerden
- Natural Language Software Registry
- Deutsches Forschungsinstitut fuer Kuenstliche Intelligenz (DFKI)
- Stuhlsatzenhausweg 3
- D-W-6600 Saarbruecken
- Germany
-
- phone: +49 (681) 303-5282
- e-mail: registry@dfki.uni-sb.de
-
-
-
-
- Andrew Hunt
- Speech Technology Research Group Ph: 61-2-692 4509
- Dept. of Electrical Engineering Fax: 61-2-692 3847
- University of Sydney, NSW, 2006, Australia email: andrewh@ee.su.oz.au
-