═══ 1. -Preface- ═══ This file documents the Festival Speech Synthesis System a general text to speech system for making your computer talk and developing new synthesis techniques. Copyright (C) 1996-1999 University of Edinburgh Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the authors. This file documents the Festival Speech Synthesis System VERSION. This document contains many gaps and is still in the process of being written. Abstract initial comments Copying How you can copy and share the code Acknowledgements List of contributors What is new Enhancements since last public release Overview Generalities and Philosophy Installation Compilation and Installation Quick start Just tell me what to type Scheme A quick introduction to Festival's scripting language Text methods for interfacing to Festival TTS Text to speech modes XML/SGML mark-up XML/SGML mark-up Language Emacs interface Using Festival within Emacs Internal functions Phonesets Defining and using phonesets Lexicons Building and compiling Lexicons Utterances Existing and defining new utterance types Modules Text analysis Tokenizing text POS tagging Part of speech tagging Phrase breaks Finding phrase breaks Intonation Intonations modules Duration Duration modules UniSyn synthesizer The UniSyn waveform synthesizer Diphone synthesizer Building and using diphone synthesizers Other synthesis methods other waveform synthesis methods Audio output Getting sound from Festival Voices Adding new voices (and languages) Tools CART, Ngrams etc Building models from databases Adding new modules and writing C++ code Programming Programming in Festival (Lisp/C/C++) API Using Festival in other programs Examples Some simple (and not so simple) examples Problems Reporting bugs. References Other sources of information Feature functions List of builtin feature functions. Variable list Short descriptions of all variables Function list Short descriptions of all functions Index Index of concepts. ═══ 2. Abstract ═══ This document provides a user manual for the Festival Speech Synthesis System, version VERSION. Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, and an Emacs interface. Festival is multi-lingual, we have develeoped voices in many languages including English (UK and US), Spanish and Welsh, though English is the most advanced. The system is written in C++ and uses the Edinburgh Speech Tools for low level architecture and has a Scheme (SIOD) based command interpreter for control. Documentation is given in the FSF texinfo format which can generate a printed manual, info files and HTML. The latest details and a full software distribution of the Festival Speech Synthesis System are available through its home page which may be found at http://www.cstr.ed.ac.uk/projects/festival.html ═══ 3. Copying ═══ As we feeel the core system has reached an acceptable level of maturity from 1.4.0 the basic system is released under a free lience, without the commercial restrictions we imposed on early versions. The basic system has been placed under an X11 type licence which as free licences go is pretty free. No GPL code is included in festival or the speech tools themselves (though some auxiliary files are GPL'd e.g. the Emacs mode for Festival). We have deliberately choosen a licence that should be compatible with our commercial partners and our free software users. However although the code is free, we still offer no warranties and no maintenance. We will continue to endeavor to fix bugs and answer queries when can, but are not in a position to guarantee it. We will consider maintenance contracts and consultancy if desired, please contacts us for details. Also note that not all the voices and lexicons we distribute with festival are free. Particularly the British English lexicon derived from Oxford Advanced Learners' Dictionary is free only for non-commercial use (we will release an alternative soon). Also the Spanish diphone voice we relase is only free for non-commercial use. If you are using Festival or the speech tools in commercial environment, even though no licence is required, we would be grateful if you let us know as it helps justify ourselves to our various sponsors. The current copyright on the core system is The Festival Speech Synthesis System: version 1.4.1 Centre for Speech Technology Research University of Edinburgh, UK Copyright (c) 1996-1999 All Rights Reserved. Permission is hereby granted, free of charge, to use and distribute this software and its documentation without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of this work, and to permit persons to whom this work is furnished to do so, subject to the following conditions: 1. The code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Any modifications must be clearly marked as such. 3. Original authors' names are not deleted. 4. The authors' names are not used to endorse or promote products derived from this software without specific prior written permission. THE UNIVERSITY OF EDINBURGH AND THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE UNIVERSITY OF EDINBURGH NOR THE CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. ═══ 4. Acknowledgements ═══ The code in this system was primarily written by Alan W Black, Paul Taylor and Richard Caley. Festival sits on top of the Edinburgh Speech Tools Library, and uses much of its functionality. Amy Isard wrote a synthesizer for her MSc project in 1995, which first used the Edinburgh Speech Tools Library. Although Festival doesn't contain any code from that system, her system was used as a basic model. Much of the design and philosophy of Festival has been built on the experience both Paul and Alan gained from the development of various previous synthesizers and software systems, especially CSTR's Osprey and Polyglot systems taylor91 and ATR's CHATR system black94. However, it should be stated that Festival is fully developed at CSTR and contains neither proprietary code or ideas. Festival contains a number of subsystems integrated from other sources and we acknowledge those systems here. ═══ 4.1. SIOD ═══ The Scheme interpreter (SIOD -- Scheme In One Defun 3.0) was written by George Carrett (gjc@mitech.com, gjc@paradigm.com) and offers a basic small Scheme (Lisp) interpreter suitable for embedding in applications such as Festival as a scripting language. A number of changes and improvements have been added in our development but it still remains that basic system. We are grateful to George and Paradigm Associates Incorporated for providing such a useful and well-written sub-system. Scheme In One Defun (SIOD) COPYRIGHT (c) 1988-1994 BY PARADIGM ASSOCIATES INCORPORATED, CAMBRIDGE, MASSACHUSETTS. ALL RIGHTS RESERVED Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of Paradigm Associates Inc not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. PARADIGM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL PARADIGM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. ═══ 4.2. editline ═══ Because of conflicts between the copyright for GNU readline, for which an optional interface was included in earlier versions, we have replace the interface with a complete command line editing system based on 'editline'. 'Editline' was posted to the USENET newsgroup 'comp.sources.misc' in 1992. A number of modifications have been made to make it more useful to us but the original code (contained within the standard speech tools distribution) and our modifications fall under the following licence. Copyright 1992 Simmule Turner and Rich Salz. All rights reserved. This software is not subject to any license of the American Telephone and Telegraph Company or of the Regents of the University of California. Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it freely, subject to the following restrictions: 1. The authors are not responsible for the consequences of use of this software, no matter how awful, even if they arise from flaws in it. 2. The origin of this software must not be misrepresented, either by explicit claim or by omission. Since few users ever read sources, credits must appear in the documentation. 3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. Since few users ever read sources, credits must appear in the documentation. 4. This notice may not be removed or altered. ═══ 4.3. Edinburgh Speech Tools Library ═══ The Edinburgh Speech Tools lies at the core of Festival. Although developed separately, much of the development of certain parts of the Edinburgh Speech Tools has been directed by Festival's needs. In turn those who have contributed to the Speech Tools make Festival a more usable system. See Section Acknowledgements of Edinburgh Speech Tools Library Manual. Online information about the Edinburgh Speech Tools library is available through http://www.cstr.ed.ac.uk/projects/speech_tools.html ═══ 4.4. Others ═══ Many others have provided actual code and support for Festival, for which we are grateful. Specifically,  Alistair Conkie: various low level code points and some design work, Spanish synthesis, the old diphone synthesis code.  Steve Isard:directorship and LPC diphone code, design of diphone schema.  EPSRC:who fund Alan Black and Paul Taylor.  Sun Microsystems Laboratories: for supporting the project and funding Richard.  AT&T Labs - Research:for supporting the project.  Paradigm Associates and George Carrett:for Scheme in one defun.  Mike Macon:Improving the quality of the diphone synthesizer and LPC analysis.  Kurt Dusterhoff:Tilt intonation training and modelling.  Amy Isard:for her SSML project and related synthesizer.  Richard Tobin:for answering all those difficult questions, the socket code, and the XML parser.  Simmule Turner and Rich Salz:command line editor (editline)  Borja Etxebarria:Help with the Spanish synsthesis  Briony Williams:Welsh synthesis  Jacques H. de Villiers: 'jacques@cse.ogi.edu' from CSLUat OGI, for the TCL interface, and other usability issues  Kevin Lenzo: 'lenzo@cs.cmu.edu' from CMU for the PERLinterface.  Rob Clarke:for support under Linux.  Samuel Audet 'guardia@cam.org':OS/2 support  Mari Ostendorf:For providing access to the BU FM Radio corpus from which some modules were trained.  Melvin Hunt:from whose work we based our residual LPC synthesis model on  Oxford Text Archive:For the computer users version of Oxford Advanced Learners' Dictionary (redistributed with permission).  Reading University:for access to MARSEC from which the phrase break model was trained.  LDC & Penn Tree Bank:from which the POS tagger was trained, redistribution of the models is with permission from the LDC.  Roger Burroughes and Kurt Dusterhoff:For letting us capture their voices.  ATR and Nick Campbell:for first getting Paul and Alan to work together and for the experience we gained.  FSF:for G++, make, ····  Center for Spoken Language Understanding:CSLU at OGI, particularly Ron Cole and Mike Macon, have acted as significant users for the system giving significant feedback and allowing us to teach courses on Festival offering valuable real-use feedback.  Our beta testers:Thanks to all the people who put up with previous versions of the system and reported bugs, both big and small. These comments are very important to the constant improvements in the system. And thanks for your quick responses when I had specific requests.  And our users ···Many people have downloaded earlier versions of the system. Many have found problems with installation and use and have reported it to us. Many of you have put up with multiple compilations trying to fix bugs remotely. We thank you for putting up with us and are pleased you've taken the time to help us improve our system. Many of you have come up with uses we hadn't thought of, which is always rewarding. Even if you haven't actively responded, the fact that you use the system at all makes it worthwhile. ═══ 5. What is new ═══ Compared to the the previous major release (1.3.0 release Aug 1998) 1.4.0 is not functionally so different from its previous versions. This release is primarily a consolidation release fixing and tidying up some of the lower level aspects of the system to allow better modularity for some of our future planned modules.  Copyright change: The system is now free and has no commercial restriction. Note that currently on the US voices (ked and kal) are also now unrestricted. The UK English voices depend on the Oxford Advanced Learners' Dictionary of Current English which cannot be used for commercial use without permission from Oxford University Press.  Architecture tidy up:the interfaces to lower level part parts of the system have been tidied up deleting some of the older code that was supported for compatibility reasons. This is a much higher dependence of features and easier (and safer) ways to register new objects as feature values and Scheme objects. Scheme has been tidied up. It is no longer "in one defun" but "in one directory".  New documentation system for speech tools:A new docbook based documentation system has been added to the speech tools. Festival's documentation will will move over to this sometime soon too.  initial JSAPI support: both JSAPI and JSML (somewhatsimilar to Sable) now have initial impelementations. They of course depend on Java support which so far we have only (successfully) investgated under Solaris and Linux.  Generalization of statistical models: CART, ngrams,and WFSTs are now fully supported from Lisp and can be used with a generalized viterbi function. This makes adding quite complex statistical models easy without adding new C++.  Tilt Intonation modelling:Full support is now included for the Tilt intomation models, both training and use.  Documentation on Bulding New Voices in Festival:documentation, scripts etc. for building new voices and languages in the system, see http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/ ═══ 6. Overview ═══ Festival is designed as a speech synthesis system for at least three levels of user. First, those who simply want high quality speech from arbitrary text with the minimum of effort. Second, those who are developing language systems and wish to include synthesis output. In this case, a certain amount of customization is desired, such as different voices, specific phrasing, dialog types etc. The third level is in developing and testing new synthesis methods. This manual is not designed as a tutorial on converting text to speech but for documenting the processes and use of our system. We do not discuss the detailed algorithms involved in converting text to speech or the relative merits of multiple methods, though we will often give references to relevant papers when describing the use of each module. For more general information about text to speech we recommend Dutoit's 'An introduction to Text-to-Speech Synthesis' dutoit97. For more detailed research issues in TTS see sproat98 or vansanten96. Philosophy Why we did it like it is Future How much better its going to get ═══ 6.1. Philosophy ═══ One of the biggest problems in the development of speech synthesis, and other areas of speech and language processing systems, is that there are a lot of simple well-known techniques lying around which can help you realise your goal. But in order to improve some part of the whole system it is necessary to have a whole system in which you can test and improve your part. Festival is intended as that whole system in which you may simply work on your small part to improve the whole. Without a system like Festival, before you could even start to test your new module you would need to spend significant effort to build a whole system, or adapt an existing one before you could start working on your improvements. Festival is specifically designed to allow the addition of new modules, easily and efficiently, so that development need not get bogged down in re-implementing the wheel. But there is another aspect of Festival which makes it more useful than simply an environment for researching into new synthesis techniques. It is a fully usable text-to-speech system suitable for embedding in other projects that require speech output. The provision of a fully working easy-to-use speech synthesizer in addition to just a testing environment is good for two specific reasons. First, it offers a conduit for our research, in that our experiments can quickly and directly benefit users of our synthesis system. And secondly, in ensuring we have a fully working usable system we can immediately see what problems exist and where our research should be directed rather where our whims take us. These concepts are not unique to Festival. ATR's CHATR system (black94) follows very much the same philosophy and Festival benefits from the experiences gained in the development of that system. Festival benefits from various pieces of previous work. As well as CHATR, CSTR's previous synthesizers, Osprey and the Polyglot projects influenced many design decisions. Also we are influenced by more general programs in considering software engineering issues, especially GNU Octave and Emacs on which the basic script model was based. Unlike in some other speech and language systems, software engineering is considered very important to the development of Festival. Too often research systems consist of random collections of hacky little scripts and code. No one person can confidently describe the algorithms it performs, as parameters are scattered throughout the system, with tricks and hacks making it impossible to really evaluate why the system is good (or bad). Such systems do not help the advancement of speech technology, except perhaps in pointing at ideas that should be further investigated. If the algorithms and techniques cannot be described externally from the program such that they can reimplemented by others, what is the point of doing the work? Festival offers a common framework where multiple techniques may be implemented (by the same or different researchers) so that they may be tested more fairly in the same environment. As a final word, we'd like to make two short statements which both achieve the same end but unfortunately perhaps not for the same reasons: Good software engineering makes good research easier But the following seems to be true also If you spend enough effort on something it can be shown to be better than its competitors. ═══ 6.2. Future ═══ Festival is still very much in development. Hopefully this state will continue for a long time. It is never possible to complete software, there are always new things that can make it better. However as time goes on Festival's core architecture will stabilise and little or no changes will be made. Other aspects of the system will gain greater attention such as waveform synthesis modules, intonation techniques, text type dependent analysers etc. Festival will improve, so don't expected it to be the same six months from now. A number of new modules and enhancements are already under consideration at various stages of implementation. The following is a non-exhaustive list of what we may (or may not) add to Festival over the next six months or so.  Selection-based synthesis: Moving away from diphone technology to more generalized selection of units for speech database.  New structure for linguistic content of utterances:Using techniques for Metrical Phonology we are building more structure representations of utterances reflecting there linguistic significance better. This will allow improvements in prosody and unit selection.  Non-prosodic prosodic control:For language generation systems and custom tasks where the speech to be synthesized is being generated by some program, more information about text structure will probably exist, such as phrasing, contrast, key items etc. We are investigating the relationship of high-level tags to prosodic information through the Sole project http://www.cstr.ed.ac.uk/projects/sole.html  Dialect independent lexicons:Currently for each new dialect we need a new lexicon, we are currently investigating a form of lexical specification that is dialect independent that allows the core form to be mapped to different dialects. This will make the generation of voices in different dialects much easier. ═══ 7. Installation ═══ This section describes how to install Festival from source in a new location and customize that installation. Requirements Software/Hardware requirements for Festival Configuration Setting up compilation Site initialization Settings for your particular site Checking an installation But does it work ··· Y2K Comment on Festival and year 2000 ═══ 7.1. Requirements ═══ In order to compile Festival you first need the following source packages festival-1.4.1.tar.gz Festival Speech Synthesis System source speech_tools-1.2.1.tar.gzThe Edinburgh Speech Tools Library festlex_NAME.tar.gzThe lexicon distribution, where possible, includes the lexicon input file as well as the compiled form, for your convenience. The lexicons have varying distribution policies, but are all free except OALD, which is only free for non-commercial use (we are working on a free replacement). In some cases only a pointer to an ftp'able file plus a program to convert that file to the Festival format is included. festvox_NAME.tar.gzYou'll need a speech database. A number are available (with varying distribution policies). Each voice may have other dependencies such as requiring particular lexicons festdoc_1.4.1.tar.gzFull postscript, info and html documentation for Festival and the Speech Tools. The source of the documentation is available in the standard distributions but for your conveniences it has been pre-generated. In addition to Festival specific sources you will also need A UNIX machine Currently we have compiled and tested the system under Solaris (2.5(.1), 2.6 and 2.7), SunOS (4.1.3), FreeBSD 2.2, 3.x, Linux (Redhat 4.1, 5.0, 5.1, 5.2, 6.0 and other Linux distributions), and it should work under OSF (Dec Alphas) SGI (Irix), HPs (HPUX). But any standard UNIX machine should be acceptable. We have now successfully ported this version to Windows NT nad Windows 95 (using the Cygnus GNU win32 environment). This is still a young port but seems to work. A C++ compilerNote that C++ is not very portable even between different versions of the compiler from the same vendor. Although we've tried very hard to make the system portable, we know it is very unlikely to compile without change except with compilers that have already been tested. The currently tested systems are 1. Sun Sparc Solaris 2.5, 2.5.1, 2.6, 2.7: GCC 2.7.2, GCC 2.8.1, SunCC 4.1, egcs 1.1.1, egcs 1.1.2, GCC 2.95.1 2. Sun Sparc SunOS 4.1.3: GCC 2.7.2 3. Intel SunOS 2.5.1: GCC 2.7.2 4. FreeBSD for Intel 2.2.1, 2.2.6 and 3.x (ELF based) GCC 2.7.2.1, GCC 2.95.1 5. Linux (2.0.30) for Intel (RedHat 4.1/5.0/5.1/5.2/6.0): GCC 2.7.2, GCC 2.7.2/egcs-1.0.2, egcs 1.1.1, egcs-1.1.2, GCC 2.95.1 6. Windows NT 4.0: GCC 2.7.2 plus egcs (from Cygnus GNU win32 b19), Visual C++ PRO v5.0 Note if GCC works on one version of Unix it usually works on others. We still recommend GCC 2.7.2 which we use as our standard compiler. It is (mostly) standard across platforms and compiles faster and produces better code than any of the other compilers we've used. We have compiled both the speech tools and Festival under Windows NT 4.0 and Windows 95 using the GNU tools available from Cygnus. ftp://ftp.cygnus.com/pub/gnu-win32/. GNU make Due to there being too many different make programs out there we have tested the system using GNU make on all systems we use. Others may work but we know GNU make does. Audio hardwareYou can use Festival without audio output hardware but it doesn't sound very good (though admittedly you can hear less problems with it). A number of audio systems are supported (directly inherited from the audio support in the Edinburgh Speech Tools Library): NCD's NAS (formerly called netaudio) a network transparent audio system (which can be found at ftp://ftp.x.org/contrib/audio/nas/); '/dev/audio' (at 8k ulaw and 8/16bit linear), found on Suns, Linux machines and FreeBSD; and a method allowing arbitrary UNIX commands. See Audio output. Earlier versions of Festival mistakenly offered a command line editor interface to the GNU package readline, but due to conflicts with the GNU Public Licence and Festival's licence this interface was removed in version 1.3.1. Even Festival's new free licence would cause problems as readline support would restrict Festival linking with non-free code. A new command line interface based on editline was provided that offers similar functionality. Editline remains a compilation option as it is probably not yet as portable as we would like it to be. In addition to the above, in order to process the documentation you will need 'TeX', 'dvips' (or similar), GNU's 'makeinfo' (part of the texinfo package) and 'texi2html' which is available from http://wwwcn.cern.ch/dci/texi2html/. However the document files are also available pre-processed into, postscript, DVI, info and html as part of the distribution in 'festdoc-1.4.X.tar.gz'. Most of the related software not part of the Festival distribution has been made available in ftp://ftp.cstr.ed.ac.uk/pub/festival/extras/ Ensure you have a fully installed and working version of your C++ compiler. Most of the problems people have had in installing Festival have been due to incomplete or bad compiler installation. It might be worth checking if the following program works if you don't know if anyone has used your C++ installation before. #include int main (int argc, char **argv) { cout << "Hello world\n"; } Unpack all the source files in a new directory. The directory will then contain two subdirectories speech_tools/ festival/ ═══ 7.2. Configuration ═══ First ensure you have a compiled version of the Edinburgh Speech Tools Library. See 'speech_tools/INSTALL' for instructions. Before compilation of Festival it is necessary to configure your implementation to be aware of the environment it is being compiled in. Specifically it must know the names of various local programs, such as your compiler; directories were local libraries are held, and choices for various options about sub-systems it is to use. In most cases this can be done automatically if your system is a supported, otherwise it may be necessary to edit a few lines. All compilation information is set in a local per installation file called 'config/config'. You should copy the example one, mark it writable and edit it according to your local set up. cd config/ cp config-dist config chmod +w config 'config/config' is included by all 'Makefiles' in the system and therefore should be the only place machine specific information need be changed. Note that all 'Makefiles' define the variable TOP to allow appropriate relative addressing of directories within the 'Makefiles' and their included files. For the most part Festival configuration inherits the configuration from your speech tools config file ('··/speech_tools/config/config'). Additional optional modules may be added by adding them to the end of your config file e.g. ALSO_INCLUDE += clunits Adding and new module here will treat is as a new directory in the 'src/modules/' and compile it into the system in the same way the OTHER_DIRS feature was used in previous versions. If the compilation directory being accessed by NFS or if you use an automounter (e.g. amd) it is recommend to explicitly set the variable FESTIVAL_HOME in 'config/config'. The command pwd is not reliable when a directory may have multiple names. To check your configuration type in the 'festival/' directory. gnumake info If that seems fine, compile the system with gnumake On completion you can check the system with gnumake test Note that the single most common reason for problems in compilation and linking found amongst the beta testers was a bad installation of GNU C++. If you get many strange errors in G++ library header files or link errors it is worth checking that your system has the compiler, header files and runtime libraries properly installed. This may be checked by compiling a simple program under C++ and also finding out if anyone at your site has ever used the installation. Most of these installation problems are caused by upgrading to a newer version of libg++ without removing the older version so a mixed version of the '.h' files exist. Although we have tried very hard to ensure that Festival compiles with no warnings this is not possible under some systems. Under SunOS the system include files do not declare a number of system provided functions. This a bug in Sun's include files. This will causes warnings like "implicit definition of fprintf". These are harmless. Under Sun's CC compiler a number of warnings are given about not being able to find source, particularly for operator << and some == operators. It is unclear why this should be a warning as the code exists in other files deliberately for modularity purposes and should not be visible in these files anyway. These warnings are harmless. Under Linux a warning at link time about reducing the size of some symbols often is produced. This is harmless. There is often occasional warnings about some socket system function having an incorrect argument type, this is also harmless. The speech tools and festival compile under Windows95 or Windows NT with Visual C++ v5.0 using the Microsoft 'nmake' make program. We've only done this with the Professonal edition, but have no reason to believe that it relies on anything not in the standard edition. In accordance to VC++ conventions, object files are created with extension ·obj, executables with extension .exe and libraries with extension ·lib. This may mean that both unix and Win32 versions can be built in the same directory tree, but I wouldn't rely on it. To do this you require nmake Makefiles for the system. These can be generated from the gnumake Makefiles, using the command gnumake VCMakefile in the speech_tools and festival directories. I have only done this under unix, it's possible it would work under the cygnus gnuwin32 system. If 'make.depend' files exist (i.e. if you have done 'gnumake depend' in unix) equivalent 'vc_make.depend' files will be created, if not the VCMakefiles will not contain dependency information for the '.cc' files. The result will be that you can compile the system once, but changes will not cause the correct things to be rebuilt. In order to compile from the DOS command line using Visual C++ you need to have a collection of environment variables set. In Windows NT there is an instalation option for Visual C++ which sets these globally. Under Windows95 or if you don't ask for them to be set globally under NT you need to run vcvars32.bat See the VC++ documentation for more details. Once you have the source trees with VCMakefiles somewhere visible from Windows, you need to copy 'peech_tools\config\vc_config-dist' to 'speech_tools\config\vc_config' and edit it to suit your local situation. Then do the same with 'festival\config\vc_config-dist'. The thing most likely to need changing is the definition of FESTIVAL_HOME in 'festival\config\vc_config_make_rules' which needs to point to where you have put festival. Now you can compile. cd to the speech_tools directory and do nmake /nologo /fVCMakefile and the library, the programs in main and the test programs should be compiled. The tests can't be run automatically under Windows. A simple test to check that things are probably OK is: main\na_play testsuite\data\ch_wave.wav which reads and plays a waveform. Next go into the festival directory and do nmake /nologo /fVCMakefile to build festival. When it's finished, and assuming you have the voices and lexicons unpacked in the right place, festival should run just as under unix. We should remind you that the NT/95 ports are still young and there may yet be problems that we've not found yet. We only recommend the use the speech tools and Festival under Windows if you have significant experience in C++ under those platforms. Most of the modules 'src/modules' are actually optional and the system could be compiled without them. The basic set could be reduced further if certain facilities are not desired. Particularly: 'donovan' which is only required if the donovan voice is used; 'rxp' if no XML parsing is required (e.g. Sable); and 'parser' if no stochastic paring is required (this parser isn't used for any of our currently released voices). Actually even 'UniSyn' and 'UniSyn_diphone' could be removed if some external waveform synthesizer is being used (e.g. MBROLA) or some alternative one like 'OGIresLPC'. Removing unused modules will make the festival binary smaller and (potentially) start up faster but don't expect too much. You can delete these by changing the BASE_DIRS variable in 'src/modules/Makefile'. ═══ 7.3. Site initialization ═══ Once compiled Festival may be further customized for particular sites. At start up time Festival loads the file 'init.scm' from its library directory. This file further loads other necessary files such as phoneset descriptions, duration parameters, intonation parameters, definitions of voices etc. It will also load the files 'sitevars.scm' and 'siteinit.scm' if they exist. 'sitevars.scm' is loaded after the basic Scheme library functions are loaded but before any of the festival related functions are loaded. This file is intended to set various path names before various subsystems are loaded. Typically variables such as lexdir (the directory where the lexicons are held), and voices_dir (pointing to voice directories) should be reset here if necessary. The default installation will try to find its lexicons and voices automatically based on the value of load-path (this is derived from FESTIVAL_HOME at compilation time or by using the --libdir at run-time). If the voices and lexicons have been unpacked into subdirectories of the library directory (the default) then no site specific initialization of the above pathnames will be necessary. The second site specific file is 'siteinit.scm'. Typical examples of local initialization are as follows. The default audio output method is NCD's NAS system if that is supported as that's what we use normally in CSTR. If it is not supported, any hardware specific mode is the default (e.g. sun16audio, freebas16audio, linux16audio or mplayeraudio). But that default is just a setting in 'init.scm'. If for example in your environment you may wish the default audio output method to be 8k mulaw through '/dev/audio' you should add the following line to your 'siteinit.scm' file (Parameter.set 'Audio_Method 'sunaudio) Note the use of Parameter.set rather than Parameter.def the second function will not reset the value if it is already set. Remember that you may use the audio methods sun16audio. linux16audio or freebsd16audio only if NATIVE_AUDIO was selected in 'speech_tools/config/config' and your are on such machines. The Festival variable *modules* contains a list of all supported functions/modules in a particular installation including audio support. Check the value of that variable if things aren't what you expect. If you are installing on a machine whose audio is not directly supported by the speech tools library, an external command may be executed to play a waveform. The following example is for an imaginary machine that can play audio files through a program called 'adplay' with arguments for sample rate and file type. When playing waveforms, Festival, by default, outputs as unheadered waveform in native byte order. In this example you would set up the default audio playing mechanism in 'siteinit.scm' as follows (Parameter.set 'Audio_Method 'Audio_Command) (Parameter.set 'Audio_Command "adplay -raw -r $SR $FILE") For Audio_Command method of playing waveforms Festival supports two additional audio parameters. Audio_Required_Rate allows you to use Festivals internal sample rate conversion function to any desired rate. Note this may not be as good as playing the waveform at the sample rate it is originally created in, but as some hardware devices are restrictive in what sample rates they support, or have naive resample functions this could be optimal. The second addition audio parameter is Audio_Required_Format which can be used to specify the desired output forms of the file. The default is unheadered raw, but this may be any of the values supported by the speech tools (including nist, esps, snd, riff, aiff, audlab, raw and, if you really want it, ascii). For example suppose you run Festival on a remote machine and are not running any network audio system and want Festival to copy files back to your local machine and simply cat them to '/dev/audio'. The following would do that (assuming permissions for rsh are allowed). (Parameter.set 'Audio_Method 'Audio_Command) ;; Make output file ulaw 8k (format ulaw implies 8k) (Parameter.set 'Audio_Required_Format 'ulaw) (Parameter.set 'Audio_Command "userhost=`echo $DISPLAY | sed 's/:.*$//'`; rcp $FILE $userhost:$FILE; \ rsh $userhost \"cat $FILE >/dev/audio\" ; rsh $userhost \"rm $FILE\"") Note there are limits on how complex a command you want to put in the Audio_Command string directly. It can get very confusing with respect to quoting. It is therefore recommended that once you get past a certain complexity consider writing a simple shell script and calling it from the Audio_Command string. A second typical customization is setting the default speaker. Speakers depend on many things but due to various licence (and resource) restrictions you may only have some diphone/nphone databases available in your installation. The function name that is the value of voice_default is called immediately after 'siteinit.scm' is loaded offering the opportunity for you to change it. In the standard distribution no change should be required. If you download all the distributed voices voice_rab_diphone is the default voice. You may change this for a site by adding the following to 'siteinit.scm' or per person by changing your '.festivalrc'. For example if you wish to change the default voice to the American one voice_ked_diphone (set! voice_default 'voice_ked_diphone) Note the single quote, and note that unlike in early versions voice_default is not a function you can call directly. A second level of customization is on a per user basis. After loading 'init.scm', which includes 'sitevars.scm' and 'siteinit.scm' for local installation, Festival loads the file '.festivalrc' from the user's home directory (if it exists). This file may contain arbitrary Festival commands. For example a particular installation of Festival may set Spanish as the default language by adding (language_spanish) in 'siteinit.scm', while a user may wish their version to use Welsh by default. In this case they would add (language_welsh) to their '.festivalrc' in their home directory. ═══ 7.4. Checking an installation ═══ Once compiled and site initialization is set up you should test to see if Festival can speak or not. Start the system $ bin/festival Festival Speech Synthesis System 1.4.1:release November 1999 Copyright (C) University of Edinburgh, 1996-1999. All rights reserved. For details type `(festival_warranty)' festival> ^D If errors occur at this stage they are most likely to do with pathname problems. If any error messages are printed about non-existent files check that those pathnames point to where you intended them to be. Most of the (default) pathnames are dependent on the basic library path. Ensure that is correct. To find out what it has been set to, start the system without loading the init files. $ bin/festival -q Festival Speech Synthesis System 1.4.1:release November 1999 Copyright (C) University of Edinburgh, 1996-1999. All rights reserved. For details type `(festival_warranty)' festival> libdir "/projects/festival/lib/" festival> ^D This should show the pathname you set in your 'config/config'. If the system starts with no errors try to synthesize something festival> (SayText "hello world") Some files are only accessed at synthesis time so this may show up other problem pathnames. If it talks, you're in business, if it doesn't, here are some possible problems. If you get the error message Can't access NAS server You have selected NAS as the audio output but have no server running on that machine or your DISPLAY or AUDIOSERVER environment variable is not set properly for your output device. Either set these properly or change the audio output device in 'lib/siteinit.scm' as described above. Ensure your audio device actually works the way you think it does. On Suns, the audio output device can be switched into a number of different output modes, speaker, jack, headphones. If this is set to the wrong one you may not hear the output. Use one of Sun's tools to change this (try '/usr/demo/SOUND/bin/soundtool'). Try to find an audio file independent of Festival and get it to play on your audio. Once you have done that ensure that the audio output method set in Festival matches that. Once you have got it talking, test the audio spooling device. festival> (intro) This plays a short introduction of two sentences, spooling the audio output. Finally exit from Festival (by end of file or (quit)) and test the script mode with. $ examples/saytime A test suite is included with Festival but it makes certain assumptions about which voices are installed. It assumes that voice_rab_diphone ('festvox_rabxxxx.tar.gz') is the default voice and that voice_ked_diphone and voice_don_diphone ('festvox_kedxxxx.tar.gz' and 'festvox_don.tar.gz') are installed. Also local settings in your 'festival/lib/siteinit.scm' may affect these tests. However, after installation it may be worth trying gnumake test from the 'festival/' directory. This will do various tests including basic utterance tests and tokenization tests. It also checks that voices are installed and that they don't interfere with each other. These tests are primarily regression tests for the developers of Festival, to ensure new enhancements don't mess up existing supported features. They are not designed to test an installation is successful, though if they run correctly it is most probable the installation has worked. ═══ 7.5. Y2K ═══ Festival comes with no warranty therefore we will not make any legal statement about the performance of the system. However a number of people have ask about Festival and Y2K compliance, and we have decided to make some comments on this. Every effort has been made to ensure that Festival will continue running as before into the next millenium. However even if Festival itself has no problems it is dependent on the operating system environment it is running in. During compilation dates on files are important and the compilation process may not work if your machine cannot assign (reasonable) dates to new files. At run time there is less dependence on system dates and times. Specifically times are used in generation of random numbers (where only relative time is important) and as time stamps in log files when festival runs in server mode, thus we feel it is unlikely there will be any problems. However, as a speech synthesizer, Festival must make explicit decisions about the pronunciation of dates in the next two decades when people themselves have not yet made such decisions. Most people are still unsure how to read years written as '01, '04, '12, 00s, 10s, (cf. '86, 90s). It is interesting to note that while there is a convenient short name for the last decade of the twentieth century, the "ninties" there is no equivalent name for the first decade of the twenty-first century (or the second). In the mean time we have made reasonable decisions about such pronunciations. Once people have themselves become Y2K compliant and decided what to actually call these years, if their choices are different from how Festival pronounces them we reserve the right to change how Festival speaks these dates to match their belated decisions. However as we do not give out warranties about compliance we will not be requiring our users to return signed Y2K compliant warranties about their own compliance either. ═══ 8. Quick start ═══ This section is for those who just want to know the absolute basics to run the system. Festival works in two fundamental modes, command mode and text-to-speech mode (tts-mode). In command mode, information (in files or through standard input) is treated as commands and is interpreted by a Scheme interpreter. In tts-mode, information (in files or through standard input) is treated as text to be rendered as speech. The default mode is command mode, though this may change in later versions. Basic command line options Simple command driven session Getting some help ═══ 8.1. Basic command line options ═══ Festival's basic calling method is as festival [options] file1 file2 ··· Options may be any of the following -q start Festival without loading 'init.scm' or user's '.festivalrc' -b --batch After processing any file arguments do not become interactive -i --interactiveAfter processing file arguments become interactive. This option overrides any batch argument. --tts Treat file arguments in text-to-speech mode, causing them to be rendered as speech rather than interpreted as commands. When selected in interactive mode the command line edit functions are not available --command Treat file arguments in command mode. This is the default. --language LANGSet the default language to LANG. Currently LANG may be one of english, spanish or welsh (depending on what voices are actually available in your installation). --server After loading any specified files go into server mode. This is a mode where Festival waits for clients on a known port (the value of server_port, default is 1314). Connected clients may send commands (or text) to the server and expect waveforms back. See Server/client API. Note server mode may be unsafe and allow unauthorised access to your machine, be sure to read the security recommendations in Server/client API --script scriptfileRun scriptfile as a Festival script file. This is similar to to --batch but it encapsulates the command line arguments into the Scheme variables argv and argc, so that Festival scripts may process their command line arguments just like any other program. It also does not load the the basic initialisation files as sometimes you may not want to do this. If you wish them, you should copy the loading sequence from an example Festival script like 'festival/examples/saytext'. --heap NUMBERThe Scheme heap (basic number of Lisp cells) is of a fixed size and cannot be dynamically increased at run time (this would complicate garbage collection). The default size is 210000 which seems to be more than adequate for most work. In some of our training experiments where very large list structures are required it is necessary to increase this. Note there is a trade off between size of the heap and time it takes to garbage collect so making this unnecessarily big is not a good idea. If you don't understand the above explanation you almost certainly don't need to use the option. In command mode, if the file name starts with a left parenthesis, the name itself is read and evaluated as a Lisp command. This is often convenient when running in batch mode and a simple command is necessary to start the whole thing off after loading in some other specific files. ═══ 8.2. Sample command driven session ═══ Here is a short session using Festival's command interpreter. Start Festival with no arguments $ festival Festival Speech Synthesis System 1.4.1:release November 1999 Copyright (C) University of Edinburgh, 1996-1999. All rights reserved. For details type `(festival_warranty)' festival> Festival uses the a command line editor based on editline for terminal input so command line editing may be done with Emacs commands. Festival also supports history as well as function, variable name, and file name completion via the TAB key. Typing help will give you more information, that is help without any parenthesis. (It is actually a variable name whose value is a string containing help.) Festival offers what is called a read-eval-print loop, because it reads an s-expression (atom or list), evaluates it and prints the result. As Festival includes the SIOD Scheme interpreter most standard Scheme commands work festival> (car '(a d)) a festival> (+ 34 52) 86 In addition to standard Scheme commands a number of commands specific to speech synthesis are included. Although, as we will see, there are simpler methods for getting Festival to speak, here are the basic underlying explicit functions used in synthesizing an utterance. Utterances can consist of various types (See Utterance types), but the simplest form is plain text. We can create an utterance and save it in a variable festival> (set! utt1 (Utterance Text "Hello world")) # festival> The (hex) number in the return value may be different for your installation. That is the print form for utterances. Their internal structure can be very large so only a token form is printed. Although this creates an utterance it doesn't do anything else. To get a waveform you must synthesize it. festival> (utt.synth utt1) # festival> This calls various modules, including tokenizing, duration,. intonation etc. Which modules are called are defined with respect to the type of the utterance, in this case Text. It is possible to individually call the modules by hand but you just wanted it to talk didn't you. So festival> (utt.play utt1) # festival> will send the synthesized waveform to your audio device. You should hear "Hello world" from your machine. To make this all easier a small function doing these three steps exists. SayText simply takes a string of text, synthesizes it and sends it to the audio device. festival> (SayText "Good morning, welcome to Festival") # festival> Of course as history and command line editing are supported c-p or up-arrow will allow you to edit the above to whatever you wish. Festival may also synthesize from files rather than simply text. festival> (tts "myfile" nil) nil festival> The end of file character c-d will exit from Festival and return you to the shell, alternatively the command quit may be called (don't forget the parentheses). Rather than starting the command interpreter, Festival may synthesize files specified on the command line unix$ festival --tts myfile unix$ Sometimes a simple waveform is required from text that is to be kept and played at some later time. The simplest way to do this with festival is by using the 'text2wave' program. This is a festival script that will take a file (or text from standard input) and produce a single waveform. An example use is text2wave myfile.txt -o myfile.wav Options exist to specify the waveform file type, for example if Sun audio format is required text2wave myfile.txt -otype snd -o myfile.wav Use '-h' on 'text2wave' to see all options. ═══ 8.3. Getting some help ═══ If no audio is generated then you must check to see if audio is properly initialized on your machine. See Audio output. In the command interpreter m-h (meta-h) will give you help on the current symbol before the cursor. This will be a short description of the function or variable, how to use it and what its arguments are. A listing of all such help strings appears at the end of this document. m-s will synthesize and say the same information, but this extra function is really just for show. The lisp function manual will send the appropriate command to an already running Netscape browser process. If nil is given as an argument the browser will be directed to the tables of contents of the manual. If a non-nil value is given it is assumed to be a section title and that section is searched and if found displayed. For example festival> (manual "Accessing an utterance") Another related function is manual-sym which given a symbol will check its documentation string for a cross reference to a manual section and request Netscape to display it. This function is bound to m-m and will display the appropriate section for the given symbol. Note also that the TAB key can be used to find out the name of commands available as can the function Help (remember the parentheses). For more up to date information on Festival regularly check the Festival Home Page at http://www.cstr.ed.ac.uk/projects/festival.html Further help is available by mailing questions to festival-help@cstr.ed.ac.uk Although we cannot guarantee the time required to answer you, we will do our best to offer help. Bug reports should be submitted to festival-bug@cstr.ed.ac.uk If there is enough user traffic a general mailing list will be created so all users may share comments and receive announcements. In the mean time watch the Festival Home Page for news. ═══ 9. Scheme ═══ Many people seem daunted by the fact that Festival uses Scheme as its scripting language and feel they can't use Festival because they don't know Scheme. However most of those same people use Emacs everyday which also has (a much more complex) Lisp system underneath. The number of Scheme commands you actually need to know in Festival is really very small and you can easily just find out as you go along. Also people use the Unix shell often but only know a small fraction of actual commands available in the shell (or in fact that there even is a distinction between shell builtin commands and user definable ones). So take it easy, you'll learn the commands you need fairly quickly. Scheme references Places to learn more about Scheme Scheme fundamentals Syntax and semantics Scheme Festival specifics Scheme I/O ═══ 9.1. Scheme references ═══ If you wish to learn about Scheme in more detail I recommend the book abelson85. The Emacs Lisp documentation is reasonable as it is comprehensive and many of the underlying uses of Scheme in Festival were influenced by Emacs. Emacs Lisp however is not Scheme so there are some differences. Other Scheme tutorials and resources available on the Web are  The Revised Revised Revised Revised Scheme Report, the document defining the language is available from http://tinuviel.cs.wcu.edu/res/ldp/r4rs-html/r4rs_toc.html  a Scheme tutorials from the net: - http://www.cs.uoregon.edu/classes/cis425/schemeTutorial.html  the Scheme FAQ - http://www.landfield.com/faqs/scheme-faq/part1/ ═══ 9.2. Scheme fundamentals ═══ But you want more now, don't you, not just be referred to some other book. OK here goes. Syntax: an expression is an atom or a list. A list consists of a left paren, a number of expressions and right paren. Atoms can be symbols, numbers, strings or other special types like functions, hash tables, arrays, etc. Semantics: All expressions can be evaluated. Lists are evaluated as function calls. When evaluating a list all the members of the list are evaluated first then the first item (a function) is called with the remaining items in the list as arguments. Atoms are evaluated depending on their type: symbols are evaluated as variables returning their values. Numbers, strings, functions, etc. evaluate to themselves. Comments are started by a semicolon and run until end of line. And that's it. There is nothing more to the language that. But just in case you can't follow the consequences of that, here are some key examples. festival> (+ 2 3) 5 festival> (set! a 4) 4 festival> (* 3 a) 12 festival> (define (add a b) (+ a b)) # festival> (add 3 4) 7 festival> (set! alist '(apples pears bananas)) (apples pears bananas) festival> (car alist) apples festival> (cdr alist) (pears bananas) festival> (set! blist (cons 'oranges alist)) (oranges apples pears bananas) festival> (append alist blist) (apples pears bananas oranges apples pears bananas) festival> (cons alist blist) ((apples pears bananas) oranges apples pears bananas) festival> (length alist) 3 festival> (length (append alist blist)) 7 ═══ 9.3. Scheme Festival specifics ═══ There a number of additions to SIOD that are Festival specific though still part of the Lisp system rather than the synthesis functions per se. By convention if the first statement of a function is a string, it is treated as a documentation string. The string will be printed when help is requested for that function symbol. In interactive mode if the function :backtrace is called (within parenthesis) the previous stack trace is displayed. Calling :backtrace with a numeric argument will display that particular stack frame in full. Note that any command other than :backtrace will reset the trace. You may optionally call (set_backtrace t) Which will cause a backtrace to be displayed whenever a Scheme error occurs. This can be put in your '.festivalrc' if you wish. This is especially useful when running Festival in non-interactive mode (batch or script mode) so that more information is printed when an error occurs. A hook in Lisp terms is a position within some piece of code where a user may specify their own customization. The notion is used heavily in Emacs. In Festival there a number of places where hooks are used. A hook variable contains either a function or list of functions that are to be applied at some point in the processing. For example the after_synth_hooks are applied after synthesis has been applied to allow specific customization such as resampling or modification of the gain of the synthesized waveform. The Scheme function apply_hooks takes a hook variable as argument and an object and applies the function/list of functions in turn to the object. When an error occurs in either Scheme or within the C++ part of Festival by default the system jumps to the top level, resets itself and continues. Note that errors are usually serious things, pointing to bugs in parameters or code. Every effort has been made to ensure that the processing of text never causes errors in Festival. However when using Festival as a development system it is often that errors occur in code. Sometimes in writing Scheme code you know there is a potential for an error but you wish to ignore that and continue on to the next thing without exiting or stopping and returning to the top level. For example you are processing a number of utterances from a database and some files containing the descriptions have errors in them but you want your processing to continue through every utterance that can be processed rather than stopping 5 minutes after you gone home after setting a big batch job for overnight. Festival's Scheme provides the function unwind-protect which allows the catching of errors and then continuing normally. For example suppose you have the function process_utt which takes a filename and does things which you know might cause an error. You can write the following to ensure you continue processing even in an error occurs. (unwind-protect (process_utt filename) (begin (format t "Error found in processing %s\n" filename) (format t "continuing\n"))) The unwind-protect function takes two arguments. The first is evaluated and if no error occurs the value returned from that expression is returned. If an error does occur while evaluating the first expression, the second expression is evaluated. unwind-protect may be used recursively. Note that all files opened while evaluating the first expression are closed if an error occurs. All global variables outside the scope of the unwind-protect will be left as they were set up until the error. Care should be taken in using this function but its power is necessary to be able to write robust Scheme code. ═══ 9.4. Scheme I/O ═══ Different Scheme's may have quite different implementations of file i/o functions so in this section we will describe the basic functions in Festival SIOD regarding i/o. Simple printing to the screen may be achieved with the function print which prints the given s-expression to the screen. The printed form is preceded by a new line. This is often useful for debugging but isn't really powerful enough for much else. Files may be opened and closed and referred to file descriptors in a direct analogy to C's stdio library. The SIOD functions fopen and fclose work in the exactly the same way as their equivalently named partners in C. The format command follows the command of the same name in Emacs and a number of other Lisps. C programmers can think of it as fprintf. format takes a file descriptor, format string and arguments to print. The file description may be a file descriptor as returned by the Scheme function fopen, it may also be t which means the output will be directed as standard out (cf. printf). A third possibility is nil which will cause the output to printed to a string which is returned (cf. sprintf). The format string closely follows the format strings in ANSI C, but it is not the same. Specifically the directives currently supported are, %%, %d, %x, %s, %f, %g and %c. All modifiers for these are also supported. In addition %l is provided for printing of Scheme objects as objects. For example (format t "%03d %3.4f %s %l %l %l\n" 23 23 "abc" "abc" '(a b d) utt1) will produce 023 23.0000 abc "abc" (a b d) # on standard output. When large lisp expressions are printed they are difficult to read because of the parentheses. The function pprintf prints an expression to a file description (or t for standard out). It prints so the s-expression is nicely lined up and indented. This is often called pretty printing in Lisps. For reading input from terminal or file, there is currently no equivalent to scanf. Items may only be read as Scheme expressions. The command (load FILENAME t) will load all s-expressions in FILENAME and return them, unevaluated as a list. Without the third argument the load function will load and evaluate each s-expression in the file. To read individual s-expressions use readfp. For example (let ((fd (fopen trainfile "r")) (entry) (count 0)) (while (not (equal? (set! entry (readfp fd)) (eof-val))) (if (string-equal (car entry) "home") (set! count (+ 1 count)))) (fclose fd)) To convert a symbol whose print name is a number to a number use parse-number. This is the equivalent to atof in C. Note that, all i/o from Scheme input files is assumed to be basically some form of Scheme data (though can be just numbers, tokens). For more elaborate analysis of incoming data it is possible to use the text tokenization functions which offer a fully programmable method of reading data. ═══ 10. TTS ═══ Festival supports text to speech for raw text files. If you are not interested in using Festival in any other way except as black box for rendering text as speech, the following method is probably what you want. festival --tts myfile This will say the contents of 'myfile'. Alternatively text may be submitted on standard input echo hello world | festival --tts cat myfile | festival --tts Festival supports the notion of text modes where the text file type may be identified, allowing Festival to process the file in an appropriate way. Currently only two types are considered stable: STML and raw, but other types such as email, HTML, Latex, etc. are being developed and discussed below. This follows the idea of buffer modes in Emacs where a file's type can be utilized to best display the text. Text mode may also be selected based on a filename's extension. Within the command interpreter the function tts is used to render files as text; it takes a filename and the text mode as arguments. Utterance chunking From text to utterances Text modes Mode specific text analysis Example text mode An example mode for reading email ═══ 10.1. Utterance chunking ═══ Text to speech works by first tokenizing the file and chunking the tokens into utterances. The definition of utterance breaks is determined by the utterance tree in variable eou_tree. A default version is given in 'lib/tts.scm'. This uses a decision tree to determine what signifies an utterance break. Obviously blank lines are probably the most reliable, followed by certain punctuation. The confusion of the use of periods for both sentence breaks and abbreviations requires some more heuristics to best guess their different use. The following tree is currently used which works better than simply using punctuation. (defvar eou_tree '((n.whitespace matches ".*\n.*\n\\(.\\|\n\\)*") ;; 2 or more newlines ((1)) ((punc in ("?" ":" "!")) ((1)) ((punc is ".") ;; This is to distinguish abbreviations vs periods ;; These are heuristics ((name matches "\\(.*\\··*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)") ((n.whitespace is " ") ((0)) ;; if abbrev single space isn't enough for break ((n.name matches "[A-Z].*") ((1)) ((0)))) ((n.whitespace is " ") ;; if it doesn't look like an abbreviation ((n.name matches "[A-Z].*") ;; single space and non-cap is no break ((1)) ((0))) ((1)))) ((0))))) The token items this is applied to will always (except in the end of file case) include one following token, so look ahead is possible. The "n." and "p." and "p.p." prefixes allow access to the surrounding token context. The features name, whitespace and punc allow access to the contents of the token itself. At present there is no way to access the lexicon form this tree which unfortunately might be useful if certain abbreviations were identified as such there. Note these are heuristics and written by hand not trained from data, though problems have been fixed as they have been observed in data. The above rules may make mistakes where abbreviations appear at end of lines, and when improper spacing and capitalization is used. This is probably worth changing, for modes where more casual text appears, such as email messages and USENET news messages. A possible improvement could be made by analysing a text to find out its basic threshold of utterance break (i.e. if no full stop, two spaces, followed by a capitalized word sequences appear and the text is of a reasonable length then look for other criteria for utterance breaks). Ultimately what we are trying to do is to chunk the text into utterances that can be synthesized quickly and start to play them quickly to minimise the time someone has to wait for the first sound when starting synthesis. Thus it would be better if this chunking were done on prosodic phrases rather than chunks more similar to linguistic sentences. Prosodic phrases are bounded in size, while sentences are not. ═══ 10.2. Text modes ═══ We do not believe that all texts are of the same type. Often information about the general contents of file will aid synthesis greatly. For example in Latex files we do not want to here "left brace, backslash e m" before each emphasized word, nor do we want to necessarily hear formating commands. Festival offers a basic method for specifying customization rules depending on the mode of the text. By type we are following the notion of modes in Emacs and eventually will allow customization at a similar level. Modes are specified as the third argument to the function tts. When using the Emacs interface to Festival the buffer mode is automatically passed as the text mode. If the mode is not supported a warning message is printed and the raw text mode is used. Our initial text mode implementation allows configuration both in C++ and in Scheme. Obviously in C++ almost anything can be done but it is not as easy to reconfigure without recompilation. Here we will discuss those modes which can be fully configured at run time. A text mode may contain the following filter A Unix shell program filter that processes the text file in some appropriate way. For example for email it might remove uninteresting headers and just output the subject, from line and the message body. If not specified, an identity filter is used. init_functionThis (Scheme) function will be called before any processing will be done. It allows further set up of tokenization rules and voices etc. exit_functionThis (Scheme) function will be called at the end of any processing allowing reseting of tokenization rules etc. analysis_modeIf analysis mode is xml the file is read through the built in XML parser rxp. Alternatively if analysis mode is xxml the filter should an SGML normalising parser and the output is processed in a way suitable for it. Any other value is ignored. These mode specific parameters are specified in the a-list held in tts_text_modes. When using Festival in Emacs the emacs buffer mode is passed to Festival as the text mode. Note that above mechanism is not really designed to be re-entrant, this should be addressed in later versions. Following the use of auto-selection of mode in Emacs, Festival can auto-select the text mode based on the filename given when no explicit mode is given. The Lisp variable auto-text-mode-alist is a list of dotted pairs of regular expression and mode name. For example to specify that the email mode is to be used for files ending in '.email' we would add to the current auto-text-mode-alist as follows (set! auto-text-mode-alist (cons (cons "\\.email$" 'email) auto-text-mode-alist)) If the function tts is called with a mode other than nil that mode overrides any specified by the auto-text-mode-alist. The mode fundamental is the explicit "null" mode, it is used when no mode is specified in the function tts, and match is found in auto-text-mode-alist or the specified mode is not found. By convention if a requested text model is not found in tts_text_modes the file 'MODENAME-mode' will be required. Therefore if you have the file 'MODENAME-mode.scm' in your library then it will be automatically loaded on reference. Modes may be quite large and it is not necessary have Festival load them all at start up time. Because of the auto-text-mode-alist and the auto loading of currently undefined text modes you can use Festival like festival --tts example.email Festival with automatically synthesize 'example.email' in text mode email. If you add your own personal text modes you should do the following. Suppose you've written an HTML mode. You have named it 'html-mode.scm' and put it in '/home/awb/lib/festival/'. In your '.festivalrc' first identify you're personal Festival library directory by adding it to lib-path. (set! lib-path (cons "/home/awb/lib/festival/" lib-path)) Then add the definition to the auto-text-mode-alist that file names ending '.html' or '.htm' should be read in HTML mode. (set! auto-text-mode-alist (cons (cons "\\.html?$" 'html) auto-text-mode-alist)) Then you may synthesize an HTML file either from Scheme (tts "example.html" nil) Or from the shell command line festival --tts example.html Anyone familiar with modes in Emacs should recognise that the process of adding a new text mode to Festival is very similar to adding a new buffer mode to Emacs. ═══ 10.3. Example text mode ═══ Here is a short example of a tts mode for reading email messages. It is by no means complete but is a start at showing how you can customize tts modes without writing new C++ code. The first task is to define a filter that will take a saved mail message and remove extraneous headers and just leave the from line, subject and body of the message. The filter program is given a file name as its first argument and should output the result on standard out. For our purposes we will do this as a shell script. #!/bin/sh # Email filter for Festival tts mode # usage: email_filter mail_message >tidied_mail_message grep "^From: " $1 echo grep "^Subject: " $1 echo # delete up to first blank line (i.e. the header) sed '1,/^$/ d' $1 Next we define the email init function, which will be called when we start this mode. What we will do is save the current token to words function and slot in our own new one. We can then restore the previous one when we exit. (define (email_init_func) "Called on starting email text mode." (set! email_previous_t2w_func token_to_words) (set! english_token_to_words email_token_to_words) (set! token_to_words email_token_to_words)) Note that both english_token_to_words and token_to_words should be set to ensure that our new token to word function is still used when we change voices. The corresponding end function puts the token to words function back. (define (email_exit_func) "Called on exit email text mode." (set! english_token_to_words email_previous_t2w_func) (set! token_to_words email_previous_t2w_func)) Now we can define the email specific token to words function. In this example we deal with two specific cases. First we deal with the common form of email addresses so that the angle brackets are not pronounced. The second points are to recognise quoted text and immediately change the the speaker to the alternative speaker. (define (email_token_to_words token name) "Email specific token to word rules." (cond This first condition identifies the token as a bracketed email address and removes the brackets and splits the token into name and IP address. Note that we recursively call the function email_previous_t2w_func on the email name and IP address so that they will be pronounced properly. Note that because that function returns a list of words we need to append them together. ((string-matches name "<.*.*>") (append (email_previous_t2w_func token (string-after (string-before name "@") "<")) (cons "at" (email_previous_t2w_func token (string-before (string-after name "@") ">"))))) Our next condition deals with identifying a greater than sign being used as a quote marker. When we detect this we select the alternative speaker, even though it may already be selected. We then return no words so the quote marker is not spoken. The following condition finds greater than signs which are the first token on a line. ((and (string-matches name ">") (string-matches (item.feat token "whitespace") "[ \t\n]*\n *")) (voice_don_diphone) nil ;; return nothing to say ) If it doesn't match any of these we can go ahead and use the builtin token to words function Actually, we call the function that was set before we entered this mode to ensure any other specific rules still remain. But before that we need to check if we've had a newline with doesn't start with a greater than sign. In that case we switch back to the primary speaker. (t ;; for all other cases (if (string-matches (item.feat token "whitespace") ".*\n[ \t\n]*") (voice_rab_diphone)) (email_previous_t2w_func token name)))) In addition to these we have to actually declare the text mode. This we do by adding to any existing modes as follows. (set! tts_text_modes (cons (list 'email ;; mode name (list ;; email mode params (list 'init_func email_init_func) (list 'exit_func email_exit_func) '(filter "email_filter"))) tts_text_modes)) This will now allow simple email messages to be dealt with in a mode specific way. An example mail message is included in 'examples/ex1.email'. To hear the result of the above text mode start Festival, load in the email mode descriptions, and call TTS on the example file. (tts "···/examples/ex1.email" 'email) The above is very short of a real email mode but does illustrate how one might go about building one. It should be reiterated that text modes are new in Festival and their most effective form has not been discovered yet. This will improve with time and experience. ═══ 11. XML/SGML mark-up ═══ The ideas of a general, synthesizer system nonspecific, mark-up language for labelling text has been under discussion for some time. Festival has supported an SGML based markup language through multiple versions most recently STML (sproat97). This is based on the earlier SSML (Speech Synthesis Markup Language) which was supported by previous versions of Festival (taylor96). With this version of Festival we support Sable a similar mark-up language devised by a consortium from Bell Labls, Sub Microsystems, AT&T and Edinburgh, sable98. Unlike the previous versions which were SGML based, the implementation of Sable in Festival is now XML based. To the user they different is negligable but using XML makes processing of files easier and more standardized. Also Festival now includes an XML parser thus reducing the dependencies in processing Sable text. Raw text has the problem that it cannot always easily be rendered as speech in the way the author wishes. Sable offers a well-defined way of marking up text so that the synthesizer may render it appropriately. The definition of Sable is by no means settled and is still in development. In this release Festival offers people working on Sable and other XML (and SGML) based markup languages a chance to quickly experiment with prototypes by providing a DTD (document type descriptions) and the mapping of the elements in the DTD to Festival functions. Although we have not yet (personally) investigated facilities like cascading style sheets and generalized SGML specification languages like DSSSL we believe the facilities offer by Festival allow rapid prototyping of speech output markup languages. Primarily we see Sable markup text as a language that will be generated by other programs, e.g. text generation systems, dialog managers etc. therefore a standard, easy to parse, format is required, even if it seems overly verbose for human writers. For more information of Sable and access to the mailing list see http://www.cstr.ed.ac.uk/projects/sable.html Sable example an example of Sable with descriptions Supported Sable tags Currently supported Sable tags Adding Sable tags Adding new Sable tags XML/SGML requirements Software environment requirements for use Using Sable Rendering Sable files as speech ═══ 11.1. Sable example ═══ Here is a simple example of Sable marked up text The boy saw the girl in the park with the telescope. The boy saw the girl in the park with the telescope. Good morning My name is Stuart, which is spelled stuart though some people pronounce it stuart. My telephone number is 2787. I used to work in Buccleuch Place, but no one can pronounce that. By the way, my telephone number is actually After the initial definition of the SABLE tags, through the file 'Sable.v0_2.dtd', which is distributed as part of Festival, the body is given. There are tags for identifying the language and the voice. Explicit boundary markers may be given in text. Also duration and intonation control can be explicit specified as can new pronunciations of words. The last sentence specifies some external filenames to play at that point. ═══ 11.2. Supported Sable tags ═══ There is not yet a definitive set of tags but hopefully such a list will form over the next few months. As adding support for new tags is often trivial the problem lies much more in defining what tags there should be than in actually implementing them. The following are based on version 0.2 of Sable as described in http://www.cstr.ed.ac.uk/projects/sable_spec2.html, though some aspects are not currently supported in this implementation. Further updates will be announces through the Sable mailing list. LANGUAGE Allows the specification of the language through the ID attribute. Valid values in Festival are, english, en1, spanish, en, and others depending on your particular installation. For example ··· If the language isn't supported by the particualr installation of Festival "Some text in ··" is said instead and the section is ommitted. SPEAKER Select a voice. Accepts a parameter NAME which takes values male1, male2, female1, etc. There is currently no definition about what happens when a voice is selected which the synthesizer doesn't support. An example is ··· AUDIO This allows the specification of an external waveform that is to be included. There are attributes for specifying volume and whether the waveform is to be played in the background of the following text or not. Festival as yet only supports insertion. My telephone number is