home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: Multimed
/
Multimed.zip
/
fest-141.zip
/
festival
/
festival.inf
(
.txt
)
< prev
next >
Wrap
OS/2 Help File
|
1999-12-24
|
256KB
|
7,190 lines
ΓòÉΓòÉΓòÉ 1. -Preface- ΓòÉΓòÉΓòÉ
This file documents the Festival Speech Synthesis System a general text to
speech system for making your computer talk and developing new synthesis
techniques.
Copyright (C) 1996-1999 University of Edinburgh
Permission is granted to make and distribute verbatim copies of this manual
provided the copyright notice and this permission notice are preserved on all
copies.
Permission is granted to copy and distribute modified versions of this manual
under the conditions for verbatim copying, provided that the entire resulting
derived work is distributed under the terms of a permission notice identical to
this one.
Permission is granted to copy and distribute translations of this manual into
another language, under the above conditions for modified versions, except that
this permission notice may be stated in a translation approved by the authors.
This file documents the Festival Speech Synthesis System VERSION. This
document contains many gaps and is still in the process of being written.
Abstract initial comments
Copying How you can copy and share the code
Acknowledgements List of contributors
What is new Enhancements since last public release
Overview Generalities and Philosophy
Installation Compilation and Installation
Quick start Just tell me what to type
Scheme A quick introduction to Festival's scripting
language Text methods for interfacing to
Festival
TTS Text to speech modes
XML/SGML mark-up XML/SGML mark-up Language
Emacs interface Using Festival within Emacs Internal functions
Phonesets Defining and using phonesets
Lexicons Building and compiling Lexicons
Utterances Existing and defining new utterance types
Modules
Text analysis Tokenizing text
POS tagging Part of speech tagging
Phrase breaks Finding phrase breaks
Intonation Intonations modules
Duration Duration modules
UniSyn synthesizer The UniSyn waveform synthesizer
Diphone synthesizer Building and using diphone synthesizers
Other synthesis methods other waveform synthesis methods
Audio output Getting sound from Festival
Voices Adding new voices (and languages)
Tools CART, Ngrams etc
Building models from databases Adding new modules and writing C++ code
Programming Programming in Festival (Lisp/C/C++)
API Using Festival in other programs
Examples Some simple (and not so simple) examples
Problems Reporting bugs.
References Other sources of information
Feature functions List of builtin feature functions.
Variable list Short descriptions of all variables
Function list Short descriptions of all functions
Index Index of concepts.
ΓòÉΓòÉΓòÉ 2. Abstract ΓòÉΓòÉΓòÉ
This document provides a user manual for the Festival Speech Synthesis System,
version VERSION.
Festival offers a general framework for building speech synthesis systems as
well as including examples of various modules. As a whole it offers full text
to speech through a number APIs: from shell level, though a Scheme command
interpreter, as a C++ library, and an Emacs interface. Festival is
multi-lingual, we have develeoped voices in many languages including English
(UK and US), Spanish and Welsh, though English is the most advanced.
The system is written in C++ and uses the Edinburgh Speech Tools for low level
architecture and has a Scheme (SIOD) based command interpreter for control.
Documentation is given in the FSF texinfo format which can generate a printed
manual, info files and HTML.
The latest details and a full software distribution of the Festival Speech
Synthesis System are available through its home page which may be found at
http://www.cstr.ed.ac.uk/projects/festival.html
ΓòÉΓòÉΓòÉ 3. Copying ΓòÉΓòÉΓòÉ
As we feeel the core system has reached an acceptable level of maturity from
1.4.0 the basic system is released under a free lience, without the commercial
restrictions we imposed on early versions. The basic system has been placed
under an X11 type licence which as free licences go is pretty free. No GPL
code is included in festival or the speech tools themselves (though some
auxiliary files are GPL'd e.g. the Emacs mode for Festival). We have
deliberately choosen a licence that should be compatible with our commercial
partners and our free software users.
However although the code is free, we still offer no warranties and no
maintenance. We will continue to endeavor to fix bugs and answer queries when
can, but are not in a position to guarantee it. We will consider maintenance
contracts and consultancy if desired, please contacts us for details.
Also note that not all the voices and lexicons we distribute with festival are
free. Particularly the British English lexicon derived from Oxford Advanced
Learners' Dictionary is free only for non-commercial use (we will release an
alternative soon). Also the Spanish diphone voice we relase is only free for
non-commercial use.
If you are using Festival or the speech tools in commercial environment, even
though no licence is required, we would be grateful if you let us know as it
helps justify ourselves to our various sponsors.
The current copyright on the core system is
The Festival Speech Synthesis System: version 1.4.1
Centre for Speech Technology Research
University of Edinburgh, UK
Copyright (c) 1996-1999
All Rights Reserved.
Permission is hereby granted, free of charge, to use and distribute
this software and its documentation without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of this work, and to
permit persons to whom this work is furnished to do so, subject to
the following conditions:
1. The code must retain the above copyright notice, this list of
conditions and the following disclaimer.
2. Any modifications must be clearly marked as such.
3. Original authors' names are not deleted.
4. The authors' names are not used to endorse or promote products
derived from this software without specific prior written
permission.
THE UNIVERSITY OF EDINBURGH AND THE CONTRIBUTORS TO THIS WORK
DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT
SHALL THE UNIVERSITY OF EDINBURGH NOR THE CONTRIBUTORS BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
THIS SOFTWARE.
ΓòÉΓòÉΓòÉ 4. Acknowledgements ΓòÉΓòÉΓòÉ
The code in this system was primarily written by Alan W Black, Paul Taylor and
Richard Caley. Festival sits on top of the Edinburgh Speech Tools Library,
and uses much of its functionality.
Amy Isard wrote a synthesizer for her MSc project in 1995, which first used
the Edinburgh Speech Tools Library. Although Festival doesn't contain any
code from that system, her system was used as a basic model.
Much of the design and philosophy of Festival has been built on the experience
both Paul and Alan gained from the development of various previous
synthesizers and software systems, especially CSTR's Osprey and Polyglot
systems taylor91 and ATR's CHATR system black94.
However, it should be stated that Festival is fully developed at CSTR and
contains neither proprietary code or ideas.
Festival contains a number of subsystems integrated from other sources and we
acknowledge those systems here.
ΓòÉΓòÉΓòÉ 4.1. SIOD ΓòÉΓòÉΓòÉ
The Scheme interpreter (SIOD -- Scheme In One Defun 3.0) was written by George
Carrett (gjc@mitech.com, gjc@paradigm.com) and offers a basic small Scheme
(Lisp) interpreter suitable for embedding in applications such as Festival as
a scripting language. A number of changes and improvements have been added in
our development but it still remains that basic system. We are grateful to
George and Paradigm Associates Incorporated for providing such a useful and
well-written sub-system.
Scheme In One Defun (SIOD)
COPYRIGHT (c) 1988-1994 BY
PARADIGM ASSOCIATES INCORPORATED, CAMBRIDGE, MASSACHUSETTS.
ALL RIGHTS RESERVED
Permission to use, copy, modify, distribute and sell this software
and its documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appear in all copies
and that both that copyright notice and this permission notice appear
in supporting documentation, and that the name of Paradigm Associates
Inc not be used in advertising or publicity pertaining to distribution
of the software without specific, written prior permission.
PARADIGM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
PARADIGM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
SOFTWARE.
ΓòÉΓòÉΓòÉ 4.2. editline ΓòÉΓòÉΓòÉ
Because of conflicts between the copyright for GNU readline, for which an
optional interface was included in earlier versions, we have replace the
interface with a complete command line editing system based on 'editline'.
'Editline' was posted to the USENET newsgroup 'comp.sources.misc' in 1992. A
number of modifications have been made to make it more useful to us but the
original code (contained within the standard speech tools distribution) and
our modifications fall under the following licence.
Copyright 1992 Simmule Turner and Rich Salz. All rights reserved.
This software is not subject to any license of the American Telephone
and Telegraph Company or of the Regents of the University of California.
Permission is granted to anyone to use this software for any purpose on
any computer system, and to alter it and redistribute it freely, subject
to the following restrictions:
1. The authors are not responsible for the consequences of use of this
software, no matter how awful, even if they arise from flaws in it.
2. The origin of this software must not be misrepresented, either by
explicit claim or by omission. Since few users ever read sources,
credits must appear in the documentation.
3. Altered versions must be plainly marked as such, and must not be
misrepresented as being the original software. Since few users
ever read sources, credits must appear in the documentation.
4. This notice may not be removed or altered.
ΓòÉΓòÉΓòÉ 4.3. Edinburgh Speech Tools Library ΓòÉΓòÉΓòÉ
The Edinburgh Speech Tools lies at the core of Festival. Although developed
separately, much of the development of certain parts of the Edinburgh Speech
Tools has been directed by Festival's needs. In turn those who have
contributed to the Speech Tools make Festival a more usable system.
See Section Acknowledgements of Edinburgh Speech Tools Library Manual.
Online information about the Edinburgh Speech Tools library is available
through
http://www.cstr.ed.ac.uk/projects/speech_tools.html
ΓòÉΓòÉΓòÉ 4.4. Others ΓòÉΓòÉΓòÉ
Many others have provided actual code and support for Festival, for which we
are grateful. Specifically,
Alistair Conkie: various low level code points and some design work,
Spanish synthesis, the old diphone synthesis code.
Steve Isard:directorship and LPC diphone code, design of diphone schema.
EPSRC:who fund Alan Black and Paul Taylor.
Sun Microsystems Laboratories: for supporting the project and funding
Richard.
AT&T Labs - Research:for supporting the project.
Paradigm Associates and George Carrett:for Scheme in one defun.
Mike Macon:Improving the quality of the diphone synthesizer and LPC
analysis.
Kurt Dusterhoff:Tilt intonation training and modelling.
Amy Isard:for her SSML project and related synthesizer.
Richard Tobin:for answering all those difficult questions, the socket
code, and the XML parser.
Simmule Turner and Rich Salz:command line editor (editline)
Borja Etxebarria:Help with the Spanish synsthesis
Briony Williams:Welsh synthesis
Jacques H. de Villiers: 'jacques@cse.ogi.edu' from CSLUat OGI, for the
TCL interface, and other usability issues
Kevin Lenzo: 'lenzo@cs.cmu.edu' from CMU for the PERLinterface.
Rob Clarke:for support under Linux.
Samuel Audet 'guardia@cam.org':OS/2 support
Mari Ostendorf:For providing access to the BU FM Radio corpus from which
some modules were trained.
Melvin Hunt:from whose work we based our residual LPC synthesis model on
Oxford Text Archive:For the computer users version of Oxford Advanced
Learners' Dictionary (redistributed with permission).
Reading University:for access to MARSEC from which the phrase break model
was trained.
LDC & Penn Tree Bank:from which the POS tagger was trained,
redistribution of the models is with permission from the LDC.
Roger Burroughes and Kurt Dusterhoff:For letting us capture their voices.
ATR and Nick Campbell:for first getting Paul and Alan to work together
and for the experience we gained.
FSF:for G++, make, ┬╖┬╖┬╖┬╖
Center for Spoken Language Understanding:CSLU at OGI, particularly Ron
Cole and Mike Macon, have acted as significant users for the system
giving significant feedback and allowing us to teach courses on Festival
offering valuable real-use feedback.
Our beta testers:Thanks to all the people who put up with previous
versions of the system and reported bugs, both big and small. These
comments are very important to the constant improvements in the system.
And thanks for your quick responses when I had specific requests.
And our users ┬╖┬╖┬╖Many people have downloaded earlier versions of the
system. Many have found problems with installation and use and have
reported it to us. Many of you have put up with multiple compilations
trying to fix bugs remotely. We thank you for putting up with us and are
pleased you've taken the time to help us improve our system. Many of you
have come up with uses we hadn't thought of, which is always rewarding.
Even if you haven't actively responded, the fact that you use the system
at all makes it worthwhile.
ΓòÉΓòÉΓòÉ 5. What is new ΓòÉΓòÉΓòÉ
Compared to the the previous major release (1.3.0 release Aug 1998) 1.4.0 is
not functionally so different from its previous versions. This release is
primarily a consolidation release fixing and tidying up some of the lower
level aspects of the system to allow better modularity for some of our future
planned modules.
Copyright change: The system is now free and has no commercial
restriction. Note that currently on the US voices (ked and kal) are also
now unrestricted. The UK English voices depend on the Oxford Advanced
Learners' Dictionary of Current English which cannot be used for
commercial use without permission from Oxford University Press.
Architecture tidy up:the interfaces to lower level part parts of the
system have been tidied up deleting some of the older code that was
supported for compatibility reasons. This is a much higher dependence of
features and easier (and safer) ways to register new objects as feature
values and Scheme objects. Scheme has been tidied up. It is no longer
"in one defun" but "in one directory".
New documentation system for speech tools:A new docbook based
documentation system has been added to the speech tools. Festival's
documentation will will move over to this sometime soon too.
initial JSAPI support: both JSAPI and JSML (somewhatsimilar to Sable) now
have initial impelementations. They of course depend on Java support
which so far we have only (successfully) investgated under Solaris and
Linux.
Generalization of statistical models: CART, ngrams,and WFSTs are now
fully supported from Lisp and can be used with a generalized viterbi
function. This makes adding quite complex statistical models easy
without adding new C++.
Tilt Intonation modelling:Full support is now included for the Tilt
intomation models, both training and use.
Documentation on Bulding New Voices in Festival:documentation, scripts
etc. for building new voices and languages in the system, see
http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/
ΓòÉΓòÉΓòÉ 6. Overview ΓòÉΓòÉΓòÉ
Festival is designed as a speech synthesis system for at least three levels of
user. First, those who simply want high quality speech from arbitrary text
with the minimum of effort. Second, those who are developing language systems
and wish to include synthesis output. In this case, a certain amount of
customization is desired, such as different voices, specific phrasing, dialog
types etc. The third level is in developing and testing new synthesis
methods.
This manual is not designed as a tutorial on converting text to speech but for
documenting the processes and use of our system. We do not discuss the
detailed algorithms involved in converting text to speech or the relative
merits of multiple methods, though we will often give references to relevant
papers when describing the use of each module.
For more general information about text to speech we recommend Dutoit's 'An
introduction to Text-to-Speech Synthesis' dutoit97. For more detailed
research issues in TTS see sproat98 or vansanten96.
Philosophy Why we did it like it is
Future How much better its going to get
ΓòÉΓòÉΓòÉ 6.1. Philosophy ΓòÉΓòÉΓòÉ
One of the biggest problems in the development of speech synthesis, and other
areas of speech and language processing systems, is that there are a lot of
simple well-known techniques lying around which can help you realise your
goal. But in order to improve some part of the whole system it is necessary
to have a whole system in which you can test and improve your part. Festival
is intended as that whole system in which you may simply work on your small
part to improve the whole. Without a system like Festival, before you could
even start to test your new module you would need to spend significant effort
to build a whole system, or adapt an existing one before you could start
working on your improvements.
Festival is specifically designed to allow the addition of new modules, easily
and efficiently, so that development need not get bogged down in
re-implementing the wheel.
But there is another aspect of Festival which makes it more useful than simply
an environment for researching into new synthesis techniques. It is a fully
usable text-to-speech system suitable for embedding in other projects that
require speech output. The provision of a fully working easy-to-use speech
synthesizer in addition to just a testing environment is good for two specific
reasons. First, it offers a conduit for our research, in that our experiments
can quickly and directly benefit users of our synthesis system. And secondly,
in ensuring we have a fully working usable system we can immediately see what
problems exist and where our research should be directed rather where our
whims take us.
These concepts are not unique to Festival. ATR's CHATR system (black94)
follows very much the same philosophy and Festival benefits from the
experiences gained in the development of that system. Festival benefits from
various pieces of previous work. As well as CHATR, CSTR's previous
synthesizers, Osprey and the Polyglot projects influenced many design
decisions. Also we are influenced by more general programs in considering
software engineering issues, especially GNU Octave and Emacs on which the
basic script model was based.
Unlike in some other speech and language systems, software engineering is
considered very important to the development of Festival. Too often research
systems consist of random collections of hacky little scripts and code. No
one person can confidently describe the algorithms it performs, as parameters
are scattered throughout the system, with tricks and hacks making it
impossible to really evaluate why the system is good (or bad). Such systems
do not help the advancement of speech technology, except perhaps in pointing
at ideas that should be further investigated. If the algorithms and
techniques cannot be described externally from the program such that they can
reimplemented by others, what is the point of doing the work?
Festival offers a common framework where multiple techniques may be
implemented (by the same or different researchers) so that they may be tested
more fairly in the same environment.
As a final word, we'd like to make two short statements which both achieve the
same end but unfortunately perhaps not for the same reasons: Good software
engineering makes good research easier But the following seems to be
true also If you spend enough effort on something it can be shown to be better
than its competitors.
ΓòÉΓòÉΓòÉ 6.2. Future ΓòÉΓòÉΓòÉ
Festival is still very much in development. Hopefully this state will
continue for a long time. It is never possible to complete software, there are
always new things that can make it better. However as time goes on
Festival's core architecture will stabilise and little or no changes will be
made. Other aspects of the system will gain greater attention such as
waveform synthesis modules, intonation techniques, text type dependent
analysers etc.
Festival will improve, so don't expected it to be the same six months from
now.
A number of new modules and enhancements are already under consideration at
various stages of implementation. The following is a non-exhaustive list of
what we may (or may not) add to Festival over the next six months or so.
Selection-based synthesis: Moving away from diphone technology to more
generalized selection of units for speech database.
New structure for linguistic content of utterances:Using techniques for
Metrical Phonology we are building more structure representations of
utterances reflecting there linguistic significance better. This will
allow improvements in prosody and unit selection.
Non-prosodic prosodic control:For language generation systems and custom
tasks where the speech to be synthesized is being generated by some
program, more information about text structure will probably exist, such
as phrasing, contrast, key items etc. We are investigating the
relationship of high-level tags to prosodic information through the Sole
project http://www.cstr.ed.ac.uk/projects/sole.html
Dialect independent lexicons:Currently for each new dialect we need a new
lexicon, we are currently investigating a form of lexical specification
that is dialect independent that allows the core form to be mapped to
different dialects. This will make the generation of voices in different
dialects much easier.
ΓòÉΓòÉΓòÉ 7. Installation ΓòÉΓòÉΓòÉ
This section describes how to install Festival from source in a new location
and customize that installation.
Requirements Software/Hardware requirements for Festival
Configuration Setting up compilation
Site initialization Settings for your particular site
Checking an installation But does it work ┬╖┬╖┬╖
Y2K Comment on Festival and year 2000
ΓòÉΓòÉΓòÉ 7.1. Requirements ΓòÉΓòÉΓòÉ
In order to compile Festival you first need the following source packages
festival-1.4.1.tar.gz Festival Speech Synthesis System source
speech_tools-1.2.1.tar.gzThe Edinburgh Speech Tools Library
festlex_NAME.tar.gzThe lexicon distribution, where possible, includes the
lexicon input file as well as the compiled form, for your
convenience. The lexicons have varying distribution policies, but
are all free except OALD, which is only free for non-commercial use
(we are working on a free replacement). In some cases only a
pointer to an ftp'able file plus a program to convert that file to
the Festival format is included.
festvox_NAME.tar.gzYou'll need a speech database. A number are available
(with varying distribution policies). Each voice may have other
dependencies such as requiring particular lexicons
festdoc_1.4.1.tar.gzFull postscript, info and html documentation for Festival
and the Speech Tools. The source of the documentation is available
in the standard distributions but for your conveniences it has been
pre-generated.
In addition to Festival specific sources you will also need
A UNIX machine Currently we have compiled and tested the system under Solaris
(2.5(.1), 2.6 and 2.7), SunOS (4.1.3), FreeBSD 2.2, 3.x, Linux
(Redhat 4.1, 5.0, 5.1, 5.2, 6.0 and other Linux distributions), and
it should work under OSF (Dec Alphas) SGI (Irix), HPs (HPUX). But
any standard UNIX machine should be acceptable. We have now
successfully ported this version to Windows NT nad Windows 95 (using
the Cygnus GNU win32 environment). This is still a young port but
seems to work.
A C++ compilerNote that C++ is not very portable even between different
versions of the compiler from the same vendor. Although we've tried
very hard to make the system portable, we know it is very unlikely
to compile without change except with compilers that have already
been tested. The currently tested systems are
1. Sun Sparc Solaris 2.5, 2.5.1, 2.6, 2.7: GCC 2.7.2, GCC 2.8.1,
SunCC 4.1, egcs 1.1.1, egcs 1.1.2, GCC 2.95.1
2. Sun Sparc SunOS 4.1.3: GCC 2.7.2
3. Intel SunOS 2.5.1: GCC 2.7.2
4. FreeBSD for Intel 2.2.1, 2.2.6 and 3.x (ELF based) GCC
2.7.2.1, GCC 2.95.1
5. Linux (2.0.30) for Intel (RedHat 4.1/5.0/5.1/5.2/6.0): GCC
2.7.2, GCC 2.7.2/egcs-1.0.2, egcs 1.1.1, egcs-1.1.2, GCC
2.95.1
6. Windows NT 4.0: GCC 2.7.2 plus egcs (from Cygnus GNU win32
b19), Visual C++ PRO v5.0
Note if GCC works on one version of Unix it usually works on
others.
We still recommend GCC 2.7.2 which we use as our standard
compiler. It is (mostly) standard across platforms and compiles
faster and produces better code than any of the other compilers
we've used.
We have compiled both the speech tools and Festival under Windows
NT 4.0 and Windows 95 using the GNU tools available from Cygnus.
ftp://ftp.cygnus.com/pub/gnu-win32/.
GNU make Due to there being too many different make programs out there we
have tested the system using GNU make on all systems we use. Others
may work but we know GNU make does.
Audio hardwareYou can use Festival without audio output hardware but it
doesn't sound very good (though admittedly you can hear less
problems with it). A number of audio systems are supported
(directly inherited from the audio support in the Edinburgh Speech
Tools Library): NCD's NAS (formerly called netaudio) a network
transparent audio system (which can be found at
ftp://ftp.x.org/contrib/audio/nas/); '/dev/audio' (at 8k ulaw and
8/16bit linear), found on Suns, Linux machines and FreeBSD; and a
method allowing arbitrary UNIX commands. See Audio output.
Earlier versions of Festival mistakenly offered a command line editor
interface to the GNU package readline, but due to conflicts with the GNU
Public Licence and Festival's licence this interface was removed in version
1.3.1. Even Festival's new free licence would cause problems as readline
support would restrict Festival linking with non-free code. A new command
line interface based on editline was provided that offers similar
functionality. Editline remains a compilation option as it is probably not
yet as portable as we would like it to be.
In addition to the above, in order to process the documentation you will need
'TeX', 'dvips' (or similar), GNU's 'makeinfo' (part of the texinfo package)
and 'texi2html' which is available from http://wwwcn.cern.ch/dci/texi2html/.
However the document files are also available pre-processed into, postscript,
DVI, info and html as part of the distribution in 'festdoc-1.4.X.tar.gz'.
Most of the related software not part of the Festival distribution has been
made available in
ftp://ftp.cstr.ed.ac.uk/pub/festival/extras/
Ensure you have a fully installed and working version of your C++ compiler.
Most of the problems people have had in installing Festival have been due to
incomplete or bad compiler installation. It might be worth checking if the
following program works if you don't know if anyone has used your C++
installation before.
#include <iostream.h>
int main (int argc, char **argv)
{
cout << "Hello world\n";
}
Unpack all the source files in a new directory. The directory will then
contain two subdirectories
speech_tools/
festival/
ΓòÉΓòÉΓòÉ 7.2. Configuration ΓòÉΓòÉΓòÉ
First ensure you have a compiled version of the Edinburgh Speech Tools
Library. See 'speech_tools/INSTALL' for instructions.
Before compilation of Festival it is necessary to configure your
implementation to be aware of the environment it is being compiled in.
Specifically it must know the names of various local programs, such as your
compiler; directories were local libraries are held, and choices for various
options about sub-systems it is to use. In most cases this can be done
automatically if your system is a supported, otherwise it may be necessary to
edit a few lines.
All compilation information is set in a local per installation file called
'config/config'. You should copy the example one, mark it writable and edit
it according to your local set up.
cd config/
cp config-dist config
chmod +w config
'config/config' is included by all 'Makefiles' in the system and therefore
should be the only place machine specific information need be changed. Note
that all 'Makefiles' define the variable TOP to allow appropriate relative
addressing of directories within the 'Makefiles' and their included files.
For the most part Festival configuration inherits the configuration from your
speech tools config file ('┬╖┬╖/speech_tools/config/config'). Additional
optional modules may be added by adding them to the end of your config file
e.g.
ALSO_INCLUDE += clunits
Adding and new module here will treat is as a new directory in the
'src/modules/' and compile it into the system in the same way the OTHER_DIRS
feature was used in previous versions.
If the compilation directory being accessed by NFS or if you use an
automounter (e.g. amd) it is recommend to explicitly set the variable
FESTIVAL_HOME in 'config/config'. The command pwd is not reliable when a
directory may have multiple names.
To check your configuration type in the 'festival/' directory.
gnumake info
If that seems fine, compile the system with
gnumake
On completion you can check the system with
gnumake test
Note that the single most common reason for problems in compilation and
linking found amongst the beta testers was a bad installation of GNU C++. If
you get many strange errors in G++ library header files or link errors it is
worth checking that your system has the compiler, header files and runtime
libraries properly installed. This may be checked by compiling a simple
program under C++ and also finding out if anyone at your site has ever used
the installation. Most of these installation problems are caused by upgrading
to a newer version of libg++ without removing the older version so a mixed
version of the '.h' files exist.
Although we have tried very hard to ensure that Festival compiles with no
warnings this is not possible under some systems.
Under SunOS the system include files do not declare a number of system
provided functions. This a bug in Sun's include files. This will causes
warnings like "implicit definition of fprintf". These are harmless.
Under Sun's CC compiler a number of warnings are given about not being able to
find source, particularly for operator << and some == operators. It is
unclear why this should be a warning as the code exists in other files
deliberately for modularity purposes and should not be visible in these files
anyway. These warnings are harmless.
Under Linux a warning at link time about reducing the size of some symbols
often is produced. This is harmless. There is often occasional warnings
about some socket system function having an incorrect argument type, this is
also harmless.
The speech tools and festival compile under Windows95 or Windows NT with
Visual C++ v5.0 using the Microsoft 'nmake' make program. We've only done
this with the Professonal edition, but have no reason to believe that it
relies on anything not in the standard edition.
In accordance to VC++ conventions, object files are created with extension
┬╖obj, executables with extension .exe and libraries with extension ┬╖lib. This
may mean that both unix and Win32 versions can be built in the same directory
tree, but I wouldn't rely on it.
To do this you require nmake Makefiles for the system. These can be generated
from the gnumake Makefiles, using the command
gnumake VCMakefile
in the speech_tools and festival directories. I have only done this under
unix, it's possible it would work under the cygnus gnuwin32 system.
If 'make.depend' files exist (i.e. if you have done 'gnumake depend' in unix)
equivalent 'vc_make.depend' files will be created, if not the VCMakefiles will
not contain dependency information for the '.cc' files. The result will be
that you can compile the system once, but changes will not cause the correct
things to be rebuilt.
In order to compile from the DOS command line using Visual C++ you need to
have a collection of environment variables set. In Windows NT there is an
instalation option for Visual C++ which sets these globally. Under Windows95
or if you don't ask for them to be set globally under NT you need to run
vcvars32.bat
See the VC++ documentation for more details.
Once you have the source trees with VCMakefiles somewhere visible from
Windows, you need to copy 'peech_tools\config\vc_config-dist' to
'speech_tools\config\vc_config' and edit it to suit your local situation. Then
do the same with 'festival\config\vc_config-dist'.
The thing most likely to need changing is the definition of FESTIVAL_HOME in
'festival\config\vc_config_make_rules' which needs to point to where you have
put festival.
Now you can compile. cd to the speech_tools directory and do
nmake /nologo /fVCMakefile
and the library, the programs in main and the test programs should be
compiled.
The tests can't be run automatically under Windows. A simple test to check
that things are probably OK is:
main\na_play testsuite\data\ch_wave.wav
which reads and plays a waveform.
Next go into the festival directory and do
nmake /nologo /fVCMakefile
to build festival. When it's finished, and assuming you have the voices and
lexicons unpacked in the right place, festival should run just as under unix.
We should remind you that the NT/95 ports are still young and there may yet be
problems that we've not found yet. We only recommend the use the speech tools
and Festival under Windows if you have significant experience in C++ under
those platforms.
Most of the modules 'src/modules' are actually optional and the system could
be compiled without them. The basic set could be reduced further if certain
facilities are not desired. Particularly: 'donovan' which is only required if
the donovan voice is used; 'rxp' if no XML parsing is required (e.g. Sable);
and 'parser' if no stochastic paring is required (this parser isn't used for
any of our currently released voices). Actually even 'UniSyn' and
'UniSyn_diphone' could be removed if some external waveform synthesizer is
being used (e.g. MBROLA) or some alternative one like 'OGIresLPC'. Removing
unused modules will make the festival binary smaller and (potentially) start
up faster but don't expect too much. You can delete these by changing the
BASE_DIRS variable in 'src/modules/Makefile'.
ΓòÉΓòÉΓòÉ 7.3. Site initialization ΓòÉΓòÉΓòÉ
Once compiled Festival may be further customized for particular sites. At
start up time Festival loads the file 'init.scm' from its library directory.
This file further loads other necessary files such as phoneset descriptions,
duration parameters, intonation parameters, definitions of voices etc. It
will also load the files 'sitevars.scm' and 'siteinit.scm' if they exist.
'sitevars.scm' is loaded after the basic Scheme library functions are loaded
but before any of the festival related functions are loaded. This file is
intended to set various path names before various subsystems are loaded.
Typically variables such as lexdir (the directory where the lexicons are
held), and voices_dir (pointing to voice directories) should be reset here if
necessary.
The default installation will try to find its lexicons and voices
automatically based on the value of load-path (this is derived from
FESTIVAL_HOME at compilation time or by using the --libdir at run-time). If
the voices and lexicons have been unpacked into subdirectories of the library
directory (the default) then no site specific initialization of the above
pathnames will be necessary.
The second site specific file is 'siteinit.scm'. Typical examples of local
initialization are as follows. The default audio output method is NCD's NAS
system if that is supported as that's what we use normally in CSTR. If it is
not supported, any hardware specific mode is the default (e.g. sun16audio,
freebas16audio, linux16audio or mplayeraudio). But that default is just a
setting in 'init.scm'. If for example in your environment you may wish the
default audio output method to be 8k mulaw through '/dev/audio' you should add
the following line to your 'siteinit.scm' file
(Parameter.set 'Audio_Method 'sunaudio)
Note the use of Parameter.set rather than Parameter.def the second function
will not reset the value if it is already set. Remember that you may use the
audio methods sun16audio. linux16audio or freebsd16audio only if NATIVE_AUDIO
was selected in 'speech_tools/config/config' and your are on such machines.
The Festival variable *modules* contains a list of all supported
functions/modules in a particular installation including audio support. Check
the value of that variable if things aren't what you expect.
If you are installing on a machine whose audio is not directly supported by
the speech tools library, an external command may be executed to play a
waveform. The following example is for an imaginary machine that can play
audio files through a program called 'adplay' with arguments for sample rate
and file type. When playing waveforms, Festival, by default, outputs as
unheadered waveform in native byte order. In this example you would set up
the default audio playing mechanism in 'siteinit.scm' as follows
(Parameter.set 'Audio_Method 'Audio_Command)
(Parameter.set 'Audio_Command "adplay -raw -r $SR $FILE")
For Audio_Command method of playing waveforms Festival supports two additional
audio parameters. Audio_Required_Rate allows you to use Festivals internal
sample rate conversion function to any desired rate. Note this may not be as
good as playing the waveform at the sample rate it is originally created in,
but as some hardware devices are restrictive in what sample rates they
support, or have naive resample functions this could be optimal. The second
addition audio parameter is Audio_Required_Format which can be used to specify
the desired output forms of the file. The default is unheadered raw, but this
may be any of the values supported by the speech tools (including nist, esps,
snd, riff, aiff, audlab, raw and, if you really want it, ascii).
For example suppose you run Festival on a remote machine and are not running
any network audio system and want Festival to copy files back to your local
machine and simply cat them to '/dev/audio'. The following would do that
(assuming permissions for rsh are allowed).
(Parameter.set 'Audio_Method 'Audio_Command)
;; Make output file ulaw 8k (format ulaw implies 8k)
(Parameter.set 'Audio_Required_Format 'ulaw)
(Parameter.set 'Audio_Command
"userhost=`echo $DISPLAY | sed 's/:.*$//'`; rcp $FILE $userhost:$FILE; \
rsh $userhost \"cat $FILE >/dev/audio\" ; rsh $userhost \"rm $FILE\"")
Note there are limits on how complex a command you want to put in the
Audio_Command string directly. It can get very confusing with respect to
quoting. It is therefore recommended that once you get past a certain
complexity consider writing a simple shell script and calling it from the
Audio_Command string.
A second typical customization is setting the default speaker. Speakers
depend on many things but due to various licence (and resource) restrictions
you may only have some diphone/nphone databases available in your
installation. The function name that is the value of voice_default is called
immediately after 'siteinit.scm' is loaded offering the opportunity for you to
change it. In the standard distribution no change should be required. If you
download all the distributed voices voice_rab_diphone is the default voice.
You may change this for a site by adding the following to 'siteinit.scm' or
per person by changing your '.festivalrc'. For example if you wish to change
the default voice to the American one voice_ked_diphone
(set! voice_default 'voice_ked_diphone)
Note the single quote, and note that unlike in early versions voice_default is
not a function you can call directly.
A second level of customization is on a per user basis. After loading
'init.scm', which includes 'sitevars.scm' and 'siteinit.scm' for local
installation, Festival loads the file '.festivalrc' from the user's home
directory (if it exists). This file may contain arbitrary Festival commands.
For example a particular installation of Festival may set Spanish as the
default language by adding
(language_spanish)
in 'siteinit.scm', while a user may wish their version to use Welsh by
default. In this case they would add
(language_welsh)
to their '.festivalrc' in their home directory.
ΓòÉΓòÉΓòÉ 7.4. Checking an installation ΓòÉΓòÉΓòÉ
Once compiled and site initialization is set up you should test to see if
Festival can speak or not.
Start the system
$ bin/festival
Festival Speech Synthesis System 1.4.1:release November 1999
Copyright (C) University of Edinburgh, 1996-1999. All rights reserved.
For details type `(festival_warranty)'
festival> ^D
If errors occur at this stage they are most likely to do with pathname
problems. If any error messages are printed about non-existent files check
that those pathnames point to where you intended them to be. Most of the
(default) pathnames are dependent on the basic library path. Ensure that is
correct. To find out what it has been set to, start the system without
loading the init files.
$ bin/festival -q
Festival Speech Synthesis System 1.4.1:release November 1999
Copyright (C) University of Edinburgh, 1996-1999. All rights reserved.
For details type `(festival_warranty)'
festival> libdir
"/projects/festival/lib/"
festival> ^D
This should show the pathname you set in your 'config/config'.
If the system starts with no errors try to synthesize something
festival> (SayText "hello world")
Some files are only accessed at synthesis time so this may show up other
problem pathnames. If it talks, you're in business, if it doesn't, here are
some possible problems.
If you get the error message
Can't access NAS server
You have selected NAS as the audio output but have no server running on that
machine or your DISPLAY or AUDIOSERVER environment variable is not set
properly for your output device. Either set these properly or change the
audio output device in 'lib/siteinit.scm' as described above.
Ensure your audio device actually works the way you think it does. On Suns,
the audio output device can be switched into a number of different output
modes, speaker, jack, headphones. If this is set to the wrong one you may not
hear the output. Use one of Sun's tools to change this (try
'/usr/demo/SOUND/bin/soundtool'). Try to find an audio file independent of
Festival and get it to play on your audio. Once you have done that ensure that
the audio output method set in Festival matches that.
Once you have got it talking, test the audio spooling device.
festival> (intro)
This plays a short introduction of two sentences, spooling the audio output.
Finally exit from Festival (by end of file or (quit)) and test the script mode
with.
$ examples/saytime
A test suite is included with Festival but it makes certain assumptions about
which voices are installed. It assumes that voice_rab_diphone
('festvox_rabxxxx.tar.gz') is the default voice and that voice_ked_diphone and
voice_don_diphone ('festvox_kedxxxx.tar.gz' and 'festvox_don.tar.gz') are
installed. Also local settings in your 'festival/lib/siteinit.scm' may affect
these tests. However, after installation it may be worth trying
gnumake test
from the 'festival/' directory. This will do various tests including basic
utterance tests and tokenization tests. It also checks that voices are
installed and that they don't interfere with each other. These tests are
primarily regression tests for the developers of Festival, to ensure new
enhancements don't mess up existing supported features. They are not designed
to test an installation is successful, though if they run correctly it is most
probable the installation has worked.
ΓòÉΓòÉΓòÉ 7.5. Y2K ΓòÉΓòÉΓòÉ
Festival comes with no warranty therefore we will not make any legal statement
about the performance of the system. However a number of people have ask
about Festival and Y2K compliance, and we have decided to make some comments
on this.
Every effort has been made to ensure that Festival will continue running as
before into the next millenium. However even if Festival itself has no
problems it is dependent on the operating system environment it is running in.
During compilation dates on files are important and the compilation process
may not work if your machine cannot assign (reasonable) dates to new files.
At run time there is less dependence on system dates and times. Specifically
times are used in generation of random numbers (where only relative time is
important) and as time stamps in log files when festival runs in server mode,
thus we feel it is unlikely there will be any problems.
However, as a speech synthesizer, Festival must make explicit decisions about
the pronunciation of dates in the next two decades when people themselves have
not yet made such decisions. Most people are still unsure how to read years
written as '01, '04, '12, 00s, 10s, (cf. '86, 90s). It is interesting to note
that while there is a convenient short name for the last decade of the
twentieth century, the "ninties" there is no equivalent name for the first
decade of the twenty-first century (or the second). In the mean time we have
made reasonable decisions about such pronunciations.
Once people have themselves become Y2K compliant and decided what to actually
call these years, if their choices are different from how Festival pronounces
them we reserve the right to change how Festival speaks these dates to match
their belated decisions. However as we do not give out warranties about
compliance we will not be requiring our users to return signed Y2K compliant
warranties about their own compliance either.
ΓòÉΓòÉΓòÉ 8. Quick start ΓòÉΓòÉΓòÉ
This section is for those who just want to know the absolute basics to run the
system.
Festival works in two fundamental modes, command mode and text-to-speech mode
(tts-mode). In command mode, information (in files or through standard input)
is treated as commands and is interpreted by a Scheme interpreter. In
tts-mode, information (in files or through standard input) is treated as text
to be rendered as speech. The default mode is command mode, though this may
change in later versions.
Basic command line options
Simple command driven session
Getting some help
ΓòÉΓòÉΓòÉ 8.1. Basic command line options ΓòÉΓòÉΓòÉ
Festival's basic calling method is as
festival [options] file1 file2 ┬╖┬╖┬╖
Options may be any of the following
-q start Festival without loading 'init.scm' or user's '.festivalrc'
-b
--batch After processing any file arguments do not become interactive
-i
--interactiveAfter processing file arguments become interactive. This option
overrides any batch argument.
--tts Treat file arguments in text-to-speech mode, causing them to be
rendered as speech rather than interpreted as commands. When
selected in interactive mode the command line edit functions are not
available
--command Treat file arguments in command mode. This is the default.
--language LANGSet the default language to LANG. Currently LANG may be one of
english, spanish or welsh (depending on what voices are actually
available in your installation).
--server After loading any specified files go into server mode. This is a
mode where Festival waits for clients on a known port (the value of
server_port, default is 1314). Connected clients may send commands
(or text) to the server and expect waveforms back. See Server/client
API. Note server mode may be unsafe and allow unauthorised access
to your machine, be sure to read the security recommendations in
Server/client API
--script scriptfileRun scriptfile as a Festival script file. This is similar
to to --batch but it encapsulates the command line arguments into
the Scheme variables argv and argc, so that Festival scripts may
process their command line arguments just like any other program.
It also does not load the the basic initialisation files as
sometimes you may not want to do this. If you wish them, you should
copy the loading sequence from an example Festival script like
'festival/examples/saytext'.
--heap NUMBERThe Scheme heap (basic number of Lisp cells) is of a fixed size
and cannot be dynamically increased at run time (this would
complicate garbage collection). The default size is 210000 which
seems to be more than adequate for most work. In some of our
training experiments where very large list structures are required
it is necessary to increase this. Note there is a trade off between
size of the heap and time it takes to garbage collect so making this
unnecessarily big is not a good idea. If you don't understand the
above explanation you almost certainly don't need to use the option.
In command mode, if the file name starts with a left parenthesis, the name
itself is read and evaluated as a Lisp command. This is often convenient when
running in batch mode and a simple command is necessary to start the whole
thing off after loading in some other specific files.
ΓòÉΓòÉΓòÉ 8.2. Sample command driven session ΓòÉΓòÉΓòÉ
Here is a short session using Festival's command interpreter.
Start Festival with no arguments
$ festival
Festival Speech Synthesis System 1.4.1:release November 1999
Copyright (C) University of Edinburgh, 1996-1999. All rights reserved.
For details type `(festival_warranty)'
festival>
Festival uses the a command line editor based on editline for terminal input
so command line editing may be done with Emacs commands. Festival also
supports history as well as function, variable name, and file name completion
via the TAB key.
Typing help will give you more information, that is help without any
parenthesis. (It is actually a variable name whose value is a string
containing help.)
Festival offers what is called a read-eval-print loop, because it reads an
s-expression (atom or list), evaluates it and prints the result. As Festival
includes the SIOD Scheme interpreter most standard Scheme commands work
festival> (car '(a d))
a
festival> (+ 34 52)
86
In addition to standard Scheme commands a number of commands specific to
speech synthesis are included. Although, as we will see, there are simpler
methods for getting Festival to speak, here are the basic underlying explicit
functions used in synthesizing an utterance.
Utterances can consist of various types (See Utterance types), but the
simplest form is plain text. We can create an utterance and save it in a
variable
festival> (set! utt1 (Utterance Text "Hello world"))
#<Utterance 1d08a0>
festival>
The (hex) number in the return value may be different for your installation.
That is the print form for utterances. Their internal structure can be very
large so only a token form is printed.
Although this creates an utterance it doesn't do anything else. To get a
waveform you must synthesize it.
festival> (utt.synth utt1)
#<Utterance 1d08a0>
festival>
This calls various modules, including tokenizing, duration,. intonation etc.
Which modules are called are defined with respect to the type of the
utterance, in this case Text. It is possible to individually call the modules
by hand but you just wanted it to talk didn't you. So
festival> (utt.play utt1)
#<Utterance 1d08a0>
festival>
will send the synthesized waveform to your audio device. You should hear
"Hello world" from your machine.
To make this all easier a small function doing these three steps exists.
SayText simply takes a string of text, synthesizes it and sends it to the
audio device.
festival> (SayText "Good morning, welcome to Festival")
#<Utterance 1d8fd0>
festival>
Of course as history and command line editing are supported c-p or up-arrow
will allow you to edit the above to whatever you wish.
Festival may also synthesize from files rather than simply text.
festival> (tts "myfile" nil)
nil
festival>
The end of file character c-d will exit from Festival and return you to the
shell, alternatively the command quit may be called (don't forget the
parentheses).
Rather than starting the command interpreter, Festival may synthesize files
specified on the command line
unix$ festival --tts myfile
unix$
Sometimes a simple waveform is required from text that is to be kept and
played at some later time. The simplest way to do this with festival is by
using the 'text2wave' program. This is a festival script that will take a
file (or text from standard input) and produce a single waveform.
An example use is
text2wave myfile.txt -o myfile.wav
Options exist to specify the waveform file type, for example if Sun audio
format is required
text2wave myfile.txt -otype snd -o myfile.wav
Use '-h' on 'text2wave' to see all options.
ΓòÉΓòÉΓòÉ 8.3. Getting some help ΓòÉΓòÉΓòÉ
If no audio is generated then you must check to see if audio is properly
initialized on your machine. See Audio output.
In the command interpreter m-h (meta-h) will give you help on the current
symbol before the cursor. This will be a short description of the function or
variable, how to use it and what its arguments are. A listing of all such
help strings appears at the end of this document. m-s will synthesize and say
the same information, but this extra function is really just for show.
The lisp function manual will send the appropriate command to an already
running Netscape browser process. If nil is given as an argument the browser
will be directed to the tables of contents of the manual. If a non-nil value
is given it is assumed to be a section title and that section is searched and
if found displayed. For example
festival> (manual "Accessing an utterance")
Another related function is manual-sym which given a symbol will check its
documentation string for a cross reference to a manual section and request
Netscape to display it. This function is bound to m-m and will display the
appropriate section for the given symbol.
Note also that the TAB key can be used to find out the name of commands
available as can the function Help (remember the parentheses).
For more up to date information on Festival regularly check the Festival Home
Page at
http://www.cstr.ed.ac.uk/projects/festival.html
Further help is available by mailing questions to
festival-help@cstr.ed.ac.uk
Although we cannot guarantee the time required to answer you, we will do our
best to offer help.
Bug reports should be submitted to
festival-bug@cstr.ed.ac.uk
If there is enough user traffic a general mailing list will be created so all
users may share comments and receive announcements. In the mean time watch the
Festival Home Page for news.
ΓòÉΓòÉΓòÉ 9. Scheme ΓòÉΓòÉΓòÉ
Many people seem daunted by the fact that Festival uses Scheme as its
scripting language and feel they can't use Festival because they don't know
Scheme. However most of those same people use Emacs everyday which also has
(a much more complex) Lisp system underneath. The number of Scheme commands
you actually need to know in Festival is really very small and you can easily
just find out as you go along. Also people use the Unix shell often but only
know a small fraction of actual commands available in the shell (or in fact
that there even is a distinction between shell builtin commands and user
definable ones). So take it easy, you'll learn the commands you need fairly
quickly.
Scheme references Places to learn more about Scheme
Scheme fundamentals Syntax and semantics
Scheme Festival specifics
Scheme I/O
ΓòÉΓòÉΓòÉ 9.1. Scheme references ΓòÉΓòÉΓòÉ
If you wish to learn about Scheme in more detail I recommend the book
abelson85.
The Emacs Lisp documentation is reasonable as it is comprehensive and many of
the underlying uses of Scheme in Festival were influenced by Emacs. Emacs
Lisp however is not Scheme so there are some differences.
Other Scheme tutorials and resources available on the Web are
The Revised Revised Revised Revised Scheme Report, the document defining
the language is available from
http://tinuviel.cs.wcu.edu/res/ldp/r4rs-html/r4rs_toc.html
a Scheme tutorials from the net:
- http://www.cs.uoregon.edu/classes/cis425/schemeTutorial.html
the Scheme FAQ
- http://www.landfield.com/faqs/scheme-faq/part1/
ΓòÉΓòÉΓòÉ 9.2. Scheme fundamentals ΓòÉΓòÉΓòÉ
But you want more now, don't you, not just be referred to some other book. OK
here goes.
Syntax: an expression is an atom or a list. A list consists of a left paren,
a number of expressions and right paren. Atoms can be symbols, numbers,
strings or other special types like functions, hash tables, arrays, etc.
Semantics: All expressions can be evaluated. Lists are evaluated as function
calls. When evaluating a list all the members of the list are evaluated first
then the first item (a function) is called with the remaining items in the
list as arguments. Atoms are evaluated depending on their type: symbols are
evaluated as variables returning their values. Numbers, strings, functions,
etc. evaluate to themselves.
Comments are started by a semicolon and run until end of line.
And that's it. There is nothing more to the language that. But just in case
you can't follow the consequences of that, here are some key examples.
festival> (+ 2 3)
5
festival> (set! a 4)
4
festival> (* 3 a)
12
festival> (define (add a b) (+ a b))
#<CLOSURE (a b) (+ a b)>
festival> (add 3 4)
7
festival> (set! alist '(apples pears bananas))
(apples pears bananas)
festival> (car alist)
apples
festival> (cdr alist)
(pears bananas)
festival> (set! blist (cons 'oranges alist))
(oranges apples pears bananas)
festival> (append alist blist)
(apples pears bananas oranges apples pears bananas)
festival> (cons alist blist)
((apples pears bananas) oranges apples pears bananas)
festival> (length alist)
3
festival> (length (append alist blist))
7
ΓòÉΓòÉΓòÉ 9.3. Scheme Festival specifics ΓòÉΓòÉΓòÉ
There a number of additions to SIOD that are Festival specific though still
part of the Lisp system rather than the synthesis functions per se.
By convention if the first statement of a function is a string, it is treated
as a documentation string. The string will be printed when help is requested
for that function symbol.
In interactive mode if the function :backtrace is called (within parenthesis)
the previous stack trace is displayed. Calling :backtrace with a numeric
argument will display that particular stack frame in full. Note that any
command other than :backtrace will reset the trace. You may optionally call
(set_backtrace t)
Which will cause a backtrace to be displayed whenever a Scheme error occurs.
This can be put in your '.festivalrc' if you wish. This is especially useful
when running Festival in non-interactive mode (batch or script mode) so that
more information is printed when an error occurs.
A hook in Lisp terms is a position within some piece of code where a user may
specify their own customization. The notion is used heavily in Emacs. In
Festival there a number of places where hooks are used. A hook variable
contains either a function or list of functions that are to be applied at some
point in the processing. For example the after_synth_hooks are applied after
synthesis has been applied to allow specific customization such as resampling
or modification of the gain of the synthesized waveform. The Scheme function
apply_hooks takes a hook variable as argument and an object and applies the
function/list of functions in turn to the object.
When an error occurs in either Scheme or within the C++ part of Festival by
default the system jumps to the top level, resets itself and continues. Note
that errors are usually serious things, pointing to bugs in parameters or
code. Every effort has been made to ensure that the processing of text never
causes errors in Festival. However when using Festival as a development system
it is often that errors occur in code.
Sometimes in writing Scheme code you know there is a potential for an error
but you wish to ignore that and continue on to the next thing without exiting
or stopping and returning to the top level. For example you are processing a
number of utterances from a database and some files containing the
descriptions have errors in them but you want your processing to continue
through every utterance that can be processed rather than stopping 5 minutes
after you gone home after setting a big batch job for overnight.
Festival's Scheme provides the function unwind-protect which allows the
catching of errors and then continuing normally. For example suppose you have
the function process_utt which takes a filename and does things which you know
might cause an error. You can write the following to ensure you continue
processing even in an error occurs.
(unwind-protect
(process_utt filename)
(begin
(format t "Error found in processing %s\n" filename)
(format t "continuing\n")))
The unwind-protect function takes two arguments. The first is evaluated and
if no error occurs the value returned from that expression is returned. If an
error does occur while evaluating the first expression, the second expression
is evaluated. unwind-protect may be used recursively. Note that all files
opened while evaluating the first expression are closed if an error occurs.
All global variables outside the scope of the unwind-protect will be left as
they were set up until the error. Care should be taken in using this function
but its power is necessary to be able to write robust Scheme code.
ΓòÉΓòÉΓòÉ 9.4. Scheme I/O ΓòÉΓòÉΓòÉ
Different Scheme's may have quite different implementations of file i/o
functions so in this section we will describe the basic functions in Festival
SIOD regarding i/o.
Simple printing to the screen may be achieved with the function print which
prints the given s-expression to the screen. The printed form is preceded by a
new line. This is often useful for debugging but isn't really powerful enough
for much else.
Files may be opened and closed and referred to file descriptors in a direct
analogy to C's stdio library. The SIOD functions fopen and fclose work in the
exactly the same way as their equivalently named partners in C.
The format command follows the command of the same name in Emacs and a number
of other Lisps. C programmers can think of it as fprintf. format takes a
file descriptor, format string and arguments to print. The file description
may be a file descriptor as returned by the Scheme function fopen, it may also
be t which means the output will be directed as standard out (cf. printf). A
third possibility is nil which will cause the output to printed to a string
which is returned (cf. sprintf).
The format string closely follows the format strings in ANSI C, but it is not
the same. Specifically the directives currently supported are, %%, %d, %x,
%s, %f, %g and %c. All modifiers for these are also supported. In addition
%l is provided for printing of Scheme objects as objects.
For example
(format t "%03d %3.4f %s %l %l %l\n" 23 23 "abc" "abc" '(a b d) utt1)
will produce
023 23.0000 abc "abc" (a b d) #<Utterance 32f228>
on standard output.
When large lisp expressions are printed they are difficult to read because of
the parentheses. The function pprintf prints an expression to a file
description (or t for standard out). It prints so the s-expression is nicely
lined up and indented. This is often called pretty printing in Lisps.
For reading input from terminal or file, there is currently no equivalent to
scanf. Items may only be read as Scheme expressions. The command
(load FILENAME t)
will load all s-expressions in FILENAME and return them, unevaluated as a
list. Without the third argument the load function will load and evaluate
each s-expression in the file.
To read individual s-expressions use readfp. For example
(let ((fd (fopen trainfile "r"))
(entry)
(count 0))
(while (not (equal? (set! entry (readfp fd)) (eof-val)))
(if (string-equal (car entry) "home")
(set! count (+ 1 count))))
(fclose fd))
To convert a symbol whose print name is a number to a number use parse-number.
This is the equivalent to atof in C.
Note that, all i/o from Scheme input files is assumed to be basically some
form of Scheme data (though can be just numbers, tokens). For more elaborate
analysis of incoming data it is possible to use the text tokenization
functions which offer a fully programmable method of reading data.
ΓòÉΓòÉΓòÉ 10. TTS ΓòÉΓòÉΓòÉ
Festival supports text to speech for raw text files. If you are not
interested in using Festival in any other way except as black box for
rendering text as speech, the following method is probably what you want.
festival --tts myfile
This will say the contents of 'myfile'. Alternatively text may be submitted
on standard input
echo hello world | festival --tts
cat myfile | festival --tts
Festival supports the notion of text modes where the text file type may be
identified, allowing Festival to process the file in an appropriate way.
Currently only two types are considered stable: STML and raw, but other types
such as email, HTML, Latex, etc. are being developed and discussed below. This
follows the idea of buffer modes in Emacs where a file's type can be utilized
to best display the text. Text mode may also be selected based on a
filename's extension.
Within the command interpreter the function tts is used to render files as
text; it takes a filename and the text mode as arguments.
Utterance chunking From text to utterances
Text modes Mode specific text analysis
Example text mode An example mode for reading email
ΓòÉΓòÉΓòÉ 10.1. Utterance chunking ΓòÉΓòÉΓòÉ
Text to speech works by first tokenizing the file and chunking the tokens into
utterances. The definition of utterance breaks is determined by the utterance
tree in variable eou_tree. A default version is given in 'lib/tts.scm'. This
uses a decision tree to determine what signifies an utterance break.
Obviously blank lines are probably the most reliable, followed by certain
punctuation. The confusion of the use of periods for both sentence breaks and
abbreviations requires some more heuristics to best guess their different use.
The following tree is currently used which works better than simply using
punctuation.
(defvar eou_tree
'((n.whitespace matches ".*\n.*\n\\(.\\|\n\\)*") ;; 2 or more newlines
((1))
((punc in ("?" ":" "!"))
((1))
((punc is ".")
;; This is to distinguish abbreviations vs periods
;; These are heuristics
((name matches "\\(.*\\┬╖┬╖*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
((n.whitespace is " ")
((0)) ;; if abbrev single space isn't enough for break
((n.name matches "[A-Z].*")
((1))
((0))))
((n.whitespace is " ") ;; if it doesn't look like an abbreviation
((n.name matches "[A-Z].*") ;; single space and non-cap is no break
((1))
((0)))
((1))))
((0)))))
The token items this is applied to will always (except in the end of file
case) include one following token, so look ahead is possible. The "n." and
"p." and "p.p." prefixes allow access to the surrounding token context. The
features name, whitespace and punc allow access to the contents of the token
itself. At present there is no way to access the lexicon form this tree which
unfortunately might be useful if certain abbreviations were identified as such
there.
Note these are heuristics and written by hand not trained from data, though
problems have been fixed as they have been observed in data. The above rules
may make mistakes where abbreviations appear at end of lines, and when
improper spacing and capitalization is used. This is probably worth changing,
for modes where more casual text appears, such as email messages and USENET
news messages. A possible improvement could be made by analysing a text to
find out its basic threshold of utterance break (i.e. if no full stop, two
spaces, followed by a capitalized word sequences appear and the text is of a
reasonable length then look for other criteria for utterance breaks).
Ultimately what we are trying to do is to chunk the text into utterances that
can be synthesized quickly and start to play them quickly to minimise the time
someone has to wait for the first sound when starting synthesis. Thus it
would be better if this chunking were done on prosodic phrases rather than
chunks more similar to linguistic sentences. Prosodic phrases are bounded in
size, while sentences are not.
ΓòÉΓòÉΓòÉ 10.2. Text modes ΓòÉΓòÉΓòÉ
We do not believe that all texts are of the same type. Often information
about the general contents of file will aid synthesis greatly. For example in
Latex files we do not want to here "left brace, backslash e m" before each
emphasized word, nor do we want to necessarily hear formating commands.
Festival offers a basic method for specifying customization rules depending on
the mode of the text. By type we are following the notion of modes in Emacs
and eventually will allow customization at a similar level.
Modes are specified as the third argument to the function tts. When using the
Emacs interface to Festival the buffer mode is automatically passed as the
text mode. If the mode is not supported a warning message is printed and the
raw text mode is used.
Our initial text mode implementation allows configuration both in C++ and in
Scheme. Obviously in C++ almost anything can be done but it is not as easy to
reconfigure without recompilation. Here we will discuss those modes which can
be fully configured at run time.
A text mode may contain the following
filter A Unix shell program filter that processes the text file in some
appropriate way. For example for email it might remove
uninteresting headers and just output the subject, from line and the
message body. If not specified, an identity filter is used.
init_functionThis (Scheme) function will be called before any processing will
be done. It allows further set up of tokenization rules and voices
etc.
exit_functionThis (Scheme) function will be called at the end of any
processing allowing reseting of tokenization rules etc.
analysis_modeIf analysis mode is xml the file is read through the built in XML
parser rxp. Alternatively if analysis mode is xxml the filter
should an SGML normalising parser and the output is processed in a
way suitable for it. Any other value is ignored.
These mode specific parameters are specified in the a-list held in
tts_text_modes.
When using Festival in Emacs the emacs buffer mode is passed to Festival as
the text mode.
Note that above mechanism is not really designed to be re-entrant, this should
be addressed in later versions.
Following the use of auto-selection of mode in Emacs, Festival can auto-select
the text mode based on the filename given when no explicit mode is given. The
Lisp variable auto-text-mode-alist is a list of dotted pairs of regular
expression and mode name. For example to specify that the email mode is to be
used for files ending in '.email' we would add to the current
auto-text-mode-alist as follows
(set! auto-text-mode-alist
(cons (cons "\\.email$" 'email)
auto-text-mode-alist))
If the function tts is called with a mode other than nil that mode overrides
any specified by the auto-text-mode-alist. The mode fundamental is the
explicit "null" mode, it is used when no mode is specified in the function
tts, and match is found in auto-text-mode-alist or the specified mode is not
found.
By convention if a requested text model is not found in tts_text_modes the
file 'MODENAME-mode' will be required. Therefore if you have the file
'MODENAME-mode.scm' in your library then it will be automatically loaded on
reference. Modes may be quite large and it is not necessary have Festival
load them all at start up time.
Because of the auto-text-mode-alist and the auto loading of currently
undefined text modes you can use Festival like
festival --tts example.email
Festival with automatically synthesize 'example.email' in text mode email.
If you add your own personal text modes you should do the following. Suppose
you've written an HTML mode. You have named it 'html-mode.scm' and put it in
'/home/awb/lib/festival/'. In your '.festivalrc' first identify you're
personal Festival library directory by adding it to lib-path.
(set! lib-path (cons "/home/awb/lib/festival/" lib-path))
Then add the definition to the auto-text-mode-alist that file names ending
'.html' or '.htm' should be read in HTML mode.
(set! auto-text-mode-alist
(cons (cons "\\.html?$" 'html)
auto-text-mode-alist))
Then you may synthesize an HTML file either from Scheme
(tts "example.html" nil)
Or from the shell command line
festival --tts example.html
Anyone familiar with modes in Emacs should recognise that the process of
adding a new text mode to Festival is very similar to adding a new buffer mode
to Emacs.
ΓòÉΓòÉΓòÉ 10.3. Example text mode ΓòÉΓòÉΓòÉ
Here is a short example of a tts mode for reading email messages. It is by no
means complete but is a start at showing how you can customize tts modes
without writing new C++ code.
The first task is to define a filter that will take a saved mail message and
remove extraneous headers and just leave the from line, subject and body of
the message. The filter program is given a file name as its first argument
and should output the result on standard out. For our purposes we will do
this as a shell script.
#!/bin/sh
# Email filter for Festival tts mode
# usage: email_filter mail_message >tidied_mail_message
grep "^From: " $1
echo
grep "^Subject: " $1
echo
# delete up to first blank line (i.e. the header)
sed '1,/^$/ d' $1
Next we define the email init function, which will be called when we start
this mode. What we will do is save the current token to words function and
slot in our own new one. We can then restore the previous one when we exit.
(define (email_init_func)
"Called on starting email text mode."
(set! email_previous_t2w_func token_to_words)
(set! english_token_to_words email_token_to_words)
(set! token_to_words email_token_to_words))
Note that both english_token_to_words and token_to_words should be set to
ensure that our new token to word function is still used when we change
voices.
The corresponding end function puts the token to words function back.
(define (email_exit_func)
"Called on exit email text mode."
(set! english_token_to_words email_previous_t2w_func)
(set! token_to_words email_previous_t2w_func))
Now we can define the email specific token to words function. In this example
we deal with two specific cases. First we deal with the common form of email
addresses so that the angle brackets are not pronounced. The second points are
to recognise quoted text and immediately change the the speaker to the
alternative speaker.
(define (email_token_to_words token name)
"Email specific token to word rules."
(cond
This first condition identifies the token as a bracketed email address and
removes the brackets and splits the token into name and IP address. Note that
we recursively call the function email_previous_t2w_func on the email name and
IP address so that they will be pronounced properly. Note that because that
function returns a list of words we need to append them together.
((string-matches name "<.*.*>")
(append
(email_previous_t2w_func token
(string-after (string-before name "@") "<"))
(cons
"at"
(email_previous_t2w_func token
(string-before (string-after name "@") ">")))))
Our next condition deals with identifying a greater than sign being used as a
quote marker. When we detect this we select the alternative speaker, even
though it may already be selected. We then return no words so the quote
marker is not spoken. The following condition finds greater than signs which
are the first token on a line.
((and (string-matches name ">")
(string-matches (item.feat token "whitespace")
"[ \t\n]*\n *"))
(voice_don_diphone)
nil ;; return nothing to say
)
If it doesn't match any of these we can go ahead and use the builtin token to
words function Actually, we call the function that was set before we entered
this mode to ensure any other specific rules still remain. But before that we
need to check if we've had a newline with doesn't start with a greater than
sign. In that case we switch back to the primary speaker.
(t ;; for all other cases
(if (string-matches (item.feat token "whitespace")
".*\n[ \t\n]*")
(voice_rab_diphone))
(email_previous_t2w_func token name))))
In addition to these we have to actually declare the text mode. This we do by
adding to any existing modes as follows.
(set! tts_text_modes
(cons
(list
'email ;; mode name
(list ;; email mode params
(list 'init_func email_init_func)
(list 'exit_func email_exit_func)
'(filter "email_filter")))
tts_text_modes))
This will now allow simple email messages to be dealt with in a mode specific
way.
An example mail message is included in 'examples/ex1.email'. To hear the
result of the above text mode start Festival, load in the email mode
descriptions, and call TTS on the example file.
(tts "┬╖┬╖┬╖/examples/ex1.email" 'email)
The above is very short of a real email mode but does illustrate how one might
go about building one. It should be reiterated that text modes are new in
Festival and their most effective form has not been discovered yet. This will
improve with time and experience.
ΓòÉΓòÉΓòÉ 11. XML/SGML mark-up ΓòÉΓòÉΓòÉ
The ideas of a general, synthesizer system nonspecific, mark-up language for
labelling text has been under discussion for some time. Festival has
supported an SGML based markup language through multiple versions most
recently STML (sproat97). This is based on the earlier SSML (Speech Synthesis
Markup Language) which was supported by previous versions of Festival
(taylor96). With this version of Festival we support Sable a similar mark-up
language devised by a consortium from Bell Labls, Sub Microsystems, AT&T and
Edinburgh, sable98. Unlike the previous versions which were SGML based, the
implementation of Sable in Festival is now XML based. To the user they
different is negligable but using XML makes processing of files easier and
more standardized. Also Festival now includes an XML parser thus reducing the
dependencies in processing Sable text.
Raw text has the problem that it cannot always easily be rendered as speech in
the way the author wishes. Sable offers a well-defined way of marking up text
so that the synthesizer may render it appropriately.
The definition of Sable is by no means settled and is still in development.
In this release Festival offers people working on Sable and other XML (and
SGML) based markup languages a chance to quickly experiment with prototypes by
providing a DTD (document type descriptions) and the mapping of the elements
in the DTD to Festival functions. Although we have not yet (personally)
investigated facilities like cascading style sheets and generalized SGML
specification languages like DSSSL we believe the facilities offer by Festival
allow rapid prototyping of speech output markup languages.
Primarily we see Sable markup text as a language that will be generated by
other programs, e.g. text generation systems, dialog managers etc. therefore a
standard, easy to parse, format is required, even if it seems overly verbose
for human writers.
For more information of Sable and access to the mailing list see
http://www.cstr.ed.ac.uk/projects/sable.html
Sable example an example of Sable with descriptions
Supported Sable tags Currently supported Sable tags
Adding Sable tags Adding new Sable tags
XML/SGML requirements Software environment requirements for use
Using Sable Rendering Sable files as speech
ΓòÉΓòÉΓòÉ 11.1. Sable example ΓòÉΓòÉΓòÉ
Here is a simple example of Sable marked up text
<?xml version="1.0"?>
<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN"
"Sable.v0_2.dtd"
[]>
<SABLE>
<SPEAKER NAME="male1">
The boy saw the girl in the park <BREAK/> with the telescope.
The boy saw the girl <BREAK/> in the park with the telescope.
Good morning <BREAK /> My name is Stuart, which is spelled
<RATE SPEED="-40%">
<SAYAS MODE="literal">stuart</SAYAS> </RATE>
though some people pronounce it
<PRON SUB="stoo art">stuart</PRON>. My telephone number
is <SAYAS MODE="literal">2787</SAYAS>.
I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place,
but no one can pronounce that.
By the way, my telephone number is actually
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>.
</SPEAKER>
</SABLE>
After the initial definition of the SABLE tags, through the file
'Sable.v0_2.dtd', which is distributed as part of Festival, the body is given.
There are tags for identifying the language and the voice. Explicit boundary
markers may be given in text. Also duration and intonation control can be
explicit specified as can new pronunciations of words. The last sentence
specifies some external filenames to play at that point.
ΓòÉΓòÉΓòÉ 11.2. Supported Sable tags ΓòÉΓòÉΓòÉ
There is not yet a definitive set of tags but hopefully such a list will form
over the next few months. As adding support for new tags is often trivial the
problem lies much more in defining what tags there should be than in actually
implementing them. The following are based on version 0.2 of Sable as
described in http://www.cstr.ed.ac.uk/projects/sable_spec2.html, though some
aspects are not currently supported in this implementation. Further updates
will be announces through the Sable mailing list.
LANGUAGE Allows the specification of the language through the ID attribute.
Valid values in Festival are, english, en1, spanish, en, and others
depending on your particular installation. For example
<LANGUAGE id="english"> ┬╖┬╖┬╖ </LANGUAGE>
If the language isn't supported by the particualr installation of Festival
"Some text in ┬╖┬╖" is said instead and the section is ommitted.
SPEAKER Select a voice. Accepts a parameter NAME which takes values male1,
male2, female1, etc. There is currently no definition about what
happens when a voice is selected which the synthesizer doesn't
support. An example is
<SPEAKER name="male1"> ┬╖┬╖┬╖ </SPEAKER>
AUDIO This allows the specification of an external waveform that is to be
included. There are attributes for specifying volume and whether
the waveform is to be played in the background of the following text
or not. Festival as yet only supports insertion.
My telephone number is
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>.
MARKER This allows Festival to mark when a particalur part of the text has
been reached. At present the simply the value of the MARK attribute
is printed. This is done some when that piece of text is analyzed.
not when it is played. To use this in any real application would
require changes to this tags implementation.
Move the <MARKER MARK="mouse" /> mouse to the top.
BREAK Specifies a boundary at some LEVEL. Strength may be values Large,
Medium, Small or a number. Note that this this tag is an emtpy tag
and must include the closing part within itsefl specification.
<BREAK LEVEL="LARGE"/>
DIV This signals an division. In Festival this causes an utterance
break. A TYPE attribute may be specified but it is ignored by
Festival.
PRON Allows pronunciation of enclosed text to be explcitily given. It
supports the attributes IPA for an IPA specification (not currently
supported by Festival); SUB text to be substituted which can be in
some form of phonetic spelling, and ORIGIN where the linguistic
origin of the enclosed text may be identified to assist in
etymologically sensitive letter to sound rules.
<PRON SUB="toe maa toe">tomato</PRON>
SAYAS Allows indeitnfication of the enclose tokens/text. The attribute
MODE cand take any of the following a values: literal, date, time,
phone, net, postal, currency, math, fraction, measure, ordinal,
cardinal, or name. Further specification of type for dates (MDY,
DMY etc) may be speficied through the MODETYPE attribute.
As a test of marked-up numbers. Here we have
a year <SAYAS MODE="date">1998</SAYAS>,
an ordinal <SAYAS MODE="ordinal">1998</SAYAS>,
a cardinal <SAYAS MODE="cardinal">1998</SAYAS>,
a literal <SAYAS MODE="literal">1998</SAYAS>,
and phone number <SAYAS MODE="phone">1998</SAYAS>.
EMPH To specify enclose text should be emphasized, a LEVEL attribute may
be specified but its value is currently ignored by Festival (besides
the emphasis Festival generates isn't very good anyway).
The leaders of <EMPH>Denmark</EMPH> and <EMPH>India</EMPH> meet on
Friday.
PITCH Allows the specification of pitch range, mid and base points.
Without his penguin, <PITCH BASE="-20%"> which he left at home, </PITCH>
he could not enter the restaurant.
RATE Allows the specification of speaking rate
The address is <RATE SPEED="-40%"> 10 Main Street </RATE>.
VOLUME Allows the specification of volume. Note in festival this causes an
utetrance break before and after this tag.
Please speak more <VOLUME LEVEL="loud">loudly</VOLUME>, except
when I ask you to speak <VOLUME LEVEL="quiet">in a quiet voice</VOLUME>.
ENGINE This allows specification of engine specific commands
An example is <ENGINE ID="festival" DATA="our own festival speech
synthesizer"> the festival speech synthesizer</ENGINE> or
the Bell Labs speech synthesizer.
These tags may change in name but they cover the aspects of speech mark up
that we wish to express. Later additions and changes to these are expected.
See the files 'festival/examples/example.sable' and
'festival/examples/example2.sable' for working examples.
Note the definition of Sable is on going and there are likely to be later more
complete implementations of sable for Festival as independent releases consult
'url://www.cstr.ed.ac.uk/projects/sable.html' for the most recent updates.
ΓòÉΓòÉΓòÉ 11.3. Adding Sable tags ΓòÉΓòÉΓòÉ
We do not yet claim that there is a fixed standard for Sable tags but we wish
to move towards such a standard. In the mean time we have made it easy in
Festival to add support for new tags without, in general, having to change any
of the core functions.
Two changes are necessary to add a new tags. First, change the definition in
'lib/Sable.v0_2.dtd', so that Sable files may use it. The second stage is to
make Festival sensitive to that new tag. The example in
festival/lib/sable-mode.scm shows how a new text mode may be implemented for
an XML/SGML-based markup language. The basic point is that an identified
function will be called on finding a start tag or end tags in the document.
It is the tag-function's job to synthesize the given utterance if the tag
signals an utterance boundary. The return value from the tag-function is the
new status of the current utterance, which may remain unchanged or if the
current utterance has been synthesized nil should be returned signalling a new
utterance.
Note the hierarchical structure of the document is not available in this
method of tag-functions. Any hierarchical state that must be preserved has to
be done using explicit stacks in Scheme. This is an artifact due to the cross
relationship to utterances and tags (utterances may end within start and end
tags), and the desire to have all specification in Scheme rather than C++.
The tag-functions are defined in an elements list. They are identified with
names such as "(SABLE" and ")SABLE" denoting start and end tags respectively.
Two arguments are passed to these tag functions, an assoc list of attributes
and values as specified in the document and the current utterances. If the
tag denotes an utterance break, call xxml_synth on UTT and return nil. If a
tag (start or end) is found in the document and there is no corresponding
tag-function it is ignored.
New features may be added to words with a start and end tag by adding features
to the global xxml_word_features. Any features in that variable will be added
to each word.
Note that this method may be used for both XML based lamnguages and SGML based
markup languages (though and external normalizing SGML parser is required in
the SGML case). The type (XML vs SGML) is identified by the analysis_type
parameter in the tts text mode specification.
ΓòÉΓòÉΓòÉ 11.4. XML/SGML requirements ΓòÉΓòÉΓòÉ
Festival is distributed with rxp an XML parser developed by Richard Tobin of
the Language Technology Group, University of Edinburgh. Sable is set up as an
XML text mode so no further requirements or external programs are required to
synthesize from Sable marked up text (unlike previous releases). Note that
rxp is not a full validation parser and hence doesn't check some aspects of
the file (tags within tags).
Festival still supports SGML based markup but in such cases requires an
external SGML normalizing parser. We have tested 'nsgmls-1.0' which is
available as part of the SGML tools set 'sp-1.1.tar.gz' which is available
from http://www.jclark.com/sp/index.html. This seems portable between many
platforms.
ΓòÉΓòÉΓòÉ 11.5. Using Sable ΓòÉΓòÉΓòÉ
Support in Festival for Sable is as a text mode. In the command mode use the
following to process an Sable file
(tts "file.sable" 'sable)
Also the automatic selection of mode based on file type has been set up such
that files ending '.sable' will be automatically synthesized in this mode.
Thus
festival --tts fred.sable
Will render 'fred.sable' as speech in Sable mode.
Another way of using Sable is through the Emacs interface. The say-buffer
command will send the Emacs buffer mode to Festival as its tts-mode. If the
Emacs mode is stml or sgml the file is treated as an sable file. See Emacs
interface
Many people experimenting with Sable (and TTS in general) often want all the
waveform output to be saved to be played at a later date. The simplest way to
do this is using the 'text2wave' script, It respects the audo mode selection
so
text2wave fred.sable -o fred.wav
Note this renders the file a single waveform (done by concatenating the
waveforms for each utterance in the Sable file).
If you wish the waveform for each utterance in a file saved you can cause the
tts process to save the waveforms during synthesis. A call to
festival> (save_waves_during_tts)
Any future call to tts will cause the waveforms to be saved in a file
'tts_file_xxx.wav' where 'xxx' is a number. A call to
(save_waves_during_tts_STOP) will stop saving the waves. A message is printed
when the waveform is saved otherwise people forget about this and wonder why
their disk has filled up.
This is done by inserting a function in tts_hooks which saves the wave. To do
other things to each utterances during TTS (such as saving the utterance
structure), try redefining the function save_tts_output (see
festival/lib/tts.scm).
ΓòÉΓòÉΓòÉ 12. Emacs interface ΓòÉΓòÉΓòÉ
One easy method of using Festival is via an Emacs interface that allows
selection of text regions to be sent to Festival for rendering as speech.
'festival.el' offers a new minor mode which offers an extra menu (in emacs-19
and 20) with options for saying a selected region, or a whole buffer, as well
as various general control functions. To use this you must install
'festival.el' in a directory where Emacs can find it, then add to your
'.emacs' in your home directory the following lines.
(autoload 'say-minor-mode "festival" "Menu for using Festival." t)
(say-minor-mode t)
Successive calls to say-minor-mode will toggle the minor mode, switching the
'say' menu on and off.
Note that the optional voice selection offered by the language sub-menu is not
sensitive to actual voices supported by the your Festival installation. Hand
customization is require in the 'festival.el' file. Thus some voices may
appear in your menu that your Festival doesn't support and some voices may be
supported by your Festival that do not appear in the menu.
When the Emacs Lisp function festival-say-buffer or the menu equivalent is
used the Emacs major mode is passed to Festival as the text mode.
ΓòÉΓòÉΓòÉ 13. Phonesets ΓòÉΓòÉΓòÉ
The notion of phonesets is important to a number of different subsystems
within Festival. Festival supports multiple phonesets simultaneously and
allows mapping between sets when necessary. The lexicons, letter to sound
rules, waveform synthesizers, etc. all require the definition of a phoneset
before they will operate.
A phoneset is a set of symbols which may be further defined in terms of
features, such as vowel/consonant, place of articulation for consonants, type
of vowel etc. The set of features and their values must be defined with the
phoneset. The definition is used to ensure compatibility between sub-systems
as well as allowing groups of phones in various prediction systems (e.g.
duration)
A phoneset definition has the form
(defPhoneSet
NAME
FEATUREDEFS
PHONEDEFS )
The NAME is any unique symbol used e.g. mrpa, darpa, etc. FEATUREDEFS is a
list of definitions each consisting of a feature name and its possible values.
For example
(
(vc + -) ;; vowel consonant
(vlength short long diphthong schwa 0) ;; vowel length
┬╖┬╖┬╖
)
The third section is a list of phone definitions themselves. Each phone
definition consists of a phone name and the values for each feature in the
order the features were defined in the above section.
A typical example of a phoneset definition can be found in
'lib/mrpa_phones.scm'.
Note the phoneset should also include a definition for any silence phones. In
addition to the definition of the set the silence phone(s) themselves must
also be identified to the system. This is done through the command
PhoneSet.silences. In the mrpa set this is done by the command
(PhoneSet.silences '(#))
There may be more than one silence phone (e.g. breath, start silence etc.) in
any phoneset definition. However the first phone in this set is treated
special and should be canonical silence. Among other things, it is this phone
that is inserted by the pause prediction module.
In addition to declaring phonesets, alternate sets may be selected by the
command PhoneSet.select.
Phones in different sets may be automatically mapped between using their
features. This mapping is not yet as general as it could be, but is useful
when mapping between various phonesets of the same language. When a phone
needs to be mapped from one set to another the phone with matching features is
selected. This allows, at least to some extent, lexicons, waveform
synthesizers, duration modules etc. to use different phonesets (though in
general this is not advised).
A list of currently defined phonesets is returned by the function
(PhoneSet.list)
Note phonesets are often not defined until a voice is actually loaded so this
list is not the list of of sets that are distributed but the list of sets that
are used by currently loaded voices.
The name, phones, features and silences of the current phoneset may be
accessedwith the function
(PhoneSet.description nil)
If the argument to this function is a list, only those parts of the phoneset
description named are returned. For example
(PhoneSet.description '(silences))
(PhoneSet.description '(silences phones))
ΓòÉΓòÉΓòÉ 14. Lexicons ΓòÉΓòÉΓòÉ
A Lexicon in Festival is a subsystem that provides pronunciations for words.
It can consist of three distinct parts: an addenda, typically short consisting
of hand added words; a compiled lexicon, typically large (10,000s of words)
which sits on disk somewhere; and a method for dealing with words not in
either list.
Lexical entries Format of lexical entries
Defining lexicons Building new lexicons
Lookup process Order of significance
Letter to sound rules Dealing with unknown words
Building letter to sound rulesBuilding rules from data
Lexicon requirements What should be in the lexicon
Available lexicons Current available lexicons
Post-lexical rules Modification of words in context
ΓòÉΓòÉΓòÉ 14.1. Lexical entries ΓòÉΓòÉΓòÉ
Lexical entries consist of three basic parts, a head word, a part of speech
and a pronunciation. The headword is what you might normally think of as a
word e.g. 'walk', 'chairs' etc. but it might be any token.
The part-of-speech field currently consist of a simple atom (or nil if none is
specified). Of course there are many part of speech tag sets and whatever you
mark in your lexicon must be compatible with the subsystems that use that
information. You can optionally set a part of speech tag mapping for each
lexicon. The value should be a reverse assoc-list of the following form
(lex.set.pos.map
'((( punc fpunc) punc)
(( nn nnp nns nnps ) n)))
All part of speech tags not appearing in the left hand side of a pos map are
left unchanged.
The third field contains the actual pronunciation of the word. This is an
arbitrary Lisp S-expression. In many of the lexicons distributed with
Festival this entry has internal format, identifying syllable structure,
stress markigns and of course the phones themselves. In some of our other
lexicons we simply list the phones with stress marking on each vowel.
Some typical example entries are
( "walkers" n ((( w oo ) 1) (( k @ z ) 0)) )
( "present" v ((( p r e ) 0) (( z @ n t ) 1)) )
( "monument" n ((( m o ) 1) (( n y u ) 0) (( m @ n t ) 0)) )
Note you may have two entries with the same headword, but different part of
speech fields allow differentiation. For example
( "lives" n ((( l ai v z ) 1)) )
( "lives" v ((( l i v z ) 1)) )
See Lookup process for a description of how multiple entries with the same
headword are used during lookup.
By current conventions, single syllable function words should have no stress
marking, while single syllable content words should be stressed.
NOTE: the POS field may change in future to contain more complex formats. The
same lexicon mechanism (but different lexicon) is used for holding part of
speech tag distributions for the POS prediction module.
ΓòÉΓòÉΓòÉ 14.2. Defining lexicons ΓòÉΓòÉΓòÉ
As stated above, lexicons consist of three basic parts (compiled form, addenda
and unknown word method) plus some other declarations.
Each lexicon in the system has a name which allows different lexicons to be
selected from efficiently when switching between voices during synthesis. The
basic steps involved in a lexicon definition are as follows.
First a new lexicon must be created with a new name
(lex.create "cstrlex")
A phone set must be declared for the lexicon, to allow both checks on the
entries themselves and to allow phone mapping between different phone sets
used in the system
(lex.set.phoneset "mrpa")
The phone set must be already declared in the system.
A compiled lexicon, the construction of which is described below, may be
optionally specified
(lex.set.compile.file "/projects/festival/lib/dicts/cstrlex.out")
The method for dealing with unknown words, See Letter to sound rules, may be
set
(lex.set.lts.method 'lts_rules)
(lex.set.lts.ruleset 'nrl)
In this case we are specifying the use of a set of letter to sound rules
originally developed by the U.S. Naval Research Laboratories. The default
method is to give an error if a word is not found in the addenda or compiled
lexicon. (This and other options are discussed more fully below.)
Finally addenda items may be added for words that are known to be common, but
not in the lexicon and cannot reasonably be analysed by the letter to sound
rules.
(lex.add.entry
'( "awb" n ((( ei ) 1) ((d uh) 1) ((b @ l) 0) ((y uu) 0) ((b ii) 1))))
(lex.add.entry
'( "cstr" n ((( s ii ) 1) (( e s ) 1) (( t ii ) 1) (( aa ) 1)) ))
(lex.add.entry
'( "Edinburgh" n ((( e m ) 1) (( b r @ ) 0))) ))
Using lex.add.entry again for the same word and part of speech will redefine
the current pronunciation. Note these add entries to the current lexicon so
its a good idea to explicitly select the lexicon before you add addenda
entries, particularly if you are doing this in your own '.festivalrc' file.
For large lists, compiled lexicons are best. The function lex.compile takes
two filename arguments, a file name containing a list of lexical entries and
an output file where the compiled lexicon will be saved.
Compilation can take some time and may require lots of memory, as all entries
are loaded in, checked and then sorted before being written out again. During
compilation if some entry is malformed the reading process halts with a not so
useful message. Note that if any of your entries include quote or double
quotes the entries will probably be misparsed and cause such a weird error.
In such cases try setting
(debug_output t)
before compilation. This will print out each entry as it is read in which
should help to narrow down where the error is.
ΓòÉΓòÉΓòÉ 14.3. Lookup process ΓòÉΓòÉΓòÉ
When looking up a word, either through the C++ interface, or Lisp interface, a
word is identified by its headword and part of speech. If no part of speech
is specified, nil is assumed which matches any part of speech tag.
The lexicon look up process first checks the addenda, if there is a full match
(head word plus part of speech) it is returned. If there is an addenda entry
whose head word matches and whose part of speech is nil that entry is
returned.
If no match is found in the addenda, the compiled lexicon, if present, is
checked. Again a match is when both head word and part of speech tag match,
or either the word being searched for has a part of speech nil or an entry has
its tag as nil. Unlike the addenda, if no full head word and part of speech
tag match is found, the first word in the lexicon whose head word matches is
returned. The rationale is that the letter to sound rules (the next defence)
are unlikely to be better than an given alternate pronunciation for a the word
but different part of speech. Even more so given that as there is an entry
with the head word but a different part of speech this word may have an
unusual pronunciation that the letter to sound rules will have no chance in
producing.
Finally if the word is not found in the compiled lexicon it is passed to
whatever method is defined for unknown words. This is most likely a letter to
sound module. See Letter to sound rules.
Optional pre- and post-lookup hooks can be specified for a lexicon. As a
single (or list of) Lisp functions. The pre-hooks will be called with two
arguments (word and features) and should return a pair (word and features).
The post-hooks will be given a lexical entry and should return a lexical
entry. The pre- and post-hooks do nothing by default.
Compiled lexicons may be created from lists of lexical entries. A compiled
lexicon is much more efficient for look up than the addenda. Compiled
lexicons use a binary search method while the addenda is searched linearly.
Also it would take a prohibitively long time to load in a typical full lexicon
as an addenda. If you have more than a few hundred entries in your addenda
you should seriously consider adding them to your compiled lexicon.
Because many publicly available lexicons do not have syllable markings for
entries the compilation method supports automatic syllabification. Thus for
lexicon entries for compilation, two forms for the pronunciation field are
supported: the standard full syllabified and stressed form and a simpler
linear form found in at least the BEEP and CMU lexicons. If the pronunciation
field is a flat atomic list it is assumed syllabification is required.
Syllabification is done by finding the minimum sonorant position between
vowels. It is not guaranteed to be accurate but does give a solution that is
sufficient for many purposes. A little work would probably improve this
significantly. Of course syllabification requires the entry's phones to be in
the current phone set. The sonorant values are calculated from the vc, ctype,
and cvox features for the current phoneset. See
'src/arch/festival/Phone.cc:ph_sonority()' for actual definition.
Additionally in this flat structure vowels (atoms starting with a, e, i, o or
u) may have 1 2 or 0 appended marking stress. This is again following the
form found in the BEEP and CMU lexicons.
Some example entries in the flat form (taken from BEEP) are
("table" nil (t ei1 b l))
("suspicious" nil (s @ s p i1 sh @ s))
Also if syllabification is required there is an opportunity to run a set of
"letter-to-sound"-rules on the input (actually an arbitrary re-write rule
system). If the variable lex_lts_set is set, the lts ruleset of that name is
applied to the flat input before syllabification. This allows simple
predictable changes such as conversion of final r into longer vowel for
English RP from American labelled lexicons.
A list of all matching entries in the addenda and the compiled lexicon may be
found by the function lex.lookup_all. This function takes a word and returns
all matching entries irrespective of part of speech.
ΓòÉΓòÉΓòÉ 14.4. Letter to sound rules ΓòÉΓòÉΓòÉ
Each lexicon may define what action to take when a word cannot be found in the
addenda or the compiled lexicon. There are a number of options which will
hopefully be added to as more general letter to sound rule systems are added.
The method is set by the command
(lex.set.lts.method METHOD)
Where METHOD can be any of the following
'Error' Throw an error when an unknown word is found (default).
'lts_rules'Use externally specified set of letter to sound rules (described
below). The name of the rule set to use is defined with the
lex.lts.ruleset function. This method runs one set of rules on an
exploded form of the word and assumes the rules return a list of
phonemes (in the appropriate set). If multiple instances of rules
are required use the function method described next.
'none' This returns an entry with a nil pronunciation field. This will
only be valid in very special circumstances.
'FUNCTIONNAME'Call this as a LISP function function name. This function is
given two arguments: the word and the part of speech. It should
return a valid lexical entry.
The basic letter to sound rule system is very simple but is powerful enough to
build reasonably complex letter to sound rules. Although we've found trained
LTS rules better than hand written ones (for complex languages) where no data
is available and rules must be hand written the following rule formalism is
much easier to use than that generated by the LTS training system (described
in the next section).
The basic form of a rule is as follows
( LEFTCONTEXT [ ITEMS ] RIGHTCONTEXT = NEWITEMS )
This interpretation is that if ITEMS appear in the specified right and left
context then the output string is to contain NEWITEMS. Any of LEFTCONTEXT,
RIGHTCONTEXT or NEWITEMS may be empty. Note that NEWITEMS is written to a
different "tape" and hence cannot feed further rules (within this ruleset).
An example is
( # [ c h ] C = k )
The special character # denotes a word boundary, and the symbol C denotes the
set of all consonants, sets are declared before rules. This rule states that
a ch at the start of a word followed by a consonant is to be rendered as the k
phoneme. Symbols in contexts may be followed by the symbol * for zero or more
occurrences, or + for one or more occurrences.
The symbols in the rules are treated as set names if they are declared as such
or as symbols in the input/output alphabets. The symbols may be more than one
character long and the names are case sensitive.
The rules are tried in order until one matches the first (or more) symbol of
the tape. The rule is applied adding the right hand side to the output tape.
The rules are again applied from the start of the list of rules.
The function used to apply a set of rules if given an atom will explode it
into a list of single characters, while if given a list will use it as is.
This reflects the common usage of wishing to re-write the individual letters
in a word to phonemes but without excluding the possibility of using the
system for more complex manipulations, such as multi-pass LTS systems and
phoneme conversion.
From lisp there are three basic access functions, there are corresponding
functions in the C/C++ domain.
(lts.ruleset NAME SETS RULES) Define a new set of lts rules. Where NAME is
the name for this rule, SETS is a list of set definitions of the
form (SETNAME e0 e1 ┬╖┬╖┬╖) and RULES are a list of rules as described
above.
(lts.apply WORD RULESETNAME)Apply the set of rules named RULESETNAME to WORD.
If WORD is a symbol it is exploded into a list of the individual
characters in its print name. If WORD is a list it is used as is.
If the rules cannot be successfully applied an error is given. The
result of (successful) application is returned in a list.
(lts.check_alpha WORD RULESETNAME)The symbols in WORD are checked against the
input alphabet of the rules named RULESETNAME. If they are all
contained in that alphabet t is returned, else nil. Note this does
not necessarily mean the rules will successfully apply (contexts may
restrict the application of the rules), but it allows general
checking like numerals, punctuation etc, allowing application of
appropriate rule sets.
The letter to sound rule system may be used directly from Lisp and can easily
be used to do relatively complex operations for analyzing words without
requiring modification of the C/C++ system. For example the Welsh letter to
sound rule system consists or three rule sets, first to explicitly identify
epenthesis, then identify stressed vowels, and finally rewrite this augmented
letter string to phonemes. This is achieved by the following function
(define (welsh_lts word features)
(let (epen str wel)
(set! epen (lts.apply (downcase word) 'newepen))
(set! str (lts.apply epen 'newwelstr))
(set! wel (lts.apply str 'newwel))
(list word
nil
(lex.syllabify.phstress wel))))
The LTS method for the Welsh lexicon is set to welsh_lts, so this function is
called when a word is not found in the lexicon. The above function first
downcases the word and then applies the rulesets in turn, finally calling the
syllabification process and returns a constructed lexically entry.
ΓòÉΓòÉΓòÉ 14.5. Building letter to sound rules ΓòÉΓòÉΓòÉ
As writing letter to sound rules by hand is hard and very time consuming, an
alternative method is also available where a latter to sound system may be
built from a lexicon of the language. This technique has successfully been
used from English (British and American), French and German. The difficulty
and appropriateness of using letter to sound rules is very language dependent,
The following outlines the processes involved in building a letter to sound
model for a language given a large lexicon of pronunciations. This technique
is likely to work for most European languages (including Russian) but doesn't
seem particularly suitable for very language alphabet languages like Japanese
and Chinese. The process described here is not (yet) fully automatic but the
hand intervention required is small and may easily be done even by people with
only a very little knowledge of the language being dealt with.
The process involves the following steps
Pre-processing lexicon into suitable training set
Defining the set of allowable pairing of letters to phones. (We intend
to do this fully automatically in future versions).
Constructing the probabilities of each letter/phone pair.
Aligning letters to an equal set of phones/_epsilons_.
Extracting the data by letter suitable for training.
Building CART models for predicting phone from letters (and context).
Building additional lexical stress assignment model (if necessary).
All except the first two stages of this are fully automatic.
Before building a model its wise to think a little about what you want it to
do. Ideally the model is an auxiluary to the lexicon so only words not found
in the lexicon will require use of the letter to sound rules. Thus only
unusual forms are likely to require the rules. More precisely the most common
words, often having the most non-standard pronunciations, should probably be
explicitly listed always. It is possible to reduce the size of the lexicon
(sometimes drastically) by removing all entries that the training LTS model
correctly predicts.
Before starting it is wise to consider removing some entries from the lexicon
before training, I typically will remove words under 4 letters and if part of
speech information is available I remove all function words, ideally only
training from nouns verbs and adjectives as these are the most likely forms to
be unknown in text. It is useful to have morphologically inflected and
derived forms in the training set as it is often such variant forms that not
found in the lexicon even though their root morpheme is. Note that in many
forms of text, proper names are the most common form of unknown word and even
the technique presented here may not adequately cater for that form of unknown
words (especially if they unknown words are non-native names). This is all
stating that this may or may not be appropriate for your task but the rules
generated by this learning process have in the examples we've done been much
better than what we could produce by hand writing rules of the form described
in the previous section.
First preprocess the lexicon into a file of lexical entries to be used for
training, removing functions words and changing the head words to all lower
case (may be language dependent). The entries should be of the form used for
input for Festival's lexicon compilation. Specifical the pronunciations
should be simple lists of phones (no syllabification). Depending on the
language, you may wish to remve the stressing---for examples here we have
though later tests suggest that we should keep it in even for English. Thus
the training set should look something like
("table" nil (t ei b l))
("suspicious" nil (s @ s p i sh @ s))
It is best to split the data into a training set and a test set if you wish to
know how well your training has worked. In our tests we remove every tenth
entry and put it in a test set. Note this will mean our test results are
probably better than if we removed say the last ten in every hundred.
The second stage is to define the set of allowable letter to phone mappings
irrespective of context. This can sometimes be initially done by hand then
checked against the training set. Initially constract a file of the form
(require 'lts_build)
(set! allowables
'((a _epsilon_)
(b _epsilon_)
(c _epsilon_)
┬╖┬╖┬╖
(y _epsilon_)
(z _epsilon_)
(# #)))
All letters that appear in the alphabet should (at least) map to _epsilon_,
including any accented characters that appear in that language. Note the last
two hashes. These are used by to denote beginning and end of word and are
automatically added during training, they must appear in the list and should
only map to themselves.
To incrementally add to this allowable list run festival as
festival allowables.scm
and at the prompt type
festival> (cummulate-pairs "oald.train")
with your train file. This will print out each lexical entry that couldn't be
aligned with the current set of allowables. At the start this will be every
entry. Looking at these entries add to the allowables to make alignment work.
For example if the following word fails
("abate" nil (ah b ey t))
Add ah to the allowables for letter a, b to b, ey to a and t to letter t.
After doing that restart festival and call cummulate-pairs again.
Incrementally add to the allowable pairs until the number of failures becomes
accceptable. Often there are entries for which there is no real relationship
between the letters and the pronunciation such as in abbreviations and foreign
words (e.g. "aaa" as "t r ih p ax l ey"). For the lexicons I've used the
technique on less than 10 per thousand fail in this way.
It is worth while being consistent on defining your set of allowables. (At
least) two mappings are possible for the letter sequence ch---having letter c
go to phone ch and letter h go to _epsilon_ and also letter c go to phone
_epsilon_ and letter h goes to ch. However only one should be allowed, we
preferred c to ch.
It may also be the case that some letters give rise to more than one phone.
For example the letter x in English is often pronunced as the phone
combination k and s. To allow this, use the multiphone k-s. Thus the
multiphone k-s will be predicted for x in some context and the model will
separate it into two phones while it also ignoring any predicted _epsilons_.
Note that multiphone units are relatively rare but do occur. In English,
letter x give rise to a few, k-s in taxi, g-s in example, and sometimes g-zh
and k-sh in luxury. Others are w-ah in one, t-s in pizza, y-uw in new
(British), ah-m in -ism etc. Three phone multiphone are much rarer but may
exist, they are not supported by this code as is, but such entries should
probably be ignored. Note the - sign in the multiphone examples is
significant and is used to indentify multiphones.
The allowables for OALD end up being
(set! allowables
'
((a _epsilon_ ei aa a e@ @ oo au o i ou ai uh e)
(b _epsilon_ b )
(c _epsilon_ k s ch sh @-k s t-s)
(d _epsilon_ d dh t jh)
(e _epsilon_ @ ii e e@ i @@ i@ uu y-uu ou ei aa oi y y-u@ o)
(f _epsilon_ f v )
(g _epsilon_ g jh zh th f ng k t)
(h _epsilon_ h @ )
(i _epsilon_ i@ i @ ii ai @@ y ai-@ aa a)
(j _epsilon_ h zh jh i y )
(k _epsilon_ k ch )
(l _epsilon_ l @-l l-l)
(m _epsilon_ m @-m n)
(n _epsilon_ n ng n-y )
(o _epsilon_ @ ou o oo uu u au oi i @@ e uh w u@ w-uh y-@)
(p _epsilon_ f p v )
(q _epsilon_ k )
(r _epsilon_ r @@ @-r)
(s _epsilon_ z s sh zh )
(t _epsilon_ t th sh dh ch d )
(u _epsilon_ uu @ w @@ u uh y-uu u@ y-u@ y-u i y-uh y-@ e)
(v _epsilon_ v f )
(w _epsilon_ w uu v f u)
(x _epsilon_ k-s g-z sh z k-sh z g-zh )
(y _epsilon_ i ii i@ ai uh y @ ai-@)
(z _epsilon_ z t-s s zh )
(# #)
))
Note this is an exhaustive list and (deliberately) says nothing about the
contexts or frequency that these letter to phone pairs appear. That
information will be generated automatically from the training set.
Once the number of failed matches is signficantly low enough let
cummulate-pairs run to completion. This counts the number of times each
letter/phone pair occurs in allowable alignments.
Next call
festival> (save-table "oald-")
with the name of your lexicon. This changes the cummulation table into
probabilities and saves it.
Restart festival loading this new table
festival allowables.scm oald-pl-table.scm
Now each word can be aligned to an equally-lengthed string of phones, epsilon
and multiphones.
festival> (aligndata "oald.train" "oald.train.align")
Do this also for you test set.
This will produce entries like
aaronson _epsilon_ aa r ah n s ah n
abandon ah b ae n d ah n
abate ah b ey t _epsilon_
abbe ae b _epsilon_ iy
The next stage is to build features suitable for 'wagon' to build models.
This is done by
festival> (build-feat-file "oald.train.align" "oald.train.feats")
Again the same for the test set.
Now you need to constructrure a description file for 'wagon' for the given
data. The can be done using the script 'make_wgn_desc' provided with the
speech tools
Here is an example script for building the models, you will need to modify it
for your particualr database but it shows the basic processes
for i in a b c d e f g h i j k l m n o p q r s t u v w x y z
do
# Stop value for wagon
STOP=2
echo letter $i STOP $STOP
# Find training set for letter $i
cat oald.train.feats |
awk '{if ($6 == "'$i'") print $0}' >ltsdataTRAIN.$i.feats
# split training set to get heldout data for stepwise testing
traintest ltsdataTRAIN.$i.feats
# Extract test data for letter $i
cat oald.test.feats |
awk '{if ($6 == "'$i'") print $0}' >ltsdataTEST.$i.feats
# run wagon to predict model
wagon -data ltsdataTRAIN.$i.feats.train -test ltsdataTRAIN.$i.feats.test \
-stepwise -desc ltsOALD.desc -stop $STOP -output lts.$i.tree
# Test the resulting tree against
wagon_test -heap 2000000 -data ltsdataTEST.$i.feats -desc ltsOALD.desc \
-tree lts.$i.tree
done
The script 'traintest' splits the given file 'X' into 'X.train' and 'X.test'
with every tenth line in 'X.test' and the rest in 'X.train'.
This script can take a significnat amount of time to run, about 6 hours on a
Sun Ultra 140.
Once the models are created the must be collected together into a single list
structure. The trees generated by 'wagon' contain fully probability
distributions at each leaf, at this time this information can be removed as
only the most probable will actually be predicted. This substantially reduces
the size of the tress.
(merge_models 'oald_lts_rules "oald_lts_rules.scm")
(merge_models is defined within 'lts_build.scm') The given file will contain a
set! for the given variable name to an assoc list of letter to trained tree.
Note the above function naively assumes that the letters in the alphabet are
the 26 lower case letters of the English alphabet, you will need to edit this
adding accented letters if required. Note that adding "'" (single quote) as a
letter is a little tricky in scheme but can be done---the command (intern "'")
will give you the symbol for single quote.
To test a set of lts models load the saved model and call the following
function with the test align file
festival oald-table.scm oald_lts_rules.scm
festival> (lts_testset "oald.test.align" oald_lts_rules)
The result (after showing all the failed ones), will be a table showing the
results for each letter, for all letters and for complete words. The failed
entries may give some notion of how good or bad the result is, sometimes it
will be simple vowel diferences, long versus short, schwa versus full vowel,
other times it may be who consonants missing. Remember the ultimate quality of
the letter sound rules is how adequate they are at providing acceptable
pronunciations rather than how good the numeric score is.
For some languages (e.g. English) it is necessary to also find a stree pattern
for unknown words. Ultimately for this to work well you need to know the
morphological decomposition of the word. At present we provide a CART trained
system to predict stress patterns for English. If does get 94.6% correct for
an unseen test set but that isn't really very good. Later tests suggest that
predicting stressed and unstressed phones directly is actually better for
getting whole words correct even though the models do slightly worse on a per
phone basis black98.
As the lexicon may be a large part of the system we have also experimented
with removing entries from the lexicon if the letter to sound rules system
(and stree assignment system) can correct predict them. For OALD this allows
us to half the size of the lexicon, it could possibly allow more if a certain
amount of fuzzy acceptance was allowed (e.g. with schwa). For other languages
the gain here can be very signifcant, for German and French we can reduce the
lexicon by over 90%. The function reduce_lexicon in
'festival/lib/lts_build.scm' was used to do this. A diccussion of using the
above technique as a dictionary compression method is discussed in pagel98. A
morphological decomposition algorithm, like that described in black91, may
even help more.
The technique described in this section and its relative merits with respect
to a number of languages/lexicons and tasks is dicussed more fully in black98.
ΓòÉΓòÉΓòÉ 14.6. Lexicon requirements ΓòÉΓòÉΓòÉ
For English there are a number of assumptions made about the lexicon which are
worthy of explicit mention. If you are basically going to use the existing
token rules you should try to include at least the following in any lexicon
that is to work with them.
The letters of the alphabet, when a token is identified as an acronym it
is spelled out. The tokenization assumes that the individual letters of
the alphabet are in the lexicon with their pronunciations. They should
be identified as nouns. (This is to distinguish a as a determiner which
can be schwa'd from a as a letter which cannot.) The part of speech
should be nn by default, but the value of the variable token.letter_pos
is used and may be changed if this is not what is required.
One character symbols such as dollar, at-sign, percent etc. Its
difficult to get a complete list and to know what the pronunciation of
some of these are (e.g hash or pound sign). But the letter to sound
rules cannot deal with them so they need to be explicitly listed. See
the list in the function mrpa_addend in
'festival/lib/dicts/oald/oaldlex.scm'. This list should also contain the
control characters and eight bit characters.
The possessive 's should be in your lexicon as schwa and voiced fricative
(z). It should be in twice, once as part speech type pos and once as n
(used in plurals of numbers acronyms etc. e.g 1950's). 's is treated as
a word and is separated from the tokens it appears with. The
post-lexical rule (the function postlex_apos_s_check) will delete the
schwa and devoice the z in appropriate contexts. Note this post-lexical
rule brazenly assumes that the unvoiced fricative in the phoneset is s.
If it is not in your phoneset copy the function (it is in
'festival/lib/postlex.scm') and change it for your phoneset and use your
version as a post-lexical rule.
Numbers as digits (e.g. "1", "2", "34", etc.) should normally not be in
the lexicon. The number conversion routines convert numbers to words
(i.e. "one", "two", "thirty four", etc.).
The word "unknown" or whatever is in the variable
token.unknown_word_name. This is used in a few obscure cases when there
just isn't anything that can be said (e.g. single characters which aren't
in the lexicon). Some people have suggested it should be possible to
make this a sound rather than a word. I agree, but Festival doesn't
support that yet.
ΓòÉΓòÉΓòÉ 14.7. Available lexicons ΓòÉΓòÉΓòÉ
Currently Festival supports a number of different lexicons. They are all
defined in the file 'lib/lexicons.scm' each with a number of common extra
words added to their addendas. They are
'CUVOALD' The Computer Users Version of Oxford Advanced Learner's Dictionary
is available from the Oxford Text Archive
ftp://ota.ox.ac.uk/pub/ota/public/dicts/710. It contains about
70,000 entries and is a part of the BEEP lexicon. It is more
consistent in its marking of stress though its syllable marking is
not what works best for our synthesis methods. Many syllabic 'l''s,
'n''s, and 'm''s, mess up the syllabification algorithm, making
results sometimes appear over reduced. It is however our current
default lexicon. It is also the only lexicon with part of speech
tags that can be distributed (for non-commercial use).
'CMU' This is automatically constructed from 'cmu_dict-0.4' available from
many places on the net (see comp.speech archives). It is not in the
mrpa phone set because it is American English pronunciation.
Although mappings exist between its phoneset ('darpa') and 'mrpa'
the results for British English speakers are not very good. However
this is probably the biggest, most carefully specified lexicon
available. It contains just under 100,000 entries. Our
distribution has been modified to include part of speech tags on
words we know to be homographs.
'mrpa' A version of the CSTR lexicon which has been floating about for
years. It contains about 25,000 entries. A new updated free version
of this is due to be released soon.
'BEEP' A British English rival for the 'cmu_lex'. BEEP has been made
available by Tony Robinson at Cambridge and is available in many
archives. It contains 163,000 entries and has been converted to the
'mrpa' phoneset (which was a trivial mapping). Although large, it
suffers from a certain randomness in its stress markings, making use
of it for synthesis dubious.
All of the above lexicons have some distribution restrictions (though mostly
pretty light), but as they are mostly freely available we provide programs
that can convert the originals into Festival's format.
The MOBY lexicon has recently been released into the public domain and will be
converted into our format soon.
ΓòÉΓòÉΓòÉ 14.8. Post-lexical rules ΓòÉΓòÉΓòÉ
It is the lexicon's job to produce a pronunciation of a given word. However in
most languages the most natural pronunciation of a word cannot be found in
isolation from the context in which it is to be spoken. This includes such
phenomena as reduction, phrase final devoicing and r-insertion. In Festival
this is done by post-lexical rules.
PostLex is a module which is run after accent assignment but before duration
and F0 generation. This is because knowledge of accent position is necessary
for vowel reduction and other post lexical phenomena and changing the
segmental items will affect durations.
The PostLex first applies a set of built in rules (which could be done in
Scheme but for historical reasons are still in C++). It then applies the
functions set in the hook postlex_rules_hook. These should be a set of
functions that take an utterance and apply appropriate rules. This should be
set up on a per voice basis.
Although a rule system could be devised for post-lexical sound rules it is
unclear what the scope of them should be, so we have left it completely open.
Our vowel reduction model uses a CART decision tree to predict which syllables
should be reduced, while the "'s" rule is very simple (shown in
'festival/lib/postlex.scm').
The 's in English may be pronounced in a number of different ways depending on
the preceding context. If the preceding consonant is a fricative or
affricative and not a palatal labio-dental or dental a schwa is required (e.g.
bench's) otherwise no schwa is required (e.g. John's). Also if the previous
phoneme is unvoiced the "s" is rendered as an "s" while in all other cases it
is rendered as a "z".
For our English voices we have a lexical entry for "'s" as a schwa followed by
a "z". We use a post lexical rule function called postlex_apos_s_check to
modify the basic given form when required. After lexical lookup the segment
relation contains the concatenation of segments directly from lookup in the
lexicon. Post lexical rules are applied after that.
In the following rule we check each segment to see if it is part of a word
labelled "'s", if so we check to see if are we currently looking at the schwa
or the z part, and test if modification is required
(define (postlex_apos_s_check utt)
"(postlex_apos_s_check UTT)
Deal with possesive s for English (American and British). Delete
schwa of 's if previous is not a fricative or affricative, and
change voiced to unvoiced s if previous is not voiced."
(mapcar
(lambda (seg)
(if (string-equal "'s" (item.feat
seg "R:SylStructure.parent.parent.name"))
(if (string-equal "a" (item.feat seg 'ph_vlng))
(if (and (member_string (item.feat seg 'p.ph_ctype)
'(f a))
(not (member_string
(item.feat seg "p.ph_cplace")
'(d b g))))
t;; don't delete schwa
(item.delete seg))
(if (string-equal "-" (item.feat seg "p.ph_cvox"))
(item.set_name seg "s")))));; from "z"
(utt.relation.items utt 'Segment))
utt)
ΓòÉΓòÉΓòÉ 15. Utterances ΓòÉΓòÉΓòÉ
The utterance structure lies at the heart of Festival. This chapter describes
its basic form and the functions available to manipulate it.
Utterance structure internal structure of utterances
Utterance types Type defined synthesis actions
Example utterance types Some example utterances
Utterance modules
Accessing an utterance getting the data from the structure
Features Features and features names
Utterance I/O Saving and loading utterances
ΓòÉΓòÉΓòÉ 15.1. Utterance structure ΓòÉΓòÉΓòÉ
Festival's basic object for synthesis is the utterance. An represents some
chunk of text that is to be rendered as speech. In general you may think of
it as a sentence but in many cases it wont actually conform to the standard
linguistic syntactic form of a sentence. In general the process of text to
speech is to take an utterance which contaisn a simple string of characters
and convert it step by step, filling out the utterance structure with more
information until a waveform is built that says what the text contains.
The processes involved in convertion are, in general, as follows
Tokenization Converting the string of characters into a list of tokens.
Typically this means whitespace separated tokesn of the original
text string.
Token identificationidentification of general types for the tokens, usually
this is trivial but requires some work to identify tokens of digits
as years, dates, numbers etc.
Token to wordConvert each tokens to zero or more words, expanding numbers,
abbreviations etc.
Part of speechIdentify the syntactic part of speech for the words.
Prosodic phrasingChunk utterance into prosodic phrases.
Lexical lookupFind the pronucnation of each word from a lexicon/letter to
sound rule system including phonetic and syllable structure.
Intonational accentsAssign intonation accents to approrpiate syllables.
Assign durationAssign duration to each phone in the utterance.
Generate F0 contour (tune)Generate tune based on accents etc.
Render waveformRender waveform from phones, duration and F) target values,
this itself may take several steps including unit selection (be they
diphones or other sized units), imposition of dsesired prosody
(duration and F0) and waveform reconstruction.
The number of steps and what actually happens may vary and is dependent on the
particular voice selected and the utterance's type, see below.
Each of these steps in Festival is achived by a module which will typically
add new information to the utterance structure.
An utterance structure consists of a set of items which may be part of one or
more relations. Items represent things like words and phones, though may also
be used to represent less concrete objects like noun phrases, and nodes in
metrical trees. An item contains a set of features, (name and value).
Relations are typically simple lists of items or trees of items. For example
the the Word relation is a simple list of items each of which represent a word
in the utternace. Those words will also be in other relations, such as the
SylStructure relation where the word will be the top of a tree structure
containing its syllables and segments.
Unlike previous versions of the system items (then called stream items) are
not in any particular relations (or stream). And are merely part of the
relations they are within. Importantly this allows much more general
relations to be made over items that was allowed in the previous system. This
new architecture is the continuation of our goal of providing a general
efficient structure for representing complex interrelated utterance objects.
The architecture is fully general and new items and relations may be defined
at run time, such that new modules may use any relations they wish. However
within our standard English (and other voices) we have used a specific set of
relations ass follows.
Token a list of trees. This is first formed as a list of tokens found in
a character text string. Each root's daughters are the Word's that
the token is related to.
Word a list of words. These items will also appear as daughters (leaf
nodes) of the Token relation. They may also appear in the Syntax
relation (as leafs) if the parser is used. They will also be leafs
of the Phrase relation.
Phrase a list of trees. This is a list of phrase roots whose daughters are
the Word's within those phrases.
Syntax a single tree. This, if the probabilistic parser is called, is a
syntactic binary branching tree over the members of the Word
relation.
SylStructurea list of trees. This links the Word, Syllable and Segment
relations. Each Word is the root of a tree whose immediate
daughters are its syllables and their daughters in turn as its
segments.
Syllable a list of syllables. Each member will also be in a the SylStructure
relation. In that relation its parent will be the word it is in and
its daughters will be the segments that are in it. Syllables are
also in the Intonation relation giving links to their related
intonation events.
Segment a list of segments (phones). Each member (except silences) will be
leaf nodes in the SylStructure relation. These may also be in the
Target relation linking them to F0 target points.
IntEvent a list of intonation events (accents and bounaries). These are
related to syllables through the Intonation relation as leafs on
that relation. Thus their parent in the Intonation relation is the
syllable these events are attached to.
Intonationa list of trees relating syllables to intonation events. Roots of
the trees in Intonation are Syllables and their daughters are
IntEvents.
Wave a single item with a feature called wave whose value is the
generated waveform.
This is a non-exhaustive list some modules may add other relations and not all
utterance may have all these relations, but the above is the general case.
ΓòÉΓòÉΓòÉ 15.2. Utterance types ΓòÉΓòÉΓòÉ
The primary purpose of types is to define which modules are to be applied to
an utterance. UttTypes are defined in 'lib/synthesis.scm'. The function
defUttType defines which modules are to be applied to an utterance of that
type. The function utt.synth is called applies this list of module to an
utterance before waveform synthesis is called.
For example when a Segment type Utterance is synthesized it needs only have
its values loaded into a Segment relation and a Target relation, then the low
level waveform synthesis module Wave_Synth is called. This is defined as
follows
(defUttType Segments
(Initialize utt)
(Wave_Synth utt))
A more complex type is Text type utterance which requires many more modules to
be called before a waveform can be synthesized
(defUttType Text
(Initialize utt)
(Text utt)
(Token utt)
(POS utt)
(Phrasify utt)
(Word utt)
(Intonation utt)
(Duration utt)
(Int_Targets utt)
(Wave_Synth utt)
)
The Initialize module should normally be called for all types. It loads the
necessary relations from the input form and deletes all other relations (if
any exist) ready for synthesis.
Modules may be directly defined as C/C++ functions and declared with a Lisp
name or simple functions in Lisp that check some global parameter before
calling a specific module (e.g. choosing between different intonation
modules).
These types are used when calling the function utt.synth and individual
modules may be called explicitly by hand if required.
Because we expect waveform synthesis methods to themselves become complex with
a defined set of functions to select, join, and modify units we now support an
addition notion of SynthTypes like UttTypes these define a set of functions to
apply to an utterance. These may be defined using the defSynthType function.
For example
(defSynthType Festival
(print "synth method Festival")
(print "select")
(simple_diphone_select utt)
(print "join")
(cut_unit_join utt)
(print "impose")
(simple_impose utt)
(simple_power utt)
(print "synthesis")
(frames_lpc_synthesis utt)
)
A SynthType is selected by naming as the value of the parameter Synth_Method.
Duration the application of the function utt.synth there are three hooks
applied. This allows addition control of the synthesis process.
before_synth_hooks is applied before any modules are applied.
after_analysis_hooks is applied at the start of Wave_Synth when all text,
linguistic and prosodic processing have been done. after_synth_hooks is
applied after all modules have been applied. These are useful for things such
as, altering the volume of a voice that happens to be quieter than others, or
for example outputing information for a talking head before waveform synthesis
occurs so preparation of the facial frames and synthesizing the waveform may
be done in parallel. (see 'festival/examples/th-mode.scm' for an example use
of these hooks for a talking head text mode.)
ΓòÉΓòÉΓòÉ 15.3. Example utterance types ΓòÉΓòÉΓòÉ
A number of utterance types are currently supported. It is easy to add new
ones but the standard distribution includes the following.
Text Raw text as a string.
(Utterance Text "This is an example")
Words A list of words
(Utterance Words (this is an example))
Words may be atomic or lists if further features need to be specified. For
example to specify a word and its part of speech you can use
(Utterance Words (I (live (pos v)) in (Reading (pos n) (tone H-H%))))
Note: the use of the tone feature requires an intonation mode that supports
it.
Any feature and value named in the input will be added to the Word item.
Phrase This allows explicit phrasing and features on Tokens to be
specified. The input consists of a list of phrases each contains a
list of tokens.
(Utterance
Phrase
((Phrase ((name B))
I saw the man
(in ((EMPH 1)))
the park)
(Phrase ((name BB))
with the telescope)))
ToBI tones and accents may also be specified on Tokens but these will only
take effect if the selected intonation method uses them.
Segments This allows specification of segments, durations and F0 target
values.
(Utterance
Segments
((# 0.19 )
(h 0.055 (0 115))
(@ 0.037 (0.018 136))
(l 0.064 )
(ou 0.208 (0.0 134) (0.100 135) (0.208 123))
(# 0.19)))
Note the times are in seconds NOT milliseconds. The format of each segment
entry is segment name, duration in seconds, and list of target values. Each
target value consists of a pair of point into the segment (in seconds) and F0
value in Hz.
Phones This allows a simple specification of a list of phones. Synthesis
specifies fixed durations (specified in FP_duration, default 100 ms)
and monotone intonation (specified in FP_F0, default 120Hz). This
may be used for simple checks for waveform synthesizers etc.
(Utterance Phones (# h @ l ou #))
Note the function SayPhones allows synthesis and playing of lists of phones
through this utterance type.
Wave A waveform file. Synthesis here simply involves loading the file.
(Utterance Wave fred.wav)
Others are supported, as defined in 'lib/synthesis.scm' but are used
internally by various parts of the system. These include Tokens used in TTS
and SegF0 used by utt.resynth.
ΓòÉΓòÉΓòÉ 15.4. Utterance modules ΓòÉΓòÉΓòÉ
The module is the basic unit that does the work of synthesis. Within Festival
there are duration modules, intonation modules, wave synthesis modules etc.
As stated above the utterance type defines the set of modules which are to be
applied to the utterance. These modules in turn will create relations and
items so that ultimately a waveform is generated, if required.
Many of the chapters in this manual are solely concerned with particular
modules in the system. Note that many modules have internal choices, such as
which duration method to use or which intonation method to use. Such general
choices are often done through the Parameter system. Parameters may be set
for different features like Duration_Method, Synth_Method etc. Formerly the
values for these parameters were atomic values but now they may be the
functions themselves. For example, to select the Klatt duration rules
(Parameter.set 'Duration_Method Duration_Klatt)
This allows new modules to be added without requiring changes to the central
Lisp functions such as Duration, Intonation, and Wave_Synth.
ΓòÉΓòÉΓòÉ 15.5. Accessing an utterance ΓòÉΓòÉΓòÉ
There are a number of standard functions that allow one to access parts of an
utterance and traverse through it.
Functions exist in Lisp (and of course C++) for accessing an utterance. The
Lisp access functions are
'(utt.relationnames UTT)' returns a list of the names of the relations
currently created in UTT.
'(utt.relation.items UTT RELATIONNAME)'returns a list of all items in
RELATIONNAME in UTT. This is nil if no relation of that name
exists. Note for tree relation will give the items in pre-order.
'(utt.relation_tree UTT RELATIONNAME)'A Lisp tree presentation of the items
RELATIONNAME in UTT. The Lisp bracketing reflects the tree structure
in the relation.
'(utt.relation.leafs UTT RELATIONNAME)'A list of all the leafs of the items in
RELATIONNAME in UTT. Leafs are defined as those items with no
daughters within that relation. For simple list relations
utt.relation.leafs and utt.relation.items will return the same
thing.
'(utt.relation.first UTT RELATIONNAME)'returns the first item in RELATIONNAME.
Returns nil if this relation contains no items
'(utt.relation.last UTT RELATIONNAME)'returns the last (the most next) item in
RELATIONNAME. Returns nil if this relation contains no items
'(item.feat ITEM FEATNAME)'returns the value of feature FEATNAME in ITEM.
FEATNAME may be a feature name, feature function name, or pathname
(see below). allowing reference to other parts of the utterance this
item is in.
'(item.features ITEM)'Returns an assoc list of feature-value pairs of all
local features on this item.
'(item.name ITEM)'Returns the name of this ITEM. This could also be accessed
as (item.feat ITEM 'name).
'(item.set_name ITEM NEWNAME)'Sets name on ITEM to be NEWNAME. This is
equivalent to (item.set_feat ITEM 'name NEWNAME)
'(item.set_feat ITEM FEATNAME FEATVALUE)'set the value of FEATNAME to
FEATVALUE in ITEM. FEATNAME should be a simple name and not refer to
next, previous or other relations via links.
'(item.relation ITEM RELATIONNAME)'Return the item as viewed from
RELATIONNAME, or nil if ITEM is not in that relation.
'(item.relationnames ITEM)'Return a list of relation names that this item is
in.
'(item.relationname ITEM)'Return the relation name that this item is currently
being viewed as.
'(item.next ITEM)'Return the next item in ITEM's current relation, or nil if
there is no next.
'(item.prev ITEM)'Return the previous item in ITEM's current relation, or nil
if there is no previous.
'(item.parent ITEM)'Return the parent of ITEM in ITEM's current relation, or
nil if there is no parent.
'(item.daughter1 ITEM)'Return the first daughter of ITEM in ITEM's current
relation, or nil if there are no daughters.
'(item.daughter2 ITEM)'Return the second daughter of ITEM in ITEM's current
relation, or nil if there is no second daughter.
'(item.daughtern ITEM)'Return the last daughter of ITEM in ITEM's current
relation, or nil if there are no daughters.
'(item.leafs ITEM)'Return a list of all lefs items (those with no daughters)
dominated by this item.
'(item.next_leaf ITEM)'Find the next item in this relation that has no
daughters. Note this may traverse up the tree from this point to
search for such an item.
As from 1.2 the utterance structure may be fully manipulated from Scheme.
Relations and items may be created and deleted, as easily as they can in C++;
'(utt.relation.present UTT RELATIONNAME)' returns t if relation named
RELATIONNAME is present, nil otherwise.
'(utt.relation.create UTT RELATIONNAME)'Creates a new relation called
RELATIONNAME. If this relation already exists it is deleted first
and items in the relation are derefenced from it (deleting the items
if they are no longer referenced by any relation). Thus create
relation guarantees an empty relation.
'(utt.relation.delete UTT RELATIONNAME)'Deletes the relation called
RELATIONNAME in utt. All items in that relation are derefenced from
the relation and if they are no longer in any relation the items
themselves are deleted.
'(utt.relation.append UTT RELATIONNAME ITEM)'Append ITEM to end of relation
named RELATIONNAME in UTT. Returns nil if there is not relation
named RELATIONNAME in UTT otherwise returns the item appended. This
new item becomes the last in the top list. ITEM item may be an item
itself (in this or another relation) or a LISP description of an
item, which consist of a list containing a name and a set of feature
vale pairs. It ITEM is nil or inspecified an new empty item is
added. If ITEM is already in this relation it is dereferenced from
its current position (and an emtpy item re-inserted).
'(item.insert ITEM1 ITEM2 DIRECTION)'Insert ITEM2 into ITEM1's relation in the
direction specified by DIRECTION. DIRECTION may take the value,
before, after, above and below. If unspecified, after is assumed.
Note it is not recommended to insert above and below and the
functions item.insert_parent and item.append_daughter should
normally be used for tree building. Inserting using before and after
within daughters is perfectly safe.
'(item.append_daughter PARENT DAUGHTER)'Append DAUGHTER, an item or a
description of an item to the item PARENT in the PARENT's relation.
'(item.insert_parent DAUGHTER NEWPARENT)'Insert a new parent above DAUGHTER.
NEWPARENT may be a item or the description of an item.
'(item.delete ITEM)'Delete this item from all relations it is in. All
daughters of this item in each relations are also removed from the
relation (which may in turn cause them to be deleted if they cease
to be referenced by any other relation.
'(item.relation.remove ITEM)'Remove this item from this relation, and any of
its daughters. Other relations this item are in remain untouched.
'(item.move_tree FROM TO)'Move the item FROM to the position of TO in TO's
relation. FROM will often be in the same relation as TO but that
isn't necessary. The contents of TO are dereferenced. its daughters
are saved then descendants of FROM are recreated under the new TO,
then TO's previous daughters are derefenced. The order of this is
important as FROM may be part of TO's descendants. Note that if TO
is part of FROM's descendants no moving occurs and nil is returned.
For example to remove all punction terminal nodes in the Syntax
relation the call would be something like
(define (syntax_relation_punc p)
(if (string-equal "punc" (item.feat (item.daughter2 p) "pos"))
(item.move_tree (item.daughter1 p) p)
(mapcar syntax_remove_punc (item.daughters p))))
'(item.exchange_trees ITEM1 ITEM2)'Exchange ITEM1 and ITEM2 and their
descendants in ITEM2's relation. If ITEM1 is within ITEM2's
descendents or vice versa nil is returns and no exchange takes
place. If ITEM1 is not in ITEM2's relation, no exchange takes
place.
Daughters of a node are actually represented as a list whose first daughter is
double linked to the parent. Although being aware of this structure may be
useful it is recommended that all access go through the tree specific
functions *.parent and *.daughter* which properly deal with the structure,
thus is the internal structure ever changes in the future only these tree
access function need be updated.
With the above functions quite elaborate utterance manipulations can be
performed. For example in post-lexical rules where modifications to the
segments are required based on the words and their context. See Post-lexical
rules for an example of using various utterance access functions.
ΓòÉΓòÉΓòÉ 15.6. Features ΓòÉΓòÉΓòÉ
In previous versions items had a number of predefined features. This is no
longer the case and all features are optional. Particularly the start and end
features are no longer fixed, though those names are still used in the
relations where yjeu are appropriate. Specific functions are provided for the
name feature but they are just short hand for normal feature access. Simple
features directly access the features in the underlying EST_Feature class in
an item.
In addition to simple features there is a mechanism for relating functions to
names, thus accessing a feature may actually call a function. For example the
features num_syls is defined as a feature function which will count the number
of syllables in the given word, rather than simple access a pre-existing
feature. Feature functions are usually dependent on the particular realtion
the item is in, e.g. some feature functions are only appropriate for items in
the Word relation, or only appropriate for those in the IntEvent relation.
The third aspect of feature names is a path component. These are parts of the
name (preceding in .) that indicated some trversal of the utterance structure.
For example the features name will access the name feature on the given item.
The feature n.name will return the name feature on the next item (in that
item's relation). A number of basic direction operators are defined.
n. next
p. previous
nn. next next
pp. previous
parent.
daughter1.first daughter
daughter2.second daughter
daughtern.last daughter
first. most previous item
last. most next item
Also you may specific traversal to another relation relation, though the
R:<relationame>. operator. For example given an Item in the syllable relation
R:SylStructure.parent.name would give the name of word the syllable is in.
Some more complex examples are as follows, assuming we are starting form an
item in the Syllable relation.
'stress' This item's lexical stress
'n.stress'The next syllable's lexical stress
'p.stress'The previous syllable's lexical stress
'R:SylStructure.parent.name'The word this syllable is in
'R:SylStructure.parent.R:Word.n.name'The word next to the word this syllable
is in
'n.R:SylStructure.parent.name'The word the next syllable is in
'R:SylStructure.daughtern.ph_vc'The phonetic feature vc of the final segment
in this syllable.
A list of all feature functions is given in an appendix of this document. See
Feature functions. New functions may also be added in Lisp.
In C++ feature values are of class EST_Val which may be a string, int, or a
float (or any arbitrary object). In Scheme this distinction cannot not always
be made and sometimes when you expect an int you actually get a string. Care
should be take to ensure the right matching functions are use in Scheme. It
is recommended you use string-append or string-match as they will always work.
If a pathname does not identify a valid path for the particular item (e.g.
there is no next) "0" is returned.
When collecting data from speech databases it is often useful to collect a
whole set of features from all utterances in a database. These features can
then be used for building various models (both CART tree models and linear
regression modules use these feature names),
A number of functions exist to help in this task. For example
(utt.features utt1 'Word '(name pos p.pos n.pos))
will return a list of word, and part of speech context for each word in the
utterance.
See Extracting features for an example of extracting sets of features from a
database for use in building stochastic models.
ΓòÉΓòÉΓòÉ 15.7. Utterance I/O ΓòÉΓòÉΓòÉ
A number of functions are available to allow an utterance's structure to be
made available for other programs.
The whole structure, all relations, items and features may be saved in an
ascii format using the function utt.save. This file may be reloaded using the
utt.load function. Note the waveform is not saved using the form.
Individual aspects of an utterance may be selectively saved. The waveform
itself may be saved using the function utt.save.wave. This will save the
waveform in the named file in the format specified in the Parameter
Wavefiletype. All formats supported by the Edinburgh Speech Tools are valid
including nist, esps, sun, riff, aiff, raw and ulaw. Note the functions
utt.wave.rescale and utt.wave.resample may be used to change the gain and
sample frequency of the waveform before saving it. A waveform may be imported
into an existing utterance with the function utt.import.wave. This is
specifically designed to allow external methods of waveform synthesis.
However if you just wish to play an external wave or make it into an utterance
you should consider the utterance Wave type.
The segments of an utterance may be saved in a file using the function
utt.save.segs which saves the segments of the named utterance in xlabel
format. Any other stream may also be saved using the more general
utt.save.relation which takes the additional argument of a relation name. The
names of each item and the end feature of each item are saved in the named
file, again in Xlabel format, other features are saved in extra fields. For
more elaborated saving methods you can easily write a Scheme function to save
data in an utterance in whatever format is required. See the file
'lib/mbrola.scm' for an example.
A simple function to allow the displaying of an utterance in Entropic's Xwaves
tool is provided by the function display. It simply saves the waveform and the
segments and sends appropriate commands to (the already running) Xwaves and
xlabel programs.
A function to synthesize an externally specified utterance is provided for by
utt.resynth which takes two filename arguments, an xlabel segment file and an
F0 file. This function loads, synthesizes and plays an utterance synthesized
from these files. The loading is provided by the underlying function
utt.load.segf0.
ΓòÉΓòÉΓòÉ 16. Text analysis ΓòÉΓòÉΓòÉ
Tokenizing Splitting text into tokens
Token to word rules
Homograph disambiguation "Wed 5 may wind US Sen up"
ΓòÉΓòÉΓòÉ 16.1. Tokenizing ΓòÉΓòÉΓòÉ
A crucial stage in text processing is the initial tokenization of text. A
token in Festival is an atom separated with whitespace from a text file (or
string). If punctuation for the current language is defined, characters
matching that punctuation are removed from the beginning and end of a token
and held as features of the token. The default list of characters to be
treated as white space is defined as
(defvar token.whitespace " \t\n\r")
While the default set of punctuation characters is
(defvar token.punctuation "\"'`.,:;!?(){}[]")
(defvar token.prepunctuation "\"'`({[")
These are declared in 'lib/token.scm' but may be changed for different
languages, text modes etc.
ΓòÉΓòÉΓòÉ 16.2. Token to word rules ΓòÉΓòÉΓòÉ
Tokens are further analysed into lists of words. A word is an atom that can
be given a pronunciation by the lexicon (or letter to sound rules). A token
may give rise to a number of words or none at all.
For example the basic tokens
This pocket-watch was made in 1983.
would give a word relation of
this pocket watch was made in nineteen eighty three
Becuase the relationship between tokens and word in some cases is complex, a
user function may be specified for translating tokens into words. This is
designed to deal with things like numbers, email addresses, and other
non-obvious pronunciations of tokens as zero or more words. Currently a
builtin function builtin_english_token_to_words offers much of the necessary
functionality for English but a user may further customize this.
If the user defines a function token_to_words which takes two arguments: a
token item and a token name, it will be called by the Token_English and
Token_Any modules. A substantial example is given as english_token_to_words
in 'festival/lib/token.scm'.
An example of this function is in 'lib/token.scm'. It is quite elaborate and
covers most of the common multi-word tokens in English including, numbers,
money symbols, Roman numerals, dates, times, plurals of symbols, number
ranges, telephone number and various other symbols.
Let us look at the treatment of one particular phenomena which shows the use
of these rules. Consider the expression "$12 million" which should be
rendered as the words "twelve million dollars". Note the word "dollars" which
is introduced by the "$" sign, ends up after the end of the expression. There
are two cases we need to deal with as there are two tokens. The first
condition in the cond checks if the current token name is a money symbol,
while the second condition check that following word is a magnitude (million,
billion, trillion, zillion etc.) If that is the case the "$" is removed and
the remaining numbers are pronounced, by calling the builtin token to word
function. The second condition deals with the second token. It confirms the
previous is a money value (the same regular expression as before) and then
returns the word followed by the word "dollars". If it is neither of these
forms then the builtin function is called.
(define (token_to_words token name)
"(token_to_words TOKEN NAME)
Returns a list of words for NAME from TOKEN."
(cond
((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matches (item.feat token "n.name") ".*illion.?"))
(builtin_english_token_to_words token (string-after name "$")))
((and (string-matches (item.feat token "p.name")
"\\$[0-9,]+\\(\\.[0-9]+\\)?")
(string-matches name ".*illion.?"))
(list
name
"dollars"))
(t
(builtin_english_token_to_words token name))))
It is valid to make some conditions return no words, though some care should
be taken with that, as punctuation information may no longer be available to
later processing if there are no words related to a token.
ΓòÉΓòÉΓòÉ 16.3. Homograph disambiguation ΓòÉΓòÉΓòÉ
Not all tokens can be rendered as words easily. Their context may affect the
way they are to be pronounced. For example in the utterance
On May 5 1985, 1985 people moved to Livingston.
the tokens "1985" should be pronounced differently, the first as a year,
"nineteen eighty five" while the second as a quantity "one thousand nine
hundred and eighty five". Numbers may also be pronounced as ordinals as in
the "5" above, it should be "fifth" rather than "five".
Also, the pronunciation of certain words cannot simply be found from their
orthographic form alone. Linguistic part of speech tags help to disambiguate
a large class of homographs, e.g. "lives". A part of speech tagger is
included in Festival and discussed in POS tagging. But even part of speech
isn't sufficient in a number of cases. Words such as "bass", "wind", "bow"
etc cannot by distinguished by part of speech alone, some semantic information
is also required. As full semantic analysis of text is outwith the realms of
Festival's capabilities some other method for disambiguation is required.
Following the work of yarowsky96 we have included a method for identified
tokens to be further labelled with extra tags to help identify their type.
Yarowsky uses decision lists to identify different types for homographs.
Decision lists are a restricted form of decision trees which have some
advantages over full trees, they are easier to build and Yarowsky has shown
them to be adequate for typical homograph resolution.
ΓòÉΓòÉΓòÉ 16.3.1. Using disambiguators ΓòÉΓòÉΓòÉ
Festival offers a method for assigning a token_pos feature to each token. It
does so using Yarowsky-type disambiguation techniques. A list of
disambiguators can be provided in the variable token_pos_cart_trees. Each
disambiguator consists of a regular expression and a CART tree (which may be a
decision list as they have the same format). If a token matches the regular
expression the CART tree is applied to the token and the resulting class is
assigned to the token via the feature token_pos. This is done by the
Token_POS module.
For example, the follow disambiguator distinguishes "St" (street and saint)
and "Dr" (doctor and drive).
("\\([dD][Rr]\\|[Ss][tT]\\)"
((n.name is 0)
((p.cap is 1)
((street))
((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
((street))
((title))))
((punc matches ".*,.*")
((street))
((p.punc matches ".*,.*")
((title))
((n.cap is 0)
((street))
((p.cap is 0)
((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
((street))
((title)))
((pp.name matches "[1-9][0-9]+")
((street))
((title)))))))))
Note that these only assign values for the feature token_pos and do nothing
more. You must have a related token to word rule that interprets this feature
value and does the required translation. For example the corresponding token
to word rule for the above disambiguator is
((string-matches name "\\([dD][Rr]\\|[Ss][tT]\\)")
(if (string-equal (item.feat token "token_pos") "street")
(if (string-matches name "[dD][rR]")
(list "drive")
(list "street"))
(if (string-matches name "[dD][rR]")
(list "doctor")
(list "saint"))))
ΓòÉΓòÉΓòÉ 16.3.2. Building disambiguators ΓòÉΓòÉΓòÉ
Festival offers some support for building disambiguation trees. The basic
method is to find all occurrences of a homographic token in a large text
database, label each occurrence into classes, extract appropriate context
features for these tokens and finally build an classification tree or decision
list based on the extracted features.
The extraction and building of trees is not yet a fully automated process in
Festival but the file 'festival/examples/toksearch.scm' shows some basic
Scheme code we use for extracting tokens from very large collections of text.
The function extract_tokens does the real work. It reads the given file,
token by token into a token stream. Each token is tested against the desired
tokens and if there is a match the named features are extracted. The token
stream will be extended to provide the necessary context. Note that only some
features will make any sense in this situation. There is only a token
relation so referring to words, syllables etc. is not productive.
In this example databases are identified by a file that lists all the files in
the text databases. Its name is expected to be 'bin/DBNAME.files' where
DBNAME is the name of the database. The file should contain a list of
filenames in the database e.g for the Gutenberg texts the file
'bin/Gutenberg.files' contains
gutenberg/etext90/bill11.txt
gutenberg/etext90/const11.txt
gutenberg/etext90/getty11.txt
gutenberg/etext90/jfk11.txt
┬╖┬╖┬╖
Extracting the tokens is typically done in two passes. The first pass
extracts the context (I've used 5 tokens either side). It extracts the file
and position, so the token is identified, and the word in context.
Next those examples should be labelled with a small set of classes which
identify the type of the token. For example for a token like "Dr" whether it
is a person's title or a street identifier. Note that hand-labelling can be
laborious, though it is surprising how few tokens of particular types actually
exist in 62 million words.
The next task is to extract the tokens with the features that will best
distinguish the particular token. In our "Dr" case this will involve
punctuation around the token, capitalisation of surrounding tokens etc. After
extracting the distinguishing tokens you must line up the labels with these
extracted features. It would be easier to extract both the context and the
desired features at the same time but experience shows that in labelling, more
appropriate features come to mind that will distinguish classes better and you
don't want to have to label twice.
Once a set of features consisting of the label and features is created it is
easy to use 'wagon' to create the corresponding decision tree or decision
list. 'wagon' supports both decision trees and decision lists, it may be
worth experimenting to find out which give the best results on some held out
test data. It appears that decision trees are typically better, but are often
much larger, and the size does not always justify the the sometimes only
slightly better results.
ΓòÉΓòÉΓòÉ 17. POS tagging ΓòÉΓòÉΓòÉ
Part of speech tagging is a fairly well-defined process. Festival includes a
part of speech tagger following the HMM-type taggers as found in the Xerox
tagger and others (e.g. DeRose88). Part of speech tags are assigned, based on
the probability distribution of tags given a word, and from ngrams of tags.
These models are externally specified and a Viterbi decoder is used to assign
part of speech tags at run time.
So far this tagger has only been used for English but there is nothing
language specific about it. The module POS assigns the tags. It accesses the
following variables for parameterization.
pos_lex_name The name of a "lexicon" holding reverse probabilities of words
given a tag (indexed by word). If this is unset or has the value
NIL no part of speech tagging takes place.
pos_ngram_nameThe name of a loaded ngram model of part of speech tags (loaded
by ngram.load).
pos_p_start_tagThe name of the most likely tag before the start of an
utterance. This is typically the tag for sentence final punctuation
marks.
pos_pp_start_tagThe name of the most likely tag two before the start of an
utterance. For English the is typically a simple noun, but for other
languages it might be a verb. If the ngram model is bigger than
three this tag is effectively repeated for the previous left
contexts.
pos_map We have found that it is often better to use a rich tagset for
prediction of part of speech tags but that in later use (phrase
breaks and dictionary lookup) a much more constrained tagset is
better. Thus mapping of the predicted tagset to a different tagset
is supported. pos_map should be a a list of pairs consisting of a
list of tags to be mapped and the new tag they are to be mapped to.
Note is it important to have the part of speech tagger match the tags used in
later parts of the system, particularly the lexicon. Only two of our lexicons
used so far have (mappable) part of speech labels.
An example of the part of speech tagger for English can be found in
'lib/pos.scm'.
ΓòÉΓòÉΓòÉ 18. Phrase breaks ΓòÉΓòÉΓòÉ
There are two methods for predicting phrase breaks in Festival, one simple and
one sophisticated. These two methods are selected through the parameter
Phrase_Method and phrasing is achieved by the module Phrasify.
The first method is by CART tree. If parameter Phrase_Method is cart_tree,
the CART tree in the variable phrase_cart_tree is applied to each word to see
if a break should be inserted or not. The tree should predict categories BB
(for big break), B (for break) or NB (for no break). A simple example of a
tree to predict phrase breaks is given in the file 'lib/phrase.scm'.
(set! simple_phrase_cart_tree
'
((R:Token.parent.punc in ("?" "." ":"))
((BB))
((R:Token.parent.punc in ("'" "\"" "," ";"))
((B))
((n.name is 0)
((BB))
((NB))))))
The second and more elaborate method of phrase break prediction is used when
the parameter Phrase_Method is prob_models. In this case a probabilistic
model using probabilities of a break after a word based on the part of speech
of the neighbouring words and the previous word. This is combined with a
ngram model of the distribution of breaks and non-breaks using a Viterbi
decoder to find the optimal phrasing of the utterance. The results using this
technique are good and even show good results on unseen data from other
researchers' phrase break tests (see black97b). However sometimes it does
sound wrong, suggesting there is still further work required.
Parameters for this module are set through the feature list held in the
variable phr_break_params, and example of which for English is set in
english_phr_break_params in the file 'lib/phrase.scm'. The features names and
meaning are
pos_ngram_name The name of a loaded ngram that gives probability distributions
of B/NB given previous, current and next part of speech.
pos_ngram_filenameThe filename containing pos_ngram_name.
break_ngram_nameThe name of a loaded ngram of B/NB distributions. This is
typically a 6 or 7-gram.
break_ngram_filenameThe filename containing break_ngram_name.
gram_scale_sA weighting factor for breaks in the break/non-break ngram.
Increasing the value insertes more breaks, reducing it causes less
breaks to be inserted.
phrase_type_treeA CART tree that is used to predict type of break given the
predict break position. This (rather crude) technique is current
used to distinguish major and minor breaks.
break_tagsA list of the break tags (typically (B NB)).
pos_map A part of speech map used to map the pos feature of words into a
smaller tagset used by the phrase predictor.
ΓòÉΓòÉΓòÉ 19. Intonation ΓòÉΓòÉΓòÉ
A number of different intonation modules are available with varying levels of
control. In general intonation is generated in two steps.
1. Prediction of accents (and/or end tones) on a per syllable basis.
2. Prediction of F0 target values, this must be done afterdurations are
predicted.
Reflecting this split there are two main intonation modules that call
sub-modules depending on the desired intonation methods. The Intonation and
Int_Targets modules are defined in Lisp ('lib/intonation.scm') and call
sub-modules which are (so far) in C++.
Default intonation Effectively none at all.
Simple intonation Accents and hats.
Tree intonation Accents and Tones, and F0 prediction by LR
Tilt intonation Using the Tilt intonation model
General intonation A programmable intonation module
Using ToBI A ToBI by rule example
ΓòÉΓòÉΓòÉ 19.1. Default intonation ΓòÉΓòÉΓòÉ
This is the simplest form of intonation and offers the modules
Intonation_Default and Intonation_Targets_Default. The first of which
actually does nothing at all. Intonation_Targets_Default simply creates a
target at the start of the utterance, and one at the end. The values of
which, by default are 130 Hz and 110 Hz. These values may be set through the
parameter duffint_params for example the following will general a monotone at
150Hz.
(set! duffint_params '((start 150) (end 150)))
(Parameter.set 'Int_Method 'DuffInt)
(Parameter.set 'Int_Target_Method Int_Targets_Default)
ΓòÉΓòÉΓòÉ 19.2. Simple intonation ΓòÉΓòÉΓòÉ
This module uses the CART tree in int_accent_cart_tree to predict if each
syllable is accented or not. A predicted value of NONE means no accent is
generated by the corresponding Int_Targets_Simple function. Any other
predicted value will cause a `hat' accent to be put on that syllable.
A default int_accent_cart_tree is available in the value
simple_accent_cart_tree in 'lib/intonation.scm'. It simply predicts accents
on the stressed syllables on content words in poly-syllabic words, and on the
only syllable in single syllable content words. Its form is
(set! simple_accent_cart_tree
'
((R:SylStructure.parent.gpos is content)
((stress is 1)
((Accented))
((position_type is single)
((Accented))
((NONE))))
((NONE))))
The function Int_Targets_Simple uses parameters in the a-list in variable
int_simple_params. There are two interesting parameters f0_mean which gives
the mean F0 for this speaker (default 110 Hz) and f0_std is the standard
deviation of F0 for this speaker (default 25 Hz). This second value is used
to determine the amount of variation to be put in the generated targets.
For each Phrase in the given utterance an F0 is generated starting at
f0_code+(f0_std*0.6) and declines f0_std Hz over the length of the phrase
until the last syllable whose end is set to f0_code-f0_std. An imaginary line
called baseline is drawn from start to the end (minus the final extra fall),
For each syllable that is accented (i.e. has an IntEvent related to it) three
targets are added. One at the start, one in mid vowel, and one at the end.
The start and end are at position baseline Hz (as declined for that syllable)
and the mid vowel is set to baseline+f0_std.
Note this model is not supposed to be complex or comprehensive but it offers a
very quick and easy way to generate something other than a fixed line F0.
Something similar to this has been for Spanish and Welsh without (too many)
people complaining. However it is not designed as a serious intonation
module.
ΓòÉΓòÉΓòÉ 19.3. Tree intonation ΓòÉΓòÉΓòÉ
This module is more flexible. Two different CART trees can be used to predict
`accents' and `endtones'. Although at present this module is used for an
implementation of the ToBI intonation labelling system it could be used for
many different types of intonation system.
The target module for this method uses a Linear Regression model to predict
start mid-vowel and end targets for each syllable using arbitrarily specified
features. This follows the work described in black96. The LR models are held
as as described below See Linear regression. Three models are used in the
variables f0_lr_start, f0_lr_mid and f0_lr_end.
ΓòÉΓòÉΓòÉ 19.4. Tilt intonation ΓòÉΓòÉΓòÉ
Tilt description to be inserted.
ΓòÉΓòÉΓòÉ 19.5. General intonation ΓòÉΓòÉΓòÉ
As there seems to be a number of intonation theories that predict F0 contours
by rule (possibly using trained parameters) this module aids the external
specification of such rules for a wide class of intonation theories (through
primarily those that might be referred to as the ToBI group). This is
designed to be multi-lingual and offer a quick way to port often pre-existing
rules into Festival without writing new C++ code.
The accent prediction part uses the same mechanisms as the Simple intonation
method described above, a decision tree for accent prediction, thus the tree
in the variable int_accent_cart_tree is used on each syllable to predict an
IntEvent.
The target part calls a specified Scheme function which returns a list of
target points for a syllable. In this way any arbitrary tests may be done to
produce the target points. For example here is a function which returns three
target points for each syllable with an IntEvent related to it (i.e. accented
syllables).
(define (targ_func1 utt syl)
"(targ_func1 UTT STREAMITEM)
Returns a list of targets for the given syllable."
(let ((start (item.feat syl 'syllable_start))
(end (item.feat syl 'syllable_end)))
(if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented")
(list
(list start 110)
(list (/ (+ start end) 2.0) 140)
(list end 100)))))
This function may be identified as the function to call by the following setup
parameters.
(Parameter.set 'Int_Method 'General)
(Parameter.set 'Int_Target_Method Int_Targets_General)
(set! int_general_params
(list
(list 'targ_func targ_func1)))
ΓòÉΓòÉΓòÉ 19.6. Using ToBI ΓòÉΓòÉΓòÉ
An example implementation of a ToBI to F0 target module is included in
'lib/tobi_rules.scm' based on the rules described in jilka96. This uses the
general intonation method discussed in the previous section. This is designed
to be useful to people who are experimenting with ToBI (silverman92), rather
than general text to speech.
To use this method you need to load 'lib/tobi_rules.scm' and call
setup_tobi_f0_method. The default is in a male's pitch range, i.e. for
voice_rab_diphone. You can change it for other pitch ranges by changing the
folwoing variables.
(Parameter.set 'Default_Topline 110)
(Parameter.set 'Default_Start_Baseline 87)
(Parameter.set 'Default_End_Baseline 83)
(Parameter.set 'Current_Topline (Parameter.get 'Default_Topline))
(Parameter.set 'Valley_Dip 75)
An example using this from STML is given in 'examples/tobi.stml'. But it can
also be used from Scheme. For example before defining an utterance you should
execute the following either from teh command line on in some setup file
(voice_rab_diphone)
(require 'tobi_rules)
(setup_tobi_f0_method)
In order to allow specification of accents, tones, and break levels you must
use an utterance type that allows such specification. For example
(Utterance
Words
(boy
(saw ((accent H*)))
the
(girl ((accent H*)))
in the
(park ((accent H*) (tone H-)))
with the
(telescope ((accent H*) (tone H-H%)))))
(Utterance Words
(The
(boy ((accent L*)))
saw
the
(girl ((accent H*) (tone L-)))
with
the
(telescope ((accent H*) (tone H-H%))))))
You can display the the synthesized form of these utterance in Xwaves. Start
an Xwaves and an Xlabeller and call the function display on the synthesized
utterance.
ΓòÉΓòÉΓòÉ 20. Duration ΓòÉΓòÉΓòÉ
A number of different duration prediction modules are available with varying
levels of sophistication.
Segmental duration prediction is done by the module Duration which calls
different actual methods depending on the parameter Duration_Method.
All of the following duration methods may be further affected by both a global
duration stretch and a per word one.
If the parameter Duration_Stretch is set, all absolute durations predicted by
any of the duration methods described here are multiplied by the parameter's
value. For example
(Parameter.set 'Duration_Stretch 1.2)
will make everything speak more slowly.
In addition to the global stretch method, if the feature dur_stretch on the
related Token is set it will also be used as a multiplicative factor on the
duration produced by the selected method. That is
R:Syllable.parent.parent.R:Token.parent.dur_stretch. There is a lisp function
duration_find_stretch wchi will return the combined gloabel and local duration
stretch factor for a given segment item.
Note these global and local methods of affecting the duration produced by
models are crude and should be considered hacks. Uniform modification of
durations is not what happens in real speech. These parameters are typically
used when the underlying duration method is lacking in some way. However
these can be useful.
Note it is quite easy to implement new duration methods in Scheme directly.
Default durations Fixed length durations
Average durations
Klatt durations Klatt rules from book.
CART durations Tree based durations
ΓòÉΓòÉΓòÉ 20.1. Default durations ΓòÉΓòÉΓòÉ
If parameter Duration_Method is set to Default, the simplest duration model is
used. All segments are 100 milliseconds (this can be modified by
Duration_Stretch, and/or the localised Token related dur_stretch feature).
ΓòÉΓòÉΓòÉ 20.2. Average durations ΓòÉΓòÉΓòÉ
If parameter Duration_Method is set to Averages then segmental durations are
set to their averages. The variable phoneme_durations should be an a-list of
phones and averages in seconds. The file 'lib/mrpa_durs.scm' has an example
for the mrpa phoneset.
If a segment is found that does not appear in the list a default duration of
0.1 seconds is assigned, and a warning message generated.
ΓòÉΓòÉΓòÉ 20.3. Klatt durations ΓòÉΓòÉΓòÉ
If parameter Duration_Method is set to Klatt the duration rules from the Klatt
book (allen87, chapter 9). This method requires minimum and inherent
durations for each phoneme in the phoneset. This information is held in the
variable duration_klatt_params. Each member of this list is a three-tuple, of
phone name, inherent duration and minimum duration. An example for the mrpa
phoneset is in 'lib/klatt_durs.scm'.
ΓòÉΓòÉΓòÉ 20.4. CART durations ΓòÉΓòÉΓòÉ
Two very similar methods of duration prediction by CART tree are supported.
The first, used when parameter Duration_Method is Tree simply predicts
durations directly for each segment. The tree is set in the variable
duration_cart_tree.
The second, which seems to give better results, is used when parameter
Duration_Method is Tree_ZScores. In this second model the tree predicts
zscores (number of standard deviations from the mean) rather than duration
directly. (This follows campbell91, but we don't deal in syllable durations
here.) This method requires means and standard deviations for each phone.
The variable duration_cart_tree should contain the zscore prediction tree and
the variable duration_ph_info should contain a list of phone, mean duration,
and standard deviation for each phone in the phoneset.
An example tree trained from 460 sentences spoken by Gordon is in
'lib/gswdurtreeZ'. Phone means and standard deviations are in
'lib/gsw_durs.scm'.
After prediction the segmental duration is calculated by the simple formula
duration = mean + (zscore * standard deviation)
For some other duration models that affect an inherent duration by some factor
this method has been used. If the tree predicts factors rather than zscores
and the duration_ph_info entries are phone, 0.0, inherent duration. The above
formula will generate the desired result. Klatt and Klatt-like rules can be
implemented in the this way without adding a new method.
ΓòÉΓòÉΓòÉ 21. UniSyn synthesizer ΓòÉΓòÉΓòÉ
Since 1.3 a new general synthesizer module has been included. This designed
to replace the older diphone synthesizer described in the next chapter. A
redesign was made in order to have a generalized waveform synthesizer, singla
processing module that could be used even when the units being concatenated
are not diphones. Also at this stage the full diphone (or other) database
pre-processing functions were added to the Speech Tool library.
ΓòÉΓòÉΓòÉ 21.1. UniSyn database format ΓòÉΓòÉΓòÉ
The Unisyn synthesis modules can use databases in two basic formats, separate
and grouped. Separate is when all files (signal, pitchmark and coefficient
files) are accessed individually during synthesis. This is the standard use
during databse development. Group format is when a database is collected
together into a single special file containing all information necessary for
waveform synthesis. This format is designed to be used for distribution and
general use of the database.
A database should consist of a set of waveforms, (which may be translated into
a set of coefficients if the desired the signal processing method requires
it), a set of pitchmarks and an index. The pitchmarks are necessary as most
of our current signal processing are pitch synchronous.
ΓòÉΓòÉΓòÉ 21.1.1. Generating pitchmarks ΓòÉΓòÉΓòÉ
Pitchmarks may be derived from laryngograph files using the our proved program
'pitchmark' distributed with the speech tools. The actual parameters to this
program are still a bit of an art form. The first major issue is which
direction the lar files. We have seen both, though it does seem to be CSTR's
ones are most often upside down while others (e.g. OGI's) are the right way
up. The -inv argument to 'pitchmark' is specifically provided to cater for
this. There other issues in getting the pitchmarks aligned. The basic
command for generating pitchmarks is
pitchmark -inv lar/file001.lar -o pm/file001.pm -otype est \
-min 0.005 -max 0.012 -fill -def 0.01 -wave_end
The '-min', '-max' and '-def' (fill values for unvoiced regions), may need to
be changed depending on the speaker pitch range. The above is suitable for a
male speaker. The '-fill' option states that unvoiced sections should be
filled with equally spaced pitchmarks.
ΓòÉΓòÉΓòÉ 21.1.2. Generating LPC coefficients ΓòÉΓòÉΓòÉ
LPC coefficients are generated using the 'sig2fv' command. Two stages are
required, generating the LPC coefficients and generating the residual. The
prototypical commands for these are
sig2fv wav/file001.wav -o lpc/file001.lpc -otype est -lpc_order 16 \
-coefs "lpc" -pm pm/file001.pm -preemph 0.95 -factor 3 \
-window_type hamming
sigfilter wav/file001.wav -o lpc/file001.res -otype nist \
-lpcfilter lpc/file001.lpc -inv_filter
For some databases you may need to normalize the power. Properly normalizing
power is difficult but we provide a simple function which may do the jobs
acceptably. You should do this on the waveform before lpc analysis (and
ensure you also do the residual extraction on the normalized waveform rather
than the original.
ch_wave -scaleN 0.5 wav/file001.wav -o file001.Nwav
This normalizes the power by maximizing the signal first then multiplying it
by the given factor. If the database waveforms are clean (i.e. no clicks)
this can give reasonable results.
ΓòÉΓòÉΓòÉ 21.2. Generating a diphone index ΓòÉΓòÉΓòÉ
The diphone index consists of a short header following by an ascii list of
each diphone, the file it comes from followed by its start middle and end
times in seconds. For most databases this files needs to be generated by some
database specific script.
An example header is
EST_File index
DataType ascii
NumEntries 2005
IndexName rab_diphone
EST_Header_End
The most notable part is the number of entries, which you should note can get
out of sync with the actual number of entries if you hand edit entries. I.e.
if you add an entry and the system still can't find it check that the number
of entries is right.
The entries themselves may take on one of two forms, full entries or index
entries. Full entries consist of a diphone name, where the phones are
separated by "-"; a file name which is used to index into the pitchmark, LPC
and waveform file; and the start, middle (change over point between phones)
and end of the phone in the file in seconds of the diphone. For example
r-uh edx_1001 0.225 0.261 0.320
r-e edx_1002 0.224 0.273 0.326
r-i edx_1003 0.240 0.280 0.321
r-o edx_1004 0.212 0.253 0.320
The second form of entry is an index entry which simply states that reference
to that diphone should actually be made to another. For example
aa-ll &aa-l
This states that the diphone aa-ll should actually use the diphone aa-l. Note
they are a number of ways to specify alternates for missing diphones an this
method is best used for fixing single or small classes of missing or broken
diphones. Index entries may appear anywhere in the file but can't be nested.
Some checks are made one reading this index to ensure times etc are reasonable
but multiple entries for the same diphone are not, in that case the later one
will be selected.
ΓòÉΓòÉΓòÉ 21.3. Database declaration ΓòÉΓòÉΓòÉ
There two major types of database grouped and ungrouped. Grouped databases
come as a single file containing the diphone index, coeficinets and residuals
for the diphones. This is the standard way databases are distributed as
voices in Festoval. Ungrouped access diphones from individual files and is
designed as a method for debugging and testing databases before distribution.
Using ungrouped dataabse is slower but allows quicker changes to the index,
and associated coefficient files and residuals without rebuilding the group
file.
A database is declared to the system through the command us_diphone_init.
This function takes a parameter list of various features used for setting up a
database. The features are
name An atomic name for this database, used in selecting it from the
current set of laded database.
index_fileA filename name containing either a diphone index, as descripbed
above, or a group file. The feature grouped defines the distinction
between this being a group of simple index file.
grouped Takes the value "true" or "false". This defined simple index or if
the index file is a grouped file.
coef_dir The directory containing the coefficients, (LPC or just pitchmarks
in the PSOLA case).
sig_dir The directory containing the signal files (residual for LPC, full
waveforms for PSOLA).
coef_ext The extention for coefficient files, typically ".lpc" for LPC file
and ".pm" for pitchmark files.
sig_ext The extention for signal files, typically ".res" for LPC residual
files and ".wav" for waveform files.
default_diphoneThe diphone to be used when the requested one doesn't exist.
No matter how careful you are you should always include a default
diphone for distributed diphone database. Synthesis will throw an
error if no diphone is found and there is no default. Although it
is usually an error when this is required its better to fill in
something than stop synthesizing. Typical values for this are
silence to silence or schwa to schwa.
alternates_leftA list of pairs showing the alternate phone names for the left
phone in a diphone pair. This is list is used to rewrite the
diphone name when the directly requested one doesn't exist. This is
the recommended method for dealing with systematic holes in a
diphone database.
alternates_rightA list of pairs showing the alternate phone names for the
right phone in a diphone pair. This is list is used to rewrite the
diphone name when the directly requested one doesn't exist. This is
the recommended method for dealing with systematic holes in a
diphone database.
An example database definition is
(set! rab_diphone_dir "/projects/festival/lib/voices/english/rab_diphone")
(set! rab_lpc_group
(list
'(name "rab_lpc_group")
(list 'index_file
(path-append rab_diphone_dir "group/rablpc16k.group"))
'(alternates_left ((i ii) (ll l) (u uu) (i@ ii) (uh @) (a aa)
(u@ uu) (w @) (o oo) (e@ ei) (e ei)
(r @)))
'(alternates_right ((i ii) (ll l) (u uu) (i@ ii)
(y i) (uh @) (r @) (w @)))
'(default_diphone @-@@)
'(grouped "true")))
(us_dipohone_init rab_lpc_group)
ΓòÉΓòÉΓòÉ 21.4. Making groupfiles ΓòÉΓòÉΓòÉ
The function us_make_group_file will make a group file of the currently
selected US diphone database. It loads in all diphone sin the dtabaase and
saves them in the named file. An optional second argument allows
specification of how the group file will be saved. These options are as a
feature list. There are three possible options
track_file_format The format for the coefficient files. By default this is
est_binary, currently the only other alternative is est_ascii.
sig_file_formatThe format for the signal parts of the of the database. By
default this is snd (Sun's Audio format). This was choosen as it
has the smallest header and supports various sample formats. Any
format supported by the Edinburgh Speech Tools is allowed.
sig_sample_formatThe format for the samples in the signal files. By default
this is mulaw. This is suitable when the signal files are LPC
residuals. LPC residuals have a much smaller dynamic range that
plain PCM files. Because mulaw representation is half the size (8
bits) of standard PCM files (16bits) this significantly reduces the
size of the group file while only marginally altering the quality of
synthesis (and from experiments the effect is not perceptible).
However when saving group files where the signals are not LPC
residuals (e.g. in PSOLA) using this default mulaw is not
recommended and short should probably be used.
ΓòÉΓòÉΓòÉ 21.5. UniSyn module selection ΓòÉΓòÉΓòÉ
In a voice selection a UniSyn database may be selected as follows
(set! UniSyn_module_hooks (list rab_diphone_const_clusters ))
(set! us_abs_offset 0.0)
(set! window_factor 1.0)
(set! us_rel_offset 0.0)
(set! us_gain 0.9)
(Parameter.set 'Synth_Method 'UniSyn)
(Parameter.set 'us_sigpr 'lpc)
(us_db_select rab_db_name)
The UniSyn_module_hooks are run before synthesis, see the next selection about
diphone name selection. At present only lpc is supported by the UniSyn
module, though potentially there may be others.
An optional implementation of TD-PSOLA moulines90 has been written but fear of
legal problems unfortunately prevents it being in the public distribution, but
this policy should not be taken as acknowledging or not acknowledging any
alleged patent violation.
ΓòÉΓòÉΓòÉ 21.6. Diphone selection ΓòÉΓòÉΓòÉ
Diphone names are constructed for each phone-phone pair in the Segment
relation in an utterance. If a segment has the feature in forming a diphone
name UniSyn first checks for the feature us_diphone_left (or us_diphone_right
for the right hand part of the diphone) then if that doesn't exist the feature
us_diphone then if that doesn't exist the feature name. Thus is is possible
to specify diphone names which are not simply the concatenation of two segment
names.
This feature is used to specify consonant cluster diphone names for our
English voices. The hook UniSyn_module_hooks is run before selection and we
specify a function to add us_diphone_* features as appropriate. See the
function rab_diphone_fix_phone_name in
'lib/voices/english/rab_diphone/festvox/rab_diphone.scm' for an example.
Once the diphone name is created it is used to select the diphone from the
database. If it is not found the name is converted using the list of
alternates_left and alternates_right as specified in the database declaration.
If that doesn't specify a diphone in the database. The default_diphone is
selected, and a warning is printed. If no default diphone is specified or the
default diphone doesn't exist in the database an error is thrown.
ΓòÉΓòÉΓòÉ 22. Diphone synthesizer ΓòÉΓòÉΓòÉ
NOTE: use of this diphone synthesis is depricated and it will probably be
removed from future versions, all of its functionality has been replaced by
the UniSyn synthesizer. It is not compiled by default, if required add
ALSO_INCLUDE += diphone to your 'festival/config/config' file.
A basic diphone synthesizer offers a method for making speech from segments,
durations and intonation targets. This module was mostly written by Alistair
Conkie but the base diphone format is compatible with previous CSTR diphone
synthesizers.
The synthesizer offers residual excited LPC based synthesis (hunt89) and PSOLA
(TM) (moulines90) (PSOLA is not available for distribution).
Diphone database format Format of basic dbs
LPC databases Building and using LPC files.
Group files Efficient binary formats
Diphone_Init Loading diphone databases
Access strategies Various access methods
Diphone selection Mapping phones to special diphone names
ΓòÉΓòÉΓòÉ 22.1. Diphone database format ΓòÉΓòÉΓòÉ
A diphone database consists of a dictionary file, a set of waveform files, and
a set of pitch mark files. These files are the same format as the previous
CSTR (Osprey) synthesizer.
The dictionary file consist of one entry per line. Each entry consists of
five fields: a diphone name of the form P1-P2, a filename (without extension),
a floating point start position in the file in milliseconds, a mid position in
milliseconds (change in phone), and an end position in milliseconds. Lines
starting with a semi-colon and blank lines are ignored. The list may be in
any order.
For example a partial list of phones may look like.
ch-l r021 412.035 463.009 518.23
jh-l d747 305.841 382.301 446.018
h-l d748 356.814 403.54 437.522
#-@ d404 233.628 297.345 331.327
@-# d001 836.814 938.761 1002.48
Waveform files may be in any form, as long as every file is the same type,
headered or unheadered as long as the format is supported the speech tools
wave reading functions. These may be standard linear PCM waveform files in
the case of PSOLA or LPC coefficients and residual when using the residual LPC
synthesizer. LPC databases
Pitch mark files consist a simple list of positions in milliseconds (plus
places after the point) in order, one per line of each pitch mark in the file.
For high quality diphone synthesis these should be derived from laryngograph
data. During unvoiced sections pitch marks should be artificially created at
reasonable intervals (e.g. 10 ms). In the current format there is no way to
determine the "real" pitch marks from the "unvoiced" pitch marks.
It is normal to hold a diphone database in a directory with a number of
sub-directories namely 'dic/' contain the dictionary file, 'wave/' for the
waveform files, typically of whole nonsense words (sometimes this directory is
called 'vox/' for historical reasons) and 'pm/' for the pitch mark files. The
filename in the dictionary entry should be the same for waveform file and the
pitch mark file (with different extensions).
ΓòÉΓòÉΓòÉ 22.2. LPC databases ΓòÉΓòÉΓòÉ
The standard method for diphone resynthesis in the released system is residual
excited LPC (hunt89). The actual method of resynthesis isn't important to the
database format, but if residual LPC synthesis is to be used then it is
necessary to make the LPC coefficient files and their corresponding residuals.
Previous versions of the system used a "host of hacky little scripts" to this
but now that the Edinburgh Speech Tools supports LPC analysis we can provide a
walk through for generating these.
We assume that the waveform file of nonsense words are in a directory called
'wave/'. The LPC coefficients and residuals will be, in this example, stored
in 'lpc16k/' with extensions '.lpc' and '.res' respectively.
Before starting it is worth considering power normalization. We have found
this important on all of the databases we have collected so far. The ch_wave
program, part of the speech tools, with the optional -scaleN 0.4 may be used
if a more complex method is not available.
The following shell command generates the files
for i in wave/*.wav
do
fname=`basename $i .wav`
echo $i
lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \
-r lpc16k/$fname.res -otype htk -rtype nist $i
done
It is said that the LPC order should be sample rate divided by one thousand
plus 2. This may or may not be appropriate and if you are particularly
worried about the database size it is worth experimenting.
The program 'lpc_analysis', found in 'speech_tools/bin', can be used to
generate the lpc coefficients and residual. Note these should be reflection
coefficients so they may be quantised (as they are in group files).
The coefficients and residual files produced by different LPC analysis
programs may start at different offsets. For example the Entropic's ESPS
functions generate LPC coefficients that are offset by one frame shift (e.g.
0.01 seconds). Our own 'lpc_analysis' routine has no offset. The Diphone_Init
parameter list allows these offsets to be specified. Using the above function
to generate the LPC files the description parameters should include
(lpc_frame_offset 0)
(lpc_res_offset 0.0)
While when generating using ESPS routines the description should be
(lpc_frame_offset 1)
(lpc_res_offset 0.01)
The defaults actually follow the ESPS form, that is lpc_frame_offset is 1 and
lpc_res_offset is equal to the frame shift, if they are not explicitly
mentioned.
Note the biggest problem we have in implementing the residual excited LPC
resynthesizer was getting the right part of the residual to line up with the
right LPC coefficients describing the pitch mark. Making errors in this
degrades the synthesized waveform notably, but not seriously, making it
difficult to determine if it is an offset problem or some other bug.
Although we have started investigating if extracting pitch synchronous LPC
parameters rather than fixed shift parameters gives better performance, we
haven't finished this work. 'lpc_analysis' supports pitch synchronous
analysis but the raw "ungrouped" access method does not yet. At present the
LPC parameters are extracted at a particular pitch mark by interpolating over
the closest LPC parameters. The "group" files hold these interpolated
parameters pitch synchronously.
The American English voice 'kd' was created using the speech tools
'lpc_analysis' program and its set up should be looked at if you are going to
copy it. The British English voice 'rb' was constructed using ESPS routines.
ΓòÉΓòÉΓòÉ 22.3. Group files ΓòÉΓòÉΓòÉ
Databases may be accessed directly but this is usually too inefficient for any
purpose except debugging. It is expected that group files will be built which
contain a binary representation of the database. A group file is a compact
efficient representation of the diphone database. Group files are byte order
independent, so may be shared between machines of different byte orders and
word sizes. Certain information in a group file may be changed at load time so
a database name, access strategy etc. may be changed from what was set
originally in the group file.
A group file contains the basic parameters, the diphone index, the signal
(original waveform or LPC residual), LPC coefficients, and the pitch marks.
It is all you need for a run-time synthesizer. Various compression mechanisms
are supported to allow smaller databases if desired. A full English LPC plus
residual database at 8k ulaw is about 3 megabytes, while a full 16 bit version
at 16k is about 8 megabytes.
Group files are created with the Diphone.group command which takes a database
name and an output filename as an argument. Making group files can take some
time especially if they are large. The group_type parameter specifies raw or
ulaw for encoding signal files. This can significantly reduce the size of
databases.
Group files may be partially loaded (see access strategies) at run time for
quicker start up and to minimise run-time memory requirements.
ΓòÉΓòÉΓòÉ 22.4. Diphone_Init ΓòÉΓòÉΓòÉ
The basic method for describing a database is through the Diphone_Init
command. This function takes a single argument, a list of pairs of parameter
name and value. The parameters are
name An atomic name for this database.
group_fileThe filename of a group file, which may itself contain parameters
describing itself
type The default value is pcm, but for distributed voices this is always
lpc.
index_fileA filename containing the diphone dictionary.
signal_dirA directory (slash terminated) containing the pcm waveform files.
signal_extA dot prefixed extension for the pcm waveform files.
pitch_dir A directory (slash terminated) containing the pitch mark files.
pitch_ext A dot prefixed extension for the pitch files
lpc_dir A directory (slash terminated) containing the LPC coefficient files
and residual files.
lpc_ext A dot prefixed extension for the LPC coefficient files
lpc_type The type of LPC file (as supported by the speech tools)
lpc_frame_offsetThe number of frames "missing" from the beginning of the file.
Often LPC parameters are offset by one frame.
lpc_res_extA dot prefixed extension for the residual files
lpc_res_typeThe type of the residual files, this is a standard waveform type
as supported by the speech tools.
lpc_res_offsetNumber of seconds "missing" from the beginning of the residual
file. Some LPC analysis technique do not generate a residual until
after one frame.
samp_freq Sample frequency of signal files
phoneset Phoneset used, must already be declared.
num_diphonesTotal number of diphones in database. If specified this must be
equal or bigger than the number of entries in the index file. If it
is not specified the square of the number of phones in the phoneset
is used.
sig_band number of sample points around actual diphone to take from file.
This should be larger than any windowing used on the signal, and/or
up to the pitch marks outside the diphone signal.
alternates_afterList of pairs of phones stating replacements for the second
part of diphone when the basic diphone is not found in the diphone
database.
alternates_beforeList of pairs of phones stating replacements for the first
part of diphone when the basic diphone is not found in the diphone
database.
default_diphoneWhen unexpected combinations occur and no appropriate diphone
can be found this diphone should be used. This should be specified
for all diphone databases that are to be robust. We usually us the
silence to silence diphone. No mater how carefully you designed
your diphone set, conditions when an unknown diphone occur seem to
always happen. If this is not set and a diphone is requested that is
not in the database an error occurs and synthesis will stop.
Examples of both general set up, making group files and general use are in
'lib/voices/english/rab_diphone/festvox/rab_diphone.scm'
ΓòÉΓòÉΓòÉ 22.5. Access strategies ΓòÉΓòÉΓòÉ
Three basic accessing strategies are available when using diphone databases.
They are designed to optimise access time, start up time and space
requirements.
direct Load all signals at database init time. This is the slowest startup
but the fastest to access. This is ideal for servers. It is also
useful for small databases that can be loaded quickly. It is
reasonable for many group files.
dynamic Load signals as they are required. This has much faster start up
and will only gradually use up memory as the diphones are actually
used. Useful for larger databases, and for non-group file access.
ondemand Load the signals as they are requested but free them if they are not
required again immediately. This is slower access but requires low
memory usage. In group files the re-reads are quite cheap as the
database is well cached and a file description is already open for
the file.
Note that in group files pitch marks (and LPC coefficients) are always fully
loaded (cf. direct), as they are typically smaller. Only signals (waveform
files or residuals) are potentially dynamically loaded.
ΓòÉΓòÉΓòÉ 22.6. Diphone selection ΓòÉΓòÉΓòÉ
The appropriate diphone is selected based on the name of the phone identified
in the segment stream. However for better diphone synthesis it is useful to
augment the diphone database with other diphones in addition to the ones
directly from the phoneme set. For example dark and light l's, distinguishing
consonants from their consonant cluster form and their isolated form. There
are however two methods to identify this modification from the basic name.
When the diphone module is called the hook diphone_module_hooks is applied.
That is a function of list of functions which will be applied to the
utterance. Its main purpose is to allow the conversion of the basic name into
an augmented one. For example converting a basic l into a dark l, denoted by
ll. The functions given in diphone_module_hooks may set the feature
diphone_phone_name which if set will be used rather than the name of the
segment.
For example suppose we wish to use a dark l (ll) rather than a normal l for
all l's that appear in the coda of a syllable. First we would define a
function to which identifies this condition and adds the addition feature
diphone_phone_name identify the name change. The following function would
achieve this
(define (fix_dark_ls utt)
"(fix_dark_ls UTT)
Identify ls in coda position and relabel them as ll."
(mapcar
(lambda (seg)
(if (and (string-equal "l" (item.name seg))
(string-equal "+" (item.feat seg "p.ph_vc"))
(item.relation.prev seg "SylStructure"))
(item.set_feat seg "diphone_phone_name" "ll")))
(utt.relation.items utt 'Segment))
utt)
Then when we wish to use this for a particular voice we need to add
(set! diphone_module_hooks (list fix_dark_ls))
in the voice selection function.
For a more complex example including consonant cluster identification see the
American English voice 'ked' in
'festival/lib/voices/english/ked/festvox/kd_diphone.scm'. The function
ked_diphone_fix_phone_name carries out a number of mappings.
The second method for changing a name is during actual look up of a diphone in
the database. The list of alternates is given by the Diphone_Init function.
These are used when the specified diphone can't be found. For example we
often allow mappings of dark l, ll to l as sometimes the dark l diphone
doesn't actually exist in the database.
ΓòÉΓòÉΓòÉ 23. Other synthesis methods ΓòÉΓòÉΓòÉ
Festival supports a number of other synthesis systems
LPC diphone synthesizer A small LPC synthesizer (Donovan diphones)
MBROLA Interface to MBROLA
Synthesizers in development
ΓòÉΓòÉΓòÉ 23.1. LPC diphone synthesizer ΓòÉΓòÉΓòÉ
A very simple, and very efficient LPC diphone synthesizer using the "donovan"
diphones is also supported. This synthesis method is primarily the work of
Steve Isard and later Alistair Conkie. The synthesis quality is not as good as
the residual excited LPC diphone synthesizer but has the advantage of being
much smaller. The donovan diphone database is under 800k.
The diphones are loaded through the Donovan_Init function which takes the name
of the dictionary file and the diphone file as arguments, see the following
for details
lib/voices/english/don_diphone/festvox/don_diphone.scm
ΓòÉΓòÉΓòÉ 23.2. MBROLA ΓòÉΓòÉΓòÉ
As an example of how Festival may use a completely external synthesis method
we support the free system MBROLA. MBROLA is both a diphone synthesis
technique and an actual system that constructs waveforms from segment,
duration and F0 target information. For details see the MBROLA home page at
http://tcts.fpms.ac.be/synthesis/mbrola.html. MBROLA already supports a
number of diphone sets including French, Spanish, German and Romanian.
Festival support for MBROLA is in the file 'lib/mbrola.scm'. It is all in
Scheme. The function MBROLA_Synth is called when parameter Synth_Method is
MBROLA. The function simply saves the segment, duration and target
information from the utterance, calls the external 'mbrola' program with the
selected diphone database, and reloads the generated waveform back into the
utterance.
An MBROLA-ized version of the Roger diphoneset is available from the MBROLA
site. The simple Festival end is distributed as part of the system in
'festvox_en1.tar.gz'. The following variables are used by the process
mbrola_progname the pathname of the mbrola executable.
mbrola_databasethe name of the database to use. This variable is switched
between different speakers.
ΓòÉΓòÉΓòÉ 23.3. Synthesizers in development ΓòÉΓòÉΓòÉ
In addition to the above synthesizers Festival also supports CSTR's older
PSOLA synthesizer written by Paul Taylor. But as the newer diphone synthesizer
produces similar quality output and is a newer (and hence a cleaner)
implementation further development of the older module is unlikely.
An experimental unit seleciton synthesis module is included in
'modules/clunits/' it is an implementation of black97c. It is included for
people wishing to continue reserach in the area rather than as a fully usable
waveform synthesis engine. Although it sometimes gives excellent results it
also sometimes gives amazingly bad ones too. We included this as an example of
one possible framework for selection-based synthesis.
As one of our funded projects is to specifically develop new selection based
synthesis algorithms we expect to include more models within later versions of
the system.
Also, now that Festival has been released other groups are working on new
synthesis techniques in the system. Many of these will become available and
where possible we will give pointers from the Festival home page to them.
Particularly there is an alternative residual excited LPC module implemented
at the Center for Spoken Language Understanding (CSLU) at the Oregon Graduate
Institute (OGI).
ΓòÉΓòÉΓòÉ 24. Audio output ΓòÉΓòÉΓòÉ
If you have never heard any audio ever on your machine then you must first
work out if you have the appropriate hardware. If you do, you also need the
appropriate software to drive it. Festival can directly interface with a
number of audio systems or use external methods for playing audio.
The currently supported audio methods are
'NAS' NCD's NAS, is a network transparent audio system (formerly called
netaudio). If you already run servers on your machines you simply
need to ensure your AUDIOSERVER environment variable is set (or your
DISPLAY variable if your audio output device is the same as your X
Windows display). You may set NAS as your audio output method by the
command
(Parameter.set 'Audio_Method 'netaudio)
'/dev/audio'On many systems '/dev/audio' offers a simple low level method for
audio output. It is limited to mu-law encoding at 8KHz. Some
implementations of '/dev/audio' allow other sample rates and sample
types but as that is non-standard this method only uses the common
format. Typical systems that offer these are Suns, Linux and
FreeBSD machines. You may set direct '/dev/audio' access as your
audio method by the command
(Parameter.set 'Audio_Method 'sunaudio)
'/dev/audio (16bit)'Later Sun Microsystems workstations support 16 bit linear
audio at various sample rates. Support for this form of audio
output is supported. It is a compile time option (as it requires
include files that only exist on Sun machines. If your installation
supports it (check the members of the list *modules*) you can select
16 bit audio output on Suns by the command
(Parameter.set 'Audio_Method 'sun16audio)
Note this will send it to the local machine where the festival binary is
running, this might not be the one you are sitting next to---that's why we
recommend netaudio. A hacky solution to playing audio on a local machine from
a remote machine without using netaudio is described in Installation
'/dev/dsp (voxware)'Both FreeBSD and Linux have a very similar audio interface
through '/dev/dsp'. There is compile time support for these in the
speech tools and when compiled with that option Festival may utilise
it. Check the value of the variable *modules* to see which audio
devices are directly supported. On FreeBSD, if supported, you may
select local 16 bit linear audio by the command
(Parameter.set 'Audio_Method 'freebsd16audio)
While under Linux, if supported, you may use the command
(Parameter.set 'Audio_Method 'linux16audio)
Some earlier (and smaller machines) only have 8bit audio even though they
include a '/dev/dsp' (Soundblaster PRO for example). This was not dealt with
properly in earlier versions of the system but now the support automatically
checks to see the sample width supported and uses it accordingly. 8 bit at
higher frequencies that 8K sounds better than straight 8k ulaw so this feature
is useful.
'mplayer' Under Windows NT or 95 you can use the 'mplayer' command which we
have found requires special treatement to get its parameters right.
Rather than using Audio_Command you can select this on Windows
machine with the following command
(Parameter.set 'Audio_Method 'mplayeraudio)
Alternatively built-in audio output is available with
(Parameter.set 'Audio_Method 'win32audio)
'SGI IRIX'Builtin audio output is now available for SGI's IRIX 6.2 using the
command
(Parameter.set 'Audio_Method 'irixaudio)
'Audio Command'Alternatively the user can provide a command that can play an
audio file. Festival will execute that command in an environment
where the shell variables SR is set to the sample rate (in Hz) and
FILE which, by default, is the name of an unheadered raw, 16bit file
containing the synthesized waveform in the byte order of the machine
Festival is running on. You can specify your audio play command and
that you wish Festival to execute that command through the following
command
(Parameter.set 'Audio_Command "sun16play -f $SR $FILE")
(Parameter.set 'Audio_Method 'Audio_Command)
On SGI machines under IRIX the equivalent would be
(Parameter.set 'Audio_Command
"sfplay -i integer 16 2scomp rate $SR end $FILE")
(Parameter.set 'Audio_Method 'Audio_Command)
The Audio_Command method of playing waveforms Festival supports two additional
audio parameters. Audio_Required_Rate allows you to use Festival's internal
sample rate conversion function to any desired rate. Note this may not be as
good as playing the waveform at the sample rate it is originally created in,
but as some hardware devices are restrictive in what sample rates they
support, or have naive resample functions this could be optimal. The second
additional audio parameter is Audio_Required_Format which can be used to
specify the desired output forms of the file. The default is unheadered raw,
but this may be any of the values supported by the speech tools (including
nist, esps, snd, riff, aiff, audlab, raw and, if you really want it, ascii).
For example suppose you have a program that only plays sun headered files at
16000 KHz you can set up audio output as
(Parameter.set 'Audio_Method 'Audio_Command)
(Parameter.set 'Audio_Required_Rate 16000)
(Parameter.set 'Audio_Required_Format 'snd)
(Parameter.set 'Audio_Command "sunplay $FILE")
If Netaudio is not available and you need to play audio on a machine different
from teh one Festival is running on we have had reports that 'snack'
(http://www.speech.kth.se/snack/) is a possible solution. It allows remote
play but importnatly also supports Windows 95/NT based clients.
Because you do not want to wait for a whole file to be synthesized before you
can play it, Festival also offers an audio spooler that allows the playing of
audio files while continuing to synthesize the following utterances. On
reasonable workstations this allows the breaks between utterances to be as
short as your hardware allows them to be.
The audio spooler may be started by selecting asynchronous mode
(audio_mode async)
This is switched on by default be the function tts. You may put Festival back
into synchronous mode (i.e. the utt.play command will wait until the audio has
finished playing before returning). by the command
(audio_mode sync)
Additional related commands are
(audio_mode ' close) Close the audio server down but wait until it is cleared.
This is useful in scripts etc. when you wish to only exit when all
audio is complete.
(audio_mode 'shutup)Close the audio down now, stopping the current file being
played and any in the queue. Note that this may take some time to
take effect depending on which audio method you use. Sometimes
there can be 100s of milliseconds of audio in the device itself
which cannot be stopped.
(audio_mode 'query)Lists the size of each waveform currently in the queue.
ΓòÉΓòÉΓòÉ 25. Voices ΓòÉΓòÉΓòÉ
This chapter gives some general suggestions about adding new voices to
Festival. Festival attempts to offer an environment where new voices and
languages can easily be slotted in to the system.
Current voices Currently available voices
Building a new voice Building a new voice
Defining a new voice Defining a new voice
ΓòÉΓòÉΓòÉ 25.1. Current voices ΓòÉΓòÉΓòÉ
Currently there are a number of voices available in Festival and we expect
that number to increase. Each is elected via a function of the name 'voice_*'
which sets up the waveform synthesizer, phone set, lexicon, duration and
intonation models (and anything else necessary) for that speaker. These voice
setup functions are defined in 'lib/voices.scm'.
The current voice functions are
voice_rab_diphone A British English male RP speaker, Roger. This uses the
UniSyn residual excited LPC diphone synthesizer. The lexicon is the
computer users version of Oxford Advanced Learners' Dictionary, with
letter to sound rules trained from that lexicon. Intonation is
provided by a ToBI-like system using a decision tree to predict
accent and end tone position. The F0 itself is predicted as three
points on each syllable, using linear regression trained from the
Boston University FM database (f2b) and mapped to Roger's pitch
range. Duration is predicted by decision tree, predicting zscore
durations for segments trained from the 460 Timit sentence spoken by
another British male speaker.
voice_ked_diphoneAn American English male speaker, Kurt. Again this uses the
UniSyn residual excited LPC diphone synthesizer. This uses the CMU
lexicon, and letter to sound rules trained from it. Intonation as
with Roger is trained from the Boston University FM Radio corpus.
Duration for this voice also comes from that database.
voice_kal_diphoneAn American English male speaker. Again this uses the UniSyn
residual excited LPC diphone synthesizer. And like ked, uses the
CMU lexicon, and letter to sound rules trained from it. Intonation
as with Roger is trained from the Boston University FM Radio corpus.
Duration for this voice also comes from that database. This voice
was built in two days work and is at least as good as ked due to us
understanding the process better. The diphone labels were
autoaligned with hand correction.
voice_don_diphoneSteve Isard's LPC based diphone synthesizer, Donovan
diphones. The other parts of this voice, lexicon, intonation, and
duration are the same as voice_rab_diphone described above. The
quality of the diphones is not as good as the other voices because
it uses spike excited LPC. Although the quality is not as good it
is much faster and the database is much smaller than the others.
voice_el_diphoneA male Castilian Spanish speaker, using the Eduardo Lopez
diphones. Alistair Conkie and Borja Etxebarria did much to make
this. It has improved recently but is not as comprehensive as our
English voices.
voice_gsw_diphoneThis offers a male RP speaker, Gordon, famed for many
previous CSTR synthesizers, using the standard diphone module. Its
higher levels are very similar to the Roger voice above. This voice
is not in the standard distribution, and is unlikely to be added for
commercial reasons, even though it sounds better than Roger.
voice_en1_mbrolaThe Roger diphone set using the same front end as
voice_rab_diphone but uses the MBROLA diphone synthesizer for
waveform synthesis. The MBROLA synthesizer and Roger diphone
database (called en1) is not distributed by CSTR but is available
for non-commercial use for free from
http://tcts.fpms.ac.be/synthesis/mbrola.html. We do however provide
the Festival part of the voice in 'festvox_en1.tar.gz'.
voice_us1_mbrolaA female Amercian English voice using our standard US English
front end and the us1 database for the MBROLA diphone synthesizer
for waveform synthesis. The MBROLA synthesizer and the us1 diphone
database is not distributed by CSTR but is available for
non-commercial use for free from
http://tcts.fpms.ac.be/synthesis/mbrola.html. We provide the
Festival part of the voice in 'festvox_us1.tar.gz'.
voice_us2_mbrolaA male Amercian English voice using our standard US English
front end and the us2 database for the MBROLA diphone synthesizer
for waveform synthesis. The MBROLA synthesizer and the us2 diphone
database is not distributed by CSTR but is available for
non-commercial use for free from
http://tcts.fpms.ac.be/synthesis/mbrola.html. We provide the
Festival part of the voice in 'festvox_us2.tar.gz'.
voice_us3_mbrolaAnother male Amercian English voice using our standard US
English front end and the us2 database for the MBROLA diphone
synthesizer for waveform synthesis. The MBROLA synthesizer and the
us2 diphone database is not distributed by CSTR but is available for
non-commercial use for free from
http://tcts.fpms.ac.be/synthesis/mbrola.html. We provide the
Festival part of the voice in 'festvox_us1.tar.gz'.
Other voices will become available through time. Groups other than CSTR are
working on new voices. Particularly OGI's CSLU have release a number of
American English voices, two Mexican Spanish voices and two German voices.
All use OGI's their own residual excited LPC synthesizer which is distributed
as a plug-in for Festival. (see http://www.cse.ogi.edu/CSLU/research/TTS for
details).
Other languages are being worked on including German, Basque, Welsh, Greek and
Polish already have been developed and could be release soon. CSTR has a set
of Klingon diphones though the text anlysis for Klingon still requires some
work (If anyone has access to a good Klingon continous speech corpora please
let us know.)
Pointers and examples of voices developed at CSTR and elsewhere will be posted
on the Festival home page.
ΓòÉΓòÉΓòÉ 25.2. Building a new voice ΓòÉΓòÉΓòÉ
This section runs through the definition of a new voice in Festival. Although
this voice is simple (it is a simplified version of the distributed spanish
voice) it shows all the major parts that must be defined to get Festival to
speak in a new voice. Thanks go to Alistair Conkie for helping me define this
but as I don't speak Spanish there are probably many mistakes. Hopefully its
pedagogical use is better than its ability to be understood in Castille.
A much more detailed document on building voices in Festival has been written
and is recommend reading for any one attempting to add a new voice to Festival
black99. The information here is a little sparse though gives the basic
requirements.
The general method for defining a new voice is to define the parameters for
all the various sub-parts e.g. phoneset, duration parameter intonation
parameters etc., then defined a function of the form voice_NAME which when
called will actually select the voice.
ΓòÉΓòÉΓòÉ 25.2.1. Phoneset ΓòÉΓòÉΓòÉ
For most new languages and often for new dialects, a new phoneset is required.
It is really the basic building block of a voice and most other parts are
defined in terms of this set, so defining it first is a good start.
(defPhoneSet
spanish
;;; Phone Features
(;; vowel or consonant
(vc + -)
;; vowel length: short long diphthong schwa
(vlng s l d a 0)
;; vowel height: high mid low
(vheight 1 2 3 -)
;; vowel frontness: front mid back
(vfront 1 2 3 -)
;; lip rounding
(vrnd + -)
;; consonant type: stop fricative affricative nasal liquid
(ctype s f a n l 0)
;; place of articulation: labial alveolar palatal labio-dental
;; dental velar
(cplace l a p b d v 0)
;; consonant voicing
(cvox + -)
)
;; Phone set members (features are not! set properly)
(
(# - 0 - - - 0 0 -)
(a + l 3 1 - 0 0 -)
(e + l 2 1 - 0 0 -)
(i + l 1 1 - 0 0 -)
(o + l 3 3 - 0 0 -)
(u + l 1 3 + 0 0 -)
(b - 0 - - + s l +)
(ch - 0 - - + a a -)
(d - 0 - - + s a +)
(f - 0 - - + f b -)
(g - 0 - - + s p +)
(j - 0 - - + l a +)
(k - 0 - - + s p -)
(l - 0 - - + l d +)
(ll - 0 - - + l d +)
(m - 0 - - + n l +)
(n - 0 - - + n d +)
(ny - 0 - - + n v +)
(p - 0 - - + s l -)
(r - 0 - - + l p +)
(rr - 0 - - + l p +)
(s - 0 - - + f a +)
(t - 0 - - + s t +)
(th - 0 - - + f d +)
(x - 0 - - + a a -)
)
)
(PhoneSet.silences '(#))
Note some phonetic features may be wrong.
ΓòÉΓòÉΓòÉ 25.2.2. Lexicon and LTS ΓòÉΓòÉΓòÉ
Spanish is a language whose pronunciation can almost completely be predicted
from its orthography so in this case we do not need a list of words and their
pronunciations and can do most of the work with letter to sound rules.
Let us first make a lexicon structure as follows
(lex.create "spanish")
(lex.set.phoneset "spanish")
However if we did just want a few entries to test our system without building
any letter to sound rules we could add entries directly to the addenda. For
example
(lex.add.entry
'("amigos" nil (((a) 0) ((m i) 1) (g o s))))
A letter to sound rule system for Spanish is quite simple in the format
supported by Festival. The following is a good start to a full set.
(lts.ruleset
; Name of rule set
spanish
; Sets used in the rules
(
(LNS l n s )
(AEOU a e o u )
(AEO a e o )
(EI e i )
(BDGLMN b d g l m n )
)
; Rules
(
( [ a ] = a )
( [ e ] = e )
( [ i ] = i )
( [ o ] = o )
( [ u ] = u )
( [ "'" a ] = a1 ) ;; stressed vowels
( [ "'" e ] = e1 )
( [ "'" i ] = i1 )
( [ "'" o ] = o1 )
( [ "'" u ] = u1 )
( [ b ] = b )
( [ v ] = b )
( [ c ] "'" EI = th )
( [ c ] EI = th )
( [ c h ] = ch )
( [ c ] = k )
( [ d ] = d )
( [ f ] = f )
( [ g ] "'" EI = x )
( [ g ] EI = x )
( [ g u ] "'" EI = g )
( [ g u ] EI = g )
( [ g ] = g )
( [ h u e ] = u e )
( [ h i e ] = i e )
( [ h ] = )
( [ j ] = x )
( [ k ] = k )
( [ l l ] # = l )
( [ l l ] = ll )
( [ l ] = l )
( [ m ] = m )
( [ ~ n ] = ny )
( [ n ] = n )
( [ p ] = p )
( [ q u ] = k )
( [ r r ] = rr )
( # [ r ] = rr )
( LNS [ r ] = rr )
( [ r ] = r )
( [ s ] BDGLMN = th )
( [ s ] = s )
( # [ s ] C = e s )
( [ t ] = t )
( [ w ] = u )
( [ x ] = k s )
( AEO [ y ] = i )
( # [ y ] # = i )
( [ y ] = ll )
( [ z ] = th )
))
We could simply set our lexicon to use the above letter to sound system with
the following command
(lex.set.lts.ruleset 'spanish)
But this would not deal with upper case letters. Instead of writing new rules
for upper case letters we can define that a Lisp function be called when
looking up a word and intercept the lookup with our own function. First we
state that unknown words should call a function, and then define the function
we wish called. The actual link to ensure our function will be called is done
below at lexicon selection time
(define (spanish_lts word features)
"(spanish_lts WORD FEATURES)
Using letter to sound rules build a spanish pronunciation of WORD."
(list word
nil
(lex.syllabify.phstress (lts.apply (downcase word) 'spanish))))
(lex.set.lts.method spanish_lts)
In the function we downcase the word and apply the LTS rule to it. Next we
syllabify it and return the created lexical entry.
ΓòÉΓòÉΓòÉ 25.2.3. Phrasing ΓòÉΓòÉΓòÉ
Without detailed labelled databases we cannot build statistical models of
phrase breaks, but we can simply build a phrase break model based on
punctuation. The following is a CART tree to predict simple breaks, from
punctuation.
(set! spanish_phrase_cart_tree
'
((lisp_token_end_punc in ("?" "." ":"))
((BB))
((lisp_token_end_punc in ("'" "\"" "," ";"))
((B))
((n.name is 0) ;; end of utterance
((BB))
((NB))))))
ΓòÉΓòÉΓòÉ 25.2.4. Intonation ΓòÉΓòÉΓòÉ
For intonation there are number of simple options without requiring training
data. For this example we will simply use a hat pattern on all stressed
syllables in content words and on single syllable content words. (i.e.
Simple) Thus we need an accent prediction CART tree.
(set! spanish_accent_cart_tree
'
((R:SylStructure.parent.gpos is content)
((stress is 1)
((Accented))
((position_type is single)
((Accented))
((NONE))))
((NONE))))
We also need to specify the pitch range of our speaker. We will be using a
male Spanish diphone database of the follow range
(set! spanish_el_int_simple_params
'((f0_mean 120) (f0_std 30)))
ΓòÉΓòÉΓòÉ 25.2.5. Duration ΓòÉΓòÉΓòÉ
We will use the trick mentioned above for duration prediction. Using the
zscore CART tree method, we will actually use it to predict factors rather
than zscores.
The tree predicts longer durations in stressed syllables and in clause initial
and clause final syllables.
(set! spanish_dur_tree
'
((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial
((R:SylStructure.parent.stress is 1)
((1.5))
((1.2)))
((R:SylStructure.parent.syl_break > 1) ;; clause final
((R:SylStructure.parent.stress is 1)
((2.0))
((1.5)))
((R:SylStructure.parent.stress is 1)
((1.2))
((1.0))))))
In addition to the tree we need durations for each phone in the set
(set! spanish_el_phone_data
'(
(# 0.0 0.250)
(a 0.0 0.090)
(e 0.0 0.090)
(i 0.0 0.080)
(o 0.0 0.090)
(u 0.0 0.080)
(b 0.0 0.065)
(ch 0.0 0.135)
(d 0.0 0.060)
(f 0.0 0.100)
(g 0.0 0.080)
(j 0.0 0.100)
(k 0.0 0.100)
(l 0.0 0.080)
(ll 0.0 0.105)
(m 0.0 0.070)
(n 0.0 0.080)
(ny 0.0 0.110)
(p 0.0 0.100)
(r 0.0 0.030)
(rr 0.0 0.080)
(s 0.0 0.110)
(t 0.0 0.085)
(th 0.0 0.100)
(x 0.0 0.130)
))
ΓòÉΓòÉΓòÉ 25.2.6. Waveform synthesis ΓòÉΓòÉΓòÉ
There are a number of choices for waveform synthesis currently supported.
MBROLA supports Spanish, so we could use that. But their Spanish diphones in
fact use a slightly different phoneset so we would need to change the above
definitions to use it effectively. Here we will use a diphone database for
Spanish recorded by Eduardo Lopez when he was a Masters student some years
ago.
Here we simply load our pre-built diphone database
(us_diphone_init
(list
'(name "el_lpc_group")
(list 'index_file
(path-append spanish_el_dir "group/ellpc11k.group"))
'(grouped "true")
'(default_diphone "#-#")))
ΓòÉΓòÉΓòÉ 25.2.7. Voice selection function ΓòÉΓòÉΓòÉ
The standard way to define a voice in Festival is to define a function of the
form voice_NAME which selects all the appropriate parameters. Because the
definition below follows the above definitions we know that everything
appropriate has been loaded into Festival and hence we just need to select the
appropriate a parameters.
(define (voice_spanish_el)
"(voice_spanish_el)
Set up synthesis for Male Spanish speaker: Eduardo Lopez"
(voice_reset)
(Parameter.set 'Language 'spanish)
;; Phone set
(Parameter.set 'PhoneSet 'spanish)
(PhoneSet.select 'spanish)
(set! pos_lex_name nil)
;; Phrase break prediction by punctuation
(set! pos_supported nil)
;; Phrasing
(set! phrase_cart_tree spanish_phrase_cart_tree)
(Parameter.set 'Phrase_Method 'cart_tree)
;; Lexicon selection
(lex.select "spanish")
;; Accent prediction
(set! int_accent_cart_tree spanish_accent_cart_tree)
(set! int_simple_params spanish_el_int_simple_params)
(Parameter.set 'Int_Method 'Simple)
;; Duration prediction
(set! duration_cart_tree spanish_dur_tree)
(set! duration_ph_info spanish_el_phone_data)
(Parameter.set 'Duration_Method 'Tree_ZScores)
;; Waveform synthesizer: diphones
(Parameter.set 'Synth_Method 'UniSyn)
(Parameter.set 'us_sigpr 'lpc)
(us_db_select 'el_lpc_group)
(set! current-voice 'spanish_el)
)
(provide 'spanish_el)
ΓòÉΓòÉΓòÉ 25.2.8. Last remarks ΓòÉΓòÉΓòÉ
We save the above definitions in a file 'spanish_el.scm'. Now we can declare
the new voice to Festival. See Defining a new voice for a description of
methods for adding new voices. For testing purposes we can explciitly load
the file 'spanish_el.scm'
The voice is now available for use in festival.
festival> (voice_spanish_el)
spanish_el
festival> (SayText "hola amigos")
<Utterance 0x04666>
As you can see adding a new voice is not very difficult. Of course there is
quite a lot more than the above to add a high quality robust voice to
Festival. But as we can see many of the basic tools that we wish to use
already exist. The main difference between the above voice and the English
voices already in Festival are that their models are better trained from
databases. This produces, in general, better results, but the concepts behind
them are basically the same. All of those trainable methods may be
parameterized with data for new voices.
As Festival develops, more modules will be added with better support for
training new voices so in the end we hope that adding in high quality new
voices is actually as simple as (or indeed simpler than) the above
description.
ΓòÉΓòÉΓòÉ 25.2.9. Resetting globals ΓòÉΓòÉΓòÉ
Because the version of Scheme used in Festival only has a single flat name
space it is unfortunately too easy for voices to set some global which
accidentally affects all other voices selected after it. Because of this
problem we have introduced a convention to try to minimise the possibility of
this becoming a problem. Each voice function defined should always call
voice_reset at the start. This will reset any globals and also call a tidy up
function provided by the previous voice function.
Likewise in your new voice function you should provide a tidy up function to
reset any non-standard global variables you set. The function
current_voice_reset will be called by voice_reset. If the value of
current_voice_reset is nil then it is not called. voice_reset sets
current_voice_reset to nil, after calling it.
For example suppose some new voice requires the audio device to be directed to
a different machine. In this example we make the giant's voice go through the
netaudio machine big_speakers while the standard voice go through
small_speakers.
Although we can easily select the machine big_speakers as out when our
voice_giant is called, we also need to set it back when the next voice is
selected, and don't want to have to modify every other voice defined in the
system. Let us first define two functions to selection the audio output.
(define (select_big)
(set! giant_previous_audio (getenv "AUDIOSERVER"))
(setenv "AUDIOSERVER" "big_speakers"))
(define (select_normal)
(setenv "AUDIOSERVER" giant_previous_audio))
Note we save the previous value of AUDIOSERVER rather than simply assuming it
was small_speakers.
Our definition of voice_giant definition of voice_giant will look something
like
(define (voice_giant)
"comment comment ┬╖┬╖┬╖"
(voice_reset) ;; get into a known state
(select_big)
;;; other giant voice parameters
┬╖┬╖┬╖
(set! current_voice_rest select_normal)
(set! current-voice 'giant))
The obvious question is which variables should a voice reset. Unfortunately
there is not a definitive answer to that. To a certain extent I don't want to
define that list as there will be many variables that will by various people
in Festival which are not in the original distribution and we don't want to
restrict them. The longer term answer is some for of partitioning of the
Scheme name space perhaps having voice local variables (cf. Emacs buffer local
variables). But ultimately a voice may set global variables which could
redefine the operation of later selected voices and there seems no real way to
stop that, and keep the generality of the system.
Note the convention of setting the global current-voice as the end of any
voice definition file. We do not enforce this but probabaly should. The
variable current-voice at any time should identify the current voice, the
voice description information (described below) will relate this name to
properties identifying it.
ΓòÉΓòÉΓòÉ 25.3. Defining a new voice ΓòÉΓòÉΓòÉ
As there are a number of voices available for Festival and they may or may not
exists in different installations we have tried to make it as simple as
possible to add new voices to the system without having to change any of the
basic distribution. In fact if the voices use the following standard method
for describing themselves it is merely a matter of unpacking them in order for
them to be used by the system.
The variable voice-path conatins a list of directories where voices will be
automatically searched for. If this is not set it is set automatically by
appending '/voices/' to all paths in festival load-path. You may add new
directories explicitly to this variable in your 'sitevars.scm' file or your
own '.festivalrc' as you wish.
Each voice directory is assumed to be of the form
LANGUAGE/VOICENAME/
Within the VOICENAME/ directory itself it is assumed there is a file
'festvox/VOICENAME.scm' which when loaded will define the voice itself. The
actual voice function should be called voice_VOICENAME.
For example the voices distributed with the standard Festival distribution all
unpack in 'festival/lib/voices'. The Amercan voice 'ked_diphone' unpacks into
festival/lib/voices/english/ked_diphone/
Its actual definition file is in
festival/lib/voices/english/ked_diphone/festvox/ked_diphone.scm
Note the name of the directory and the name of the Scheme definition file must
be the same.
Alternative voices using perhaps a different encoding of the database but the
same front end may be defined in the same way by using symbolic links in the
langauge directoriy to the main directory. For example a PSOLA version of the
ked voice may be defined in
festival/lib/voices/english/ked_diphone/festvox/ked_psola.scm
Adding a symbole link in 'festival/lib/voices/english/' ro 'ked_diphone'
called 'ked_psola' will allow that voice to be automatically registered when
Festival starts up.
Note that this method doesn't actually load the voices it finds, that could be
prohibitively time consuming to the start up process. It blindly assumes that
there is a file 'VOICENAME/festvox/VOICENAME.scm' to load. An autoload
definition is given for voice_VOICENAME which when called will load that file
and call the real definition if it exists in the file.
This is only a recommended method to make adding new voices easier, it may be
ignored if you wish. However we still recommend that even if you use your own
convetions for adding new voices you consider the autoload function to define
them in, for example, the 'siteinit.scm' file or '.festivalrc'. The autoload
function takes three arguments: a function name, a file containing the actual
definiton and a comment. For example a definition of voice can be done
explicitly by
(autooad voice_f2b "/home/awb/data/f2b/ducs/f2b_ducs"
"American English female f2b")))
Of course you can also load the definition file explicitly if you wish.
In order to allow the system to start making intellegent use of voices we
recommend that all voice definitions include a call to the function
voice_proclaim this allows the system to know some properties about the voice
such as language, gender and dialect. The proclaim_voice function taks two
arguments a name (e.g. rab_diphone and an assoc list of features and names.
Currently we require language, gender, dialect and description. The last
being a textual description of the voice itself. An example proclaimation is
(proclaim_voice
'rab_diphone
'((language english)
(gender male)
(dialect british)
(description
"This voice provides a British RP English male voice using a
residual excited LPC diphone synthesis method. It uses a
modified Oxford Advanced Learners' Dictionary for pronunciations.
Prosodic phrasing is provided by a statistically trained model
using part of speech and local distribution of breaks. Intonation
is provided by a CART tree predicting ToBI accents and an F0
contour generated from a model trained from natural speech. The
duration model is also trained from data using a CART tree.")))
There are functions to access a description. voice.description will return
the description for a given voice and will load that voice if it is not
already loaded. voice.describe will describe the given given voice by
synthesizing the textual description using the current voice. It would be
nice to use the voice itself to give a self introduction but unfortunately
that introduces of problem of decide which language the description should be
in, we are not all as fluent in welsh as we'd like to be.
The function voice.list will list the potential voices in the system. These
are the names of voices which have been found in the voice-path. As they have
not actaully been loaded they can't actually be confirmed as usable voices.
One solution to this would be to load all voices at start up time which would
allow confirmation they exist and to get their full description through
proclaim_voice. But start up is already too slow in festival so we have to
accept this stat for the time being. Splitting the description of the voice
from the actual definition is a possible solution to this problem but we have
not yet looked in to this.
ΓòÉΓòÉΓòÉ 26. Tools ΓòÉΓòÉΓòÉ
A number of basic data manipulation tools are supported by Festival. These
often make building new modules very easy and are already used in many of the
existing modules. They typically offer a Scheme method for entering data, and
Scheme and C++ functions for evaluating it.
Regular expressions
CART trees Building and using CART
Ngrams Building and using Ngrams
Viterbi decoder Using the Viterbi decoder
Linear regression Building and using linear regression models
ΓòÉΓòÉΓòÉ 26.1. Regular expressions ΓòÉΓòÉΓòÉ
Regular expressions are a formal method for describing a certain class of
mathematical languages. They may be viewed as patterns which match some set
of strings. They are very common in many software tools such as scripting
languages like the UNIX shell, PERL, awk, Emacs etc. Unfortunately the exact
form of regualr expressions often differs slightly between different
applications making their use often a little tricky.
Festival support regular expressions based mainly of the form used in the GNU
libg++ Regex class, though we have our own implementation of it. Our
implementation (EST_Regex) is actually based on Henry Spencer's 'regex.c' as
distributed with BSD 4.4.
Regular expressions are represented as character strings which are interpreted
as regular expressions by certain Scheme and C++ functions. Most characters
in a regular expression are treated as literals and match only that character
but a number of others have special meaning. Some characters may be escaped
with preceeding backslashes to change them from operators to literals (or
sometime literals to operators).
. Matches any character.
$ matches end of string
^ matches beginning of string
X* matches zero or more occurrences of X, X may be a character, range
of parenthesized expression.
X+ matches one or more occurrences of X, X may be a character, range of
parenthesized expression.
X? matches zero or one occurrence of X, X may be a character, range of
parenthesized expression.
[┬╖┬╖┬╖] a ranges matches an of the values in the brackets. The range
operator "-" allows specification of ranges e.g. a-z for all lower
case characters. If the first character of the range is ^ then it
matches anything character except those specificed in the range. If
you wish - to be in the range you must put that first.
\\(┬╖┬╖┬╖\\) Treat contents of parentheses as single object allowing operators *,
+, ? etc to operate on more than single characters.
X\\|Y matches either X or Y. X or Y may be single characters, ranges or
parenthesized expressions.
Note that actuall only one backslash is needed before a character to escape it
but becuase these expressions are most often contained with Scheme or C++
strings, the escpae mechanaism for those strings requires that backslash
itself be escaped, hence you will most often be required to type two
backslashes.
Some example may help in enderstanding the use of regular expressions.
a.b matches any three letter string starting with an a and ending with a
b.
.*a matches any string ending in an a
.*a.* matches any string containing an a
[A-Z].* matches any string starting with a capital letter
[0-9]+ matches any string of digits
-?[0-9]+\\(\\.[0-9]+\\)?matches any positive or negative real number. Note
the optional preceeding minus sign and the optional part contain the
point and following numbers. The point itself must be escaped as
dot on its own matches any character.
[^aeiouAEIOU]+mathes any non-empty string which doesn't conatin a vowel
\\([Ss]at\\(urday\\)\\)?\\|\\([Ss]un\\(day\\)\\)matches Saturday and Sunday in
various ways
The Scheme function string-matches takes a string and a regular expression and
returns t if the regular expression macthes the string and nil otherwise.
ΓòÉΓòÉΓòÉ 26.2. CART trees ΓòÉΓòÉΓòÉ
One of the basic tools available with Festival is a system for building and
using Classification and Regression Trees (breiman84). This standard
statistical method can be used to predict both categorical and continuous data
from a set of feature vectors.
The tree itself contains yes/no questions about features and ultimately
provides either a probability distribution, when predicting categorical values
(classification tree), or a mean and standard deviation when predicting
continuous values (regression tree). Well defined techniques can be used to
construct an optimal tree from a set of training data. The program, developed
in conjunction with Festival, called 'wagon', distributed with the speech
tools, provides a basic but ever increasingly powerful method for constructing
trees.
A tree need not be automatically constructed, CART trees have the advantage
over some other automatic training methods, such as neural networks and linear
regression, in that their output is more readable and often understandable by
humans. Importantly this makes it possible to modify them. CART trees may
also be fully hand constructed. This is used, for example, in generating some
duration models for languages we do not yet have full databases to train from.
A CART tree has the following syntax
CART ::= QUESTION-NODE || ANSWER-NODE
QUESTION-NODE ::= ( QUESTION YES-NODE NO-NODE )
YES-NODE ::= CART
NO-NODE ::= CART
QUESTION ::= ( FEATURE in LIST )
QUESTION ::= ( FEATURE is STRVALUE )
QUESTION ::= ( FEATURE = NUMVALUE )
QUESTION ::= ( FEATURE > NUMVALUE )
QUESTION ::= ( FEATURE < NUMVALUE )
QUESTION ::= ( FEATURE matches REGEX )
ANSWER-NODE ::= CLASS-ANSWER || REGRESS-ANSWER
CLASS-ANSWER ::= ( (VALUE0 PROB) (VALUE1 PROB) ┬╖┬╖┬╖ MOST-PROB-VALUE )
REGRESS-ANSWER ::= ( ( STANDARD-DEVIATION MEAN ) )
Note that answer nodes are distinguished by their car not being atomic.
The interpretation of a tree is with respect to a Stream_Item The FEATURE in a
tree is a standard feature (see Features).
The following example tree is used in one of the Spanish voices to predict
variations from average durations.
(set! spanish_dur_tree
'
(set! spanish_dur_tree
'
((R:SylStructure.parent.R:Syllable.p.syl_break > 1 ) ;; clause initial
((R:SylStructure.parent.stress is 1)
((1.5))
((1.2)))
((R:SylStructure.parent.syl_break > 1) ;; clause final
((R:SylStructure.parent.stress is 1)
((2.0))
((1.5)))
((R:SylStructure.parent.stress is 1)
((1.2))
((1.0))))))
It is applied to the segment stream to give a factor to multiply the average
by.
wagon is constantly improving and with version 1.2 of the speech tools may now
be considered fairly stable for its basic operations. Experimental features
are described in help it gives. See the Speech Tools manual for a more
comprehensive discussion of using 'wagon'.
However the above format of trees is similar to those produced by many other
systems and hence it is reasonable to translate their formats into one which
Festival can use.
ΓòÉΓòÉΓòÉ 26.3. Ngrams ΓòÉΓòÉΓòÉ
Bigram, trigrams, and general ngrams are used in the part of speech tagger and
the phrase break predicter. An Ngram C++ Class is defined in the speech tools
library and some simple facilities are added within Festival itself.
Ngrams may be built from files of tokens using the program ngram_build which
is part of the speech tools. See the speech tools documentation for details.
Within Festival ngrams may be named and loaded from files and used when
required. The LISP function load_ngram takes a name and a filename as
argument and loads the Ngram from that file. For an example of its use once
loaded see 'src/modules/base/pos.cc' or 'src/modules/base/phrasify.cc'.
ΓòÉΓòÉΓòÉ 26.4. Viterbi decoder ΓòÉΓòÉΓòÉ
Another common tool is a Viterbi decoder. This C++ Class is defined in the
speech tools library 'speech_tooks/include/EST_viterbi.h' and
'speech_tools/stats/EST_viterbi.cc'. A Viterbi decoder requires two functions
at declaration time. The first constructs candidates at each stage, while the
second combines paths. A number of options are available (which may change).
The prototypical example of use is in the part of speech tagger which using
standard Ngram models to predict probabilities of tags. See
'src/modules/base/pos.cc' for an example.
The Viterbi decoder can also be used through the Scheme function Gen_Viterbi.
This function respects the parameters defined in the variable get_vit_params.
Like other modules this parameter list is an assoc list of feature name and
value. The parameters supported are:
Relation The name of the relation the decoeder is to be applied to.
cand_functionA function that is to be called for each item that will return a
list of candidates (with probilities).
return_featThe name of a feature that the best candidate is to be returned in
for each item in the named relation.
p_word The previous word to the first item in the named relation (only used
when ngrams are the "language model").
pp_word The previous previous word to the first item in the named relation
(only used when ngrams are the "language model").
ngramname the name of an ngram (loaded by ngram.load) to be used as a
"language model".
wfstmname the name of a WFST (loaded by wfst.load) to be used as a "language
model", this is ignored if an ngramname is also specified.
debug If specified more debug features are added to the items in the
relation.
gscale_p Grammar scaling factor.
Here is a short example to help make the use of this facility clearer.
There are two parts required for the Viterbi decode a set of candidate
observations and some "language model". For the math to work properly the
candidate observations must be reverse probabilities (for each candidiate as
given what is the probability of the observation, rather than the probability
of the candidate given the observation). These can be calculated for the
probabilties candidate given the observation divided by the probability of the
candidate in isolation.
For the sake of simplicity let us assume we have a lexicon of words to
distribution of part of speech tags with reverse probabilities. And an
tri-gram called pos-tri-gram over ngram sequences of part of speech tags.
First we must define the candidate function
(define (pos_cand_function w)
;; select the appropriate lexicon
(lex.select 'pos_lex)
;; return the list of cands with rprobs
(cadr
(lex.lookup (item.name w) nil)))
The returned candidate list would look somthing like
( (jj -9.872) (vbd -6.284) (vbn -5.565) )
Our part of speech tagger function would look something like this
(define (pos_tagger utt)
(set! get_vit_params
(list
(list 'Relation "Word")
(list 'return_feat 'pos_tag)
(list 'p_word "punc")
(list 'pp_word "nn")
(list 'ngramname "pos-tri-gram")
(list 'cand_function 'pos_cand_function)))
(Gen_Viterbi utt)
utt)
this will assign the optimal part of speech tags to each word in utt.
ΓòÉΓòÉΓòÉ 26.5. Linear regression ΓòÉΓòÉΓòÉ
The linear regression model takes models built from some external package and
finds coefficients based on the features and weights. A model consists of a
list of features. The first should be the atom Intercept plus a value. The
following in the list should consist of a feature (see Features) followed by a
weight. An optional third element may be a list of atomic values. If the
result of the feature is a member of this list the feature's value is treated
as 1 else it is 0. This third argument allows an efficient way to map
categorical values into numeric values. For example, from the F0 prediction
model in 'lib/f2bf0lr.scm'. The first few parameters are
(set! f2b_f0_lr_start
'(
( Intercept 160.584956 )
( Word.Token.EMPH 36.0 )
( pp.tobi_accent 10.081770 (H*) )
( pp.tobi_accent 3.358613 (!H*) )
( pp.tobi_accent 4.144342 (*? X*? H*!H* * L+H* L+!H*) )
( pp.tobi_accent -1.111794 (L*) )
┬╖┬╖┬╖
)
Note the feature pp.tobi_accent returns an atom, and is hence tested with the
map groups specified as third arguments.
Models may be built from feature data (in the same format as 'wagon' using the
'ols' program distributed with the speech tools library.
ΓòÉΓòÉΓòÉ 27. Building models from databases ΓòÉΓòÉΓòÉ
Because our research interests tend towards creating statistical models
trained from real speech data, Festival offers various support for extracting
information from speech databases, in a way suitable for building models.
Models for accent prediction, F0 generation, duration, vowel reduction,
homograph disambiguation, phrase break assignment and unit selection have been
built using Festival to extract and process various databases.
Labelling databases Phones, syllables, words etc.
Extracting features Extraction of model parameters.
Building models Building stochastic models from features
ΓòÉΓòÉΓòÉ 27.1. Labelling databases ΓòÉΓòÉΓòÉ
In order for Festival to use a database it is most useful to build utterance
structures for each utterance in the database. As discussed earlier,
utterance structures contain relations of items. Given such a structure for
each utterance in a database we can easily read in the utterance
representation and access it, dumping information in a normalised way allowing
for easy building and testing of models.
Of course the level of labelling that exists, or that you are willing to do by
hand or using some automatic tool, for a particular database will vary. For
many purposes you will at least need phonetic labelling. Hand labelled data is
still better than auto-labelled data, but that could change. The size and
consistency of the data is important too.
For this discussion we will assume labels for: segments, syllables, words,
phrases, intonation events, pitch targets. Some of these can be derived, some
need to be labelled. This would not fail with less labelling but of course
you wouldn't be able to extract as much information from the result.
In our databases these labels are in Entropic's Xlabel format, though it is
fairly easy to convert any reasonable format.
Segment These give phoneme labels for files. Note the these labels must be
members of the phoneset that you will be using for this database.
Often phone label files may contain extra labels (e.g. beginning and
end silence) which are not really part of the phoneset. You should
remove (or re-label) these phones accordingly.
Word Again these will need to be provided. The end of the word should
come at the last phone in the word (or just after). Pauses/silences
should not be part of the word.
Syllable There is a chance these can be automatically generated from Word and
Segment files given a lexicon. Ideally these should include lexical
stress.
IntEvent These should ideally mark accent/boundary tone type for each
syllable, but this almost definitely requires hand-labelling. Also
given that hand-labelling of accent type is harder and not as
accurate, it is arguable that anything other than accented vs.
non-accented can be used reliably.
Phrase This could just mark the last non-silence phone in each utterance,
or before any silence phones in the whole utterance.
Target This can be automatically derived from an F0 file and the Segment
files. A marking of the mean F0 in each voiced phone seem to give
adequate results.
Once these files are created an utterance file can be automatically created
from the above data. Note it is pretty easy to get the streams right but
getting the relations between the streams is much harder. Firstly labelling
is rarely accurate and small windows of error must be allowed to ensure things
line up properly. The second problem is that some label files identify point
type information (IntEvent and Target) while others identify segments (e.g.
Segment, Words etc.). Relations have to know this in order to get it right.
For example is not right for all syllables between two IntEvents to be linked
to the IntEvent, only to the Syllable the IntEvent is within.
The script 'festival/examples/make_utts' is an example Festival script which
automatically builds the utterance files from the above labelled files.
The script, by default assumes, a hierarchy in an database directory of the
following form. Under a directory 'festival/' where all festival specific
database ifnromation can be kept, a directory 'relations/' contains a
subdirectory for each basic relation (e.g. 'Segment/', 'Syllable/', etc.)
Each of which contains the basic label files for that relation.
The following command will build a set of utterance structures (including
building hte relations that link between these basic relations).
make_utts -phoneset radio festival/relation/Segment/*.Segment
This will create utterances in 'festival/utts/'. There are a number of
options to 'make_utts' use '-h' to find them. The '-eval' option allows extra
scheme code to be loaded which may be called by the utterance building
process. The function make_utts_user_function will be called on all utterance
created. Redefining that in database specific loaded code will allow database
specific fixed to the utterance.
ΓòÉΓòÉΓòÉ 27.2. Extracting features ΓòÉΓòÉΓòÉ
The easiest way to extract features from a labelled database of the form
described in the previous section is by loading in each of the utterance
structures and dumping the desired features.
Using the same mechanism to extract the features as will eventually be used by
models built from the features has the important advantage of avoiding
spurious errors easily introduced when collecting data. For example a feature
such as n.accent in a Festival utterance will be defined as 0 when there is no
next accent. Extracting all the accents and using an external program to
calculate the next accent may make a different decision so that when the
generated model is used a different value for this feature will be produced.
Such mismatches in training models and actual use are unfortunately common, so
using the same mechanism to extract data for training, and for actual use is
worthwhile.
The recommedn method for extracting features is using the festival script
'dumpfeats'. It basically takes a list of feature names and a list of
utterance files and dumps the desired features.
Features may be dumped into a single file or into separate files one for each
utterance. Feature names may be specified on the command line or in a
separate file. Extar code to define new features may be loaded too.
For example suppose we wanted to save the features for a set of utterances
include the duration, phone name, previous and next phone names for all
segments in each utterance.
dumpfeats -feats "(segment_duration name p.name n.name)" \
-output feats/%s.dur -relation Segment \
festival/utts/*.utt
This will save these features in files named for the utterances they come from
in the directory 'feats/'. The argument to '-feats' is treated as literal
list only if it starts with a left parenthesis, otherwise it is treated as a
filename contain named features (unbracketed).
Extra code (for new feature definitions) may be loaded through the '-eval'
option. If the argument to '-eval' starts with a left parenthesis it is
trated as an s-expression rather than a filename and is evaluated. If
argument '-output' contains "%s" it will be filled in with the utterance's
filename, if it is a simple filename the features from all utterances will be
saved in that same file. The features for each item in the named relation are
saved on a single line.
ΓòÉΓòÉΓòÉ 27.3. Building models ΓòÉΓòÉΓòÉ
This section describes how to build models from data extracted from databases
as described in the previous section. It uses the CART building program,
'wagon' which is available in the speech tools distribution. But the data is
suitable for many other types of model building techniques, such as linear
regression or neural networks.
Wagon is described in the speech tools manual, though we will cover simple use
here. To use Wagon you need a datafile and a data description file.
A datafile consists of a number of vectors one per line each containing the
same number of fields. This, not coincidentally, is exactly the format
produced by 'dumpfeats' described in the previous section. The data
description file describes the fields in the datafile and their range. Fields
may be of any of the following types: class (a list of symbols), floats, or
ignored. Wagon will build a classification tree if the first field (the
predictee) is of type class, or a regression tree if the first field is a
float. An example data description file would be
(
( duration float )
( name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n
ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( n.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n
ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( p.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n
ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( R:SylStructure.parent.position_type 0 final initial mid single )
( pos_in_syl float )
( syl_initial 0 1 )
( syl_final 0 1)
( R:SylStructure.parent.R:Syllable.p.syl_break 0 1 3 )
( R:SylStructure.parent.syl_break 0 1 3 4 )
( R:SylStructure.parent.R:Syllable.n.syl_break 0 1 3 4 )
( R:SylStructure.parent.R:Syllable.p.stress 0 1 )
( R:SylStructure.parent.stress 0 1 )
( R:SylStructure.parent.R:Syllable.n.stress 0 1 )
)
The script 'speech_tools/bin/make_wagon_desc' goes some way to helping. Given
a datafile and a file containing the field names, it will construct an
approximation of the description file. This file should still be edited as
all fields are treated as of type class by 'make_wagon_desc' and you may want
to change them some of them to float.
The data file must be a single file, although we created a number of feature
files by the process described in the previous section. From a list of file
ids select, say, 80% of them, as training data and cat them into a single
datafile. The remaining 20% may be catted together as test data.
To build a tree use a command like
wagon -desc DESCFILE -data TRAINFILE -test TESTFILE
The minimum cluster size (default 50) may be reduced using the command line
option -stop plus a number.
Varying the features and stop size may improve the results.
Building the models and getting good figures is only one part of the process.
You must integrate this model into Festival if its going to be of any use. In
the case of CART trees generated by Wagon, Festival supports these directly.
In the case of CART trees predicting zscores, or factors to modify duration
averages, ees can be used as is.
Note there are other options to Wagon which may help build better CART models.
Consult the chapter in the speech tools manual on Wagon for more information.
Other parts of the distributed system use CART trees, and linear regression
models that were training using the processes described in this chapter. Some
other parts of the distributed system use CART trees which were written by
hand and may be improved by properly applying these processes.
ΓòÉΓòÉΓòÉ 28. Programming ΓòÉΓòÉΓòÉ
This chapter covers aspects of programming within the Festival environment,
creating new modules, and modifying existing ones. It describes basic Classes
available and gives some particular examples of things you may wish to add.
The source code A walkthrough of the source code
Writing a new module Example access of an utterance
ΓòÉΓòÉΓòÉ 28.1. The source code ΓòÉΓòÉΓòÉ
The ultimate authority on what happens in the system lies in the source code
itself. No matter how hard we try, and how automatic we make it, the source
code will always be ahead of the documentation. Thus if you are going to be
using Festival in a serious way, familiarity with the source is essential.
The lowest level functions are catered for in the Edinburgh Speech Tools, a
separate library distributed with Festival. The Edinburgh Speech Tool Library
offers the basic utterance structure, waveform file access, and other various
useful low-level functions which we share between different speech systems in
our work. See Section Overview of Edinburgh Speech Tools Library Manual.
The directory structure for the Festival distribution reflects the conceptual
split in the code.
./bin/ The user-level executable binaries and scripts that are part of the
festival system. These are simple symbolic links to the binaries or
if the system is compiled with shared libraries small wrap-around
shell scripts that set LD_LIBRARY_PATH appropriately
./doc/ This contains the texinfo documentation for the whole system. The
'Makefile' constructs the info and/or html version as desired. Note
that the festival binary itself is used to generate the lists of
functions and variables used within the system, so must be compiled
and in place to generate a new version of the documentation.
./examples/This contains various examples. Some are explained within this
manual, others are there just as examples.
./lib/ The basic Scheme parts of the system, including 'init.scm' the first
file loaded by festival at start-up time. Depending on your
installation, this directory may also contain subdirectories
containing lexicons, voices and databases. This directory and its
sub-directories are used by Festival at run-time.
./lib/etc/Executables for Festival's internal use. A subdirectory containing
at least the audio spooler will be automatically created (one for
each different architecture the system is compiled on). Scripts are
added to this top level directory itself.
./lib/voices/By default this contains the voices used by Festival including
their basic Scheme set up functions as well as the diphone
databases.
./lib/dicts/This contains various lexicon files distributed as part of the
system.
./config/ This contains the basic 'Makefile' configuration files for compiling
the system (run-time configuration is handled by Scheme in the
'lib/' directory). The file 'config/config' created as a copy of
the standard 'config/config-dist' is the installation specific
configuration. In most cases a simpel copy of the distribution file
will be sufficient.
./src/ The main C++/C source for the system.
./src/lib/Where the 'libFestival.a' is built.
./src/include/Where include files shared between various parts of the system
live. The file 'festival.h' provides access to most of the parts of
the system.
./src/main/Contains the top level C++ files for the actual executables. This
is directory where the executable binary 'festival' is created.
./src/arch/The main core of the Festival system. At present everything is
held in a single sub-directory './src/arc/festival/'. This contains
the basic core of the synthesis system itself. This directory
contains lisp front ends to access the core utterance architecture,
and phonesets, basic tools like, client/server support, ngram
support, etc, and an audio spooler.
./src/modules/In contrast to the 'arch/' directory this contains the non-core
parts of the system. A set of basic example modules are included
with the standard distribution. These are the parts that do the
synthesis, the other parts are just there to make module writing
easier.
./src/modules/base/This contains some basic simple modules that weren't quite
big enough to deserve their own directory. Most importantly it
includes the Initialize module called by many synthesis methods
which sets up an utterance structure and loads in initial values.
This directory also contains phrasing, part of speech, and word
(syllable and phone construction from words) modules.
./src/modules/Lexicon/This is not really a module in the true sense (the Word
module is the main user of this). This contains functions to
construct, compile, and access lexicons (entries of words, part of
speech and pronunciations). This also contains a letter-to-sound
rule system.
./src/modules/Intonation/This contains various intonation systems, from the
very simple to quite complex parameter driven intonation systems.
./src/modules/Duration/This contains various duration prediction systems, from
the very simple (fixed duration) to quite complex parameter driven
duration systems.
./src/modules/UniSyn/A basic diphone synthesizer system, supporting a simple
database format (which can be grouped into a more efficient binary
representation). It is multi-lingual, and allows multiple databases
to be loaded at once. It offers a choice of concatenation methods
for diphones: residual excited LPC or PSOLA (TM) (which is not
distributed)
./src/modules/Text/Various text analysis functions, particularly the tokenizer
and utterance segmenter (from arbitrary files). This directory also
contains the support for text modes and SGML.
./src/modules/donovan/An LPC based diphone synthesizer. Very small and neat.
./src/modules/rxp/The Festival/Scheme front end to An XML parser written by
Richard Tobin from University of Edinburgh's Language Technology
Group┬╖┬╖ rxp is now part of the speech tools rather than just
Festival.
./src/modules/parserA simple interface the the Stochastic Context Free Grammar
parser in the speech tools library.
./src/modules/diphoneAn optional module contain the previouslty used diphone
synthsizer.
./src/modules/clunitsA partial implementation of a cluster unit selection
algorithm as described in black97c.
./src/modules/Database rjc_synthesisThis consist of a new set of modules for
doing waveform synthesis. They are inteneded to unit size
independent (e.g. diphone, phone, non-uniform unit). Also
selection, prosodic modification, joining and signal processing are
separately defined. Unfortunately this code has not really been
exercised enough to be considered stable to be used in the default
synthesis method, but those working on new synthesis techniques may
be interested in integration using these new modules. They may be
updated before the next full release of Festival.
./src/modules/*Other optional directories may be contained here containing
various research modules not yet part of the standard distribution.
See below for descriptions of how to add modules to the basic
system.
One intended use of Festival is offer a software system where new modules may
be easily tested in a stable environment. We have tried to make the addition
of new modules easy, without requiring complex modifications to the rest of
the system.
All of the basic modules should really be considered merely as example
modules. Without much effort all of them could be improved.
ΓòÉΓòÉΓòÉ 28.2. Writing a new module ΓòÉΓòÉΓòÉ
This section gives a simple example of writing a new module. showing the basic
steps that must be done to create and add a new module that is available for
the rest of the system to use. Note many things can be done solely in Scheme
now and really only low-level very intensive things (like waveform
synthesizers) need be coded in C++.
ΓòÉΓòÉΓòÉ 28.2.1. Example 1: adding new modules ΓòÉΓòÉΓòÉ
The example here is a duration module which sets durations of phones for a
given list of averages. To make this example more interesting, all durations
in accented syllables are increased by 1.5. Note that this is just an example
for the sake of one, this (and much better techniques) could easily done
within the system as it is at present using a hand-crafted CART tree.
Our knew module, called Duration_Simple can most easily be added to the
'./src/Duration/' directory in a file 'simdur.cc'. You can worry about the
copyright notice, but after that you'll probably need the following includes
#include <festival.h>
The module itself must be declared in a fixed form. That is receiving a
single LISP form (an utterance) as an argument and returning that LISP form at
the end. Thus our definition will start
LISP FT_Duration_Simple(LISP utt)
{
Next we need to declare an utterance structure and extract it from the LISP
form. We also make a few other variable declarations
EST_Utterance *u = get_c_utt(utt);
EST_Item *s;
float end=0.0, dur;
LISP ph_avgs,ldur;
We cannot list the average durations for each phone in the source code as we
cannot tell which phoneset we are using (or what modifications we want to make
to durations between speakers). Therefore the phone and average duration
information is held in a Scheme variable for easy setting at run time. To use
the information in our C++ domain we must get that value from the Scheme
domain. This is done with the following statement.
ph_avgs = siod_get_lval("phoneme_averages","no phoneme durations");
The first argument to siod_get_lval is the Scheme name of a variable which has
been set to an assoc list of phone and average duration before this module is
called. See the variable phone_durations in 'lib/mrpa_durs.scm' for the
format. The second argument to siod_get_lval. is an error message to be
printed if the variable phone_averages is not set. If the second argument to
siod_get_lval is NULL then no error is given and if the variable is unset this
function simply returns the Scheme value nil.
Now that we have the duration data we can go through each segment in the
utterance and add the duration. The loop looks like
for (s=u->relation("Segment")->head(); s != 0; s = next(s))
{
We can lookup the average duration of the current segment name using the
function siod_assoc_str. As arguments, it takes the segment name s->name()
and the assoc list of phones and duration.
ldur = siod_assoc_str(s->name(),ph_avgs);
Note the return value is actually a LISP pair (phone name and duration), or
nil if the phone isn't in the list. Here we check if the segment is in the
list. If it is not we print an error and set the duration to 100 ms, if it is
in the list the floating point number is extracted from the LISP pair.
if (ldur == NIL)
{
cerr << " Phoneme: " << s->name() << " no duration "
<< endl;
dur = 0.100;
}
else
dur = get_c_float(car(cdr(ldur)));
If this phone is in an accented syllable we wish to increase its duration by a
factor of 1.5. To find out if it is accented we use the feature system to
find the syllable this phone is part of and find out if that syllable is
accented.
if (ffeature(s,"R:SylStructure.parent.accented") == 1)
dur *= 1.5;
Now that we have the desired duration we increment the end duration with our
predicted duration for this segment and set the end of the current segment.
end += dur;
s->fset("end",end);
}
Finally we return the utterance from the function.
return utt;
}
Once a module is defined it must be declared to the system so it may be
called. To do this one must call the function festival_def_utt_module which
takes a LISP name, the C++ function name and a documentation string describing
what the module does. This will automatically be available at run-time and
added to the manual. The call to this function should be added to the
initialization function in the directory you are adding the module too. The
function is called festival_DIRNAME_init(). If one doesn't exist you'll need
to create it.
In './src/Duration/' the function festival_Duration_init() is at the end of
the file 'dur_aux.cc'. Thus we can add our new modules declaration at the end
of that function. But first we must declare the C++ function in that file.
Thus above that function we would add
LISP FT_Duration_Simple(LISP args);
While at the end of the function festival_Duration_init() we would add
festival_def_utt_module("Duration_Simple",FT_Duration_Simple,
"(Duration_Simple UTT)\n\
Label all segments with average duration ┬╖┬╖┬╖ ");
In order for our new file to be compiled we must add it to the 'Makefile' in
that directory, to the SRCS variable. Then when we type make in './src/' our
new module will be properly linked in and available for use.
Of course we are not quite finished. We still have to say when our new
duration module should be called. When we set
(Parameter.set 'Duration_Method Duration_Simple)
for a voice it will use our new module, calls to the function utt.synth will
use our new duration module.
Note in earlier versions of Festival it was necessary to modify the duration
calling function in 'lib/duration.scm' but that is no longer necessary.
ΓòÉΓòÉΓòÉ 28.2.2. Example 2: accessing the utterance ΓòÉΓòÉΓòÉ
In this example we will make more direct use of the utterance structure,
showing the gory details of following relations in an utterance. This time we
will create a module that will name all syllables with a concatenation of the
names of the segments they are related to.
As before we need the same standard includes
#include "festival.h"
Now the definition the function
LISP FT_Name_Syls(LISP utt)
{
As with the previous example we are called with an utterance LISP object and
will return the same. The first task is to extract the utterance object from
the LISP object.
EST_Utterance *u = get_c_utt(utt);
EST_Item *syl,*seg;
Now for each syllable in the utterance we want to find which segments are
related to it.
for (syl=u->relation("Syllable")->head(); syl != 0; syl = next(syl))
{
Here we declare a variable to cummulate the names of the segments.
EST_String sylname = "";
Now we iterate through the SylStructure daughters of the syllable. These will
be the segments in that syllable.
for (seg=daughter1(syl,"SylStructure"); seg; seg=next(seg))
sylname += seg->name();
Finally we set the syllables name to the concatenative name, and loop to the
next syllable.
syl->set_name(sylname);
}
Finally we return the LISP form of the utterance.
return utt;
}
ΓòÉΓòÉΓòÉ 28.2.3. Example 3: adding new directories ΓòÉΓòÉΓòÉ
In this example we will add a whole new subsystem. This will often be a
common way for people to use Festival. For example let us assume we wish to
add a formant waveform synthesizer (e.g like that in the free 'rsynth'
program). In this case we will add a whole new sub-directory to the modules
directory. Let us call it 'rsynth/'.
In the directory we need a 'Makefile' of the standard form so we should copy
one from one of the other directories, e.g. 'Intonation/'. Standard methods
are used to identify the source code files in a 'Makefile' so that the '.o'
files are properly added to the library. Following the other examples will
ensure your code is integrated properly.
We'll just skip over the bit where you extract the information from the
utterance structure and synthesize the waveform (see 'donovan/donovan.cc' or
'diphone/diphone.cc' for examples).
To get Festival to use your new module you must tell it to compile the
directory's contents. This is done in 'festival/config/config'. Add the line
ALSO_INCLUDE += rsynth
to the end of that file (there are simialr ones mentioned). Simply adding the
name of the directory here will add that as a new module and the directory
will be compiled.
What you must provide in your code is a function festival_DIRNAME_init() which
will be called at initialization time. In this function you should call any
further initialization require and define and new Lisp functions you with to
made available to the rest of the system. For example in the 'rsynth' case we
would define in some file in 'rsynth/'
#include "festival.h"
static LISP utt_rtsynth(LISP utt)
{
EST_Utterance *u = get_c_utt(utt);
// Do format synthesis
return utt;
}
void festival_rsynth_init()
{
proclaim_module("rsynth");
festival_def_utt_module("Rsynth_Synth",utt_rsynth,
"(Rsynth_Synth UTT)
A simple formant synthesizer");
┬╖┬╖┬╖
}
Integration of the code in optional (and standard directories) is done by
automatically creating 'src/modules/init_modules.cc' for the list of standard
directories plus those defined as ALSO_INCLUDE. A call to a function called
festival_DIRNAME_init() will be made.
This mechanism is specifically designed so you can add modules to the system
without changing anything in the standard distribution.
ΓòÉΓòÉΓòÉ 28.2.4. Example 4: adding new LISP objects ΓòÉΓòÉΓòÉ
This third examples shows you how to add a new Object to Scheme and add
wraparounds to allow manipulation within the the Scheme (and C++) domain.
Like example 2 we are assuming this is done in a new directory. Suppose you
have a new object called Widget that can transduce a string into some other
string (with some optional continuous parameter. Thus, here we create a new
file 'widget.cc' like this
#include "festival.h"
#include "widget.h" // definitions for the widget class
In order to register the widgets as Lisp objects we actually need to register
them as EST_Val's as well. Thus we now need
VAL_REGISTER_CLASS(widget,Widget)
SIOD_REGISTER_CLASS(widget,Widget)
The first names given to these functions should be a short mnenomic name for
the object that will be used in the defining of a set of access and
construction functions. It of course must be unique within the whole systems.
The second name is the name of the object itself.
To understand its usage we can add a few simple widget maniplutation functions
LISP widget_load(LISP filename)
{
EST_String fname = get_c_string(filename);
Widget *w = new Widget; // build a new widget
if (w->load(fname) == 0) // successful load
return siod(w);
else
{
cerr << "widget load: failed to load \"" << fname << "\"" << endl;
festival_error();
}
return NIL; // for compilers that get confused
}
Note that the function siod constructs a LISP object from a widget, the class
register macro defines that for you. Also note that when giving an object to a
LISP object it then owns the object and is responsibile for deleting it when
garbage collection occurs on that LISP object. Care should be taken that you
don't put the same object within different LISP objects. The macros
VAL_RESGISTER_CLASS_NODEL should be called if you do not want your give object
to be deleted by the LISP system (this may cause leaks).
If you want refer to these functions in other files within your models you can
use
VAL_REGISTER_CLASS_DCLS(widget,Widget)
SIOD_REGISTER_CLASS_DCLS(widget,Widget)
in a common '.h' file
The following defines a function that takes a LISP object containing a widget,
aplies some method and returns a string.
LISP widget_apply(LISP lwidget, LISP string, LISP param)
{
Widget *w = widget(lwidget);
EST_String s = get_c_string(string);
float p = get_c_float(param);
EST_String answer;
answer = w->apply(s,p);
return strintern(answer);
}
The function widget, defined by the regitration macros, takes a LISP object
and returns a pointer to the widget inside it. If the LISP object does not
contain a widget an error will be thrown.
Finally you wish to add these functions to the Lisp system
void festival_widget_init()
{
init_subr_1("widget.load",widget_load,
"(widget.load FILENAME)\n\
Load in widget from FILENAME.");
init_subr_3("widget.apply",widget_apply,
"(widget.apply WIDGET INPUT VAL)\n\
Returns widget applied to string iNPUT with float VAL.");
}
In yout 'Makefile' for this directory you'll need to add the include directory
where 'widget.h' is, if it is not contained within the directory itself. This
done through the make variable LOCAL_INCLUDES as
LOCAL_INCLUDES = -I/usr/local/widget/include
And for the linker you 'll need to identify where your widget library is. In
your 'festival/config/config' file at the end add
COMPILERLIBS += -L/usr/local/widget/lib -lwidget
ΓòÉΓòÉΓòÉ 29. API ΓòÉΓòÉΓòÉ
If you wish to use Festival within some other application there are a number
of possible interfaces.
Scheme API Programming in Scheme
Shell API From Unix shell
Server/client API Festival as a speech synthesis server
C/C++ API Through function calls from C++.
C only API Small independent C client access
Java and JSAPI Sythesizing from Java
ΓòÉΓòÉΓòÉ 29.1. Scheme API ΓòÉΓòÉΓòÉ
Festival includes a full programming language, Scheme (a variant of Lisp) as a
powerful interface to its speech synthesis functions. Often this will be the
easiest method of controlling Festival's functionality. Even when using other
API's they will ultimately depend on the Scheme interpreter.
Scheme commands (as s-expressions) may be simply written in files and
interpreted by Festival, either by specification as arguments on the command
line, in the interactive interpreter, or through standard input as a pipe.
Suppose we have a file 'hello.scm' containing
;; A short example file with Festival Scheme commands
(voice_rab_diphone) ;; select Gordon
(SayText "Hello there")
(voice_don_diphone) ;; select Donovan
(SayText "and hello from me")
From the command interpreter we can execute the commands in this file by
loading them
festival> (load "hello.scm")
nil
Or we can execute the commands in the file directly from the shell command
line
unix$ festival -b hello.scm
The '-b' option denotes batch operation meaning the file is loaded and then
Festival will exit, without starting the command interpreter. Without this
option '-b' Festival will load 'hello.scm' and then accept commands on
standard input. This can be convenient when some initial set up is required
for a session.
Note one disadvantage of the batch method is that time is required for
Festival's initialisation every time it starts up. Although this will
typically only be a few seconds, for saying short individual expressions that
lead in time may be unacceptable. Thus simply executing the commands within
an already running system is more desirable, or using the server/client mode.
Of course its not just about strings of commands, because Scheme is a fully
functional language, functions, loops, variables, file access, arithmetic
operations may all be carried out in your Scheme programs. Also, access to
Unix is available through the system function. For many applications directly
programming them in Scheme is both the easiest and the most efficient method.
A number of example Festival scripts are included in 'examples/'. Including a
program for saying the time, and for telling you the latest news (by accessing
a page from the web). Also see the detailed discussion of a script example in
See POS Example.
ΓòÉΓòÉΓòÉ 29.2. Shell API ΓòÉΓòÉΓòÉ
The simplest use of Festival (though not the most powerful) is simply using it
to directly render text files as speech. Suppose we have a file 'hello.txt'
containing
Hello world. Isn't it excellent weather
this morning.
We can simply call Festival as
unix$ festival --tts hello.txt
Or for even simpler one-off phrases
unix$ echo "hello " | festival --tts
This is easy to use but you will need to wait for Festival to start up and
initialise its databases before it starts to render the text as speech. This
may take several seconds on some machines. A socket based server mechanism is
provided in Festival which will allow a single server process to start up once
and be used efficiently by multiple client programs.
Note also the use of Sable for marked up text, see XML/SGML mark-up. Sable
allows various forms of additional information in text, such as phrasing,
emphasis, pronunciation, as well as changing voices, and inclusion of external
waveform files (i.e. random noises). For many application this will be the
preferred interface method. Other text modes too are available through the
command line by using auto-text-mode-alist.
ΓòÉΓòÉΓòÉ 29.3. Server/client API ΓòÉΓòÉΓòÉ
Festival offers a BSD socket-based interface. This allows Festival to run as
a server and allow client programs to access it. Basically the server offers
a new command interpreter for each client that attaches to it. The server is
forked for each client but this is much faster than having to wait for a
Festival process to start from scratch. Also the server can run on a bigger
machine, offering much faster synthesis.
Note: the Festival server is inherently insecure and may allow arbitrary users
access to your machine.
Every effort has been made to minimise the risk of unauthorised access through
Festival and a number of levels of security are provided. However with any
program offering socket access, like httpd, sendmail or ftpd there is a risk
that unauthorised access is possible. I trust Festival's security enough to
often run it on my own machine and departmental servers, restricting access to
within our department. Please read the information below before using the
Festival server so you understand the risks.
ΓòÉΓòÉΓòÉ 29.3.1. Server access control ΓòÉΓòÉΓòÉ
The following access control is available for Festival when running as a
server. When the server starts it will usually start by loading in various
commands specific for the task it is to be used for. The following variables
are used to control access.
server_port A number identifying the inet socket port. By default this is
1314. It may be changed as required.
server_log_fileIf nil no logging takes place, if t logging is printed to
standard out and if a file name log messages are appended to that
file. All connections and attempted connections are logged with a
time stamp and the name of the client. All commands sent from the
client are also logged (output and data input is not logged).
server_deny_listIf non-nil it is used to identify which machines are not
allowed access to the server. This is a list of regular
expressions. If the host name of the client matches any of the
regexs in this list the client is denied access. This overrides all
other access methods. Remember that sometimes hosts are identified
as numbers not as names.
server_access_listIf this is non-nil only machines whose names match at least
one of the regexs in this list may connect as clients. Remember
that sometimes hosts are identified as numbers not as names, so you
should probably exclude the IP number of machine as well as its name
to be properly secure.
server_passwdIf this is non-nil, the client must send this passwd to the
server followed by a newline before access is given. This is
required even if the machine is included in the access list. This
is designed so servers for specific tasks may be set up with
reasonable security.
(set_server_safe_functions FUNCNAMELIST)If called this can restrict which
functions the client may call. This is the most restrictive form of
access, and thoroughly recommended. In this mode it would be normal
to include only the specific functions the client can execute (i.e.
the function to set up output, and a tts function). For example a
server could call the following at set up time, thus restricting
calls to only those that 'festival_client' --ttw uses.
(set_server_safe_functions
'(tts_return_to_client tts_text tts_textall Parameter.set))
Its is strongly recommend that you run Festival in server mode as userid
nobody to limit the access the process will have, also running it in a chroot
environment is more secure.
For example suppose we wish to allow access to all machines in the CSTR domain
except for holmes.cstr.ed.ac.uk and adam.cstr.ed.ac.uk. This may be done by
the following two commands
(set! server_deny_list '("holmes\\.cstr\\.ed\\.ac\\.uk"
"adam\\.cstr\\.ed\\.ac\\.uk"))
(set! server_access_list '("[^\\.]*\\.cstr\\.ed\\.ac\\.uk"))
This is not complete though as when DNS is not working holmes and adam will
still be able to access the server (but if our DNS isn't working we probably
have more serious problems). However the above is secure in that only
machines in the domain cstr.ed.ac.uk can access the server, though there may
be ways to fix machines to identify themselves as being in that domain even
when they are not.
By default Festival in server mode will only accept client connections for
localhost.
ΓòÉΓòÉΓòÉ 29.3.2. Client control ΓòÉΓòÉΓòÉ
An example client program called 'festival_client' is included with the system
that provides a wide range of access methods to the server. A number of
options for the client are offered.
--server The name (or IP number) of the server host. By default this is
'localhost' (i.e. the same machine you run the client on).
--port The port number the Festival server is running on. By default this
is 1314.
--output FILENAMEIf a waveform is to be synchronously returned, it will be
saved in FILENAME. The --ttw option uses this as does the use of
the Festival command utt.send.wave.client. If an output waveform
file is received by 'festival_client' and no output file has been
given the waveform is discarded with an error message.
--passwd PASSWDIf a passwd is required by the server this should be stated on
the client call. PASSWD is sent plus a newline before any other
communication takes places. If this isn't specified and a passwd is
required, you must enter that first, if the --ttw option is used, a
passwd is required and none specified access will be denied.
--prolog FILEFILE is assumed to be contain Festival commands and its contents
are sent to the server after the passwd but before anything else.
This is convenient to use in conjunction with --ttw which otherwise
does not offer any way to send commands as well as the text to the
server.
--otype OUTPUTTYPEIf an output waveform file is to be used this specified the
output type of the file. The default is nist, but, ulaw, riff, ulaw
and others as supported by the Edinburgh Speech Tools Library are
valid. You may use raw too but note that Festival may return
waveforms of various sampling rates depending on the sample rates of
the databases its using. You can of course make Festival only
return one particular sample rate, by using after_synth_hooks. Note
that byte order will be native machine of the client machine if the
output format allows it.
--ttw Text to wave is an attempt to make festival_client useful in many
simple applications. Although you can connect to the server and
send arbitrary Festival Scheme commands, this option automatically
does what is probably what you want most often. When specified this
options takes text from the specified file (or stdin), synthesizes
it (in one go) and saves it in the specified output file. It
basically does the following
(Parameter.set 'Wavefiletype '<output type>)
(tts_textall "
<file/stdin contents>
")))
Note that this is best used for small, single utterance texts as you have to
wait for the whole text to be synthesized before it is returned.
--aucommand COMMANDExecute COMMAND of each waveform returned by the server.
The variable FILE will be set when COMMAND is executed.
--async So that the delay between the text being sent and the first sound
being available to play, this option in conjunction with --ttw
causes the text to be synthesized utterance by utterance and be sent
back in separated waveforms. Using --aucommand each waveform my be
played locally, and when 'festival_client' is interrupted the sound
will stop. Getting the client to connect to an audio server
elsewhere means the sound will not necessarily stop when the
'festival_client' process is stopped.
--withlispWith each command being sent to Festival a Lisp return value is
sent, also Lisp expressions may be sent from the server to the
client through the command send_client. If this option is specified
the Lisp expressions are printed to standard out, otherwise this
information is discarded.
A typical example use of 'festival_client' is
festival_client --async --ttw --aucommand 'na_play $FILE' fred.txt
This will use 'na_play' to play each waveform generated for the utterances in
'fred.txt'. Note the single quotes so that the $ in $FILE isn't expanded
locally.
Note the server must be running before you can talk to it. At present
Festival is not set up for automatic invocations through 'inetd' and
'/etc/services'. If you do that yourself, note that it is a different type of
interface as 'inetd' assumes all communication goes through standard in/out.
Also note that each connection to the server starts a new session. Variables
are not persistent over multiple calls to the server so if any initialization
is required (e.g. loading of voices) it must be done each time the client
starts or more reasonably in the server when it is started.
A PERL festival client is also available in
'festival/examples/festival_client.pl'
ΓòÉΓòÉΓòÉ 29.4. C/C++ API ΓòÉΓòÉΓòÉ
As well as offerening an interface through Scheme and the shell some users may
also wish to embedd Festival within their own C++ programs. A number of simply
to use high level functions are available for such uses.
In order to use Festival you must include 'festival/src/include/festival.h'
which in turn will include the necessary other include files in
'festival/src/include' and 'speech_tools/include' you should ensure these are
included in the include path for you your program. Also you will need to link
your program with 'festival/src/lib/libFestival.a',
'speech_tools/lib/libestools.a', 'speech_tools/lib/libestbase.a' and
'speech_tools/lib/libeststring.a' as well as any other optional libraries such
as net audio.
The main external functions available for C++ users of Festival are.
void festival_initialize(int load_init_files,int heapsize); This must be
called before any other festival functions may be called. It sets up
the synthesizer system. The first argument if true, causes the
system set up files to be loaded (which is normallly what is
necessary), the second argument is the initial size of the Scheme
heap, this should normally be 210000 unless you envisage processing
very large Lisp structures.
int festival_say_file(const EST_String &filename);Say the contents of the
given file. Returns TRUE or FALSE depending on where this was
successful.
int festival_say_text(const EST_String &text);Say the contents of the given
string. Returns TRUE or FALSE depending on where this was
successful.
int festival_load_file(const EST_String &filename);Load the contents of the
given file and evaluate its contents as Lisp commands. Returns TRUE
or FALSE depending on where this was successful.
int festival_eval_command(const EST_String &expr);Read the given string as a
Lisp command and evaluate it. Returns TRUE or FALSE depending on
where this was successful.
int festival_text_to_wave(const EST_String &text,EST_Wave &wave);Synthesize
the given string into the given wave. Returns TRUE or FALSE
depending on where this was successful.
Many other commands are also available but often the above will be sufficient.
Below is a simple top level program that uses the Festival functions
int main(int argc, char **argv)
{
EST_Wave wave;
int heap_size = 210000; // default scheme heap size
int load_init_files = 1; // we want the festival init files loaded
festival_initialize(load_init_files,heap_size);
// Say simple file
festival_say_file("/etc/motd");
festival_eval_command("(voice_ked_diphone)");
// Say some text;
festival_say_text("hello world");
// Convert to a waveform
festival_text_to_wave("hello world",wave);
wave.save("/tmp/wave.wav","riff");
// festival_say_file puts the system in async mode so we better
// wait for the spooler to reach the last waveform before exiting
// This isn't necessary if only festival_say_text is being used (and
// your own wave playing stuff)
festival_wait_for_spooler();
return 0;
}
ΓòÉΓòÉΓòÉ 29.5. C only API ΓòÉΓòÉΓòÉ
A simpler C only interface example is given inf
'festival/examples/festival_client.c'. That interface talks to a festival
server. The code does not require linking with any other EST or Festival code
so is much smaller and easier to include in other programs. The code is
missing some functionality but not much consider how much smaller it is.
ΓòÉΓòÉΓòÉ 29.6. Java and JSAPI ΓòÉΓòÉΓòÉ
Initial support for talking to a Festival server from java is included from
version 1.3.0 and initial JSAPI support is included from 1.4.0. At present the
JSAPI talks to a Festival server elsewhere rather than as part of the Java
process itself.
A simple (Pure) Java festival client is given
'festival/src/modules/java/cstr/festival/Client.java' with a wraparound script
in 'festival/bin/festival_client_java'.
See the file 'festival/src/modules/java/cstr/festival/jsapi/ReadMe' for
requirements and a small example of using the JSAPI interface.
ΓòÉΓòÉΓòÉ 30. Examples ΓòÉΓòÉΓòÉ
This chapter contains some simple walkthrough examples of using Festival in
various ways, not just as speech synthesizer
POS Example Using Festival as a part of speech tagger
ΓòÉΓòÉΓòÉ 30.1. POS Example ΓòÉΓòÉΓòÉ
This example shows how we can use part of the standard synthesis process to
tokenize and tag a file of text. This section does not cover training and
setting up a part of speech tag set (See POS tagging), only how to go about
using the standard POS tagger on text.
This example also shows how to use Festival as a simple scripting language,
and how to modify various methods used during text to speech.
The file 'examples/text2pos' contains an executable shell script which will
read arbitrary ascii text from standard input and produce words and their part
of speech (one per line) on standard output.
A Festival script, like any other UNIX script, it must start with the the
characters #! followed by the name of the 'festival' executable. For scripts
the option -script is also required. Thus our first line looks like
#!/usr/local/bin/festival -script
Note that the pathname may need to be different on your system
Following this we have copious comments, to keep our lawyers happy, before we
get into the real script.
The basic idea we use is that the tts process segments text into utterances,
those utterances are then passed to a list of functions, as defined by the
Scheme variable tts_hooks. Normally this variable contains a list of two
function, utt.synth and utt.play which will synthesize and play the resulting
waveform. In this case, instead, we wish to predict the part of speech value,
and then print it out.
The first function we define basically replaces the normal synthesis function
utt.synth. It runs the standard festival utterance modules used in the
synthesis process, up to the point where POS is predicted. This function looks
like
(define (find-pos utt)
"Main function for processing TTS utterances. Predicts POS and
prints words with their POS"
(Token utt)
(POS utt)
)
The normal text-to-speech process first tokenizes the text splitting it in to
``sentences''. The utterance type of these is Token. Then we call the Token
utterance module, which converts the tokens to a stream of words. Then we
call the POS module to predict part of speech tags for each word. Normally we
would call other modules ultimately generating a waveform but in this case we
need no further processing.
The second function we define is one that will print out the words and parts
of speech
(define (output-pos utt)
"Output the word/pos for each word in utt"
(mapcar
(lambda (pair)
(format t "%l/%l\n" (car pair) (car (cdr pair))))
(utt.features utt 'Word '(name pos))))
This uses the utt.features function to extract features from the items in a
named stream of an utterance. In this case we want the name and pos features
for each item in the Word stream. Then for each pair we print out the word's
name, a slash and its part of speech followed by a newline.
Our next job is to redefine the functions to be called during text to speech.
The variable tts_hooks is defined in 'lib/tts.scm'. Here we set it to our two
newly-defined functions
(set! tts_hooks (list find-pos output-pos))
So that garbage collection messages do not appear on the screen we stop the
message from being outputted by the following command
(gc-status nil)
The final stage is to start the tts process running on standard input.
Because we have redefined what functions are to be run on the utterances, it
will no longer generate speech but just predict part of speech and print it to
standard output.
(tts_file "-")
ΓòÉΓòÉΓòÉ 31. Problems ΓòÉΓòÉΓòÉ
There will be many problems with Festival, both in installation and running
it. It is a young system and there is a lot to it. We believe the basic
design is sound and problems will be features that are missing or incomplete
rather than fundamental ones.
We are always open to suggestions on how to improve it and fix problems, we
don't guarantee we'll have the time to fix problems but we are interested in
hearing what problems you have.
Before you smother us with mail here is an incomplete list of general problems
we have already identified
The more documentation we write the more we realize how much more
documentation is required. Most of the Festival documentation was
written by someone who knows the system very well, and makes many English
mistakes. A good re-write by some one else would be a good start.
The system is far too slow. Although machines are getting faster, it
still takes too long to start the system and get it to speak some given
text. Even so, on reasonable machines, Festival can generate the speech
several times faster than it takes to say it. But even if it is five
time faster, it will take 2 seconds to generate a 10 second utterance. A
2 second wait is too long. Faster machines would improve this but a
change in design is a better solution.
The system is too big. It takes a long time to compile even on quite
large machines, and its foot print is still in the 10s of megabytes as is
the run-time requirement. Although we have spent some time trying to fix
this (optional modules have made the possibility of building a much
smaller binary) we haven't done enough yet.
The signal quality of the voices isn't very good by today's standard of
synthesizers, even given the improvement quality since the last release.
This is partly our fault in not spending the time (or perhaps also not
having enough expertise) on the low-level waveform synthesis parts of the
system. This will improve in the future with better signal processing
(under development) and better synthesis techniques (also under
development).
ΓòÉΓòÉΓòÉ 32. References ΓòÉΓòÉΓòÉ
allen87 Allen J., Hunnicut S. and Klatt, D. Text-to-speech: the MITalk
system, Cambridge University Press, 1987.
abelson85 Abelson H. and Sussman G. Structure and Interpretation of Computer
Programs, MIT Press, 1985.
black94 Black A. and Taylor, P. "CHATR: a generic speech synthesis system.",
Proceedings of COLING-94, Kyoto, Japan 1994.
black96 Black, A. and Hunt, A. "Generating F0 contours from ToBI labels
using linear regression", ICSLP96, vol. 3, pp 1385-1388,
Philadelphia, PA. 1996.
black97b Black, A, and Taylor, P. "Assigning Phrase Breaks from
Part-of-Speech Sequences", Eurospeech97, Rhodes, Greece, 1997.
black97c Black, A, and Taylor, P. "Automatically clustering similar units for
unit selection in speech synthesis", Eurospeech97, Rhodes, Greece,
1997.
black98 Black, A., Lenzo, K. and Pagel, V., "Issues in building general
letter to sound rules.", 3rd ESCA Workshop on Speech Synthesis,
Jenolan Caves, Australia, 1998.
black99 Black, A., and Lenzo, K., "Building Voices in the Festival Speech
Synthesis System," unpublished document, Carnegie Mellon University,
available at
http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/
breiman84 Breiman, L., Friedman, J. Olshen, R. and Stone, C. Classification
and regression trees, Wadsworth and Brooks, Pacific Grove, CA. 1984.
campbell91Campbell, N. and Isard, S. "Segment durations in a syllable frame",
Journal of Phonetics, 19:1 37-47, 1991.
DeRose88 DeRose, S. "Grammatical category disambiguation by statistical
optimization". Computational Linguistics, 14:31-39, 1988.
dusterhoff97Dusterhoff, K. and Black, A. "Generating F0 contours for speech
synthesis using the Tilt intonation theory" Proceedings of ESCA
Workshop of Intonation, September, Athens, Greece. 1997
dutoit97 Dutoit, T. An introduction to Text-to-Speech Synthesis Kluwer
Acedemic Publishers, 1997.
hunt89 Hunt, M., Zwierynski, D. and Carr, R. "Issues in high quality LPC
analysis and synthesis", Eurospeech89, vol. 2, pp 348-351, Paris,
France. 1989.
jilka96 Jilka M. Regelbasierte Generierung natuerlich klingender Intonation
des Amerikanischen Englisch, Magisterarbeit, Institute of Natural
Language Processing, University of Stuttgart. 1996
moulines90Moulines, E, and Charpentier, N. "Pitch-synchronous waveform
processing techniques for text-to-speech synthesis using diphones"
Speech Communication, 9(5/6) pp 453-467. 1990.
pagel98, Pagel, V., Lenzo, K., and Black, A. "Letter to Sound Rules for
Accented Lexicon Compression", ICSLP98, Sydney, Australia, 1998.
ritchie92 Ritchie G, Russell G, Black A and Pulman S. Computational
Morphology: practical mechanisms for the English Lexicon, MIT Press,
Cambridge, Mass.
vansanten96van Santen, J., Sproat, R., Olive, J. and Hirschberg, J. eds,
"Progress in Speech Synthesis," Springer Verlag, 1996.
silverman92Silverman K., Beckman M., Pitrelli, J., Ostendorf, M., Wightman,
C., Price, P., Pierrehumbert, J., and Hirschberg, J "ToBI: a
standard for labelling English prosody." Proceedings of ICSLP92 vol
2. pp 867-870, 1992
sproat97 Sproat, R., Taylor, P, Tanenblatt, M. and Isard, A. "A Markup
Language for Text-to-Speech Synthesis", Eurospeech97, Rhodes,
Greece, 1997.
sproat98, Sproat, R. eds, "Multilingual Text-to-Speech Synthesis: The Bell
Labs approach", Kluwer 1998.
sable98, Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., Lenzo,
K., and Edgington, M. "SABLE: A standard for TTS markup." ICSLP98,
Sydney, Australia, 1998.
taylor91 Taylor P., Nairn I., Sutherland A. and Jack M┬╖┬╖ "A real time speech
synthesis system", Eurospeech91, vol. 1, pp 341-344, Genoa, Italy.
1991.
taylor96 Taylor P. and Isard, A. "SSML: A speech synthesis markup language"
to appear in Speech Communications.
wwwxml97 World Wide Web Consortium Working Draft "Extensible Markup Language
(XML)Version 1.0 Part 1: Syntax",
http://www.w3.org/pub/WWW/TR/WD-xml-lang-970630.html
yarowsky96Yarowsky, D., "Homograph disambiguation in text-to-speech
synthesis", in "Progress in Speech Synthesis," eds. van Santen, J.,
Sproat, R., Olive, J. and Hirschberg, J. pp 157-172. Springer
Verlag, 1996.
ΓòÉΓòÉΓòÉ 33. Feature functions ΓòÉΓòÉΓòÉ
This chapter contains a list of a basic feature functions available for stream
items in utterances. See Features. These are the basic features, which can
be combined with relative features (such as n. for next, and relations to
follow links). Some of these features are implemented as short C++ functions
(e.g. asyl_in) while others are simple features on an item (e.g. pos). Note
that functional feature take precidence over simple features, so accessing and
feature called "X" will always use the function called "X" even if a the
simple feature call "X" exists on the item.
Unlike previous versions there are no features that are builtin on all items
except addr (reintroduced in 1.3.1) which returns a unique string for that
item (its the hex address on teh item within the machine). Features may be
defined through Scheme too, these all have the prefix lisp_.
The feature functions are listed in the form Relation.name where Relation is
the name of the stream that the function is appropriate to and name is its
name. Note that you will not require the Relation part of the name if the
stream item you are applying the function to is of that type.