home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.uv.es
/
2014.11.ftp.uv.es.tar
/
ftp.uv.es
/
pub
/
biologia
/
distanc_.exe
/
DISTANCE.DOC
< prev
next >
Wrap
Text File
|
1993-03-22
|
12KB
|
217 lines
COMPUTER NOTE
DISTANCE: A Program to Calculate Distances and Other Parameters of an
Alignment of DNA Sequences.
J.A. Lopez-Bueno and F. Gonzalez-Candelas
E-mail: LOPEZJ@vm.ci.uv.es
GONZALEZ@evalsb.geneti.uv.es
Departament de Genetica and Servei de Bioinformatica
Universitat de Valencia
Dr. Moliner 50
E-46100 Burjassot, Valencia, Spain
The advent of DNA sequence analysis techniques has brought an
explosion of knowledge of molecular biology and also a period of development
in the study of molecular evolution. Application of these techniques has led to
the discovery of new features of genes and genomes and to a rapid
accumulation of data on DNA sequences. These discoveries and the availability
of a large number of DNA sequences have facilitated the study of molecular
evolution and considerable progress has already been made in the comparative
study of DNA sequences (Li et al. 1985).
Methods for estimating the number of nucleotide base substitutions are
crucial for studies of molecular evolution. Knowledge of the number of base
substitutions is particulary important for computing the evolutionary rate and
constructing phylogenetic trees at the DNA level (Gojobori et al. 1989). Several
methods have been proposed, some of which are:
- Jukes and Cantor (1969), which assumes equal, random substitution rates
among the four types of nucleotides.
- Kimura's (1980) Two-Parameter Method, which allows transitional and
transversional substitutions to occur at different rates.
- Kimura's (1981) Three-Parameter Method, which allows one type of
transversional substitutions and two types of transitional substitutions.
- Tajima and Nei's (1984) Four-Parameter Method, in which each nucleotide is
substituted by another at a fixed rate for each substituting nucleotide.
- Takahata and Kimura's (1981) Four-Parameter Method. This model allows
the two types of transversional substitutions at a nucleotide site to occur
at different rates. Moreover, the four types of transitions can occur at
different rates.
- Kimura's Six-Parameter Method, which is based on the model originally
proposed by Kimura (1981), who called it the Two Frequency Class
Model. It was solved by Gojobori et al. (1982).
The program DISTANCE is intended to provide researchers with all these
nucleotide distances, and their corresponding variances when possible, for any
set of previously aligned nucleotide sequences.
The program DISTANCE has been written in Pascal using a Turbo
Pascal compiler (version 6.0, Borland Co.). It runs on PC's and DOS-based
computers and needs a CGA graphics adapter. We suministrate the executable
file DISTANCE.EXE and the source codes along with several other files
described below.
The input file can have three different formats:
a,b) A first line with two integers (number of species and number of positions),
with the sequence information in the remaining lines (in interleaved PHYLIP
format version 3.3, or aligned PHYLIP format, older versions, Felsenstein,
1990). The two integers in the first line are as follows:
i) Number of sequences. It is an integer in the range
2 <= NumberSeq <= 20. The program constant NumMaxSp in
the source file UNITDIST.PAS can be modified to allow for
larger numbers.
ii) Length of the aligned sequences. It is an integer, without
limits. These two parameters are only restricted by the
available memory in the computer. See the example files
PHYNEW.SEQ and PHYOLD.SEQ, and the on-screen help for
more information.
c) MSF format, output of program PILEUP (Genetics Computer Group 1991).
See the example files MSF.SEQ and the on-screen help for more information.
DISTANCE can be executed in two different modes: like a menu-driven
program, if no parameters are indicated, or like a command, to use it in batch
mode.
In the first mode, MENU MODE, it has a main menu with nine submenus:
i) Path and name of the input file. If the input file is in the current directory
only its name is necessary. Otherwise, the whole path must be typed. The
default filename is SEQUENCE.SEQ.
ii) Path and name of the results file. Once it has been written, the following
results will be added to the end of this file, except in the case that this option is
changed. The default filename is SEQUENCE.RST.
iii) Format of the aligned sequences file. DISTANCE currently accepts three
different formats for the aligned DNA sequences: interleaved (PHYLIP version
3.3), sequential (PHYLIP older versions) or MSF. You have to indicate to
program which one you are using.
iv) Method to compute the distances. You can choose among six methods:
-Jukes and Cantor's method.
-Kimura's Two-Parameter Method.
-Kimura's Three-Parameter Method.
-Tajima and Nei's Four-Parameter Method.
-Takahata and Kimura's Four-Parameter Method.
-Kimura's Six-Parameter Method.
The program also computes the corresponding matrix of variances of the
estimates, except for the last two methods, as they have not been derived yet.
v) Bases in each codon to use in the computations. You can choose to calculate
distances in each combination of bases of the aligned DNA sequences (all the
bases, first base, second base, third base, first and second bases, first and third
bases, and second and third bases). When all the bases are selected,
synonymous and non-synonymous difference proportions and substitutions per
site matrices, using the unweighted pathway method (Nei, 1987), can be
computed according to one of the following genetic codes:
vi) Tables of trinucleotide-aminoacid translation code to use:
- Don't make these calculations (default),
- Standard nuclear code,
- Drosophila mitochondrial,
- Yeast mitochondrial,
- Mammalian mitochondrial,
- Ciliated.
This calculations can slow the execution of the program.
vii) Output format. The results file can be of three types:
-Large output file (all the matrices). DISTANCE prints the common length
vector, the Hamming's distances (absolute number of changes) matrix, the
transversions and transitions matrix, all the nucleotide pairs changes
matrices, and the distance matrix chosen in (iv), along with its variances
matrix, if possible.
-Brief output file. The program only prints the distance estimates and the
corresponding variances matrices.
-Fitsch and Kitsch output file. The program only prints one lower triangular
matrix with the distances for using with the PHYLIP package programs Fitsch
and Kitsch.
viii) Do it! This option will execute the program, once all the options have been
chosen.
ix) Help. On-screen help (fourteen pages) about all the previous topics. The
index of this help is:
-Page 1. Index.
-Page 2. Phylip New input -interleaved- format. Example.
-Page 3. Phylip Old input -aligned- format. Example.
-Page 4. MSF input format. Example.
-Page 5. Methods to calculate distances (Jukes-Cantor).
-Page 6. Methods to calculate distances (Kimura 2).
-Pag. 7. Methods to calculate distances (Kimura 3).
-Pag. 8. Methods to calculate distances (Kimura 4).
-Pag. 9. Methods to calculate distances (Kimura 6).
-Pag. 10. Methods to calculate distances (Tajima and Nei).
-Pag. 11. Bases to use.
-Pag. 12. Code tables.
-Pag. 13. Output.
-Pag. 14. Some important notes. Future.
x) Quit. Return to DOS.
In the second mode (COMMAND MODE) you must to supply
parameters in the following way:
DISTANCE /h (/H) : Help for command mode, or
DISTANCE parm1 parm2 parm3 parm4 parm5 parm6 parm7
where parm1 is Input file path, parm2 is Output file path, parm3 is the input
format, parm4 is the distance measure, parm5 is the bases to be used, parm6 is
the genetic code and parm7 is the output format. You can eliminate the options
starting from the right side, and the defaults are
DISTANCE sequence.seq sequence.rst 1 2 1 1 1,
being the options in the same order as in the menu mode. For instance,
DISTANCE myfile1.dat myfile1.dis 2 1 2
will execute DISTANCE with input file = myfile1.dat, writing the results in the
output file = myfile1.dis, reading the sequences in sequential PHYLIP format,
using Jukes-Cantor's distance measure, and using only the first bases of each
codon. As defaults, no translation into aminoacids will be performed (parm6 =
1) and the brief output format (parm7 = 1) will be used.
Although there are several other programs to compute distances from
nucleic acids sequences available (DNADIST in the PHYLIP package, for
instance), DISTANCE has several interesting features. It can use three
different input formats, including the two standard PHYLIP formats, compute
six different distances and, also, compute the variances matrix of four of them.
In order to provide more information on what distance to use and for further
application to other programs, DISTANCE shows all the nucleotide pairs
matrices (including deletions) when the large output option is chosen, as well
as the number of synonymous and non-synonymous substitutions.
DISTANCE comprises the following files:
- DISTANCE.PAS : main source program.
- DISTANCE.EXE : the executable file.
- DISTANCE.DOC : help text file for the program.
- UNITDIST.PAS : source unit with some variable and types.
- UNITDIST.TPU : compiled unit UNITDIST.PAS.
- STANDNUC.TAB, DROSOMIT.TAB, YEASTMIT.TAB, MAMMIT.TAB,
CILIATED.TAB : text files with the five different translation codes.
- HELP.PAS : unit with the help text.
- HELP.TPU : compiled unit of HELP.PAS.
- PHYNEW.SEQ: Example of sequence data file in PHYLIP interleaved format.
- PHYOLD.SEQ: Example of sequence data file in PHYLIP aligned format.
- MSF.SEQ: Example of sequence data file in MSF format.
Copies of the source code and executable files can be obtained from the authors
by sending a floppy-disk (either 3.5" or 5.25") or by electronic mail (Internet
addresses: GONZALEZ@EVALSB.GENETI.UV.ES and
LBUENO@VM.CI.UV.ES).
References
Felsenstein J, 1990. PHYLIP Manual, version 3.3. Berkeley: University of
California Press.
Genetics Computer Group 1991. Program Manual for the GCG package,
version 7.
Gojobori T, Moriyama EN, and Kimura M, 1990. Statistical methods for
estimating sequence divergence. Methods in Enzymology 183: 531-550.
Gojobori T, Ishii K, and Nei M, 1982. Estimation of average number of
nucleotide substitutions when the rate of substitution varies with nucleotide. J
Mol Evol 18: 414-423.
Jukes TH and Cantor CR, 1969. Evolution of protein molecules. In:
Mammalian protein metabolism (Munro HN, ed). New York: Academic Press;
21-123.
Kimura M, 1980. A simple method for estimating evolutionary rates of base
substitutions through comparative studies of nucleotide sequences. J Mol Evol
16: 111-120.
Kimura M, 1981. Estimation of evolutionary distances between homologous
nucleotide sequences. Proc Natl Acad Sci USA 78: 454-455.
Li WH, Luo CC, and Wu CI, 1985. Evolution of DNA sequences. In: Molecular
evolutionary genetics (MacIntyre RI, ed). New York: Plenum Press; 4-16.
Nei M. 1987. Molecular Evolutionary Genetics. Washington:
ColumbiaUniversity Press.
Tajima F, and Nei M, 1984. Estimation of evolutionary distance between
nucleotide sequences. Mol Biol Evol 1: 269-285.
Takahata N, and Kimura M, 1981. A model of evolutionary base substitutions
and its application with special reference to rapid change of pseudogenes.
Genetics 98: 641-657.
11