ftp.uv.es

home *** CD-ROM | disk | FTP | other *** search

/ ftp.uv.es / 2014.11.ftp.uv.es.tar / ftp.uv.es / pub / biologia / distanc_.exe / DISTANCE.DOC < prev next >

Wrap

Text File | 1993-03-22 | 12KB | 217 lines

COMPUTER NOTE DISTANCE: A Program to Calculate Distances and Other Parameters of an Alignment of DNA Sequences. J.A. Lopez-Bueno and F. Gonzalez-Candelas E-mail: LOPEZJ@vm.ci.uv.es GONZALEZ@evalsb.geneti.uv.es Departament de Genetica and Servei de Bioinformatica Universitat de Valencia Dr. Moliner 50 E-46100 Burjassot, Valencia, Spain The advent of DNA sequence analysis techniques has brought an explosion of knowledge of molecular biology and also a period of development in the study of molecular evolution. Application of these techniques has led to the discovery of new features of genes and genomes and to a rapid accumulation of data on DNA sequences. These discoveries and the availability of a large number of DNA sequences have facilitated the study of molecular evolution and considerable progress has already been made in the comparative study of DNA sequences (Li et al. 1985). Methods for estimating the number of nucleotide base substitutions are crucial for studies of molecular evolution. Knowledge of the number of base substitutions is particulary important for computing the evolutionary rate and constructing phylogenetic trees at the DNA level (Gojobori et al. 1989). Several methods have been proposed, some of which are: - Jukes and Cantor (1969), which assumes equal, random substitution rates among the four types of nucleotides. - Kimura's (1980) Two-Parameter Method, which allows transitional and transversional substitutions to occur at different rates. - Kimura's (1981) Three-Parameter Method, which allows one type of transversional substitutions and two types of transitional substitutions. - Tajima and Nei's (1984) Four-Parameter Method, in which each nucleotide is substituted by another at a fixed rate for each substituting nucleotide. - Takahata and Kimura's (1981) Four-Parameter Method. This model allows the two types of transversional substitutions at a nucleotide site to occur at different rates. Moreover, the four types of transitions can occur at different rates. - Kimura's Six-Parameter Method, which is based on the model originally proposed by Kimura (1981), who called it the Two Frequency Class Model. It was solved by Gojobori et al. (1982). The program DISTANCE is intended to provide researchers with all these nucleotide distances, and their corresponding variances when possible, for any set of previously aligned nucleotide sequences. The program DISTANCE has been written in Pascal using a Turbo Pascal compiler (version 6.0, Borland Co.). It runs on PC's and DOS-based computers and needs a CGA graphics adapter. We suministrate the executable file DISTANCE.EXE and the source codes along with several other files described below. The input file can have three different formats: a,b) A first line with two integers (number of species and number of positions), with the sequence information in the remaining lines (in interleaved PHYLIP format version 3.3, or aligned PHYLIP format, older versions, Felsenstein, 1990). The two integers in the first line are as follows: i) Number of sequences. It is an integer in the range 2 <= NumberSeq <= 20. The program constant NumMaxSp in the source file UNITDIST.PAS can be modified to allow for larger numbers. ii) Length of the aligned sequences. It is an integer, without limits. These two parameters are only restricted by the available memory in the computer. See the example files PHYNEW.SEQ and PHYOLD.SEQ, and the on-screen help for more information. c) MSF format, output of program PILEUP (Genetics Computer Group 1991). See the example files MSF.SEQ and the on-screen help for more information. DISTANCE can be executed in two different modes: like a menu-driven program, if no parameters are indicated, or like a command, to use it in batch mode. In the first mode, MENU MODE, it has a main menu with nine submenus: i) Path and name of the input file. If the input file is in the current directory only its name is necessary. Otherwise, the whole path must be typed. The default filename is SEQUENCE.SEQ. ii) Path and name of the results file. Once it has been written, the following results will be added to the end of this file, except in the case that this option is changed. The default filename is SEQUENCE.RST. iii) Format of the aligned sequences file. DISTANCE currently accepts three different formats for the aligned DNA sequences: interleaved (PHYLIP version 3.3), sequential (PHYLIP older versions) or MSF. You have to indicate to program which one you are using. iv) Method to compute the distances. You can choose among six methods: -Jukes and Cantor's method. -Kimura's Two-Parameter Method. -Kimura's Three-Parameter Method. -Tajima and Nei's Four-Parameter Method. -Takahata and Kimura's Four-Parameter Method. -Kimura's Six-Parameter Method. The program also computes the corresponding matrix of variances of the estimates, except for the last two methods, as they have not been derived yet. v) Bases in each codon to use in the computations. You can choose to calculate distances in each combination of bases of the aligned DNA sequences (all the bases, first base, second base, third base, first and second bases, first and third bases, and second and third bases). When all the bases are selected, synonymous and non-synonymous difference proportions and substitutions per site matrices, using the unweighted pathway method (Nei, 1987), can be computed according to one of the following genetic codes: vi) Tables of trinucleotide-aminoacid translation code to use: - Don't make these calculations (default), - Standard nuclear code, - Drosophila mitochondrial, - Yeast mitochondrial, - Mammalian mitochondrial, - Ciliated. This calculations can slow the execution of the program. vii) Output format. The results file can be of three types: -Large output file (all the matrices). DISTANCE prints the common length vector, the Hamming's distances (absolute number of changes) matrix, the transversions and transitions matrix, all the nucleotide pairs changes matrices, and the distance matrix chosen in (iv), along with its variances matrix, if possible. -Brief output file. The program only prints the distance estimates and the corresponding variances matrices. -Fitsch and Kitsch output file. The program only prints one lower triangular matrix with the distances for using with the PHYLIP package programs Fitsch and Kitsch. viii) Do it! This option will execute the program, once all the options have been chosen. ix) Help. On-screen help (fourteen pages) about all the previous topics. The index of this help is: -Page 1. Index. -Page 2. Phylip New input -interleaved- format. Example. -Page 3. Phylip Old input -aligned- format. Example. -Page 4. MSF input format. Example. -Page 5. Methods to calculate distances (Jukes-Cantor). -Page 6. Methods to calculate distances (Kimura 2). -Pag. 7. Methods to calculate distances (Kimura 3). -Pag. 8. Methods to calculate distances (Kimura 4). -Pag. 9. Methods to calculate distances (Kimura 6). -Pag. 10. Methods to calculate distances (Tajima and Nei). -Pag. 11. Bases to use. -Pag. 12. Code tables. -Pag. 13. Output. -Pag. 14. Some important notes. Future. x) Quit. Return to DOS. In the second mode (COMMAND MODE) you must to supply parameters in the following way: DISTANCE /h (/H) : Help for command mode, or DISTANCE parm1 parm2 parm3 parm4 parm5 parm6 parm7 where parm1 is Input file path, parm2 is Output file path, parm3 is the input format, parm4 is the distance measure, parm5 is the bases to be used, parm6 is the genetic code and parm7 is the output format. You can eliminate the options starting from the right side, and the defaults are DISTANCE sequence.seq sequence.rst 1 2 1 1 1, being the options in the same order as in the menu mode. For instance, DISTANCE myfile1.dat myfile1.dis 2 1 2 will execute DISTANCE with input file = myfile1.dat, writing the results in the output file = myfile1.dis, reading the sequences in sequential PHYLIP format, using Jukes-Cantor's distance measure, and using only the first bases of each codon. As defaults, no translation into aminoacids will be performed (parm6 = 1) and the brief output format (parm7 = 1) will be used. Although there are several other programs to compute distances from nucleic acids sequences available (DNADIST in the PHYLIP package, for instance), DISTANCE has several interesting features. It can use three different input formats, including the two standard PHYLIP formats, compute six different distances and, also, compute the variances matrix of four of them. In order to provide more information on what distance to use and for further application to other programs, DISTANCE shows all the nucleotide pairs matrices (including deletions) when the large output option is chosen, as well as the number of synonymous and non-synonymous substitutions. DISTANCE comprises the following files: - DISTANCE.PAS : main source program. - DISTANCE.EXE : the executable file. - DISTANCE.DOC : help text file for the program. - UNITDIST.PAS : source unit with some variable and types. - UNITDIST.TPU : compiled unit UNITDIST.PAS. - STANDNUC.TAB, DROSOMIT.TAB, YEASTMIT.TAB, MAMMIT.TAB, CILIATED.TAB : text files with the five different translation codes. - HELP.PAS : unit with the help text. - HELP.TPU : compiled unit of HELP.PAS. - PHYNEW.SEQ: Example of sequence data file in PHYLIP interleaved format. - PHYOLD.SEQ: Example of sequence data file in PHYLIP aligned format. - MSF.SEQ: Example of sequence data file in MSF format. Copies of the source code and executable files can be obtained from the authors by sending a floppy-disk (either 3.5" or 5.25") or by electronic mail (Internet addresses: GONZALEZ@EVALSB.GENETI.UV.ES and LBUENO@VM.CI.UV.ES). References Felsenstein J, 1990. PHYLIP Manual, version 3.3. Berkeley: University of California Press. Genetics Computer Group 1991. Program Manual for the GCG package, version 7. Gojobori T, Moriyama EN, and Kimura M, 1990. Statistical methods for estimating sequence divergence. Methods in Enzymology 183: 531-550. Gojobori T, Ishii K, and Nei M, 1982. Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide. J Mol Evol 18: 414-423. Jukes TH and Cantor CR, 1969. Evolution of protein molecules. In: Mammalian protein metabolism (Munro HN, ed). New York: Academic Press; 21-123. Kimura M, 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16: 111-120. Kimura M, 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc Natl Acad Sci USA 78: 454-455. Li WH, Luo CC, and Wu CI, 1985. Evolution of DNA sequences. In: Molecular evolutionary genetics (MacIntyre RI, ed). New York: Plenum Press; 4-16. Nei M. 1987. Molecular Evolutionary Genetics. Washington: ColumbiaUniversity Press. Tajima F, and Nei M, 1984. Estimation of evolutionary distance between nucleotide sequences. Mol Biol Evol 1: 269-285. Takahata N, and Kimura M, 1981. A model of evolutionary base substitutions and its application with special reference to rapid change of pseudogenes. Genetics 98: 641-657. 11