home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Shareware BBS: 24 DOS
/
24-DOS.zip
/
proanaly.zip
/
MANUAL.TXT
< prev
next >
Wrap
Text File
|
1996-08-05
|
58KB
|
1,361 lines
PROANALYST - QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS (QSAR) IN
PROTEINS, PROTEIN ENGINEERING, PATTERNS RECOGNITION IN COMBINATORIAL
LIBRARIES, PHYSICO-CHEMICAL AND ALPHABETICAL ANALYSIS FOR MULTIPLE
SEQUENCE ALIGNMENTS AND 3D STRUCTURE
COPYRIGHT (C) 1996 Vladimir A.Ivanisenko, Alexey M.Eroshkin
Theoretical Dept., Research Institute of Molecular Biology,
SRC VB "Vector", 633159, Koltsovo, Novosibirsk region, Russia
Tel. (3822) 647774
Telex 133196 NPO SU
Fax: (3832) 328831;
E.mail: salex@vector.nsk.su, eroshkin@vector.nsk.su
DEMO version (Print@save are disabled, protein number - limited)
TABLE OF CONTENTS:
1. Introduction.
2. Menu items:
2.1 File (data loading)
2.1.1 Property (physico-chemical factor(s) selection).
2.1.2 Load protein (selection of protein family to study).
2.1.3 Load 3D-structure (selection of file with protein
3D structure).
2.1.4 View result (viewing files with the results).
2.1.5 Save protein (saving current protein subset).
2.1.6 Load protein sequence from PDB file.
2.1.7 Commander (access to MS-DOS commands).
2.2 Options (setting the parameters and functions for further
calculations).
2.2.1.1 Analysis based on (setting the type of data to be used in
calculation: sequences only, 3D structure or both).
2.2.1.2 Functions for 1D-structure (selection of functions
to calculate fragment physico-chemical characteristics).
2.2.1.3 Functions for 3D-structure (selection of functions
to calculate spatial site characteristics).
2.2.1.4 Hypotheses amount (Maximal amount of keeping sites).
2.2.1.5 Min frame (Minimal size of a sequential site).
2.2.1.6 Max frame (Maximal size of a sequential site).
2.2.1.7 Gaps treatment (ignore or exclude).
2.2.1.8 Fragments (type of searching site: linear/sequential,
discrete).
2.2.1.9 Min number of factors (for multiple regression
and discriminant analysis).
2.2.1.10 Max number of factors (for multiple regression
and discriminant analysis).
2.2.1.11 Cutoff radius (cutoff radius for neighbors in 3D
site).
2.2.1.12 Type of input atoms (type of atoms to be used in 3D
model, C-alpha, C-beta or all).
2.2.1.13 Profile smoothing (on/off profile smoothing).
2.2.2 Sequences display mode (modes of sequence displaying on
the screen).
2.2.3 Display residues exposure (turning on/off displaying the
solvent exposed amino acid residues on the screen).
2.2.4 Display 2D structure (turning on/off displaying secondary
structures on the screen).
2.2.5 Display protein name (turning on/off displaying protein
names on the screen).
2.2.6 Display protein activity (turning on/off displaying protein
activities on the screen).
2.2.7 View marked fragments (viewing current or all marked
fragments).
2.2.8 Window for protein name (the number of positions used to
display protein name).
2.2.9 Window for activities (the number of positions used to
display activities).
2.2.10 Window for graphs (dimensions).
2.2.11 Show property on the screen (value of aa physico-chemical
property for residue under cursor).
2.2.12 Sorting by (sorting the proteins according to their
activities (increase/decrees) or group numbers).
2.3 Prepare data (viewing multiple sequence alignment, protein names,
activities, group numbers, marked fragment(s) etc.; fragment(s)
choosing, splitting proteins into groups, sequence editing, etc.).
2.4 Analysis (main program block for all calculations).
2.4.1. Define factors: fragment, function and property (specifying
the factors for calculations).
2.4.2. Structure-activity (selection of procedure for finding
activity-modulating center and analysis of relationships).
2.4.2.1 Multiple linear regression analysis (physico-
chemical analysis).
2.4.2.2 Discriminant analysis (physico-chemical analysis).
2.4.2.3 Cross groups variation (alphabetical analysis).
2.4.3 Functional center.
2.4.3.1 Amino acid residues conservation in current group
2.4.4 Profile analysis (searching regions with high and low
physico-chemical properties, conservative and variable
sites, etc.).
2.4.4.1 Average profile (physico-chemical profiles).
2.4.4.2 Min profile (physico-chemical profiles).
2.4.4.3 Max profile (physico-chemical profiles).
2.4.4.4 Cross groups variation (comparison of aa
between the groups of proteins).
2.4.4.5 Variation in current group (comparison of aa
in one group of proteins).
2.4.4.6 SADC-PROFILE
2.4.4.7 Residual dispersion 3D.
2.4.4.8 Cross groups variation 3D.
2.4.4.9 Variation in current group 3D.
2.4.4.10 Normalized cross groups variation 3D.
2.4.4.11 Coordinated changes 3D.
2.4.4.12 Number of coordinated position 3D.
2.4.4.13 View profile on 3D structure.
2.4.4.14 Save profile (writing the values of profile to disk).
2.4.4.15 Dispersion profiles (physico-chemical profiles).
2.4.5 Motifs Search.
2.4.6 Sort (sorting the protein according to activity values or
group number).
2.4.7 Save last result (writing the results, obtained in automatic
mode).
2.4.8 View last result (viewing the results, obtained in automatic
mode).
2.5 View 3D-structure
2.5.1 Spatial site (viewing 3D structures and spatial sites,
selection the sites for further calculation, etc.).
2.5.2 Simple marking (marking and viewing).
2.6 Help.
2.7 Quit.
3. General information
4. Requirements
5. Standard errors list.
6. Note from authors
7. References
1. INTRODUCTION
PROANALYST is an integrated applied system for studying quantitative
structure-activity relationships in proteins. PROANALYST provides
multivariate statistical, pattern and profile analyses; physico-chemical
and alphabetical analyses in protein sequences and 3D structures;
protein engineering experiments design.
The program is a further development of earlier described program
PROANAL (Eroshkin, et al., 1993, 1995). PROANALYST examines the
relationships between the protein activity and physico-chemical
characteristics (or amino acid residue composition) of different regions
in their primary and tertiary structures (3D QSAR). The
structure-activity analysis is based on aligned protein amino acid
residue sequences, data on their activity (pK, ED50, Km or any other)
and 3D structure of at least one of these proteins. Program is useful
also in cases when protein families are divided by evolution, functional
or other criteria. The following methods are implemented: empirical
energy calculations, spatial site moments calculations, discriminant
analysis, multiple linear regression, analysis of variations (ANOVA) and
some other. Regression plots, 3D pictures, graphs of various
physico-chemical profiles for the sequences and 3D structures make it
easier for the researcher to get the picture of the problem. The
program allows to look for protein sites conservative in variations of
physico-chemical characteristics (candidates to functionally important
regions) and regions with high or low values of these characteristics.
PROANALYST may be used for simulation of protein-engineering
experiments, prediction of protein activity and the search of different
protein regions such as functional sites, elements of secondary
structure, solvent-exposed regions, T- and B-cell antigenic
determinants, etc. In automatic mode PROANALYST generates and
verifies hypotheses on the location of modulating regions in sequence or
3D structure of a protein, and key characteristics of this region. In
manual mode the researcher can generate and analyze his own hypotheses.
Program is implemented for IBM PC or compatible computers. It is
designed to be easily handled by any occasional computer user and yet it
is powerful enough for experienced professionals.
2. MENU ITEMS
To start the program you have to make current directory containing all
the necessary files (see General information), type PANALYST.EXE and
press Enter. On the screen you'll see the main menu:
-----------------------------------------------------------------------------
File Options Prepare data Analysis View 3D-structure Help Quit
-----------------------------------------------------------------------------
Right and left arrow keys allow to navigate through the items of the
menu. One line description of the highlighted menu item appears
simultaneously at the bottom line of the screen. Pressing <Enter> key
confirms the selection. Menu items are available only if the program is
ready to perform the corresponding action. For instance, "Analysis" is
available only after the protein family is loaded, and the regions of
interest are chosen. "View 3D-structure" is available only after
3D-structure is loaded.
2.1 FILE
When "File" item of the main menu is selected the following submenu
appears:
Property
Load protein
Load 3D-structure
View result
Save protein
Commander
2.1.1 PROPERTY
This menu item lists the files containing physico-chemical properties of
amino acid residues (*.ppt). After the file is selected, the program
displays the list of all properties available in this file. And user can
choose any subset of these properties. Every file can contain up to 50
different properties. The data in this file are organized according to
the following format:
Comments (any number of lines started with space symbol)
Property name (no spaces are allowed at the beginning of the name)
Literature reference (the same requirement as above)
Values of amino acid residue properties, 7 positions per one value (with space
in 1-st position)
The order of values should correspond to the following order of amino
acids ACDEFGHIKLMNPQRSTVWY.
For example:
Any comments must have a leading space
{ Property names (from 1-st position of string)
Literature source (from 1-st position of string)
Values of amino acid residue properties }
Hydrophilicity Hopp-Woods
T.P.Hopp, K.R.Woods, PNAS 78 (1981) 3824
-.5 -1.0 3.0 3.0 -2.5 0.0 -.5 -1.8 3.0 -1.8
-1.3 .2 0.0 .2 3.0 .3 -.4 -1.5 -3.4 -2.3
Hydropathy Kyte-Doolittle
J.Kyte, R.F.Doolittle J.Mol.Biol 157 (1982) 105
1.8 2.5 -3.5 -3.5 2.8 -.4 -3.2 4.5 -3.9 3.8
1.9 -3.5 -1.6 -3.5 -4.5 -.8 -.7 4.2 -.9 -1.3
You can also select any subset of properties for the analysis or
prepare your own set of properties. There are several hot keys.
Command keys Effect
< Esc > Return back.
< Up > Moves the cursor one position up.
< Down > Moves the cursor one line down.
< Enter > Selects/cancels selection of a property for
inclusion in the data set.
< Alt I > Selects all property from a given file.
< Alt C > Cancels selection of all properties that
were chosen earlier.
2.1.2 LOAD PROTEIN
Here user can select protein/peptide family to be analyzed. The
data are supposed to be prepared in three separate files:
1) File of protein names (with extension *.seq). The format of the file
corresponds to the one of protein sequence database SWISS-PROT. All
lines except DE (name of a protein) and // (the end of the data for a
protein) are ignored.
For example:
DE INTERFERON ALPHA 2
//
DE INTERFERON ALPHA 1
OS HOMO SAPIENCE
//
....
2) File of aligned protein sequences (with extension *.ali) in one-letter
code. The format of the file is as follows: first line - the length of
aligned amino acid sequences (after special words "Seq.file "). Then
protein sequences - one sequence per one line (even in the case of long
sequences). NO ADDITIONAL SYMBOLS LIKE " " (BLANK) IN THE END OF THE
LINES. Gaps are coded by symbol '-'. Add one or more blank lines at the
end of the file.
For example:
Seq. file 43
QCGEGLCCDQCSFIEEGTVCRIARGDDLDDYCNGRSAGCPRNP
QCGE---CDQCSFMKKGTICRRARGDDLDDYCNGRSAGCPRNP
QCGEGPCCDQCSFMKKGTICRRARGDDLDDYCNGRSAGCPRNP
QCGEGLCCDQCSFMKKGTICRRARGDDLDDYCNGISAGCPRNP
QCGEGLCCDQCSFMKKGTICRRARGDDLDDYCNGISAGCPRNP
QCADGLCCDQCRFKKKRTICRRARGDD--DRCTGQSADCPRNG
QCADGLCCDQCRFKKKTGICRIARGDFPDDRCTGLSNDCPRWN
Q--DGLCCDQCRFKKKRTICRIARGDFPDDRCTGQSADCPRWN
QCAEGLCCDQCRFKGAGKICRRARGDNPDDRCTGQSADCPRNR
QCAEGLCCDQCLFMKEGTVC-RARGDDVNDYCNGISAGCPRNP
PCATGPCCRRCKFKRAGKVCRVARGDWNNDYCTGKSCDCPRNP
3) File of protein activities (with extension *.act). Format of this file
is just activity value of each protein per line. In case if you do not
need to investigate structure-activity relations or don't have activity
data just type ordinal numbers 1, 2, 3, 4, 5, .... Add one or more blank
lines at the end of the file.
For example:
1.602
1.699
1.716
1.748
1.982
2
2.033
2.124
2.134
2.188
2.204
2.326
In this menu item user can also select any subset of the proteins
for analysis (use <Enter> key to select or unselect particular protein).
There are several examples on the distribution disk. They can be
useful to try the program and have a look at the structure of
data files.
4) File of residue solvent exposure and protein 2D structure (file name
with extension *.exp). This file is optional.
Format of the file:
The file should have two lines of the length equal to the length of
aligned protein sequences. The first line reflects the exposure of
amino acid residues (aa) to the solvent (0 - internal aa, 1 - external).
Positions containing 1 are marked green in the sequence window. The
second string is simply displayed on the screen. So, you can type any
appropriate codes for the elements of the secondary structure.
Example:
1010110010011011001100011100101110101001100110100100000000000000000010
tttaaaaaaaaaaaaaaaaaattttttttttttbbbbbbbbbtttttttttaaaaaaaaaaaaaaaattt
There are several special command keys which facilitate the extraction of
any protein subset from the initial files.
Command keys Effect
< Esc > Return back.
< Up Arrow > Moves cursor one line up.
< Down Arrow > Moves cursor one line down.
< Enter > Selects/unselects the protein.
< Alt I > Selects all proteins.
< Alt C > Cancels selection.
< PgUp > Moves cursor seventeen lines up.
< PgDn > Moves cursor seventeen lines down.
CAUTION: The following limits are in the current version of the
program:
- the length of aligned sequences must be less than or equal to 5000
amino acid residues;
- the length of protein name can not have more than 80 symbols;
- the numbers of proteins should not exceed 500.
2.1.3 LOAD 3D-STRUCTURE
User can load 3D structure of one of the analyzed proteins (when
available). This structure will be used as a model for all analyzed
proteins. The files containing 3D data (.cb,.pdb) are supposed to be
in PDB format. PROANALYST is able to work with files containing only
C-alpha, C-beta atoms or all protein atoms. The part of 3D structure
can be loaded too.
2.1.4 VIEW RESULTS
This options allows to see all results of PROANALYST calculations (files
with filenames having the extension *.res). The whole library of
results can be created as the result of working with the program.
F3 key is used to see the files.
F4 key is used to edit the files.
2.1.5 SAVE PROTEIN
User can save to disk any subset of initially selected protein family
and activities in the form appropriate for further using (topic 2.1.2).
All sequences from groups with numbers 1, 2, 3, etc. (except group
with number 0) will be saved. After using this option the window
appears that have file directory with extensions *.ALI. In order to
create new file it is necessary to type the new filename and press
ENTER. To append existing file, select the name and type ENTER. You
can enter to the section "Save protein" only if the options "File",
"Load Protein" and "Prepare data" are executed.
2.1.6 LOAD PROTEIN FROM PDB FILE
Load protein sequence from PDB file. This option is available only after
protein 3D-structure is loaded.
2.1.7 COMMANDER
User can execute any DOS command in this option.
2.2 OPTIONS
User has in this option the following menu to input parameters for
further calculations:
Calculation
Sequences display mode
Display residues exposure (on/off)
Display 2D structure (on/off)
Display protein name (on/off)
Display activity (on/off)
View marked fragments (Current/All)
Window for protein name
Window for activities
Show property on a screen
Sorting by (increase/decrees)
Each parameter has some default value.
2.2.1 CALCULATION
Parameters and functions for calculations is given in this option.
2.2.1.1 EVALUATE
User chooses the type of data to be used in calculation:
1. Only 1D-structure (default option),
2. Only 3D-structure.
3. 1D-structure and 3D-structure.
2.2.1.2 FUNCTION 1D-STRUCTURE.
Ten possible ways of fragment characteristic calculation from amino acid
residue sequence or composition are taken into account:
Average for a fragment (on/off)
Moment, Alpha-helix periodicity (on/off)
Moment, Pi-helix periodicity (on/off)
Moment, Beta-strand flat periodicity (on/off)
Moment, Beta-strand twist periodicity (on/off)
Moment, 3-10-helix periodicity (on/off)
Minimum value for a fragment (on/off)
Maximum value for a fragment (on/off)
Amplitude value for a fragment (max - min) (on/off)
Sum for a fragment (on/off)
User can switch on or off each type of function. 5 modes of moment
calculation are introduced - each of them is connected with a some type
of secondary structure of the region (Schultz and Schirmer, 1979).
These characteristics are calculated by the same formula (Eisenberg et
al., 1984) and differ only by values of periodicity angles (or by
number of amino acid residues per turn). In this section user
switches off or on some particular function for calculation of the
fragment characteristic and the way how to process gaps.
2.2.1.3 FUNCTION 3D-STRUCTURE.
This option is available only if option "Analysis based on - 3D structure"
is chosen. With the using protein 3D coordinates the all neighbors
for each amino acid residue are defined (via some threshold distance
between C- alpha or C-beta or all atoms). This amino acid residues
are considered as spatial site. For this spatial site the following
characteristics can be calculated:
Average for a spatial site (on/off)
Minimum value for a spatial site (on/off)
Maximum value for a spatial site (on/off)
Amplitude value (max - min) for a spatial site (on/off)
Sum for a spatial site (on/off)
Dipole moment for a spatial site (on/off)
Empirical potential energy for ss (on/off)
Short range potential energy for ss (on/off)
Long range potential energy for ss (on/off)
Disp of L-r potential energy for ss (on/off)
Mean of L-r potential energy for ss (on/off)
Dipole moment for a spatial site is calculated with the using of
aa physical and chemical properties and 3D coordinates.
Empirical potential energy functions are defined by algorithm of
Crippen (G.M Crippen and V.N. Viswanadhan. 1984). The empirical potential
energy consists of two terms: short-range and long-range energies.
Dispersion and mean value of long-range potential energies are
calculated for 6 site tertiary structures, obtained as the result of
variations in amino acid residue (Ca-atoms) coordinates of the initial
site. Modified Ca-atoms coordinates are calculated as initial
coordinates +/- 1 angstrom relative to axes X, Y and Z. The long-range
potential energy is calculated for each new site structure. Then the
dispersion and mean value of long-range potential energies for 6 varied
structures are calculated.
User can switch on or off each type of function.
2.2.1.4 HYPOTHESES AMOUNT (MAXIMAL AMOUNT OF SOUGHT SITES)
Maximal amount of protein fragments (sites) to be displayed on the screen
as the result of automatic search. User can select any number up to
200. Default value is 25.
2.2.1.5 MIN FRAME and 2.2.1.6 MAX FRAME
Minimal and Maximal length of protein fragments to be investigated in
automatic search. MIN FRAME > or = 1, MAX FRAME < or = length of the
studied protein. Default values are: MIN FRAME=1, MAX FRAME=5.
2.2.1.7 GAPS TREATMENT
In this section user is to select a way how to process gaps.
There are two ways of processing gaps: in mode "Ignore" gaps will be
omitted (for example, sequence ACDE--FG will turn into ACDEFG). In mode
"Exclude" sequences having gaps will be excluded from calculation of
fragment characteristic. The second way is based on the point of view
that deletions greatly distort local protein structure and sites with the
gaps can not be analyzed adequately by such a simple procedure (and hence
there is reason to not take into account sites having gaps). Default
value is IGNORE.
2.2.1.8 FRAGMENTS (split/merge)
There are two modes of working:
SPLIT - Each marked fragment is site for investigation.
MERGE - Combine several marked fragments (with numbers 1, 2, 3,... etc.)
into one discrete site. Only average, sum, min and max functions
are available in this case. Default value is: SPLIT.
2.2.1.9 MIN NUMBER OF FACTORS and
2.2.1.10 MAX NUMBER OF FACTORS.
Min and max number of evaluated factors for regression and discriminant
analysis. Default values are: 1 and 1.
2.2.1.11 CUTOFF RADIUS
CUTOFF RADIUS is the threshold distance (between C-alpha, C-beta or all
atoms) for creation of spatial sites. Default value is 5 angstroms (for
C-alpha atoms).
2.2.1.12 TYPE OF INPUT ATOMS
The type of atoms is shown (CA or CB or ALL) that will be used in
analysis of tertiary structure. Choose what is necessary in your case.
Default value is: CA.
2.2.1.13 PROFILE SMOOTHING
Switching On or Off the profile smoothing. The value S(i) in position i
for unsmoothed profile is equal to protein site characteristic
calculated for the window [i,i+current frame length]. The formula for
smoothing is:
SS(i)=(S(i)+S(i-1)+S(i-2)+...+S(i-current frame length))/current frame length
Default value is: OFF.
2.2.2 SEQUENCES DISPLAY MODE
One of the two types of displaying protein sequences on the screen may
be used: "sequence" and "change". If "sequence" mode is selected then
complete sequences are displayed. If "change" mode is selected then
only amino acid differences are displayed for second, third, fourth,
etc. sequences relative to the first one. Default value is: CHANGE.
2.2.3 DISPLAY RESIDUES EXPOSURE (ON/OFF)
The surface amino acid residues will be shown in green if the mode "ON"
is chosen (and if the relative file *.exp exists). Default value is: ON.
2.2.4 DISPLAY 2D STRUCTURE (on/off).
The protein 2D structure will be shown above the sequences in the
mode ON is chosen (and if the relative file *.exp exists).
Default value is: ON.
2.2.5 DISPLAY PROTEIN NAME (on/off)
The names of proteins are displayed on the screen in case ON. Default
value is: ON.
2.2.6 DISPLAY ACTIVITY (on/off)
The activities of proteins are displayed on the screen in case On.
Default value is: ON.
2.2.7 VIEW MARKED FRAGMENTS (Current/All)
There are two modes of working:
CURRENT - only current fragment will be marked in red.
ALL - all selected fragments will be marked in red.
Default value is: CURRENT.
2.2.8 WINDOW FOR PROTEIN NAME and
2.2.9 WINDOW FOR ACTIVITIES
Size of windows for protein names and protein activities. Default values
are: 15 and 5.
2.2.9 WINDOW FOR GRAPHS
Sizes of window for graphs (structure-activity, profiles, discriminant
function) can be changed in this option.
Left - the coordinate of the left side of the window on the screen
(min value is 1, max value is 80). Default value is 6.
Top - the coordinate of the top side of the window on the screen
(min value is 1, max value is 25). Default value is 5.
Right - the coordinate of the right side of the window on the screen
(min value is 1, max value is 80). Default value is 74.
Down - the coordinate of the down side of the window on the screen
(min value is 1, max value is 25). Default value is 20.
2.2.11 SHOW PROPERTY ON THE SCREEN
User can choose amino acid property to display on a screen.
Default value is: first property in the list.
2.2.12 SORTING BY (increase/decrease)
There are two ways of protein sorting in the set:
INCREASE - to sort by increasing protein activities or group numbers.
DECREASE - to sort by decreasing protein activities or group numbers.
Default value is: DECREASE.
To sort the protein set press "Alt S" in the main program window
and select appropriate mode of sorting.
2.3 PREPARE DATA
This is one of the two main program units. User can enter to this
unit only after correct choosing the data in units "File" and "Protein".
Entering to "Prepare Data" user will get the window with
aligned amino acid sequences of investigated protein/peptide family.
In upper line of the window it is shown:
Line, Col - positions of cursor in the multiple aligned sequences,
Grp - the number of current active group (number of groups is less than
9),
Frg - the number of current active fragment (number of fragments is less
than 9),
Ppt - the value of physico-chemical property of amino acid under cursor.
Keys <Alt B> and <Alt E> are used to mark the fragment for analysis.
Keys <Alt 1>, ... , <Alt 8> are used to change the current fragment number.
Keys <Ctrl Alt 1>, ... , <Ctrl Alt 8> are used to change the current
active group of proteins.
Keys <Ctrl Alt I> are used to combine all proteins to one group (with
current group number).
Keys <Ctrl Alt C> are used to exclude protein(s) from analysis or to
clear current grouping. The proteins will be marked with group number 0
(such proteins are not used in analysis but in predictions only).
The proteins having group number equal to zero are not used in
calculations. To change group number press ENTER. There are several
special command keys which facilitate the investigation of the protein
family.
Command keys Effect
< Esc > Returns to the Main menu.
< Left > Moves cursor one position to the left.
< Right> Moves cursor one position to the right.
< Up > Moves the string up.
< PgUp > Page up.
< PgDn > Page down.
< Down > Moves the string down.
< Home > Moves cursor to first position on the window.
< End > Moves cursor to the last position on the window.
< Ctrl Home > Moves cursor to the first position of the sequence.
< Ctrl End> Moves cursor to last position of the sequence.
< Alt N > Moves to the window for editing protein names.
< Alt D > Moves to the window for editing protein activities.
<Alt 1>,...,<Alt 8> Switch between fragments. User may define up to 8
different regions in a protein family (fragments
are marked by red color). Initially the number
of current fragment is equal to 1.
< Alt B >, < Alt E > Define the beginning and the end of a fragment
< Alt U > Unmark current fragment.
< Alt C > Unmark all selected fragments.
< F7 > Find a string (word) in a sequence under the
cursor. The matrix of amino acid similarity is
used in the search (user choose the matrix from the
set). Symbol x should be used for undefined
positions.
ACDEFGHIKLMNPQRSTVWY Edit the sequence of protein (to provide protein-
engineering experiment).
The window for editing protein names has the following command keys:
Command keys Effect
< Esc > Returns to main working window.
< Left > Moves cursor one position to the left.
< Right > Moves cursor one position to the right.
< Up > Moves the string up.
< Down > Moves the string down.
< Home > Moves cursor to the position of first
character of the name.
< End > Moves cursor to the position of
end character of the name.
< F4 > Starts editing.
The window for editing protein activities has the following command keys:.
Command keys Effect
< Esc > Return to main working window.
< Up > Moves the string up.
< Down > Moves the string down.
< F4 > Starts editing.
2.4 ANALYSIS.
The comparison of sequences in all methods of alphabetical analysis can
be done with the using of matrices of amino acid similarity (MDM78, minimal
mutational distance, etc.). For physico-chemical profiles and
structure-activity analysis the factors should be determined before the
calculations.
2.4.1 FACTORS DEFINITION: FRAGMENT, FUNCTION and PROPERTY.
Here user inputs the set of factors that will be used in calculations
of regression, discriminant and profile analysis. The factor is the
combination of three parameters: fragment of sequence, function and
physico-chemical property. To select the factors it is necessary to go
through the set of menus. At first it is necessary to select the fragment
(from the set of fragments inputted in section PREPARE DATA). Then the
user selects the function(s) to calculate site characteristics. Then the
user selects the physico-chemical properties to calculate site
characteristics. All selected physico-chemical properties will be used in
calculations with the earlier chosen functions. To move through the menu
use the keys: End, Home, PgUp, PgDn, Up, Down.
Command keys Effect
< Enter > Choosing the fragment, function or property.
< Esc > Returns to previous menu.
< F1 > Help
2.4.2 STRUCTURE-ACTIVITY ANALYSIS.
The analysis of relationship between structure and activity in a family of
proteins/peptides is performed in this section.
Analysis in Automatic and Manual (by hand) modes.
In automatic mode the program generates and verifies hypotheses on the
location of a sequential activity-modulating region in a protein, and key
characteristics of this region. The search depends of the values of
MIN FRAME and MAX FRAME as well as of the types and numbers of selected
factors ("Factors definition"). The window with the results appears by
the end of automatic. The results are ranked set of best hypotheses
on structure-activity relation. User can mark one hypotheses for
further analysis in the manual mode (BY HAND). Remember, that after
marking some hypothesis the set of initially selected factors will be
replaced by factors from marked hypothesis.
2.4.2.1 MULTIPLE LINEAR REGRESSION ANALYSIS.
Multiple linear regression permits to estimate the correlations
(dependencies) between the variables and activity and to verify
hypothesis on the nature of activity-modulating centers. Such centers
can include different parts of protein structure (discrete centers) or
can have more than one key amino acid property influencing protein
activity (e.g. charge and volume).
The window with the results of automatic search can be shown in
submenu "VIEW LAST RESULT".
In this unit the following characteristics are shown:
- protein site, property, function;
- the regression equation;
- 0.95 confidence intervals for all parameters of regression;
- all test statistics and their confidence levels;
- multiple correlation coefficient;
- coefficient of multiple determination;
- total variance of activity;
- residual mean square error (RMSE);
To visualize the results of multiple regression analysis the graphs of
projections and correlation line between theoretical and measured
activities are used (press F5 and F6 keys in the window with the textual
results).
The projection is regression line for one factor when other factors are
fixed (mean values is putting to regression equation for fixed factors).
Command keys Effect
< Esc > Abort.
< F2 > Writes the results to the disk.
< F5 > Plot structure-activity graphs.
Press F2 in this window to print the
screen on laser or matrix printer.
< F6 > Correlation line between theoretical and
measured activities.
2.4.2.2 DISCRIMINANT ANALYSIS.
Discriminant analysis is used in the cases when protein activities are
given only qualitatively or when protein are divided into groups by
some criteria (Klecka, 1986). It is possible to define the site and
physico-chemical factor describing the given protein partition with the
using of this analysis. Obtained coefficients of canonical discriminant
functions can be used in further classification of proteins.
At first, the proteins should be divided in two or more groups. Then
the proteins should be sorted in order of increasing group numbers
(press "Alt S" and select "SORT BY GROUPS").
The window with the results of automatic search can be shown
in submenu "VIEW LAST RESULT".
The results of discriminant analysis have a lot of information:
1. Eigenvalues.
2. Ratio of eigenvalues to their sum.
3. Canonical correlation (R) %.
4. Square of canonical correlation (R^2) %.
5. Lambda-statistic of Wilks S.S.
5.1 Number of functions (k).
5.2 The values of statistics.
5.3 Statistics Chi-square (degrees of freedom and critical values for
95% and 99% confidence level).
6. Discriminant function coefficients.
7. Standardized coefficients.
8. Structural coefficients (Pearson's correlation coefficients).
This window has the following command keys:
Command keys Effect
< Esc > Abort.
< F5 > Plots graph of discriminant function.
Press F2 key in this window to
prints the screen on laser or matrix printer.
< F2 > Writes the results to the disk.
2.4.2.3 CROSS GROUPS VARIATION.
At the first stage of finding an activity-modulating regions the
alphabetical analysis is reasonably to use. Let us divide protein
family into N groups of proteins with similar activities. To calculate
the inter group variability index the comparison of protein sites
(sequential or spatial) can be done. The number of protein pairs (each
from different groups) that have the same contest of amino acid residues
in given site is calculated at the first step. Then this number is
divided to the common number of all possible pairs of proteins. So we
get the number (varying from 0 to 1) that characterize the site
variability.
The estimation of variability indexes is calculated based on pairwise
comparison of proteins from different groups.
I=1-log[9*Sum Ri /N +1],
i=1,N
Ri= Mult r , where:
ij
j=1,M
N - the number of protein pares,
M - the number of positions in the site,
r - the element of matrix aa similarity,
ij
Sum - summation,
Mult - multiplication.
If r vary in the interval [0,1] then the values for I lies in the interval
ij
[0,1]. In conservative position I=0.
The following matrices are implemented in the program:
ONE - uniform matrix,
PHY-CHEM - physico-chemical similarity, based on McLachlan's matrix,
ESAB - matrix of evolutionarily related aa.
The window with the results of automatic search can be shown in
submenu "VIEW LAST RESULT".
Command keys Effect
< Esc > Abort.
< F2 > Writes the result to the disk.
2.4.3 FUNCTIONAL CENTER
2.4.3.1 AMINO ACID RESIDUES CONSERVATION IN CURRENT GROUP
Amino acid residues conservation in current group is calculated by the
same way as in CROSS GROUP VARIATION (see 2.4.2.3) procedure but pairs
of proteins are taken from the current group. The regions with low
variability indexes are considered as conservative.
The window with the results of automatic search can be shown in
submenu "VIEW LAST RESULT".
2.4.4 PROFILE ANALYSIS.
The profile for primary structure is build with the using of sliding
window (investigated site) of 1, 2, 3, 4 and so on amino acids. In the
case of tertiary structure profiles the procedure is as follows. All
amino acids - neighbors for given position (or positions) are defined
via some threshold distance between C-alpha, C-beta or all atoms in
amino acid residues (three different cases are implemented) in protein
3D structure. This amino acid residues are considered as spatial
activity-modulating site. Spatial profile is the result of the
consequence calculation of earlier described characteristics for all
spatial sites.
2.4.4.1 AVERAGE PROFILE
Physico-chemical or structural profiles are calculated for all
individual proteins (with the exception of proteins with group number
0). The resulting average profile is calculated as average value (in
each point) of all profiles.
2.4.4.2 MIN PROFILE
The resulting profile is equal to minimal value (not average) in each
point of all individual profiles.
2.4.4.3 MAX PROFILE
The resulting profile is equal to maximal value in each point
of all individual profiles.
2.4.4.4 CROSS GROUPS VARIATION
2.4.4.5 VARIATION IN CURRENT GROUP
The algorithms to profiles calculation are the same as in 2.4.2.3 (CROSS
GROUPS VARIATION) with the only difference that sites are moving windows
in protein structure (1D or 3D). When MAX FRAME > MIN FRAME the profile
is the average for several calculated profiles with the set of windows
(lengths varying from MIN FRAME to MAX FRAME). Entering this
section user need to select the matrix of aa similarity (from suggested
catalog of names).
2.4.4.6 SADC-PROFILE
For calculation of the Structure-Activity Determination coefficient
profile (SADC profile) the proteins are divided to some number of groups
with the same amino acid content in the given site. SADC profile
reflects the relations between variance of protein activity and site
amino acid residue variability (to be published).
The amino acid comparison in all the methods of alphabetical analysis can
be done with the using of matrices of amino acid similarity
(physico-chemical, minimal mutational distance, etc). In all used
methods the sites can be determined from protein sequence as well as
tertiary structure. After entering to this section user need to select
the matrix of aa similarity (from suggested catalog of names). Then it
is necessary to input the threshold value for site similarity.
Command keys Effect
< Esc > Abort.
< F5 > Print screen.
2.4.4.7 RESIDUAL DISPERSION
This parameter is residual dispersion of activities after separation of
proteins on some groups. The procedure of calculation of statistic
Residual Dispersion is taken from (W.R. Klecka, 1986). Briefly, the
proteins are divided to the some groups with the same amino acid
content in the given site (sequential or spatial). The matrix of amino
acid similarity is used in calculations.
2.4.4.8 CROSS GROUPS VARIATION 3D
2.4.4.9 VARIATION IN CURRENT GROUP 3D
These methods have the same sense as in case of sequences.
2.4.4.10 NORMALIZED CROSS GROUPS VARIATION 3D
Normalized cross groups variation (used only for 3D sites) is equal to
cross group variation (calculated as described earlier) multiplied by
average value of intra group conservation indexes. The last parameter,
intra group conservation index, is defined as 1 minus variability index.
2.4.4.11 COORDINATED CHANGES 3D
The values of the profile are the maximal correlation coefficient (in %)
between mutations in given position and mutations in neighborhood
positions of 3D structure (Sander, 1994). The high correlation
coefficient reflects the existing another positions (close in 3D
structure to the first) with coordinated changes.
Example of coordinated changes:
M D
F R
F R
C A
C A
All positions within spatial site are tested for presence of coordinated
changes with the center of the site.
2.4.4.12 NUMBER OF COORDINATED POSITION 3D
The number of coordinated positions with the correlation coefficient
higher the cutoff parameter is calculated. The correlation is
calculated based on the algorithm, described in the previous section.
2.4.4.13 VIEW PROFILE ON 3D STRUCTURE
Each profile calculated in this program can be displayed on protein 3D
structure. User need to input two numbers to differentiate amino acid
residues of the protein into three sets: one set - positions with high
values on the profile, second set - positions with low values on the
profile, third set - positions with intermediate values on the profile.
Then user can see protein 3D structure with three above mentioned amino
acid residues types marked by various colors (and lines width). It is
possible to define also only two sets of positions (for example to
discriminate only high profile values on the profile).
User can change easy the modes to display protein 3D structure
(see SPATIAL SITE in menu item VIEW 3D).
2.4.4.14 SAVE PROFILE
In this submenu user can save to the disc each profile calculated in the
program.
Command keys Effect
< Esc > Abort.
< F3 > View file from the catalog.
< F4 > Editing the file.
< F9 > New mask.
< New name > Type new name in command line to save profile
to new file (in another case the profile will be
added to existing file).
< Enter > Save profile.
2.4.4.15 DISPERSIONS PROFILE
Physico-chemical or structural profiles are calculated for all
individual proteins (with the exception of proteins with group number
0). The resulting dispersion profile is equal to variance of
values in each point of these profiles.
2.4.5 MOTIFS SEARCH (IN CURRENT GROUP).
This menu item serves to find common motifs in current group of sequences
(in the fragment No 1). Minimal and maximal lengths of searching motifs
should be defined earlier in menu item OPTIONS (MIN FRAME, MAX FRAME).
The results of searching can be seen in menu item VIEW LAST RESULT. The
table has the motif frequency (number of proteins from the current group
possessing with this motif) and sequence. Frequent motifs, that present
in the most of the sequences are placed in the upper part of the table.
Variable positions are shown by symbol "-".
2.4.6 SORT
To sort the protein set press "Alt S" in the main program window
and select appropriate mode of sorting.
By activity - sorting by protein activity,
By groups - sorting by protein group number,
By motifs - sorting by motifs number in the sequence.
Choose what you need, then press ENTER and return back. You'll get
reordered list of proteins in multiple alignment. Sorting by protein
group numbers is absolutely necessary step if you are preparing to
provide discriminant analysis.
2.4.7 SAVE LAST RESULT
Any result found in automatic mode can saved to the disc in this
submenu item (the file also may be edited here).
Command keys Effect
< Esc > Abort.
< F3 > View file from the catalog.
< F4 > Edit the file.
< F9 > Select new mask for filename.
< New name > Type new name in command line to save result
to new file, in another case the result will be
added to existing file.
< Enter > Save result.
2.4.8 VIEW LAST RESULT
Viewing the result found in automatic mode. The result is shown as short
table. Each line has brief description of the result of one calculation
(one hypothesis). To see detailed description move cursor to desired
line and press F3. Then you can see relative graph, if available. Also
you can mark (press ENTER) one hypothesis for further investigation in a
manual mode ("by hand"). After choosing a hypothesis the set of
initially selected factors (marked site(s), amino acid factors and
functions, see section 2.4.1) will be replaced by site(s) and factors
from selected hypothesis. User can analyze taken factors BY HAND (in
manual mode). The window with the results of automatic search can be
shown in any time.
2.5 VIEW 3D-STRUCTURE.
2.5.1 SPATIAL SITE
This module assign for viewing 3D-structure of proteins and spatial
sites. Protein sites marked in section PREPARE DATA (or VIEW LAST
RESULT) will be shown here in colors. The graphical module is highly
flexible: there are three types of residues in the protein, that may be
treated separately:
A (SITE CENTERS) - sites, marked earlier in the PREPARE DATA menu item
or just now in this window (press ENTER on desired position) or as the
result of selecting site in submenu VIEW RESULTS;
B (NEIGHBORS) - the residues that are close to the SITE CENTERS
(the cutoff radius is given in menu items OPTIONS, CALCULATION);
C (PROTEIN) - all other residues of protein.
The style to display residues of types A, B, C can be changed
independently (see help - lines at the top and the bottom of the
screen). Amino acid sequence of the protein is shown on the screen (the
sequence is taken from PDB-file, but not from *.ali file). The colors of
the letters related to the colors of residues on the stereo picture.
This module can be used also independently to display protein structures
(without loading any protein family for analysis).
User has options to:
- display alternatively C-alpha or all-atom protein models;
- rotate the structure, change the dimensions of the picture, stereo
angle, distance between stereo pictures (even to display mono pictures,
when the distance is zero);
- change the width of the main chain and the backbone, the colors of
types A, B and C residues (the site, its neighbors and the rest of the
molecule);
- change the size and colors of C-alpha atoms;
- create new spatial site (by moving cursor through backbone and
pressing ENTER);
- print picture to printer.
The following functional keys are used in the program (as shown on the
bottom line of the screen):
-----------------------------------------------------------------------------
F2-PROTEIN F3-SITE CENTERS F4-NEIGHBORS F5-OPTIONS F6-PRINT F10-QUIT
-----------------------------------------------------------------------------
Command key Effect
F2-PROTEIN changes the modes to display the residues of the type C.
F3-SITE CENTERS changes the modes to display the residues that are
CENTERS OF THE SITES marked earlier in PREPARE DATA or
here (type A).
F3-NEIGHBORS changes the modes to display the residues that are
within cutoff radius from CENTERS OF THE SITES (type B).
F5-OPTIONS changes the numeration, picture size, distance between the
stereo pictures, stereo angle.
F6-PRINT prints the picture to printer (laser or matrix).
F10-QUIT returns back.
After pressing F2, F3 or F4 the following submenu appears:
-----------------------------------------------------------------------------
COLOR:SIDE CH,CA,BACKBONE,BACKGRND; WIDTH: SIDE CH,BACKBONE,CA;CHANGE:Grey +,-
-----------------------------------------------------------------------------
The first part (COLORS) serves to change the colors of the lines in
chosen type of residues. User can change the colors of side chains
(SIDE CH), of C-alpha atoms (CA) and of protein backbone (BACKBONE).
The color of the screen background can be changed also (BACKGRND).
The second part (WIDTH) serves to change the width of the lines in
chosen type of residues. User can change the width of side chains (SIDE
CH), of protein backbone (BACKBONE) or of C-alpha atoms (CA).
To change any picture element it is necessary to move cursor to the
relative position (LEFT and RIGHT ARROW keys should be used) and press
keys <GREY-> or <GREY+>. Pressing <GREY-> or <GREY+> several times you
can get desirable size, color, numeration and width of the lines and
C-alpha atoms. The width of the lines have three meanings: heavy lines,
thin lines and the absence of lines (last case helps you to remove some
parts of the structure from the picture if necessary).
In submenu item OPTIONS (F5) the following submenu appears:
-----------------------------------------------------------------------------
NUMERATION SIZE DISTANCE-BETWEEN-PICTURES STEREO-ANGLE;CHANGE:GREY <-,->,+,-
-----------------------------------------------------------------------------
To change picture element it is necessary to move cursor to the relative
position (LEFT and RIGHT ARROW keys should be used) and press key
<GREY-> or <GREY+>. Pressing <GREY-> or <GREY+> some times you can set
desirable options: change the numeration, picture size, distance between
the stereo pictures and stereo angle. (Increasing stereo angle to
90o, you'll get two pictures from the different sides of the protein).
2.5.2 SIMPLE MARKING
This module serves for marking different amino acid residues groups
without relation to window with the sequences and without automatic
marking spatial neighbors (as in SPATIAL SITE module). All amino acid
residues are divided into three groups, that can be treated separately:
LOW GROUP, MEAN GROUP and HIGH GROUP. The status of current group is
displayed on the top left side of the screen. The style of displaying
residues of these three groups can be changed independently.
The mode of working in this module is the same as in module SPATIAL SITE.
The help-line on the bottom of the screen is:
-----------------------------------------------------------------------------
F2-MEAN GROUP F3-HIGH GROUP F4-LOW GROUP F5-OPTIONS F6-PRINT F10-QUIT
-----------------------------------------------------------------------------
To change the current group options you should press relatively F2, F3
or F4. The following submenu appears, that is the same as in module
SPATIAL SITE.
-----------------------------------------------------------------------------
COLOR:SIDE CH,CA,BACKBONE,BACKGRND; WIDTH: SIDE CH,BACKBONE,CA;CHANGE:Grey +,-
-----------------------------------------------------------------------------
To change only the current group status (mean, high, low) you should press
F2 and <Esc>, F3 and <Esc>, F4 and <Esc>.
The colors of groups in 3D picture coincide with the colors of letters in
the protein sequence on the bottom of the screen.
2.6 HELP.
Help information can be invoked by pushing F1 key in each submenu. In
addition, the last line of screen has short help for each menu option.
2.7 QUIT.
To exit from the program go to menu option QUIT and press Enter.
3. GENERAL INFORMATION
The distribution diskette contains the following several necessary and
supplementary files. All files should be in the same directory. There
are files with the examples.
4. REQUIREMENTS
PROANALYST runs on the IBM PC family of computers, including XT, AT.
PROANALYST requires DOS 3.3 or higher and at least 560K of RAM; it will run
on any 80-column monitor but graphics require EGA/VGA monitor. A hard disk
is recommended for enhancing performance of the program.
5. STANDARD ERRORS LIST
ERROR=0 - successful finishing the program.
In case of some problems the program sends the following messages:
ERROR= -1 - file input/output error. The program can not find the file
with the given filename. Recommendation: check the filename and correct
it.
ERROR= -2 - there is not enough RAM (random access memory) in your computer
to execute the program. Recommendations: unload resident programs, use
high memory or divide your protein sequences into two overlapping parts and
investigate them separately.
ERROR= -79 - errors in datafiles: 1. The number of sequences is not
equaled to the number of data in activity file. 2. There are lines without
data in the file with activities. Recommendation: check the file with
activities and correct it (in case if activity is not known for any member
of the family, input some number like 9999, but do not use this protein in
structure-activity).
6. NOTE FROM AUTHORS
The program is constantly growing, so the manual may have some
differences with the current state of the program.
If you've found errors or have any proposals to improve PROANALYST
program please contact Ivanisenko V.A. or Eroshkin A.M., Research Institute
of the Molecular Biology, SRC VB "Vector", Koltsovo, Novosibirsk region,
633159, Russia.
Tel.(3832)-64-77-74
Telex 133196 NPOSU
Fax: (3832) - 328831;
E.mail: salex@vector.nsk.su (to Vladimir Ivanisenko),
eroshkin@vector.nsk.su (to Alexey Eroshkin)
FEEL FREE TO CALL US. We'll be very glad to hear from you anyway.
7. REFERENCES
Eroshkin A.M., Fomin V.I., Zhilkin P.A.,Ivanisenko V.V., Kondrachin Y.V.
PROANAL version 2: multifunctional program for analysis of multiple
protein sequence alignments and for studying the structure-activity
relationships in protein families.CABIOS,1995, V 11, N 1, pp 39-44.
Eroshkin A.M., Zhilkin P.A., Fomin V.I. (1993) Algorithm and computer
program Pro_Anal for analysis of relationship between structure and
activity in a family of proteins or peptides. CABIOS, v.9, n. 5, 491-
497.
Crippen G.M. and Viswanadhan V.N. (1984) A potential function for
conformational analysis of proteins. Int. J.Pept. Prot. Res. 24, 279-296.
Bolshev,L.N. and Smirnov,N.B. (1983) Tables of mathematics statistics
(in Russian), Moscow, Nauka, p.284.
Eisenberg,D., Schwarz,E., Komaromy M., Wall R. (1984) Analysis of
membrane and surface protein sequences with the hydrophobic moment plot.
J. Mol. Biol., 189, p.125-142
Klecka W.R. Discriminant analysis. Seventh Printing, 1986.
Kidera,A., Konishi,Y., Oka,M., Ooi,T., Sheraga,H.,A. (1985) Relation
between sequence similarity and structural similarity in proteins. Role
of important properties of amino acids. J. Prot. Chem., 4, 23-55.
Schultz,G.E. and Schirmer,R.H. (1979) The Principles of Protein
Structure, Springer-Verlag, New York.
Kendall,M. (1970) Rank correlation methods, Griffin London.