OS/2 Shareware BBS: 24 DOS

home *** CD-ROM | disk | FTP | other *** search

/ OS/2 Shareware BBS: 24 DOS / 24-DOS.zip / proanaly.zip / MANUAL.TXT < prev next >

Wrap

Text File | 1996-08-05 | 58KB | 1,361 lines

PROANALYST - QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS (QSAR) IN PROTEINS, PROTEIN ENGINEERING, PATTERNS RECOGNITION IN COMBINATORIAL LIBRARIES, PHYSICO-CHEMICAL AND ALPHABETICAL ANALYSIS FOR MULTIPLE SEQUENCE ALIGNMENTS AND 3D STRUCTURE COPYRIGHT (C) 1996 Vladimir A.Ivanisenko, Alexey M.Eroshkin Theoretical Dept., Research Institute of Molecular Biology, SRC VB "Vector", 633159, Koltsovo, Novosibirsk region, Russia Tel. (3822) 647774 Telex 133196 NPO SU Fax: (3832) 328831; E.mail: salex@vector.nsk.su, eroshkin@vector.nsk.su DEMO version (Print@save are disabled, protein number - limited) TABLE OF CONTENTS: 1. Introduction. 2. Menu items: 2.1 File (data loading) 2.1.1 Property (physico-chemical factor(s) selection). 2.1.2 Load protein (selection of protein family to study). 2.1.3 Load 3D-structure (selection of file with protein 3D structure). 2.1.4 View result (viewing files with the results). 2.1.5 Save protein (saving current protein subset). 2.1.6 Load protein sequence from PDB file. 2.1.7 Commander (access to MS-DOS commands). 2.2 Options (setting the parameters and functions for further calculations). 2.2.1.1 Analysis based on (setting the type of data to be used in calculation: sequences only, 3D structure or both). 2.2.1.2 Functions for 1D-structure (selection of functions to calculate fragment physico-chemical characteristics). 2.2.1.3 Functions for 3D-structure (selection of functions to calculate spatial site characteristics). 2.2.1.4 Hypotheses amount (Maximal amount of keeping sites). 2.2.1.5 Min frame (Minimal size of a sequential site). 2.2.1.6 Max frame (Maximal size of a sequential site). 2.2.1.7 Gaps treatment (ignore or exclude). 2.2.1.8 Fragments (type of searching site: linear/sequential, discrete). 2.2.1.9 Min number of factors (for multiple regression and discriminant analysis). 2.2.1.10 Max number of factors (for multiple regression and discriminant analysis). 2.2.1.11 Cutoff radius (cutoff radius for neighbors in 3D site). 2.2.1.12 Type of input atoms (type of atoms to be used in 3D model, C-alpha, C-beta or all). 2.2.1.13 Profile smoothing (on/off profile smoothing). 2.2.2 Sequences display mode (modes of sequence displaying on the screen). 2.2.3 Display residues exposure (turning on/off displaying the solvent exposed amino acid residues on the screen). 2.2.4 Display 2D structure (turning on/off displaying secondary structures on the screen). 2.2.5 Display protein name (turning on/off displaying protein names on the screen). 2.2.6 Display protein activity (turning on/off displaying protein activities on the screen). 2.2.7 View marked fragments (viewing current or all marked fragments). 2.2.8 Window for protein name (the number of positions used to display protein name). 2.2.9 Window for activities (the number of positions used to display activities). 2.2.10 Window for graphs (dimensions). 2.2.11 Show property on the screen (value of aa physico-chemical property for residue under cursor). 2.2.12 Sorting by (sorting the proteins according to their activities (increase/decrees) or group numbers). 2.3 Prepare data (viewing multiple sequence alignment, protein names, activities, group numbers, marked fragment(s) etc.; fragment(s) choosing, splitting proteins into groups, sequence editing, etc.). 2.4 Analysis (main program block for all calculations). 2.4.1. Define factors: fragment, function and property (specifying the factors for calculations). 2.4.2. Structure-activity (selection of procedure for finding activity-modulating center and analysis of relationships). 2.4.2.1 Multiple linear regression analysis (physico- chemical analysis). 2.4.2.2 Discriminant analysis (physico-chemical analysis). 2.4.2.3 Cross groups variation (alphabetical analysis). 2.4.3 Functional center. 2.4.3.1 Amino acid residues conservation in current group 2.4.4 Profile analysis (searching regions with high and low physico-chemical properties, conservative and variable sites, etc.). 2.4.4.1 Average profile (physico-chemical profiles). 2.4.4.2 Min profile (physico-chemical profiles). 2.4.4.3 Max profile (physico-chemical profiles). 2.4.4.4 Cross groups variation (comparison of aa between the groups of proteins). 2.4.4.5 Variation in current group (comparison of aa in one group of proteins). 2.4.4.6 SADC-PROFILE 2.4.4.7 Residual dispersion 3D. 2.4.4.8 Cross groups variation 3D. 2.4.4.9 Variation in current group 3D. 2.4.4.10 Normalized cross groups variation 3D. 2.4.4.11 Coordinated changes 3D. 2.4.4.12 Number of coordinated position 3D. 2.4.4.13 View profile on 3D structure. 2.4.4.14 Save profile (writing the values of profile to disk). 2.4.4.15 Dispersion profiles (physico-chemical profiles). 2.4.5 Motifs Search. 2.4.6 Sort (sorting the protein according to activity values or group number). 2.4.7 Save last result (writing the results, obtained in automatic mode). 2.4.8 View last result (viewing the results, obtained in automatic mode). 2.5 View 3D-structure 2.5.1 Spatial site (viewing 3D structures and spatial sites, selection the sites for further calculation, etc.). 2.5.2 Simple marking (marking and viewing). 2.6 Help. 2.7 Quit. 3. General information 4. Requirements 5. Standard errors list. 6. Note from authors 7. References 1. INTRODUCTION PROANALYST is an integrated applied system for studying quantitative structure-activity relationships in proteins. PROANALYST provides multivariate statistical, pattern and profile analyses; physico-chemical and alphabetical analyses in protein sequences and 3D structures; protein engineering experiments design. The program is a further development of earlier described program PROANAL (Eroshkin, et al., 1993, 1995). PROANALYST examines the relationships between the protein activity and physico-chemical characteristics (or amino acid residue composition) of different regions in their primary and tertiary structures (3D QSAR). The structure-activity analysis is based on aligned protein amino acid residue sequences, data on their activity (pK, ED50, Km or any other) and 3D structure of at least one of these proteins. Program is useful also in cases when protein families are divided by evolution, functional or other criteria. The following methods are implemented: empirical energy calculations, spatial site moments calculations, discriminant analysis, multiple linear regression, analysis of variations (ANOVA) and some other. Regression plots, 3D pictures, graphs of various physico-chemical profiles for the sequences and 3D structures make it easier for the researcher to get the picture of the problem. The program allows to look for protein sites conservative in variations of physico-chemical characteristics (candidates to functionally important regions) and regions with high or low values of these characteristics. PROANALYST may be used for simulation of protein-engineering experiments, prediction of protein activity and the search of different protein regions such as functional sites, elements of secondary structure, solvent-exposed regions, T- and B-cell antigenic determinants, etc. In automatic mode PROANALYST generates and verifies hypotheses on the location of modulating regions in sequence or 3D structure of a protein, and key characteristics of this region. In manual mode the researcher can generate and analyze his own hypotheses. Program is implemented for IBM PC or compatible computers. It is designed to be easily handled by any occasional computer user and yet it is powerful enough for experienced professionals. 2. MENU ITEMS To start the program you have to make current directory containing all the necessary files (see General information), type PANALYST.EXE and press Enter. On the screen you'll see the main menu: ----------------------------------------------------------------------------- File Options Prepare data Analysis View 3D-structure Help Quit ----------------------------------------------------------------------------- Right and left arrow keys allow to navigate through the items of the menu. One line description of the highlighted menu item appears simultaneously at the bottom line of the screen. Pressing <Enter> key confirms the selection. Menu items are available only if the program is ready to perform the corresponding action. For instance, "Analysis" is available only after the protein family is loaded, and the regions of interest are chosen. "View 3D-structure" is available only after 3D-structure is loaded. 2.1 FILE When "File" item of the main menu is selected the following submenu appears: Property Load protein Load 3D-structure View result Save protein Commander 2.1.1 PROPERTY This menu item lists the files containing physico-chemical properties of amino acid residues (*.ppt). After the file is selected, the program displays the list of all properties available in this file. And user can choose any subset of these properties. Every file can contain up to 50 different properties. The data in this file are organized according to the following format: Comments (any number of lines started with space symbol) Property name (no spaces are allowed at the beginning of the name) Literature reference (the same requirement as above) Values of amino acid residue properties, 7 positions per one value (with space in 1-st position) The order of values should correspond to the following order of amino acids ACDEFGHIKLMNPQRSTVWY. For example: Any comments must have a leading space { Property names (from 1-st position of string) Literature source (from 1-st position of string) Values of amino acid residue properties } Hydrophilicity Hopp-Woods T.P.Hopp, K.R.Woods, PNAS 78 (1981) 3824 -.5 -1.0 3.0 3.0 -2.5 0.0 -.5 -1.8 3.0 -1.8 -1.3 .2 0.0 .2 3.0 .3 -.4 -1.5 -3.4 -2.3 Hydropathy Kyte-Doolittle J.Kyte, R.F.Doolittle J.Mol.Biol 157 (1982) 105 1.8 2.5 -3.5 -3.5 2.8 -.4 -3.2 4.5 -3.9 3.8 1.9 -3.5 -1.6 -3.5 -4.5 -.8 -.7 4.2 -.9 -1.3 You can also select any subset of properties for the analysis or prepare your own set of properties. There are several hot keys. Command keys Effect < Esc > Return back. < Up > Moves the cursor one position up. < Down > Moves the cursor one line down. < Enter > Selects/cancels selection of a property for inclusion in the data set. < Alt I > Selects all property from a given file. < Alt C > Cancels selection of all properties that were chosen earlier. 2.1.2 LOAD PROTEIN Here user can select protein/peptide family to be analyzed. The data are supposed to be prepared in three separate files: 1) File of protein names (with extension *.seq). The format of the file corresponds to the one of protein sequence database SWISS-PROT. All lines except DE (name of a protein) and // (the end of the data for a protein) are ignored. For example: DE INTERFERON ALPHA 2 // DE INTERFERON ALPHA 1 OS HOMO SAPIENCE // .... 2) File of aligned protein sequences (with extension *.ali) in one-letter code. The format of the file is as follows: first line - the length of aligned amino acid sequences (after special words "Seq.file "). Then protein sequences - one sequence per one line (even in the case of long sequences). NO ADDITIONAL SYMBOLS LIKE " " (BLANK) IN THE END OF THE LINES. Gaps are coded by symbol '-'. Add one or more blank lines at the end of the file. For example: Seq. file 43 QCGEGLCCDQCSFIEEGTVCRIARGDDLDDYCNGRSAGCPRNP QCGE---CDQCSFMKKGTICRRARGDDLDDYCNGRSAGCPRNP QCGEGPCCDQCSFMKKGTICRRARGDDLDDYCNGRSAGCPRNP QCGEGLCCDQCSFMKKGTICRRARGDDLDDYCNGISAGCPRNP QCGEGLCCDQCSFMKKGTICRRARGDDLDDYCNGISAGCPRNP QCADGLCCDQCRFKKKRTICRRARGDD--DRCTGQSADCPRNG QCADGLCCDQCRFKKKTGICRIARGDFPDDRCTGLSNDCPRWN Q--DGLCCDQCRFKKKRTICRIARGDFPDDRCTGQSADCPRWN QCAEGLCCDQCRFKGAGKICRRARGDNPDDRCTGQSADCPRNR QCAEGLCCDQCLFMKEGTVC-RARGDDVNDYCNGISAGCPRNP PCATGPCCRRCKFKRAGKVCRVARGDWNNDYCTGKSCDCPRNP 3) File of protein activities (with extension *.act). Format of this file is just activity value of each protein per line. In case if you do not need to investigate structure-activity relations or don't have activity data just type ordinal numbers 1, 2, 3, 4, 5, .... Add one or more blank lines at the end of the file. For example: 1.602 1.699 1.716 1.748 1.982 2 2.033 2.124 2.134 2.188 2.204 2.326 In this menu item user can also select any subset of the proteins for analysis (use <Enter> key to select or unselect particular protein). There are several examples on the distribution disk. They can be useful to try the program and have a look at the structure of data files. 4) File of residue solvent exposure and protein 2D structure (file name with extension *.exp). This file is optional. Format of the file: The file should have two lines of the length equal to the length of aligned protein sequences. The first line reflects the exposure of amino acid residues (aa) to the solvent (0 - internal aa, 1 - external). Positions containing 1 are marked green in the sequence window. The second string is simply displayed on the screen. So, you can type any appropriate codes for the elements of the secondary structure. Example: 1010110010011011001100011100101110101001100110100100000000000000000010 tttaaaaaaaaaaaaaaaaaattttttttttttbbbbbbbbbtttttttttaaaaaaaaaaaaaaaattt There are several special command keys which facilitate the extraction of any protein subset from the initial files. Command keys Effect < Esc > Return back. < Up Arrow > Moves cursor one line up. < Down Arrow > Moves cursor one line down. < Enter > Selects/unselects the protein. < Alt I > Selects all proteins. < Alt C > Cancels selection. < PgUp > Moves cursor seventeen lines up. < PgDn > Moves cursor seventeen lines down. CAUTION: The following limits are in the current version of the program: - the length of aligned sequences must be less than or equal to 5000 amino acid residues; - the length of protein name can not have more than 80 symbols; - the numbers of proteins should not exceed 500. 2.1.3 LOAD 3D-STRUCTURE User can load 3D structure of one of the analyzed proteins (when available). This structure will be used as a model for all analyzed proteins. The files containing 3D data (.cb,.pdb) are supposed to be in PDB format. PROANALYST is able to work with files containing only C-alpha, C-beta atoms or all protein atoms. The part of 3D structure can be loaded too. 2.1.4 VIEW RESULTS This options allows to see all results of PROANALYST calculations (files with filenames having the extension *.res). The whole library of results can be created as the result of working with the program. F3 key is used to see the files. F4 key is used to edit the files. 2.1.5 SAVE PROTEIN User can save to disk any subset of initially selected protein family and activities in the form appropriate for further using (topic 2.1.2). All sequences from groups with numbers 1, 2, 3, etc. (except group with number 0) will be saved. After using this option the window appears that have file directory with extensions *.ALI. In order to create new file it is necessary to type the new filename and press ENTER. To append existing file, select the name and type ENTER. You can enter to the section "Save protein" only if the options "File", "Load Protein" and "Prepare data" are executed. 2.1.6 LOAD PROTEIN FROM PDB FILE Load protein sequence from PDB file. This option is available only after protein 3D-structure is loaded. 2.1.7 COMMANDER User can execute any DOS command in this option. 2.2 OPTIONS User has in this option the following menu to input parameters for further calculations: Calculation Sequences display mode Display residues exposure (on/off) Display 2D structure (on/off) Display protein name (on/off) Display activity (on/off) View marked fragments (Current/All) Window for protein name Window for activities Show property on a screen Sorting by (increase/decrees) Each parameter has some default value. 2.2.1 CALCULATION Parameters and functions for calculations is given in this option. 2.2.1.1 EVALUATE User chooses the type of data to be used in calculation: 1. Only 1D-structure (default option), 2. Only 3D-structure. 3. 1D-structure and 3D-structure. 2.2.1.2 FUNCTION 1D-STRUCTURE. Ten possible ways of fragment characteristic calculation from amino acid residue sequence or composition are taken into account: Average for a fragment (on/off) Moment, Alpha-helix periodicity (on/off) Moment, Pi-helix periodicity (on/off) Moment, Beta-strand flat periodicity (on/off) Moment, Beta-strand twist periodicity (on/off) Moment, 3-10-helix periodicity (on/off) Minimum value for a fragment (on/off) Maximum value for a fragment (on/off) Amplitude value for a fragment (max - min) (on/off) Sum for a fragment (on/off) User can switch on or off each type of function. 5 modes of moment calculation are introduced - each of them is connected with a some type of secondary structure of the region (Schultz and Schirmer, 1979). These characteristics are calculated by the same formula (Eisenberg et al., 1984) and differ only by values of periodicity angles (or by number of amino acid residues per turn). In this section user switches off or on some particular function for calculation of the fragment characteristic and the way how to process gaps. 2.2.1.3 FUNCTION 3D-STRUCTURE. This option is available only if option "Analysis based on - 3D structure" is chosen. With the using protein 3D coordinates the all neighbors for each amino acid residue are defined (via some threshold distance between C- alpha or C-beta or all atoms). This amino acid residues are considered as spatial site. For this spatial site the following characteristics can be calculated: Average for a spatial site (on/off) Minimum value for a spatial site (on/off) Maximum value for a spatial site (on/off) Amplitude value (max - min) for a spatial site (on/off) Sum for a spatial site (on/off) Dipole moment for a spatial site (on/off) Empirical potential energy for ss (on/off) Short range potential energy for ss (on/off) Long range potential energy for ss (on/off) Disp of L-r potential energy for ss (on/off) Mean of L-r potential energy for ss (on/off) Dipole moment for a spatial site is calculated with the using of aa physical and chemical properties and 3D coordinates. Empirical potential energy functions are defined by algorithm of Crippen (G.M Crippen and V.N. Viswanadhan. 1984). The empirical potential energy consists of two terms: short-range and long-range energies. Dispersion and mean value of long-range potential energies are calculated for 6 site tertiary structures, obtained as the result of variations in amino acid residue (Ca-atoms) coordinates of the initial site. Modified Ca-atoms coordinates are calculated as initial coordinates +/- 1 angstrom relative to axes X, Y and Z. The long-range potential energy is calculated for each new site structure. Then the dispersion and mean value of long-range potential energies for 6 varied structures are calculated. User can switch on or off each type of function. 2.2.1.4 HYPOTHESES AMOUNT (MAXIMAL AMOUNT OF SOUGHT SITES) Maximal amount of protein fragments (sites) to be displayed on the screen as the result of automatic search. User can select any number up to 200. Default value is 25. 2.2.1.5 MIN FRAME and 2.2.1.6 MAX FRAME Minimal and Maximal length of protein fragments to be investigated in automatic search. MIN FRAME > or = 1, MAX FRAME < or = length of the studied protein. Default values are: MIN FRAME=1, MAX FRAME=5. 2.2.1.7 GAPS TREATMENT In this section user is to select a way how to process gaps. There are two ways of processing gaps: in mode "Ignore" gaps will be omitted (for example, sequence ACDE--FG will turn into ACDEFG). In mode "Exclude" sequences having gaps will be excluded from calculation of fragment characteristic. The second way is based on the point of view that deletions greatly distort local protein structure and sites with the gaps can not be analyzed adequately by such a simple procedure (and hence there is reason to not take into account sites having gaps). Default value is IGNORE. 2.2.1.8 FRAGMENTS (split/merge) There are two modes of working: SPLIT - Each marked fragment is site for investigation. MERGE - Combine several marked fragments (with numbers 1, 2, 3,... etc.) into one discrete site. Only average, sum, min and max functions are available in this case. Default value is: SPLIT. 2.2.1.9 MIN NUMBER OF FACTORS and 2.2.1.10 MAX NUMBER OF FACTORS. Min and max number of evaluated factors for regression and discriminant analysis. Default values are: 1 and 1. 2.2.1.11 CUTOFF RADIUS CUTOFF RADIUS is the threshold distance (between C-alpha, C-beta or all atoms) for creation of spatial sites. Default value is 5 angstroms (for C-alpha atoms). 2.2.1.12 TYPE OF INPUT ATOMS The type of atoms is shown (CA or CB or ALL) that will be used in analysis of tertiary structure. Choose what is necessary in your case. Default value is: CA. 2.2.1.13 PROFILE SMOOTHING Switching On or Off the profile smoothing. The value S(i) in position i for unsmoothed profile is equal to protein site characteristic calculated for the window [i,i+current frame length]. The formula for smoothing is: SS(i)=(S(i)+S(i-1)+S(i-2)+...+S(i-current frame length))/current frame length Default value is: OFF. 2.2.2 SEQUENCES DISPLAY MODE One of the two types of displaying protein sequences on the screen may be used: "sequence" and "change". If "sequence" mode is selected then complete sequences are displayed. If "change" mode is selected then only amino acid differences are displayed for second, third, fourth, etc. sequences relative to the first one. Default value is: CHANGE. 2.2.3 DISPLAY RESIDUES EXPOSURE (ON/OFF) The surface amino acid residues will be shown in green if the mode "ON" is chosen (and if the relative file *.exp exists). Default value is: ON. 2.2.4 DISPLAY 2D STRUCTURE (on/off). The protein 2D structure will be shown above the sequences in the mode ON is chosen (and if the relative file *.exp exists). Default value is: ON. 2.2.5 DISPLAY PROTEIN NAME (on/off) The names of proteins are displayed on the screen in case ON. Default value is: ON. 2.2.6 DISPLAY ACTIVITY (on/off) The activities of proteins are displayed on the screen in case On. Default value is: ON. 2.2.7 VIEW MARKED FRAGMENTS (Current/All) There are two modes of working: CURRENT - only current fragment will be marked in red. ALL - all selected fragments will be marked in red. Default value is: CURRENT. 2.2.8 WINDOW FOR PROTEIN NAME and 2.2.9 WINDOW FOR ACTIVITIES Size of windows for protein names and protein activities. Default values are: 15 and 5. 2.2.9 WINDOW FOR GRAPHS Sizes of window for graphs (structure-activity, profiles, discriminant function) can be changed in this option. Left - the coordinate of the left side of the window on the screen (min value is 1, max value is 80). Default value is 6. Top - the coordinate of the top side of the window on the screen (min value is 1, max value is 25). Default value is 5. Right - the coordinate of the right side of the window on the screen (min value is 1, max value is 80). Default value is 74. Down - the coordinate of the down side of the window on the screen (min value is 1, max value is 25). Default value is 20. 2.2.11 SHOW PROPERTY ON THE SCREEN User can choose amino acid property to display on a screen. Default value is: first property in the list. 2.2.12 SORTING BY (increase/decrease) There are two ways of protein sorting in the set: INCREASE - to sort by increasing protein activities or group numbers. DECREASE - to sort by decreasing protein activities or group numbers. Default value is: DECREASE. To sort the protein set press "Alt S" in the main program window and select appropriate mode of sorting. 2.3 PREPARE DATA This is one of the two main program units. User can enter to this unit only after correct choosing the data in units "File" and "Protein". Entering to "Prepare Data" user will get the window with aligned amino acid sequences of investigated protein/peptide family. In upper line of the window it is shown: Line, Col - positions of cursor in the multiple aligned sequences, Grp - the number of current active group (number of groups is less than 9), Frg - the number of current active fragment (number of fragments is less than 9), Ppt - the value of physico-chemical property of amino acid under cursor. Keys <Alt B> and <Alt E> are used to mark the fragment for analysis. Keys <Alt 1>, ... , <Alt 8> are used to change the current fragment number. Keys <Ctrl Alt 1>, ... , <Ctrl Alt 8> are used to change the current active group of proteins. Keys <Ctrl Alt I> are used to combine all proteins to one group (with current group number). Keys <Ctrl Alt C> are used to exclude protein(s) from analysis or to clear current grouping. The proteins will be marked with group number 0 (such proteins are not used in analysis but in predictions only). The proteins having group number equal to zero are not used in calculations. To change group number press ENTER. There are several special command keys which facilitate the investigation of the protein family. Command keys Effect < Esc > Returns to the Main menu. < Left > Moves cursor one position to the left. < Right> Moves cursor one position to the right. < Up > Moves the string up. < PgUp > Page up. < PgDn > Page down. < Down > Moves the string down. < Home > Moves cursor to first position on the window. < End > Moves cursor to the last position on the window. < Ctrl Home > Moves cursor to the first position of the sequence. < Ctrl End> Moves cursor to last position of the sequence. < Alt N > Moves to the window for editing protein names. < Alt D > Moves to the window for editing protein activities. <Alt 1>,...,<Alt 8> Switch between fragments. User may define up to 8 different regions in a protein family (fragments are marked by red color). Initially the number of current fragment is equal to 1. < Alt B >, < Alt E > Define the beginning and the end of a fragment < Alt U > Unmark current fragment. < Alt C > Unmark all selected fragments. < F7 > Find a string (word) in a sequence under the cursor. The matrix of amino acid similarity is used in the search (user choose the matrix from the set). Symbol x should be used for undefined positions. ACDEFGHIKLMNPQRSTVWY Edit the sequence of protein (to provide protein- engineering experiment). The window for editing protein names has the following command keys: Command keys Effect < Esc > Returns to main working window. < Left > Moves cursor one position to the left. < Right > Moves cursor one position to the right. < Up > Moves the string up. < Down > Moves the string down. < Home > Moves cursor to the position of first character of the name. < End > Moves cursor to the position of end character of the name. < F4 > Starts editing. The window for editing protein activities has the following command keys:. Command keys Effect < Esc > Return to main working window. < Up > Moves the string up. < Down > Moves the string down. < F4 > Starts editing. 2.4 ANALYSIS. The comparison of sequences in all methods of alphabetical analysis can be done with the using of matrices of amino acid similarity (MDM78, minimal mutational distance, etc.). For physico-chemical profiles and structure-activity analysis the factors should be determined before the calculations. 2.4.1 FACTORS DEFINITION: FRAGMENT, FUNCTION and PROPERTY. Here user inputs the set of factors that will be used in calculations of regression, discriminant and profile analysis. The factor is the combination of three parameters: fragment of sequence, function and physico-chemical property. To select the factors it is necessary to go through the set of menus. At first it is necessary to select the fragment (from the set of fragments inputted in section PREPARE DATA). Then the user selects the function(s) to calculate site characteristics. Then the user selects the physico-chemical properties to calculate site characteristics. All selected physico-chemical properties will be used in calculations with the earlier chosen functions. To move through the menu use the keys: End, Home, PgUp, PgDn, Up, Down. Command keys Effect < Enter > Choosing the fragment, function or property. < Esc > Returns to previous menu. < F1 > Help 2.4.2 STRUCTURE-ACTIVITY ANALYSIS. The analysis of relationship between structure and activity in a family of proteins/peptides is performed in this section. Analysis in Automatic and Manual (by hand) modes. In automatic mode the program generates and verifies hypotheses on the location of a sequential activity-modulating region in a protein, and key characteristics of this region. The search depends of the values of MIN FRAME and MAX FRAME as well as of the types and numbers of selected factors ("Factors definition"). The window with the results appears by the end of automatic. The results are ranked set of best hypotheses on structure-activity relation. User can mark one hypotheses for further analysis in the manual mode (BY HAND). Remember, that after marking some hypothesis the set of initially selected factors will be replaced by factors from marked hypothesis. 2.4.2.1 MULTIPLE LINEAR REGRESSION ANALYSIS. Multiple linear regression permits to estimate the correlations (dependencies) between the variables and activity and to verify hypothesis on the nature of activity-modulating centers. Such centers can include different parts of protein structure (discrete centers) or can have more than one key amino acid property influencing protein activity (e.g. charge and volume). The window with the results of automatic search can be shown in submenu "VIEW LAST RESULT". In this unit the following characteristics are shown: - protein site, property, function; - the regression equation; - 0.95 confidence intervals for all parameters of regression; - all test statistics and their confidence levels; - multiple correlation coefficient; - coefficient of multiple determination; - total variance of activity; - residual mean square error (RMSE); To visualize the results of multiple regression analysis the graphs of projections and correlation line between theoretical and measured activities are used (press F5 and F6 keys in the window with the textual results). The projection is regression line for one factor when other factors are fixed (mean values is putting to regression equation for fixed factors). Command keys Effect < Esc > Abort. < F2 > Writes the results to the disk. < F5 > Plot structure-activity graphs. Press F2 in this window to print the screen on laser or matrix printer. < F6 > Correlation line between theoretical and measured activities. 2.4.2.2 DISCRIMINANT ANALYSIS. Discriminant analysis is used in the cases when protein activities are given only qualitatively or when protein are divided into groups by some criteria (Klecka, 1986). It is possible to define the site and physico-chemical factor describing the given protein partition with the using of this analysis. Obtained coefficients of canonical discriminant functions can be used in further classification of proteins. At first, the proteins should be divided in two or more groups. Then the proteins should be sorted in order of increasing group numbers (press "Alt S" and select "SORT BY GROUPS"). The window with the results of automatic search can be shown in submenu "VIEW LAST RESULT". The results of discriminant analysis have a lot of information: 1. Eigenvalues. 2. Ratio of eigenvalues to their sum. 3. Canonical correlation (R) %. 4. Square of canonical correlation (R^2) %. 5. Lambda-statistic of Wilks S.S. 5.1 Number of functions (k). 5.2 The values of statistics. 5.3 Statistics Chi-square (degrees of freedom and critical values for 95% and 99% confidence level). 6. Discriminant function coefficients. 7. Standardized coefficients. 8. Structural coefficients (Pearson's correlation coefficients). This window has the following command keys: Command keys Effect < Esc > Abort. < F5 > Plots graph of discriminant function. Press F2 key in this window to prints the screen on laser or matrix printer. < F2 > Writes the results to the disk. 2.4.2.3 CROSS GROUPS VARIATION. At the first stage of finding an activity-modulating regions the alphabetical analysis is reasonably to use. Let us divide protein family into N groups of proteins with similar activities. To calculate the inter group variability index the comparison of protein sites (sequential or spatial) can be done. The number of protein pairs (each from different groups) that have the same contest of amino acid residues in given site is calculated at the first step. Then this number is divided to the common number of all possible pairs of proteins. So we get the number (varying from 0 to 1) that characterize the site variability. The estimation of variability indexes is calculated based on pairwise comparison of proteins from different groups. I=1-log[9*Sum Ri /N +1], i=1,N Ri= Mult r , where: ij j=1,M N - the number of protein pares, M - the number of positions in the site, r - the element of matrix aa similarity, ij Sum - summation, Mult - multiplication. If r vary in the interval [0,1] then the values for I lies in the interval ij [0,1]. In conservative position I=0. The following matrices are implemented in the program: ONE - uniform matrix, PHY-CHEM - physico-chemical similarity, based on McLachlan's matrix, ESAB - matrix of evolutionarily related aa. The window with the results of automatic search can be shown in submenu "VIEW LAST RESULT". Command keys Effect < Esc > Abort. < F2 > Writes the result to the disk. 2.4.3 FUNCTIONAL CENTER 2.4.3.1 AMINO ACID RESIDUES CONSERVATION IN CURRENT GROUP Amino acid residues conservation in current group is calculated by the same way as in CROSS GROUP VARIATION (see 2.4.2.3) procedure but pairs of proteins are taken from the current group. The regions with low variability indexes are considered as conservative. The window with the results of automatic search can be shown in submenu "VIEW LAST RESULT". 2.4.4 PROFILE ANALYSIS. The profile for primary structure is build with the using of sliding window (investigated site) of 1, 2, 3, 4 and so on amino acids. In the case of tertiary structure profiles the procedure is as follows. All amino acids - neighbors for given position (or positions) are defined via some threshold distance between C-alpha, C-beta or all atoms in amino acid residues (three different cases are implemented) in protein 3D structure. This amino acid residues are considered as spatial activity-modulating site. Spatial profile is the result of the consequence calculation of earlier described characteristics for all spatial sites. 2.4.4.1 AVERAGE PROFILE Physico-chemical or structural profiles are calculated for all individual proteins (with the exception of proteins with group number 0). The resulting average profile is calculated as average value (in each point) of all profiles. 2.4.4.2 MIN PROFILE The resulting profile is equal to minimal value (not average) in each point of all individual profiles. 2.4.4.3 MAX PROFILE The resulting profile is equal to maximal value in each point of all individual profiles. 2.4.4.4 CROSS GROUPS VARIATION 2.4.4.5 VARIATION IN CURRENT GROUP The algorithms to profiles calculation are the same as in 2.4.2.3 (CROSS GROUPS VARIATION) with the only difference that sites are moving windows in protein structure (1D or 3D). When MAX FRAME > MIN FRAME the profile is the average for several calculated profiles with the set of windows (lengths varying from MIN FRAME to MAX FRAME). Entering this section user need to select the matrix of aa similarity (from suggested catalog of names). 2.4.4.6 SADC-PROFILE For calculation of the Structure-Activity Determination coefficient profile (SADC profile) the proteins are divided to some number of groups with the same amino acid content in the given site. SADC profile reflects the relations between variance of protein activity and site amino acid residue variability (to be published). The amino acid comparison in all the methods of alphabetical analysis can be done with the using of matrices of amino acid similarity (physico-chemical, minimal mutational distance, etc). In all used methods the sites can be determined from protein sequence as well as tertiary structure. After entering to this section user need to select the matrix of aa similarity (from suggested catalog of names). Then it is necessary to input the threshold value for site similarity. Command keys Effect < Esc > Abort. < F5 > Print screen. 2.4.4.7 RESIDUAL DISPERSION This parameter is residual dispersion of activities after separation of proteins on some groups. The procedure of calculation of statistic Residual Dispersion is taken from (W.R. Klecka, 1986). Briefly, the proteins are divided to the some groups with the same amino acid content in the given site (sequential or spatial). The matrix of amino acid similarity is used in calculations. 2.4.4.8 CROSS GROUPS VARIATION 3D 2.4.4.9 VARIATION IN CURRENT GROUP 3D These methods have the same sense as in case of sequences. 2.4.4.10 NORMALIZED CROSS GROUPS VARIATION 3D Normalized cross groups variation (used only for 3D sites) is equal to cross group variation (calculated as described earlier) multiplied by average value of intra group conservation indexes. The last parameter, intra group conservation index, is defined as 1 minus variability index. 2.4.4.11 COORDINATED CHANGES 3D The values of the profile are the maximal correlation coefficient (in %) between mutations in given position and mutations in neighborhood positions of 3D structure (Sander, 1994). The high correlation coefficient reflects the existing another positions (close in 3D structure to the first) with coordinated changes. Example of coordinated changes: M D F R F R C A C A All positions within spatial site are tested for presence of coordinated changes with the center of the site. 2.4.4.12 NUMBER OF COORDINATED POSITION 3D The number of coordinated positions with the correlation coefficient higher the cutoff parameter is calculated. The correlation is calculated based on the algorithm, described in the previous section. 2.4.4.13 VIEW PROFILE ON 3D STRUCTURE Each profile calculated in this program can be displayed on protein 3D structure. User need to input two numbers to differentiate amino acid residues of the protein into three sets: one set - positions with high values on the profile, second set - positions with low values on the profile, third set - positions with intermediate values on the profile. Then user can see protein 3D structure with three above mentioned amino acid residues types marked by various colors (and lines width). It is possible to define also only two sets of positions (for example to discriminate only high profile values on the profile). User can change easy the modes to display protein 3D structure (see SPATIAL SITE in menu item VIEW 3D). 2.4.4.14 SAVE PROFILE In this submenu user can save to the disc each profile calculated in the program. Command keys Effect < Esc > Abort. < F3 > View file from the catalog. < F4 > Editing the file. < F9 > New mask. < New name > Type new name in command line to save profile to new file (in another case the profile will be added to existing file). < Enter > Save profile. 2.4.4.15 DISPERSIONS PROFILE Physico-chemical or structural profiles are calculated for all individual proteins (with the exception of proteins with group number 0). The resulting dispersion profile is equal to variance of values in each point of these profiles. 2.4.5 MOTIFS SEARCH (IN CURRENT GROUP). This menu item serves to find common motifs in current group of sequences (in the fragment No 1). Minimal and maximal lengths of searching motifs should be defined earlier in menu item OPTIONS (MIN FRAME, MAX FRAME). The results of searching can be seen in menu item VIEW LAST RESULT. The table has the motif frequency (number of proteins from the current group possessing with this motif) and sequence. Frequent motifs, that present in the most of the sequences are placed in the upper part of the table. Variable positions are shown by symbol "-". 2.4.6 SORT To sort the protein set press "Alt S" in the main program window and select appropriate mode of sorting. By activity - sorting by protein activity, By groups - sorting by protein group number, By motifs - sorting by motifs number in the sequence. Choose what you need, then press ENTER and return back. You'll get reordered list of proteins in multiple alignment. Sorting by protein group numbers is absolutely necessary step if you are preparing to provide discriminant analysis. 2.4.7 SAVE LAST RESULT Any result found in automatic mode can saved to the disc in this submenu item (the file also may be edited here). Command keys Effect < Esc > Abort. < F3 > View file from the catalog. < F4 > Edit the file. < F9 > Select new mask for filename. < New name > Type new name in command line to save result to new file, in another case the result will be added to existing file. < Enter > Save result. 2.4.8 VIEW LAST RESULT Viewing the result found in automatic mode. The result is shown as short table. Each line has brief description of the result of one calculation (one hypothesis). To see detailed description move cursor to desired line and press F3. Then you can see relative graph, if available. Also you can mark (press ENTER) one hypothesis for further investigation in a manual mode ("by hand"). After choosing a hypothesis the set of initially selected factors (marked site(s), amino acid factors and functions, see section 2.4.1) will be replaced by site(s) and factors from selected hypothesis. User can analyze taken factors BY HAND (in manual mode). The window with the results of automatic search can be shown in any time. 2.5 VIEW 3D-STRUCTURE. 2.5.1 SPATIAL SITE This module assign for viewing 3D-structure of proteins and spatial sites. Protein sites marked in section PREPARE DATA (or VIEW LAST RESULT) will be shown here in colors. The graphical module is highly flexible: there are three types of residues in the protein, that may be treated separately: A (SITE CENTERS) - sites, marked earlier in the PREPARE DATA menu item or just now in this window (press ENTER on desired position) or as the result of selecting site in submenu VIEW RESULTS; B (NEIGHBORS) - the residues that are close to the SITE CENTERS (the cutoff radius is given in menu items OPTIONS, CALCULATION); C (PROTEIN) - all other residues of protein. The style to display residues of types A, B, C can be changed independently (see help - lines at the top and the bottom of the screen). Amino acid sequence of the protein is shown on the screen (the sequence is taken from PDB-file, but not from *.ali file). The colors of the letters related to the colors of residues on the stereo picture. This module can be used also independently to display protein structures (without loading any protein family for analysis). User has options to: - display alternatively C-alpha or all-atom protein models; - rotate the structure, change the dimensions of the picture, stereo angle, distance between stereo pictures (even to display mono pictures, when the distance is zero); - change the width of the main chain and the backbone, the colors of types A, B and C residues (the site, its neighbors and the rest of the molecule); - change the size and colors of C-alpha atoms; - create new spatial site (by moving cursor through backbone and pressing ENTER); - print picture to printer. The following functional keys are used in the program (as shown on the bottom line of the screen): ----------------------------------------------------------------------------- F2-PROTEIN F3-SITE CENTERS F4-NEIGHBORS F5-OPTIONS F6-PRINT F10-QUIT ----------------------------------------------------------------------------- Command key Effect F2-PROTEIN changes the modes to display the residues of the type C. F3-SITE CENTERS changes the modes to display the residues that are CENTERS OF THE SITES marked earlier in PREPARE DATA or here (type A). F3-NEIGHBORS changes the modes to display the residues that are within cutoff radius from CENTERS OF THE SITES (type B). F5-OPTIONS changes the numeration, picture size, distance between the stereo pictures, stereo angle. F6-PRINT prints the picture to printer (laser or matrix). F10-QUIT returns back. After pressing F2, F3 or F4 the following submenu appears: ----------------------------------------------------------------------------- COLOR:SIDE CH,CA,BACKBONE,BACKGRND; WIDTH: SIDE CH,BACKBONE,CA;CHANGE:Grey +,- ----------------------------------------------------------------------------- The first part (COLORS) serves to change the colors of the lines in chosen type of residues. User can change the colors of side chains (SIDE CH), of C-alpha atoms (CA) and of protein backbone (BACKBONE). The color of the screen background can be changed also (BACKGRND). The second part (WIDTH) serves to change the width of the lines in chosen type of residues. User can change the width of side chains (SIDE CH), of protein backbone (BACKBONE) or of C-alpha atoms (CA). To change any picture element it is necessary to move cursor to the relative position (LEFT and RIGHT ARROW keys should be used) and press keys <GREY-> or <GREY+>. Pressing <GREY-> or <GREY+> several times you can get desirable size, color, numeration and width of the lines and C-alpha atoms. The width of the lines have three meanings: heavy lines, thin lines and the absence of lines (last case helps you to remove some parts of the structure from the picture if necessary). In submenu item OPTIONS (F5) the following submenu appears: ----------------------------------------------------------------------------- NUMERATION SIZE DISTANCE-BETWEEN-PICTURES STEREO-ANGLE;CHANGE:GREY <-,->,+,- ----------------------------------------------------------------------------- To change picture element it is necessary to move cursor to the relative position (LEFT and RIGHT ARROW keys should be used) and press key <GREY-> or <GREY+>. Pressing <GREY-> or <GREY+> some times you can set desirable options: change the numeration, picture size, distance between the stereo pictures and stereo angle. (Increasing stereo angle to 90o, you'll get two pictures from the different sides of the protein). 2.5.2 SIMPLE MARKING This module serves for marking different amino acid residues groups without relation to window with the sequences and without automatic marking spatial neighbors (as in SPATIAL SITE module). All amino acid residues are divided into three groups, that can be treated separately: LOW GROUP, MEAN GROUP and HIGH GROUP. The status of current group is displayed on the top left side of the screen. The style of displaying residues of these three groups can be changed independently. The mode of working in this module is the same as in module SPATIAL SITE. The help-line on the bottom of the screen is: ----------------------------------------------------------------------------- F2-MEAN GROUP F3-HIGH GROUP F4-LOW GROUP F5-OPTIONS F6-PRINT F10-QUIT ----------------------------------------------------------------------------- To change the current group options you should press relatively F2, F3 or F4. The following submenu appears, that is the same as in module SPATIAL SITE. ----------------------------------------------------------------------------- COLOR:SIDE CH,CA,BACKBONE,BACKGRND; WIDTH: SIDE CH,BACKBONE,CA;CHANGE:Grey +,- ----------------------------------------------------------------------------- To change only the current group status (mean, high, low) you should press F2 and <Esc>, F3 and <Esc>, F4 and <Esc>. The colors of groups in 3D picture coincide with the colors of letters in the protein sequence on the bottom of the screen. 2.6 HELP. Help information can be invoked by pushing F1 key in each submenu. In addition, the last line of screen has short help for each menu option. 2.7 QUIT. To exit from the program go to menu option QUIT and press Enter. 3. GENERAL INFORMATION The distribution diskette contains the following several necessary and supplementary files. All files should be in the same directory. There are files with the examples. 4. REQUIREMENTS PROANALYST runs on the IBM PC family of computers, including XT, AT. PROANALYST requires DOS 3.3 or higher and at least 560K of RAM; it will run on any 80-column monitor but graphics require EGA/VGA monitor. A hard disk is recommended for enhancing performance of the program. 5. STANDARD ERRORS LIST ERROR=0 - successful finishing the program. In case of some problems the program sends the following messages: ERROR= -1 - file input/output error. The program can not find the file with the given filename. Recommendation: check the filename and correct it. ERROR= -2 - there is not enough RAM (random access memory) in your computer to execute the program. Recommendations: unload resident programs, use high memory or divide your protein sequences into two overlapping parts and investigate them separately. ERROR= -79 - errors in datafiles: 1. The number of sequences is not equaled to the number of data in activity file. 2. There are lines without data in the file with activities. Recommendation: check the file with activities and correct it (in case if activity is not known for any member of the family, input some number like 9999, but do not use this protein in structure-activity). 6. NOTE FROM AUTHORS The program is constantly growing, so the manual may have some differences with the current state of the program. If you've found errors or have any proposals to improve PROANALYST program please contact Ivanisenko V.A. or Eroshkin A.M., Research Institute of the Molecular Biology, SRC VB "Vector", Koltsovo, Novosibirsk region, 633159, Russia. Tel.(3832)-64-77-74 Telex 133196 NPOSU Fax: (3832) - 328831; E.mail: salex@vector.nsk.su (to Vladimir Ivanisenko), eroshkin@vector.nsk.su (to Alexey Eroshkin) FEEL FREE TO CALL US. We'll be very glad to hear from you anyway. 7. REFERENCES Eroshkin A.M., Fomin V.I., Zhilkin P.A.,Ivanisenko V.V., Kondrachin Y.V. PROANAL version 2: multifunctional program for analysis of multiple protein sequence alignments and for studying the structure-activity relationships in protein families.CABIOS,1995, V 11, N 1, pp 39-44. Eroshkin A.M., Zhilkin P.A., Fomin V.I. (1993) Algorithm and computer program Pro_Anal for analysis of relationship between structure and activity in a family of proteins or peptides. CABIOS, v.9, n. 5, 491- 497. Crippen G.M. and Viswanadhan V.N. (1984) A potential function for conformational analysis of proteins. Int. J.Pept. Prot. Res. 24, 279-296. Bolshev,L.N. and Smirnov,N.B. (1983) Tables of mathematics statistics (in Russian), Moscow, Nauka, p.284. Eisenberg,D., Schwarz,E., Komaromy M., Wall R. (1984) Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol., 189, p.125-142 Klecka W.R. Discriminant analysis. Seventh Printing, 1986. Kidera,A., Konishi,Y., Oka,M., Ooi,T., Sheraga,H.,A. (1985) Relation between sequence similarity and structural similarity in proteins. Role of important properties of amino acids. J. Prot. Chem., 4, 23-55. Schultz,G.E. and Schirmer,R.H. (1979) The Principles of Protein Structure, Springer-Verlag, New York. Kendall,M. (1970) Rank correlation methods, Griffin London.