Genquire ©

Installation and Usage Manual

© 2001, National Research Council of Canada

Authors:
Mark Wilkinson (mwilkinson@gene.pbi.nrc.ca)

David Block (dblock@gene.pbi.nrc.ca)

Installer Program:

Matthew Links (mlinks@gene.pbi.nrc.ca)

Support:
Genquire GUI: Mark Wilkinson

Genquire Database: David Block

Documentation: Mark Wilkinson

BioPerl: http://www.bioperl.org

GFF: http://www.sanger.ac.uk/Software/formats/GFF

GO: http://www.geneontology.org

GO_Perl API:  Chris Mungall
 

Mailing Lists and CVS access:

http://bioinformatics.org/Genquire


























INSTALLING GENQUIRE:
 

INSTALLATION IS BEST LEEFT TO YOUR SYS-ADMIN IF YOU ARE ON A *NIX SYSTEM.
Root access is required to run some of the installation scripts!  This message does not affect
MS Windows users...
 
  • The easiest way to install Genquire is to run the install.pl program. This will install all support files
    that genquire needs.  In case you are interested, however, here is the list of support files:
  • If you plan to run a local database, you have MySQL installed, and you must be the MySQL root user.
  • Unpack the files into your final working directory.
  • IF you plan to run a local database, type "perl init_db.pl" to initialize the database. Follow instructions given.
  • Type "perl Genquire.pl" to start the program.
  • Try connecting to the sample database, listed as "SAMPLE" under the data-source menu on the startup screen.



  • It is strongly recommended that installation of Genquire should be attempted only by your SysAdmin.

    If you wish to install our sample data (and have write access to the database you are connecting to), start the Genquire program, select "Import Sequence" from the File menu on the QueryScreen window, and import the sequence "sample_contig.fas". When prompted for a name, give it the name "dy_c_5" (this is the name of the sequence in the FASTA header). The sequence has now been imported into the database. Now select "Import GFF" from the File menu and import the GFF file "sample_contig.gff". All features for this sequence have now been imported to the database. The sample contig may now be browsed in Genquire.
     
     

    BEFORE YOU BEGIN

    genquire.conf:

    genquire.conf contains various path information for tools/folders on your system. This file is created when you run the install script.  You may attempt to modify this file by hand using any text editor, however using the install script ensures that this file is complete and in the correct format.  It should contain, at a minimum, the following information:
     

    {
    use lib "/path_to/bioperl-gui";
    use lib "/path_to/bioperl-live";
    use lib "/path_to/GO/perl-api";
    use lib "/path_to/Genquire/PLUGINS/";
    use lib "/path_to/Genquire";
    
    
    $TEMP_DIR = "/tmp";
    $BLAST_BINARIES = "/path_to/BLAST";
    $BROWSER = "netscape";  # command to evoke browser
    
    $BLAST_URL = "http://URL_to/blast.cgi";
    $WORKING_DIR = "/path_to/wherever";
    $PLUGINS_DIR = "/path_to/Genquire/PLUGINS/";
    $DATA_SOURCES = [["GENQUIRE_LOCAL", "gq_local.cfg"],
                     ["GENBANK_FLAT", "gb_flat.cfg"],
                    ];
    $BLAST_CONFIG = {
            # below is a list of all *acceptable* Blast parameters
            # note that some are removed because the GUI must be able
            # to receive and interpret the results.  
            # when CGI_param is not avalable for your particular CGI,
            # leave it undef.
            #
            # blastall_param                #[CGI_param,            default]
            '-p'                    =>       ['blastprogram',        'blastx'],
            ... see the file for details...
            ...};
    
    }
    Filling your local database from scripts can be difficult - there are several database "flags" that Genquire uses to achieve some of its features.  If you are planning to use a local DB we highly recommend that you read the Supplementary Documentation at the end of document; contact the mailing list if you have any questions.
     

    gq_local.cnf

    this file must be edited with your database login and password information.

    gb_flat.cnf

    this file should require no editing.

    PLUGINS/plugins.conf

    This file should be edited to reflect the availability and location of various sequence analysis tools on your system.  A sample file is shown below:

    #Plugin_Name                plugin_script,          command-line-params
    #______________________________________________________
    Blast_This_Genome,    gq_blast.pl,              /tmp/,   /home/markw/BLAST/
    Remote_Blast,                 remoteBlast.pl,      http://fingal/PANDA/pandablast.cgi,   /tmp/
    GeneMarkHMM,          GMHMM.pl,         gmhmme, /tmp/
    Import GFF,                     ImpExpGFF.pl,     /tmp/
    Of the plugins included with Genquire, the necessary command line arguments are:
    gq_blast: requires a temp directory, and the location of the blastall executable.
    remoteBlast.pl:  requires the URL of the remote Blast CGI and a temp directory.
    GMHMM.pl:  requires the command to execute the geneMarkHMM executable on your system, and a temp directory.
    ImpExpGFF.pl:  requires a temp directory
    THE PLUGIN SCRIPTS  PROVIDED WITH GENQUIRE WILL NEED TO BE EDITED FOR YOUR LOCAL SYSTEM:  They contain some "use lib" references which you will have to modify to reflect the location of your libraries (eg. BioPerl).  This is only a minute or two of effort.  For information on creating your own plugins, read the pod documentation in PlugInHandler.pm, which includes a comprehensive description of the relatively straightforward plugins API.

    spider.pl

    In order to make whole-genome annotation queries fast enough to be useable, annotations are keyword-indexed by running spider.pl.  Newly entered annotations will not be found by quieries until this script is run.  The script erases all previous indexes and rebuilds the table from scratch each time it is run.  This process takes 10-15 minutes.  You might consider running this as a cron-job each night.

    Context.pm

    Below is a list of the feature types that Genquire recognizes by default, and the feature objects that they map onto.  Features with * are required features for the Genquire system to function correctly!
     

    *Gene        =>'GQ::Server::Gene',
    *Exon        =>'GQ::Server::Feature',
    *UTR          =>'GQ::Server::Feature',
    'TRNA Gene'   =>'GQ::Server::Gene',
    'RNA Exon'    =>'GQ::Server::Feature',
    Promoter      =>'GQ::Server::Feature',
    Intron        =>'GQ::Server::Feature',
    *Transcript   =>'GQ::Server::Transcript',
    *Poly_A_site  =>'GQ::Server::Feature',
    Misc_Feature  =>'GQ::Server::Feature',
    DEFAULT       =>'GQ::Server::Feature',
        Note that this list is case sensitive.  It matches a set of default features in the FeatureTypes table.  If this does not match the list of feature types you plan to work with, then you will need to edit BOTH the names in the database and the %type_hash in Context.pm (~ line 10-20 of Context.pm) by hand to add new types, remove types, or change the spelling/capitalization of the equivalent types... either that, or make your feature types conform to this list.  Note, however, that the starred features above may not be changed as they will prevent Genquire from being fully functional.
     

    Genquire Database Structure
     

    The Genquire database is designed for storing information related to biological sequences.  As such, the central element of the database is the contig.  A group of contigs that are joined in a tiling path make up an assembly, and a group of assemblies related in some way (i. e. release date) are a version.  Finally, there can be several versions available for each organism of interest.

    Entering contig and supra-contig level information can be done by hand, or via scripts.  Some scripts may come with the Genquire distribution, but it is recommended that this process only be attempted by those with a good understanding of the Genquire schema.
     

    Loading a Genquire Database
    If you are using data from TIGR, you will be able to use the bulk_parse_contigs.pm file to load your data directly from the TIGR XML into the Genquire database.  If not... sorry...

    The Genquire database schema is available in the distribution, in the file schema.  An experienced Mysql administrator needs no further instructions.  If you are not so confident, run the init_db.pl script, and follow the instructions.  The init_db.pl script also enters sample organism data from the org.data file.

     
    The first step in loading data into Genquire is to create the appropriate organisms.  The file org.data in the Genquire distribution contains the table structure for the Organism table, along with the data for two plants, Arabidopsis thaliana and rice.  Other organisms can be entered as required, either interactively, using the mysql client program, or via the Genquire 'Add new organism' menu option upon database login.

    Each organism has a common name, a latin name, a short code (two letter abbreviation of the latin name, in our system), and an id, which is assigned automatically by the database.

    The organism id is used in only one table, the Assembly table.  An assembly is a series of contigs in a tiling path, with no gaps (i.e. Assembly is not equal to Chromosome, so if there are gaps in the tiling path you may assign multiple assemblies to a single chromosome to represent that).  If one has a complete tiling path for an organism, there will be one assembly for each chromosome.  Assemblies never cross chromosomes.  Therefore, the columns of the Assembly table are organism, chromosome (chr_id), version, and id (again, generated automatically by the database).

    The version column in the Assembly table refers to the id of the Version table, which is a place to describe the different versions of your tiling paths.  Each version is related to its organism, so organism 1 can have versions 1 and 2, and organism 2 can have unrelated versions 1, 2, and 3.

    The same contig could be placed in two different Assemblies, in different versions of the tiling path.  There is, therefore, a ContigAssembly join table that relates each contig_id to its parent assembly.

    Which brings us to contigs.  The Contig table holds only an id (auto-generated), the name of the contig (which is intended to be the widely recognized designation of that contig for  human researchers), and a centromere column, which is simply y or n.

    The sequence of the contig should be entered as a single string into the database, in the Sequence table, with its contig_id.

    The Tiling_Path table is the last table to be concerned about here.  The tiling path needs to be defined in terms of assembly coordinates, and preferably in chromosomal coordinates.  Each contig then has an abs_start, which is the nucleotide coordinate, in the parent assembly, of the first nucleotide of that contig sequence (seq column in Sequence table), and a length, which is the length of that sequence.  Due to overlaps, you may not want to use part of your contig.  The answer to that is the virtual contig, or VC.  The VC is defined in terms of your contig, with a VC_start and a VC_length.  So if there is no overlap and the VC is the same as the contig, the VC_start equals 1 and the VC_length is the same as length.  If there is some overlap, the VC_length is shortened by the amount of the overlap, and the VC_start may be moved, if the overlap occurs on the 5' end of the contig.

     assembly:           ##############################################################################################################...
                         1   5   10   15   20   25   30   35   40   45   50   55   60   65   70   75   80   85   90   95  100  105  110...
     contig:                                   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
     abs_start: 24 length: 70                  1   5   10   15   20   25   30   35   40   45   50   55   60   65   70
     virtual contig:                               %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     VC_start: 5 VC_length: 60                     1                                                         60
     

    A final note- Genquire assumes that all contigs and their accompanying annotation are oriented the same way.  If some contigs in your data set are oriented in the reverse direction, you need to reverse complement the contigs, and reverse the coordinates of all of its features prior to entering them into the database.

    FEATURES:

    Good luck with your installation!  Detailed questions are welcome, as are improvements to this documentation.  Please send them to me, Dave, at dblock@gene.pbi.nrc.ca, and I'll reply as soon as I can.
     


    STARTING THE PROGRAM


    Main Genquire Window "QueryScreen"

    File Menu


    Genome Map Window:

    The genome map is constructed on the fly from information in the Contig, ContigAssembly, and Assembly tables (see database documentation or contact Dave Block for details).  Mouseover a contig to display it's name in the QueryScreen status bar, click it to select it, double-click it to open it.

    Right-click on the genome map to get a drop-down menu of functions:

    NOTE:  Flagging of features requires that this functionality be present in the data-source adaptor layer.  Chances are, this feature will only be available when using the Genquire database schema.

    The Genome map also responds to clicks on mapped features (in the Features table) with the primary_tag = 'dupl'.  This can be used to store duplication information as interactive features in the database.

    Sequence Feature Display Window: Creating and Destroying Genes:

    Once you have selected a group of features which you believe belong to a gene, you can drag them to an empty portion of the Finished (blue) canvas as new Gene objects. The gene will appear with a unique database-generated identifier as a dark blue bar, and all of its sub-features will be visible.  Dragging features onto a gene (Blue bar), will create a new transcript for that gene.  Dragging features onto a transcript (grey bar) will add the features to that transcript.  Deleting of features is done by the right-click dropdown menu.

    Annotating Genes:

    Gene-type objects behave exactly the same way as all other on-screen objects. Double (sometimes triple) clicking them will bring up the Hand Annotation window, in which you can create tag/value pairs and  GO ontology annotations for a given gene.

    Chat window Hand Annotation Window Sequence Context annotation window * this is hard-coded into the program. If you wish to modify this, you need only change the regexp in the %attr{} hash at the beginning of the ShowSequenceContext.pm module. The values of: donor_site => [regexp-value, "read/write"],

    acceptor_site => [regexp-value, "read/write"],

    must be valid Perl regular expressions. You should also fill in the values for donor_site_length and acceptor_site_length in order to ensure that the sites are properly highlighted.
    Genquire Supplementary Information:

    This information is a supplement to the installation instructions for Genquire.  It is particularly related to the data-format for certain database fields which will "automagically" cause the GUI to react in useful ways.

    The primary_tag field of the Feature table will contain many different feature types, some of which will be common to most genome databases, and are thus interpreted by Genquire in a special way.  If you wish to enable this functionality, these common features should have the following (case-sensitive) primary_tags:
     
     
    Table primary_tag type of feature described 
    Feature prom promotor
    Feature plyA polyadelylation site
    Feature dupl duplicated region*

    Annot_Feature, Annot_Gene, and  Annot_discard should, preferably, not be filled in by hand as the entries for these tables come from within Genquire itself.

    *w.r.t. the annotation of duplicated regions:  additional information about duplications is held in the TagValue, Tags and Values tables as follows:
     
     
    Table tag value
    TagValue Homology_to Contig.name (from the Contigs table)
    TagValue Homology_start start position on the homologous contig
    TagValue Homology_stop end position on the homologous contig
    TagValue Orientation direct or inverted (this is only relevant for dups on the same chromosome)

    In addition to the tags above, there are certain Feature.primary_tag's that will be completely ignored by Genquire - i.e. features with these primary_tags will not be mapped, and will not be visible on the annotation canvas  This was done to allow display of BioPerl-parsed genbank entires.  The ignored primary_tags are (case sensitive):
     


    The BlastLookUp table holds the ids (primary keys) of the BlastAcc table for exon_id's which share an NCBI  gi accession number in common from their Blast homology.  This table is difficult to generate by hand/script, and is best done within Genquire if possible using the Blast Exon(s) function on the pop-up menu.  When full, this table allows the selection of all exons on a canvas which share common Blast hits (useful for finding gene boundaries).

    The Blast_vs_EST table is fairly self-explanatory - EST's are (outside of Genquire) Blasted against the genome of the organism, and the Blast reports are parsed into this table.  When parsing Blast data into the EST and Blast_versus_EST tables, the "query" is considered to be the EST, adn the "subject" is considered to be the contig/genome.  EST's (or other transcript-based sequences like cDNA's) may then be mapped onto your annotation canvas, and used in Sim4 alignments.  Information about the EST itself is held in the EST table.  Note that both the EST table and the Blast_versus_EST tables have a "source" column, and this needs to be consistent between the two tables.