Installation and Usage Manual
© 2001, National Research Council of Canada
David Block (dblock@gene.pbi.nrc.ca)
Genquire Database: David Block
Documentation: Mark Wilkinson
BioPerl: http://www.bioperl.org
Mailing Lists and CVS access:
INSTALLING GENQUIRE:
If you wish to install our sample data (and have write
access to the database you are connecting to), start the Genquire program,
select "Import Sequence" from the File menu on the QueryScreen window,
and import the sequence "sample_contig.fas". When prompted for a name,
give it the name "dy_c_5" (this is the name of the sequence in the FASTA
header). The sequence has now been imported into the database. Now select
"Import GFF" from the File menu and import the GFF file "sample_contig.gff".
All features for this sequence have now been imported to the database.
The sample contig may now be browsed in Genquire.
BEFORE YOU BEGIN
genquire.conf:
genquire.conf contains various path information for tools/folders
on your system. This file is created when you run the install script.
You may attempt to modify this file by hand using any text editor, however
using the install script ensures that this file is complete and in the
correct format. It should contain, at a minimum, the following information:
Filling your local database from scripts can be difficult - there are several database "flags" that Genquire uses to achieve some of its features. If you are planning to use a local DB we highly recommend that you read the Supplementary Documentation at the end of document; contact the mailing list if you have any questions.{ use lib "/path_to/bioperl-gui"; use lib "/path_to/bioperl-live"; use lib "/path_to/GO/perl-api"; use lib "/path_to/Genquire/PLUGINS/"; use lib "/path_to/Genquire"; $TEMP_DIR = "/tmp"; $BLAST_BINARIES = "/path_to/BLAST"; $BROWSER = "netscape"; # command to evoke browser $BLAST_URL = "http://URL_to/blast.cgi"; $WORKING_DIR = "/path_to/wherever"; $PLUGINS_DIR = "/path_to/Genquire/PLUGINS/"; $DATA_SOURCES = [["GENQUIRE_LOCAL", "gq_local.cfg"], ["GENBANK_FLAT", "gb_flat.cfg"], ]; $BLAST_CONFIG = { # below is a list of all *acceptable* Blast parameters # note that some are removed because the GUI must be able # to receive and interpret the results. # when CGI_param is not avalable for your particular CGI, # leave it undef. # # blastall_param #[CGI_param, default] '-p' => ['blastprogram', 'blastx'], ... see the file for details......}; }
gq_local.cnf
this file must be edited with your database login and password information.
gb_flat.cnf
this file should require no editing.
PLUGINS/plugins.conf
This file should be edited to reflect the availability and location of various sequence analysis tools on your system. A sample file is shown below:
#Plugin_Name plugin_script, command-line-paramsOf the plugins included with Genquire, the necessary command line arguments are:
#______________________________________________________
Blast_This_Genome, gq_blast.pl, /tmp/, /home/markw/BLAST/
Remote_Blast, remoteBlast.pl, http://fingal/PANDA/pandablast.cgi, /tmp/
GeneMarkHMM, GMHMM.pl, gmhmme, /tmp/
Import GFF, ImpExpGFF.pl, /tmp/
gq_blast: requires a temp directory, and the location of the blastall executable.THE PLUGIN SCRIPTS PROVIDED WITH GENQUIRE WILL NEED TO BE EDITED FOR YOUR LOCAL SYSTEM: They contain some "use lib" references which you will have to modify to reflect the location of your libraries (eg. BioPerl). This is only a minute or two of effort. For information on creating your own plugins, read the pod documentation in PlugInHandler.pm, which includes a comprehensive description of the relatively straightforward plugins API.
remoteBlast.pl: requires the URL of the remote Blast CGI and a temp directory.
GMHMM.pl: requires the command to execute the geneMarkHMM executable on your system, and a temp directory.
ImpExpGFF.pl: requires a temp directory
spider.pl
In order to make whole-genome annotation queries fast enough to be useable, annotations are keyword-indexed by running spider.pl. Newly entered annotations will not be found by quieries until this script is run. The script erases all previous indexes and rebuilds the table from scratch each time it is run. This process takes 10-15 minutes. You might consider running this as a cron-job each night.
Context.pm
Below is a list of the feature types that Genquire recognizes
by default, and the feature objects that they map onto. Features
with * are required features for the Genquire system to function
correctly!
Note that this list is case sensitive. It matches a set of default features in the FeatureTypes table. If this does not match the list of feature types you plan to work with, then you will need to edit BOTH the names in the database and the %type_hash in Context.pm (~ line 10-20 of Context.pm) by hand to add new types, remove types, or change the spelling/capitalization of the equivalent types... either that, or make your feature types conform to this list. Note, however, that the starred features above may not be changed as they will prevent Genquire from being fully functional.*Gene =>'GQ::Server::Gene', *Exon =>'GQ::Server::Feature', *UTR =>'GQ::Server::Feature', 'TRNA Gene' =>'GQ::Server::Gene', 'RNA Exon' =>'GQ::Server::Feature', Promoter =>'GQ::Server::Feature', Intron =>'GQ::Server::Feature', *Transcript =>'GQ::Server::Transcript', *Poly_A_site =>'GQ::Server::Feature', Misc_Feature =>'GQ::Server::Feature', DEFAULT =>'GQ::Server::Feature',
Genquire Database Structure
The Genquire database is designed for storing information related to biological sequences. As such, the central element of the database is the contig. A group of contigs that are joined in a tiling path make up an assembly, and a group of assemblies related in some way (i. e. release date) are a version. Finally, there can be several versions available for each organism of interest.Loading a Genquire DatabaseEntering contig and supra-contig level information can be done by hand, or via scripts. Some scripts may come with the Genquire distribution, but it is recommended that this process only be attempted by those with a good understanding of the Genquire schema.
If you are using data from TIGR, you will be able to use the bulk_parse_contigs.pm file to load your data directly from the TIGR XML into the Genquire database. If not... sorry...The Genquire database schema is available in the distribution, in the file schema. An experienced Mysql administrator needs no further instructions. If you are not so confident, run the init_db.pl script, and follow the instructions. The init_db.pl script also enters sample organism data from the org.data file.
The first step in loading data into Genquire is to create the appropriate organisms. The file org.data in the Genquire distribution contains the table structure for the Organism table, along with the data for two plants, Arabidopsis thaliana and rice. Other organisms can be entered as required, either interactively, using the mysql client program, or via the Genquire 'Add new organism' menu option upon database login.Each organism has a common name, a latin name, a short code (two letter abbreviation of the latin name, in our system), and an id, which is assigned automatically by the database.
The organism id is used in only one table, the Assembly table. An assembly is a series of contigs in a tiling path, with no gaps (i.e. Assembly is not equal to Chromosome, so if there are gaps in the tiling path you may assign multiple assemblies to a single chromosome to represent that). If one has a complete tiling path for an organism, there will be one assembly for each chromosome. Assemblies never cross chromosomes. Therefore, the columns of the Assembly table are organism, chromosome (chr_id), version, and id (again, generated automatically by the database).
The version column in the Assembly table refers to the id of the Version table, which is a place to describe the different versions of your tiling paths. Each version is related to its organism, so organism 1 can have versions 1 and 2, and organism 2 can have unrelated versions 1, 2, and 3.
The same contig could be placed in two different Assemblies, in different versions of the tiling path. There is, therefore, a ContigAssembly join table that relates each contig_id to its parent assembly.
Which brings us to contigs. The Contig table holds only an id (auto-generated), the name of the contig (which is intended to be the widely recognized designation of that contig for human researchers), and a centromere column, which is simply y or n.
The sequence of the contig should be entered as a single string into the database, in the Sequence table, with its contig_id.
The Tiling_Path table is the last table to be concerned about here. The tiling path needs to be defined in terms of assembly coordinates, and preferably in chromosomal coordinates. Each contig then has an abs_start, which is the nucleotide coordinate, in the parent assembly, of the first nucleotide of that contig sequence (seq column in Sequence table), and a length, which is the length of that sequence. Due to overlaps, you may not want to use part of your contig. The answer to that is the virtual contig, or VC. The VC is defined in terms of your contig, with a VC_start and a VC_length. So if there is no overlap and the VC is the same as the contig, the VC_start equals 1 and the VC_length is the same as length. If there is some overlap, the VC_length is shortened by the amount of the overlap, and the VC_start may be moved, if the overlap occurs on the 5' end of the contig.
assembly: ##############################################################################################################...
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110...
contig: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
abs_start: 24 length: 70 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70
virtual contig: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
VC_start: 5 VC_length: 60 1 60
A final note- Genquire assumes that all contigs and their accompanying annotation are oriented the same way. If some contigs in your data set are oriented in the reverse direction, you need to reverse complement the contigs, and reverse the coordinates of all of its features prior to entering them into the database.
FEATURES:
Good luck with your installation! Detailed questions are welcome, as are improvements to this documentation. Please send them to me, Dave, at dblock@gene.pbi.nrc.ca, and I'll reply as soon as I can.
STARTING THE PROGRAM
Main Genquire I Window "QueryScreen"
File Menu
Options Menu
Tools Menu
PlugIns Menu
Genome Map Window:
The genome map is constructed on the fly from information in the Contig, ContigAssembly, and Assembly tables (see database documentation or contact Dave Block for details). Mouseover a contig to display it's name in the QueryScreen status bar, click it to select it, double-click it to open it.Sequence Feature Display Window:Right-click on the genome map to get a drop-down menu of functions:
NOTE: Flagging of features requires that this functionality be present in the data-source adaptor layer. Chances are, this feature will only be available when using the Genquire database schema.
- Refresh : re-display the last query results
- Query Annotations: brings up a query dialog box in which you can enter one or more query words (select AND or OR to join them) to query against the annotations. "hits" are flagged and highlighted both on the Genome Map as well as on the Sequence Feature Display Window. Flags are user-specific and persistent from one session to the next. Flags take on your database username by default, and are reset after every query. If you wish to 'archive' your query results, use the "rename Current Flags' drop-down menu option to assign a new flagname, which can be pulled out later using the 'View Flags of type' menu option.
- Remove Current Flags: removes all currentquery-flags.
- Re-name Current Flags: change the flag-name to prevent the next query from over-writing the current query. This 'saves' the query.
- View Flags of Type: select a saved query result and make it the current selection set.
The Genome map also responds to clicks on mapped features (in the Features table) with the primary_tag = 'dupl'. This can be used to store duplication information as interactive features in the database.
Once you have selected a group of features which you believe belong to a gene, you can drag them to an empty portion of the Finished (blue) canvas as new Gene objects. The gene will appear with a unique database-generated identifier as a dark blue bar, and all of its sub-features will be visible. Dragging features onto a gene (Blue bar), will create a new transcript for that gene. Dragging features onto a transcript (grey bar) will add the features to that transcript. Deleting of features is done by the right-click dropdown menu.
Annotating Genes:
Gene-type objects behave exactly the same way as all other on-screen objects. Double (sometimes triple) clicking them will bring up the Hand Annotation window, in which you can create tag/value pairs and GO ontology annotations for a given gene.
acceptor_site => [regexp-value, "read/write"],
This information is a supplement to the installation instructions for Genquire. It is particularly related to the data-format for certain database fields which will "automagically" cause the GUI to react in useful ways.
The primary_tag field of the Feature table
will contain many different feature types, some of which will be common
to most genome databases, and are thus interpreted by Genquire in a special
way. If you wish to enable this functionality, these common features
should have the following (case-sensitive) primary_tags:
Table | primary_tag | type of feature described |
Feature | prom | promotor |
Feature | plyA | polyadelylation site |
Feature | dupl | duplicated region* |
Annot_Feature, Annot_Gene, and Annot_discard should, preferably, not be filled in by hand as the entries for these tables come from within Genquire itself.
*w.r.t. the annotation of duplicated regions: additional
information about duplications is held in the TagValue, Tags and Values
tables as follows:
Table | tag | value |
TagValue | Homology_to | Contig.name (from the Contigs table) |
TagValue | Homology_start | start position on the homologous contig |
TagValue | Homology_stop | end position on the homologous contig |
TagValue | Orientation | direct or inverted (this is only relevant for dups on the same chromosome) |
In addition to the tags above, there are certain Feature.primary_tag's
that will be completely ignored by Genquire - i.e. features
with these primary_tags will not be mapped,
and will not be visible on the annotation canvas This
was done to allow display of BioPerl-parsed genbank entires. The
ignored primary_tags are (case sensitive):
The BlastLookUp table holds the ids (primary keys)
of the BlastAcc table for exon_id's which share an NCBI gi accession
number in common from their Blast homology. This table is difficult
to generate by hand/script, and is best done within Genquire if possible
using the Blast Exon(s) function on the pop-up menu. When full, this
table allows the selection of all exons on a canvas which share common
Blast hits (useful for finding gene boundaries).
The Blast_vs_EST table is fairly self-explanatory - EST's
are (outside of Genquire) Blasted against the genome of the organism, and
the Blast reports are parsed into this table. When parsing Blast
data into the EST and Blast_versus_EST tables, the "query" is considered
to be the EST, adn the "subject" is considered to be the contig/genome.
EST's (or other transcript-based sequences like cDNA's) may then be mapped
onto your annotation canvas, and used in Sim4 alignments. Information
about the EST itself is held in the EST table. Note that both the
EST table and the Blast_versus_EST tables have a "source" column, and this
needs to be consistent between the two tables.