Database API - what the Genquire schema means
The Genquire reference database implementation is an efficient way of storing genomic annotation in a MySQL database. It is flexible and enables all of the functionality available in Genquire.
Like it or lump it, let me know at dblock@gene.pbi.nrc.ca
Email dblock@gene.pbi.nrc.ca
This database schema was developed to enable Genquire functionality. Most of that functionality was dreamed up by Mark Wilkinson, mwilkinson@gene.pbi.nrc.ca
For the current schema, look at the file /Admin/schema in your Genquire installation.
This is an overview of the roles of the different tables in the Genquire DB Schema.
The centrepiece of the schema is the Feature table, which stores all of the features present on the genomic sequence. Each feature has a contig id, which is a link to its parent contig, and contig_start/contig_stop coordinates. It has a strand, which is stored in GFF format as +/-, and possibly a score and a frame. It has a source_tag, which is a simple text string identifying the source of the annotation, and a primary_tag, which is what the feature was annotated as by the source. It also has a type, which is a code referencing the FeatureType table. This decouples object types from primary_tags, which allows flexibility in annotation.
The genomic data is organized in terms of contigs. Each contig is expected to have a name, a string given it by the sequencing body or organism database. The Contig table stores that name, and what chromosome the contig is on, and whether that contig is in a centromeric region.
A chromosome is the sensible coordinate system for organisms with finished genomes, but until that point is reached, contigs can be placed in sub-chromosomal assemblies. The Assembly table stores the list of assemblies, which chromosomes they sit on, and which version, or data release, they are associated with. Each assembly gets a unique id.
With multiple releases of data, each version gets its own id, and a description.
This table relates a Contig, which has a name which may survive several versions, and the assembly on which it finds itself in the current version. The resulting combination gets a unique id, which is referred to as 'contig_id' throughout the rest of the schema. REPEAT: contig_id does not refer to the contig name, but to the combination of unique contig name and assembly associated with the current version.
Each feature in the Feature table has a contig_id. Now you know what that is. The contig_start and contig_stop coordinates can be translated into assembly coordinates once we know the tiling path of the contig's assembly. We know which assembly the contig is on because the contig_id points not to the Contig table but to the ContigAssembly table.
Each assembly is made up of a series of contigs, associated in a tiling path. This information is stored in the Tiling_Path table, keyed on contig_id.
There are two ways of completely specifying a location in a tiling path. One is to store the neighbours of the contig, and the relative distance between them. The other is to use absolute coordinates relative only to the parent assembly.
The first is convenient for loading data from some data sources. A script is provided for translating that relative data into absolute coordinates, which are much more efficient for Genquire's purposes.
The length of the contig is stored in the length column. The abs_start column is the position in assembly coordinates of the start of the contig, reading always 5' to 3'. The neighbours in the table are the contig names of the 5' and 3' neighbours of the contig.
The VC_start and VC_length refer to the virtual contig. Genquire deals with overlapping contigs in a tiling path by only using one non-redundant tiling path. It is up to the database curator which of two overlapping contigs is the 'correct' sequence. If the sequence is identical, this may not matter, but the annotations of the dropped sequence will not be loaded into the Genquire sequnce object, so be aware.
Once the 'correct' contig coordinates are determined, they are stored in VC_start and VC_length. VC_start is in contig coordinates, so if the overlap is 200 bp, the VC_start would be 201 if that 200-bp segment will come from the other contig. The VC_length is simply how much of the contig is believed.
Keyed on contig_id, this table stores the version of the sequence associated with the current data release. If the contig changes its sequence in subsequent data releases, it will get a new ContigAssembly entry, and therefore a new Sequence table entry, where the new sequence can be put.
Now that you can get the information you need about the sequence, what have you put on it? The Feature table is based on GFF format, and is meant to store everything that can be attached to a sequence, including genes, transcripts, exons, promoters, and repeats.
Of course, it needs some help with some of those details. So these are the Feature- associated tables in the schema:
Tables associated with Feature:
In order to enable undo functionality, and/or to have a history of the annotation hung on any feature, any edits to the database can be stored in these two tables. The Edit table stores the who/when of the editing, and gives each editing event a unique id. The Discard table stores the data as it looked before the edit, allowing a return to the previous data. It is a big table, with columns present to handle any of Genes, Transcripts, Gene features (exons, etc.), and any of their associated tag/value pairs.
These tables are tied very tightly to Genqiure objects, and as such, won't require much in the way of setup, as long as they exist and are not corrupt. They may become a source for data mining in the future, as an extensive annotation history builds up around Genes.
Genes contain transcripts. Transcripts contain features. This containing is expressed in the Container table, which has only two columns: parent and element. The parent is the unique Feature table id of the parent object, and the elements are the ids of the objects that are contained within that parent.
The FeatureType table was created in order to allow a limited number of genomic feature types. Each type is given a name and a unique id, which is referenced by the type column in the Feature table. It is the responsibility of Genquire to have an appropriate object type available for all of these types.
In order to allow any interesting features to be marked as such by annotators (i.e. This gene looks like an F-Box gene), flags can be placed on them using these tables. Any text string up to 30 characters long can be used as a flag, and this flag can be associated with any group of Feature ids.
The Flags table holds the flag text, and gives it a unique id. The Flag table associates the flag with feature ids and contig_ids, allowing easy contig marking in the Chromosome View in Genquire. Each row in the Flag table also gets its own unique id.
Each row in the Feature table is really a signpost saying that 'something is here'. What that something actually is cannot be stored in a flat format like a single table row. Our approach again comes from GFF (version 2), in which any annotation beyond the GFF basics is stored as a series of tag/vectors. That is, given an object, one can look at what kinds of things are known about that object, and for each of those kinds of things (tags), there can be one or more pieces of data, or things that we know. For example, given a gene, we can look for BLAST homologies, and find one or more BLAST homologies associated with a given gene.
That pattern is played out here. The list of allowed tags is kept in the Tags table, where each tag is given a unique id. Each row in the TagValue table references a Feature id and a Tags id, and contains a value. There can be multiple rows with the same Feature id and Tags id, so each TagValue row is also given a unique id.
Some kinds of annotation are themselves structured. GO annotations are like that. In order to associate a gene with a GO term, some evidence must be given, following the list of possible evidence codes available from GO. Handling this layered annotation by a simple tag/vector was too unwieldy, so another layer was put in place in the TagValue table, using the parent column.
If an annotation is part of another annotation, it simply references that annotation's unique id as its parent. For annotation types (tags) that may contain sub-parts, the annotation object must look for rows in the database that have those annotations as parents. This is implemented in Genquire as it relates to GO. Each GO annotation has a tag of GO_annotation, and a value of the GO_id. The GO term associated with that GO_id is stored in the GO_local table. The term comes from the GO_browser module that is part of bioperl-gui and integrated into Genquire.
The associated evidence is stored in further rows of the TagValue table, with the GO_annotation's TagValue id as the parent, and the GO evidence code as the tag, and the actual evidence description (article reference, author, etc.) as the value.
These tables are used by the Genquire Chat object, and can be left alone by the administrator. They allow two annotators, working in separate labs, but on the same database, to discuss what they are doing as they work. This was a Friday afternoon project that was fun, but (so far) not incredibly useful. On the other hand, Mark is usually in the desk right beside mine, so we haven't needed to use this functionality.
This table is used extensively by Genquire. It keeps track of which regions of the various assemblies are being actively annotated by which user. It contains fields for the user, the version and assembly, and the start and stop of the region registered to that user. Each lock has its own lock id, which allows Genquire to release the lock when no longer needed.
NOTE: if Genquire is killed unnaturally, which happens regularly while in development, this table tends to fill up with old locks that were not released properly. If the users in question are not online, a simple ``Delete From AnnotationLock'' clears this up.
One of the useful features of Genquire is its keyword searches. These are enabled by these three tables, the Spider.pl script in the Admin directory, and the flag_text method of the Context object.
The Spider script crawls through the database, breaking down the text from the TagValue table and other textual tables, associating those words with a unique id (in the Word table) and each Feature id/contig_id with a fid in the fid table, then combining those two in the WordBase table. A simple search of the Word table gives the unique ids associated with a set of words, and then the WordBase table associates those words with a set of fids. The fids can be traced back to Feature ids and contig_ids, which allow simple flagging of relevant keywords.
Genetic markers can be associated with a particular contig_id, and even with a contig start/stop pair. They can have a locus_name, a locus_source, and be of type 'RFLP', 'AFLP', 'Gene' or 'SNP'.
Storing an EST in the database allows a whole bunch of extra features in Genquire to be enabled. The EST table holds a unique id for the EST, the date the EST was placed in the database, the gi, gb, and description of the EST, the sequence, and a label for the source.
Once there has been an alignment of the EST with genomic sequence, the results can be stored in the Blast_vs_EST table. This has a unique id, and holds all of the data necessary to localize and display the alignment.
Blasting a feature or set of features against a large external database can be slow, so the database can store blast results, allowing easy lookup of homology information, and easy comparison of shared homology.
BlastAcc stores a unique id for each blast hit, the feature id, the gi and db of the subject, the rawscore and probability of the hit, and the longer accession and description of the subject.
BlastLookUp simply stores two feature ids, id1 and id2. The pair form the primary key of the table. Two feature ids are placed in the table if they are both homologous to the same accession. This can be useful for determining real gene boundaries.
These tables are vestigial organs and can safely be ignored for now. They may come back into relevance at a later date.
In the admin folder, the schema file holds the latest current schema. The file schema.dia is a Dia illustration of the schema that may or may not be up to date. We will try to keep these docs up to date, including this file and schema.dia. If there are any inconsistencies or outright errors, or you want to quibble with grammar or spelling, just email me at dblock@gene.pbi.nrc.ca.