WWC snapshot of http://www.ncbi.nlm.nih.gov/dbEST/how_to_submit taken on Thu May 4 3:33:38 1995
NCBI GenBank

Submission of ESTs to GenBank and dbEST


ESTs by nature are usually submitted to GenBank and dbEST as batches of dozens to thousands of entries, with a great deal of redundancy in the citation, submittor and library information. To improve the efficiency of the submission process for this type of data, we have designed a separate streamlined submission process and data format.

EST entries may be submitted by email to "batch-sub@ncbi.nlm.nih.gov". A special tagged flat file input format (see below) has been designed for this data, to allow it to be submitted as one or more text files in this manner.

A document describing this data submission format is included below. If you have questions about this format, please contact "info@ncbi.nlm.nih.gov" and a member of the support staff will get back to you or pass your question on to Carolyn Tolstoshev or Mark Boguski for a response.

Once your data is ready to submit, email it to "batch-sub@ncbi.nlm.nih.gov" as described above. You will receive a list of dbEST ids, and GenBank accession numbers from a dbEST curator via email.

Carolyn Tolstoshev, carolyn@ncbi.nlm.nih.gov
Mark Boguski, boguski@ncbi.nlm.nih.gov

Submission Format

The following is a specification for flat file formats for delivering EST and related data to the NCBI EST database. The format consists of colon delineated capitalized tags, followed by data. Except for sequence, comment and library description data, the data fields should appear on the same line as the tag, with no line wrapping. Each record (including the last record in the file) should end with a tag || to indicate the end of the record.
1.1 File Types
There are four types of deliverable files: 
a. Publication
b. Library
c. Contact
d. EST

2.1 Publication Files The following is an example of the valid tags and some illustrative data:

TYPE: 	  Entry type - must be "Pub" for publication entries. Obligatory field.
MEDUID:   Medline unique identifier.
CITATION: Journal citation for the publication. Obligatory field.
STATUS:   Status field.1=unpublished, 2=submitted, 3=in press, 4=published
	  Obligatory field.


e.g.
TYPE: Pub
MEDUID: 91262645
CITATION: Science, 252:1651 (1991)
STATUS: 4
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published The citation field is a free format string. The only requirement is that you put an identical string in the publication field of ESTs, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the EST table.

2.2 Library Files The following is an example of the valid tags and some illustrative data:

TYPE: 	  Entry type - must be "Lib" for library entries. Obligatory field.
NAME: 	  Name of library. Obligatory field.
ORGANISM: Organism from which library prepared.
ABBREV:	  Organism abbreviation e.g. C.elegans
STRAIN:	  Organism strain
COMMON:	  Common name of organism.
VECTOR:	  Name of vector.
V_TYPE:	  Type of vector (Cosmid, Phage,Plasmid,YAC, other)
RE_1:	  Restriction enzyme at site1 of vector
RE_2:	  Restriction enzyme at site2 of vector
HOST:	  Host name
DESCR: 	  Description of library preparation methods, vector, etc. This field
	  starts on the line below the DESCR: tag.



e.g.
TYPE: Lib
NAME:  Hippocampus, Ruben Moreno
ORGANISM: Homo sapiens
ABBREV: H.sapiens
STRAIN:
COMMON: Human
VECTOR: E145 
V_TYPE: Phage
RE_1: NotI
RE_2: HindIII
HOST:
DESCR: 
mRNA was purified from the hippocampus of an adult female.  cDNA was constructed 
and cloned simultaneously using vector priming with the E145 vector and method 
described by Rubenstein, et. al (Nucl. Acids Res. 18:4833, 1990). cDNA was 
directionally synthesized from the NotI site in the vector to the HindIII site.
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Try to keep the library NAME: field to2.3 Contact Files The following is an example of the valid tags and some illustrative data:
      

TYPE: 	Entry type - must be "Cont" for contact entries. Obligatory field.
NAME: 	Name of person who provided the EST. 
FAX: 	Fax number as string of digits.
TEL: 	Telephone number as string of digits.
EMAIL: 	E-mail address
LAB: 	Laboratory providing EST.
INST: 	Institution name
ADDR: 	Address string, comma delineation.


e.g.
TYPE: Cont
NAME: Kerlavage AR
FAX: 3014808588
TEL: 3014968800
EMAIL: arkerlav@loglady.ninds.nih.gov
LAB: Receptor Biochemistry & Molecular Biology
INST: NIH
ADDR: NIH/NINDS/RBMB, Park Building,  Room 405,   Bethesda,   MD 20892
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. We would like as many of the fields filled in as possible, to provide complete information to the user for contacting a source for the EST or further information about it. The contact name or contact lab fields in the EST entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.

2.4 EST Files The following is an example of the valid tags and some illustrative data:

TYPE: Entry type - must be "EST" for EST entries. Obligatory field
STATUS:	Status of EST entry - "New" or "Update". Obligatory field.
CONT_NAME: Name of contact (must be identical string to the contact entry)
CONT_LAB: Contact laboratory. (Must be identical string to the contact entry.)
EST#: EST id assigned by contact lab. Obligatory field. For EST entry 
	updates, this is the string we match on.
GB#: GenBank id.
GDB#: Genome database accession number
GDB_DSEG:Genome database Dsegment number
CLONE: Clone id.
ATCC_DNA: ATCC id for the clone as pure DNA
ATCC_INHOST: ATCC id for the clone stored in the host.
OTHER_EST: Other ESTs for this gene.
CITATION: Journal citation. (Must be identical string to the  publication entry)
PRIMER: Sequence primer description or sequence.
P_END: Which end sequenced e.g. 5'
DNA_TYPE: cDNA (default), Genomic, Viral, Synthetic, Other
MAP: Map location.
LIBRARY: Library name.  (Must be identical string to library name entry.)
PUBLIC: 1= for release to public, 0=confidential, no general release. Obligatory.
PUT_ID:	Putative identification, found by homology with a database sequence.
COMMENT: Comments about EST. Starts on line below COMMENT: tag.
E_DATE:	EMBL date of entry (used for parsing EMBL entries only. Ignore.)
U_DATE:	EMBL last update date. (used for EMBL entries only. Ignore.)
OWNER: N, L, E or D - submitting database (For internal use only. Ignore.)
SEQUENCE: Sequence string. Starts on line below SEQUENCE: tag. Obligatory field

e.g.
TYPE: EST
STATUS:  New
CONT_NAME: Kerlavage AR
CONT_LAB: Receptor Biochemistry & Molecular Biology
EST#: EST00001
GB#: M61954
GDB#:  
GDB_DSEG: 
CLONE: HHC189
ATCC_INHOST: 65128
OTHER_EST: EST00093, EST000101
CITATION: Science, 252:1651 (1991)
PRIMER: M13 Forward
P_END: 5' end
DNA_TYPE: cDNA
MAP: Chromosome 1
LIBRARY: Hippocampus, Stratagene (cat. #936205)
PUBLIC: 1 
PUT_ID: Actin, gamma, skeletal
COMMENT:
This is a comment about the sequence. It may contain features.
It may span several lines.
SEQUENCE:
AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG
ATAGCTTGTTACACAGTAATTAGATTGAAGATAATGGACACGAAACATATTCCGGGATTAAA
CATTCTTGTCAAGAAAGGGGGAGAGAAGTCTGTTGTGCAAGTTTCAAAGAAAAAGGGTACCA
GCAAAAGTGATAATGATTTGAGGATTTCTGTCTCTAATTGGAGGATGATTCTCATGTAAGGT
TGTTAGGAAATGGCAAAGTATTGATGATTGTGTGCTATGTGATTGGTGCTAGATACTTTAAC
TGAGTATACGAGTGAAATACTTGAGACTCGTGTCACTT
||


The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Valid data values for the EST status line are New (new entry) or Update (change existing EST entry). When updating an EST, only the fields present in the EST file will be changed. Please try to stick to standard map location formats, so that we will be able to write functions to parse them in the future. Sequences start on line below tag, and should be 60 per line with no blank spaces.

Notes:

  • If data is not available for some fields, the field can either be omitted entirely, or the tag may be included with an empty data field. Please do not put '*', "-", etc to indicate missing data.
  • In the EST file, you are given a choice between contact name or contact lab. If the name of the person supplying the ESTs is in the contact table, use the CONT_NAME field. If the contact entry is only identified with the name of a lab, use the CONT_LAB field instead.
  • For the fields CONT_LAB, CONT_NAME, LIBRARY and CITATION, it is very important that the string in the EST file field is completely identical to that provided for contact, library and publication files. We will be scanning these fields from the EST file and matching them automatically to library, contact and publication records in the other tables, so content, spelling, letter case and spacing must match.
  • The DNA type is assumed to be cDNA, so this field may be omitted unless the DNA type differs from this.
  • If you wish, you can submit libraries, pubs, contacts and ESTs all in one file - the TYPE field will differentiate them for the parsing software. However, if you are submitting new libraries, contacts and/or publications in the file with ESTs, and the new ESTs refer to them, they must precede the ESTs in the file, otherwise the EST crossmatching will not succeed.

    Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,

    On-Line EST Database, Data Input Format Specification

    version 2.1 July 27, 1993 Draft Copy


    This draft document is being made available solely for review purposes and should not be quoted, circulated, reproduced or represented as an official NCBI document. The draft is undergoing revisions and should not be considered or represented as reflecting the views, positions or intentions of the NCBI or the National Library of Medicine.

    Submission Format for Mapping Data

    The following is a specification for flat file formats for delivering EST mapping and related data to the NCBI EST database. The format consists of colon delineated capitalized tags, followed by data. Each record (including the last record in the file) should end with a tag || to indicate the end of the record.
    1.1 File Types
    There are four types of deliverable files: 
    a. Publication.
    b. Contact
    c. Method
    d. Map data
    
    2.1 Publication Files The following is an example of the valid tags and some illustrative data:

    TYPE:     Entry type - must be "Pub" for publication entries. Obligatory field.
    MEDUID:   Medline unique identifier.
    CITATION: Journal citation for the publication. Obligatory field.
    STATUS:   Status field.1=unpublished, 2=submitted, 3=in press, 4=published
    	  Obligatory field.
    ||        Entry separator
    
    e.g.
    TYPE: Pub
    MEDUID: 
    CITATION: Nature Genetics, 2:180-185 (1992)
    STATUS: 4
    ||
    
    
    The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published The citation field is a free format string. The only requirement is that you put an identical string in the publication field of map data, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the map table.

    2.1 Contact Files The following is an example of the valid tags and some illustrative data:

    TYPE:   Entry type - must be "Cont" for contact entries. Obligatory field.
    NAME:   Name of person who provided the mapping data. 
    FAX:    Fax number as string of digits.
    TEL:    Telephone number as string of digits.
    EMAIL:  E-mail address
    LAB:    Laboratory providing mapping data.
    INST:   Institution name
    ADDR:   Address string, comma delineation.
    ||      Entry separator
    
    
    
    e.g.
    TYPE: Cont
    NAME: Sikela JM
    FAX: 3032707097
    TEL: 3032708637
    EMAIL: sikela_j%maui@vaxf.colorado.edu
    LAB: Sikela
    INST: University of Colorado Health Sciences Center
    ADDR: Pharmacology Box C236, 4200 E 9th Ave, Denver, CO 80262
    ||
    
    
    The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. The contact name or contact lab fields in the map entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.

    2. 1 Method The following is an example of the valid tags and some illustrative data:

    TYPE:     Entry type - must be "Meth" for method entries. Obligatory field.
    NAME:     Name of method. Obligatory field.
    ORGANISM: Organism from which library prepared. Obligatory field.
    ABSOLUTE: Method gives absolute or relative address? Y or N. Obligatory field.
    L1:       Interpretation of line 1.
    L2:       Interpretation of line 2.
    L3:       Interpretation of line 3.
    L4:       Interpretation of line 4.
    L5:       Interpretation of line 5.
    L6:       Interpretation of line 6.
    L7:       Interpretation of line 7.
    L8        Interpretation of line 8.
    L9:       Interpretation of line 9.
    L10:      Interpretation of line 10.
    DESCR:    Description of method. Description starts on line after DESCR 
              heading. May be multi-line free format text.
    ||        Entry separator
    
    
    
    
    e.g.
    TYPE: Meth
    NAME:  YAC/CEPH JMS
    ORGANISM: Homo sapiens
    ABSOLUTE: n
    L1: plate
    L2: row
    L3: column
    L4: comment
    L5: comment
    L6: comment
    L7: comment
    DESCR:
    PCR-based mapping of 3'UT-derived primers to CEPH YAC DNA pools.
    Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991).
    To date, MIT puts out YAC pools A and B;  if both pools
    were used for the mapping data given, then 'C' is designated.
    ||
    
    
    
    TYPE: Meth
    NAME:  Radiation Hybrid JMS
    ORGANISM: Homo sapiens
    ABSOLUTE: y
    L1: chromosome
    L2: bin
    L3: comment
    L4: comment
    L5: comment
    DESCR:
    Radiation hybrid panels with binning.
    Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991).
    ||
    
    
    
    TYPE: Meth
    NAME:  Somatic Hybrid JMS
    ORGANISM: Homo sapiens
    ABSOLUTE: y
    L1: chromosome
    L2: arm
    L3: band
    L4: band range
    L5: comment
    L6: comment
    DESCR:
    Somatic cell hybrid mapping.
    Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991).
    ||
    
    
    The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Lines 1 to 10 are available for describing interpretation of data in the corresponding map data entries. There must be a method interpretation line for each map data line. When refering to the method in the map data entries, you must use an identical string for the method name to the string used for method name in the method entry.

    2.4 Map Data Files The following is an example of the valid tags and some illustrative data:

    TYPE:      Entry type - must be "Map" for map data entries. Obligatory
    STATUS:    Status of EST entry - "New","Replace" or "Update". Obligatory
    CONT_NAME: Name of contact (must be identical string to contact entry)
    CONT_LAB:  Contact laboratory. (Must be identical string to contact entry.)
    METHOD:    Method name. (Must be identical string to the method entry name)
    CITATION:  Journal citation. (Must be identical string to publication entry)
    NCBI#:     NCBI Id of EST. (Must have either NCBI#, EST# or GB#)
    GB#:       GenBank accession number of EST .
    EST#:      EST name (can only use this if you are the original submitter 
               of ESTs)
    PUBLIC:    1= for release to public, 0=confidential, no general release. 
               Obligatory.
    MAPSTRING: Full mapping information. Unparsed. For output only.
    CHROM:     Chromosome name or number
    L1:        Line 1 of parsed mapping information.
    L2:        Line 2 of parsed mapping information.
    L3:        Line 3.
    L4:        Line 4.
    L5:        Line 5.
    L6:        Line 6.
    L7:        Line 7.
    L8:        Line 8.
    L9:        Line 9.
    L10:       Line 10 of parsed mapping information..
    ||         Entry separator
    
    
    e.g.
    TYPE: Map
    STATUS:  New
    CONT_NAME: Sikela JM
    METHOD: YAC/CEPH JMS
    CITATION: Nature Genetics, 2:180-185 (1992)
    NCBI#:21839
    PUBLIC: 1
    MAPSTRING: 959H08
    CHROM: 
    L1: 959
    L2: H
    L3: 08
    L4: Pool B
    L5: Forward Primer: CCCCAGCAGAGAAGTTAATT
    L6: Reverse Primer: GTCAACGTCAACATTCGTTT
    L7: Product Length: 162
    ||
    TYPE: Map
    STATUS:  New
    CONT_NAME: Sikela JM
    METHOD: Radiation hybrid JMS
    CITATION: Nature Genetics, 2:180-185 (1992)
    NCBI#:21839
    PUBLIC: 1
    MAPSTRING: 4, bin 2
    CHROM: 4
    L1: 4
    L2: 2
    L3: Forward Primer: TTGAGGGTTTACAACAGATAGG
    L4: Reverse Primer: GAAATGGAAGAGAACCAGCT
    L5: Product Length: 119
    ||
    
    TYPE: Map
    STATUS:  New
    CONT_NAME: Sikela JM
    METHOD: Somatic hybrid JMS
    CITATION: Nature Genetics, 2:180-185 (1992)
    GB#: T02813
    PUBLIC: 1
    MAPSTRING: 20
    CHROM: 20
    L1: 20
    L2:
    L3:
    L5: Forward Primer: GTCTTCCTGTGTCTGCTGAG
    L6: Reverse Primer: CACCTCACCTTACATCCAAA
    ||
    TYPE: Map
    STATUS:  New
    CONT_NAME: Sikela JM
    METHOD: Somatic hybrid JMS
    CITATION: Nature Genetics, 2:180-185 (1992)
    EST#: EST0023c
    PUBLIC: 1
    MAPSTRING: 20
    CHROM: 20
    L1: 20
    L2:
    L3:
    L4:
    L5: Forward Primer: GTCTTCCTGTGTCTGCTGAG
    L6: Reverse Primer: CACCTCACCTTACATCCAAA
    ||
    

    Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,

    On-Line EST Dat abase, Data Input Format Specification

    version 1.2 April 24, 1994 Draft Copy


    This draft document is being made available solely for review purposes and should not be quoted, circulated, reproduced or represented as an official NCBI document. The draft is undergoing revisions and should not be considered or represented as reflecting the views, positions or intentions of the NCBI or the National Library of Medicine.

    NCBI Home