WWC snapshot of http://www.ncbi.nlm.nih.gov/dbEST/how_to_submit taken on Thu May 4 3:33:38 1995
ESTs by nature are usually submitted to GenBank and dbEST as batches of dozens to thousands of entries, with a great deal of redundancy in the citation, submittor and library information. To improve the efficiency of the submission process for this type of data, we have designed a separate streamlined submission process and data format.
EST entries may be submitted by email to "batch-sub@ncbi.nlm.nih.gov". A special tagged flat file input format (see below) has been designed for this data, to allow it to be submitted as one or more text files in this manner.
A document describing this data submission format is included below. If you have questions about this format, please contact "info@ncbi.nlm.nih.gov" and a member of the support staff will get back to you or pass your question on to Carolyn Tolstoshev or Mark Boguski for a response.
Once your data is ready to submit, email it to "batch-sub@ncbi.nlm.nih.gov" as described above. You will receive a list of dbEST ids, and GenBank accession numbers from a dbEST curator via email.
1.1 File Types There are four types of deliverable files: a. Publication b. Library c. Contact d. EST
2.1 Publication Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Pub" for publication entries. Obligatory field. MEDUID: Medline unique identifier. CITATION: Journal citation for the publication. Obligatory field. STATUS: Status field.1=unpublished, 2=submitted, 3=in press, 4=published Obligatory field. e.g. TYPE: Pub MEDUID: 91262645 CITATION: Science, 252:1651 (1991) STATUS: 4 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published The citation field is a free format string. The only requirement is that you put an identical string in the publication field of ESTs, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the EST table.
2.2 Library Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Lib" for library entries. Obligatory field. NAME: Name of library. Obligatory field. ORGANISM: Organism from which library prepared. ABBREV: Organism abbreviation e.g. C.elegans STRAIN: Organism strain COMMON: Common name of organism. VECTOR: Name of vector. V_TYPE: Type of vector (Cosmid, Phage,Plasmid,YAC, other) RE_1: Restriction enzyme at site1 of vector RE_2: Restriction enzyme at site2 of vector HOST: Host name DESCR: Description of library preparation methods, vector, etc. This field starts on the line below the DESCR: tag. e.g. TYPE: Lib NAME: Hippocampus, Ruben Moreno ORGANISM: Homo sapiens ABBREV: H.sapiens STRAIN: COMMON: Human VECTOR: E145 V_TYPE: Phage RE_1: NotI RE_2: HindIII HOST: DESCR: mRNA was purified from the hippocampus of an adult female. cDNA was constructed and cloned simultaneously using vector priming with the E145 vector and method described by Rubenstein, et. al (Nucl. Acids Res. 18:4833, 1990). cDNA was directionally synthesized from the NotI site in the vector to the HindIII site. ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Try to keep the library NAME: field to2.3 Contact Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Cont" for contact entries. Obligatory field. NAME: Name of person who provided the EST. FAX: Fax number as string of digits. TEL: Telephone number as string of digits. EMAIL: E-mail address LAB: Laboratory providing EST. INST: Institution name ADDR: Address string, comma delineation. e.g. TYPE: Cont NAME: Kerlavage AR FAX: 3014808588 TEL: 3014968800 EMAIL: arkerlav@loglady.ninds.nih.gov LAB: Receptor Biochemistry & Molecular Biology INST: NIH ADDR: NIH/NINDS/RBMB, Park Building, Room 405, Bethesda, MD 20892 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. We would like as many of the fields filled in as possible, to provide complete information to the user for contacting a source for the EST or further information about it. The contact name or contact lab fields in the EST entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.
2.4 EST Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "EST" for EST entries. Obligatory field STATUS: Status of EST entry - "New" or "Update". Obligatory field. CONT_NAME: Name of contact (must be identical string to the contact entry) CONT_LAB: Contact laboratory. (Must be identical string to the contact entry.) EST#: EST id assigned by contact lab. Obligatory field. For EST entry updates, this is the string we match on. GB#: GenBank id. GDB#: Genome database accession number GDB_DSEG:Genome database Dsegment number CLONE: Clone id. ATCC_DNA: ATCC id for the clone as pure DNA ATCC_INHOST: ATCC id for the clone stored in the host. OTHER_EST: Other ESTs for this gene. CITATION: Journal citation. (Must be identical string to the publication entry) PRIMER: Sequence primer description or sequence. P_END: Which end sequenced e.g. 5' DNA_TYPE: cDNA (default), Genomic, Viral, Synthetic, Other MAP: Map location. LIBRARY: Library name. (Must be identical string to library name entry.) PUBLIC: 1= for release to public, 0=confidential, no general release. Obligatory. PUT_ID: Putative identification, found by homology with a database sequence. COMMENT: Comments about EST. Starts on line below COMMENT: tag. E_DATE: EMBL date of entry (used for parsing EMBL entries only. Ignore.) U_DATE: EMBL last update date. (used for EMBL entries only. Ignore.) OWNER: N, L, E or D - submitting database (For internal use only. Ignore.) SEQUENCE: Sequence string. Starts on line below SEQUENCE: tag. Obligatory field e.g. TYPE: EST STATUS: New CONT_NAME: Kerlavage AR CONT_LAB: Receptor Biochemistry & Molecular Biology EST#: EST00001 GB#: M61954 GDB#: GDB_DSEG: CLONE: HHC189 ATCC_INHOST: 65128 OTHER_EST: EST00093, EST000101 CITATION: Science, 252:1651 (1991) PRIMER: M13 Forward P_END: 5' end DNA_TYPE: cDNA MAP: Chromosome 1 LIBRARY: Hippocampus, Stratagene (cat. #936205) PUBLIC: 1 PUT_ID: Actin, gamma, skeletal COMMENT: This is a comment about the sequence. It may contain features. It may span several lines. SEQUENCE: AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG ATAGCTTGTTACACAGTAATTAGATTGAAGATAATGGACACGAAACATATTCCGGGATTAAA CATTCTTGTCAAGAAAGGGGGAGAGAAGTCTGTTGTGCAAGTTTCAAAGAAAAAGGGTACCA GCAAAAGTGATAATGATTTGAGGATTTCTGTCTCTAATTGGAGGATGATTCTCATGTAAGGT TGTTAGGAAATGGCAAAGTATTGATGATTGTGTGCTATGTGATTGGTGCTAGATACTTTAAC TGAGTATACGAGTGAAATACTTGAGACTCGTGTCACTT ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Valid data values for the EST status line are New (new entry) or Update (change existing EST entry). When updating an EST, only the fields present in the EST file will be changed. Please try to stick to standard map location formats, so that we will be able to write functions to parse them in the future. Sequences start on line below tag, and should be 60 per line with no blank spaces.
Notes:
Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,
On-Line EST Database, Data Input Format Specification
version 2.1 July 27, 1993 Draft Copy
1.1 File Types There are four types of deliverable files: a. Publication. b. Contact c. Method d. Map data2.1 Publication Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Pub" for publication entries. Obligatory field. MEDUID: Medline unique identifier. CITATION: Journal citation for the publication. Obligatory field. STATUS: Status field.1=unpublished, 2=submitted, 3=in press, 4=published Obligatory field. || Entry separator e.g. TYPE: Pub MEDUID: CITATION: Nature Genetics, 2:180-185 (1992) STATUS: 4 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published The citation field is a free format string. The only requirement is that you put an identical string in the publication field of map data, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the map table.
2.1 Contact Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Cont" for contact entries. Obligatory field. NAME: Name of person who provided the mapping data. FAX: Fax number as string of digits. TEL: Telephone number as string of digits. EMAIL: E-mail address LAB: Laboratory providing mapping data. INST: Institution name ADDR: Address string, comma delineation. || Entry separator e.g. TYPE: Cont NAME: Sikela JM FAX: 3032707097 TEL: 3032708637 EMAIL: sikela_j%maui@vaxf.colorado.edu LAB: Sikela INST: University of Colorado Health Sciences Center ADDR: Pharmacology Box C236, 4200 E 9th Ave, Denver, CO 80262 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. The contact name or contact lab fields in the map entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.
2. 1 Method The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Meth" for method entries. Obligatory field. NAME: Name of method. Obligatory field. ORGANISM: Organism from which library prepared. Obligatory field. ABSOLUTE: Method gives absolute or relative address? Y or N. Obligatory field. L1: Interpretation of line 1. L2: Interpretation of line 2. L3: Interpretation of line 3. L4: Interpretation of line 4. L5: Interpretation of line 5. L6: Interpretation of line 6. L7: Interpretation of line 7. L8 Interpretation of line 8. L9: Interpretation of line 9. L10: Interpretation of line 10. DESCR: Description of method. Description starts on line after DESCR heading. May be multi-line free format text. || Entry separator e.g. TYPE: Meth NAME: YAC/CEPH JMS ORGANISM: Homo sapiens ABSOLUTE: n L1: plate L2: row L3: column L4: comment L5: comment L6: comment L7: comment DESCR: PCR-based mapping of 3'UT-derived primers to CEPH YAC DNA pools. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). To date, MIT puts out YAC pools A and B; if both pools were used for the mapping data given, then 'C' is designated. || TYPE: Meth NAME: Radiation Hybrid JMS ORGANISM: Homo sapiens ABSOLUTE: y L1: chromosome L2: bin L3: comment L4: comment L5: comment DESCR: Radiation hybrid panels with binning. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). || TYPE: Meth NAME: Somatic Hybrid JMS ORGANISM: Homo sapiens ABSOLUTE: y L1: chromosome L2: arm L3: band L4: band range L5: comment L6: comment DESCR: Somatic cell hybrid mapping. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Lines 1 to 10 are available for describing interpretation of data in the corresponding map data entries. There must be a method interpretation line for each map data line. When refering to the method in the map data entries, you must use an identical string for the method name to the string used for method name in the method entry.
2.4 Map Data Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Map" for map data entries. Obligatory STATUS: Status of EST entry - "New","Replace" or "Update". Obligatory CONT_NAME: Name of contact (must be identical string to contact entry) CONT_LAB: Contact laboratory. (Must be identical string to contact entry.) METHOD: Method name. (Must be identical string to the method entry name) CITATION: Journal citation. (Must be identical string to publication entry) NCBI#: NCBI Id of EST. (Must have either NCBI#, EST# or GB#) GB#: GenBank accession number of EST . EST#: EST name (can only use this if you are the original submitter of ESTs) PUBLIC: 1= for release to public, 0=confidential, no general release. Obligatory. MAPSTRING: Full mapping information. Unparsed. For output only. CHROM: Chromosome name or number L1: Line 1 of parsed mapping information. L2: Line 2 of parsed mapping information. L3: Line 3. L4: Line 4. L5: Line 5. L6: Line 6. L7: Line 7. L8: Line 8. L9: Line 9. L10: Line 10 of parsed mapping information.. || Entry separator e.g. TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: YAC/CEPH JMS CITATION: Nature Genetics, 2:180-185 (1992) NCBI#:21839 PUBLIC: 1 MAPSTRING: 959H08 CHROM: L1: 959 L2: H L3: 08 L4: Pool B L5: Forward Primer: CCCCAGCAGAGAAGTTAATT L6: Reverse Primer: GTCAACGTCAACATTCGTTT L7: Product Length: 162 || TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: Radiation hybrid JMS CITATION: Nature Genetics, 2:180-185 (1992) NCBI#:21839 PUBLIC: 1 MAPSTRING: 4, bin 2 CHROM: 4 L1: 4 L2: 2 L3: Forward Primer: TTGAGGGTTTACAACAGATAGG L4: Reverse Primer: GAAATGGAAGAGAACCAGCT L5: Product Length: 119 || TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: Somatic hybrid JMS CITATION: Nature Genetics, 2:180-185 (1992) GB#: T02813 PUBLIC: 1 MAPSTRING: 20 CHROM: 20 L1: 20 L2: L3: L5: Forward Primer: GTCTTCCTGTGTCTGCTGAG L6: Reverse Primer: CACCTCACCTTACATCCAAA || TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: Somatic hybrid JMS CITATION: Nature Genetics, 2:180-185 (1992) EST#: EST0023c PUBLIC: 1 MAPSTRING: 20 CHROM: 20 L1: 20 L2: L3: L4: L5: Forward Primer: GTCTTCCTGTGTCTGCTGAG L6: Reverse Primer: CACCTCACCTTACATCCAAA ||
Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,
On-Line EST Dat abase, Data Input Format Specification
version 1.2 April 24, 1994 Draft Copy