WWC snapshot of http://www.ncbi.nlm.nih.gov/dbSTS/how_to_submit taken on Thu May 4 3:33:38 1995
STSs by nature are usually submitted to GenBank and dbSTS as batches of dozens to thousands of entries, with a great deal of redundancy in the citation, submittor and library information. To improve the efficiency of the submission process for this type of data, we have designed a separate streamlined submission process and data format.
STS entries may be submitted by email to "batch-sub@ncbi.nlm.nih.gov". A special tagged flat file input format (see below) has been designed for this data, to allow it to be submitted as one or more text files in this manner.
A document describing this data submission format is included below. If you have questions about this format, please contact "info@ncbi.nlm.nih.gov" and a member of the support staff will get back to you or pass your question on to Carolyn Tolstoshev or Mark Boguski for a response.
Once your data is ready to submit, email it to "batch-sub@ncbi.nlm.nih.gov" as described above. You will receive a list of dbSTS ids, and GenBank accession numbers from a dbSTS curator via email.
1.1 File Types There are six types of deliverable files: a. Publication. b. Source c. Contact d. Protocol e. Buffer f. STS2.1 Publication Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Pub" for publication entries. Obligatory field. MEDUID: Medline unique identifier. CITATION: Journal citation for the publication. Obligatory field. AUTHORS: Names of authors. Obligatory field. STATUS: Status field.1=unpublished, 2=submitted, 3=in press, 4=published Obligatory field. e.g. TYPE: Pub MEDUID: CITATION: Human chromosome 7 STS AUTHORS: Green,E. STATUS: 1 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published. The authors field is a free format string. The citation field is a free format string. The only requirement is that you put an identical string in the CITATION field of STSs, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the STS table.
2.2 Source Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Source" for source entries. Obligatory field. NAME: Name of source. Obligatory field. ORGANISM: Organism from which source prepared. Obligatory field. ABBREV: Organism abbreviation e.g. C.elegans STRAIN: Organism strain COMMON: Common name of organism. VECTOR: Name of vector. V_TYPE: Type of vector (Cosmid, Phage,Plasmid,YAC, other) HOST: Host name DESCR: Description of source preparation methods, vector, etc. This field starts on the line below the DESCR: tag. TAX: Ignore this.(Taxonomy line for GenBank.) Internal use only. e.g. TYPE: Source NAME: Human EGreen ORGANISM: Homo sapiens ABBREV: H.sapiens STRAIN: COMMON: human VECTOR: V_TYPE: HOST: DESCR: Put description text here. ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. When refering to the source in the STS entries, you must use an identical string for the source name to the string used for source name in the source entry. The DESCR: field should contain as much detail about the source as seems appropriate.
2.3 Contact Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Cont" for contact entries. Obligatory field. NAME: Name of person who provided the STS. FAX: Fax number as string of digits. TEL: Telephone number as string of digits. EMAIL: E-mail address LAB: Laboratory providing STS. INST: Institution name ADDR: Address string, comma delineation. e.g. TYPE: Cont NAME: Eric Green EMAIL: egreen@wugenmail.wustl.edu LAB: Center for Genetics in Medicine INST: Washington University School of Medicine ADDR: Box 8232, 4566 Scott Avenue, St. Louis, MO 63110, USA ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. We would like as many of the fields filled in as possible, to provide complete information to the user for contacting a source for the STS or further information about it. The contact name or contact lab fields in the STS entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.
2.4 Protocol Files The following is an example of the valid tags and some illustrative data.
TYPE: Entry type - must be Protocol for protocol entries. Obligatory field. NAME: Name of protocol. Obligatory field. PROTOCOL: Description of protocol used. Starts on the line below the PROTOCOL tag. Lay out this description as you want it to appear in GenBank, using blanks to line up columns, not tabs. || e.g. TYPE: Protocol NAME: STS-A PROTOCOL: Template: 30-100 ng Primer: each 1 uM dNTPs: each 200 uM Taq Polymerase: 0.05 units/ul Total Vol: 5 ul || TYPE: Protocol NAME: STS-B PROTOCOL: Template: 30-100 ng Primer: each 1 uM dNTPs: each 200 uM Taq Polymerase: 0.05 units/ul Total Vol: 10 ul ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. When refering to the protocol in the STS entries, you must use an identical string for the protocol name to the string used for protocol name in the protocol entry. The PROTOCOL: field should contain as much detail about the protocol as seems appropriate. Lay out this description as you want it to appear in GenBank, using blanks to line up columns, not tabs.
2.5 Buffer Files The following is an example of the valid tags and some illustrative data.
TYPE: Entry type - must be Buffer for buffer entries. Obligatory field. NAME: Name of buffer. Obligatory field. BUFFER: Description of buffer used. Starts on the line below the BUFFER tag. Lay out this description as you want it to appear in GenBank, using blanks to line up columns, not tabs. || e.g. TYPE: Buffer NAME: STS-1 BUFFER: MgCl2: 1.5 mM KCl: 50 mM Tris-HCl: 10 mM pH: 8.3 || TYPE: Buffer NAME: STS-2 BUFFER: MgCl2: 2.5 mM KCl: 50 mM Tris-HCl: 10 mM pH: 8.3 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. When refering to the buffer in the STS entries, you must use an identical string for the buffer name to the string used for buffer name in the buffer entry. The BUFFER: field should contain as much detail about the buffer as seems appropriate. Lay out this description as you want it to appear in GenBank, using blanks to line up columns, not tabs.
2.6 STS Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "STS" for STS entries. Obligatory field STATUS: Status of STS entry - "New" or "Update". Obligatory field. CONT_NAME: Name of contact (must be identical string to the contact entry) CONT_LAB: Contact laboratory. (Must be identical string to contact entry.) PROTOCOL: Protocol name. (Must be identical string to the protocol entry.) BUFFER: Buffer name. (Must be identical string to the buffer entry.) SOURCE: Source name. (Must be identical string to source name entry.) CITATION: Journal citation. (Must be identical string to publication entry) STS#: STS id assigned by contact lab. Obligatory field. For STS entry updates, this is the string we match on. SYNONYMS: Synonyms list, separated by commas. GB#: GenBank id. GDB#: Human genome database accession number GDB_DSEG: Human genome database Dsegment number CLONE: Clone id. P_END: Which end sequenced e.g. 5' DNA_TYPE: Genomic (default),cDNA, Viral, Synthetic, Other SIZE: Size of STS (in nucleotides) F_PRIMER: Sequence of forward primer B_PRIMER: Sequence of backward primer. PCR_PROFILE: Description of PCR profile. This starts on line below PCR_PROFILE: tag. Line up data as you wish it to appear in GenBank. Use blanks, not tabs to format this data. PUBLIC: 1= for release to public, 0=confidential, no general release. Obligatory. GENE_SYMBOL: Putative gene symbol GENE_NAME: Full name of putative gene. PRODUCT: Putative product identification. COMMENT: Comments about STS. Starts on line below COMMENT: tag. E_DATE: EMBL date of entry (used for parsing EMBL entries only. Ignore.) U_DATE: EMBL last update date. (used for EMBL entries only. Ignore.) OWNER: N, L, E or D - submitter database (For internal use only. Ignore.) SEQUENCE: Sequence string. Starts on line below SEQUENCE: tag. Obligatory field e.g. TYPE: STS STATUS: New CONT_NAME: Eric Green PROTOCOL: STS-A BUFFER: STS-1 CITATION: Human chromosome 7 STS SOURCE: Human EGreen STS#: sWSS282 SYNONYMS: F_PRIMER: AAGCACAGGAGAAGATGG B_PRIMER: GAATTGACAGACAGTAAGGAAG DNA_TYPE: Genomic P_END: PUBLIC: 1 PRODUCT: GENE_SYMBOL: GENE_NAME: SIZE: 143 PCR_PROFILE: Presoak: 0 degrees C for 0.00 minute(s) Denaturation: 92 degrees C for 1.00 minute(s) Annealing: 60 degrees C for 2.00 minute(s) Polymerization: 72 degrees C for 2.00 minute(s) PCR Cycles: 35 Thermal Cycler: Perkin Elmer TC SEQUENCE: ATTCTATCCAAGTCTCAAGGCCCCACAACCTGGAGCTCTGATGCTCAAGCACAGGAGAAG ATGGGTGTCCAGCTCAAACACAGAGAACACATTCACCCTTCCCTGCCTTTTTGTTCTGTT CAGACCCTCAGCAGATAGGATGCCTGCCCACAGCGGTAAGGGCACATCTTCCTTACTGTC TGTCAATTCAGATGCTGATCACTCTGGT ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Valid data values for the STS status line are New (new entry) or Update (change existing STS entry). When updating an STS, only the fields present in the STS file will be changed. Sequences start on line below tag, and should be 60 per line with no blank spaces.
Notes:
Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,
On-Line EST Database, Data Input Format Specification
version 2.1 July 27, 1993 Draft Copy
1.1 File Types There are four types of deliverable files: a. Publication. b. Contact c. Method d. Map data2.1 Publication Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Pub" for publication entries. Obligatory field. MEDUID: Medline unique identifier. CITATION: Journal citation for the publication. Obligatory field. STATUS: Status field.1=unpublished, 2=submitted, 3=in press, 4=published Obligatory field. || Entry separator e.g. TYPE: Pub MEDUID: CITATION: Nature Genetics, 2:180-185 (1992) STATUS: 4 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published The citation field is a free format string. The only requirement is that you put an identical string in the publication field of map data, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the map table.
2.1 Contact Files
The following is an example of the valid tags and some illustrative data: TYPE: Entry type - must be "Cont" for contact entries. Obligatory field. NAME: Name of person who provided the mapping data. FAX: Fax number as string of digits. TEL: Telephone number as string of digits. EMAIL: E-mail address LAB: Laboratory providing mapping data. INST: Institution name ADDR: Address string, comma delineation. || Entry separator e.g. TYPE: Cont NAME: Sikela JM FAX: 3032707097 TEL: 3032708637 EMAIL: sikela_j%maui@vaxf.colorado.edu LAB: Sikela INST: University of Colorado Health Sciences Center ADDR: Pharmacology Box C236, 4200 E 9th Ave, Denver, CO 80262 ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. The contact name or contact lab fields in the map entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.
2. 1 Method The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Meth" for method entries. Obligatory field. NAME: Name of method. Obligatory field. ORGANISM: Organism from which library prepared. Obligatory field. ABSOLUTE: Method gives absolute or relative address? Y or N. Obligatory field. L1: Interpretation of line 1. L2: Interpretation of line 2. L3: Interpretation of line 3. L4: Interpretation of line 4. L5: Interpretation of line 5. L6: Interpretation of line 6. L7: Interpretation of line 7. L8 Interpretation of line 8. L9: Interpretation of line 9. L10: Interpretation of line 10. DESCR: Description of method. Description starts on line after DESCR heading. May be multi-line free format text. || Entry separator e.g. TYPE: Meth NAME: YAC/CEPH JMS ORGANISM: Homo sapiens ABSOLUTE: n L1: plate L2: row L3: column L4: comment L5: comment L6: comment L7: comment DESCR: PCR-based mapping of 3'UT-derived primers to CEPH YAC DNA pools. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). To date, MIT puts out YAC pools A and B; if both pools were used for the mapping data given, then 'C' is designated. || TYPE: Meth NAME: Radiation Hybrid JMS ORGANISM: Homo sapiens ABSOLUTE: y L1: chromosome L2: bin L3: comment L4: comment L5: comment DESCR: Radiation hybrid panels with binning. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). || TYPE: Meth NAME: Somatic Hybrid JMS ORGANISM: Homo sapiens ABSOLUTE: y L1: chromosome L2: arm L3: band L4: band range L5: comment L6: comment DESCR: Somatic cell hybrid mapping. Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991). ||The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Lines 1 to 10 are available for describing interpretation of data in the corresponding map data entries. There must be a method interpretation line for each map data line. When refering to the method in the map data entries, you must use an identical string for the method name to the string used for method name in the method entry.
2.4 Map Data Files The following is an example of the valid tags and some illustrative data:
TYPE: Entry type - must be "Map" for map data entries. Obligatory field STATUS: Status of STS entry - "New","Replace" or "Update". Obligatory. CONT_NAME: Name of contact (must be identical string to the contact entry) CONT_LAB: Contact laboratory. (Must be identical string to contact entry.) METHOD: Method name. (Must be identical string to the method entry name) CITATION: Journal citation. (Must be identical string to publication entry) NCBI#: NCBI Id of STS. (Must have either NCBI#, STS# or GB#) STS#: Name of STS (Must have STS#, NCBI# or GB#) GB#: GenBank accession number of STS . PUBLIC: 1= for release to public, 0=confidential, no general release. Obligatory. MAPSTRING: Full mapping information. Unparsed. For output only. CHROM: Chromosome name or number L1: Line 1 of parsed mapping information. L2: Line 2 of parsed mapping information. L3: Line 3. L4: Line 4. L5: Line 5. L6: Line 6. L7: Line 7. L8: Line 8. L9: Line 9. L10: Line 10 of parsed mapping information.. || Entry separator e.g. TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: YAC/CEPH JMS CITATION: Nature Genetics, 2:180-185 (1992) NCBI#:51839 PUBLIC: 1 MAPSTRING: 956H08 CHROM: L1: 959 L2: H L3: 08 L4: Pool B L5: Forward Primer: CCCCAGAGTTCCAAGTTAATT L6: Reverse Primer: GTCGCATTGCTCAACATTCGTTT L7: Product Length: 162 || TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: Radiation hybrid JMS CITATION: Nature Genetics, 2:180-185 (1992) STS#: STST001a PUBLIC: 1 MAPSTRING: 4, bin 2 CHROM: 4 L1: 4 L2: 2 L3: Forward Primer: TTDDGTAGAGGGTGCTAAGAAGG L4: Reverse Primer: GAAATGGACCTATTAAAACCAGCT L5: Product Length: 119 || TYPE: Map STATUS: New CONT_NAME: Sikela JM METHOD: Somatic hybrid JMS CITATION: Nature Genetics, 2:180-185 (1992) GB#: T12813 PUBLIC: 1 MAPSTRING: 20 CHROM: 20 L1: 20 L2: L3: L4: L5: Forward Primer: CGTAATGTCCCTGTGTCTGAG L6: Reverse Primer: CACCTCACCCATAGCCTTAGCTA ||
Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,
On-Line EST Dat abase, Data Input Format Specification
version 2.1 July 27, 1993 Draft Copy