WWC snapshot of http://www.ncbi.nlm.nih.gov/dbSTS/how_to_submit taken on Thu May 4 3:33:38 1995
NCBI GenBank

Submission of STSs to GenBank and dbSTS


STSs by nature are usually submitted to GenBank and dbSTS as batches of dozens to thousands of entries, with a great deal of redundancy in the citation, submittor and library information. To improve the efficiency of the submission process for this type of data, we have designed a separate streamlined submission process and data format.

STS entries may be submitted by email to "batch-sub@ncbi.nlm.nih.gov". A special tagged flat file input format (see below) has been designed for this data, to allow it to be submitted as one or more text files in this manner.

A document describing this data submission format is included below. If you have questions about this format, please contact "info@ncbi.nlm.nih.gov" and a member of the support staff will get back to you or pass your question on to Carolyn Tolstoshev or Mark Boguski for a response.

Once your data is ready to submit, email it to "batch-sub@ncbi.nlm.nih.gov" as described above. You will receive a list of dbSTS ids, and GenBank accession numbers from a dbSTS curator via email.

Carolyn Tolstoshev, carolyn@ncbi.nlm.nih.gov
Mark Boguski, boguski@ncbi.nlm.nih.gov

Submission Format

The following is a specification for flat file formats for delivering STS and related data to the NCBI STS database. The format consists of colon delineated capitalized tags, followed by data. Except for sequence, comment, PCR profile and source, protocol and buffer description data, the data fields should appear on the same line as the tag, with no line wrapping. Each record (including the last record in the file) should end with a tag || to indicate the end of the record.
1.1 File Types
There are six types of deliverable files:
a. Publication.
b. Source
c. Contact
d. Protocol
e. Buffer
f. STS
2.1 Publication Files The following is an example of the valid tags and some illustrative data:

TYPE: 	  Entry type - must be "Pub" for publication entries. Obligatory field.
MEDUID:   Medline unique identifier.
CITATION: Journal citation for the publication. Obligatory field.
AUTHORS:  Names of authors. Obligatory field.
STATUS:   Status field.1=unpublished, 2=submitted, 3=in press, 4=published
	  Obligatory field.


e.g.
TYPE: Pub
MEDUID:
CITATION: Human chromosome 7 STS
AUTHORS: Green,E.
STATUS: 1
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published. The authors field is a free format string. The citation field is a free format string. The only requirement is that you put an identical string in the CITATION field of STSs, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the STS table.

2.2 Source Files The following is an example of the valid tags and some illustrative data:

TYPE:     Entry type - must be "Source" for source entries. Obligatory field.
NAME:     Name of source. Obligatory field.
ORGANISM: Organism from which source prepared. Obligatory field.
ABBREV:   Organism abbreviation e.g. C.elegans
STRAIN:   Organism strain
COMMON:   Common name of organism.
VECTOR:   Name of vector.
V_TYPE:   Type of vector (Cosmid, Phage,Plasmid,YAC, other)
HOST:     Host name
DESCR:    Description of source preparation methods, vector, etc. This field
          starts on the line below the DESCR: tag.
TAX:      Ignore this.(Taxonomy line for GenBank.) Internal use only.


e.g.
TYPE: Source
NAME:  Human EGreen
ORGANISM: Homo sapiens
ABBREV: H.sapiens
STRAIN:
COMMON: human
VECTOR:
V_TYPE:
HOST:
DESCR:
Put description text here.
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. When refering to the source in the STS entries, you must use an identical string for the source name to the string used for source name in the source entry. The DESCR: field should contain as much detail about the source as seems appropriate.

2.3 Contact Files The following is an example of the valid tags and some illustrative data:

      

TYPE:   Entry type - must be "Cont" for contact entries. Obligatory field.
NAME:   Name of person who provided the STS.
FAX:    Fax number as string of digits.
TEL:    Telephone number as string of digits.
EMAIL:  E-mail address
LAB:    Laboratory providing STS.
INST:   Institution name
ADDR:   Address string, comma delineation.


e.g.
TYPE: Cont
NAME: Eric Green
EMAIL: egreen@wugenmail.wustl.edu
LAB: Center for Genetics in Medicine
INST: Washington University School of Medicine
ADDR: Box 8232, 4566 Scott Avenue, St. Louis, MO 63110, USA
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. We would like as many of the fields filled in as possible, to provide complete information to the user for contacting a source for the STS or further information about it. The contact name or contact lab fields in the STS entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.

2.4 Protocol Files The following is an example of the valid tags and some illustrative data.

TYPE:     Entry type - must be Protocol for protocol entries. Obligatory field.
NAME:     Name of protocol. Obligatory field.
PROTOCOL: Description of protocol used. Starts on the line below the PROTOCOL
          tag. Lay out this description as you want it to appear in GenBank, 
          using blanks to line up columns, not tabs.
||



e.g.
TYPE: Protocol
NAME: STS-A
PROTOCOL:
        Template:       30-100 ng
        Primer:         each 1 uM
        dNTPs:          each 200 uM
        Taq Polymerase: 0.05 units/ul
        Total Vol:      5 ul
||
TYPE: Protocol
NAME: STS-B
PROTOCOL:
        Template:       30-100 ng
        Primer:         each 1 uM
        dNTPs:          each 200 uM
        Taq Polymerase: 0.05 units/ul
        Total Vol:      10 ul
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. When refering to the protocol in the STS entries, you must use an identical string for the protocol name to the string used for protocol name in the protocol entry. The PROTOCOL: field should contain as much detail about the protocol as seems appropriate. Lay out this description as you want it to appear in GenBank, using blanks to line up columns, not tabs.

2.5 Buffer Files The following is an example of the valid tags and some illustrative data.

TYPE:   Entry type - must be Buffer for buffer entries. Obligatory field.
NAME:   Name of buffer. Obligatory field.
BUFFER: Description of buffer used. Starts on the line below the BUFFER
        tag. Lay out this description as you want it to appear in GenBank, 
        using blanks to line up columns, not tabs.
||



e.g.

TYPE: Buffer
NAME: STS-1
BUFFER:
        MgCl2:          1.5 mM
        KCl:               50 mM
        Tris-HCl:      10 mM
        pH:                 8.3
||
TYPE: Buffer
NAME: STS-2
BUFFER:
        MgCl2:          2.5 mM
        KCl:               50 mM
        Tris-HCl:      10 mM
        pH:                 8.3
||

The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. When refering to the buffer in the STS entries, you must use an identical string for the buffer name to the string used for buffer name in the buffer entry. The BUFFER: field should contain as much detail about the buffer as seems appropriate. Lay out this description as you want it to appear in GenBank, using blanks to line up columns, not tabs.

2.6 STS Files The following is an example of the valid tags and some illustrative data:

TYPE:        Entry type - must be "STS" for STS entries. Obligatory field
STATUS:      Status of STS entry - "New" or "Update". Obligatory field.
CONT_NAME:   Name of contact (must be identical string to the contact entry)
CONT_LAB:    Contact laboratory. (Must be identical string to contact entry.)
PROTOCOL:    Protocol name. (Must be identical string to the protocol entry.)
BUFFER:      Buffer name. (Must be identical string to the buffer entry.)
SOURCE:      Source name.  (Must be identical string to source name entry.)
CITATION:    Journal citation. (Must be identical string to publication entry)
STS#:        STS id assigned by contact lab. Obligatory field. For STS entry
             updates, this is the string we match on.
SYNONYMS:    Synonyms list, separated by commas.
GB#:         GenBank id.
GDB#:        Human genome database accession number
GDB_DSEG:    Human genome database Dsegment number
CLONE:       Clone id.
P_END:       Which end sequenced e.g. 5'
DNA_TYPE:    Genomic (default),cDNA, Viral, Synthetic, Other
SIZE:        Size of STS (in nucleotides)
F_PRIMER:    Sequence of forward primer
B_PRIMER:    Sequence of backward primer.
PCR_PROFILE: Description of PCR profile. This starts on line below PCR_PROFILE:
             tag. Line up data as you wish it to appear in GenBank. 
             Use blanks, not tabs to format this data.
PUBLIC:      1= for release to public, 0=confidential, no general release. 
             Obligatory.
GENE_SYMBOL: Putative gene symbol
GENE_NAME:   Full name of putative gene.
PRODUCT:     Putative product identification.
COMMENT:     Comments about STS. Starts on line below COMMENT: tag.
E_DATE:      EMBL date of entry (used for parsing EMBL entries only. Ignore.)
U_DATE:      EMBL last update date. (used for EMBL entries only. Ignore.)
OWNER:       N, L, E or D - submitter database (For internal use only. Ignore.)
SEQUENCE:    Sequence string. Starts on line below SEQUENCE: tag. Obligatory
             field


e.g.
TYPE: STS
STATUS: New
CONT_NAME: Eric Green
PROTOCOL: STS-A
BUFFER: STS-1
CITATION: Human chromosome 7 STS
SOURCE: Human EGreen
STS#: sWSS282
SYNONYMS:
F_PRIMER: AAGCACAGGAGAAGATGG
B_PRIMER: GAATTGACAGACAGTAAGGAAG
DNA_TYPE: Genomic
P_END:
PUBLIC: 1
PRODUCT:
GENE_SYMBOL:
GENE_NAME:
SIZE: 143
PCR_PROFILE:
        Presoak:            0 degrees C for 0.00 minute(s)
        Denaturation:    92 degrees C for 1.00 minute(s)
        Annealing:         60 degrees C for 2.00 minute(s)
        Polymerization: 72 degrees C for 2.00 minute(s)
        PCR Cycles:      35
        Thermal Cycler: Perkin Elmer TC
SEQUENCE:
ATTCTATCCAAGTCTCAAGGCCCCACAACCTGGAGCTCTGATGCTCAAGCACAGGAGAAG
ATGGGTGTCCAGCTCAAACACAGAGAACACATTCACCCTTCCCTGCCTTTTTGTTCTGTT
CAGACCCTCAGCAGATAGGATGCCTGCCCACAGCGGTAAGGGCACATCTTCCTTACTGTC
TGTCAATTCAGATGCTGATCACTCTGGT
||


The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Valid data values for the STS status line are New (new entry) or Update (change existing STS entry). When updating an STS, only the fields present in the STS file will be changed. Sequences start on line below tag, and should be 60 per line with no blank spaces.

Notes:

  • If data is not available for some fields, the field can either be omitted entirely, or the tag may be included with an empty data field. Please do not put '*', "-", etc to indicate missing data.
  • In the STS file, you are given a choice between contact name or contact lab. If the name of the person supplying the STSs is in the contact table, use the CONT_NAME field. If the contact entry is only identified with the name of a lab, use the CONT_LAB field instead.
  • For the fields CONT_LAB, CONT_NAME, SOURCE, PROTOCOL, BUFFER and CITATION, it is very important that the string in the STS file field is completely identical to that provided for contact, source and publication files. We will be scanning these fields from the STS file and matching them automatically to source, contact, protocol, buffer and publication records in the other tables, so content, spelling, letter case and spacing must match.
  • The DNA type is assumed to be Genomic, so this field may be omitted unless the DNA type differs from this.
  • If you wish, you can submit sources, pubs, contacts, protocols, buffers and STSs all in one file - the TYPE field will differentiate them for the parsing software. However, if you are submitting new sources, protocols, buffers, contacts and/or publications in the file with STSs, and the new STSs refer to them, they must precede the STSs in the file, otherwise the STS crossmatching will not succeed.

    Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,

    On-Line EST Database, Data Input Format Specification

    version 2.1 July 27, 1993 Draft Copy


    This draft document is being made available solely for review purposes and should not be quoted, circulated, reproduced or represented as an official NCBI document. The draft is undergoing revisions and should not be considered or represented as reflecting the views, positions or intentions of the NCBI or the National Library of Medicine.

    Submission Format for Map Data

    The following is a specification for flat file formats for delivering STS mapping and related data to the NCBI STS database. The format consists of colon delineated capitalized tags, followed by data. Each record (including the last record in the file) should end with a tag || to indicate the end of the record.
    1.1 File Types
    There are four types of deliverable files: 
    a. Publication.
    b. Contact
    c. Method
    d. Map data
    
    2.1 Publication Files The following is an example of the valid tags and some illustrative data:

    TYPE:     Entry type - must be "Pub" for publication entries. Obligatory field.
    MEDUID:   Medline unique identifier.
    CITATION: Journal citation for the publication. Obligatory field.
    STATUS:   Status field.1=unpublished, 2=submitted, 3=in press, 4=published
    	  Obligatory field.
    ||	  Entry separator
    
    e.g.
    TYPE: Pub
    MEDUID: 
    CITATION: Nature Genetics, 2:180-185 (1992)
    STATUS: 4
    ||
    
    
    The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. The MEDUID field is a Medline record unique identifier. We do not normally expect you to supply this - we try to retrieve this from our relational version of Medline database. The status field is 1=unpublished, 2=submitted, 3=in press, 4=published The citation field is a free format string. The only requirement is that you put an identical string in the publication field of map data, since we will be matching that field automatically against the publications in the publication table and replacing the string with the publication id in the map table.

    2.1 Contact Files

    The following is an example of the valid tags and some illustrative data:
    TYPE:   Entry type - must be "Cont" for contact entries. Obligatory field.
    NAME:   Name of person who provided the mapping data. 
    FAX:    Fax number as string of digits.
    TEL:    Telephone number as string of digits.
    EMAIL:  E-mail address
    LAB:    Laboratory providing mapping data.
    INST:   Institution name
    ADDR:   Address string, comma delineation.
    ||      Entry separator
    
    
    
    e.g.
    TYPE: Cont
    NAME: Sikela JM
    FAX: 3032707097
    TEL: 3032708637
    EMAIL: sikela_j%maui@vaxf.colorado.edu
    LAB: Sikela
    INST: University of Colorado Health Sciences Center
    ADDR: Pharmacology Box C236, 4200 E 9th Ave, Denver, CO 80262
    ||
    
    
    The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. None of the other fields are obligatory, but we require at least the name of a contact person or lab. The contact name or contact lab fields in the map entries must contain an identical string to the string used for NAME and LAB fields in the contact entry, for automatic matching.

    2. 1 Method The following is an example of the valid tags and some illustrative data:

    TYPE:     Entry type - must be "Meth" for method entries. Obligatory field.
    NAME:     Name of method. Obligatory field.
    ORGANISM: Organism from which library prepared. Obligatory field.
    ABSOLUTE: Method gives absolute or relative address? Y or N. Obligatory field.
    L1:       Interpretation of line 1.
    L2:       Interpretation of line 2.
    L3:       Interpretation of line 3.
    L4:       Interpretation of line 4.
    L5:       Interpretation of line 5.
    L6:       Interpretation of line 6.
    L7:       Interpretation of line 7.
    L8        Interpretation of line 8.
    L9:       Interpretation of line 9.
    L10:      Interpretation of line 10.
    DESCR:    Description of method. Description starts on line after 
              DESCR heading. May be multi-line free format text.
    ||        Entry separator
    
    
    
    
    e.g.
    TYPE: Meth
    NAME:  YAC/CEPH JMS
    ORGANISM: Homo sapiens
    ABSOLUTE: n
    L1: plate
    L2: row
    L3: column
    L4: comment
    L5: comment
    L6: comment
    L7: comment
    DESCR:
    PCR-based mapping of 3'UT-derived primers to CEPH YAC DNA pools.
    Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991).
    To date, MIT puts out YAC pools A and B;  if both pools
    were used for the mapping data given, then 'C' is designated.
    ||
    
    
    
    TYPE: Meth
    NAME:  Radiation Hybrid JMS
    ORGANISM: Homo sapiens
    ABSOLUTE: y
    L1: chromosome
    L2: bin
    L3: comment
    L4: comment
    L5: comment
    DESCR:
    Radiation hybrid panels with binning.
    Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991).
    ||
    
    
    
    TYPE: Meth
    NAME:  Somatic Hybrid JMS
    ORGANISM: Homo sapiens
    ABSOLUTE: y
    L1: chromosome
    L2: arm
    L3: band
    L4: band range
    L5: comment
    L6: comment
    DESCR:
    Somatic cell hybrid mapping.
    Primers are chosen using the PRIMER program by Lincoln et al., ver 0.5 (1991).
    ||
    
    
    
    The TYPE field is obligatory at the beginning of each entry, even if there are multiple entries of a given type in a file. Lines 1 to 10 are available for describing interpretation of data in the corresponding map data entries. There must be a method interpretation line for each map data line. When refering to the method in the map data entries, you must use an identical string for the method name to the string used for method name in the method entry.

    2.4 Map Data Files The following is an example of the valid tags and some illustrative data:

    TYPE:       Entry type - must be "Map" for map data entries. Obligatory field
    STATUS:     Status of STS entry - "New","Replace" or "Update". Obligatory.
    CONT_NAME:  Name of contact (must be identical string to the contact entry)
    CONT_LAB:   Contact laboratory. (Must be identical string to  contact entry.)
    METHOD:     Method name. (Must be identical string to the method entry name)
    CITATION:   Journal citation. (Must be identical string to publication entry)
    NCBI#:      NCBI Id of STS. (Must have either NCBI#,  STS# or GB#)
    STS#:       Name of  STS (Must have STS#, NCBI# or GB#)
    GB#:        GenBank accession number of STS .
    PUBLIC:     1= for release to public, 0=confidential, no general release. 
                Obligatory.
    MAPSTRING:  Full mapping information. Unparsed. For output only.
    CHROM:      Chromosome name or number
    L1:         Line 1 of parsed mapping information.
    L2:         Line 2 of parsed mapping information.
    L3:         Line 3.
    L4:         Line 4.
    L5:         Line 5.
    L6:         Line 6.
    L7:         Line 7.
    L8:         Line 8.
    L9:         Line 9.
    L10:        Line 10 of parsed mapping information..
    ||          Entry separator
    
    
    e.g.
    TYPE: Map
    STATUS:  New
    CONT_NAME: Sikela JM
    METHOD: YAC/CEPH JMS
    CITATION: Nature Genetics, 2:180-185 (1992)
    NCBI#:51839
    PUBLIC: 1
    MAPSTRING: 956H08
    CHROM: 
    L1: 959
    L2: H
    L3: 08
    L4: Pool B
    L5: Forward Primer: CCCCAGAGTTCCAAGTTAATT
    L6: Reverse Primer: GTCGCATTGCTCAACATTCGTTT
    L7: Product Length: 162
    ||
    
    TYPE: Map
    STATUS:  New
    CONT_NAME: Sikela JM
    METHOD: Radiation hybrid JMS
    CITATION: Nature Genetics, 2:180-185 (1992)
    STS#: STST001a
    PUBLIC: 1
    MAPSTRING: 4, bin 2
    CHROM: 4
    L1: 4
    L2: 2
    L3: Forward Primer: TTDDGTAGAGGGTGCTAAGAAGG
    L4: Reverse Primer: GAAATGGACCTATTAAAACCAGCT
    L5: Product Length: 119
    ||
    
    TYPE: Map
    STATUS:  New
    CONT_NAME: Sikela JM
    METHOD: Somatic hybrid JMS
    CITATION: Nature Genetics, 2:180-185 (1992)
    GB#: T12813
    PUBLIC: 1
    MAPSTRING: 20
    CHROM: 20
    L1: 20
    L2:
    L3:
    L4:
    L5: Forward Primer: CGTAATGTCCCTGTGTCTGAG
    L6: Reverse Primer: CACCTCACCCATAGCCTTAGCTA
    ||
    
    

    Carolyn Tolstoshev, NCBI, carolyn@ncbi.nlm.nih.gov,

    On-Line EST Dat abase, Data Input Format Specification

    version 2.1 July 27, 1993 Draft Copy


    This draft document is being made available solely for review purposes and should not be quoted, circulated, reproduced or represented as an official NCBI document. The draft is undergoing revisions and should not be considered or represented as reflecting the views, positions or intentions of the NCBI or the National Library of Medicine.