How to submit nucleotide sequence data to the EMBL Data Library: Information for Authors

727 How to submit nucleotide sequence data to the EMBL Data Library: Information for Authors l\i»jhe EMBL Data Library, Postfach 10.2209, D-6900 Heidelberg, Federal Republic of Germany ii I i ii January 1990 1 The first step in getting an accession number 2 What to submit to the EMBL Data Library Before doing anything else, authors should get a copy of a sequence data submission form. This form solicits all of the information needed to make a database entry; that is, the primary sequence data together with descriptive information such as the source of the sequenced segment (e.g., organism, strain, tissue) and the location of interesting regions within the sequence (e.g., coding regions, regulatory signals). It also contains information about data formats. The data submission form exists in both a paper and a computerreadable version; the latter can be completed using a text editor. These versions are available from the following sources: A data submission should include the following (for further details, see the data submission form itself): (a) Paper form: printed at the end of this article, from the Development editorial office and available upon request from EMBL, GenBank and the DNA Databank of Japan (DDBJ) at the addresses given in Appendix 2. (b) Computer-readable form: (1) With all releases of the EMBL and GenBank databases since January 1987 and with DDBJ releases since January 1988. (2) From EMBL by electronic mail (computer network) via our file server. Anyone with access to BITNET (either directly or via a gateway) can send a request to the EMBL file server, which will automatically return a copy of the data submission form by electronic mail. Instructions for using the EMBL file server are given in Appendix I. (3) From EMBL, on Macintosh or IBM-compatible (5i" or 3 ") floppy diskettes. Complete information on how to contact the EMBL Data Library is given in Appendix II. (4) From GenBank via electronic mail or on floppy diskette. For information on requesting the form from GenBank via Telenet, contact David Benton (+1-415-962-7360). Researchers in Japan can obtain the form by dialing up the DDBJ computer system (0559-75-6026). 3 How to send data to the EMBL Data Library (a) the sequence itself, in computer-readable form (computer network mail, magnetic tape or IBMcompatible or Macintosh floppy diskette). Printouts will be accepted only if the authors have no access to a computer. (b) a completed data submission form for each submitted sequence. The form is available from the sources listed in section l(a). (c) a computer network address, a telex number or a telefax number (advisable, to help speed things up, but not required). Data can be sent to the Data Library in one of several ways: (a) Electronic file transfer: files can be sent via computer network to DATASUBS@EMBL.BITNET. This BITNET address can be reached directly (by people at BITNET sites) or via various gateways from Arpanet, Usenet, JANET, etc. Ask your local network expert for help or phone us (+49-6221387-258). (b) Telefax to Data Submissions, EMBL Data Library. Our fax number is: +49-6221-387-306. (c) Normal post. See address given in Appendix II. 4 How long will it take to get an accession number? We will process data submissions within 7 working days of receipt and send authors notification of either what accession number(s) their data have been assigned or what additional information is needed. There are several things authors can do to minimise the time it takes to get an accession number: (a) Be sure that submissions include all the necessary materials and that all relevant questions on the data submission form have been answered.

728 EMBL Data Library (b) Check the data to be sure that they do not contain inconsistencies/errors (e.g., a stop codon in the middle of a region listed on the form as an exon). (c) Be sure to include either a computer network address or a telex or telefax number. If this information is not provided, notification of accession numbers will be sent by regular post. Telephoning is costly and time-consuming, and the Data Library will therefore not attempt to contact authors by phone. Although we will process data submissions as quickly as we can, we strongly encourage authors to submit their data at or before the time they begin writing the manuscript, rather than once it is finished. This way we can process the data while the manuscript is being written, and authors will not have to delay submission of their manuscript while they wait for notification of their accession number. It should be emphasised that authors are responsible for communicating their accession number(s) to the journal at the time they submit their manuscript; the Data Library will not contact the journal. 5 Data security The data submission form asks authors whether their submitted data can be made available to the public immediately or whether it should be withheld until publication. 6 Updating your data Once a database entry has been created from a submission, a copy is sent to the submittor for his/her reference and for comments or corrections. However, it often happens that the entry is correct when it is created but, with the passage of time, becomes out of date: the authors may make corrections to the sequence itself, or may discover new features of the sequence. Since such findings are generally not published, the only way to keep entries correct and up to date is if the authors communicate their new findings to the database. This can be done by normal post or electronic mail to the address given in Appendix II. One type of update which merits separate mention is that relating to citations. Most submissions represent data not yet been accepted for publication, and therefore the journal citation is not available when the entry is created. Adding this information at a later date requires that the database staff identify which submissions correspond to which publications; while this is often straightforward, it can also be problematic, especially if the journal does not print an accession number in the article, or if the submitted and the published data are not identical. We therefore strongly encourage researchers to let us know when and where and when data they have submitted to us are published. Appendix I. EMBL network file server Computer users with access to BITNET (directly or via a gateway) can obtain copies of the data submission form, or of database entries, by sending commands to a file server running on the VAXcluster at EMBL. The file server facility is provided free of charge, though users may have to meet some or all of the communication costs, depending on the accounting system of their local computer service. To use this facility, send file server commands (as electronic mail) to the address NETSERV@EMBL. BITNET. Each line of the mail message should consist of a single file server command, and nothing else. The mail can be sent over BITNET, or from any other network which has a gateway into BITNET (e.g., JANET in the UK or ARPANET in the USA). The most important file server command, to get users started, is HELP. If the file server receives this command, it will return a help file to the sender, explaining in some detail how to use the facility. In order to send electronic mail to a BITNET address, users must find out which command they have to use on their own local machine and how they should format the address NETSERV@EMBL.BITNET. Users who don't already know how to do this should contact their local computer service, or if all else fails, contact the Data Library and we will do our best to help. Below are some examples which illustrate how to send commands to the file server using a VAX/VMS system that is a BITNET node running JNET software. To send a HELP command to the file server, you could use the operating system command MAIL as follows: $ MAIL <filename> "JNET% ""NETSERV@EMBL""" where <filename> is the name of a file containing file server commands. To request help information the file should contain the following command: HELP To request a copy of the data submission form, it should contain the following GET command: GET DATALIB: DATASUB.TXT Users can also request specific sequences via the File Server. Information on how to do this is provided in the HELP file. Appendix II. How to contact the nucleotide sequence databases EMBL Data Library: (a) Computer network: datasubs@embl.bitnet (for data submissions); datalib@embl.bitnet (for questions requiring a personal response) (b) Postal address: Data Submissions, EMBL Data

Information for Authors 729 Library, Postfach 10.2209, 6900 Heidelberg, Federal Republic of Germany (c) Telephone: +49-6221-387-258 (d) Telefax: +49-6221-387-306 (e) Telex: 461613 (embl d) GenBank : (a) Computer network address: gb-subs@lanl.gov (b) Postal address: GenBank Submissions, Mail Stop K710, Los Alamos National Laboratory, Los Alamos, NM 87545, USA (c) Telephone: +1-505-665-2177 (d) Telefax: +1-505-665-3493 DNA Databank of Japan: (a) Computer network: ddbjsub@ddbj.nig.ac.jp (for data submissions); ddbj@ddbj.nig.ac.jp (for other enquiries) (b) Postal address: Laboratory of Genetic Information Analysis, Center for Genetic Information Research, National Institute of Genetics, Mishima, Shizuoka 411, Japan (c) Telephone: +81-559-75-0771 x647 (d) Telefax: +81-559-75-6040

730 EMBL Data Library Sequence Data Submission Form This form solicits the information needed for a nucleotide or amino acid sequence database entry. By completing and returning it to us promptly you help us to enter your data in the database accurately and rapidly. These data will be shared among the following databases: EMBL Data Library (Heidelberg, Federal Republic of Germany); GenBank (Los Alamos, NM, U.S.A. and Mountain View, CA, U.S.A), DNA Data Bank of Japan (DDBJ; Mishima, Japan); National Biomedical Research Foundation Protein Identification Resource (NBRF-PIR; Washington, D.C., U.S.A.); Martinsried Institute for Protein Sequence Data (MIPS; Martinsried, Federal Republic of Germany) and International Protein Information Database in Japan (JEPID; Noda, Japan). Please answer all questions which apply to your data. If you submit 2 or more non-contiguous sequences, copy and fill out this form for each additional sequence. Please include in your submission any additional sequence data which is not reported in your manuscript but which has been reliably determined (for example, introns or flanking sequences). When submitting nucleic acid sequences containing protein coding regions, also include a translation (SEPARATELY from the nucleic acid sequence). Then send (1) this form, (2) a copy of your manuscript (if available) and (3) your sequence data (in machine readable form) to the address shown below. Information about the various ways you can send us your data and about formats for the sequence data is given in the following two sections. Thank you. SUBMITTING DATA TO THE EMBL DATA LIBRARY We are happy to accept data submitted in any of the following ways: (1) Electronic Tile transfer: files can be sent via computer network to: DATASUBS@EMBL.EARN. This BITNET/EARN address can be reached via various gateways from Arpanet, Usenet, JANET, etc. Ask your local network expert for help or phone us. Please ensure that each line in your file is not longer than 80 characters; longer lines often get truncated when they are sent. (2) Floppy disks: we can read Macintosh and IBM-compatible diskettes. Please use the 'save as text only 1 feature of your editor to save your sequence file, as otherwise we might have difficulty processing it (3) Magnetic tapes: 9-track only (fixed-length records preferred); 800, 1600 or 6250 bpi (any blocksize); ASCII or EBCDIC character codes; any label type or unlabelled. Our address is: EMBL Data Library Submissions Computer network DATASUBS@EMBL.BITNET Postfach 10.2209 Telefax (+49) 6221 387 306 D-6900 Heidelberg Telephone (+49) 6221 387 258 Federal Republic of Germany When we receive your data we will assign them an accession number, which serves as a reference that permanently identifies them in the database. We will inform you what accession number your data have been given and we recommend that you cite this number when referring to these data in publications. If your manuscript has already been accepted for publication, the accession number can be included at the galley proof stage as a note added in proof. So that we can process your data and inform you of your accession number before you receive the galley proofs, please return this form to us as soon as possible. We suggest that the note added in proof should read approximately as follows: The nucleotide sequence data reported will appear in the EMBL, GenBank and DDBJ Nucleotide Sequence Databases under the accession number." A computer-readable version of this form is available on the distribution tapes of the EMBL Data Library from Release 11 onwards and on GenBank Releases 48 onwards. The BIONET National Computer Resource for Molecular Biology (Mountain View, CA, U.S.A.) also has a copy. Feel free to use the computer-readable form rather than this printed one. In this case, the form should be filled out with a text editor and sent via computer network or normal post to the address indicated above. FORMATS FOR SUBMITTED DATA We would appreciate receiving the sequence data in a form which conforms as closely as possible to the following standards. Each sequence should include the names of the authors. Each distinct sequence should be listed separately using the same number of bases/residues per line. The length of each sequence in bases/residues should be clearly indicated. Enumeration should begin with a "1" and continue in the direction 5' to 3' (or amino- to carboxy- terminus). Amino acid sequences should be listed using the one-letter code. Translations of protein coding regions in nucleotide sequences should be submitted in a separate computer file from the nucleotide sequences themselves. The code for representing the sequence characters should conform to the IUPAC-IUB standards, which are described in: Nucl. Acids Res. 13: 3021-3030 (1985) (for nucleic acids) and J. Biol. Chem. 243: 3557-3559 (1968) and Eur. J. Biochem 5: 151-153 (1968) (for amino acids). El.5/11.89

L GENERAL INFORMATION Your last name Institution Address First name Information for Authors 731 Middle initials Computer mail address Telephone Telex number Telefax number On what medium and in what format are you sending us your sequence data? (see instructions on front page) [ ] electronic mail [ ] diskette: computer oneratine svstem eriitnr [ ] magnetic tape record length blocksize label tvoe density [ ] 800 [ ] 1600 [ ] 6250 character code t ] ASCII [ ] EBCDIC [ ] printed copy (please, ONLY if it is impossible to send us machine-readable data) H. CITATION INFORMATION These data are [ ] published [ ] in press [ ] submitted [ ] in preparation [ ] no plans to publish authors title of paper journal volume first-last pages year Do you agree that these data can be made available in the database before they appear in print? [ ] yes [ ] no, they should be made available only after publication (estimated date: Does the sequence which you are sending with this form include data that does not appear in the above citation? [ ] no [ ] yes, from position to [ ] base pairs OR [ ] amino acid residues (If your sequence contains 2 or more such spans, use the feature table in section IV to indicate their positions) If so, how should these data be cited in the database? [ ] published [ ] in press [ ] submitted [ ] in preparation [ ] no plans to publish authors address (if different from that given in section I) title of paper journal volume first-last pages year List references to papers and/or database entries which report sequences overlapping with that submitted here. first author journal, vol., pages, year and/or database, accession number C2J/I1.89

732 EMBL Data Library m. DESCRIPTION OF SEQUENCED SEGMENT Wherever possible, please use standard nomenclature or conventions. If a question is not applicable to your sequence, answer by writing N.A.; if the information is relevant but not available, write a question mark (7). What kind of molecule did you sequence? (check all boxes which apply) [ ] genomic DNA [ ] genomic RNA [ ] virus [ ] provirus [ ]cdnatomrna [ ] cdna to genomic RNA [ ] organelle DNA [ ] organelle RNA please specify organelle [ ] trna [ ] rrna [ ] snrna [ ] scrna [ ] other nucleic acid (please specify) [ ] peptide: [ ] sequence assembled by [ ] overlap of sequenced fragments [ ] homology with related sequence [ ] other (please specify) [ ] partial: [ ] N-terminal or [ ] C-terminal or [ ] internal fragment length of sequence [ ] base pairs or [ ] amino acid residues gene name(s) (e.g., lact) gene product name(s) (e.g., beta-d-galactosidase) Enzyme Commission number (e.g., EC 3.2.1.23) gene product subunit structure (e.g., hemoglobin The following items refer to the original source of the molecule you have sequenced. organism (species) name (e.g., Escherichia coli; Mus musculus) sub-species strain (e.g., K12; BALB/c) name/number of individual or isolate (e.g., patient 123; influenza virus A/PR/8#4) developmental stage [ ] germ line [ ] rearranged haplotype tissue type cell type The following items refer to the immediate experimental source of the submitted sequence, name of cell line (e.g., Hela; 3T3-L1) library (type; name) clone(s) The following items refer to the position of the submitted sequence in the genome, chromosome (or segment) name/number map position units: [ ] genome % or [ ] nucleotide number or [ ] other Using single words or short phrases, describe the properties of the sequence in terms of: its associated phenotype(s); the biological/enzymatic activity of its product; the general functional classification of the gene and/or gene product macromolecules to which the gene product can bind (e.g., DNA, calcium, other proteins); subcellular localization of the gene product; any other relevant information. Example (for viral erbb nucleotide sequence): transforming capacity, EGF receptor-related; tyrosine kinase; oncogene; transmembrane protein. C3.1/2.88

IV. FEATURES OF THE SEQUENCE Information for Authors 733 Please list below the types and locations of all significant features experimentally identified within the sequence. that your sequence is numbered beginning with "1." In the column marked fill in feature from to bp aa id comp Significant features include: Be sure type of feature (see information below) number of first base/amino acid in the feature number of last base/amino acid in the feature x, if your numbers refer to positions of base pairs in a nucleotide sequence x, if your numbers refer to positions of amino acid residues in a peptide sequence method by which the feature was identified. E = experimentally, S = by similarity with known sequence or to an established consensus sequence; P = by similarity to some other pattern, such as an open reading frame x, if feature is located on the nucleic acid strand complementary to that reported here regulatory signals (e.g., promoters, attenuators, enhancers) transcribed regions (e.g., mrna, rrna, trna). (indicate reading frame if start and stop codons are not present) regions subject to post-transcriptional modiftcaton (e.g., introns, modified bases) translated regions extent of signal peptide, prepropeptide, propeptide, mature peptide regions subject to post-translational modification (e.g., glycosylated or phosphorylated sites) other domains/sites of interest (e.g., extracellular domain, DNA-binding domain, active site, inhibitory site) sites involved in bonding (disulfidc, thiolester, intrachain, interchain) regions of protein secondary structure (e.g., alpha helix or beta sheet) conflicts with sequence data reported by other authors variations and polymorphisms The first 2 lines of the table are filled in with examples. If you think you will need more space than the table below provides, please photocopy this page before you fill it out. Numbering for features on the sequence submitted here [ ] matches paper [ ] does not match paper feature from to bp aa id comp EXAMPLE TATA box 1 8 EXAMPLE exon 1 9 264 C4.1/2.88