Public Repositories Tutorial: Bulk Downloads

Public Repositories Tutorial: Bulk Downloads Almost all of the public databases, genome browsers, and other tools you have explored so far offer some form of access to rapidly download all or large chunks of raw data. This access is usually provided via two possible routes: manual downloads and programmatic downloads. This tutorial covers manual downloads of bulk data from public repositories, which are performed with a Web browser, such as Firefox, IE, Chrome and Safari. Since utilizing programmatic downloads require a little bit of computer programming experience to use, we will mention them, as appropriate, in this tutorial, but will not provide worked examples. Why would you ever want to do bulk downloads? Sometimes you will need to gather many records, and rather than collect everything manually, one piece at a time, you can save a great deal of time by first downloading all or a large chunk of raw data, and then sort out from it what you actually need. Once you have the data in hand, there are a number of ways you can further manipulate it and extract the pieces that you actually want. These include using a program that reads and displays tabulated data (e.g., Microsoft Excel), using Unix commands, creating and loading the data into a database and performing SQL commands (e.g., MySQL), or writing your own computer programs to parse and manipulate the data (e.g., Perl or Python). Some background information about data formats Raw data that are represented as a single or multiple files (i.e., not in a relational database such as MySQL or Oracle) and are also commonly referred to as flat file data. The bioinformatics data in these files can be represented in many different formats. Most common bioinformatics flat file formats contain text characters (and are therefore viewable with a text viewer, such as Apple TextEdit, or Microsoft Word) and have a relatively simple and regular internal structure. These (with some of their file extensions) include the following: FASTA (*fasta, *fa, *ffa *fna) Genbank and GenPept (*gb, *gbk, *gp) Sequence alignment (*aln, *fa, *ffa) Tabulated (*txt, *txv, *csv) Tabulated data deserves special consideration since it is not as well-defined a format as the others. Tabulated data can have any number of columns, which may or may not be carefully defined. Always check the file format specifications, examine the data in the files in detail, and/or ask the authors before working with these files. In addition, before attempting to use Microsoft Excel to manipulate tabulated data, realize that Excel has built-in limits of 256 columns and 65,536 rows, and will occasionally re-interpret cell content. For example, human gene symbol DEC1 is usually converted to a date representation of December 1, such as 1- Dec, and the = prefix is interpreted as a function definition. Raw data can also be represented in files with more complex internal structure, and are usually hierarchically structured. There are two common hierarchical formats that are used to

represent bioinformatics data: ASN.1 and XML. ASN.1 predates XML, and is used almost exclusively by NCBI. XML is a relatively common format in bioinformatics, but is similar to tabulated data in that there is no telling what kind of, and how much data are being represented just by knowing that the file is XML. (The definition of XML internal structure is made in an additional document called a DTD, but this usually tells you nothing about what the data represents and how much of it there is.) Like the simpler formats listed above, ASN.1 and XML also contain text character data, but they were designed to be generated and read by computer programs; they are generally very difficult for our eyes to read and our brains to make sense of. Several years ago, there was a lot of hype about XML in bioinformatics circles. Ignore the residual hype. There is nothing inherently magic about XML (in fact, it looks a lot like ASN.1), and since it tends to explode the sizes of files by two to three fold, it should be avoided if there is another, more compact format available that contains all the data you need (e.g., tabulated or FASTA). Some brief comments about three additional file formats that you might come across when dealing with bulk bioinformatics data. Archived and/or compressed text data files (file extensions *gz, *tgz, *gzip, *zip, *tar) are common when data files are large or numerous because they take up less disk storage space and download faster. Compressed files generally have to be uncompressed before they are viewed or manipulated, although some programs read compressed files directly, and uncompress them on the fly. Sanger sequencing data is output from sequencing machines as sequence chromatograms or trace files (with file extensions *ab1, *scf). It is rarely necessary to deal with these files directly, unless you are using an existing software package to display the traces (to look for evidence of SNPs or mutations, for example) or perform relatively sophisticated sequence mapping or assembly tasks (with software such as Phred, Phrap or Consed). Binary data simply indicates any non-text character data, such as image data or proprietary format data. Like trace data, you will rarely need to deal with this sort of data, unless you are doing some highly specialized task, or are dealing with data directly output from a piece of lab equipment. When dealing with binary data, the format is never clear from visual inspection using a text editor, for example. Therefore, either a special program written by someone familiar with the format, or a clear and detailed specification, are usually needed. Worked example #1: NCBI Entrez Gene FTP site In this worked example, you will download two compressed tabulated files from the NCBI Entrez Gene FTP site: one with information about each human gene, and the other with GO annotations. 1. Open a Web browser on an Internet-connected computer 2. Go to NCBI s listing of their FTP sites by entering the following URL into your browser: <http://www.ncbi.nlm.nih.gov/ftp/> 3. Click on the link to Gene

4. Click on the file named README. Readme files are common on File Transfer Protocol (FTP) sites, and usually contain important information about file contents and structure. 5. Scroll down to the description of a file called gene_info, about halfway down the page. Note that the file is tabulated (tab-delimited) and essential details are given about what each row represents (a gene) and what the different columns contain (different pieces of information about each gene). 6. Click the BACK button on your browser 7. Click on the directory named DATA, followed by GENE_INFO and Mammalia 8. Click on the file named Homo_sapiens.gene_info.gz to download the file containing information on only human genes. You may be asked to confirm the download, and you should. The *gz extension indicates that the file is compressed using the Gnu Zip program. 9. Find your local copy of the downloaded file and double-click on it. Your computer should recognize that it is compressed and open up a decompression or archive program (will vary depending on your operating system). If it doesn t, you ll have a find or install a program that is able to decompress Gnu Zip files. A decompressed tabulated file called Homo_sapiens.gene_info should have been created if you were successful. 10. Launch Microsoft Excel and open the decompressed file. In order to load the file, you may have to set Enable to All Documents in the Open file browser. 11. The Excel Text Import Wizard will open (this is where you can specify some parser settings, such as delimiting character), but for this example, it is safe to use the default settings and click the Finish button. 12. Scroll to the right of the spreadsheet to verify that columns A-O were loaded. Scroll to the bottom to confirm that all 39,856 rows in the file were loaded. Notice that the first column is filled only with 9606. This is the Taxonomy id for human. If we had downloaded the gene_info file for all organisms (well, all organisms in which gene records have been defined), this column would have many different ids. 13. Go back to <ftp://ftp.ncbi.nlm.nih.gov/gene/data/>. 14. Click on the file named gene2go.gz to download it as you did for the file above. 15. As before, find your local copy of this file and double-click it to decompress. A file named gene2go should have been created. 16. Also as before, launch Excel and open the file gene2go. 17. Since this file contains over a million lines, Excel will not be able to load it complete and will indicate this with a dialog stating: File not loaded completely. Click OK on this dialog. Notice that there are multiple entries for each Gene ID, indicating that each row represents a GO annotation-gene combination. 18. Close this file. Remember that Excel can only load a maximum of 65,536 rows. If you really need to deal with programs of this size, the only way to do it with special software that can handle many rows of data, with Unix commands, or by doing your own computer programming. If you would like to find out more information about Unix, go to <http://www.trii.org/courses/unix.html> and download the course materials. Examples of useful Unix commands and programs that can manipulate tabulated data and which bioinformaticians use every day include grep (filter rows), cut (filter columns), paste (paste columns), cat (paste rows), sort (sort rows), uniq (filter unique rows), join (intersect tables), sed

(replace text), wc (count rows), head (extract top of table), tail (extract bottom of table), more and less (browse text) and vi (edit text). Unix I/O piping and redirection are also exceedingly useful. Advanced homework (worth one cookie): use a terminal and Unix commands to add GO annotations in gene2go to human gene info in Homo_sapiens.gene_info. Hint: open a Unix terminal, navigate to the directory where the gene2go and Homo_sapiens.gene_info files are located, and then use grep to select rows in gene2go where the first column contains 9606 (ie, human gene), cut the first columns off of, and sort both the filtered gene2go file and Homo_sapiens.gene_info by Gene ID, join the two sorted files. Extra credit (worth an extra cookie): when doing the previous join, retain the genes in Homo_sapiens.gene_info that don t appear in gene2go (i.e., they are the genes without GO annotations). If you would like to find out more about computer programming languages, there are many languages to choose from and many good books available. For bioinformatics applications, we recommend Python or Perl, both established and powerful languages that are relatively easy to learn. If you already know how to program, and would like programmatic access to NCBI data in the Entrez suite of databases (such as Entrez Gene), NCBI offers e-utilities web services. Both the Bioperl module and the Biopython package have data structures, parsers and query functions that handle Entrez data. More information about the e-utilities is available at <http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework&part=eutils>. Super advanced homework (worth three cookies): write a Perl or Python program that takes database name, report type and Entrez query string, connects to NCBI e-utilities server, and performs the given query on the given database. Hint #1: use the Bioperl module or Biopython package. Hint #2: Python code that does just this is located at <http://trii.org/courses/public_repositories/get_ncbi_record.py>. You can look at the source code using any text editor, but in order to run the code, you ll have to use a command terminal and make sure you have Python and Biopython installed. Worked example #2: Ensembl human genome FTP site In this worked example, you will download a single FASTA file containing all human protein sequences from the Ensembl FTP site. 1. Open a browser and go to <http://www.ensembl.org/info/downloads/>. 2. Note that there are many methods for doing bulk data downloads from the Ensembl site. Scroll down to the section entitled FTP, and click on the link Table of links to Ensembl FTP files. 3. Notice the sentence near the top of the page indicates that all data has been compressed using the Gnu Zip program (file extension *gz). 4. Scroll down to the table of species and FTP links at the bottom of the page specifically to the row marked Homo sapiens (human), and click on the FTP link under the Peptides column. This will take you to an FTP directory.

5. Click on the README file, and note the information about Ensembl file naming conventions, and that all files are FASTA format (*fa) and compressed with Gnu Zip (*gz). 6. Click the browser s BACK button. 7. For this demonstration, we want peptide translations with some additional evidence and not just gene predictions one the genomic assembly alone, so click on the file named Homo_sapiens.NCBI36.50.pep.all.fa.gz in order to download this file to your local computer. 8. As in the previous worked example, find your local copy of the downloaded file and double-click on it. Your computer should recognize that it is compressed and open up a decompression or archive program. If it doesn t, you ll have a find or install a program that is able to decompress Gnu Zip files. A decompressed tabulated file called Homo_sapiens.NCBI36.50.pep.all.fa should have been created if you were successful. 9. Launch a text editor such as Apple TextEdit or Microsoft Word, and then open this downloaded and uncompressed file. As expected, protein sequence data is represented in the FASTA format. What is FASTA format, and why is it everywhere I look? FASTA format is a compact way to represent sequence data, which is also easy for humans and computers to scan through. FASTA format is named for the FASTA alignment program (predecessor to BLAST), which originally used this file format. It is a very simple format with two types of lines: definition lines (or deflines), and sequence lines. The defline always begin with a > symbol and is followed by some limited information, such as, in this example, unique peptide identifier; peptide type; chromosome build, number, start and stop coordinates and strand; gene id; and transcript id. The unique id usually is the first bit of information given on a defline. Any line that s not a defline (doesn t begin with > ) is sequence data. FASTA is such a popular and durable format that it even has its own Wikipedia page at <http://en.wikipedia.org/wiki/fasta_format>. Because so many programs utilize and generate files in this format, get used to it, it is here to stay. Advanced homework (worth two cookies): build a local BLAST database from the FASTA file containing human proteins, which you just downloaded and uncompressed, and then retrieve and run one of the NCBI protein isoform sequences for APP against it. Hints: if you don t already have NCBI s blastall and formatdb programs installed, go to <http://www.ncbi.nlm.nih.gov/blast/download.shtml> and download the blast program suite appropriate for your computer. You ll then have to build a BLAST database with the included program formatdb, Documentation on formatdb is available at <http://www.ncbi.nlm.nih.gov/blast/docs/formatdb.html> and blastall is at <http://www.ncbi.nlm.nih.gov/blast/docs/blastall.html>. Worked example #3: Ensembl BioMart tool In this worked example, you will use the Ensembl BioMart tool to download the 3 UTRs for all human transcripts for which there is a Drosophila gene ortholog (using protein sequence

homology as a surrogate), and do a spot check to confirm that human 3 UTR sequence was downloaded. 1. Open a browser and go to <http://www.ensembl.org/info/downloads/> 2. Scroll down to the section labeled BioMart, and click on the link BioMart data mining tool 3. Under the menu labeled CHOOSE DATABASE, select Ensembl 50, and under CHOOSE DATASET, select Homo sapiens genes (NCBI36) 4. Click on the Filters link on the left sidebar, expand the MULTI SPECIES COMPARISONS section by clicking on the + icon to its immediate left, click the checkbox labeled Homolog filters, and select Orthologous Drosophila Genes. 5. Click on the Attributes link on the left sidebar, select the radio button labeled Sequences, expand the SEQUENCES section, select the radio button labeled 3 UTR. 6. Scroll down and expand the HEADER INFORMATION in order to verify that Ensembl Gene ID and Ensembl Transcript ID are checked. 7. Click on the Count tab, and notice that the gene record count next to the Dataset label on the left sidebar appears, or is updated. 8. Click on the Results tab, and notice your results appear in the main tool panel, and that they are in FASTA format. 9. Select Compressed web file (notify by email), enter your email address, click on the Go button to cause a gzipped FASTA formatted file to be created on the Ensembl website. You will get an email shortly with a URL link to the file. 10. When you get the email (a couple of minutes, at least), click on the link, and download the file. 11. As before, find the local downloaded file and double click it to uncompress. 12. For a spot check of the data, open this downloaded and uncompressed file with a text editor and highlight and copy one of the FASTA records. 13. Go to the UCSC Genome Browser at <http://genome.ucsc.edu> and click on the BLAT link on the header bar. 14. Confirm that the Genome and Assembly are Human and Mar. 2006, respectively, paste the FASTA record into the text box and click I m feeling lucky button. 15. Zoom out 3X in order to confirm that the sequence is aligning very well with the 3 UTR of a gene. Just as the e-utilities provide programmatic access to NCBI Entrez databases, programmatic access to the Ensembl database is provided through a DAS server. DAS stands for Distributed Sequence Annotation System and is a community standard for a web service that represents genomic annotations on a reference genome. All annotations are indexed with exact start and stop positions on the reference genome. There is more information about the Ensembl DAS server instance at <http://www.ensembl.org/info/using/external_data/das/ensembl_das.html>. There is more information about the DAS specification at <http://biodas.org/documents/spec.html>. Advanced homework (worth half a cookie): use Ensembl BioMart to retrieve the HGNC gene symbols, chromosome, start and end positions, and Affymetrix U95 expression annotations for all genes on Chromosome 1.

Worked example #4: UCSC Genome Browser Table downloads In this worked example, you will download UCSC Genome Browser gene annotations from a particular region of the human genome. 1. Open a browser and go to <http://genome.ucsc.edu/cgibin/hgtables?org=human&db=hg18> 2. Copy and paste the following coordinates into the text field labeled position/search : chr21:26,074,733-26,965,003. After clicking on the jump button, you should see exon, intron, UTR and coding annotation tracks for the APP and CYYR1 genes at the top of the image. The browser has control buttons for zooming, panning and display of data tracks. In the language of UCSC Genome Browser, a track is a collection of related data, each datum of which is position indexed by chromosome, and start and stop nucleotide. 3. In order to download tables of data, click the Tables link on the blue header bar. On the new page, note the region we were viewing is now preset in the position field, and confirm that clade=vertebrate, genome= Human and assembly= Mar. 2006. 4. Make the following selections in the drop-down menus: group= Gene and Prediction Tracks, track= UCSC Genes, table= knowngene, and output format= selected fields from primary and related tables. Click on the get output button. 5. In the hg18.knowngene section, click on the check all button. In the Linked Tables section, check the box by kgxref. Click Allow Selection From Checked Tables at the bottom. 6. From the hg18.kgxref section, check the following boxes: genesymbol, refseq, protacc, and description. Scroll back up to the hg18.knowngene section, and click the get output button. 7. On the tabulated output, notice that all of the rows contain genomic coordinates (this will be true of any table downloaded from UCSC), and as expected isoforms from two genes are represented: amyloid beta A4, and cysteine and tyrosine-rich 1. The UCSC Genome Browser offers many genomes other that human, and has many other functions, including table filtering and DNA sequence downloads (in FASTA format, of course) that can be explored. You will be able to find more information at <http://genome.ucsc.edu/training.html>. Unlike NCBI Entrez and Ensembl, there is no programmatic interface for UCSC Genome Browser, but if you know how to create relational databases, you can recreate their table structure locally and use SQL to run queries. Advanced homework (worth half a cookie): for all genes on human chromosome X, use UCSC Genome Browser Tables to grab the name, chromosome, strand, start, end, HGNC gene symbol, gene description, and Ensembl ID. Hint: you ll have to select data from an additional table. The importance of checking bulk downloads When downloading data in bulk, it is always prudent to remember the old carpenter s adage, measure twice, cut once. You will save a lot of time by performing several spot checks of

the data using a different database and method (if possible) to confirm that you actually got what you expected to get. For spot-checking sequence data, use a text editor, and then BLAST <http://blast.ncbi.nlm.nih.gov/blast.cgi> or BLAT <http://genome.ucsc.edu/cgibin/hgblat?command=start> against an appropriate database. To spot check tabulated data, use a text editor, Excel and UNIX commands (to do things like count the number of rows), if you know them.