Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise you need to perform some the analysis step- by- step as instructed. You do not require developing algorithms or writing programs; but at each step think about: What are we doing at this step? Why is it required? Be careful about the timing. For each mission you need to submit a series of steps in a batch, that takes a few minutes. But it takes a few hours or a few days to all of the jobs being processed. The exercise comes in two phases. In Warm- Up phase you learn the basic procedures on a toy dataset. For this phase, prepare a one page report describing the logic of the major steps, what did you learn, and what challenges or questions did you experience. During the Challenge phase you will experience the analysis of real RNAseq data. Our goal is to compare the gene expression profiles of breast cancer to the normal tissue, find differentially expressed genes, and see which biological processes are altered in breast cancer. As the results of this phase, prepare a report of one or two pages text + several figures and tables, in the same way you are writing your results for a scientific manuscript. Focus on most important findings, and try to discuss them scientifically. Provide any hypotheses if you have found any, and describe challenges in analysis of the data. Please don t hesitate to ask any questions via email. Thank you! 1

Phase 1: Warming- up 1. Open https://usegalaxy.org/ Mission 1: Importing data 2. From the upper menu, select User, Register, and then register for a new account. 3. An email will be sent to your account, click the activation link in the email 4. Login to Galaxy from User menu, then Login (Just after registration you are automatically logged in, so you need this step in future usages). 5. Now we need to import a small RNA- seq data for the analysis. Open this url in a separate browser tab (of the same internet browser you are viewing Galaxy): https://usegalaxy.org/u/jeremy/p/galaxy- rna- seq- analysis- exercise 6. Press the green (+) buttons of the first 4 datasets one by one, in the same order. This will import 4 RNA- seq datasets, two from human adrenal and two from brain into the Galaxy. After pressing each (+) button, you should wait for a new page, and then press return to the previous page. 2

7. Go back to the Galaxy tab in your web browser, and press the refresh button above the right panel (History), now you should view the 4 imported datasets in the history. 8. Press the View Data button of every dataset (with eye shape), to see the fastq contents, including the reads sequences and quality values. 3

9. Time to check the quality of data! From the left panel (Tools), click the NGS: QC and manipulation, then FastQC:Read QC, select the 1 st dataset (1: imported: adrenal_1.fastq), and then press Execute. 10. Do the same for the other datasets 2, 3, and 4, in the same order. 11. You should have 4 new records in the right panel (History). Wait for a while until their process is finished, so their color would be green. 12. Click on the view data buttons of each FastQC result to view the quality check results. For each, you will have a long report that shows different aspects of the quality of your data. Scroll your web page or click on each item in the summary to view the detailed report. 4

13. Try to understand a little bit about the items in the quality reports by searching web, or any other means! 5

Mission 2: Alignment & Visualization Now it s time to: align the read sequences to the reference genome, measure expression values of the genes and visualize it. 1. Login to the Galaxy. 2. From the Tools menu select NGS: RNA- seq, then select Tophat2. This tools aligns the short reads of RNA- seq to the reference genome, and finds to which location they belong. (For alignment of DNA short reads to the reference genome we use different tools like Bowtie2 or BWA). 3. The data we are using is paired- end (two reads of the two ends of every RNA fragment which is then converted to cdna). Hence select Paired- end as the first argument to Tophat2. Then select two fastq files belonging to the same tissue (i.e. brain). Tophat2 needs to have an idea about the average size of the reads when the data type is paired- end. For this purpose, we set 110 (bp) as the Mean Inner Distance between Mate Pairs, this information comes from the experiment performed for RNA- seq library preparation. We also require to indicate to which organism the reads belong, hence we set Human hg19 Canonical Female, since the RNA- seq belongs to a woman and most of the human works currently use hg19 as the reference assembly. Finally press Execute and do the same for the fastq files of the other tissue (i.e. adrenal). 6

4. Wait until tophat results are ready (you can see this from the History panel). 5. Each run of Tophat2 provides several results including a summary, insertions, deletions, splice junctions, and accepted hits (aligned reads) that are shown as different outputs in the history panel. See the results one by one using View data buttons. Try to understand a little bit about them. 6. When pressing View data button on accepted hits, it downloads the aligned reads as a file with extension.bam. Save it in a suitable folder, we need it later. This file is compressed and cannot be viewed. So we need to convert it to an uncompressed format called.sam file. 7. For this purpose, go to the NGS: SAM Tools from the Tools menu and select BAM- to- SAM. In each run select one of your accepted hits outputs and execute. 8. When the results of the last step are ready, first view them using the View data button. Try to understand a little bit about it. 7

8. Run IGV. From the upper left box, select Human hg19 and open the genome. If hg19 is not listed there, select More and then indicate hg19. 9. From the Tools menu Run igvtools, a new window will appear. Select Count as the command, run the Browse button for the input file and select the.bam file that was downloaded for the accepted hits (in step 6 above). The output file will be automatically set as the input file name with extension.bam.tdf. Change the extension to.tdf. You can assign better names for the file (i.e. Brain.tdf and Adrenal.tdf). Select the maximum Zoom level (10) and press Run. Wait until it is finished (the progress is indicated in the Message area below Run button. Do the same for the other.bam file downloaded. Then close igvtools window. 8

10. In the main menu of IGV, go to the File, then Load from file, and open both.tdf files you have just created. In the location box, enter chr19:3,000,000-3,500,000 to focus in this region. In the left panel right click on the names of.tdf files and select Log scale. Identify a differentially expressed gene between Brain and Adrenal. 9

Mission 3: Differential Expression Analysis Time to perform differential expression analysis. This time we will do several steps all together, without waiting the results of each step prepared. After a few hours, the results of all will be prepared (hopefully!) 1. Login to the Galaxy. 2. From the Tools menu select NGS: RNA- seq, then select Cufflinks. This tool measures the expression level of each region (gene, isoform, etc.). As the first option (SAM or BAM file of aligned RNA- Seq reads) select the Tophat2 accepted hits on data 1, 2, leave the other parameters unchanged and press execute. Do the same for accepted hits of data 3, 4. 10

3. While the requests of the previous step are waiting to be processed, run the following steps (Don t wait for the previous results to be prepared): From NGS:RNA- seq, select Cuffmerge. This will merge the results of both Brain and Adrenal tissues, that is required for the next step. As the first parameter (GTF file produced by Cufflinks) select cufflinks results on one of your datasets (i.e. Adrenal, though it s not ready yet). Then press the Add new Additional GTF Input Files, a new (GTF file produced by Cufflinks) parameter will appear, where you can select cufflinks results on the other dataset (i.e. Brain. No matter if you select Adrenal or Brain as the first GTF file). Finally press execute. 4. Now select Cuffdiff from NGS:RNA- seq menu, to find out differentially expressed genes. For the first option (Transcripts) select the Cuffmerge results. There is a textbox that you can enter the Name of Condition 1, enter Adrenal and select Tophat2 accepted hits on Adrenal data as the Replicate 1 option. Name Condition 2 as Brain and select Tophat2 accepted hits of Brain as Replicate 1 data of Condition 2 (See the image below carefully). Then press execute. 11

5. Be patient until all results are prepared 12

Phase 2: The Challenge! Mission 1: Looking for data 1. Open ArrayExpress: http://www.ebi.ac.uk/arrayexpress. In the search text box, search for: breast cancer normal 2. There would be around 503 results, so we need to narrow down them. In the Filter experiments menu, select Homo sapiens as Organism, RNA assay as experiment assay, and Sequencing assay as technique, and press Filter button. 4. Find the experiment titled mrna-sequencing of breast cancer subtypes and normal tissue. It seems a good experiment. Click on its Accession number (E-GEOD-52194) to see the experiment. In Samples section there is a link: Click for detailed sample information and links to data. Click on it to see the samples. 13

5. In the following page you will see the 1-25 of 117 rows (samples). Set the page size to 250 to see all samples. The tumor type column shows whether each sample is a type of breast cancer tumor, or normal tissue. The FASTQ column shows the links to download FASTQ files for each sample. Also the processed column contains the processed data (gene expression levels) for each sample. 6. Below the whole table there is a link: Download Samples and Data table in Tab-delimited format. Right click on it and select Save target as or Save linked file as (depending on the web browser you use it might be different). Save the data table on your computer and open it with Microsoft Excel or some other spreadsheet software. You might need to rename the file to.xls before opening it with Excel. 7. There is a column Characteristics [tumor type] in the excel file that again shows you whether the sample is a normal tissue, or breast cancer. Randomly select one normal sample, and one tumor sample. There are different tumor subtypes, no matter which of them you select. But please do your selection randomly not the same set that you know is selected by your friends. 14

8. In the column Comment [ENA_EXPERIMENT] you can find the accession numbers of your selected samples (SRX followed by a 6 digits number). Note the ENA numbers of the samples you have selected. 9. Login to the Galaxy. Above the History panel there is a History options button with a gear icon. Press it and select Create New to make a new history something like a new project. The initial name will be Unnamed history. Click on it and rename it to Breast Cancer RNA-Seq Analysis. 10. In the Tools menu, select Get Data and then EBI SRA. Enter the ENA accession number of the Normal sample you have selected and press Search. 15

11. You will see a window with brief information about the experiment, including Model of the sequencing platform (Illumina HiSeq 2000) and Library Layout which is PAIRED, that means you have paired reads for each RNA fragment. 12. Below it, you will find another table showing individual samples. You should see a single record; with the same ENA number you have searched in the Experiment accession column. If you want to download the Fastq files you should press the links in Fastq files (ftp) column. However we need to analyze them with Galaxy, hence locate the Fastq files (galaxy) column. You should see two links, belonging to the pair of reads (if you less or more files, choose a different sample). Press the first link File 1. This will add a new job in the History panel of the Galaxy which downloads the first Fastq file. 13. Press the Back button of your web browser once and wait a few seconds, or go again to the EBI SRA and search the ENA accession number. This time select File 2, a new job will be added to the History accordingly. 16

14. In the History panel, for each job there is a button Edit attributes with a pencil icon. Click on that, and set the Name of your data jobs as Normal Breast 1 and Normal Breast 2. Also in the Info text box, enter the ENA accession number of each sample type, this will let you to know the source of data in the future (The accession number you should enter is different from SRX123456 I have entered in this picture!) and press Save. 15. Do the steps 10-14 for the tumor sample you have selected. Set their names to Cancer Breast 1 and Cancer Breast 2. Wait until all of your data are downloaded. It might take a few minutes. 16. There is a possibility that the data you have selected is not available in fastq format, or the file server is not available at that moment. In this situation the job will be failed, and the color will turn red. Delete those data (both paired files of the same sample), select another sample ENA number and do the steps above again, until you have four data files successfully available. 17

17. We need one addition data: the annotation of the human genes. Go to the Tools panel, then Get Data, then UCSC Main. Leave the genome as Human, assembly as Feb. 2009 (GRCh37/hg19), group as Genes and Gene Predictions, set the track to RefSeq genes and table to refflat. Also set the output format as GTF gene transfer format and leave Send output to Galaxy checked (if you wanted to download the file to your computer you would need to fill the output file name). Press get output, and then Send query to galaxy. The new job of the history will download the gene annotations. 18. In the search menu of the Tools panel, search fastq groomer. Select the FASTQ Groomer tool, select your first FASTQ data as File to groom, and Sanger and Illumina 1.8+ as Input FASTQ quality scores type. Press Execute, and do the same for other 3 Fastq data you have added. 18

19. You don t need to wait the FASTQ Groomer jobs to be finished. In the search box of tools menu, enter fastqc to show FastQC: Read QC. Run it, select the results of FASTQ Groomer on data 1 and press execute. Do the same for the results of FASTQ Groomer on the other data files. These jobs will be queued and will run after the Fastq Groomer jobs are finished. 20. See my history in this picture. Now I have 4 FastQ jobs that are waiting for FASTQ Groomer jobs results. The 4 FASTQ Groomer jobs are still running (yellow) and all Get Data jobs are already finished successfully (green). While my top jobs is number 16, your job numbers might be different, since I had 2 failed Get Data jobs and I had to delete them and get another sample. 21. Wait until FastQC jobs are finished. In the future exercise we will see the results of FastQC and decide about trimming data or alignment without trimming accordingly. 19

Mission 2: Quality Control & Alignment 1. Login to the Galaxy. The FastQC results should be ready now. Click the View data button to check the quality results of each data. Pay special attention to the Per base sequence quality. If the quality for some part of a dataset is too poor (i.e. below 20) use FASTQ Trimmer to remove the ends with poor quality. For my data, it seems fine. 2. See the other sections of FastQC quality results. For example look at Per base sequence content which is the average appearance of each nucleotide in each position of all reads. Since our reads are of length 60 there are 60 columns. I see higher CG than AT in my selected dataset. What about you? How do you interpret these results? 20

3. Now we want to align RNA-seq reads to the reference genome. From Tools panel run Tophat2, select Paired-end, select the FASTQ Groomer results for the normal reads, select the Human hg19 reference genome, leave the other parameters unchanged and press Execute. We did not modify the Mean distance between the reads because it is not indicated along with the dataset. Do the same for cancer data. 21

4. Without waiting for Tophat results to be prepared, we continue to Cufflinks. Using Cufflinks we will quantify the expression levels of the genes. From the Tools menu select Cufflinks, in SAM or BAM file select the Tophat accepted hits for your normal data, change Use Reference Annotation to Use reference annotation, select UCSC Main on Human: refflat (genome) as your reference annotation and press Execute. Do the same for tophat results of cancer samples. Remember that the reference we are using is the same refflat annotation of all human genes we obtained from the UCSC Table Browser. If we don t use this gene annotations, Cufflinks would not know the name or positions of the genes, to quantify their expression levels. 5. Now we want to find differentially expressed genes. Without waiting the cufflinks results to be ready, run Cuffdiff from the Tools menu, select the same UCSC annotation of human genes ( UCSC Main on Human: refflat (genome) ) as the Transcripts, set Normal as the name of Condition 1 and select Tophat result for normal cells as the Replicate 1 for Condition 1. Set Cancer as the name of Condition 2 and select the related Tophat results of cancer cells. Press Execute. 6. Drink a cup of tea, and have a long rest (maybe a few days) until the results are ready 22

Mission 3: Differential Expression Analysis 1. We will start with Tophat2 results, which starts with a record in the History panel named: Tophat2 on data X and data Y: alignment_summary where X and Y are numbers. Select alignment summary for normal cells. Click the view data button to see the results. Here are my results. As you see 93.5% of single reads are aligned, with 83.2% of them having concordant pair-alignment (both single reads of the pair are aligned in an acceptable distance). Multiple alignment occurs for reads or pairs that we are not sure about their correct alignment position in the genome. As you see, there are 14.7% multiple aligned pairs in normal cells. 2. Now have a look over Tophat2 alignment summary of cancer cells. As you see below, there is a far distance with results of normal cells. The pairs with multiple alignment rate has increased for pairs to 71.1% which is too high, and concordant pair alignment rate has fallen to 65.1%. So there should be a problem. 23

3. To find out what s the problem with cancer cells data, have a look back at the FastQC results of all 4 data files (as we did in steps 1 and 2 of the RNAseq analysis Part 2 exercise). The main difference for my case was the length of reads: The reads are 75 bp for normal cells, while 60 bp long for cancer cells. Also the quality is higher for normal reads, particularly the right ends of reads. It s very natural that longer and higher quality reads are aligned better with less multiple alignments, and this is something with the data that we cannot easily fix it. A suggestion can be trimming right ends of cancer cells, but this makes cancer reads even shorter: Normal Cells Cancer Cells 24

4. There is a big difference between Overrepresented sequences section of cells in my data. While there are no overrepresented sequences in normal data, cancer data shows several overrepresented sequences, some of them are Illumina paired end PCR primers, as below: 25

5. A necessary step to have more accurate and reliable results would be to use FASTQ Quality Trimmer, Trim sequences and Clip adapter sequences tools from NGS: QC and manipulation section of Tools menu to remove these sequencing problems of the data. You can try it later and compare your Tophat2 alignment with the current results. For now, let s continue with our alignments. 6. Now we look at the differential expression analysis. Locate Cuffdiff on data X, data Y, and data Z: gene differential expression testing in the history panel, and press view data button. There are several columns, including gene ID and location, value_1 and value_2 which represent the expression values of each gene in normal and cancer cell types (in FPKM), log2(fold_change) which shows the expression change of gene in log2 scale, p_value which is the p-value of change, q-value which is the adjusted p-value (Hence we must use q-value for analysis of significance, not the p-value), and also a significant column that shows if the gene has significant differential expression between normal can cancer, or not. 7. Click on the same record in History menu, and then on Download button (with Disk icon) to download your differential analysis results. It should download a file with extension.tabular to your computer. 26

8. Download and install the statistical language R, which is widely used for bioinformatics analysis (http://www.r-project.org). Also download and install RStudio, a development environment for R (http://www.rstudio.com). Run RStudio. Read the downloaded file using the following command (You should change YOURPATH and YOURFILE according to the path and name of downloaded file) x <- read.delim( YOURPATH/YOURFILE.tabular ) 9. Since log2(fold_change) value is not defined for many genes, we will define a new logfc value by the following command: x$logfc <- log2(x$value_2+1)-log2(x$value_1+1) 10. First find out genes having logfc > 2 (4-fold changes upregulation in cancer cells), and also logfc < -2 (4-fold changes upregulation in normal cells), by following commands: y <- subset(x, logfc > 2) write.table(y,file="yourpath/cancer.xls",sep="\t",quote=f,row.names=f) z <- subset(x, logfc < -2) write.table(y,file="yourpath/normal.xls",sep="\t",quote=f,row.names=f) 27

Mission 4: Gene Ontology and Pathway Analysis So far, we have two lists of genes, one of them being over-expressed and the other list being under-expressed in cancer, compared to the normal cells. But what are the functions of these genes in the cells? Which parts of the biological pathways in the cells are changed in cancer? These are the questions to answer during this mission. We will repeat the following steps on each of the two gene lists: once for the genes with logfc > 2, the other time for the genes having logfc < -2. If there are too few genes (less than 100) you can relax the condition (e.g. logfc > 1 and logfc < -1). Also if there are too many genes (more than 2000) you can use more stringent conditions. 1. Now select the gene symbols (gene names or gene_id) of the filtered genes and copy them. Open david (http://david.abcc.ncifcrf.gov). In the left panel (shortcut to DAVID tools) press Functional Annotation. Paste the gene symbols in box A (Paste a list), select OFFICIAL_GENE_SYMBOL in Step 2: Select Identifier, select Gene List in Step 3: List Type and press Submit List. 2. Depending of how recognized are the gene symbols of your dataset by david, it might show you different results. Usually it asks you to select species in the next step, when done you need to press Select Species button. If most of your gene symbols are not recognized by David, it might request you to convert your gene list to a recognized alternative, that is also straightforward. Pay attention to the number of genes recognized by david. It will not work very well if you have too few genes (i.e. less than 50) or too many genes (i.e. more than 2000). In such cases you might need to change the data or the groups of comparison, or the level of stringency in defining significantly differentially expressed genes. 28

3. Press Functional Annotation Clustering button. A new window will appear that shows you the results. 4. In the David functional annotation clustering table, the second column shows the biological process/molecular function or cellular position that is overrepresented in your list of genes. The Count column shows the number of genes in your selection that match that biological process or function. Also the Benjamini column shows the significance of the results, you can consider values less than 5E-2 (0.05) as significant. For instance below you can see the results of the genes overexpressed in breast cancer cells. The cell cycle and cell division biological processes are significantly overrepresented in this gene set. If you press the bar in each row you will be shown the genes involved in that biological process, that appear in your list. Do the steps 15 to 19 for the genes overexpressed in the second group also. 29

11. In addition to pressing the Functional Annotation Clustering button, explore the other results of David. Particularly expand the Pathways item, press the Chart button of each pathway database (for instance KEGG_PATHWAY), and see the pathways one by one. Save some of the pathways you are interested, that believe are relevant to your differential expression analysis. 30

Good luck! The End. 32