Phylogeny Yun Gyeong, Lee ( )

Size: px

Start display at page:

Download "Phylogeny Yun Gyeong, Lee ( )"

Sabrina Dickerson
6 years ago
Views:

1 SpiltsTree Instruction Phylogeny Yun Gyeong, Lee ( ylee307@mail.gatech.edu ) 1. Go to cygwin-x (if you don t have cygwin-x, you can either download it or use X-11 with brand new Mac in 306.) 2. Log in compgenomics.biology.gatech.edu $ ssh -X username@compgenomics.biology.gatech.edu 3.Go to comparative folder compgenomics2009/comparative/ 4. Execute SplitsTree ls./splitstree 5. Registration for SpiltsTree 4 (for extract trees, you need personal license - just use mine) yun gyeong Lee Georgia Tech vbokonly@hotmail.com Commend line Go window -> Enter a command Useful Command: EXECUTE FILE = file open and execute a file in Nexus format OPEN FILE = file open (but don t execute) a file in Nexus format SAVE FILE =file [REPLACE={YES NO}] [APPEND={YES NO}] [DATA={ALL LIST-OF- BLOCKS}] save all data or named blocks to a file in Nexus format BOOTSTRAP RUNS =number-of-runs perform bootstrapping on character data currently excluded HELF show this info QUIT exit program

2 7. Taps Main: Network Tap / Main: Data Tap/ Main: Source Tap Network Tap : display the computed tree or network Data Tap : provides a textual display of the data associated with the given document n the programs native Nexus format, organized in a linear list of items that can be either collapsed or expanded. Read-only Source Tap : provides an editable view of the source data associated with the given view - data can be entered by hand or by copy-and paste - once the data has been executed, the source data is displayed in Nexus format - if an error is encountered while parsing an input file, the file is opened and the line in which the error was detected is selected. 8. Manual for SpiltsTree MEGA Instruction 1. Go to MEGA 4 homepage and download it 1. Use Data : dolphins_binary.nex Homework - SpiltsTree4-1) Build trees with both methods UPGMA, NJ 2) Do boostrap with different times 1)100, 2)1000 and compare results.

3 2. Use Data : bees.nex 1) Open bee.nex data with commend line 2) Interpret Main:Source Tap - How many taxa are there? - What does nchar=677 mean? 3) If you want to remove some of taxa from the graph and build tree, how can you do? 4) After remove any of 3 taxa from the bees.nex file, build a tree. 3. Go to Main:Source, make taxa block (eg. Taxa : A,B,C) (See User manual page.31) - MEGA 4 1. Download Data : Compgenomics2009/comparative/CFTR_ABCC4_ABCC5_final_aln.txt Main page 2. Open file : CFTR_ABCC4_ABCC5_final_aln.txt 3. Convert to MEGA format: Utilities -> convert to MEGA format-> Data Format as.fasta 4. After convert, save this file as.meg format -> open this file again in the main page. Open file: CFTR_ABCC4_ABCC5_final_aln.meg 5. Input Data: Protein Sequences

4 6. Go to main page ( = minimize the current page) 7. Main page -> Phylogeny 1) Build trees with four different methods (NJ, ME, MP, UPGMA) and extract the trees and compare the results in terms of tree methods. : Phylogeny -> Construct Phylogeny -> NJ 2) Boostrap test of phylogeny with Neighbor joining : -> double Click green rectangle (down pic.) 3) Do Boostrap test and interior branch test with default value (Replications: 500 Random, Seed: 64238) and compare with each original tree.

5 Good luck~

6 Genome Alignment MUMmer and MAUVE (Ziming Genome Alignment Instructions 1. Dataset: use the two genomes NeisseriameningitidisZ2491.fasta and NeisseriameningitidisMC58.fasta. Use NeisseriameningitidisZ2491 as the reference genome, and NeisseriameningitidisMC58 as the query genome in MUMmer. You can get the sequence from the folder under the sever compgenomics.biology.gatech.edu: compgenomics2009/comparative/genomesequences/ncbi-4virulent 2. MUMmer Instructions: a) Online Manual and Tutorial: b) Useful command lines: mummer mummerplot mummer h Mummer options: -mum: MUM -mumreference: MAM -maxmatch: MEM -b: both strands reverse and forward strands. c) Command lines examples: mummer -mum -b -c NeisseriameningitidisZ2491.fasta NeisseriameningitidisMC58.fasta >Neisseriameningitidis_b.mums mummerplot -postscript -p MUMb Neisseriameningitidis_b.mums mummer -mumreference -b -c NeisseriameningitidisZ2491.fasta NeisseriameningitidisMC58.fasta >Neisseriameningitidis_b.mams mummer -maxmatch -b -c NeisseriameningitidisZ2491.fasta NeisseriameningitidisMC58.fasta >Neisseriameningitidis_b.mems 3. MAUVE Instructions: a) Online user guide: b) Command line: mauvealigner Mauve Mauve options: --output --output-alignment --permutation-matrix-output c) Command lines examples: 1) Run mauve alignment:

7 mauvealigner --output= mauve.out output-alignment=out.alignment --permutationmatrix-output= out.permutation NeisseriameningitidisZ2491.fasta NeisseriameningitidisZ2491.sml NeisseriameningitidisMC58.fasta NeisseriameningitidisMC58.sml (Note: Each sequence must have a corresponding Sorted Mer List (SML) file name given. If the SML file does not exist, mauvealigner will create it automatically, but make sure you put the relative sml file right after each sequence file.) 2)Visualize mauve output file: X Windows will be used for graphical display under Linux. Refer to Login the sever compgenomics.biology.gatech.edu by XII Window; Go to the folder compgenomics2009/comparative/mauve_2.2.0; Run the following command lines: Mauve mauve.out ; You can save the graph by exporting image as jpg file. Genome Alignment Questions MUMmer Questions: 1. Run the command lines: mummer -mum -b -c NeisseriameningitidisZ2491.fasta NeisseriameningitidisMC58.fasta >Neisseriameningitidis_b.mums ; In the output file Neisseriameningitidis_b.mums A) what are the coordinates for the longest MUM (maximal unique match) on the query sequence? B) Which strand is the longest MUM from (forward or reverse strand), and how long is the longest MUM? C) How many MUMs are having the length greater than 2000bp? 2. Run Mummer on both strands of the query sequence with the option of MUM, MAM and MEM separately as shown in the instructions. A) List and rank the number of matches in the three different output files. B) Explain why the number of matches are different for MUM, MAM and MEM. 3. Run mummerplot, and get the 2D plot. A) Which color is representing the inversion of two sequences? B) Please attach the pdf file of the 2D plot. MAUVE Questions: 1. How many LCB can you find? What is the length for the longest LCB that you find? 2. Paste the permutation matrix output that you get, and what software you can use to get the genomic phylogeny? 3. Attach the jpg file of the mauve alignment.

8 Comparative Genomics Homework Horizontal Gene Transfer (Emily Rogers) Instructions There are two main methods to predict horizontally transferred genes, which are genes acquired by an organism from another organism not its parent. While both methods employ the technique of looking for genes whose characteristics stand out from that of the rest of the genome, they differ in which characteristics are of interest. One main method examines phylogenetic information in looking for genes with unusually close matches to evolutionarily distant organisms, while another method relies on intrinsic, ab initio calculations to capture abnormal genetic compositions. In predicting horizontally transferred genes, we will be employing programs that use both methods. DarkHorse finds genes whose close BLAST matches belong to distantly related organisms, and alien_hunter employs complex statistics in detecting unusual genetic composition. For this homework, we have already run Darkhorse, which is located in the compgenomics2009/comparative folder, and which takes as its arguments a configuration file (using the sample provided by the program), a output file that is the result of blasting the query genome against the nr database, a file that contains a list of terms to exclude from the results (sample also provided by the program), and finally the query sequence of the genome of interest in fasta format. Move into the darkhorse/darkhorse-1.0_rev137/ folder in the comparative directory. Examine the command lines by typing./darkhorse.pl. Question 1: Assuming you may use any of the configuration files given by the program, plus all the files under the test_data directory of Darkhorse in the comparative folder, what is a sample command line execution of darkhorse? Alien_hunter employs a sliding window over raw genomic data to calculate outliers. Navigate to the alien_hunter-1.6/ directory under comparative/, and type./alien_hunter to see how to run it with the command line. How many arguments does it take? What does this program output? Question 2: Assuming we want to use the raw genomic sequences available from the results of the assembly group, what is a command line we would type to run alien_hunter? Although predictions by both programs are valid, any overlapping predictions are especially compelling, and we would like to investigate these. Navigate to the results directory of the comparative group, and look at the HGT folder. There should be three folders under the HGT directory; we re interested in the results from Darkhorse and alien_hunter. In which files are the coordinates of the HGT predictions for each? Question 3: Write a script that takes the output prediction file for both Darkhorse and alien_hunter, and finds all genes in which the predictions overlap. In other words, what genes are predicted to be HGT s by both programs?

SNP analysis (Nitya Sharma) Background Information This analyses works to find patterns of SNPs that discriminate carriage versus virulent strains of N. meningitidis.

9 SNP analysis (Nitya Sharma) Background Information This analyses works to find patterns of SNPs that discriminate carriage versus virulent strains of N. meningitidis. Basically, our aim is to find positions that contain the same nucleotide in disease and everything but that nucleotide for carriage (Figure 1). Figure 1. Depicts a SNP of interest in which the virulent strains have an "A" at a given position, wherease none of the carriage strains contain an "A" at that same position These SNP positions will be defined as SNPs of interest. (Refer to pipeline on Wiki and Figure 1). The goal of these exercises will be to find all SNPs. This can be considered the intermediate step to finding SNPs of interest. At this point, you will find all positions in which there is at least one difference across all 12 genomes (9 virulent strains, and 3 carriage strains). You are given one local collinear block (LCB) for all 12 strains labeled as V1 (for virulence strain 1) V9 and C1 C3. Our genome under study is labeled V1. Further, the coordinates of where the LCBs in each respective genome are also given. Format of label is as follows: V1_start-stop. You will use ClustaW on the command line to perform the multiple sequence alignment, then you will parse through the result and find all SNPs (displayed as the gap in * s, Figure 2.). Figure 2. Arrow indicates position of SNP.

10 Insructions: Use input sequence /compgenomics2009/comparative/hw/practice_lcb.fna On the command line 1.) Type: clustalw 2.) Choose option for Sequence Input from Disc 3.) Choose option for Multiple Alignments 4.) Choose option for do Complete multiple alignment now (Slow/Accurate) 5.) Output all files to a folder in /compgenomics2009/comparative/hw/ with your group name, and name files with your group name i.e. comparative.aln Question 1: Write a script to parse through the output (groupname.aln) and identify all SNP positions with respect to our genome (V1). Name your script SNPcode_group, and your output Parsed_groupname.txt Make sure to put these files in your already created folder in /compgenomics2009/comparative/hw/ Question 2: What is the biological significance of finding SNP patterns that discriminate carriage versus virulent strains? Referring to the pipeline, why are we interested in finding first order gene environment (that is the genes that are surrounding the SNP or the gene that the SNP is within)?

11 Cluster of Orthologous Groups (Kanika Arora) Steps for searching for COGs: 1. Log in to the server and go to the directory compgenomics2009/ 2. The first step is to compare the protein sequences from a strain to the proteins sequences in the COG database. The COG database is saved in the folder comparative/cog as COGdb. You need to mention the path of this database while running the BLAST command. In the command line, type: blastall p blastp d [path_for_the_cog_database/cogdb] i comparative/hw/strain1.faa e 1e-5 o [path_of_output_file] m 8 v 5 b 5 Example: If your present directory is compgenomics2009, you can type: blastall p blastp d comparative/cog/cogdb i strain1.faa e 1e-5 o [your group directory]/blast_output1.txt m 8 v 5 b 5 3. Output parsing: For this you need a file cog.txt which is saved in the hw folder too. Type: perl comparative/hw/cogparse.pl [path of cog.txt] [path of the output file from BLAST] [path of where you would like your results file to be saved] For example: perl comparative/hw/cogparse.pl comparative/hw/cog.txt [your group directory]/blast_output_1.txt [your group directory]/cogs_output_1.txt This perl script will give you output in this format: [Prot name Hit 1 COG of hit1 Hit2 COG of hit2 Hit 3 COG of hit3 Hit 4 COG of hit4 Hit 5 COG of hit5] NMO0001 NMA0262 COG0362 NMB0015 COG0362 HI0553 COG0362 PM1554 COG0362 VCA0898 COG0362 NMO0002 NMB0014 COG1519 NMA0261 COG1519 RSc0693 COG1519 PA4988 COG1519 kdta COG1519 This output file will be tab-delimited. The first column here has the names of the proteins of the given strain, the second column has the topmost hit of the corresponding protein, and the third column is the name of the COG that this hit belongs to.

12 [The COGs to which the best hits belong to can be found from the coginfo.txt file, which has a list of COGs and the names of the proteins that belong to each COG] 4. Follow the same steps for strain2.faa 5. Write a script to find a list of COGs for each strain and the total number of proteins which belong to COGs. a. Here, consider a protein to be associated with a COG if its first three topmost hits belong to the same COG. b. Two proteins from the same strain may belong to the same COG. Can you explain why? [The total number of proteins in COGs may be greater than the total number of COGs]. c. Your output should have the following: List of COGs : For example: Strain1: COG0001, COG0004, COG0010. COG0132 Number of COGs Number of Proteins present in COGs 6. With the list of COGs for the two strains, make a presence/absence matrix of COGs. a. For this you will need a comprehensive list of COGs from both the strains. b. For each COG in this comprehensive list, see if the COG is present in each of the strain. c. If a COG is present, represent that as 1, if it is absent, represent that as 0. d. An example of such a matrix is: COG0001 COG0005 COG0010 COG0021 COG0111 Strain Strain In the above example, COG0001 is present in both the strains. COG0005 is absent in strain1 and present in strain2.

BIR pipeline steps and subsequent output files description STEP 1: BLAST search

BIR pipeline steps and subsequent output files description STEP 1: BLAST search Lifeportal (Brief description) The Lifeportal at University of Oslo (https://lifeportal.uio.no) is a Galaxy based life sciences portal lifeportal.uio.no under the UiO tools section for phylogenomic analysis,