Genomic Evolutionary Rate Profiling (GERP) Sidow Lab

Size: px

Start display at page:

Download "Genomic Evolutionary Rate Profiling (GERP) Sidow Lab"

Shana Higgins
6 years ago
Views:

1 Last Updated: June 29, 2005 Genomic Evolutionary Rate Profiling (GERP) Sidow Lab Maintained by Gregory M. Cooper a PhD student in the lab of Arend Sidow in the Stanford University Departments of Pathology and Genetics. See the following link for more info about the lab, including information about the Java GUI we have developed to view and analyze GERP data (the ABC ): All of the GERP Perl scripts were written by GMC; statistical analysis done in collaboration with Eric A. Stone. Table of Contents: 1. Overview 2. GERP.pl 3. GERP_dataprep.pl 4. GERP_window.pl; GERP_modifytree.pl 5. GERP_findconssegs.pl 6. GERP_permutes.pl; GERP_highconf_thresh.pl Please use the following citation if you use GERP for any published analyses or results: Cooper, G.M., Stone, E.A., Asimenos, G., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S., and Sidow, A Distribution and intensity of constraint in mammalian genomic sequence. Genome Research. In press. Licensing and copying information: This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA USA

2 1. Overview Conceptually, Genomic Evolutionary Rate Profiling (GERP) is a method for the identification of slowly evolving regions in a multiple sequence alignment, defined here as constrained elements. Given a multiple sequence alignment and an unrooted tree relating the sequences of the alignment, rates of evolution are estimated in small windows or single columns stepped across the alignment using maximum likelihood. In addition to requiring a multiple sequence alignment and a topology with relative branch lengths, a neutral rate estimate for the sequences captured by the alignment is required. This neutral rate, in conjunction with the phylogenetic tree, is used to define expected rates of evolution for each window in which the observed rate is quantified. Constrained elements are identified by comparing the observed to the expected rates of evolution for each window, and defining all those regions whose collective observed rates of evolution are significantly lower than would be expected under a null model. This method has several significant advantages, including: realistic estimation of substitution events using a likelihood, tree-based method; tractable statistics; highresolution quantification of evolutionary rates and identification of constrained elements; and the ability to cope with missing data by excluding gapped or ambiguous sequences within each window (or column) and adjusting the neutral (expected) rate accordingly. Note that while it is not currently implemented, with some simple additions the expected rate could also be dynamically adjusted according to regional fluctuations in the neutral rate. In practice, GERP is a simple group of Perl scripts that allow the automation of one particular instance of the methodology. Note that this package will only be useful for people who are comfortable with reading and running Perl scripts, installing programs in a Unix environment, and dealing with flat text files. These scripts by default call upon three external programs: RepeatMasker (Smit, AFA & Green, P RepeatMasker at the multiple sequence alignment program MLAGAN (Brudno et al. 2003) and the maximum likelihood rate estimation program SEMPHY (Friedman et al. 2002). Note, however, it is trivial to supply your own repeats and/or your own alignments, and so neither RepeatMasker nor MLAGAN are required. Additionally, with some basic but non-trivial tweaks to one of the scripts, a replacement for SEMPHY could be utilized. These GERP scripts will produce the following raw output data: RepeatMasker masked sequence files and annotations; an MLAGAN-generated multiple sequence alignment; a rates file describing the observed and expected rate of evolution estimated for each site of the alignment; a constrained elements file (in a simple tab-delimited format) containing the coordinates of all elements identified at a given threshold, along with their score (see GERP_findconssegs.pl); and a compressed alignment consisting of the alignment projected down to the ungapped coordinates of a specified lead sequence (see GERP_dataprep.pl). Unless the option is disabled, these scripts will also perform permutations of the rates files, detect conserved segments within these permutations, and

3 use a score threshold that meets a specified false positive rate (defaults to 0.05). Note that the user MUST supply a file(s) of permuted coordinates (see below). Additionally, specially formatted data files can be generated, so the user can view the alignment, rate, and annotation data in the Java application, the Application for Browsing Constraints (ABC); this application was developed in our lab for the visualization and exploration of multiple sequence alignments, evolutionary rate data, and annotations. See for the application and documentation.

4 2. GERP.pl <GERP Parameters file>: GERP.pl is a master script that can be used to automate the entire analysis pipeline, from RepeatMasking to alignment to sliding window analysis to identification of constrained elements to output of results for ABC browsing. To get started, you must have RepeatMasker, SEMPHY, and MLAGAN installed (unless replacements are utilized) in a Unix environment; see the appropriate citations for information on obtaining and installing each of these programs (references at the bottom of this document). You must then create a parameters file that describes file locations and other options (see below), and run the script GERP.pl, supplying the path to the parameters file as the lone argument. GERP.pl reads the parameters file and performs the appropriate actions. The parameters file should be in plain text, with two tab-delimited columns. The first column should be the name of the option, and the second should be the value for that option. For example, seq_file MySeqs.mfa denotes that the seq_file option has a value of MySeqs.mfa. The following parameters can be specified in the parameters file: seq_file Path of a sequence file to be used, which must be in multi-fasta format phylo_tree Path of a tree file, which must be in standard parenthesis tree format and include branch lengths align_tree Path of an alignment tree file; MLAGAN requires a separate tree for the progressive alignment strategy it utilizes (see MLAGAN documentation) neutral_rate Neutral rate estimate for the full phylogenetic tree supplied; this value may be different from the sum of the branch lengths in the phylogenetic tree window_length Length of the window to use for sliding window rate estimation lead_sequence Name of the sequence representing the lead sequence ; alignment will be compressed to the ungapped coordinates of this species, to make annotations consistent between the alignment and features of this sequence; this value must correspond exactly to one of the sequence names (not including the > character) rej_subs_min Threshold score to define a candidate constrained element as significant; defaults to 8.5 (see GERP_findconssegs.pl) merge_distance Maximum tolerated number of unconstrained scores between candidate constrained elements; defaults to 1 (see GERP_findconssegs.pl)

5 extra_mergedist Will find conserved segments using additional merge distances (see GERP_findconssegs.pl); this option can be reused as many times as desired repeats Path of a repeat annotation file, which must be in RepeatMasker.out format; you can ignore repeats altogether by using this option with a value of NULL genes_file Path of a genes file to be included in ABC-formatted annotation, which must be in ABC-ready format masked_sequence Path of a repeat-masked version of the sequence file; must be in multi-fasta format, should be N-masked alignment Path to an alignment file to be used, which must be in multi-fasta format; note that this disables the call to MLAGAN align_length If you supply an alignment and disable the sliding window analysis, you must supply the alignment length to get properly formatted ABC files block_sequence Path of an alignment file in block format (see GERP_dataprep.pl), this is only useful when running GERP multiple times on the same alignment; note that this disables the call to MLAGAN rates_file Path of a GERP rates file; note this will disable the sliding window analysis no_sliding_window Disables sliding window analysis; reduces GERP functionality to producing ABC-ready annotation given a set of annotations and results; NOTE, you must use either the no_abc_rates, no_abc_files, or rates_file option in conjunction with this option to ensure functionality gap_file Path of an ABC-ready gap coordinates file (see GERP_dataprep.pl) read_gerp_segs Will import GERP-formatted constrained element files in the working directory and output them in ABC-ready format no_abc_files Disables the generation of all ABC-ready results files no_abc_seq Disables printing of ABC-ready sequence file no_abc_rates Disables printing of ABC-ready rates file keep_anchors Will allow CHAOS (part of MLAGAN) anchor files to remain; only useful for rerunning MLAGAN

6 high_conf_thresh Specify minimum rej. subs. score to consider as 100% confident; defaults to ESTIMATE, meaning this value will be automatically determined; to disable this, supply NULL as a value, or alternatively supply your own threshold in terms of rejected substitutions (see GERP_permutes.pl and GERP_highconf_thresh.pl) permutations Specify directory path containing permuted coordinates; by default this is./permutations false_pos_rate Specify maximum false positive rate accepted (defaults to 0.05; see GERP_permutes.pl) no_rsmin_estimate Skip automatic estimation of RS for given false positive rate (will use default RS minimum of 8.5, or whatever RS minimum is supplied) Note that, while no options are specifically required, you may not get sensible output if sensible groups of options are not supplied. If you use no_sliding_window, for example, but do not either supply a rates file or turn off ABC-rates file printing, the script will throw an error when attempting to find rate values for printing. For a typical GERP analysis, the options file (also see included example) might be: seqfile MySeqs.mfa phylotree MySeqs.tree aligntree MySeqs_align.tree leadseq MyLeadSeq winlen 1 neurate 2.5 genesfile MyLeadSeq_genes.txt Running GERP.pl with these seven options is sufficient to run a complete analysis and prepare files for browsing with the ABC.

7 3. GERP_dataprep.pl <Alignment> <LeadSequence> <WindowLength> <BlockSize>: The function of GERP_dataprep.pl is to format the alignment into the block sequence format necessary for GERP_window.pl to function. The basic idea is to split the alignment into chunks of size BlockSize ; GERP_window.pl then loads these chunks into memory for efficient sliding window analysis. In addition, if a lead sequence is identified, GERP_dataprep.pl will project the alignment to ungapped coordinates of the lead sequence prior to generating the block sequence file. In the process of doing this, it will generate a gap coordinates file (in an ABC-ready format) that identifies the location, relative to the lead sequence, of each gap that was deleted from the alignment. It also records the number of nucleotides deleted from each sequence of the alignment that were aligned to the lead sequence gap; see the ABC documentation for a thorough description of this file and its format. This file can be safely ignored if desired. The block sequence file format is as follows: at the top of the block sequence file should be four tab-delimited values: number of sequences, number of blocks (0-based), length of each block, and the length of the residual block. For example, for an alignment of 5 species, 102 Kb in length and a block size of 50 kb, the header line would be: Subsequently, each line should consist of a sequence name, followed by a space, followed by the sequence for that block for that species. Species names and sequences should repeat in the same order, one for each block, until the alignment is exhausted. In addition, to facilitate the sliding window analysis, an overhang is added to each block to allow the sliding window analysis to actually extend past the end of the block. By default, this overhang size is 99, allowing at most a window size of 100. For example, relative to the alignment coordinates of the above example, block 1 would contain columns 1-50,099 and block 2 would contain columns 50, ,099. This allows the windowing to run smoothly across the break points. Note that smaller window sizes will function normally; the overhang size has no effect on the sliding window analysis, assuming it is large enough to accommodate the window size used.

8 4. GERP_window.pl <PhylogeneticTree> <Alignment> <WindowLength> <NeutralRate> <Block_Sequence>: This is the core script of the analysis pipeline. Given a phylogenetic tree with branch lengths, a neutral rate estimate describing the neutral rate across this entire tree, and an alignment in block sequence format, this will produce a file containing rate estimates and expected rates for each site of the alignment. Note that the second parameter ( Alignment ), is used for naming purposes only, and dictates the name of the resultant rates file. A summary of the procedure is as follows: Collect the nucleotides for each species for each window, moving processively along the length of the alignment, one block at a time. For each species, decide if it is to be retained in the rate estimation step based on two criteria: gap and ambiguous/repeat percentage. If the number of gap characters for that sequence, or the number of Ns, or the number of lower-case nucleotides (assuming repeats are lower-case masked), exceed the threshold, this sequence is excluded from the rate estimation procedure for that window. These threshold percentages can be modified directly in the script, lines 19 and 20. Once the list of excluded species is complete, the script GERP_modifytree.pl (see below) is invoked to prune the phylogenetic tree. Subsequently, the expected (neutral) rate is estimated by summing the remaining branch lengths and pro-rating the total neutral rate according to the fraction of the original tree that remains in the analysis. Note that this maintains the topological constraints of the original tree, and also keeps the tree unrooted. Once a tree is generated and an expected (neutral) rate is determined, a tree and fasta file for the window are generated and passed to SEMPHY. This step is skipped, however, if the amount of neutral evolution captured by the sequences in the window is small (default threshold is 0.5 substitution per site; if desired, this can be changed directly in line 22). SEMPHY will optimize the lengths of the branches of this tree, given the data in the window; these branch lengths are then summed to generate an estimate for the observed rate of substitution within the window. Finally, the observed and expected rates are printed to the rates output file for each window. GERP_modifytree.pl <Tree> <seqname1> <seqname2> <seqname3> : This script reads in an unrooted, parenthesis tree with branch lengths, recursively eliminates each of the seqname variables, and returns the residual tree, keeping it unrooted. See the script for details.

9 5. GERP_findconssegs.pl <GERPRatesFile> <NeutralRate> <MergeDistance> <RejSubsMinimum> <Comma-separatedThresholds> This script finds slowly evolving regions within the rates file, subject to the neutral rate, merge distance, rejected substitution, and threshold criteria. The process is as follows: A vector of rate ratios is generated by dividing the observed rate from each window by its expected rate. This vector is scanned processively from beginning to end; this amounts to scanning the alignment from first to last column. Identify all groups of consecutive ratios below a specified threshold; each of these groups, defined by start and stop coordinates within the vector, constitutes a candidate constrained element. For example, using the following rates as input (O: observed, E: expected): O E A threshold of 1 (ie obs/exp ratio less than or equal to 1), would produce two candidate elements, one starting at position 2 and ending at position 5, and another from 7-9. A threshold of 0.5 would only produce a single candidate, from 2-4. Note that for two thresholds A and B, with A < B, the candidate list generated using threshold A is a subset (in terms of bases) of the list generated with B. After defining a group of candidate elements, a merge step is performed which will join nearby candidates that are separated by at most MergeDistance columns, where MergeDistance represents the maximum tolerated number of bases that do not meet the ratio threshold. For example, with a MergeDistance of 1 column, the two candidates produced using a threshold of 1 would be merged into one, from 2-9. Note that the MergeDistance parameter can be set to 0 if no merging is desired. Also note that merging proceeds recursively, so an arbitrary number of nearby candidates can be merged provided they meet the MergeDistance criterion. After defining a list of candidate constrained elements, each candidate is evaluated in terms of rejected substitutions (R). Rejected substitutions are defined here as the deficit in the number of observed substitutions when compared to the number of expected substitutions. When evaluating candidates that contain unconstrained bases, the rejected

10 substitutions value is allowed to be negative, but is capped at 3 times the expected neutral rate. For example, from the above rate estimates, the candidate region from position 2 to 9 would be scored as: R = ( ) + ( ) + ( ) + ( ) + ( ) +. = 4.9 and the candidate element from 2-5, would have an R value of 3. More thorough statistical treatment of these scores and the identification of constrained elements can be found in the published GERP manuscript (Cooper et al. 2005).

11 6. GERP_permutes.pl <PermutationFile(s)Path> <RatesFile> <ConstrainedElements> <MaxNeutralRate> <MergeDist> <RatioThreshold>: This script generates permutations of the rates in RatesFile, and generates estimates of the number of constrained element bases discovered in each of these permutations, using RS score minimums from 0 to 50, in 0.5 unit increments (note, in the event that 50 does not ultimately satisfy the confidence criterion described above, a warning will be issued and 50 will be used). Also, note that permuted coordinate files must be supplied in a separate directory; it is assumed that all files within this directory are permuted coordinate files. Each of these files should contain the indices of each alignment column once and only once, one coordinate per line, with the first position set to 1. For example, the following file (applicable to an alignment of length 5): would generate a new set of rates in which the 3rd column becomes position 1, the 5th column becomes position 2, 4 th becomes position 3, etc. This permuted set of rates is then used for constrained element discovery (identical procedure, using identical parameters, as described in GERP_findconssegs.pl) for each of the 100 RS thresholds. The number of constrained element bases is output at each threshold for each permutation; you may supply as many permutation files as desired, and the average number of constrained element bases identified across all permutations will ultimately be used (in GERP.pl) to estimate the false positive rate. Note that this script also takes a constrained element file (must be formatted as the output of GERP_findconssegs.pl) and will exclude positions within these constrained elements from the permutation. This option can be ignored by setting it to NULL. By default GERP.pl will identify a threshold at which no elements are identified in the permuted alignments, and use constrained elements in the actual alignment that meet this threshold as excluded coordinates (see GERP_highconf_thresh.pl, below). This score may be set directly, however, by setting the high_conf_thresh option in the parameters file (standing for high confidence threshold). Finally, note that positions for which a rate estimate was not made (denoted by a -1 in the rates file) are also excluded if they are flanked on both sides by -1s. Also note that GERP.pl will only estimate the RS threshold for the first supplied merge distance (see above), and use this same score threshold for all subsequent constrained element identification calls, even if they use a different merge distance criterion. GERP_highconf_thresh.pl <PermutationFile(s)Path> <RatesFile> <ConstrainedElements> <MaxNeutralRate> <MergeDist> <RatioThreshold>:

12 This script functions almost identically to GERP_permutes.pl, except the goal is to define the smallest threshold at which no constrained element is identified in ANY of the permuted alignments generated. This script will score constrained elements identified in all the permuted alignments, and will return the minimal score (rounded up to the nearest tenths-place) that exceeds this threshold.

13 Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., and Batzoglou, S LAGAN and Multi-LAGAN: efficient tools for largescale multiple alignment of genomic DNA. Genome Res 13: Cooper, G.M., Stone, E.A., Asimenos, G., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S., and Sidow, A Distribution and intensity of constraint in mammalian genomic sequence. Genome Research. In press. Friedman, N., Ninio, M., Pe'er, I., and Pupko, T A structural EM algorithm for phylogenetic inference. J Comput Biol 9: Smit, AFA & Green, P RepeatMasker at

LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA

LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA Michael Brudno, Chuong B. Do, Gregory M. Cooper, et al. Presented by Xuebei Yang About Alignments Pairwise Alignments