Spotter Documentation Version 0.5, Released 4/12/2010 Purpose Spotter is a program for delineating an association signal from a genome wide association study using features such as recombination rates, genetic distance, linkage disequilibrium, and association p values. Requirements A program for calculating LD information called new_fugue, written by Goncalo Abecasis. It can be downloaded from: http://genome.sph.umich.edu/wiki/new_fugue Python version 2.5 (or greater) and less than 3.0. Python 3 is a separate and incompatible branch. Synopsis Typical usage of Spotter: spotter --metal association_results_file.txt --snplist rs1,rs2,rs3 If you have a file that already contains a list of SNPs, you can do the following: spotter --metal association_results_file.txt --hits file_with_snps.txt Spotter will parse out the rs### SNPs from that file and run the algorithm on each one. Installation Spotter is ready to run out of the box, as long as new_fugue and a Python interpreter are already installed on your system. We provide as a starting point: HapMap phase II CEU build 36 genotype files, for computing LD information HapMap recombination rates, build 36 HapMap genetic map, build 36 UCSC refflat table (gene information), build 36 UCSC sequence gap table, build 36 These various pieces of data can be changed by the user. Configuration Users will likely wish to supply their own data. To change the data that Spotter uses, the user can edit the conf/config.xml file. Data sources listed under required_data can be provided by the user. Sources listed under created_data must be created by running bin/setup.py, a script for downloading and formatting data from the UCSC database. Input Spotter requires 2 pieces of information to run: a file containing association results, and a list of SNPs to use for defining regions.
The association results file should look like the following (note, the following is just an example): snp chr pos p value rs114141 3 191414141 2.7343e 04 Each row should be a SNP, with its chromosome, position, and p value (from GWAS or meta analysis). Please take care to provide SNP positions that are from the same build as the HapMap and UCSC files provided with Spotter (build 36 / hg18.) The file should be tab delimited. SNPs can be provided by using either the snplist option, or the hits option (see Options section below.) Options Argument Description General program options o, out <file> cache <file> Specify output file for report. Spotter attempts to create two files one containing the identified intervals, and one containing information on each gene found within those intervals. For example, if you specify: out my_project.txt Spotter will create the following two report files: my_project_intervals.txt my_project_genes.txt Change the location of the LD cache file. Spotter caches LD on each run of the program so that it only needs to be computed once per locus. This greatly speeds up additional runs of the program. The default location is to create a file called ld_cache.db in the current directory. Options for specifying SNPs snplist <string> Run algorithm for each SNP in this list. The list must be specified in quotes, separated by commas. Example: snplist rs1,rs2,rs3 hits <file> Run algorithm for each SNP present in this file. Spotter attempts to pattern match rs### identifiers from the file. Options for specifying association results metal <file> delim <character> Association results file. See the section on Input for more information about the format of this file. The delimiter of the association results file. Defaults to tab. Options for specifying algorithm method <string> Specify which algorithm Spotter should use. Currently only a sliding window method
is supported. See section on algorithms below for more information. Options for sliding window method slw_pval <float> P value threshold for sliding window method. This should be specified as a raw p value, and not a log transformed one. slw_usepval Disable using p values for sliding window. Default is True i.e., do not use p values. This should be enabled with caution, especially if running over many different loci. P values scales are often very different for each locus. slw_r2 <float> LD (r^2) threshold. Default is 0.5. slw_ratethresh <float> Recombination rate threshold, in cm/mb units. Default is 10. slw_winsize <int> slw_userate Sliding window size, in bases. Default is 75,000 bases (75 KB). Toggles using recombination rates instead of genetic distance. slw_cmdist <float> Genetic distance to travel beyond the LD interval, in cm. Default is 0.02. slw_usecm Toggles using genetic map distance. This is the default. Output Spotter creates two output files: one containing intervals identified by the algorithm, the other containing information on genes within each interval. See out" for controlling the names of these output files. Algorithm Sliding window method Spotter s primary algorithm uses a sliding window approach. Within each window, the algorithm checks to see if a SNP exceeds either the p value threshold, or the LD threshold. If at least one SNP within the window exceeds either of these thresholds, the window slides forward, and tries the same procedure again until failure. Once a window is found that does not contain a SNP exceeding either the p value or LD threshold, the algorithm stops scanning. It then finds the last best SNP i.e. the last SNP to pass the LD or p value threshold, and expands beyond this point until either 1) the nearest recombination peak, or 2) a specified genetic distance. The figure below shows how the algorithm works as it scans to the right from the index SNP (in this case, we choose the index SNP as the best p value in the region.) For this example, we use the following settings: P value threshold: 1E 05 ( slw_pval 1e 05, slw_usepval) LD threshold: r^2 > 0.4 ( slw_r2 0.5) Recombination peak: cm/mb > 10 ( slw_ratethresh 10, slw_userate) Sliding window size: 75 kilobases ( slw_winsize 75000) Here, we ve chosen a particular locus, and want to identify the interval near the index SNP. We start by scanning to the right of the index SNP: Window 1: succeeds, since SNPs underneath the window exceed both p value and LD threshold Window 2: succeeds, since SNPs underneath the window are exceeding the LD threshold Window 3: fails, no SNPs above p value or LD threshold
The last SNP to exceed either the p value or LD threshold is shown with a red arrow. From here, we scan to the right to find the nearest recombination peak, shown with a blue arrow (it s actually quite difficult to see here, you may need to zoom in!) This exact same procedure would also be repeated to scan to the left of the index SNP (not shown here.) The interval detected in this example is shown by two black lines (or the highlighted blue region in the gene track.) Figure a visual overview of how the sliding window algorithm detects an association signal interval. The plot shows association results from a GWAS or meta analysis of GWAS studies. Each point is a SNP, colored by its LD (r 2 ) value with the index SNP. The x axis is genomic position, and the y axis is the log 10 of the association p value. Visualizing Results Plots such as the one in the figure above can be created by using our software at: http://csg.sph.umich.edu/locuszoom/. The interval selected by Spotter can be plotted using the Highlight Region of Interest feature under Plot Using Your Data section, or by supplying histart=# and hiend=# parameters under the Plots Using Your Data And Your Hitspec File batch mode section. Licensing Spotter is written by Ryan Welch (welchr@umich.edu) and is copyrighted by the University of Michigan. This software is licensed under the MIT free software license: Copyright (c) 2010, The University of Michigan Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.