SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

SOLOMON: Parentage Analysis 1 Corresponding author: Mark Christie christim@science.oregonstate.edu

SOLOMON: Parentage Analysis 2 Table of Contents: Installing SOLOMON on Windows/Linux Pg. 3 Installing SOLOMON on Mac Pg. 4 Modules Pg. 5 Input File Format Pg. 6 Creating an Allele Frequency File Pg. 7 Creating Simulated Data Sets Pg. 9 Power Analysis Pg. 11 Exclusion Pg. 12 Bayes Parentage Pg. 14 Choosing a posterior, etc. Pg. 17 Including Known Parents Pg. 17 Siblings Pg. 17 Frequently Asked Questions Pg. 19

SOLOMON: Parentage Analysis 3 Installing SOLOMON on Windows or Linux: SOLOMON is implemented as a fully functional R package. No knowledge about the R programming language or statistical environment is needed, as SOLOMON functions solely as an interactive program with a fully functional graphical user interface. SOLOMON is available through the Comprehensive R Archive Network (CRAN). To install, please follow these simple steps: (1) Download and install the latest version of R (at least version 2.15.1, which is available at http://cran.cnr.berkeley.edu/index.html). The 32 and 64-bit versions should both work, though SOLOMON was most extensively tested on the 64-bit version. (2) Open R and copy and paste the following code into the console: install.packages("solomon") You have now successfully installed SOLOMON. To run the program simply type: library(solomon) solomon() You can close R at any time, and the package will have remained installed. To reload the package upon starting a new R session, simply retype the above two lines of code, which will load and launch the program, respectively. These commands are case sensitive. Troubleshooting: Problem: The package does not install correctly. Possible Solutions: Make sure you are using R 2.15.1 or later. Consider manually installing the package tcltk from CRAN. Google installing R packages Please contact me if any of the above instructions are incorrect.

SOLOMON: Parentage Analysis 4 Installing SOLOMON on a Mac: SOLOMON is available through the Comprehensive R Archive Network (CRAN). To install on a Mac, please follow these simple steps: (1) Download and install the latest version of R (version 2.15.1, which is available at http://cran.cnr.berkeley.edu/index.html). The 32 and 64-bit versions should both work, though SOLOMON was most extensively tested on the 64-bit version. (2) Go to http://cran.r-project.org/bin/macosx/tools/ and download and install tcltk-8.55 (Note that this comes with its own installer). You may have to restart R. (3) Open R and copy and paste the following code into the console: install.packages("solomon") You have now successfully installed SOLOMON. To run the program simply type: library(solomon) solomon() You can close R at any time, and the package will have remained installed. To reload the package upon starting a new R session, simply retype the above two lines of code, which will load and launch the program, respectively. These commands are case sensitive.

SOLOMON: Parentage Analysis 5 SOLOMON modules: After typing solomon() you should see the following screen: Clicking on the buttons next to these options will bring up a new window. From top to bottom, the modules include: Frequencies: Calculate allele frequencies from a data set or generate allele frequencies from user specified numbers of loci and alleles per locus. Will create an allele frequency file. Sims: Use an allele frequency file to create datasets. User specifies number of wanted parents, number of offspring per parent, genotyping error rate, number of unrelated individuals, and an allele frequency file. Power: Uses allele frequencies from genotype data sets to calculate the expected number of pairs that share alleles by chance. Exclusion: Perform Mendelian exclusion. User can choose how many loci to allow to mismatch and whether to allow missing data to match to scored data. Bayes: Performs Bayesian parentage analysis. Parentage options: No Known Parents: No parents of any offspring are known. One Known Parent: One parent of an offspring has been genotyped (e.g., mothers at a nest). Known Parent-Pairs: Observational data lets you know which males and females were paired together.

SOLOMON: Parentage Analysis 6 Input File Format: The input file format is relatively simple and should be easy to create from most populationgenetic and bioinformatics file formats. If you have any difficulties converting a file into this format, please contact me. SOLOMON input files should be in a tab-delimited text file format (easy to create in Microsoft Excel, for example). The input files can be named whatever you like because you will load them using a standard file-select window, but as stated above they should be text files (end in.txt). For parentage analysis with no known parents only two files are required (e.g., adults.txt and juveniles.txt). For the other parentage analyses three files are required (e.g., moms.txt, dads.txt, and juveniles.txt). Below is an example of an input file of 6 individuals genotyped at 4 loci: Individuals Locus1 Locus1 Locus2 Locus2 Locus3 Locus3 Locus4 Locus4 Adult1 105 113 190 225 78 100 107 116 Adult2 0 0 193 207 100 100 98 113 Adult3 90 90 173 210 87 94 98 113 Adult4 93 119 193 205 100 116 107 122 Adult5 110 125 179 237 78 122 104 107 Adult6 93 102 181 210 94 103 0 0 All files with genetic data should have the exact same format. The first column must contain a unique ID for every individual and the first row should contain a header for each locus. The locus and ID names here are just examples, you can use any nomenclature that you prefer. All individuals should have unique names. SOLOMON only uses codominant genetic data, so each locus must have two columns (one for each allele). Alleles cannot be greater than 900. Two or three digit (or both, as in the example above) alleles are acceptable. Missing data must be entered as 0. An easy way to observe the file format is to create some data yourself using the Frequencies and Sims modules (see below). The headers (first row) in the adult and juvenile files must be identical. Monomorphic loci will be ignored in the Bayesian analyses.

SOLOMON: Parentage Analysis 7 Creating an Allele Frequency File: Upon clicking FREQUENCIES, you will see the following menu: First set a working directory (C:/SOLOMON is the default), which can be any directory in your file system. Regardless of your operating system, use a forward slash. Because of the way the graphical user interface was created, you must always click the OK button to input your choices successfully into R. If you have a data set that you want to use the allele frequencies from, use Create Allele Frequency File with a Data Set. The input file must be in the format outlined on the previous page. This feature is useful because you can take the allele frequency file to create a wide variety of simulated data sets (e.g., varying the numbers of sampled individuals, parent-offspring pairs, number of offspring per pair, siblings etc.). If you do not have a data set or if you wish to experiment with different sorts of data sets use Create an Allele Frequency File. Simply enter in the number of loci and the number of alleles per locus. Allele frequencies are modeled as: i z = ( Na + 1) i Na i i= 1 ( Na + 1) i where Na equals the total number of alleles and i equals allele i in the set 1: Na. This distribution is fairly conservative as it results in several fairly common alleles.

SOLOMON: Parentage Analysis 8 Allele frequency files are written as output to the working directory as AlleleFrequencies.txt. If you are creating an allele frequency file from scratch, then the resulting allele frequency file simply contains 2 columns, one with the locus ID and one with the allele frequency (allele identities are not created). Here is an example of two loci with five alleles each. 1 0.574713 1 0.229885 1 0.114943 1 0.057471 1 0.022989 2 0.574713 2 0.229885 2 0.114943 2 0.057471 2 0.022989 If you create an allele frequency file from genotypic data then SOLOMON will produce two allele frequency files. One allele frequency file is identical to the one displayed above and should be used if you would like to create data sets to experiment with in the Create test data sets module (see below). An additional user-friendly file is also created that contains the locus name, allele name, count, and frequency: Locus allele count frequency Locus.1 100 1 0.005 Locus.1 101 3 0.015 Locus.1 102 6 0.03 Locus.1 104 12 0.06 Locus.1 105 50 0.25 Locus.1 106 38 0.19 Locus.1 107 16 0.08 Locus.1 108 9 0.045 Locus.1 109 19 0.095 Locus.1 110 19 0.095 Locus.1 112 1 0.005 Locus.1 113 1 0.005 Locus.1 114 1 0.005 Locus.1 115 1 0.005 Locus.1 116 21 0.105

SOLOMON: Parentage Analysis 9 Creating Simulated Data Sets: If you have created an allele frequency file (see previous page) you can now create simulated data sets to test with SOLOMON. Creating simulated data sets can be useful for many reasons, including testing them with the Bayesian modules to see how many parent-offspring pairs would be identified for various sample sizes, error rates, and marker sets. The output of this module will create three SOLOMON input files (see pg. 6): Dads, Moms, and Juveniles. If you click on the SIMS button from the main menu you will see the following menu: As before, you must set your working directory and you must click OK after entering values into the empty field for the values to be entered in R. Next you should enter the number of wanted parents (i.e., the number of true parents), and the number of offspring to be produced by each of those parents in accordance with Mendelian expectation. A genotyping error rate can also be entered (0.01 is a commonly assumed value for microsatellites), though 0 can also be used if you do not want to introduce genotyping error. Genotyping errors are distributed randomly across individuals and loci by copying alleles within a locus form a randomly selected individual. You must also select the number of unrelated individuals you wish to create. The number of unrelated individuals you select will be added to each of the three output files. Lastly, you must choose the number of full-sibling pairs to split between the Dads and Juveniles file, if you want to examine the effects of full-siblings. You may enter 0 for both unrelated individuals and/or the full-siblings fields if you do not wish to create these types of genotypes.

SOLOMON: Parentage Analysis 10 After selecting your allele frequency file and clicking Run, you will end up with simulated data sets for mothers, fathers, and juveniles. If, for example, you chose 5 parents, 1 offspring per parent, 2 unrelated individuals and 2 full-siblings, then the Dads file would look like this: IDs Locus.1 Locus.1.1 Locus.2 Locus.2.1 Locus.3 Locus.3.1 Dad 1 100 105 101 101 102 105 Dad 2 105 108 101 102 104 105 Dad 3 107 108 101 101 103 105 Dad 4 106 108 102 102 101 105 Dad 5 108 114 102 102 102 105 Individual 1 106 114 101 102 102 102 Individual 2 102 105 101 103 101 105 Sibling 1 104 114 102 102 102 103 Sibling 2 106 108 101 103 101 102 Notice that the file has 5 true fathers, two unrelated individuals and two siblings. Each of the siblings belongs to a pair, with their full sibling counterpart being placed in the offspring file. The offspring and mothers are written to separate files (not shown here). Two unrelated individuals will be added to each of those files as well. If multiple offspring per parent are created they will be named as Offspring1.1, Offspring1.2 etc. where the first number represents the family and the second number represents the unique individual. Larger data sets (thousands of individuals, hundreds of loci) may take several minutes to create.

SOLOMON: Parentage Analysis 11 Power Analysis: For data sets with no known parents, you can calculate the expected number of false pairs (i.e., pairs that share alleles by chance alone) by simply using the allele frequencies. Here, we employ equations 1 through 4 of (Christie 2010)*. To quickly calculate the expected number of false pairs, click on POWER in the main menu to bring up the following menu: This module requires an Adult genotype file and a Juvenile genotype file (they can be simulated or real). The first field asks for the minimum number of loci to be considered in the analysis. You must next enter the working directory, and the adult and juvenile files. Then simply press Run. Choosing the minimum number of loci: If your data set has 100 loci, then you can select 100 to include all loci. To determine if you could have high power with fewer loci you could change this field to 80, for example. This module is relatively fast and it should not take very long to test a range of loci values. If you are designing a project, you would ideally want to shoot for as few expected false pairs as logistically possible. * Christie, M. R. 2010. Parentage in natural populations: novel methods to detect parent-offspring pairs in large data sets. Mol Ecol Resour 10:115-128.

SOLOMON: Parentage Analysis 12 Exclusion: If you have no know parents and wish to perform exclusion, simply click on the first EXCLUSION button to launch the following menu: The first option is to select is the number of loci to mismatch. If 0 is selected, than only pairs that share at least one allele at all loci will be identified. Because genotyping errors can cause some true parent-offspring pairs to not share an allele at a locus, users may want to change this value to a positive integer (e.g., 1 or 2). Next the user must set the working directory and load the adult and juvenile files. Pressing run, will perform exclusion and write three useful output files to the working directory. To evaluate the effects of mismatching loci, the user can simply change the number of mismatching loci and press run again the output files will be overwritten. Three output files are produced: (1) Output_by_parent sorts the files by all of the offspring assigned to each of the parents: Parent Number_Offspring Offspring Number_Loci_mismatching Dad 5 6 Offspring 1 2 Offspring 4 2 Offspring 5 0 Individual 5 2 Individual 6 2 Sibling 1. 2 2 The first column here presents the ID for a particular parent (Dad 5). The second column shows the number of offspring assigned to that parent (6). The third column shows the IDs of the 6 offspring assigned to that parent. Notice that because we have used the simulated data set

SOLOMON: Parentage Analysis 13 creator in SOLOMON the IDs reveal exactly what the relationships are. We know that Offspring 5 is the true offspring of Dad 5. Offspring 1 and 4 were the progeny of Dads 1 and 4, respectively, so we know that they are not a correct assignment, same with the unrelated Individuals and the Sibling. The last column shows the number of loci that are mismatching up to 2, which was the value entered for this particular run. Notice that the true offspring mismatches at 0 loci, while the 5 erroneous assignments mismatch at 2 loci. The Output_by_offspring provides the same output, except that it is sorted from the offspring s perspective. This file is useful to see if matching to multiple parents is occurring (i.e., a mother and a father or, more likely, false matches that have occurred by chance). The last file generated is the Output_genotypes file: Dad 1 0 100 105 101 101 102 105 Offspring 1 0 105 108 101 101 105 105 Dad 3 2 107 108 103 103 103 108 Offspring 1 2 105 108 101 101 105 105 This file shows the IDs, number of mismatching loci, and complete genotypes for the userspecified number of loci.

SOLOMON: Parentage Analysis 14 Bayesian Parentage Analysis: If you select BAYES from the main menu, you will launch the interface to perform parentage analysis with Bayes theorem. The menu will look like: The user must first set their working directory. Next the user must choose the number of simulated data sets to use for the calculation of the number of shared alleles (used in the calculation of the prior and the posterior). Testing has shown that 1000 simulated data sets work well for microsatellites, while SNPs can get by with 100. The user must next select the number of simulated genotypes to create. Testing has shown that 50,000,000 simulated genotypes work well for microsatellites, while SNPs can get by with 500,000. We do not recommend using fewer than the recommended values as the precision of the posterior probability will decrease. A good check that the correct values were chosen is to run the Bayesian analysis a second time and compare the posterior values. If an appropriate number of simulated data sets and genotypes were selected, then the posterior values for each pair should differ by less than 0.001. Run time is correlated with the size of the data sets and the options chosen above. An average data set should complete ~ 30 minutes. A large data set (e.g., thousands of individuals or loci) can take several hours (4-5). Please contact me if a run has taken longer than 48 hours.

SOLOMON: Parentage Analysis 15 Bayesian Output: After performing a Bayesian parentage analysis there will be three output files. The first output file is Output_Pr(Phi)_Bayesian Prior which displays the number of mismatching loci and the corresponding Pr(Phi) value (see eqn. 1 of the paper). This equals the probability that any given pair that mismatches at a given number of loci has occurred by chance: Number of Mismatching Loci Pr(Phi) 0 0.007689 1 0.664317 2 1 3 1 4 1 5 1 6 1 In the above output, any pair that matched at all loci (0 mismatches), would have a 0.0077 probability of occurring by chance. Any pair that mismatched at 1 locus would have a 0.66 probability of occurring by chance. This is the prior that is subsequently used in Bayes theorem. Any number of mismatching loci that have Pr(Phi) equal to one are not further evaluated (See paper for details). The second file that is written is graphical output (in the form of a pdf file) entitled Output_Dataset_Power :

SOLOMON: Parentage Analysis 16 For both plots the x-axis represents the number of mismatching loci. The top plot illustrates the number of observed pairs (green points) and the expected number of false pairs (blue points), that is the number of pairs that are expected to occur by chance. Notice that the top plot is on a log 10 scale. Pr(phi) equals the expected number of false pairs divided by the total observed number of pairs (for a given number of mismatching loci) and is displayed in the bottom plot. The last output file, Output_Posterior_Probabilities is the Bayesian posterior probability for each pair that has a posterior value less than one: Adult Juvenile NL_mismatch Probability of pair being false given frequencies of shared alleles Dad 16 Offspring 16 0 9.34E-13 Dad 34 Offspring 34 0 1.03E-06 Dad 38 Offspring 38 0 1.19E-06 Dad 21 Offspring 21 0 2.22E-05 Dad 22 Offspring 22 0 2.93E-05 Dad 15 Offspring 15 0 3.18E-05 Dad 18 Offspring 18 0 3.22E-05 Dad 43 Offspring 43 0 3.22E-05 Dad 6 Offspring 6 0 3.40E-05 Dad 11 Offspring 11 0 4.70E-05 Dad 4 Offspring 4 0 5.97E-05 Dad 20 Offspring 20 0 5.97E-05 Dad 42 Offspring 42 0 6.91E-05 Dad 45 Offspring 45 0 0.00014 Dad 33 Offspring 33 0 0.000244 Dad 37 Offspring 37 1 0.000328 The first and second columns show the adult and juvenile IDs, respectively. The third column reports the number of mismatching loci. The fourth column shows the Bayesian posterior, with pairs sorted by increasing values. Pairs with low posterior probabilities (e.g., <0.05) can be identified as parent-offspring pairs. SOLOMON does not choose a cutoff value, but we have found that a good tradeoff between maximizing the number of correct assignment and minimizing false assignments occurs when all pairs with posterior values <0.05 are retained as parent-offspring pairs. What cutoff value to choose, however, should ultimately be dictated by a cost benefit analysis of weighing the risks associated with committing type I or type II errors. Because the posterior has a clear interpretation, this can be done in a quantitative way. Here, the posterior is interpreted as the probability of parent-offspring pair being false given the frequencies of shared alleles.

SOLOMON: Parentage Analysis 17 Choosing a posterior: Choosing a threshold for accepting putative parent-offspring relationships can depend on the goals of the study and on weighing the relative risks between type I and type II errors. In general, a cutoff value of 0.05 has been shown to maximize the number of correct assignments, while minimizing the number of incorrect assignments. Choosing a lower number of simulations will result in a faster run time, but the precision of the posterior estimates may be reduced. For a fixed number of simulations, the precision decreases with an increase in the posterior value. This typically should not be an issue as most users will be interested in pairs with a low posterior value. If time is not limiting, we suggest using a large number of simulations and running the program twice to check the variance of the posterior values. Precision can also be worse in low-power data (e.g., large numbers of falsepairs) sets such that we recommend using more simulations in these cases. Known Parents: If you have 1 known parent, then you will need an additional file with the known-parent genotypes. Each offspring must have a corresponding known parent (i.e., if there are 10 rows in the offspring file, then there must be 10 rows in the known-parent file). Each known parent must share at least one allele at all loci with the offspring. If you would like to include a known parent with a mismatch, simply change the adult genotype to the offspring genotype at the mismatching locus. Given this data format, and if you are creating data from within SOLOMON, then you must delete any unrelated individuals from the juvenile and known-parent files. The unknown-parent files can have unrelated individuals. If you have known-parent pairs, then the parent-pairs must occupy the same row of their respective files. Each mother and father file should thus have the same number of rows. If, for example, a female was mated to multiple males, then the female genotype must be duplicated on every corresponding row that a male occurs. SOLOMON will not include her genotype frequencies more than once in the analyses. Siblings: If you have a species with long generation times, for example, then you may have unknowingly sampled an individual in your adults file that has its full sibling in your juvenile file. If this is a possibility for your system, and could have occurred in high frequency, then we recommend using the Bayesian module that corrects for siblings. This module considers alleles that are identical-by-descent as well as those that are identical-by-state, and modifies the prior accordingly. This results in a more conservative test, so we only recommend it in cases when lots of full siblings are suspected. Note that we generally do not recommend this approach if two siblings may co-occur in your adults file. Aunts and uncles (i.e., two siblings in your adults file, one of which is a true parent) were never found to be falsely identified for the data sets we tested in the manuscript. The output for the siblings module is identical to the standard Bayesian output with an additional graph illustrating shared alleles:

SOLOMON: Parentage Analysis 18 SOLOMON output figures illustrating the frequency that each number of loci share an allele identical-by-state and identical-by-descent for unrelated (orange circles), full siblings (blue circles) and parent-offspring pairs (green circles). The plot on the left has a genotyping error rate of 0.01 and the plot on the right has a genotyping error rate of 0.001 (more typical of SNP data sets). Notice that when the error rate is low, almost all parent-offspring pairs will share an allele at all loci whereas less than 5 percent (dashed line) of full siblings will. SOLOMON can generate these plots for interested users as well as apply a correction to the Bayesian prior to include alleles that are identical-by-descent, by simply selecting the Sibs option. Also, please keep in mind that SOLOMON has been developed to identify parent-offspring pairs. If you suspect the presence of many siblings or other relatives, we recommend using alternative approaches (e.g., COLONY, Pedigree Reconstruction Tools). See the SOLOMON website for links to additional programs. Linux/Macs: Solomon was developed and extensively tested on Windows 7. Limited testing with the Ubuntu distribution of Linux and on Macs has revealed no current issues. Progress bars were written with Windows, and have been disabled for other platforms. If running on Linux or Mac, please use your system monitor to measure your CPUs and RAM to ensure that the program is still running. Very large data sets can take more than 12 hours to complete. Contact: Please feel free to contact me at christim@science.oregonstate.edu with any questions, comments, requests or recommendations. Please check the website for additional FAQs and read through this manual before sending an email.

SOLOMON: Parentage Analysis 19 Frequently Asked Questions: Can I run SOLOMON without a graphical user interface? Yes! Simply email me for scripts to be run in the command line. Once SOLOMON has been uploaded to a repository, I will make these scripts available at the SOLOMON website: http://www.science.oregonstate.edu/~christim/solomon.html What should SOLOMON users report for peer-reviewed publications? If using the Bayesian methods, Solomon users should always report (1) the prior probabilities, (2) the posterior probabilities, and (3) the number of simulated data sets and genotypes used in the Bayesian calculation. Users are free to directly use or manipulate any SOLOMON output for publications. Please also cite the associated publication. What assumptions does SOLOMON make? SOLOMON has been developed to be as assumption-free as possible. Currently, we assume that all loci are in linkage equilibrium. We recommend that users test and remove loci that are tightly linked. Could SOLOMON be used to improve findings from other parentage software? Possibly. For example, SOLOMON does calculate the ratio of the expected number of false pairs divided by the observed number of putative pairs. Thus, one minus this quantity is approximately equal to the proportion of true parent-offspring pairs in the data set and may be correlated with the proportion of candidate parents sampled in a data set. The proportion of candidate parents sampled in a dataset is a necessary parameter used in the likelihood program CERVUS. I tested this idea using the empirical steelhead data presented in the manuscript and did not see substantial improvements in the performance of CERVUS, but this approach may yield more positive results in different data sets.

SOLOMON: Parentage Analysis 20