USER S MANUAL FOR THE AMaCAID PROGRAM

USER S MANUAL FOR THE AMaCAID PROGRAM TABLE OF CONTENTS Introduction How to download and install R Folder Data The three AMaCAID models - Model 1 - Model 2 - Model 3 - Processing times Changing directory Running an analysis Results Introduction AMaCAID is a program written to work under R. It is designed to analyze multilocus genotypic patterns in large samples. It allows to compute (i) the number and frequency of the different multilocus patterns available in the dataset, and (ii) the discrimination power of each combination of k markers among n available. It thus allows identifying the optimal and smallest subset of markers that allows to distinguish all the revealed multilocus genotypes. AMaCAID can be used with any kind of molecular markers, on datasets mixing different kinds of markers, but also on qualitative characters like morphological or taxonomic traits. It can however also be used to screen any kind of datasets characterizing a set of individuals (e.g. population genetics studies) or species (e.g. taxonomic or phylogenetic studies) for discrimination purposes. The size of the assayed sample has no limitation, but all combinations of markers can be computed only for datasets involving less than 25 markers. For larger number of markers/characters, it is possible to ask AMaCAID to screen a limited number of combinations of markers. The aim of this manual is to enable the beginners with R to use AMaCAID easily. A particular attention has been given to the details, about where to download and save the files, and the way the dataset should be constructed. Each paragraph is illustrated by screenshots. The script and example files are available online: http://www.montpellier.inra.fr/brc-mtr/amacaid/

How to download and install R R is a free Programming Environment for Data Analysis and Graphics available online at: http://www.r-project.org/ On this website the user can download the program in their language. On the starting web page, select CRAN in the left-menu. Then choose a country and select one of the proposed URLs. On the download page select the appropriate operating system: Linux, Mac or Windows.

On the next page select Base. On the next page select Download R.

This will download an.exe file. Launch the install procedure and follow the instructions. Folder Create a folder in which the script file ( AMaCAIDscript.r ) and the input file ( data.txt ) will be placed. This folder will be the Directory which R is going to use for the calculation. For example:

If the names of the script file and/or the dataset are changed, AMaCAID won t work. Changing directory To work, R needs to know where to look for the files it is going to use. So you have to tell him where your Folder is. To do that open R, click on File and Change directory. It is going to open a window in which you are going to select your folder containing your dataset and the script. When selected, click OK.

Data To run AMACAID, it is first necessary to create the input file named data.txt. The data can be coded either numerically or alphabetically. For genotypic data, it is preferable to use numbers to code alleles, with either a 1 (1-9), a 2 (01-99) or a 3-digit (001-999) number per allele. For morphological or taxonomical characters, you just have to use the same code or number or word for similar character values. A single string (word or number) can be used to describe each character value. The columns must be separated by tabulations. The file must be build as follows: The first line contains the names of the different columns of the table: the first column should contain the name of the different accessions/individuals/species to consider, the next columns contain the different characters or loci for which the individuals have been screened. On the second line, the first item is the name of the first individual or sample, the second is the genotype of the individual at the first locus (in this example, coded with a 3 digit number for each allele), the third is the genotype at the second locus You can encode the missing data as you want (in the example above 000000 refer to missing data) but for

simplicity when reading the output file, we recommend to use a string of the same length as a genotype or character value (for example, a string of length 6 like nnnnnn is recommended when alleles are coded with a 3 digit number). Note that similar heterozygous genotypes must be represented with the same code otherwise they will be considered as different genotypes (for instance, AMaCAID will consider the genotypes 110113 and 113110 as two different genotypes). For example, for microsatellite genotyping patterns, obtained on diploid individuals, the input file should have the following format: LineCode MTIC37 MTIC59 ENPB1 MTIC86 L000043 095095 113113 269269 000000 L000044 095095 110110 267267 130130 L000045 101101 100100 273273 134134 L000046 086086 100100 273273 128128 For AFLP data, the input file should be as follows: LineCode aflp1 aflp2 aflp3 aflp4 aflp5 aflp6 L000043 0 0 0 1 1 1 L000044 1 1 1 1 0 0 L000045 1 1 1 1 1 1 L000046 0 0 0 0 1 1 For SNP data : LineCode snp1 snp2 snp3 snp4 snp5 snp6 L000043 00 01 00 01 01 01 L000044 11 11 11 11 01 01 L000045 11 11 11 11 11 11 L000046 01 01 01 01 11 11 Here individual L00043 is homozygous for the allele named 0 for snp1 and heterozygous at the locus named snp2. As for SSR data, similar heterozygous genotypes must be written following the same character value. For morphological data LineCode PetalColor LeafShape FlowerPerInflorescence L000043 white denticulate 5 L000044 white denticulate 3 L000045 pink denticulate 3 L000046 purple rounded 5 No empty lines are needed between samples. Numbering of samples don t needs to be sequential and samples don t need to be ordered.

You can create your dataset using Excel and save it in text format ( data.txt ) the separator being a Tab. If the data are not saved in this format AMaCAID will not work. Be careful to let no space between words or numbers because R doesn t make a distinction between a space and a tab. So if a space is let in one of the columns R is going to consider it as two distinct columns and an error message will appear. R is case sensitive: the value of a character must be strictly identically written. For instance Right and right will be considered as two different entries. Four files showing how to built input files with different kinds of data (example_codominant.txt ; example_aflp.txt ; example_morphodata.txt) and for polyploid organisms (example_tetraploid.txt) are available. To analyze/test one of this dataset using AMaCAID, you will have to rename the file data.txt. The three AMaCAID models Model 1 This is the most complete one. It browses the dataset and generates all the combinations of k loci/characters among n loci/characters available. This model gives as a result n text files, one for each k value (named: AllCombinationTested_kloci.txt ), containing the list of the k loci/characters that maximize the number of individuals/accessions/species that can be discriminated, the number of discriminated genotypes, accessions, species..., what they are and their occurrences. Finally a graph is drawn representing the maximum number of discriminated genotypes as a function of the number of characters used. To reduce processing times, a limit has been implemented in the program so as to stop the calculation when the maximum number of discriminated genotypes is reached. Model 2 This model allows the user to choose the number of characters (k) they want to use to perform the calculation. The result of this model is saved in a text file named: OutputModel2_klocus.txt. Model 3 In this model the user can choose the number of drawings (d) they want the program to do. In this case the model isn t exhaustive, it generates only d combinations of k characters among the n characters available. The results given by this method are the same as in the first model but called OutputModel3_kloci.txt.

For small datasets we suggest the first model, because exhaustive. For large dataset we recommend the third model with a relatively low number of iterations. The second model is practical when the user know how many characters they are going to use, or when applying the third model to have a first look and the user want to affine the results on a given character number. Calculation times In the table at the end of this manual you have the processing times for different datasets while using the different models. Running an analysis Here we consider that the folder with the dataset ( data.txt ) and the script file ( AMaCAIDscript.r ), downloaded on http://www.montpellier.inra.fr/brc-mtr/amacaid/, has already been created. - Open R - Change the directory - Write exactly: source( AMaCAIDscript.r ) - To choose the model: o For model 1: type 1 and Enter, the calculation begins. o For model 2: type 2, Enter, type the number of characters you want to use, Enter, the calculation begins o For model 3: type 3, Enter, type the number of samplings you want the program to do, Enter, the calculation begins - When the program has finished, Analysis finished appears. To abort a calculation press escape. Results The results, all in text format, will be saved in the Directory chosen at the beginning, the one containing the script and the input files.

Below you can see an example of an output file generated using Model1 for k=2: RESULTS Model#1 Set of loci maximizing the number of different haplotypes/genotypes/morpho patterns Loci Number : 2 8 Number of different genotypes/haplotypes/patterns detected with this set of loci 6 List and frequency of the different genotypes "1" "220220223223" 1 "2" "223223223223" 7 "3" "223223230230" 2 "4" "230230220220" 1 "5" "230230223223" 1 "6" "230230230230" 2 In this example, the optimal set of 2 loci is the combination of loci number 2 and 8 (these loci were represented by the third and the ninth columns in the input file). With this two loci, 6 multilocus genotypes can be distinguished in the sample studied. This 6 genotypes are then listed : for example : the first multilocus genotype is 220220 at locus 2 and 223223 at locus 8 and is represented once in the sample. When using model 1 and 3 a graph is generated. The graph will appear in a new R window when the calculation is finished. To save it click on File, Save as and choose the format.