USER S MANUAL FOR THE AMaCAID PROGRAM

Similar documents
SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

Documentation for BayesAss 1.3

HPC Course Session 3 Running Applications

Package lodgwas. R topics documented: November 30, Type Package

Step-by-Step Guide to Relatedness and Association Mapping Contents

Package REGENT. R topics documented: August 19, 2015

Recalling Genotypes with BEAGLECALL Tutorial

Creating a custom mappings similarity matrix

PLNT4610 BIOINFORMATICS FINAL EXAMINATION

PLNT4610 BIOINFORMATICS FINAL EXAMINATION

Step-by-Step Guide to Basic Genetic Analysis

Interpretation of Complex STR Results Using the Forensim Package

A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of unrelated individuals

M(ARK)S(IM) Dec. 1, 2009 Payseur Lab University of Wisconsin

User Manual ixora: Exact haplotype inferencing and trait association

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011

QTX. Tutorial for. by Kim M.Chmielewicz Kenneth F. Manly. Software for genetic mapping of Mendelian markers and quantitative trait loci.

Helpful Galaxy screencasts are available at:

tandem Summary Problem

ROES EVENTS SYSTEM TUTORIAL

Quality control of array genotyping data with argyle Andrew P Morgan

A short manual for LFMM (command-line version)

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

fasta2genotype.py Version 1.10 Written for Python Available on request from the author 2017 Paul Maier

Excel window. This will open the Tools menu. Select. from this list, Figure 3. This will launch a window that

Script.byu.edu SharePoint Instructions

Step-by-Step Guide to Advanced Genetic Analysis

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

RSAJ Manual. Compiled by Larry M York, April 4,

Linkage analysis with paramlink Session I: Introduction and pedigree drawing

Package NB. R topics documented: February 19, Type Package

4/4/16 Comp 555 Spring

Lecture 20: Clustering and Evolution

Genetic Analysis. Page 1

LFMM version Reference Manual (Graphical User Interface version)

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.4 Graphical User Interface (GUI) Manual

Lecture 20: Clustering and Evolution

Operating instructions for MixtureCalc v1.2 (Freeware Version)

Importing and Merging Data Tutorial

What you need to know about Java and JetTrac Licensing

Instruction: Download and Install R and RStudio

SNPolisher User Guide (Version 1.5.2) Affymetrix, Inc.

Package polysat. R topics documented: June 21, Version 0.1. Date Title Tools for Polyploid Microsatellite Analysis

Lecture 7 QTL Mapping

A quick guide on Boechera Microsatellite Website (BMW)

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Package allelematch. R topics documented: February 19, Type Package

Working with SQL SERVER EXPRESS

Introduction to version Instruction date

Created by Damian Goodridge Page 1 of 38 Created on 12/10/2004 2:08 PM. User Guide. Assign-SBT TM 3.2.7

PLA 3.0 MICROSOFT EXCEL DATA ACQUISITION MODULE

500K Data Analysis Workflow using BRLMM

Reading Genetic Data Files Into R with adegenet and pegas

Mail Merge Quick Reference Guide

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

FVGWAS- 3.0 Manual. 1. Schematic overview of FVGWAS

iloci software is used to calculate the gene-gene interactions from GWAS data. This software was implemented by the OpenCL framework.

Tutorial. Typing and Epidemiological Clustering of Common Pathogens (beta) Sample to Insight. November 21, 2017

Simple Analysis with the Graphical User Interface of POY

UMHS Financial Systems Workspace & Smart View Templates

Package MsatAllele. February 15, 2013

Figure 1 Forms category in the Insert panel. You set up a form by inserting it and configuring options through the Properties panel.

Genome-Wide Association Study Using

GeneMarker HID Quick Start

wgmlst typing in the Brucella demonstration database

E. coli functional genotyping: predicting phenotypic traits from whole genome sequences

Genomics. Nolan C. Kane

Breeding Management System

Tutorial: SeqAPass Boxplot Generator

Package MicroStrategyR

2 binary_coding. Index 21. Code genotypes as binary. binary_coding(genotype_warnings2na, genotype_table)

The Analysis of RAD-tag Data for Association Studies

The fgwas software. Version 1.0. Pennsylvannia State University

KGG: A systematic biological Knowledge-based mining system for Genomewide Genetic studies (Version 3.5) User Manual. Miao-Xin Li, Jiang Li

Contents of this guide

Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. OpenRefine for Metadata Cleanup.

Managing custom montage files Quick montages How custom montage files are applied Markers Adding markers...

Polymorphism and Variant Analysis Lab

MLSTest Tutorial Contents

General Guidelines: SAS Analyst

BD Lyoplate Human Screen Analysis Instructions For analysis using FCS Express or FlowJo and heatmap representation in Excel 2007

Population Genetics in BioPerl HOWTO

Gegenees genome format...7. Gegenees comparisons...8 Creating a fragmented all-all comparison...9 The alignment The analysis...

HLA data entry in the EBMT Registry using ProMISe

Consensus Methods for Reconstruction of Sibling Relationships from Genetic Data

User Guide for MLVAbank 6.0 FOR MICROBES GENOTYPING

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)

Item Revision Naming Schemes. Revision Naming Terminology. Default Revision Naming Schemes. Modified by Jason Howie on 31-May-2017

Documentation for OptiMAS: a decision support tool for marker-assisted assembly of diverse alleles

QTL Analysis with QGene Tutorial

Exercise Producing Thematic Maps for Dissemination

Release Notes. JMP Genomics. Version 4.0

Consensus Methods for Reconstruction of Sibling Relationships from Genetic Data

POLDISP 1.0c. a software package to estimate Pollen Dispersal. by Juan J. Robledo-Arnuncio, Frédéric Austerlitz and Peter E. Smouse USER'S MANUAL

CEQ 8000 Series Fragment Analysis Training Guide

User Manual for GIGI v1.06.1

LEA: An R Package for Landscape and Ecological Association Studies

GO! Finder V1.4. User Manual

Overview. Background. Locating quantitative trait loci (QTL)

Transcription:

USER S MANUAL FOR THE AMaCAID PROGRAM TABLE OF CONTENTS Introduction How to download and install R Folder Data The three AMaCAID models - Model 1 - Model 2 - Model 3 - Processing times Changing directory Running an analysis Results Introduction AMaCAID is a program written to work under R. It is designed to analyze multilocus genotypic patterns in large samples. It allows to compute (i) the number and frequency of the different multilocus patterns available in the dataset, and (ii) the discrimination power of each combination of k markers among n available. It thus allows identifying the optimal and smallest subset of markers that allows to distinguish all the revealed multilocus genotypes. AMaCAID can be used with any kind of molecular markers, on datasets mixing different kinds of markers, but also on qualitative characters like morphological or taxonomic traits. It can however also be used to screen any kind of datasets characterizing a set of individuals (e.g. population genetics studies) or species (e.g. taxonomic or phylogenetic studies) for discrimination purposes. The size of the assayed sample has no limitation, but all combinations of markers can be computed only for datasets involving less than 25 markers. For larger number of markers/characters, it is possible to ask AMaCAID to screen a limited number of combinations of markers. The aim of this manual is to enable the beginners with R to use AMaCAID easily. A particular attention has been given to the details, about where to download and save the files, and the way the dataset should be constructed. Each paragraph is illustrated by screenshots. The script and example files are available online: http://www.montpellier.inra.fr/brc-mtr/amacaid/

How to download and install R R is a free Programming Environment for Data Analysis and Graphics available online at: http://www.r-project.org/ On this website the user can download the program in their language. On the starting web page, select CRAN in the left-menu. Then choose a country and select one of the proposed URLs. On the download page select the appropriate operating system: Linux, Mac or Windows.

On the next page select Base. On the next page select Download R.

This will download an.exe file. Launch the install procedure and follow the instructions. Folder Create a folder in which the script file ( AMaCAIDscript.r ) and the input file ( data.txt ) will be placed. This folder will be the Directory which R is going to use for the calculation. For example:

If the names of the script file and/or the dataset are changed, AMaCAID won t work. Changing directory To work, R needs to know where to look for the files it is going to use. So you have to tell him where your Folder is. To do that open R, click on File and Change directory. It is going to open a window in which you are going to select your folder containing your dataset and the script. When selected, click OK.

Data To run AMACAID, it is first necessary to create the input file named data.txt. The data can be coded either numerically or alphabetically. For genotypic data, it is preferable to use numbers to code alleles, with either a 1 (1-9), a 2 (01-99) or a 3-digit (001-999) number per allele. For morphological or taxonomical characters, you just have to use the same code or number or word for similar character values. A single string (word or number) can be used to describe each character value. The columns must be separated by tabulations. The file must be build as follows: The first line contains the names of the different columns of the table: the first column should contain the name of the different accessions/individuals/species to consider, the next columns contain the different characters or loci for which the individuals have been screened. On the second line, the first item is the name of the first individual or sample, the second is the genotype of the individual at the first locus (in this example, coded with a 3 digit number for each allele), the third is the genotype at the second locus You can encode the missing data as you want (in the example above 000000 refer to missing data) but for

simplicity when reading the output file, we recommend to use a string of the same length as a genotype or character value (for example, a string of length 6 like nnnnnn is recommended when alleles are coded with a 3 digit number). Note that similar heterozygous genotypes must be represented with the same code otherwise they will be considered as different genotypes (for instance, AMaCAID will consider the genotypes 110113 and 113110 as two different genotypes). For example, for microsatellite genotyping patterns, obtained on diploid individuals, the input file should have the following format: LineCode MTIC37 MTIC59 ENPB1 MTIC86 L000043 095095 113113 269269 000000 L000044 095095 110110 267267 130130 L000045 101101 100100 273273 134134 L000046 086086 100100 273273 128128 For AFLP data, the input file should be as follows: LineCode aflp1 aflp2 aflp3 aflp4 aflp5 aflp6 L000043 0 0 0 1 1 1 L000044 1 1 1 1 0 0 L000045 1 1 1 1 1 1 L000046 0 0 0 0 1 1 For SNP data : LineCode snp1 snp2 snp3 snp4 snp5 snp6 L000043 00 01 00 01 01 01 L000044 11 11 11 11 01 01 L000045 11 11 11 11 11 11 L000046 01 01 01 01 11 11 Here individual L00043 is homozygous for the allele named 0 for snp1 and heterozygous at the locus named snp2. As for SSR data, similar heterozygous genotypes must be written following the same character value. For morphological data LineCode PetalColor LeafShape FlowerPerInflorescence L000043 white denticulate 5 L000044 white denticulate 3 L000045 pink denticulate 3 L000046 purple rounded 5 No empty lines are needed between samples. Numbering of samples don t needs to be sequential and samples don t need to be ordered.

You can create your dataset using Excel and save it in text format ( data.txt ) the separator being a Tab. If the data are not saved in this format AMaCAID will not work. Be careful to let no space between words or numbers because R doesn t make a distinction between a space and a tab. So if a space is let in one of the columns R is going to consider it as two distinct columns and an error message will appear. R is case sensitive: the value of a character must be strictly identically written. For instance Right and right will be considered as two different entries. Four files showing how to built input files with different kinds of data (example_codominant.txt ; example_aflp.txt ; example_morphodata.txt) and for polyploid organisms (example_tetraploid.txt) are available. To analyze/test one of this dataset using AMaCAID, you will have to rename the file data.txt. The three AMaCAID models Model 1 This is the most complete one. It browses the dataset and generates all the combinations of k loci/characters among n loci/characters available. This model gives as a result n text files, one for each k value (named: AllCombinationTested_kloci.txt ), containing the list of the k loci/characters that maximize the number of individuals/accessions/species that can be discriminated, the number of discriminated genotypes, accessions, species..., what they are and their occurrences. Finally a graph is drawn representing the maximum number of discriminated genotypes as a function of the number of characters used. To reduce processing times, a limit has been implemented in the program so as to stop the calculation when the maximum number of discriminated genotypes is reached. Model 2 This model allows the user to choose the number of characters (k) they want to use to perform the calculation. The result of this model is saved in a text file named: OutputModel2_klocus.txt. Model 3 In this model the user can choose the number of drawings (d) they want the program to do. In this case the model isn t exhaustive, it generates only d combinations of k characters among the n characters available. The results given by this method are the same as in the first model but called OutputModel3_kloci.txt.

For small datasets we suggest the first model, because exhaustive. For large dataset we recommend the third model with a relatively low number of iterations. The second model is practical when the user know how many characters they are going to use, or when applying the third model to have a first look and the user want to affine the results on a given character number. Calculation times In the table at the end of this manual you have the processing times for different datasets while using the different models. Running an analysis Here we consider that the folder with the dataset ( data.txt ) and the script file ( AMaCAIDscript.r ), downloaded on http://www.montpellier.inra.fr/brc-mtr/amacaid/, has already been created. - Open R - Change the directory - Write exactly: source( AMaCAIDscript.r ) - To choose the model: o For model 1: type 1 and Enter, the calculation begins. o For model 2: type 2, Enter, type the number of characters you want to use, Enter, the calculation begins o For model 3: type 3, Enter, type the number of samplings you want the program to do, Enter, the calculation begins - When the program has finished, Analysis finished appears. To abort a calculation press escape. Results The results, all in text format, will be saved in the Directory chosen at the beginning, the one containing the script and the input files.

Below you can see an example of an output file generated using Model1 for k=2: RESULTS Model#1 Set of loci maximizing the number of different haplotypes/genotypes/morpho patterns Loci Number : 2 8 Number of different genotypes/haplotypes/patterns detected with this set of loci 6 List and frequency of the different genotypes "1" "220220223223" 1 "2" "223223223223" 7 "3" "223223230230" 2 "4" "230230220220" 1 "5" "230230223223" 1 "6" "230230230230" 2 In this example, the optimal set of 2 loci is the combination of loci number 2 and 8 (these loci were represented by the third and the ninth columns in the input file). With this two loci, 6 multilocus genotypes can be distinguished in the sample studied. This 6 genotypes are then listed : for example : the first multilocus genotype is 220220 at locus 2 and 223223 at locus 8 and is represented once in the sample. When using model 1 and 3 a graph is generated. The graph will appear in a new R window when the calculation is finished. To save it click on File, Save as and choose the format.