Importing and Merging Data Tutorial Release 1.0 Golden Helix, Inc. February 17, 2012
Contents 1. Overview 2 2. Import Pedigree Data 4 3. Import Phenotypic Data 6 4. Import Genetic Data 8 5. Import and Apply Marker Map 9 6. Join or Merge Data Together 12 i
ii
Importing and Merging Data Tutorial, Release 1.0 Updated: February 7th, 2012 Level: Fundamentals Packages: All Packages of SVS One of the greatest challenges of any genetic analysis project is the seemingly endless formatting, manipulation, and editing of data that takes place in order to properly analyze it. This is significantly compounded when the project involves whole genome data with millions to billions of data points. SNP & Variation Suite 7 (SVS7) eliminates much of this hassle with streamlined data import of virtually any file format, as well as real-time spreadsheet manipulation and editing on a grand scale. Because data comes in all sizes, formats, and orientations, no single workflow can encompass every scenario. This tutorial, therefore, seeks to lead you through a typical workflow of importing your pedigree data (if applicable), phenotype data, genetic data, and marker map data separately, and then merging them together in a single spreadsheet for analysis as illustrated in the steps below. Contents 1
1. Overview Figure 1: Importing and merging data A. Pedigree Information - These columns always contain the six standard fields included in a pedigree file: Family ID, Patient ID, Father ID, Mother ID, Sex, and Affection Status. B. Phenotypic Variables - Often times there are additional phenotypic variables beyond affection status. Once joined, these will be located to the right of pedigree information (if available) and left of mapped genetic variables. Phenotypic variables can be of types: categorical, real, integer, and binary. C. Genetic Variables - These may be of type genotype, logr, copy number variation, etc. Genetic variables have special qualities as they allow you to perform genetic-specific analyses (e.g. LD analysis). A variable will be recognized as a genotype if it has an allele delimiter, which you can specify upon import. Once imported genotypes are characterized by two alleles delimited with an underscore, A_B. D. Map Indicator - This button, if green, indicates that a genetic marker map has been applied to the spreadsheet, meaning each genetic marker has been mapped to a common chromosome and position coordinate system. By clicking this button you can see the map and any additional annotation information associated with each genetic marker, based on fields included in the original map file. E. Row Labels - Beyond being the identifiers for rows, these grey columns provide a common key by which multiple spreadsheets can be joined or merged accurately. 2
Importing and Merging Data Tutorial, Release 1.0 F. Column Data Types - Some column operations are specific to the type of column. This is indicated by a large blue letter on the column number header. The types are as follows: B : Indicates a binary column (values 0, 1,?). C : Indicates a categorical column (values such as Low, Medium, or High ). G : Indicates a genotype column (bi- or multi-allelic markers with alleles separated by an underscore such as A_B or 2_2 ). I : Indicates an integer-valued column (values such as -1, 0, 1, 2, 10, etc.). R : Indicates a real-valued column (values containing decimal places encoded as single or double precision floating point values). Note: For more detailed instructions on how to handle each specific data format, see Importing Your Data Into A Project in the Golden Helix SVS Manual. 3
2. Import Pedigree Data Pedigree information is only required if you re doing family-based analysis. If you re not doing family-based analysis you can skip this step and begin importing your phenotype data. 1. Before you can import any data you need to open a project. Open SVS and go to File > New Project. 2. A number of options are available to import pedigree information. The most common are the PED/MAP and FBAT pedigree formats. To import these, from an open project go to Import > PED/TPED/BED, Import > PBAT > FBAT Pedigree, or Import > PBAT > Text Pedigree. For more information on the Import > PED/TPED/BED dialog see: PED/TPED/BED File For more information on the Import > PBAT > FBAT Pedigree and Import > PBAT > Text Pedigree dialogs see: Importing PBAT Family-Based Data 3. If you have all your data in a regular text file, Excel spreadsheet, or some other general file format, you can use Import > Text or Import > Third Party. Once imported you can convert the resulting spreadsheet to a pedigree file by selecting Edit > Convert to Pedigree Spreadsheet. For more information on the Import > Text dialog see: Text File For more information on the Import > Third Party dialog see: Third Party File For more information on the Edit > Convert to Pedigree Spreadsheet feature see: Convert to Pedigree Spreadsheet You will know you have a pedigree spreadsheet in your project if the first six column headers are blue as in Figure 2. The spreadsheet icon in the project navigator will also have a pedigree symbol. 4
Importing and Merging Data Tutorial, Release 1.0 Figure 2. Pedigree Spreadsheet 5
3. Import Phenotypic Data Phenotype information is needed for most, but not all, analyses in SVS. It is most often used as the dependent (e.g. case-control status) and independent variables (e.g. gender, age) in association and regression analysis. If you only have pedigree information, Affection Status would be the phenotype variable you d use as your dependent variable. 1. Phenotype information usually comes in the form of a text file or Excel spreadsheet. To import a text file, from the Project Navigator, go to Import > Text. Here you will specify how your data is formatted and which column you want to use as the row labels. Under the Advanced Options tab, you can specify the following: How your missing data is encoded in your text file Whether or not there is genotypic data and how its alleles are delimited How many header rows to skip, if any The base numeric type How real valued columns should be encoded The skip header rows option pertains to a dataset that contains ancillary information about a file before the data you wanted imported starts, as highlighted in an Illumina Final Report file in Figure 3. See Text File for more information. Figure 3. Illumina text file 2. If your phenotype data is in an Excel spreadsheet, from the Project Navigator, go to Import > Third Party. Click the Browse button to locate your file. Third Party includes quite a number of file formats. To import Excel files you need to select Excel (*.xls) or Excel 2007 (*.xlsx) from the file type drop down (Figure 4). Upon import you will have a phenotype spreadsheet. See Third Party File for more information. 6
Importing and Merging Data Tutorial, Release 1.0 Figure 4. Third Party file format selection dialog 3. In order for SVS to perform the correct statistical tests, phenotype data must be in the proper format. Data comes in all shapes and sizes and though SVS is good at detecting the format of each variable in a dataset upon import, it may not be what the researcher intended (e.g. categorical data represented as numbers will be interpreted as integers). You can use the Spreadsheet Editor (Edit > Edit this Spreadsheet) to manipulate your data to make sure every variable is in the proper format. For more information on using the Spreadsheet Editor see, Editing a Spreadsheet in the Golden Helix SVS Manual. 7
4. Import Genetic Data Genetic data comes in a myriad of custom formats and file types. 1. SVS directly supports a number of file formats for several types of analysis, including Affymetrix (e.g. CEL, CHP), Illumina (Final Report, Illumina DSF), Agilent, Nimblegen, and more. All these can be found under the Import menu from the Project Navigator. 2. If your genetic data is in text file you can use either Import > Text or Import > Third Party, as with pedigree and phenotype data (above). SVS will recognize genotypes as such as long as they are delimited (e.g. A_B, A/B). The delimiter can be specified during both import options. You ll also want to specify how missing values are encoded as this can vary from file to file. Built-in missing encodings include?, and for each allele. 3. For file formats not handled natively in SVS 7, we often write custom import scripts using SVS 7 s built-in Python scripting interface. Many of the scripts we ve written for our customers are provided for others on our Add-on Scripts Repository For more information, or if you need help importing a custom file format, please email mailto:support@goldenhelix.com or call us at 1-888-589-4629. Upon import you will have a spreadsheet that contains unmapped genotyped information as in Figure 5. Notice that the Map button in the upper left portion of the spreadsheet is greyed out. This will turn green once a map is applied. Figure 5. Spreadsheet with unmapped genotype data 8
5. Import and Apply Marker Map Genetic marker maps contain chromosome and position data for individual genetic data relative to some common coordinate system, as well as other annotation information for each genetic marker (if available). Most often marker map information is provided in a separate file than the genetic data. SVS allows you to either convert a text file with map information to a marker map file (*.dsm), download an Affymetrix annotation file using the integrated NetAffx service from Affymetrix, or download a marker map from Golden Helix s data repository. 1. To access the marker map manager, from the Project Navigator, go to Tools > Manage Marker Maps. Click the Convert Text File button to convert a text file to a marker map. Figure 6 provides an example of a marker map text file. Once you choose the file you want to convert and click OK, the text marker map will be scanned and the Choose Columns to Use dialog will appear. Columns for the marker name, chromosome and position must be specified at minimum, although additional columns can be imported from the marker map as well. Clicking OK will convert the text marker map file into a *.dsm file for use in any project. Figure 6. Text marker map file opened in Excel See Convert Text File into Marker Map DSM Format for more information. 2. For Affymetrix customers, Affymetrix NetAffx provides array design and annotation information for its GeneChip array results. You can sign up for and use the NetAffx Analysis Center through Affymetrix s website at http://www.affymetrix.com. SVS is able to communicate with NetAffx through a web service interface allowing you to download and update genetic marker map information mappable to Affymetrix data. 9
Importing and Merging Data Tutorial, Release 1.0 Begin by clicking on the Download from Affymetrix NetAffx button in the Manage Marker Maps window. You will be prompted for your Affymetrix NetAffx login information. After entering your NetAffx login information, the Download Annotations window will appear listing the latest annotation files provided by Affymetrix. Note: There are actually two annotation files for the 500K array 250K_Nsp and 250K_Sty. Both need to be downloaded simultaneously for the program to properly merge them. To download both annotation files simultaneously, highlight the first annotation file and then Ctrl+click to highlight the second. Click Download. Data that is available through Affymetrix NetAffx is also available on the Golden Helix server, eliminating the need to go to more than one location to download maps for different arrays. To do this click on the Download from Golden Helix button in the Manage Marker Maps window. These files are quite large and may take a few minutes to download depending on the speed of your Internet connection. Once finished, you will be prompted to select the fields you want imported, in addition to the six defaults. See Download Affymetrix Annotation Files 3. The Golden Helix data repository contains marker maps for both Affymetrix and Illumina arrays as *.dsm files, ready to apply to a spreadsheet once downloaded. The annotation files in through the Affymetrix NetAffx site are only the latest version. If an older version or human genome build is required then these maps can be obtained from Golden Helix. Begin by clicking on the Download from Golden Helix button in the Manage Marker Maps window. Select one or more marker maps files to download. Once downloaded, the files will be saved to the Marker Maps folder and will be visible in the Marker Map file list in the Manage Marker Maps window. See Download from Golden Helix for more information. 4. Additional annotation data can be added to any marker map through a utility function available in the Manage Marker Maps window. For example, gene names from a gene annotation track, or sense/nonsense classifications from a SIFT track can be added to a marker map. To add annotation data to a marker map, click on the Utilities button in the Manage Marker Maps window and select Add Annotation Data to Marker Map. Choose the marker map to add the data to and the annotation track that contains the data you want to add. Clicking Next > will bring up a new dialog that lists fields from the annotation track. Select the field, the name for the field (if other than the default name) and the overlap conflict resolution. Click Next >. The new marker map will be created and saved in the marker maps folder. See Add Annotation Data to Marker Map for more information. 5. Next you will need to apply the marker map file you converted to your spreadsheet containing genetic data. Open your spreadsheet containing genetic information and go to File > Apply Genetic Marker Map. Select the map file you just converted. Note: SVS 7 allows you to apply a marker map to a spreadsheet with marker names as either column headers or row labels, such as an outputted p-value spreadsheet. You will need to indicate this at the bottom of the Apply Genetic Marker Map window under Marker Names Are. 6. Once the genetic marker map is applied, the Map button in the upper left of the spreadsheet will turn green. You can view each marker s associated map information by clicking this button as in Figure 7. 10 5. Import and Apply Marker Map
Importing and Merging Data Tutorial, Release 1.0 Figure 7. Marker mapped spreadsheet display map and annotation information about each marker Note: Genotype data is a special data type. You can still map other genetic data types (e.g. CNV, LogRs) as long as the marker name in your data set maps to a name in the marker map spreadsheet. 11
6. Join or Merge Data Together Now that you have all your individual data sources imported and formatted you can join them together into a single spreadsheet. 1. Starting with the phenotype spreadsheet go to File > Join or Merge Spreadsheets. Select the spreadsheet containing pedigree information and click OK. This will bring up the Join or Merge Spreadsheets window. Here you will specify how you want to join the two spreadsheets. The safest option is to join spreadsheet using row labels as matching criteria. If, for some reason, the two spreadsheets do not contain matching row labels, you can define a custom order. 2. Repeat this process by subsequently joining the spreadsheet containing genetic data with the first joined spreadsheet containing both pedigree and phenotype data. Upon completion you will have a fully merged spreadsheet as in Figure 8. Figure 8. Spreadsheet containing pedigree, phenotype, and genetic data 12