Importing and Merging Data Tutorial

Similar documents
Recalling Genotypes with BEAGLECALL Tutorial

Convert Dosages to Genotypes Author: Autumn Laughbaum, Golden Helix, Inc.

Intro to NGS Tutorial

Creating and Using Genome Assemblies Tutorial

Import GEO Experiment into Partek Genomics Suite

Genetic Analysis. Page 1

Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Version 6.4 Graphical User Interface (GUI) Manual

Affymetrix Genotyping Console 3.0 User Manual

Step-by-Step Guide to Advanced Genetic Analysis

Tutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017

Maximizing Public Data Sources for Sequencing and GWAS

1. Introduction Supported data formats/arrays Aligned BAM files How to load and open files Affymetrix files...

Version 9 Client Workflow Interface (Dashboard) Quick Start Guide

QTX. Tutorial for. by Kim M.Chmielewicz Kenneth F. Manly. Software for genetic mapping of Mendelian markers and quantitative trait loci.

Instructions: DRDP Online Child Upload

Tutorial. Identification of Variants Using GATK. Sample to Insight. November 21, 2017

QuickReferenceCard. Axiom TM Analysis Suite - Analyzing your Samples. Setting Up and Running an Analysis

Downloading 2010 Census Data

500K Data Analysis Workflow using BRLMM

GenViewer Tutorial / Manual

SNP HiTLink Manual. Yoko Fukuda 1, Hiroki Adachi 2, Eiji Nakamura 2, and Shoji Tsuji 1

From the Insert Tab (1), highlight Picture (2) drop down and finally choose From Computer to insert a new image

Solar Campaign Google Guide. PART 1 Google Drive

BICF Nano Course: GWAS GWAS Workflow Development using PLINK. Julia Kozlitina April 28, 2017

Overview. Experiment Specifications. This tutorial will enable you to

Attaching Codesoft 6 to an ODBC Database

Step-by-Step Guide to Basic Genetic Analysis

Expression Analysis with the Advanced RNA-Seq Plugin

Step-by-Step Guide to Relatedness and Association Mapping Contents

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

Spotter Documentation Version 0.5, Released 4/12/2010

Agilent Genomic Workbench 7.0

Searching and Favorites in Datatel Web UI 4.3

mirnet Tutorial Starting with expression data

Introduction to Excel Workshop

Lesson 1: Creating and formatting an Answers analysis

WiredContact Enterprise Import Instructions

CircosVCF workshop, TAU, 9/11/2017

17 - VARIABLES... 1 DOCUMENT AND CODE VARIABLES IN MAXQDA Document Variables Code Variables... 1

User s Guide for R Routines to Perform Reference Marker Normalization

KaryoStudio v1.4 User Guide

Data Walkthrough: Background

Useful commands in Linux and other tools for quality control. Ignacio Aguilar INIA Uruguay

Chapter 7. Joining Maps to Other Datasets in QGIS

Release Notes. JMP Genomics. Version 4.0

Introductory Exercises in Microsoft Access XP

Agilent Feature Extraction Software (v10.5)

Axiom Analysis Suite Release Notes (For research use only. Not for use in diagnostic procedures.)

Quick Reference Card. GeneChip Sequence Analysis Software 4.1. I. GSEQ Introduction

Data Import and Quality Control in Geochemistry for ArcGIS

OBIEE. Oracle Business Intelligence Enterprise Edition. Rensselaer Business Intelligence Finance Author Training

Polymorphism and Variant Analysis Lab

CSV WHAT IS IT? This document provides the answers to the following questions: For which Cognos report(s) do I request the 'CSV' version?

User Guide. Web Intelligence Rich Client. Business Objects 4.1

SanctionCheck 5 CSV File Tutorial

Creating a Dashboard Prompt

Data formats in GWASTools

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

Importing sequence assemblies from BAM and SAM files

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

HaploHMM - A Hidden Markov Model (HMM) Based Program for Haplotype Inference Using Identified Haplotypes and Haplotype Patterns

University of North Dakota PeopleSoft Finance Tip Sheets. Utilizing the Query Download Feature

Annotating a single sequence

Tutorial for Windows and Macintosh SNP Hunting

Working with Variables: Primary Document Families

Generating a Custom Bill of Materials

GeneMarker HID Quick Start

Upload, Model, Analyze, and Report

Open Microsoft Word: click the Start button, click Programs> Microsoft Office> Microsoft Office Word 2007.

Helpful Galaxy screencasts are available at:

CREATING CUSTOMER MAILING LABELS

QUERY USER MANUAL Chapter 7

LIMS QUICK START GUIDE. A Multi Step Guide to Assist in the Construction of a LIMS Database. Rev 1.22

Genetic type 1 Error Calculator (GEC)

Applied Machine Learning

Performing whole genome SNP analysis with mapping performed locally

GMDR User Manual. GMDR software Beta 0.9. Updated March 2011

Fusion Detection Using QIAseq RNAscan Panels

IU Kokomo Career and Accessibility Center

SOLOMON: Parentage Analysis 1. Corresponding author: Mark Christie

MAILMERGE WORD MESSAGES

Business Process Procedures

Devyser QF-PCR. Guide to Sample Runs, Data Analysis & Results Interpretation

Introduction to Stata Getting Data into Stata. 1. Enter Data: Create a New Data Set in Stata...

Click on "+" button Select your VCF data files (see #Input Formats->1 above) Remove file from files list:

EGAN Tutorial: A Basic Use-case

Tutorial 3 - Performing a Change-Point Analysis in Excel

Annotating sequences in batch

Tutorial. Comparative Analysis of Three Bovine Genomes. Sample to Insight. November 21, 2017

1 Topic. Image classification using Knime.

GenomeStudio Software Release Notes

Network Visualization: Cytoscape

Membership Application Mailmerge

Creating a Directory with a Mail Merge from an Excel Document

Beginner s Guide to Microsoft Excel 2002

Improving Productivity with Parameters

GSEQ Software User s Guide for AccuID TM

How to Mail Merge a file with Microsoft Word 2003

AMP User Manual Reports

This tutorial will guide a curator/user to create the files to upload phenotype experiment annotation and data onto T3.

Transcription:

Importing and Merging Data Tutorial Release 1.0 Golden Helix, Inc. February 17, 2012

Contents 1. Overview 2 2. Import Pedigree Data 4 3. Import Phenotypic Data 6 4. Import Genetic Data 8 5. Import and Apply Marker Map 9 6. Join or Merge Data Together 12 i

ii

Importing and Merging Data Tutorial, Release 1.0 Updated: February 7th, 2012 Level: Fundamentals Packages: All Packages of SVS One of the greatest challenges of any genetic analysis project is the seemingly endless formatting, manipulation, and editing of data that takes place in order to properly analyze it. This is significantly compounded when the project involves whole genome data with millions to billions of data points. SNP & Variation Suite 7 (SVS7) eliminates much of this hassle with streamlined data import of virtually any file format, as well as real-time spreadsheet manipulation and editing on a grand scale. Because data comes in all sizes, formats, and orientations, no single workflow can encompass every scenario. This tutorial, therefore, seeks to lead you through a typical workflow of importing your pedigree data (if applicable), phenotype data, genetic data, and marker map data separately, and then merging them together in a single spreadsheet for analysis as illustrated in the steps below. Contents 1

1. Overview Figure 1: Importing and merging data A. Pedigree Information - These columns always contain the six standard fields included in a pedigree file: Family ID, Patient ID, Father ID, Mother ID, Sex, and Affection Status. B. Phenotypic Variables - Often times there are additional phenotypic variables beyond affection status. Once joined, these will be located to the right of pedigree information (if available) and left of mapped genetic variables. Phenotypic variables can be of types: categorical, real, integer, and binary. C. Genetic Variables - These may be of type genotype, logr, copy number variation, etc. Genetic variables have special qualities as they allow you to perform genetic-specific analyses (e.g. LD analysis). A variable will be recognized as a genotype if it has an allele delimiter, which you can specify upon import. Once imported genotypes are characterized by two alleles delimited with an underscore, A_B. D. Map Indicator - This button, if green, indicates that a genetic marker map has been applied to the spreadsheet, meaning each genetic marker has been mapped to a common chromosome and position coordinate system. By clicking this button you can see the map and any additional annotation information associated with each genetic marker, based on fields included in the original map file. E. Row Labels - Beyond being the identifiers for rows, these grey columns provide a common key by which multiple spreadsheets can be joined or merged accurately. 2

Importing and Merging Data Tutorial, Release 1.0 F. Column Data Types - Some column operations are specific to the type of column. This is indicated by a large blue letter on the column number header. The types are as follows: B : Indicates a binary column (values 0, 1,?). C : Indicates a categorical column (values such as Low, Medium, or High ). G : Indicates a genotype column (bi- or multi-allelic markers with alleles separated by an underscore such as A_B or 2_2 ). I : Indicates an integer-valued column (values such as -1, 0, 1, 2, 10, etc.). R : Indicates a real-valued column (values containing decimal places encoded as single or double precision floating point values). Note: For more detailed instructions on how to handle each specific data format, see Importing Your Data Into A Project in the Golden Helix SVS Manual. 3

2. Import Pedigree Data Pedigree information is only required if you re doing family-based analysis. If you re not doing family-based analysis you can skip this step and begin importing your phenotype data. 1. Before you can import any data you need to open a project. Open SVS and go to File > New Project. 2. A number of options are available to import pedigree information. The most common are the PED/MAP and FBAT pedigree formats. To import these, from an open project go to Import > PED/TPED/BED, Import > PBAT > FBAT Pedigree, or Import > PBAT > Text Pedigree. For more information on the Import > PED/TPED/BED dialog see: PED/TPED/BED File For more information on the Import > PBAT > FBAT Pedigree and Import > PBAT > Text Pedigree dialogs see: Importing PBAT Family-Based Data 3. If you have all your data in a regular text file, Excel spreadsheet, or some other general file format, you can use Import > Text or Import > Third Party. Once imported you can convert the resulting spreadsheet to a pedigree file by selecting Edit > Convert to Pedigree Spreadsheet. For more information on the Import > Text dialog see: Text File For more information on the Import > Third Party dialog see: Third Party File For more information on the Edit > Convert to Pedigree Spreadsheet feature see: Convert to Pedigree Spreadsheet You will know you have a pedigree spreadsheet in your project if the first six column headers are blue as in Figure 2. The spreadsheet icon in the project navigator will also have a pedigree symbol. 4

Importing and Merging Data Tutorial, Release 1.0 Figure 2. Pedigree Spreadsheet 5

3. Import Phenotypic Data Phenotype information is needed for most, but not all, analyses in SVS. It is most often used as the dependent (e.g. case-control status) and independent variables (e.g. gender, age) in association and regression analysis. If you only have pedigree information, Affection Status would be the phenotype variable you d use as your dependent variable. 1. Phenotype information usually comes in the form of a text file or Excel spreadsheet. To import a text file, from the Project Navigator, go to Import > Text. Here you will specify how your data is formatted and which column you want to use as the row labels. Under the Advanced Options tab, you can specify the following: How your missing data is encoded in your text file Whether or not there is genotypic data and how its alleles are delimited How many header rows to skip, if any The base numeric type How real valued columns should be encoded The skip header rows option pertains to a dataset that contains ancillary information about a file before the data you wanted imported starts, as highlighted in an Illumina Final Report file in Figure 3. See Text File for more information. Figure 3. Illumina text file 2. If your phenotype data is in an Excel spreadsheet, from the Project Navigator, go to Import > Third Party. Click the Browse button to locate your file. Third Party includes quite a number of file formats. To import Excel files you need to select Excel (*.xls) or Excel 2007 (*.xlsx) from the file type drop down (Figure 4). Upon import you will have a phenotype spreadsheet. See Third Party File for more information. 6

Importing and Merging Data Tutorial, Release 1.0 Figure 4. Third Party file format selection dialog 3. In order for SVS to perform the correct statistical tests, phenotype data must be in the proper format. Data comes in all shapes and sizes and though SVS is good at detecting the format of each variable in a dataset upon import, it may not be what the researcher intended (e.g. categorical data represented as numbers will be interpreted as integers). You can use the Spreadsheet Editor (Edit > Edit this Spreadsheet) to manipulate your data to make sure every variable is in the proper format. For more information on using the Spreadsheet Editor see, Editing a Spreadsheet in the Golden Helix SVS Manual. 7

4. Import Genetic Data Genetic data comes in a myriad of custom formats and file types. 1. SVS directly supports a number of file formats for several types of analysis, including Affymetrix (e.g. CEL, CHP), Illumina (Final Report, Illumina DSF), Agilent, Nimblegen, and more. All these can be found under the Import menu from the Project Navigator. 2. If your genetic data is in text file you can use either Import > Text or Import > Third Party, as with pedigree and phenotype data (above). SVS will recognize genotypes as such as long as they are delimited (e.g. A_B, A/B). The delimiter can be specified during both import options. You ll also want to specify how missing values are encoded as this can vary from file to file. Built-in missing encodings include?, and for each allele. 3. For file formats not handled natively in SVS 7, we often write custom import scripts using SVS 7 s built-in Python scripting interface. Many of the scripts we ve written for our customers are provided for others on our Add-on Scripts Repository For more information, or if you need help importing a custom file format, please email mailto:support@goldenhelix.com or call us at 1-888-589-4629. Upon import you will have a spreadsheet that contains unmapped genotyped information as in Figure 5. Notice that the Map button in the upper left portion of the spreadsheet is greyed out. This will turn green once a map is applied. Figure 5. Spreadsheet with unmapped genotype data 8

5. Import and Apply Marker Map Genetic marker maps contain chromosome and position data for individual genetic data relative to some common coordinate system, as well as other annotation information for each genetic marker (if available). Most often marker map information is provided in a separate file than the genetic data. SVS allows you to either convert a text file with map information to a marker map file (*.dsm), download an Affymetrix annotation file using the integrated NetAffx service from Affymetrix, or download a marker map from Golden Helix s data repository. 1. To access the marker map manager, from the Project Navigator, go to Tools > Manage Marker Maps. Click the Convert Text File button to convert a text file to a marker map. Figure 6 provides an example of a marker map text file. Once you choose the file you want to convert and click OK, the text marker map will be scanned and the Choose Columns to Use dialog will appear. Columns for the marker name, chromosome and position must be specified at minimum, although additional columns can be imported from the marker map as well. Clicking OK will convert the text marker map file into a *.dsm file for use in any project. Figure 6. Text marker map file opened in Excel See Convert Text File into Marker Map DSM Format for more information. 2. For Affymetrix customers, Affymetrix NetAffx provides array design and annotation information for its GeneChip array results. You can sign up for and use the NetAffx Analysis Center through Affymetrix s website at http://www.affymetrix.com. SVS is able to communicate with NetAffx through a web service interface allowing you to download and update genetic marker map information mappable to Affymetrix data. 9

Importing and Merging Data Tutorial, Release 1.0 Begin by clicking on the Download from Affymetrix NetAffx button in the Manage Marker Maps window. You will be prompted for your Affymetrix NetAffx login information. After entering your NetAffx login information, the Download Annotations window will appear listing the latest annotation files provided by Affymetrix. Note: There are actually two annotation files for the 500K array 250K_Nsp and 250K_Sty. Both need to be downloaded simultaneously for the program to properly merge them. To download both annotation files simultaneously, highlight the first annotation file and then Ctrl+click to highlight the second. Click Download. Data that is available through Affymetrix NetAffx is also available on the Golden Helix server, eliminating the need to go to more than one location to download maps for different arrays. To do this click on the Download from Golden Helix button in the Manage Marker Maps window. These files are quite large and may take a few minutes to download depending on the speed of your Internet connection. Once finished, you will be prompted to select the fields you want imported, in addition to the six defaults. See Download Affymetrix Annotation Files 3. The Golden Helix data repository contains marker maps for both Affymetrix and Illumina arrays as *.dsm files, ready to apply to a spreadsheet once downloaded. The annotation files in through the Affymetrix NetAffx site are only the latest version. If an older version or human genome build is required then these maps can be obtained from Golden Helix. Begin by clicking on the Download from Golden Helix button in the Manage Marker Maps window. Select one or more marker maps files to download. Once downloaded, the files will be saved to the Marker Maps folder and will be visible in the Marker Map file list in the Manage Marker Maps window. See Download from Golden Helix for more information. 4. Additional annotation data can be added to any marker map through a utility function available in the Manage Marker Maps window. For example, gene names from a gene annotation track, or sense/nonsense classifications from a SIFT track can be added to a marker map. To add annotation data to a marker map, click on the Utilities button in the Manage Marker Maps window and select Add Annotation Data to Marker Map. Choose the marker map to add the data to and the annotation track that contains the data you want to add. Clicking Next > will bring up a new dialog that lists fields from the annotation track. Select the field, the name for the field (if other than the default name) and the overlap conflict resolution. Click Next >. The new marker map will be created and saved in the marker maps folder. See Add Annotation Data to Marker Map for more information. 5. Next you will need to apply the marker map file you converted to your spreadsheet containing genetic data. Open your spreadsheet containing genetic information and go to File > Apply Genetic Marker Map. Select the map file you just converted. Note: SVS 7 allows you to apply a marker map to a spreadsheet with marker names as either column headers or row labels, such as an outputted p-value spreadsheet. You will need to indicate this at the bottom of the Apply Genetic Marker Map window under Marker Names Are. 6. Once the genetic marker map is applied, the Map button in the upper left of the spreadsheet will turn green. You can view each marker s associated map information by clicking this button as in Figure 7. 10 5. Import and Apply Marker Map

Importing and Merging Data Tutorial, Release 1.0 Figure 7. Marker mapped spreadsheet display map and annotation information about each marker Note: Genotype data is a special data type. You can still map other genetic data types (e.g. CNV, LogRs) as long as the marker name in your data set maps to a name in the marker map spreadsheet. 11

6. Join or Merge Data Together Now that you have all your individual data sources imported and formatted you can join them together into a single spreadsheet. 1. Starting with the phenotype spreadsheet go to File > Join or Merge Spreadsheets. Select the spreadsheet containing pedigree information and click OK. This will bring up the Join or Merge Spreadsheets window. Here you will specify how you want to join the two spreadsheets. The safest option is to join spreadsheet using row labels as matching criteria. If, for some reason, the two spreadsheets do not contain matching row labels, you can define a custom order. 2. Repeat this process by subsequently joining the spreadsheet containing genetic data with the first joined spreadsheet containing both pedigree and phenotype data. Upon completion you will have a fully merged spreadsheet as in Figure 8. Figure 8. Spreadsheet containing pedigree, phenotype, and genetic data 12