affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data

Similar documents
Package yaqcaffy. January 19, 2019

Preprocessing Affymetrix Data. Educational Materials 2005 R. Irizarry and R. Gentleman Modified 21 November, 2009, M. Morgan

Description of gcrma package

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays

Codelink Legacy: the old Codelink class

Package sscore. R topics documented: June 27, Version Date

sscore July 13, 2010 vector of data tuning constant (see details) fuzz value to avoid division by zero (see details)

Using FARMS for summarization Using I/NI-calls for gene filtering. Djork-Arné Clevert. Institute of Bioinformatics, Johannes Kepler University Linz

Package ffpe. October 1, 2018

Package simpleaffy. January 11, 2018

Package makecdfenv. March 13, 2019

Package affyio. January 18, Index 6. CDF file format function

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

Using oligonucleotide microarray reporter sequence information for preprocessing and quality assessment

Package affypdnn. January 10, 2019

affypara: Parallelized preprocessing algorithms for high-density oligonucleotide array data

Microarray Technology (Affymetrix ) and Analysis. Practicals

Preprocessing -- examples in microarrays

All About PlexSet Technology Data Analysis in nsolver Software

Introduction to the xps Package: Comparison to Affymetrix Power Tools (APT)

Introduction: microarray quality assessment with arrayqualitymetrics

QuickReferenceCard. Axiom TM Analysis Suite - Analyzing your Samples. Setting Up and Running an Analysis

Course on Microarray Gene Expression Analysis

Analysis of Genomic and Proteomic Data. Practicals. Benjamin Haibe-Kains. February 17, 2005

Package mimager. March 7, 2019

BIOINFORMATICS. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

/ Computational Genomics. Normalization

Averages and Variation

Microarray Excel Hands-on Workshop Handout

STA 570 Spring Lecture 5 Tuesday, Feb 1

Vector Xpression 3. Speed Tutorial: III. Creating a Script for Automating Normalization of Data

Package AffyExpress. October 3, 2013

Chapter 2 Describing, Exploring, and Comparing Data

Release notes for StatCrunch mid-march 2015 update

Parallelized preprocessing algorithms for high-density oligonucleotide arrays

AB1700 Microarray Data Analysis

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Quick Reference Card. GeneChip Sequence Analysis Software 4.1. I. GSEQ Introduction

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

proficient in applying mathematics knowledge/skills as specified in the Utah Core State Standards. The student generally content, and engages in

Table of Contents (As covered from textbook)

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Visualization and PCA with Gene Expression Data

No. of blue jelly beans No. of bags

Digital Image Processing. Prof. P. K. Biswas. Department of Electronic & Electrical Communication Engineering

Agilent Feature Extraction Software (v10.5)

CHAPTER 6. The Normal Probability Distribution

Package affyplm. January 6, 2019

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

Image Processing

Visualization and PCA with Gene Expression Data. Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4

GSEQ Software User s Guide for AccuID TM

Alternative CDF environments

BD Lyoplate Human Screen Analysis Instructions For analysis using FCS Express or FlowJo and heatmap representation in Excel 2007

Regression III: Advanced Methods

Chapter 5. Track Geometry Data Analysis

Middle School Math Course 2

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Chapter 3 Set Redundancy in Magnetic Resonance Brain Images

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Tutorial 7: Automated Peak Picking in Skyline

MATH Grade 6. mathematics knowledge/skills as specified in the standards. support.

Oligonucleotide expression array technology (1) has recently

DataAssist v2.0 Software User Instructions

Bioconductor tutorial

MAGMA joint modelling options and QC read-me (v1.07a)

Example how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler

R-Packages for Robust Asymptotic Statistics

Kernel Density Estimation (KDE)

Stat 428 Autumn 2006 Homework 2 Solutions

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

ReadqPCR: Functions to load RT-qPCR data into R

A web app for guided and interactive generation of multimarker panels (

AND NUMERICAL SUMMARIES. Chapter 2

Agilent CytoGenomics 2.0 Feature Extraction for CytoGenomics

Math 7 Glossary Terms

Supporting Information. High-Throughput, Algorithmic Determination of Nanoparticle Structure From Electron Microscopy Images

SVM Classification in -Arrays

miscript mirna PCR Array Data Analysis v1.1 revision date November 2014

SAS Visual Analytics 8.2: Working with Report Content

How many toothpicks are needed for her second pattern? How many toothpicks are needed for her third pattern?

Supervised Clustering of Yeast Gene Expression Data

Multivariate Calibration Quick Guide

Overview. Lecture 13: Graphics and Visualisation. Graphics & Visualisation 2D plotting. Graphics and visualisation of data in Matlab

Unit Title Key Concepts Vocabulary CCS

Learn the various 3D interpolation methods available in GMS

Package RobLoxBioC. August 3, 2018

Metabolomic Data Analysis with MetaboAnalyst

Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8

Bayesian Robust Inference of Differential Gene Expression The bridge package

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005

10.1 Overview. Section 10.1: Overview. Section 10.2: Procedure for Generating Prisms. Section 10.3: Prism Meshing Options

Searching & Sorting in Java Bubble Sort

Problem 1: Hello World!

MathMammoth Grade 6. Supplemental Curriculum: Learning Goals/Performance Objectives:

Math 227 EXCEL / MEGASTAT Guide

Office Excel. Charts

Expander Online Documentation

Affymetrix Microarrays

Transcription:

affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data Craig Parman and Conrad Halling April 30, 2018 Contents 1 Introduction 1 2 Getting Started 2 3 Figure Details 3 3.1 Report page 1................................. 3 3.2 Report page 2................................. 3 3.3 Report page 3................................. 3 3.4 Report page 4................................. 6 3.5 Report page 5................................. 6 3.6 Report page 6................................. 9 1 Introduction This document describes an R package for generating QC reports. The goal of this project is to create a tool to allow users of the popular Affymetrix GeneChip 1 arrays to quickly access the data quality of a batch of processed arrays. The package makes use of the affy Gautier et al. (2004) package for reading cell files and generating several of the plots. The QC plot from the package simpleaffy Miller (2005) is also used. Several new plots are generated and a printable pdf file is created. The functions in the package will work on normalized or non-normalized data. The results of these QC procedures will change slightly depending on whether normalization has been done. When in doubt it is recommended that the procedures be run before and after normalization. The example data included in the affydata package can be used to generate a example report. 1 www.affymetrix.com/ 1

2 Getting Started After starting R, the package should be loaded using the following. > library(affyqcreport) This will load affyqcreport as well as the affy and simpleaffy packages and their dependencies. The example data named Dilution which is an object of class AffyBatch his loaded with the following data command. > library(affydata) > data(dilution) To generate an example report simply use the method QCReport R> QCReport(Dilution,file="ExampleQC.pdf") Any valid AffyBatch object can be used as long as the corresponding CDF environment is also available. Phenotypic data contained in the AffyBatch object will be used to group arrays, but it is not required. If the AffyBatch object needs to be created from the cel files, a call directly to the various forms of the ReadAffy method can be used. For example the graphical user interface widget can be used for input as shown below. R> QCReport(ReadAffy(widget=TRUE)) The methods for creating AffyBatch objects are described further in the affy package documentation. The report consists of 6 pages. The first page consists of a list of the sample names and an index number that is used to identify each array in later plots. The second page consists of two plots made using the affy package. The first is a box plot of the pm intensities and the second plot consists of density plot of the log these intensities. The third page is the QC plot generated with the simpleaffy package. This plot shows the 3 : 5 ratios for spiked-in and control genes specific to the array type. Additionally the percentages of present gene calls and background levels are given. The next two pages are generated by analyzing the intensities of the positive and negative control elements on the outer edges of the Affymetrix arrays. The fourth page contains box plots of the intensities of these positive and negative elements. The fifth page is a plot of the center of intensity (COI) for the positive and negative border elements. The sixth page is a heat map of the array-array Spearman rank correlation coefficients of the array intensities. The arrays are ordered using the phenotypic data (if available) in order to place arrays with similar samples adjacent to each other. Arrays of similar expression patterns will have a higher correlation coefficient. If desired each page can be separately generated by a single function call with the AffyBatch object as the argument. For example the following command will generate the titlepage. R> titlepage(dilution) 2

3 Figure Details This section will describe the details of each page of the report and the function call to generate the individual pages. An expample of each page is shown in a figure. 3.1 Report page 1 The first page simple list the names of the arrays and assigns an index number to be used in future plotting. The names taken from the data set by use of the samplenames method of the affy package. These sample names and indexes are also listed on several other plots. An example is shown in Fig. 1. The plot is generated with the following command. R> titlepage(dilution) 3.2 Report page 2 The second page consists of two plots. The first is a boxplot plot of the all pm intensities and the second plot consists of kernel density estimates of these intensities. Both of these methods are defined in the affy package. These plots are useful assessing the overall signal quality for the arrays. Any array with a low average intensity or a significantly different shaped density would be suspect. An example is shown in Fig. 2. The plot is generated with the following command. R> signaldist(dilution) 3.3 Report page 3 The third page is the QC plot from the simpleaffy package. This plot shows the 3 : 5 ratio for spiked-in and control genes specific to the array type. Additionally the percentage of present gene calls and background levels are given. An example is shown in Fig. 3.The plot is described in detail in the document QC and Affymetrix data included in the simpleaffy documentation. The following is an excerpt from that document describing the plot. The figure is plotted from the bottom up with the first chip being at the base of the diagram and the last chip in the QCStats object at the top. If the standard steps for generating a QCStats object are followed, then this corresponds to the order of your samples in the AffyBatch object. Dotted horizontal lines separate the plot into rows, one for each chip. Dotted vertical lines provide a scale from -3 to 3. Each row shows the %present, average background, scale factors and GAPDH / β-actin ratios for an individual chip. 3

AffyBatch QC Report Array Name 1 20A 2 20B 3 10A 4 10B Mon Apr 30 18:47:36 2018 Produced by AffyQCReport R Package Figure 1: First page: Table of arrays in the data set. 4

Small part of dilution study Log2(Intensity) 6 8 12 1 2 3 4 density 0.0 0.4 0.8 1 2 3 4 6 8 10 12 14 log intensity Figure 2: Second page: Boxplot and histograms of pm intensities. 5

ˆ GAPDH 3 : 5 values are plotted as circles. According to Affymetrix they should be about 1. GAPDH values that are considered potential outlier (ratio > 1.25) are coloured red, otherwise they are blue. ˆ β-actin, 3 : 5 ratios are plotted as triangles. Because this is a longer gene, the recommendation is for the 3 : 5 ratios to be below 3; values below 3 are coloured blue, those above, red. ˆ The blue stripe in the image represents the range where scale factors are within 3-fold of the mean for all chips. Scale factors are plotted as a line from the centre line of the image. A line to the left corresponds to a down-scaling, to the right, to an up-scaling. If any scale factors fall outside this 3-fold region, they are all coloured red, otherwise they are blue. ˆ % present and average background, are listed to left of the figure. The plot is generated with the following command. R> plot(qc(dilution)) 3.4 Report page 4 The next two pages are generated by analyzing the positive and negative control elements on the outer edges of the Affymetrix arrays. For each array the intensities for all border elements are collected. Elements with an intensity greater the 1.2 times the mean for that group are assumed to be positive controls. Elements with a signal less that 0.8 of the mean are assumed to be negative controls. This method of separation into positive and negative controls is used so that exact details of the arrangement of these elements is not required. Elements falling in between these cut offs are not used in further calculations. The fourth page consists of box plots of the positive and negative elements. The means and standard deviations of the intensities for each array should be comparable. Large variations in the positive control can indicate non-uniform hybridization or gridding problems. Variations in the negative controls indicate background fluctuations. The plot (shown in Fig. 4) is generated with the following command. R> borderqc1(dilution) 3.5 Report page 5 As a further test, the elements are separated based on which edge of the array they are located. The mean values for the left, right, top, and bottom elements are calculated for positive and negative controls. Once the elements are separated into positive and 6

Figure 3: Third page: simpleaffy QC plot of 3 : 5 ratios and percent present calls. 7

1 2 3 4 200 400 600 800 1000 1200 1400 Positive Border Elements Intensity 1 2 3 4 100 200 300 400 500 Negative Border Elements Intensity Sample Name 1 2 3 4 20A 20B 10A 10B Figure 4: Fourth page: Boxplot of positive and negative feature intensities. 8

negative controls, and further divided by the four locations, the center of intensity (COI) for the controls is calculated. If the hybridization is uniform across the array, the location the COI for the positive elements will be located at the physical center of the array. Any spatial variations in the hybridization, such as those caused by a bubble being present during hybridization, will cause the COI to move from center. Another cause to the COI being off center is a slight misalignment of the grid used to determine the cell intensities. The COI is plotted on a relative scale where the point (0,0) is the center and 1 and -1 represent the edges of the array. Some variation to the COI is expected but an array with visible intensity variations stands out in these plots as an outlier. Any array that where the COI has coordinate with and magnitude greater that 0.5 is flagged by labeling the data point with the array index. A similar plot is made for the negative controls. This plot is a measure of the uniformity of the background across the array. Again arrays where the COI has coordinate with and magnitude greater that 0.5 is flagged. An example is shown in Fig. 5. The plot is generated with the following command. R> borderqc2(dilution) 3.6 Report page 6 The sixth page is a heat map of the array-array Spearman rank correlation coefficients. The arrays are ordered using the phenotypic data (if available) in order to place arrays with similar samples adjacent to each other. Self-self correlations are on the diagonal and by definition have a correlation coefficient of 1.0. Data from similar tissues or treatments will tend to have higher coefficients. This plot is useful for detecting outliers, failed hybridizations, or mistracked samples. See in Fig. 6 for an exaple. Of course caution must be used in deciding if an array should be discarded, because the differences in the expression patterns might be due to interesting biology, not a processing error. The plot is generated with the following command. R> correlationplot(dilution) 9

Positive Elements Negative Elements Y Center of Intensity position 1.0 0.5 0.0 0.5 1.0 Y Center of Intensity position 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 X Center of Intensity position X Center of Intensity position Sample Name 20A 20B 10A 10B 1 2 3 4 Figure 5: Fifth page: Center of intensity for positive and negative feature intensities. 10

Array Array Intensity Correlation 3 4 1 2 Correlation Coefficient 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 1 3 4 1 2 1 2 1 2 0 0 0 0 10 10 20 20 2.5 1.5 0.5 3 4 1 2 Figure 6: Sixth page: Array-array Spearman rank correlation coefficients 11

References Laurent Gautier, Leslie Cope, Benjamin M. Bolstad, and Rafael A. Irizarry. affy analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3):307 315, 2004. http://bioinformatics.oupjournals.org/cgi/content/abstract/20/3/307. Crispin J Miller. simpleaffy, 2005. http://bioinformatics.picr.man.ac.uk/simpleaffy/index.jsp. 12