/ Computational Genomics. Normalization

Similar documents
Course on Microarray Gene Expression Analysis

Preprocessing -- examples in microarrays

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Towards an Optimized Illumina Microarray Data Analysis Pipeline

Gene Clustering & Classification

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

SVM Classification in -Arrays

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Drug versus Disease (DrugVsDisease) package

Analyzing ICAT Data. Analyzing ICAT Data

Analysis of (cdna) Microarray Data: Part I. Sources of Bias and Normalisation

Expander Online Documentation

BIOINFORMATICS. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

Exploratory data analysis for microarrays

How do microarrays work

Expander 7.2 Online Documentation

Computer Exercise - Microarray Analysis using Bioconductor

Class Discovery and Prediction of Tumor with Microarray Data

Micro-array Image Analysis using Clustering Methods

ROTS: Reproducibility Optimized Test Statistic

Introduction to the xps Package: Comparison to Affymetrix Power Tools (APT)

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Using FARMS for summarization Using I/NI-calls for gene filtering. Djork-Arné Clevert. Institute of Bioinformatics, Johannes Kepler University Linz

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

Organizing, cleaning, and normalizing (smoothing) cdna microarray data

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

Nature Publishing Group

PROCEDURE HELP PREPARED BY RYAN MURPHY

Guide to Microarray Analysis

SEEK User Manual. Introduction

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Preprocessing Affymetrix Data. Educational Materials 2005 R. Irizarry and R. Gentleman Modified 21 November, 2009, M. Morgan

How to use the DEGseq Package

Weka ( )

Application of Hierarchical Clustering to Find Expression Modules in Cancer

Bayesian Analysis of Differential Gene Expression

CS6716 Pattern Recognition

Supervised Clustering of Yeast Gene Expression Data

User s Guide for R Routines to Perform Reference Marker Normalization

Evaluating Machine-Learning Methods. Goals for the lecture

Genomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Feature selection. LING 572 Fei Xia

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

From microarray images to Biological knowledge. Junior Barrera BIOINFO-USP DCC/IME-USP

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

CARMAweb users guide version Johannes Rainer

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

High throughput Data Analysis 2. Cluster Analysis

Microarray data analysis

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Min Wang. April, 2003

Gene expression & Clustering (Chapter 10)

Clustering Techniques

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Package SCAN.UPC. October 9, Type Package. Title Single-channel array normalization (SCAN) and University Probability of expression Codes (UPC)

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Title: Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data

Fuzzy C-means with Bi-dimensional Empirical Mode Decomposition for Segmentation of Microarray Image

1 Principles 1 2 Data Types 1 3 Normalization 3. 5 Clustering 7 6 Statistical Analysis 13 7 Conclusion 16 4 Dataset Filtering and Management 4

CompClustTk Manual & Tutorial

Support Vector Machines: Brief Overview" November 2011 CPSC 352

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

Description of gcrma package

Identifying differentially expressed genes with siggenes

Supplementary information: Detection of differentially expressed segments in tiling array data

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Biosphere: the interoperation of web services in microarray cluster analysis

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

Release Notes. JMP Genomics. Version 4.0

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Dimension reduction : PCA and Clustering

Stat 342 Exam 3 Fall 2014

Cluster Analysis for Microarray Data

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Evaluating Classifiers

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Classification by Nearest Shrunken Centroids and Support Vector Machines

Normalization: Bioconductor s marray package

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

GS Analysis of Microarray Data

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

mirnet Tutorial Starting with expression data

All About PlexSet Technology Data Analysis in nsolver Software

A Reliable and Distributed LIMS for Efficient Management of the Microarray Experiment Environment

CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series

Gene Expression Clustering with Functional Mixture Models

Image Processing. Bilkent University. CS554 Computer Vision Pinar Duygulu

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Exon Probeset Annotations and Transcript Cluster Groupings

TAIR User guide. TAIR User Guide Version 1.0 1

Cross-validation and the Bootstrap

Transcription:

10-810 /02-710 Computational Genomics Normalization

Genes and Gene Expression Technology Display of Expression Information

Yeast cell cycle expression Experiments (over time) baseline expression program gene 1 0 10 20 70 80 genes Higher expression compared to baseline Lower expression compared to baseline Spellman et al Mol. Biol. Cell 1998

ROS response

Visualization: Relative vs. absolute expression

Exercising the Genome 600 Conditions/Mutations 6200 Genes Environment Single-gene Mutations

Using annotation databases Statistical tests to identify the overlap with various functional categories

Genome wide binding experiments (transcription factors) no binding by this TF gene 1 TF1 TF8 genes Probably bound by this TF Lee et al Science 2002

What you should know The basic idea behind microarray profiling The two different microarray technologies Pros and cons for each Noise factors in microarray experiments (more next)

Gene expression analysis

Gene Expression Analysis Model Computational information fusion Biological regulatory networks Pattern Recognition Data Analysis clustering, classification normalization, miss. value estimation functional assignment, response programs diff. expressed genes Experimental Design array design, number of repeats experiment selection

Experiment design A number of computational issues should be addressed: Selecting short subsequences for oligo arrays to minimize cross hybridizations Determining the number of replicates for each sample Sampling rates for time series experiments

Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series

Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series

Typical experiment: replicates healthy cancer Technical replicates: same sample using multiple arrays Dye swap: reverse the color code between arrays Clinical replicates: samples from different individuals Many experiments have all three kinds of replicates

Normalization Single Channel Background Correction Probe Set Summarization (Affymetrix) Between Array Normalization Normalization Two Channel Array

Background Correction On Affymetrix major normalization software (e.g. MAS 5, dchip, RMA, gcrma) have different approaches to background correction On Agilent the feature extraction software gives various options on ways to do background correction Image from Agilent User Manual

Agilent Background Correction Table from Agilent User Manual

Normalization Single Channel Background Correction Between Array Normalization Normalization Probe Set Summarization (Affymetrix) Two Channel Array

In theory these really should have the same distribution Image from Terry Speed s Slides Ideally normalization should remove variation between microarray arrays unrelated to the biologly of interest, without removing the variation due to the biology of interest

Two repeats from Agilent mouse expression data

Normalization: Assumptions In order to normalize between arrays we need to make certain assumptions. Methods we will discuss will assume (with decreasing order of assumption strength): 1. Values are exactly the same between the arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).

Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).

Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).

Normalizing across arrays Consider the following two sets of values:

Lets put them together

Normalizing between arrays The first step in the analysis of microarray data in a given experiment is to normalize between the different arrays. Simple assumption: mrna quantity is the same for all arrays M j = 1 n n! i= 1 y i j Where n and T are the total number of genes and arrays, respectfully. M j is known as the sample mean M Next we transform each value to make all arrays have the same mean: j j j yˆ i = yi! M + M = 1 T T! j= 1 M j

Normalizing the mean

Variance normalization In many cases normalizing the mean is not enough. We may further assume that the variance should be the same for each array Implicitly we assume that the expression distribution is the same for all arrays (though different genes may change in each of the arrays) Here V j is the sample variance. Next, we transform each value as follows:!! = = = " = T j j n i j j j V T V M y n V i 1 1 2 1 ) ( 1 ( ) j j j i j i V V M M y y +! = ˆ

Normalizing mean and variance

Mean Scaling Before After

Distribution is still not identical

Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).

Quantile Normalization Maps the distribution of expression values for every microarray experiment to the same target distribution This is the approach of Terry Speed s group s RMA software Also an option in Agilent s Gene Spring Software

Mapping raw data to normalized data Image from Terry Speed s Slides

Normalized distribution For each order all the values in the array Take the mean (or median) value in each position Sort Mean 1 9 2 2 2 7 8 4 6 5 8 3 C B A 8 9 7 5 8 6 2 4 3 1 2 2 C B A 8 6.33 3 1.67 Normalized

Quantile Example Image from Terry Speed s Slides

Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).

Invariant Set Normalization Normalization based on probes that do not change between data Housekeeping Genes Control Probes Rank invariant genes This is the approach of Wing Wong s dchip software

dchip Invariant Set Normalization Select one baseline array Default Median Intensity All other arrays are normalized against baseline Computationally finds rank invariant probes Different genes may be used for different arrays Running Median curve through rank invariant probes becomes the new y=x curve, all values adjust accordingly

dchip on Replicate Data Spline on All Unnormalized Data Running Median Line Normalized Data Q-Q Plot Image From Li and Wong, Genome Biology 2001

dchip on data with true differentially expressed genes Spline on All Unnormalized Data Running Median Line Normalized Data Q-Q Plot Differently expressed genes not selected as rank invariant Image From Li and Wong, Genome Biology 2001

Normalizing Across Very Different Distributions can be Problematic May want to normalize subsets from different tissues separately Image From dchip manual

cdna arrays: ratios healthy cancer In many experiments we are interested in the ratio between two samples For example - Cancer vs. healthy - Progression of disease (ratio to time point 0)

Transformation While ratios are useful, they are not symmetric. If R = 2*G then R/G = 2 while G/R = ½ This makes it hard to visualize the different changes Instead, we use a log transform, and focus on the log ratio: y i = R i log = log Ri! Gi logg i Empirical studies have also shown that in microarray experiments the log ratio of (most) genes tends to be normally distributed

Normalizing between array: Locally weighted linear regression Normalizing the mean and the variance works well if the variance is independent of the measured value. However, this is not the case in gene expression. For microarrays it turns out that the variance is value dependent.

Locally weighted linear regression Instead of computing a single mean and variance for each array, we can compute different means and variances for different expression values. Given two arrays, R and G, or two channels on the same array, we plot on the x axis the (log) of their intensity and on the y axis their ratio We are interested in normalizing the average (log) expression ratio for the different intensity values

Computing local mean and variance Setting may work, however, it requires that many genes have the same x value, which is usually not the case Instead, we can use a weighted sum where the weight is propotional to the distance of the point from x:!! = = " = = x x i x x i i i x m y k x v y k x m 2 )) ( ( 1 ) ( 1 ) (!!!! " = = i i i i i i i i i i x m y x w x w x v y x w x w x m 2 )) ( ) ( ( ) ( 1 ) ( ) ( ) ( 1 ) ( ( ) ) ( ) ( ) ( ) ˆ( x v V M x m x y x y +! =

Determining the weights There are a number of ways to determine the weights Here we will use a Gaussian centered at x, such that 2 ( x# x i ) # 2 w( ) = 2! x i 1 2"! e σ 2 is a parameter that should be selected by the user

Locally weighted regression: Results Original values normalized values

Oligo arrays: Negative values In many cases oligo array can return values that are less than 0 (Why?) There are a number of ways to handle these values The most common is to threshold at a certain positive value A more sophisticated way is to use the negative values to learn something about the variance of the specific gene

Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series

Motivation In many cases, this is the goal of the experiment. Such genes can be key to understanding what goes wrong / or get fixed under certain condition (cancer, stress etc.). In other cases, these genes can be used as features for a classifier. These genes can also serve as a starting point for a model for the system being studied (e.g. cell cycle, phermone response etc.).

Problems As mentioned in the previous lecture, differences in expression values can result from many different noise sources. Our goal is to identify the real differences, that is, differences that can be explained by the various errors introduced during the experimental phase. Need to understand both the experimental protocol and take into account the underlying biology / chemistry

The wrong way During the early days (though some continue to do this today) the common method was to select genes based on their fold change between experiments. The common value was 2 (or absolute log of 1). Obviously this method is not perfect

Significance bands for Affy arrays

Value dependent variance

Typical experiment: replicates healthy cancer Technical replicates: same sample using multiple arrays Dye swap: reverse the color code between arrays Clinical replicates: samples from different individuals Many experiments have all three kinds of replicates

What you should know The different noise factors that influence microarray results The assumptions and corresponding methods for normalizing oligo (single channel) arrays. Methods for normalizing two channel (cdna) arrays.