Course on Microarray Gene Expression Analysis

Similar documents
Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

/ Computational Genomics. Normalization

Preprocessing -- examples in microarrays

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

DAY 52 BOX-AND-WHISKER

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

Package ffpe. October 1, 2018

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Preprocessing Affymetrix Data. Educational Materials 2005 R. Irizarry and R. Gentleman Modified 21 November, 2009, M. Morgan

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Measures of Position

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

CHAPTER 3: Data Description

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

No. of blue jelly beans No. of bags

STA 570 Spring Lecture 5 Tuesday, Feb 1

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 3 - Displaying and Summarizing Quantitative Data

CARMAweb users guide version Johannes Rainer

Normalization: Bioconductor s marray package

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

MiChip. Jonathon Blake. October 30, Introduction 1. 5 Plotting Functions 3. 6 Normalization 3. 7 Writing Output Files 3

How do microarrays work

Data Mining and Analytics. Introduction

Gene Clustering & Classification

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

ECLT 5810 Data Preprocessing. Prof. Wai Lam

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Expander Online Documentation

MS data processing. Filtering and correcting data. W4M Core Team. 22/09/2015 v 1.0.0

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Chapter 1. Looking at Data-Distribution

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

15 Wyner Statistics Fall 2013

Sections 4.3 and 4.4

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

Chapter 2 Describing, Exploring, and Comparing Data

Measures of Dispersion

3.3 The Five-Number Summary Boxplots

Exploratory data analysis for microarrays

PROCEDURE HELP PREPARED BY RYAN MURPHY

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Visualization and PCA with Gene Expression Data. Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

3. Data Analysis and Statistics

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Package TilePlot. February 15, 2013

Data processing. Filters and normalisation. Mélanie Pétéra W4M Core Team 31/05/2017 v 1.0.0

SLStats.notebook. January 12, Statistics:

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Averages and Variation

Using FARMS for summarization Using I/NI-calls for gene filtering. Djork-Arné Clevert. Institute of Bioinformatics, Johannes Kepler University Linz

Expander 7.2 Online Documentation

CHAPTER 2 DESCRIPTIVE STATISTICS

Bioconductor exercises 1. Exploring cdna data. June Wolfgang Huber and Andreas Buness

Nature Methods: doi: /nmeth Supplementary Figure 1

Section 9: One Variable Statistics

In Minitab interface has two windows named Session window and Worksheet window.

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Analysis of (cdna) Microarray Data: Part I. Sources of Bias and Normalisation

BIOINFORMATICS. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

Measures of Position. 1. Determine which student did better

EECS730: Introduction to Bioinformatics

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Bar Charts and Frequency Distributions

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2016

After opening Stata for the first time: set scheme s1mono, permanently

ROTS: Reproducibility Optimized Test Statistic

Exploring and Understanding Data Using R.

1.3 Graphical Summaries of Data

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Data Mining: Concepts and Techniques

Lecture Notes 3: Data summarization

Package stepnorm. R topics documented: April 10, Version Date

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

More Numerical and Graphical Summaries using Percentiles. David Gerard

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Week 4: Describing data and estimation

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

Statistics. Lowess. Now, compute the lowess model,, and plot. Finally, plot the data points by themselves, and display the two plots together.

An Introduction to Minitab Statistics 529

Transcription:

Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO

::: Introduction. The probe-level data must be cleaned and processed to obtain biologically meaningful measurement: Background correction: eliminate signals due to non-specific binding; Normalization: make multiple arrays comparable; Preprocessing: Scale linearization, summarization (calculating ratios, log scale transformation), missing values imputation, etc.

::: Why normalize? Normalization techniques are needed to accurately interpret the results: Remove aberrations and outliers due to non-biological factors. Reveal the patterns and outliers of actual biological factors.

::: Why normalize? Sources of variation between multiple high-density oligonucleotidearrays: Biological (e.g., diseased vs. normal) Non-biological: - Total RNA preparation, transcription, amplification, abundances; - Sample labeling differences; - Hybridization parametres; - Scanner differences; - Image analysis; etc. Goal: make multiple arrays comparable.

::: Normalization Assumptions Changes in expression are independent of abundance. Rare transcripts are as likely to change in response to a given stress as common ones. Most transcripts are not differentially expressed in response to a given stress. Expression ratio of typical spot: tumor/control= 1 Range of abundance begins at 0. Negative transcripts = error measurement Outliers are biologically relevant. Average cases are less interesting.

::: Normalization plots Boxplots Whiskers: In most cases represents 1.5 times the box width. Can be customized. IQR: InterQuartile Range the difference between the 75th percentile and 25th percentile Median Outliers <1.5 IQR below 0-25 quartile Outliers >1.5 IQR over 75-100 quartile

::: Normalization plots MA plots (ratio vs intensity) R M = log R logg = log G log R + logg A = = log 2 R G The MA-plots show the relationship between A (the "average signal" [(log R + log G)/2]), where R is the background subtracted red [mean of F635 - median of B635] and G the background subtracted green [mean of F532 - median of B532]) and M (the log [base 2] differential ratio: log(r/g)).

::: Normalization plots AFTER NORMALIZATION BEFORE NORMALIZATION

::: Normalization methods Two main categories for study: Intra-slide normalization (within array). Normalizes expression values to make intensities consistent within each array. Inter-slide normalization (between array). Normalizes expression values to achieve consistency between arrays. Normalization between arrays is usually, but not necessarily, applied after normalization within arrays (an exception in VSN method).

::: Normalization methods Within-array normalization. Median: Subtracts the weighted median from the M-values for each array. Lowess: LOcally WEighted Scatterplot Smoothing, 2-channel microarrays. Utilize a locally weighted polynomial regression of theintensity scatterplot in order to obtain the calibration factor. Print-tip Lowess: Regional Lowess. 2-channel spotted microarrays. Control: Fits a global loess curve through a set of control spots and applies that curve to all the other spots. Robustspline: Normalize the M-values for a single microarray using robustly fitted regression splines and empirical Bayes shrinkage. Between-array normalization. Scale: Simply to scale the log-ratios to have the same median-absolute-deviation across arrays MAS5: Affymetrix RMA: Affymetrix Cyclic Loess: Affymetrix, CodeLink Quantiles: Ensures that the intensities have the same empirical distribution across arrays and across channels. Affymetrix, Agilent Invariant set normalization: Affymetrix VSN: Variance Stabilization Normalization

::: Within-array methods Global Lowess Normalization Lowess is a technique for fitting a smoothing curve to a dataset. An intensity-dependent normalization. Predicted loess value is subtracted from the data to decrease the standard deviation and place the mean log ratio at 0. Lowess normalization may be applied to a two-color array expression dataset. All samples in the dataset are corrected independently. Lowess normalization can be applied to complete or incomplete datasets. Global normalized data {(M,A)} n=1..5184 : M norm = M-c(A) where c(a) is an intensity dependent function. Lowess curves adjustment

::: Within-array methods Print-tip Lowess normalization Oncochip One loess line for each block (print-tip block). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Print-tip normalized data {(M,A)} n=1..5184 : M p,norm = M p -c p (A); (1-16) p=print tip where c p (A) is an intensity dependent function for print tip p.

::: Between-array methods Quantiles normalization Bolstad et al, Bioinformatics (2003) Significant variation in the distribution of intensity values across arrays Transforms the distribution of probe intensities to be the same across arrays Final distribution is the average of each quantile across chips Density Log Intensities

::: Between-array methods Quantiles normalization Sort each column of original matrix Take average across rows 1 5 3 5 2 1 6 7 3 2 2 6 4 6 1 8 2 3 4.5 6 1 1 1 5 2 2 2 6 3 5 3 7 4 6 6 8 Substitute each value to corresponding row average 2 2 2 2 3 3 3 3 4.5 4.5 4.5 4.5 6 6 6 6 Unsort columns of matrix to original order 2 4.5 4.5 2 3 2 6 4.5 4.5 3 3 3 6 6 2 6

::: Between-array methods Variance Stabilization Normalization (VSN) Huber W. et al. Bioinformatics (2002) log asinh Log tranform can inflate variance near the background. VSN transformation = asinh(ai + bi * y) Well-defined and meaningful close to 0. Original intensities may be negative. linear log

::: Preprocessing. Log Transformation Common technique used for two-color arrays (one-color as well) Log ratio transformations convert data to a linear scale M= log2(cy5/cy3) A=log2(Cy5*Cy3)*0.5 Ratio scale 0.25 0.5 1 2 4 No Log transform Log scale (lineal) -2-1 0 1 2 After Log transform

::: Preprocessing. Imputing Missing Values Fill with zeros: Replace missing values by zeros. Fill with row average: Replace missing values by the. row average. Fill with row median: Replace missing values by the. row median. K-Nearest Neighbors (KNN) impute: Replace missing values by the average value of the K nearest patterns.

::: Preprocessing. KNN Imputing Machine learning algorithm. Usually euclidean (or Manhattan) metric, confined to the columns for which that gene is NOT missing K < number of samples in smallest class. Class A 1 2 3 4 5 6 7 8 Class B 1 2 3 4 1 2 3 4 5 6 7 8 5 na 7 8 9 10 11 12 13 14 15 16 9 10 na 12 13 14 15 16 9 10 11 12 13 14 15 16

::: Preprocessing. Other Parametres Filter Flat Patterns (no relevant differences between classes): - By number of peaks - By root mean square - By standard deviation Standardize Patterns: Subtracts the mean of the pattern and divide it by the standard deviation (z-score). Allows comparison of observations from different distributions (recommended) Indicates how many standard deviations an observation is above or below the mean.

Thanks! Visit us: http://ubio.bioinfo.cnio.es/ubio/ Gonzalo Gómez ggomez@cnio.es