Visualization and PCA with Gene Expression Data. Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4

Size: px
Start display at page:

Download "Visualization and PCA with Gene Expression Data. Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4"

Transcription

1 Visualization and PCA with Gene Expression Data Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4 1

2 References Chapter 10 of Bioconductor Monograph Ringner (2008). What is principal component analysis? Nature Biotechnology 26: Chapter 8 of Johnson & Wichern s Applied Multivariate Statistical Analysis, 5 th ed. 2

3 MA plot compare samples MA plot: M=Y-X vs. A=0.5(Y+X) Rotate and scale Y vs. X scatterplot For log-scale expression on arrays Y and X: M = log fold-change (Y vs. X) A = average expression For comparing multiple arrays, can create a pseudo-array reference X by taking median for each gene across all samples Sometimes add a Loess curve: locally weighted polynomial regression 3

4 Visualization and efficiency library(affydata); data(dilution) eset <- log2(exprs(dilution)) dim(eset) # 409,600 rows (gene probes), 4 columns (samples) X <- eset[,1]; Y <- eset[,2] M <- Y - X; A <- (Y+X)/2 plot(a,m,main='default M-A plot',pch=16,cex=0.1); abline(h=0) Interpretation: M-direction shows differential expression A-direction shows average expression Look for: systematic changes outliers patterns quality / normalization (larger M-variability, curvature) Problems: 1. loss of information (overlayed points) 2. file size Note: Dilution is from an Affymetrix experiment, only used for illustration here. 4

5 Alternative: show density by color library(geneplotter); library(rcolorbrewer) blues.ramp <- colorramppalette(brewer.pal(9,"blues")[-1]) dc <- denscols(a,m,colramp=blues.ramp) plot(a,m,col=dc,pch=16,cex=0.1,main='color density M-A plot') abline(h=0) Interpretation: color represents density around corresponding point How is this better? visualize - overlayed points What problem(s) remain? file size (one point per probe) 5

6 Alternative: highlight outer points op <- par() par(bg="black", fg="white", col.axis="white", col.lab="white", col.sub="white", col.main="white") dcol <- denscols(a,m,colramp=blues.ramp) plot(a,m,col=dcol,pch=16,cex=0.1, main='color density M-A plot') abline(h=0) par(op) 6

7 Alternative: smooth density smoothscatter(a,m,nbin=250,nrpoints=50,colramp=blues.ramp, main='smooth scatter M-A plot'); abline(h=0) What does this do? 1. smooth the local density using a kernel density estimator 2. black points represent isolated data points - But it can be a bit blurry (creates visual artifacts) 7

8 Alternative: hexagonal binning library(hexbin) hb <- hexbin(cbind(a,m),xbins=40) plot(hb, colramp = blues.ramp, main='hexagonal binning M-A plot') What does this do? essentially discretizes density - Maybe a little clunky, and adding reference lines can be tricky - But probably the safest plot 8

9 # Same as previous slide, but try adding horizontal ref hb <- hexbin(cbind(a,m),xbins=40) plot(hb, colramp = blues.ramp, main='hexagonal binning M-A plot') abline(h=0) 9

10 Image file types and sizes (slide 5 ex.) dcol <- denscols(a,m,colramp=blues.ramp) pdf("c:\\folder\\f1.pdf") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 3,643 KB postscript("c:\\folder\\f1.ps") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 25,315 KB jpeg("c:\\folder\\f1.jpg") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 14 KB png("c:\\folder\\f1.png") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 9 KB (Note other options for these functions to control size and quality.) 10

11 Principal Components Analysis A common approach in high-dimensional data: reduce dimensionality Notation: X lj = [log-scale] expression / abundance level for variable (gene / protein / metabolite / substance) j in observation (sample) l of the data [so X T expression set matrix] Define i th principal component (like a new variable or column): PC i = a ij X j j (where X j is the j th column of X) 11

12 Principal Components Analysis In the ith principal component: the coefficients a ij are chosen (automatically) so that: PC 1, PC 2, each have the most variation possible PC 1, PC 2, are independent (uncorrelated) PC 1 has most variation, followed by PC 2, then PC 3, i (a ij ) 2 = 1 for each i PC i = j a ij X j 12

13 PCA: Interpretation Size of a ij s indicates importance in variability Example: Suppose a 1j s are large for a certain class of gene / protein / metabolite, but small for other classes. Then PC1 can be interpreted as representing that class Problem: such clean interpretation not guaranteed 13

14 PCA: Visualization with the Biplot Several tools exist, but the biplot is fairly common Represent both observations / samples (rows of X) and variables [genes / proteins / etc.] (columns of X) Observations usually plotted as text labels at coordinates determined by first two PC s Greater interest: Variables plotted as labeled arrows, to coordinate (on arbitrary scale of top and right axes) weight in the first two PC s 14

15 PCA Implementation and Example Problem with high-dimensional wide data If have many more variables than observations Solution: transpose X and convert back to original space [princomp2 function in msprocess package] Example here: Naples data 4 Ctrl patients, 4 diseased (RCM or DCM) patients Focus on C vs. non-c (D or R) differentiation Just use previously-selected set of 20 genes for now (recall Notes 3 slide 23) 15

16 ## Get subset of Naples data (same as in Notes 3, slide 23) url <- " naples <- read.csv(url, row.names=1) head(naples) emat <- as.matrix(naples) gn <- rownames(emat) gn.list <- c("ensg ","ensg ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ") t <- is.element(gn,gn.list) small.eset <- emat[t,] group <- c('c','r','c','d','c','r','c','d') colnames(small.eset) <- group # Run PCA and visualize result source(" pc <- pc2(t(small.eset), scale = TRUE) 16

17 biplot(pc, cex=c(2,1), xlim=c(-.8,.8)) Biplot arrows show how certain variables put observations in certain parts of the plot. Here, dominant variables are , , , and

18 small.eset C R C D C R C D ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG Why would , , , and dominate? 18

19 Caution with high-dimensional PCA With a large # of variables (genes, dimensions, etc.), need to be aware of scale Variables with large variance (perhaps due to scale) will dominate principal components Need to standardize variables (genes; rows here) if they are measured on scales with widely different ranges or if the units of measurement are not [comparable] (Johnson & Wichern text) Log-scale expression values often better behaved 19

20 After standardizing rows of small.eset: Now able to detect effects, both huge and subtle: C > non-c for , , , and C < non-c for 68137, 20

21 Scree plot shows variance of the first few PC s Best to have majority of variance accounted for by the first two (or so) PC s 21

22 # standardize each row X <- log(small.eset+1) # Steps omitted here; each row of X standardized. # Think about how you could do it! pcx <- pc2(t(x), scale=true) biplot(pcx, xlim=c(-.8,.6), cex=c(2,1)) screeplot(pcx) 22

23 Side note: archived package Package msprocess was removed from the CRAN repository. Formerly available versions can be obtained from the archive. (But only there as source, not compiled package) We may return to this package later, but for now, we only need the princomp2 function: This function performs principal component analysis (PCA) for wide data x, i.e. dim(x)[1] < dim(x)[2]. This kind of data can be handled by princomp in S-PLUS but not in R. The trick is to do PCA for t(x) first and then convert back to the original space. It might be more efficient than princomp for high dimensional data. The pc2 function sourced in (previous slide) is a modified (but fully functional) version 23

24 # Try a 3-d biplot library(biplotgui) Biplots(Data=t(X)) # actually, using t(log(small.eset+1)) # looks similar here # lower-left: External --> In 3D 24

25 Summary Visualization efficiency Avoid overlaying points Conserve file size (note file type) Choose meaningful color palette Principal Components Emphasize major sources of variability No guaranteed interpretability (no accounting for bio.) Pay attention to scale of variables 25

26 Warnings why end with this? What can be gained from clustering? general ideas, possible structures; NOT - statistical inference (at least not in the same way) Be wary of: elaborate pictures non-normalized data unjustifiable decisions clustering method distance function color scheme claims of - proof (maybe support ) arbitrary decisions What is clustering best for? exploratory data analysis and summary, NOT statistical inference 26

Visualization and PCA with Gene Expression Data

Visualization and PCA with Gene Expression Data References Visualization and PCA with Gene Expression Data Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 2.4 Chapter 10 of Bioconductor Monograph (course text) Ringner (2008).

More information

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006 Gene Expression an Overview of Problems & Solutions: 1&2 Utah State University Bioinformatics: Problems and Solutions Summer 2006 Review DNA mrna Proteins action! mrna transcript abundance ~ expression

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

Distance Measures for Gene Expression (and other high-dimensional) Data

Distance Measures for Gene Expression (and other high-dimensional) Data Distance Measures for Gene Expression (and other high-dimensional) Data Utah State University Spring 04 STAT 5570: Statistical Bioinformatics Notes. References Chapter of Bioconductor Monograph (course

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies. CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data

CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data CBioVikings Copenhagen February 2 nd, Richard Röttger 1 Who is talking? 2 Resources Go to http://imada.sdu.dk/~roettger/teaching/cbiovikings.php You will find The dataset These slides An overview paper

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Tips and Guidance for Analyzing Data. Executive Summary

Tips and Guidance for Analyzing Data. Executive Summary Tips and Guidance for Analyzing Data Executive Summary This document has information and suggestions about three things: 1) how to quickly do a preliminary analysis of time-series data; 2) key things to

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

Click Trust to launch TableView.

Click Trust to launch TableView. Visualizing Expression data using the Co-expression Tool Web service and TableView Introduction. TableView was written by James (Jim) E. Johnson and colleagues at the University of Minnesota Center for

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below. Graphing in Excel featuring Excel 2007 1 A spreadsheet can be a powerful tool for analyzing and graphing data, but it works completely differently from the graphing calculator that you re used to. If you

More information

affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data

affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data Craig Parman and Conrad Halling April 30, 2018 Contents 1 Introduction 1 2 Getting Started 2 3 Figure Details 3 3.1 Report page

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 1st, 2018 1 Exploratory Data Analysis & Feature Construction How to explore a dataset Understanding the variables (values, ranges, and empirical

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Clustering analysis of gene expression data

Clustering analysis of gene expression data Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains

More information

Exploratory Data Analysis EDA

Exploratory Data Analysis EDA Exploratory Data Analysis EDA Luc Anselin http://spatial.uchicago.edu 1 from EDA to ESDA dynamic graphics primer on multivariate EDA interpretation and limitations 2 From EDA to ESDA 3 Exploratory Data

More information

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017 CPSC 340: Machine Learning and Data Mining Multi-Dimensional Scaling Fall 2017 Assignment 4: Admin 1 late day for tonight, 2 late days for Wednesday. Assignment 5: Due Monday of next week. Final: Details

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data

More information

CIE L*a*b* color model

CIE L*a*b* color model CIE L*a*b* color model To further strengthen the correlation between the color model and human perception, we apply the following non-linear transformation: with where (X n,y n,z n ) are the tristimulus

More information

Nature Methods: doi: /nmeth Supplementary Figure 1

Nature Methods: doi: /nmeth Supplementary Figure 1 Supplementary Figure 1 Schematic representation of the Workflow window in Perseus All data matrices uploaded in the running session of Perseus and all processing steps are displayed in the order of execution.

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Math background. 2D Geometric Transformations. Implicit representations. Explicit representations. Read: CS 4620 Lecture 6

Math background. 2D Geometric Transformations. Implicit representations. Explicit representations. Read: CS 4620 Lecture 6 Math background 2D Geometric Transformations CS 4620 Lecture 6 Read: Chapter 2: Miscellaneous Math Chapter 5: Linear Algebra Notation for sets, functions, mappings Linear transformations Matrices Matrix-vector

More information

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays Preprocessing Probe-level data: the intensities read for each of the components. Genomic-level data: the measures being used in

More information

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017 CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017 Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week. Digression: the other

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Basis Functions Tom Kelsey School of Computer Science University of St Andrews http://www.cs.st-andrews.ac.uk/~tom/ tom@cs.st-andrews.ac.uk Tom Kelsey ID5059-02-BF 2015-02-04

More information

STA 490H1S Initial Examination of Data

STA 490H1S Initial Examination of Data Initial Examination of Data Alison L. Department of Statistics University of Toronto Winter 2011 Course mantra It s OK not to know. Expressing ignorance is encouraged. It s not OK to not have a willingness

More information

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2) SOC6078 Advanced Statistics: 9. Generalized Additive Models Robert Andersen Department of Sociology University of Toronto Goals of the Lecture Introduce Additive Models Explain how they extend from simple

More information

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data Week 3 Based in part on slides from textbook, slides of Susan Holmes Part I Graphical exploratory data analysis October 10, 2012 1 / 1 2 / 1 Graphical summaries of data Graphical summaries of data Exploratory

More information

Principal Component Image Interpretation A Logical and Statistical Approach

Principal Component Image Interpretation A Logical and Statistical Approach Principal Component Image Interpretation A Logical and Statistical Approach Md Shahid Latif M.Tech Student, Department of Remote Sensing, Birla Institute of Technology, Mesra Ranchi, Jharkhand-835215 Abstract

More information

Package densityclust

Package densityclust Type Package Package densityclust October 24, 2017 Title Clustering by Fast Search and Find of Density Peaks Version 0.3 Date 2017-10-24 Maintainer Thomas Lin Pedersen An improved

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Advanced Applied Multivariate Analysis

Advanced Applied Multivariate Analysis Advanced Applied Multivariate Analysis STAT, Fall 3 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/ / 3 General Information Course

More information

Package pca3d. February 17, 2017

Package pca3d. February 17, 2017 Package pca3d Type Package Title Three Dimensional PCA Plots Version 0.10 Date 2017-02-17 Author January Weiner February 17, 2017 URL http://logfc.wordpress.com Maintainer January Weiner

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Hierarchical clustering

Hierarchical clustering Hierarchical clustering Rebecca C. Steorts, Duke University STA 325, Chapter 10 ISL 1 / 63 Agenda K-means versus Hierarchical clustering Agglomerative vs divisive clustering Dendogram (tree) Hierarchical

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2018 Last Time: Multi-Dimensional Scaling Multi-dimensional scaling (MDS): Non-parametric visualization: directly optimize the z i locations.

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

Image Processing. Image Features

Image Processing. Image Features Image Processing Image Features Preliminaries 2 What are Image Features? Anything. What they are used for? Some statements about image fragments (patches) recognition Search for similar patches matching

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Metabolomic Data Analysis with MetaboAnalyst

Metabolomic Data Analysis with MetaboAnalyst Metabolomic Data Analysis with MetaboAnalyst User ID: guest6522519400069885256 April 14, 2009 1 Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety

More information

CREATING THE DISTRIBUTION ANALYSIS

CREATING THE DISTRIBUTION ANALYSIS Chapter 12 Examining Distributions Chapter Table of Contents CREATING THE DISTRIBUTION ANALYSIS...176 BoxPlot...178 Histogram...180 Moments and Quantiles Tables...... 183 ADDING DENSITY ESTIMATES...184

More information

Stat 290: Lab 2. Introduction to R/S-Plus

Stat 290: Lab 2. Introduction to R/S-Plus Stat 290: Lab 2 Introduction to R/S-Plus Lab Objectives 1. To introduce basic R/S commands 2. Exploratory Data Tools Assignment Work through the example on your own and fill in numerical answers and graphs.

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Project 1 Jeremy Morris February 20, 2006 1 Generating bivariate normal data Definition 2.2 from our text states that we can transform a sample from a standard normal random variable

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Database Repository and Tools

Database Repository and Tools Database Repository and Tools John Matese May 9, 2008 What is the Repository? Save and exchange retrieved and analyzed datafiles Perform datafile manipulations (averaging and annotations) Run specialized

More information

Scientific Graphing in Excel 2007

Scientific Graphing in Excel 2007 Scientific Graphing in Excel 2007 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.

More information

Data preprocessing Functional Programming and Intelligent Algorithms

Data preprocessing Functional Programming and Intelligent Algorithms Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix. Paula Ahonen-Rainio Maa Visual Analysis in GIS

Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix. Paula Ahonen-Rainio Maa Visual Analysis in GIS Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix Paula Ahonen-Rainio Maa-123.3530 Visual Analysis in GIS 11.11.2015 Topics today YOUR REPORTS OF A-2 Thematic maps with charts

More information

Normalization: Bioconductor s marray package

Normalization: Bioconductor s marray package Normalization: Bioconductor s marray package Yee Hwa Yang 1 and Sandrine Dudoit 2 October 30, 2017 1. Department of edicine, University of California, San Francisco, jean@biostat.berkeley.edu 2. Division

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

Calculating a PCA and a MDS on a fingerprint data set

Calculating a PCA and a MDS on a fingerprint data set BioNumerics Tutorial: Calculating a PCA and a MDS on a fingerprint data set 1 Aim Principal Components Analysis (PCA) and Multi Dimensional Scaling (MDS) are two alternative grouping techniques that can

More information

Multivariate Standard Normal Transformation

Multivariate Standard Normal Transformation Multivariate Standard Normal Transformation Clayton V. Deutsch Transforming K regionalized variables with complex multivariate relationships to K independent multivariate standard normal variables is an

More information

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle.

Sec 4.1 Coordinates and Scatter Plots. Coordinate Plane: Formed by two real number lines that intersect at a right angle. Algebra I Chapter 4 Notes Name Sec 4.1 Coordinates and Scatter Plots Coordinate Plane: Formed by two real number lines that intersect at a right angle. X-axis: The horizontal axis Y-axis: The vertical

More information

18.02 Multivariable Calculus Fall 2007

18.02 Multivariable Calculus Fall 2007 MIT OpenCourseWare http://ocw.mit.edu 18.02 Multivariable Calculus Fall 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.02 Problem Set 4 Due Thursday

More information

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1 Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1,2 Keyue Ding, Ph.D. Nov. 8, 2014 1 NCIC Clinical Trials Group, Kingston, Ontario, Canada 2 Dept. Public

More information

explorase: Multivariate exploratory analysis and visualization for systems biology

explorase: Multivariate exploratory analysis and visualization for systems biology explorase: Multivariate exploratory analysis and visualization for systems biology Michael Lawrence, Dianne Cook, Eun-Kyung Lee April 30, 2018 1 Overview of explorase As its name suggests, explorase is

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information

Learn What s New. Statistical Software

Learn What s New. Statistical Software Statistical Software Learn What s New Upgrade now to access new and improved statistical features and other enhancements that make it even easier to analyze your data. The Assistant Data Customization

More information

Drug versus Disease (DrugVsDisease) package

Drug versus Disease (DrugVsDisease) package 1 Introduction Drug versus Disease (DrugVsDisease) package The Drug versus Disease (DrugVsDisease) package provides a pipeline for the comparison of drug and disease gene expression profiles where negatively

More information

SPSS Basics for Probability Distributions

SPSS Basics for Probability Distributions Built-in Statistical Functions in SPSS Begin by defining some variables in the Variable View of a data file, save this file as Probability_Distributions.sav and save the corresponding output file as Probability_Distributions.spo.

More information

Chapter 2 Modeling Distributions of Data

Chapter 2 Modeling Distributions of Data Chapter 2 Modeling Distributions of Data Section 2.1 Describing Location in a Distribution Describing Location in a Distribution Learning Objectives After this section, you should be able to: FIND and

More information

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1 Automated Bioinformatics Analysis System on Chip ABASOC version 1.1 Phillip Winston Miller, Priyam Patel, Daniel L. Johnson, PhD. University of Tennessee Health Science Center Office of Research Molecular

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract 19-1214 Chemometrics Technical Note Description of Pirouette Algorithms Abstract This discussion introduces the three analysis realms available in Pirouette and briefly describes each of the algorithms

More information

What s New in iogas Version 6.3

What s New in iogas Version 6.3 What s New in iogas Version 6.3 November 2017 Contents Import Tools... 2 Support for IMDEXHUB-IQ Database... 2 Import Random Subsample... 2 Stereonet Contours... 2 Gridding and contouring... 2 Analysis

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Visual Encoding Design

Visual Encoding Design CSE 442 - Data Visualization Visual Encoding Design Jeffrey Heer University of Washington Review: Expressiveness & Effectiveness / APT Choosing Visual Encodings Assume k visual encodings and n data attributes.

More information

Scaling Techniques in Political Science

Scaling Techniques in Political Science Scaling Techniques in Political Science Eric Guntermann March 14th, 2014 Eric Guntermann Scaling Techniques in Political Science March 14th, 2014 1 / 19 What you need R RStudio R code file Datasets You

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information

Using FARMS for summarization Using I/NI-calls for gene filtering. Djork-Arné Clevert. Institute of Bioinformatics, Johannes Kepler University Linz

Using FARMS for summarization Using I/NI-calls for gene filtering. Djork-Arné Clevert. Institute of Bioinformatics, Johannes Kepler University Linz Software Manual Institute of Bioinformatics, Johannes Kepler University Linz Using FARMS for summarization Using I/NI-calls for gene filtering Djork-Arné Clevert Institute of Bioinformatics, Johannes Kepler

More information

Introduction to Homogeneous coordinates

Introduction to Homogeneous coordinates Last class we considered smooth translations and rotations of the camera coordinate system and the resulting motions of points in the image projection plane. These two transformations were expressed mathematically

More information

CHAPTER 2 Modeling Distributions of Data

CHAPTER 2 Modeling Distributions of Data CHAPTER 2 Modeling Distributions of Data 2.2 Density Curves and Normal Distributions The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers Density Curves

More information

PCOMP http://127.0.0.1:55825/help/topic/com.rsi.idl.doc.core/pcomp... IDL API Reference Guides > IDL Reference Guide > Part I: IDL Command Reference > Routines: P PCOMP Syntax Return Value Arguments Keywords

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Resting state network estimation in individual subjects

Resting state network estimation in individual subjects Resting state network estimation in individual subjects Data 3T NIL(21,17,10), Havard-MGH(692) Young adult fmri BOLD Method Machine learning algorithm MLP DR LDA Network image Correlation Spatial Temporal

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch 12.1, 9.1 May 8, CODY Machine Learning for finding oil,

More information

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017 Milena Kraus Digital Health Summer Agenda Real-world Use Cases Oncology Nephrology Heart Insufficiency Additional Topics Data Management & Foundations Biology Recap Data Sources Data Formats Business Processes

More information