Workshop: R and Bioinformatics

Size: px
Start display at page:

Download "Workshop: R and Bioinformatics"

Transcription

1 Workshop: R and Bioinformatics Jean Monlong & Simon Papillon Human Genetics department October 28,

2 Why using R for bioinformatics? I Flexible statistics and data visualization software. I Many packages and a vast community: Bioconductor. I Simple and easy, compared at other computing languages. 2

3 Today s workshop Goal I Explain and demonstrate some key principles to do good bioinformatics. I How to structure the analysis, how to write e explore your data. I What to do and not to do... cient script, I This is NOT a package tutorial but will point at useful resources. Learning from your mistakes You will get errors today, before raising your hand: I Check your command for typos. I Try to understand the error message. I Check your input/objects. 3

4 Functions Previously on HGSS workshops: function To define functions. return Define what will be returned by the function. All the object created within the function are temporary. Structure myfunctionname <- function(input.obj1,second.input.obj ) { Intructions on input.obj1 and second.input.obj... return(my.output.obj) } myfunctionname(1,c(2,4,5)) 4

5 Conditions Logical tests == Are both values equal. > or >= Is the left value greater (greater or equal) than the right value. < or <= Is the left value smaller (smaller or equal) than the left value.! Is a NOT operator that negates the value of a test. Is an OR operator used to combine logical tests. Returns TRUE if either are TRUE. & Is an AND operator used to combine logical tests. Returns TRUE if both are TRUE Example test < == 4!test test!test test &!test ## (TRUE) ## (FALSE) ## (TRUE) ## (FALSE) 5

6 Conditions Boolean Any logical tests can be vectorized (compare 2 vectors). Example which Returns the index of the vectors with TRUE values. 5:8 == 6 ## FALSE,TRUE,FALSE,FALSE 5:8 >= 6 & 5:8<=7 ## FALSE,TRUE,TRUE,FALSE c(true, TRUE) & c(true, FALSE) ## TRUE,FALSE c(true, FALSE) c(false, FALSE) ## TRUE,FALSE which(5:10 == 6) ## 2 which(5:10 > 6) ## 3,4,5,6 6

7 Conditions - Exercise Exercise Create a function that: 1. remove values below 3 from a vector. 2. remove values below a specified threshold from a vector. For more advanced users Have a look at these functions on logical vectors: I any, %in%. I sum, mean, table. Extra tips I Don t filter out, keep in. I Use the boolean vector directly between [ ]. 7

8 Testing conditions if else Test a condition, if TRUE run some instruction, if FALSE something else (or nothing). if( Condition ){... Instructions } Example if(length(luckynumbers)>3){ cat("too many lucky numbers.") luckynumbers = luckynumbers[1:3] } else if(length(luckynumbers)==3){ cat("just enough lucky numbers.") } else { cat("you need more lucky numbers.") } 8

9 Loops for loops Iterate over the element of a container and run instructions. for(v in vec){... Instruction } while loops Run instructions as long as a condition is TRUE. while( CONDITION ){... Instruction } Example facto = 1 for(n in 1:10){ facto = facto * n } 9

10 Exercises if else Create a function that classify the average value of a vector. It returns: I low if the average if below 3. I medium if the average if between 3 and 7. I high if the average if above 7. Loops Write a function that computes the mean values of a matrix columns: 1. using the apply function. 2. using a for loop. 3. (using a while loop.) 10

11 Important principles Scripting Write scripts of your analysis: I Keeping track, easy rerun, easy parameter tweaking. I Rstudio or other interfaces (Emacs+ESS,...). Clear and modular code E I Define clear analysis steps. I Write function(s) for each step. I Keeps the data and parameters used clear (for you and for R). I No confusing temporary objects. I No repeating code. I More suitable for apply-like usage. I Easy parameter tweaking. ciency matters I Data structure and manipulation. I Especially relevant with our large data. 11

12 Data exploration - All you can plot Answer your questions with plots! Utility I Get an idea of the quality of the data and potential issues. I Get a full answer. I Detect potential biases. I (Find unexpected results.) Through all your analysis I Quality Control plots at the beginning. I Control plot after each steps. I Awesome plot with your results. 12

13 Today s dataset - Methylation analysis Goal Analyze methylation data following those principles. DNA methylation data I Identify sample outliers (and remove them). I Identify co-variates (sex, age) using PCA. I Use heatmap to plot sample groupings. I Point out di erentially methylated sites. I Plot methylation levels of interesting sites. The data set beta_value # The methylation data probe_data # The annotation of each probe pheno_data # The annotation of each sample (metadata) Get to know your data Check the first few rows of each object (remember the head function). Check the type and size of your data(str). 13

14 Importing large files more e ciently Extra parameters in read.table colclasses a vector with the data type of each column: e.g. character, numeric. nrows the number of rows to read. Potentially in combination with system and wc. Read a file line by line (or by chunk) con = file(file.name) while(length(line = readlines(con,n=1))>0){... Instructions } To test the performance system.time({... Instructions }) 14

15 Manipulating large files: a classical error Don t do this! Concatenate iteratively on big data. mymatrix = NULL for(i in 1:100000){... Instructions mymatrix = rbind(mymatrix, mynewline) } Instead do this Create the data and then fill it. mymatrix = matrix(na,100000,100) for(i in 1:100000){... Instructions mymatrix[i,] = mynewline } 15

16 Identify outliers Checking the distribution I Identify the case samples. I Take a look at the density of those samples. I Are there any samples that stand out? Useful functions density(x) # Compute the density of x plot(x) # Create a plot of x plot(density(x)) # Plot the density of x 16

17 Quality Control plots Aim Assess the quality of your data and potential artefacts that could bias your analysis. Basic approaches I Principal Component Analysis: representing the largest variation in the data. I Clustering: summarizing similarity relations between samples/genes. I Heatmaps: summary of the clustering on both row and columns. I Testing metadata: gender, age,... 17

18 Quality Control plots - Functions PCA using prcomp PCA of the matrix columns; plot of the variance explained by the first PCs; representation of the rows using the first two PCs. peeseaaye = prcomp(input.matrix) plot(peeseaaye) plot(peeseaaye$x,type="n") text(peeseaaye$x,labels=rownames(input.matrix)) Clustering using hclust Clustering using a distance matrix, e.g. from correlation between columns. cor.dist = as.dist(1-cor(input.matrix)) euc.dist = dist(input.matrix) kleusteur = hclust(cor.dist,method="ward") plot(kleusteur) library(mass) mdees = isomds(cor.dist) plot(mdees$points) 18

19 Heatmaps Using heatmap function heatmap(input.matrix) 19

20 Application - Identify outliers Using the PCA I Using all samples, plot the first 2 Principle Components. I Color the samples according to their Sample Group status. I How do the outliers identified previously behave? I Remove those samples from your data set (dont forget to propagate the changes in all objects!). PCA example pca <- prcomp(t(beta_value)) plot(pca$x) 20

21 Application - Identify phenotypic co-variates Using the PCA I Re-compute the PCA without the outlier samples. I Plot the 1st PC with the age of the samples (do you notice something?) I Color the sample according to their Sample Group status I Re-compute the PCA using control samples only. I Does sex have an e ect on data? I Knowing that females have 2 X chromosomes, 1 silenced by methylation are you able to predict the sex of the samples? I Check your prediction in the PCA plot 21

22 Application - Predicting the sex Some tips # Here we assume that the X chromosome is more methylated # in females x_probes <- probe_data$chrom == "chrx" summary(x_probes) controls <- pheno_data$sample_group == "control" summary(controls) pca <- prcomp(t(beta_value[x_probes, controls])) mean_x_meth <- apply(beta_value[x_probes, controls], 2, mean) color_vec <- ifelse(mean_x_meth >= 0.45, "red", "blue") plot(pca$x, col=color_vec) 22

23 Automated approach to identify metadata co-variates Linear regression For example, between Principal Component and metadata. I Can be automated to test other Principal Components and numerous metadata. I To make a decision on borderline/unclear cases. Application summary(lm(pca$x[,1]~pheno_data$age)) summary(lm(pca$x[,2]~pheno_data$age)) summary(lm(pca$x[,3]~pheno_data$age)) covar.test = summary(lm(pca$x~pheno_data$age)) covar.test.pv = unlist(lapply(covar.test,function(l) 1-pf(l$fstatistic[1],l$fstatistic[2],l$fstatistic[3]))) covar.test.pv[covar.test.pv<.01] 23

24 Cluster the samples Using the heatmap I We dont need to use the full data set (too big) I Let s find the 1,000 most variant sites and use them for clustering I Make the heatmap (see previous examples) I Color the sample according to their Sample Group status (use ColSideColors=color vec in heatmap) Predicting the sex # Getting the 1,000 most variant sites # # Compute the variance probe_var <- apply(beta_value, 1, var) # Order them decreasingly probe_var <- order(probe_var, decreasing = T) # Get the top 1,000 probe_var <- probe_var[1:1000] 24

25 Find di erentially methylated sites At-testcompares2distributions I We want to compare cases vs controls at each site (loop) I First, make a boolean vector for cases and one for controls (we saw how in previous slides) t-test, an example t.test(rnorm(100), rnorm(100)) # Getting only the p-value t.test(rnorm(100), rnorm(100))$p.value Non-parametric alternative Wilcoxon test using wilcox.test is a rank based test. 25

26 Plot di erentially methylated sites Using boxplots I Using a previously identified di erentially methylated site make a boxplot of this site for cases and controls. methylation control ETMR 26

27 GenomicRanges... Introduction Represents genomic intervals. All annotation can be represented through GenomicRanges objects. Creation mygr = GRanges(chrs, IRanges(start=starts,end=ends)) Overlaping intervals myoverlap = findoverlaps(mygr, genes.gr) queryhits(myoverlap) length(unique(queryhits(myoverlap))) Distance to nearest disttogene = distancetonearest(mygr, genes.gr) which(disttogene < 100) 27

28 ...and annotation Introduction Many annotation are already available directly from R, see Bioconductor website. Else you can create your own GenomicRanges object. TxDb Gene annotation. source(" bioclite("txdb.hsapiens.ucsc.hg19.knowngene") library(txdb.hsapiens.ucsc.hg19.knowngene) AnnotationHub Many di erent tracks, including most of Encode s. source(" bioclite("annotationhub") library(annotationhub) ah = AnnotationHub() ctcf.tfbs = ah$goldenpath.hg19.encodedcc.wgencodeuwtfbs. wgencodeuwtfbsmcf7ctcfstdpkrep1.narrowpeak_0.0.1.rdata 28

29 Extra: ggplot2 Introduction A package to constuct beautiful and/or complex graphs. Many aspects of the graph are arranged automatically but everything can be specified. Easy layers addition cg cg cg cg cg cg cg cg cg cg cg cg methylation density cg cg cg cg cg cg control ETMR control ETMR control ETMR methylation control ETMR control ETMR 29

30 Extra: ggplot2 - data.frame only data.frame The input object is always a data.frame, with each rows being one point to represent and each column the di erent information on it. data.frame: in its simple version, a matrix with di erent data type possible in each column. Useful functions data.frame To create a data.frame. subset To subset a data.frame using condition on the columns. melt/reshape To deconstruct a matrix into data.frame, or the opposite. reshape package. aggregate To compute summary statistics on subset of the data.frame. ddply apply-like function on subset of the data.frame. plyr package. Example gene.expression.df = data.frame(gene=c("a","b","c"), gene.expression=1:3) dim(gene.expression.df) gene.expression.df = subset(gene.expression.df, gene.expression>1) 30

31 Extra: ggplot2 Simple graph - Histogram df.to.plot = data.frame(value=rnorm(1000)) library(ggplot2) ggplot(df.to.plot,aes(x=value)) + geom_histogram() More complex graph Putting a color for each group and di erent panels for each probe. ggplot(beta.df, aes(x=methylation)) + geom_density(aes(fill=group,colour=group),alpha=.7) + facet_wrap(~probe) Learn from the examples there I I 31

32 Online tutorials R I : small video-tutorials. I : R and statistics small web-tutorials. I : Coursera Computing for Data Analysis videos. Other interesting videos, e.g. ggplot2. I : R manual. R and Bioinformatics I List of online resources for Bioinformatics. I : Bioinformatics workshop material. I : Pieces of code for bioinformatics analysis, plots. Including Bioconductor. I : Bioinformatics tutorials material: pdf and R scripts. 32

33 Thank you!! If you re interested in potentially more sessions, in di erent format (more often, more specific), maybe some kind of Rclub,letus know through the survey or by . 33

34 Lists Flexible container A list can contain any element type. It does not require elements to be of the same type. Example list Create a list. l[[i]] Get or set the i th object of the list. l$toto Get or set the element labeled as toto. names Get or set the names of the list elements. length Get the number of element in the list. str Output the structure of a R object. l = list(vec=1:10,mat=matrix(runif(25),5)) str(l) l l$vec = 1 l 34

35 Functions - lapply apply for lists I Useful way to iterate through lists. Example file_list <- list.files(. ) files_content <- lapply(file_list, function(file) \{ data <- read.csv(file) #Do something with the data return(data) \}) 35

Analyzing Genomic Data with NOJAH

Analyzing Genomic Data with NOJAH Analyzing Genomic Data with NOJAH TAB A) GENOME WIDE ANALYSIS Step 1: Select the example dataset or upload your own. Two example datasets are available. Genome-Wide TCGA-BRCA Expression datasets and CoMMpass

More information

Install RStudio from - use the standard installation.

Install RStudio from   - use the standard installation. Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/

More information

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html Intro to R R is a functional programming language, which means that most of what one does is apply functions to objects. We will begin with a brief introduction to R objects and how functions work, and

More information

An Introduction to R- Programming

An Introduction to R- Programming An Introduction to R- Programming Hadeel Alkofide, Msc, PhD NOT a biostatistician or R expert just simply an R user Some slides were adapted from lectures by Angie Mae Rodday MSc, PhD at Tufts University

More information

Introduction to Cancer Genomics

Introduction to Cancer Genomics Introduction to Cancer Genomics Gene expression data analysis part I David Gfeller Computational Cancer Biology Ludwig Center for Cancer research david.gfeller@unil.ch 1 Overview 1. Basic understanding

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Bioconductor tutorial

Bioconductor tutorial Bioconductor tutorial Adapted by Alex Sanchez from tutorials by (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang Outline The Bioconductor Project

More information

Package dmrseq. September 14, 2018

Package dmrseq. September 14, 2018 Type Package Package dmrseq September 14, 2018 Title Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing Version 1.1.15 Author Keegan Korthauer ,

More information

Lab: Using R and Bioconductor

Lab: Using R and Bioconductor Lab: Using R and Bioconductor Robert Gentleman Florian Hahne Paul Murrell June 19, 2006 Introduction In this lab we will cover some basic uses of R and also begin working with some of the Bioconductor

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

ITS Introduction to R course

ITS Introduction to R course ITS Introduction to R course Nov. 29, 2018 Using this document Code blocks and R code have a grey background (note, code nested in the text is not highlighted in the pdf version of this document but is

More information

Introduction to R Programming

Introduction to R Programming Course Overview Over the past few years, R has been steadily gaining popularity with business analysts, statisticians and data scientists as a tool of choice for conducting statistical analysis of data

More information

Package RAPIDR. R topics documented: February 19, 2015

Package RAPIDR. R topics documented: February 19, 2015 Package RAPIDR February 19, 2015 Title Reliable Accurate Prenatal non-invasive Diagnosis R package Package to perform non-invasive fetal testing for aneuploidies using sequencing count data from cell-free

More information

Data Import and Export

Data Import and Export Data Import and Export Eugen Buehler October 17, 2018 Importing Data to R from a file CSV (comma separated value) tab delimited files Excel formats (xls, xlsx) SPSS/SAS/Stata RStudio will tell you if you

More information

R on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for

R on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for R on BioHPC Rstudio, Parallel R and BioconductoR 1 Updated for 2015-07-15 2 Today we ll be looking at Why R? The dominant statistics environment in academia Large number of packages to do a lot of different

More information

CTL mapping in R. Danny Arends, Pjotr Prins, and Ritsert C. Jansen. University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1

CTL mapping in R. Danny Arends, Pjotr Prins, and Ritsert C. Jansen. University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1 CTL mapping in R Danny Arends, Pjotr Prins, and Ritsert C. Jansen University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1 First written: Oct 2011 Last modified: Jan 2018 Abstract: Tutorial

More information

Recap From Last Time: Today s Learning Goals BIMM 143. Data analysis with R Lecture 4. Barry Grant.

Recap From Last Time: Today s Learning Goals BIMM 143. Data analysis with R Lecture 4. Barry Grant. BIMM 143 Data analysis with R Lecture 4 Barry Grant http://thegrantlab.org/bimm143 Recap From Last Time: Substitution matrices: Where our alignment match and mis-match scores typically come from Comparing

More information

Introduction to Matlab. Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole

Introduction to Matlab. Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole Introduction to Matlab Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole MATLAB Environment This image cannot currently be displayed. What do we use? Help? If you know the name of the

More information

Package SC3. September 29, 2018

Package SC3. September 29, 2018 Type Package Title Single-Cell Consensus Clustering Version 1.8.0 Author Vladimir Kiselev Package SC3 September 29, 2018 Maintainer Vladimir Kiselev A tool for unsupervised

More information

k-nn classification with R QMMA

k-nn classification with R QMMA k-nn classification with R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 1/16 HW (Height and weight) of adults Statistics

More information

Introducing Categorical Data/Variables (pp )

Introducing Categorical Data/Variables (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs.

More information

Expression Analysis with the Advanced RNA-Seq Plugin

Expression Analysis with the Advanced RNA-Seq Plugin Expression Analysis with the Advanced RNA-Seq Plugin May 24, 2016 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming

More information

An introduction to Genomic Data Structures

An introduction to Genomic Data Structures An introduction to Genomic Data Structures Cavan Reilly October 30, 2017 Table of contents Object Oriented Programming The ALL data set ExpressionSet Objects Environments More on ExpressionSet Objects

More information

Package SC3. November 27, 2017

Package SC3. November 27, 2017 Type Package Title Single-Cell Consensus Clustering Version 1.7.1 Author Vladimir Kiselev Package SC3 November 27, 2017 Maintainer Vladimir Kiselev A tool for unsupervised

More information

A brief introduction to R

A brief introduction to R A brief introduction to R Cavan Reilly September 29, 2017 Table of contents Background R objects Operations on objects Factors Input and Output Figures Missing Data Random Numbers Control structures Background

More information

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling

More information

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.

More information

Package PCADSC. April 19, 2017

Package PCADSC. April 19, 2017 Type Package Package PCADSC April 19, 2017 Title Tools for Principal Component Analysis-Based Data Structure Comparisons Version 0.8.0 A suite of non-parametric, visual tools for assessing differences

More information

Getting Started. Slides R-Intro: R-Analytics: R-HPC:

Getting Started. Slides R-Intro:   R-Analytics:   R-HPC: Getting Started Download and install R + Rstudio http://www.r-project.org/ https://www.rstudio.com/products/rstudio/download2/ TACC ssh username@wrangler.tacc.utexas.edu % module load Rstats %R Slides

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

Package r.jive. R topics documented: April 12, Type Package

Package r.jive. R topics documented: April 12, Type Package Type Package Package r.jive April 12, 2017 Title Perform JIVE Decomposition for Multi-Source Data Version 2.1 Date 2017-04-11 Author Michael J. O'Connell and Eric F. Lock Maintainer Michael J. O'Connell

More information

ACHIEVEMENTS FROM TRAINING

ACHIEVEMENTS FROM TRAINING LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM

More information

Why use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first

Why use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first Why use R? Introduction to R: Using R for statistics ti ti and data analysis BaRC Hot Topics October 2011 George Bell, Ph.D. http://iona.wi.mit.edu/bio/education/r2011/ To perform inferential statistics

More information

Using R for statistics and data analysis

Using R for statistics and data analysis Introduction ti to R: Using R for statistics and data analysis BaRC Hot Topics October 2011 George Bell, Ph.D. http://iona.wi.mit.edu/bio/education/r2011/ Why use R? To perform inferential statistics (e.g.,

More information

Prepare input data for CINdex

Prepare input data for CINdex 1 Introduction Prepare input data for CINdex Genomic instability is known to be a fundamental trait in the development of tumors; and most human tumors exhibit this instability in structural and numerical

More information

Package FunciSNP. November 16, 2018

Package FunciSNP. November 16, 2018 Type Package Package FunciSNP November 16, 2018 Title Integrating Functional Non-coding Datasets with Genetic Association Studies to Identify Candidate Regulatory SNPs Version 1.26.0 Date 2013-01-19 Author

More information

Step-by-Step Guide to Advanced Genetic Analysis

Step-by-Step Guide to Advanced Genetic Analysis Step-by-Step Guide to Advanced Genetic Analysis Page 1 Introduction In the previous document, 1 we covered the standard genetic analyses available in JMP Genomics. Here, we cover the more advanced options

More information

Package pandar. April 30, 2018

Package pandar. April 30, 2018 Title PANDA Algorithm Version 1.11.0 Package pandar April 30, 2018 Author Dan Schlauch, Joseph N. Paulson, Albert Young, John Quackenbush, Kimberly Glass Maintainer Joseph N. Paulson ,

More information

Step-by-Step Guide to Relatedness and Association Mapping Contents

Step-by-Step Guide to Relatedness and Association Mapping Contents Step-by-Step Guide to Relatedness and Association Mapping Contents OBJECTIVES... 2 INTRODUCTION... 2 RELATEDNESS MEASURES... 2 POPULATION STRUCTURE... 6 Q-K ASSOCIATION ANALYSIS... 10 K MATRIX COMPRESSION...

More information

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Predictive Analysis: Evaluation and Experimentation. Heejun Kim Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training

More information

Clustering analysis of gene expression data

Clustering analysis of gene expression data Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains

More information

Exploring gene expression datasets

Exploring gene expression datasets Exploring gene expression datasets Alexey Sergushichev Dec 4-5, St. Louis About the workshop We will cover the basic analysis of gene expression matrices No working with raw data The focus is on being

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Step-by-Step Guide to Basic Genetic Analysis

Step-by-Step Guide to Basic Genetic Analysis Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control

More information

Package GEM. R topics documented: January 31, Type Package

Package GEM. R topics documented: January 31, Type Package Type Package Package GEM January 31, 2018 Title GEM: fast association study for the interplay of Gene, Environment and Methylation Version 1.5.0 Date 2015-12-05 Author Hong Pan, Joanna D Holbrook, Neerja

More information

Computer lab 2 Course: Introduction to R for Biologists

Computer lab 2 Course: Introduction to R for Biologists Computer lab 2 Course: Introduction to R for Biologists April 23, 2012 1 Scripting As you have seen, you often want to run a sequence of commands several times, perhaps with small changes. An efficient

More information

Data Science and Machine Learning Essentials

Data Science and Machine Learning Essentials Data Science and Machine Learning Essentials Lab 3A Visualizing Data By Stephen Elston and Graeme Malcolm Overview In this lab, you will learn how to use R or Python to visualize data. If you intend to

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Introduction to R, Github and Gitlab

Introduction to R, Github and Gitlab Introduction to R, Github and Gitlab 27/11/2018 Pierpaolo Maisano Delser mail: maisanop@tcd.ie ; pm604@cam.ac.uk Outline: Why R? What can R do? Basic commands and operations Data analysis in R Github and

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Importing and Merging Data Tutorial

Importing and Merging Data Tutorial Importing and Merging Data Tutorial Release 1.0 Golden Helix, Inc. February 17, 2012 Contents 1. Overview 2 2. Import Pedigree Data 4 3. Import Phenotypic Data 6 4. Import Genetic Data 8 5. Import and

More information

Prelims Data Analysis TT 2018 Sheet 7

Prelims Data Analysis TT 2018 Sheet 7 Prelims Data Analysis TT 208 Sheet 7 At the end of this exercise sheet there are Optional Practical Exercises in R and Matlab. It is strongly recommended that students do these exercises, but students

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

GWAS Exercises 3 - GWAS with a Quantiative Trait

GWAS Exercises 3 - GWAS with a Quantiative Trait GWAS Exercises 3 - GWAS with a Quantiative Trait Peter Castaldi January 28, 2013 PLINK can also test for genetic associations with a quantitative trait (i.e. a continuous variable). In this exercise, we

More information

SISG/SISMID Module 3

SISG/SISMID Module 3 SISG/SISMID Module 3 Introduction to R Ken Rice Tim Thornton University of Washington Seattle, July 2018 Introduction: Course Aims This is a first course in R. We aim to cover; Reading in, summarizing

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

Basics of Plotting Data

Basics of Plotting Data Basics of Plotting Data Luke Chang Last Revised July 16, 2010 One of the strengths of R over other statistical analysis packages is its ability to easily render high quality graphs. R uses vector based

More information

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Advanced Econometric Methods EMET3011/8014

Advanced Econometric Methods EMET3011/8014 Advanced Econometric Methods EMET3011/8014 Lecture 2 John Stachurski Semester 1, 2011 Announcements Missed first lecture? See www.johnstachurski.net/emet Weekly download of course notes First computer

More information

Importing and visualizing data in R. Day 3

Importing and visualizing data in R. Day 3 Importing and visualizing data in R Day 3 R data.frames Like pandas in python, R uses data frame (data.frame) object to support tabular data. These provide: Data input Row- and column-wise manipulation

More information

AGA User Manual. Version 1.0. January 2014

AGA User Manual. Version 1.0. January 2014 AGA User Manual Version 1.0 January 2014 Contents 1. Getting Started... 3 1a. Minimum Computer Specifications and Requirements... 3 1b. Installation... 3 1c. Running the Application... 4 1d. File Preparation...

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

ChIP-Seq Tutorial on Galaxy

ChIP-Seq Tutorial on Galaxy 1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Package PedCNV. February 19, 2015

Package PedCNV. February 19, 2015 Type Package Package PedCNV February 19, 2015 Title An implementation for association analysis with CNV data. Version 0.1 Date 2013-08-03 Author, Sungho Won and Weicheng Zhu Maintainer

More information

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays Preprocessing Probe-level data: the intensities read for each of the components. Genomic-level data: the measures being used in

More information

Bioinformatics - Homework 1 Q&A style

Bioinformatics - Homework 1 Q&A style Bioinformatics - Homework 1 Q&A style Instructions: in this assignment you will test your understanding of basic GWAS concepts and GenABEL functions. The materials needed for the homework (two datasets

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Introduction to R Reading, writing and exploring data

Introduction to R Reading, writing and exploring data Introduction to R Reading, writing and exploring data R-peer-group QUB February 12, 2013 R-peer-group (QUB) Session 2 February 12, 2013 1 / 26 Session outline Review of last weeks exercise Introduction

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Certified Data Science with Python Professional VS-1442

Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become

More information

Using generxcluster. Charles C. Berry. April 30, 2018

Using generxcluster. Charles C. Berry. April 30, 2018 Using generxcluster Charles C. Berry April 30, 2018 Contents 1 Overview 1 2 Basic Use 1 2.1 Reading Data from a File...................... 2 2.2 Simulating Data........................... 2 2.3 Invoking

More information

Reading and writing data

Reading and writing data An introduction to WS 2017/2018 Reading and writing data Dr. Noémie Becker Dr. Sonja Grath Special thanks to: Prof. Dr. Martin Hutzenthaler and Dr. Benedikt Holtmann for significant contributions to course

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

LAB #1: DESCRIPTIVE STATISTICS WITH R

LAB #1: DESCRIPTIVE STATISTICS WITH R NAVAL POSTGRADUATE SCHOOL LAB #1: DESCRIPTIVE STATISTICS WITH R Statistics (OA3102) Lab #1: Descriptive Statistics with R Goal: Introduce students to various R commands for descriptive statistics. Lab

More information

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p. Preface p. xiii Ideology: Data Skills for Robust and Reproducible Bioinformatics How to Learn Bioinformatics p. 1 Why Bioinformatics? Biology's Growing Data p. 1 Learning Data Skills to Learn Bioinformatics

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Evaluating Machine Learning Methods: Part 1

Evaluating Machine Learning Methods: Part 1 Evaluating Machine Learning Methods: Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts bias of an estimator learning curves stratified sampling cross validation

More information

Gene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT

Gene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT Gene Survey: FAQ Tod Casasent 2016-02-22-1245 DRAFT 1 What is this document? This document is intended for use by internal and external users of the Gene Survey package, results, and output. This document

More information

Overview. Linear Algebra Notation. MATLAB Data Types Data Visualization. Probability Review Exercises. Asymptotics (Big-O) Review

Overview. Linear Algebra Notation. MATLAB Data Types Data Visualization. Probability Review Exercises. Asymptotics (Big-O) Review Tutorial 1 1 / 21 Overview Linear Algebra Notation Data Types Data Visualization Probability Review Exercises Asymptotics (Big-O) Review 2 / 21 Linear Algebra Notation Notation and Convention 3 / 21 Linear

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

srap: Simplified RNA-Seq Analysis Pipeline

srap: Simplified RNA-Seq Analysis Pipeline srap: Simplified RNA-Seq Analysis Pipeline Charles Warden October 30, 2017 1 Introduction This package provides a pipeline for gene expression analysis. The normalization function is specific for RNA-Seq

More information

W ASHU E PI G ENOME B ROWSER

W ASHU E PI G ENOME B ROWSER W ASHU E PI G ENOME B ROWSER Keystone Symposium on DNA and RNA Methylation January 23 rd, 2018 Fairmont Hotel Vancouver, Vancouver, British Columbia, Canada Presenter: Renee Sears and Josh Jang Tutorial

More information

Introduction to R Jason Huff, QB3 CGRL UC Berkeley April 15, 2016

Introduction to R Jason Huff, QB3 CGRL UC Berkeley April 15, 2016 Introduction to R Jason Huff, QB3 CGRL UC Berkeley April 15, 2016 Installing R R is constantly updated and you should download a recent version; the version when this workshop was written was 3.2.4 I also

More information

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments

Assignment No: 2. Assessment as per Schedule. Specifications Readability Assignments Specifications Readability Assignments Assessment as per Schedule Oral Total 6 4 4 2 4 20 Date of Performance:... Expected Date of Completion:... Actual Date of Completion:... ----------------------------------------------------------------------------------------------------------------

More information

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University RNA-Seq Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University joshua.ainsley@tufts.edu Day four Quantifying expression Intro to R Differential expression

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Representing sequencing data in Bioconductor

Representing sequencing data in Bioconductor Representing sequencing data in Bioconductor Mark Dunning mark.dunning@cruk.cam.ac.uk Last modified: July 28, 2015 Contents 1 Accessing Genome Sequence 1 1.1 Alphabet Frequencies...................................

More information

Visualizing the World

Visualizing the World Visualizing the World An Introduction to Visualization 15.071x The Analytics Edge Why Visualization? The picture-examining eye is the best finder we have of the wholly unanticipated -John Tukey Visualizing

More information

Package demi. February 19, 2015

Package demi. February 19, 2015 Type Package Package demi February 19, 2015 Title Differential Expression from Multiple Indicators Implementation of the DEMI method for the analysis of high-density microarray data. URL http://biit.cs.ut.ee/demi

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Agenda. Predicate Testing. CSE 5321/4321, Ali Sharifara, UTA

Agenda. Predicate Testing. CSE 5321/4321, Ali Sharifara, UTA Agenda Predicate Testing CSE 5321/4321, Ali Sharifara, UTA 1 Predicate Testing Introduction Basic Concepts Predicate Coverage Summary CSE 5321/4321, Ali Sharifara, UTA 2 Motivation Predicates are expressions

More information