Workshop: R and Bioinformatics
|
|
- Garey Robbins
- 6 years ago
- Views:
Transcription
1 Workshop: R and Bioinformatics Jean Monlong & Simon Papillon Human Genetics department October 28,
2 Why using R for bioinformatics? I Flexible statistics and data visualization software. I Many packages and a vast community: Bioconductor. I Simple and easy, compared at other computing languages. 2
3 Today s workshop Goal I Explain and demonstrate some key principles to do good bioinformatics. I How to structure the analysis, how to write e explore your data. I What to do and not to do... cient script, I This is NOT a package tutorial but will point at useful resources. Learning from your mistakes You will get errors today, before raising your hand: I Check your command for typos. I Try to understand the error message. I Check your input/objects. 3
4 Functions Previously on HGSS workshops: function To define functions. return Define what will be returned by the function. All the object created within the function are temporary. Structure myfunctionname <- function(input.obj1,second.input.obj ) { Intructions on input.obj1 and second.input.obj... return(my.output.obj) } myfunctionname(1,c(2,4,5)) 4
5 Conditions Logical tests == Are both values equal. > or >= Is the left value greater (greater or equal) than the right value. < or <= Is the left value smaller (smaller or equal) than the left value.! Is a NOT operator that negates the value of a test. Is an OR operator used to combine logical tests. Returns TRUE if either are TRUE. & Is an AND operator used to combine logical tests. Returns TRUE if both are TRUE Example test < == 4!test test!test test &!test ## (TRUE) ## (FALSE) ## (TRUE) ## (FALSE) 5
6 Conditions Boolean Any logical tests can be vectorized (compare 2 vectors). Example which Returns the index of the vectors with TRUE values. 5:8 == 6 ## FALSE,TRUE,FALSE,FALSE 5:8 >= 6 & 5:8<=7 ## FALSE,TRUE,TRUE,FALSE c(true, TRUE) & c(true, FALSE) ## TRUE,FALSE c(true, FALSE) c(false, FALSE) ## TRUE,FALSE which(5:10 == 6) ## 2 which(5:10 > 6) ## 3,4,5,6 6
7 Conditions - Exercise Exercise Create a function that: 1. remove values below 3 from a vector. 2. remove values below a specified threshold from a vector. For more advanced users Have a look at these functions on logical vectors: I any, %in%. I sum, mean, table. Extra tips I Don t filter out, keep in. I Use the boolean vector directly between [ ]. 7
8 Testing conditions if else Test a condition, if TRUE run some instruction, if FALSE something else (or nothing). if( Condition ){... Instructions } Example if(length(luckynumbers)>3){ cat("too many lucky numbers.") luckynumbers = luckynumbers[1:3] } else if(length(luckynumbers)==3){ cat("just enough lucky numbers.") } else { cat("you need more lucky numbers.") } 8
9 Loops for loops Iterate over the element of a container and run instructions. for(v in vec){... Instruction } while loops Run instructions as long as a condition is TRUE. while( CONDITION ){... Instruction } Example facto = 1 for(n in 1:10){ facto = facto * n } 9
10 Exercises if else Create a function that classify the average value of a vector. It returns: I low if the average if below 3. I medium if the average if between 3 and 7. I high if the average if above 7. Loops Write a function that computes the mean values of a matrix columns: 1. using the apply function. 2. using a for loop. 3. (using a while loop.) 10
11 Important principles Scripting Write scripts of your analysis: I Keeping track, easy rerun, easy parameter tweaking. I Rstudio or other interfaces (Emacs+ESS,...). Clear and modular code E I Define clear analysis steps. I Write function(s) for each step. I Keeps the data and parameters used clear (for you and for R). I No confusing temporary objects. I No repeating code. I More suitable for apply-like usage. I Easy parameter tweaking. ciency matters I Data structure and manipulation. I Especially relevant with our large data. 11
12 Data exploration - All you can plot Answer your questions with plots! Utility I Get an idea of the quality of the data and potential issues. I Get a full answer. I Detect potential biases. I (Find unexpected results.) Through all your analysis I Quality Control plots at the beginning. I Control plot after each steps. I Awesome plot with your results. 12
13 Today s dataset - Methylation analysis Goal Analyze methylation data following those principles. DNA methylation data I Identify sample outliers (and remove them). I Identify co-variates (sex, age) using PCA. I Use heatmap to plot sample groupings. I Point out di erentially methylated sites. I Plot methylation levels of interesting sites. The data set beta_value # The methylation data probe_data # The annotation of each probe pheno_data # The annotation of each sample (metadata) Get to know your data Check the first few rows of each object (remember the head function). Check the type and size of your data(str). 13
14 Importing large files more e ciently Extra parameters in read.table colclasses a vector with the data type of each column: e.g. character, numeric. nrows the number of rows to read. Potentially in combination with system and wc. Read a file line by line (or by chunk) con = file(file.name) while(length(line = readlines(con,n=1))>0){... Instructions } To test the performance system.time({... Instructions }) 14
15 Manipulating large files: a classical error Don t do this! Concatenate iteratively on big data. mymatrix = NULL for(i in 1:100000){... Instructions mymatrix = rbind(mymatrix, mynewline) } Instead do this Create the data and then fill it. mymatrix = matrix(na,100000,100) for(i in 1:100000){... Instructions mymatrix[i,] = mynewline } 15
16 Identify outliers Checking the distribution I Identify the case samples. I Take a look at the density of those samples. I Are there any samples that stand out? Useful functions density(x) # Compute the density of x plot(x) # Create a plot of x plot(density(x)) # Plot the density of x 16
17 Quality Control plots Aim Assess the quality of your data and potential artefacts that could bias your analysis. Basic approaches I Principal Component Analysis: representing the largest variation in the data. I Clustering: summarizing similarity relations between samples/genes. I Heatmaps: summary of the clustering on both row and columns. I Testing metadata: gender, age,... 17
18 Quality Control plots - Functions PCA using prcomp PCA of the matrix columns; plot of the variance explained by the first PCs; representation of the rows using the first two PCs. peeseaaye = prcomp(input.matrix) plot(peeseaaye) plot(peeseaaye$x,type="n") text(peeseaaye$x,labels=rownames(input.matrix)) Clustering using hclust Clustering using a distance matrix, e.g. from correlation between columns. cor.dist = as.dist(1-cor(input.matrix)) euc.dist = dist(input.matrix) kleusteur = hclust(cor.dist,method="ward") plot(kleusteur) library(mass) mdees = isomds(cor.dist) plot(mdees$points) 18
19 Heatmaps Using heatmap function heatmap(input.matrix) 19
20 Application - Identify outliers Using the PCA I Using all samples, plot the first 2 Principle Components. I Color the samples according to their Sample Group status. I How do the outliers identified previously behave? I Remove those samples from your data set (dont forget to propagate the changes in all objects!). PCA example pca <- prcomp(t(beta_value)) plot(pca$x) 20
21 Application - Identify phenotypic co-variates Using the PCA I Re-compute the PCA without the outlier samples. I Plot the 1st PC with the age of the samples (do you notice something?) I Color the sample according to their Sample Group status I Re-compute the PCA using control samples only. I Does sex have an e ect on data? I Knowing that females have 2 X chromosomes, 1 silenced by methylation are you able to predict the sex of the samples? I Check your prediction in the PCA plot 21
22 Application - Predicting the sex Some tips # Here we assume that the X chromosome is more methylated # in females x_probes <- probe_data$chrom == "chrx" summary(x_probes) controls <- pheno_data$sample_group == "control" summary(controls) pca <- prcomp(t(beta_value[x_probes, controls])) mean_x_meth <- apply(beta_value[x_probes, controls], 2, mean) color_vec <- ifelse(mean_x_meth >= 0.45, "red", "blue") plot(pca$x, col=color_vec) 22
23 Automated approach to identify metadata co-variates Linear regression For example, between Principal Component and metadata. I Can be automated to test other Principal Components and numerous metadata. I To make a decision on borderline/unclear cases. Application summary(lm(pca$x[,1]~pheno_data$age)) summary(lm(pca$x[,2]~pheno_data$age)) summary(lm(pca$x[,3]~pheno_data$age)) covar.test = summary(lm(pca$x~pheno_data$age)) covar.test.pv = unlist(lapply(covar.test,function(l) 1-pf(l$fstatistic[1],l$fstatistic[2],l$fstatistic[3]))) covar.test.pv[covar.test.pv<.01] 23
24 Cluster the samples Using the heatmap I We dont need to use the full data set (too big) I Let s find the 1,000 most variant sites and use them for clustering I Make the heatmap (see previous examples) I Color the sample according to their Sample Group status (use ColSideColors=color vec in heatmap) Predicting the sex # Getting the 1,000 most variant sites # # Compute the variance probe_var <- apply(beta_value, 1, var) # Order them decreasingly probe_var <- order(probe_var, decreasing = T) # Get the top 1,000 probe_var <- probe_var[1:1000] 24
25 Find di erentially methylated sites At-testcompares2distributions I We want to compare cases vs controls at each site (loop) I First, make a boolean vector for cases and one for controls (we saw how in previous slides) t-test, an example t.test(rnorm(100), rnorm(100)) # Getting only the p-value t.test(rnorm(100), rnorm(100))$p.value Non-parametric alternative Wilcoxon test using wilcox.test is a rank based test. 25
26 Plot di erentially methylated sites Using boxplots I Using a previously identified di erentially methylated site make a boxplot of this site for cases and controls. methylation control ETMR 26
27 GenomicRanges... Introduction Represents genomic intervals. All annotation can be represented through GenomicRanges objects. Creation mygr = GRanges(chrs, IRanges(start=starts,end=ends)) Overlaping intervals myoverlap = findoverlaps(mygr, genes.gr) queryhits(myoverlap) length(unique(queryhits(myoverlap))) Distance to nearest disttogene = distancetonearest(mygr, genes.gr) which(disttogene < 100) 27
28 ...and annotation Introduction Many annotation are already available directly from R, see Bioconductor website. Else you can create your own GenomicRanges object. TxDb Gene annotation. source(" bioclite("txdb.hsapiens.ucsc.hg19.knowngene") library(txdb.hsapiens.ucsc.hg19.knowngene) AnnotationHub Many di erent tracks, including most of Encode s. source(" bioclite("annotationhub") library(annotationhub) ah = AnnotationHub() ctcf.tfbs = ah$goldenpath.hg19.encodedcc.wgencodeuwtfbs. wgencodeuwtfbsmcf7ctcfstdpkrep1.narrowpeak_0.0.1.rdata 28
29 Extra: ggplot2 Introduction A package to constuct beautiful and/or complex graphs. Many aspects of the graph are arranged automatically but everything can be specified. Easy layers addition cg cg cg cg cg cg cg cg cg cg cg cg methylation density cg cg cg cg cg cg control ETMR control ETMR control ETMR methylation control ETMR control ETMR 29
30 Extra: ggplot2 - data.frame only data.frame The input object is always a data.frame, with each rows being one point to represent and each column the di erent information on it. data.frame: in its simple version, a matrix with di erent data type possible in each column. Useful functions data.frame To create a data.frame. subset To subset a data.frame using condition on the columns. melt/reshape To deconstruct a matrix into data.frame, or the opposite. reshape package. aggregate To compute summary statistics on subset of the data.frame. ddply apply-like function on subset of the data.frame. plyr package. Example gene.expression.df = data.frame(gene=c("a","b","c"), gene.expression=1:3) dim(gene.expression.df) gene.expression.df = subset(gene.expression.df, gene.expression>1) 30
31 Extra: ggplot2 Simple graph - Histogram df.to.plot = data.frame(value=rnorm(1000)) library(ggplot2) ggplot(df.to.plot,aes(x=value)) + geom_histogram() More complex graph Putting a color for each group and di erent panels for each probe. ggplot(beta.df, aes(x=methylation)) + geom_density(aes(fill=group,colour=group),alpha=.7) + facet_wrap(~probe) Learn from the examples there I I 31
32 Online tutorials R I : small video-tutorials. I : R and statistics small web-tutorials. I : Coursera Computing for Data Analysis videos. Other interesting videos, e.g. ggplot2. I : R manual. R and Bioinformatics I List of online resources for Bioinformatics. I : Bioinformatics workshop material. I : Pieces of code for bioinformatics analysis, plots. Including Bioconductor. I : Bioinformatics tutorials material: pdf and R scripts. 32
33 Thank you!! If you re interested in potentially more sessions, in di erent format (more often, more specific), maybe some kind of Rclub,letus know through the survey or by . 33
34 Lists Flexible container A list can contain any element type. It does not require elements to be of the same type. Example list Create a list. l[[i]] Get or set the i th object of the list. l$toto Get or set the element labeled as toto. names Get or set the names of the list elements. length Get the number of element in the list. str Output the structure of a R object. l = list(vec=1:10,mat=matrix(runif(25),5)) str(l) l l$vec = 1 l 34
35 Functions - lapply apply for lists I Useful way to iterate through lists. Example file_list <- list.files(. ) files_content <- lapply(file_list, function(file) \{ data <- read.csv(file) #Do something with the data return(data) \}) 35
Analyzing Genomic Data with NOJAH
Analyzing Genomic Data with NOJAH TAB A) GENOME WIDE ANALYSIS Step 1: Select the example dataset or upload your own. Two example datasets are available. Genome-Wide TCGA-BRCA Expression datasets and CoMMpass
More informationInstall RStudio from - use the standard installation.
Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/
More informationfile:///users/williams03/a/workshops/2015.march/final/intro_to_r.html
Intro to R R is a functional programming language, which means that most of what one does is apply functions to objects. We will begin with a brief introduction to R objects and how functions work, and
More informationAn Introduction to R- Programming
An Introduction to R- Programming Hadeel Alkofide, Msc, PhD NOT a biostatistician or R expert just simply an R user Some slides were adapted from lectures by Angie Mae Rodday MSc, PhD at Tufts University
More informationIntroduction to Cancer Genomics
Introduction to Cancer Genomics Gene expression data analysis part I David Gfeller Computational Cancer Biology Ludwig Center for Cancer research david.gfeller@unil.ch 1 Overview 1. Basic understanding
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationBioconductor tutorial
Bioconductor tutorial Adapted by Alex Sanchez from tutorials by (1) Steffen Durinck, Robert Gentleman and Sandrine Dudoit (2) Laurent Gautier (3) Matt Ritchie (4) Jean Yang Outline The Bioconductor Project
More informationPackage dmrseq. September 14, 2018
Type Package Package dmrseq September 14, 2018 Title Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing Version 1.1.15 Author Keegan Korthauer ,
More informationLab: Using R and Bioconductor
Lab: Using R and Bioconductor Robert Gentleman Florian Hahne Paul Murrell June 19, 2006 Introduction In this lab we will cover some basic uses of R and also begin working with some of the Bioconductor
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationITS Introduction to R course
ITS Introduction to R course Nov. 29, 2018 Using this document Code blocks and R code have a grey background (note, code nested in the text is not highlighted in the pdf version of this document but is
More informationIntroduction to R Programming
Course Overview Over the past few years, R has been steadily gaining popularity with business analysts, statisticians and data scientists as a tool of choice for conducting statistical analysis of data
More informationPackage RAPIDR. R topics documented: February 19, 2015
Package RAPIDR February 19, 2015 Title Reliable Accurate Prenatal non-invasive Diagnosis R package Package to perform non-invasive fetal testing for aneuploidies using sequencing count data from cell-free
More informationData Import and Export
Data Import and Export Eugen Buehler October 17, 2018 Importing Data to R from a file CSV (comma separated value) tab delimited files Excel formats (xls, xlsx) SPSS/SAS/Stata RStudio will tell you if you
More informationR on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for
R on BioHPC Rstudio, Parallel R and BioconductoR 1 Updated for 2015-07-15 2 Today we ll be looking at Why R? The dominant statistics environment in academia Large number of packages to do a lot of different
More informationCTL mapping in R. Danny Arends, Pjotr Prins, and Ritsert C. Jansen. University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1
CTL mapping in R Danny Arends, Pjotr Prins, and Ritsert C. Jansen University of Groningen Groningen Bioinformatics Centre & GCC Revision # 1 First written: Oct 2011 Last modified: Jan 2018 Abstract: Tutorial
More informationRecap From Last Time: Today s Learning Goals BIMM 143. Data analysis with R Lecture 4. Barry Grant.
BIMM 143 Data analysis with R Lecture 4 Barry Grant http://thegrantlab.org/bimm143 Recap From Last Time: Substitution matrices: Where our alignment match and mis-match scores typically come from Comparing
More informationIntroduction to Matlab. Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole
Introduction to Matlab Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole MATLAB Environment This image cannot currently be displayed. What do we use? Help? If you know the name of the
More informationPackage SC3. September 29, 2018
Type Package Title Single-Cell Consensus Clustering Version 1.8.0 Author Vladimir Kiselev Package SC3 September 29, 2018 Maintainer Vladimir Kiselev A tool for unsupervised
More informationk-nn classification with R QMMA
k-nn classification with R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 1/16 HW (Height and weight) of adults Statistics
More informationIntroducing Categorical Data/Variables (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs.
More informationExpression Analysis with the Advanced RNA-Seq Plugin
Expression Analysis with the Advanced RNA-Seq Plugin May 24, 2016 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com
More information/ Computational Genomics. Normalization
10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program
More informationApplied Regression Modeling: A Business Approach
i Applied Regression Modeling: A Business Approach Computer software help: SAS SAS (originally Statistical Analysis Software ) is a commercial statistical software package based on a powerful programming
More informationAn introduction to Genomic Data Structures
An introduction to Genomic Data Structures Cavan Reilly October 30, 2017 Table of contents Object Oriented Programming The ALL data set ExpressionSet Objects Environments More on ExpressionSet Objects
More informationPackage SC3. November 27, 2017
Type Package Title Single-Cell Consensus Clustering Version 1.7.1 Author Vladimir Kiselev Package SC3 November 27, 2017 Maintainer Vladimir Kiselev A tool for unsupervised
More informationA brief introduction to R
A brief introduction to R Cavan Reilly September 29, 2017 Table of contents Background R objects Operations on objects Factors Input and Output Figures Missing Data Random Numbers Control structures Background
More informationDATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:
DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business
More informationCross-validation and the Bootstrap
Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling
More informationPython for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT
Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.
More informationPackage PCADSC. April 19, 2017
Type Package Package PCADSC April 19, 2017 Title Tools for Principal Component Analysis-Based Data Structure Comparisons Version 0.8.0 A suite of non-parametric, visual tools for assessing differences
More informationGetting Started. Slides R-Intro: R-Analytics: R-HPC:
Getting Started Download and install R + Rstudio http://www.r-project.org/ https://www.rstudio.com/products/rstudio/download2/ TACC ssh username@wrangler.tacc.utexas.edu % module load Rstats %R Slides
More informationIntroduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)
Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data
More informationPackage r.jive. R topics documented: April 12, Type Package
Type Package Package r.jive April 12, 2017 Title Perform JIVE Decomposition for Multi-Source Data Version 2.1 Date 2017-04-11 Author Michael J. O'Connell and Eric F. Lock Maintainer Michael J. O'Connell
More informationACHIEVEMENTS FROM TRAINING
LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM
More informationWhy use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first
Why use R? Introduction to R: Using R for statistics ti ti and data analysis BaRC Hot Topics October 2011 George Bell, Ph.D. http://iona.wi.mit.edu/bio/education/r2011/ To perform inferential statistics
More informationUsing R for statistics and data analysis
Introduction ti to R: Using R for statistics and data analysis BaRC Hot Topics October 2011 George Bell, Ph.D. http://iona.wi.mit.edu/bio/education/r2011/ Why use R? To perform inferential statistics (e.g.,
More informationPrepare input data for CINdex
1 Introduction Prepare input data for CINdex Genomic instability is known to be a fundamental trait in the development of tumors; and most human tumors exhibit this instability in structural and numerical
More informationPackage FunciSNP. November 16, 2018
Type Package Package FunciSNP November 16, 2018 Title Integrating Functional Non-coding Datasets with Genetic Association Studies to Identify Candidate Regulatory SNPs Version 1.26.0 Date 2013-01-19 Author
More informationStep-by-Step Guide to Advanced Genetic Analysis
Step-by-Step Guide to Advanced Genetic Analysis Page 1 Introduction In the previous document, 1 we covered the standard genetic analyses available in JMP Genomics. Here, we cover the more advanced options
More informationPackage pandar. April 30, 2018
Title PANDA Algorithm Version 1.11.0 Package pandar April 30, 2018 Author Dan Schlauch, Joseph N. Paulson, Albert Young, John Quackenbush, Kimberly Glass Maintainer Joseph N. Paulson ,
More informationStep-by-Step Guide to Relatedness and Association Mapping Contents
Step-by-Step Guide to Relatedness and Association Mapping Contents OBJECTIVES... 2 INTRODUCTION... 2 RELATEDNESS MEASURES... 2 POPULATION STRUCTURE... 6 Q-K ASSOCIATION ANALYSIS... 10 K MATRIX COMPRESSION...
More informationPredictive Analysis: Evaluation and Experimentation. Heejun Kim
Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training
More informationClustering analysis of gene expression data
Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains
More informationExploring gene expression datasets
Exploring gene expression datasets Alexey Sergushichev Dec 4-5, St. Louis About the workshop We will cover the basic analysis of gene expression matrices No working with raw data The focus is on being
More informationStatistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationStep-by-Step Guide to Basic Genetic Analysis
Step-by-Step Guide to Basic Genetic Analysis Page 1 Introduction This document shows you how to clean up your genetic data, assess its statistical properties and perform simple analyses such as case-control
More informationPackage GEM. R topics documented: January 31, Type Package
Type Package Package GEM January 31, 2018 Title GEM: fast association study for the interplay of Gene, Environment and Methylation Version 1.5.0 Date 2015-12-05 Author Hong Pan, Joanna D Holbrook, Neerja
More informationComputer lab 2 Course: Introduction to R for Biologists
Computer lab 2 Course: Introduction to R for Biologists April 23, 2012 1 Scripting As you have seen, you often want to run a sequence of commands several times, perhaps with small changes. An efficient
More informationData Science and Machine Learning Essentials
Data Science and Machine Learning Essentials Lab 3A Visualizing Data By Stephen Elston and Graeme Malcolm Overview In this lab, you will learn how to use R or Python to visualize data. If you intend to
More information10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors
Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple
More informationIntroduction to R, Github and Gitlab
Introduction to R, Github and Gitlab 27/11/2018 Pierpaolo Maisano Delser mail: maisanop@tcd.ie ; pm604@cam.ac.uk Outline: Why R? What can R do? Basic commands and operations Data analysis in R Github and
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationImporting and Merging Data Tutorial
Importing and Merging Data Tutorial Release 1.0 Golden Helix, Inc. February 17, 2012 Contents 1. Overview 2 2. Import Pedigree Data 4 3. Import Phenotypic Data 6 4. Import Genetic Data 8 5. Import and
More informationPrelims Data Analysis TT 2018 Sheet 7
Prelims Data Analysis TT 208 Sheet 7 At the end of this exercise sheet there are Optional Practical Exercises in R and Matlab. It is strongly recommended that students do these exercises, but students
More informationPSS718 - Data Mining
Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the
More informationGWAS Exercises 3 - GWAS with a Quantiative Trait
GWAS Exercises 3 - GWAS with a Quantiative Trait Peter Castaldi January 28, 2013 PLINK can also test for genetic associations with a quantitative trait (i.e. a continuous variable). In this exercise, we
More informationSISG/SISMID Module 3
SISG/SISMID Module 3 Introduction to R Ken Rice Tim Thornton University of Washington Seattle, July 2018 Introduction: Course Aims This is a first course in R. We aim to cover; Reading in, summarizing
More informationArtificial Neural Networks (Feedforward Nets)
Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x
More informationBasics of Plotting Data
Basics of Plotting Data Luke Chang Last Revised July 16, 2010 One of the strengths of R over other statistical analysis packages is its ability to easily render high quality graphs. R uses vector based
More informationBGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)
BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is
More informationSupplementary text S6 Comparison studies on simulated data
Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate
More informationAdvanced Econometric Methods EMET3011/8014
Advanced Econometric Methods EMET3011/8014 Lecture 2 John Stachurski Semester 1, 2011 Announcements Missed first lecture? See www.johnstachurski.net/emet Weekly download of course notes First computer
More informationImporting and visualizing data in R. Day 3
Importing and visualizing data in R Day 3 R data.frames Like pandas in python, R uses data frame (data.frame) object to support tabular data. These provide: Data input Row- and column-wise manipulation
More informationAGA User Manual. Version 1.0. January 2014
AGA User Manual Version 1.0 January 2014 Contents 1. Getting Started... 3 1a. Minimum Computer Specifications and Requirements... 3 1b. Installation... 3 1c. Running the Application... 4 1d. File Preparation...
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationSection 2.3: Simple Linear Regression: Predictions and Inference
Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple
More informationChIP-Seq Tutorial on Galaxy
1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationPackage PedCNV. February 19, 2015
Type Package Package PedCNV February 19, 2015 Title An implementation for association analysis with CNV data. Version 0.1 Date 2013-08-03 Author, Sungho Won and Weicheng Zhu Maintainer
More informationMicroarray Data Analysis (V) Preprocessing (i): two-color spotted arrays
Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays Preprocessing Probe-level data: the intensities read for each of the components. Genomic-level data: the measures being used in
More informationBioinformatics - Homework 1 Q&A style
Bioinformatics - Homework 1 Q&A style Instructions: in this assignment you will test your understanding of basic GWAS concepts and GenABEL functions. The materials needed for the homework (two datasets
More informationDI TRANSFORM. The regressive analyses. identify relationships
July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,
More informationIntroduction to R Reading, writing and exploring data
Introduction to R Reading, writing and exploring data R-peer-group QUB February 12, 2013 R-peer-group (QUB) Session 2 February 12, 2013 1 / 26 Session outline Review of last weeks exercise Introduction
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationCertified Data Science with Python Professional VS-1442
Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become
More informationUsing generxcluster. Charles C. Berry. April 30, 2018
Using generxcluster Charles C. Berry April 30, 2018 Contents 1 Overview 1 2 Basic Use 1 2.1 Reading Data from a File...................... 2 2.2 Simulating Data........................... 2 2.3 Invoking
More informationReading and writing data
An introduction to WS 2017/2018 Reading and writing data Dr. Noémie Becker Dr. Sonja Grath Special thanks to: Prof. Dr. Martin Hutzenthaler and Dr. Benedikt Holtmann for significant contributions to course
More informationYour Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression
Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using
More informationLAB #1: DESCRIPTIVE STATISTICS WITH R
NAVAL POSTGRADUATE SCHOOL LAB #1: DESCRIPTIVE STATISTICS WITH R Statistics (OA3102) Lab #1: Descriptive Statistics with R Goal: Introduce students to various R commands for descriptive statistics. Lab
More informationMerge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.
Preface p. xiii Ideology: Data Skills for Robust and Reproducible Bioinformatics How to Learn Bioinformatics p. 1 Why Bioinformatics? Biology's Growing Data p. 1 Learning Data Skills to Learn Bioinformatics
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More informationEvaluating Machine Learning Methods: Part 1
Evaluating Machine Learning Methods: Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts bias of an estimator learning curves stratified sampling cross validation
More informationGene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT
Gene Survey: FAQ Tod Casasent 2016-02-22-1245 DRAFT 1 What is this document? This document is intended for use by internal and external users of the Gene Survey package, results, and output. This document
More informationOverview. Linear Algebra Notation. MATLAB Data Types Data Visualization. Probability Review Exercises. Asymptotics (Big-O) Review
Tutorial 1 1 / 21 Overview Linear Algebra Notation Data Types Data Visualization Probability Review Exercises Asymptotics (Big-O) Review 2 / 21 Linear Algebra Notation Notation and Convention 3 / 21 Linear
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationsrap: Simplified RNA-Seq Analysis Pipeline
srap: Simplified RNA-Seq Analysis Pipeline Charles Warden October 30, 2017 1 Introduction This package provides a pipeline for gene expression analysis. The normalization function is specific for RNA-Seq
More informationW ASHU E PI G ENOME B ROWSER
W ASHU E PI G ENOME B ROWSER Keystone Symposium on DNA and RNA Methylation January 23 rd, 2018 Fairmont Hotel Vancouver, Vancouver, British Columbia, Canada Presenter: Renee Sears and Josh Jang Tutorial
More informationIntroduction to R Jason Huff, QB3 CGRL UC Berkeley April 15, 2016
Introduction to R Jason Huff, QB3 CGRL UC Berkeley April 15, 2016 Installing R R is constantly updated and you should download a recent version; the version when this workshop was written was 3.2.4 I also
More informationAssignment No: 2. Assessment as per Schedule. Specifications Readability Assignments
Specifications Readability Assignments Assessment as per Schedule Oral Total 6 4 4 2 4 20 Date of Performance:... Expected Date of Completion:... Actual Date of Completion:... ----------------------------------------------------------------------------------------------------------------
More informationRNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University
RNA-Seq Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University joshua.ainsley@tufts.edu Day four Quantifying expression Intro to R Differential expression
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informationPrepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.
Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good
More informationRepresenting sequencing data in Bioconductor
Representing sequencing data in Bioconductor Mark Dunning mark.dunning@cruk.cam.ac.uk Last modified: July 28, 2015 Contents 1 Accessing Genome Sequence 1 1.1 Alphabet Frequencies...................................
More informationVisualizing the World
Visualizing the World An Introduction to Visualization 15.071x The Analytics Edge Why Visualization? The picture-examining eye is the best finder we have of the wholly unanticipated -John Tukey Visualizing
More informationPackage demi. February 19, 2015
Type Package Package demi February 19, 2015 Title Differential Expression from Multiple Indicators Implementation of the DEMI method for the analysis of high-density microarray data. URL http://biit.cs.ut.ee/demi
More informationDATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing
More informationAgenda. Predicate Testing. CSE 5321/4321, Ali Sharifara, UTA
Agenda Predicate Testing CSE 5321/4321, Ali Sharifara, UTA 1 Predicate Testing Introduction Basic Concepts Predicate Coverage Summary CSE 5321/4321, Ali Sharifara, UTA 2 Motivation Predicates are expressions
More information