OCAP: An R package for analysing itraq data.

Size: px

Start display at page:

Download "OCAP: An R package for analysing itraq data."

Joseph Ryan
6 years ago
Views:

1 OCAP: An R package for analysing itraq data. Penghao Wang, Pengyi Yang, Yee Hwa Yang 10, January 2012 Contents 1. Introduction Getting started Download Install to R Load the Package The Test Dataset Input and Output Preprocessing Analysis Workflow Fully Automatic Analysis Individual Analysis Component - Peak Picking... 5 Step 1: Individual Analysis Component - Peak Picking... 5 Step 2: Individual Analysis Component Protein Identification... 5 Step 3: Individual Analysis Component Protein Quantification Descriptive Analysis Workflow... 6 Technical details:... 8 Reference... 8

2 1. Introduction OCAP (Open Comprehensive itraq Analysis Pipeline) software is designed as a comprehensive analysis pipeline for pre-processing, exploration and data analysis of Mass Spectrometry-based itraq labelled protein experiments. There are two versions of this software, (a) OCAP_C++ which is a C++ stand-alone version and (b) OCAP is the R package that provides a R interface to OCAP_C++ as well as providing downstream statistical tools for visualisations. There are 3 major stages to preprocessing of itraq data. (1) spectrum peak picking; (2) peptide and protein identification; (3) protein quantification. OCAP incorporates DyWave (Wang et al. 2010) for peak-picking mass spectrum, X!Tandem (Craig and Beavis 2004) for protein identification, and WQuant (a wavelet-based itraq protein quantification) algorithm for extracting itraq reporter ions for protein quantification. 2. Getting started 2.1 Download Currently, users need to download the OCAP package from the OCAP webpage at Google Code in a ZIP and a tar.gz file. The webpage of OCAP project is at The current version of OCAPOCAP is 1.2, and it can be built under the R version If user requires the OCAP in a specific version of R, please feel free to contact us: penghao.wang@sydney.edu.au. 2.2 Install to R The second thing is to install the downloaded OCAP package into R environment. There are a number of dependencies associated with this package. First, please click on Packages and Select repositories. Select ALL repositories before proceeding. Please copy and paste the following code to load ALL dependencies: ## Install packages install.packages( limma ) install.packages( multtest ) install.packages( corrgram ) install.packages( misctools ) install.packages( futilities ) install.packages( PBSmodelling ) install.packages( affy ) After dependencies have been installed, user may need to select the Packages menu from R s main drop down menu, then select Install packages from local zip files menu and specify the downloaded package in ZIP format. Figure 1 shows an example.

3 Figure 1. How to install the downloaded packages to R. Or alternatively, user may choose to install from the source tar.gz file. User then needs to go to command line console, go to the directory where the downloaded tar.gz package is located and type R CMD INSTALL OCAP_1.2.tar.gz. 2.3 Load the Package Before starting any anlaysis, use the following code to load the package. ## load the OCAP package library(ocap) 2.4 The Test Dataset One itraq dataset is provided for testing out the package. The test data is obtained from a published study of Whitehead et al. (2006). The data is 4-plex itraq experiment on evaluating the cellular response to gamma radiation on bacteria. In addition, the SWISS PROT bacteria protein database sp_bacteria.fasta for the aforementioned test data is included in the compressed data file. 3. Input and Output OCAP can either automatically perform complete data analysis including peak-picking, protein identification, and protein quantification or perform analysis step by step. It expects input of (a) raw mass spectra in mzxml format and (b) a protein sequence database in FASTA format. Output: Upon completion of analysis, the quantification results will represented as a list of two data.frame objects, representing peptide and protein level results. If users are not familiar with R, it might be best to place all the mzxml spectra files together with the protein sequence database into a single directory. It is very common that raw mass spectra are larger than 1GB, and it may pose burden on memory usage during peak-picking

4 and quantification processes. Therefore, it is recommended to split the large mzxml files into separate files, and merge the results once the quantification is completed. OCAP provides functions exp_pep_mat and exp_prot_mat for combining the analyses. 4. Preprocessing Analysis Workflow Before running OCAP for analysing the mass spectrometry data, it is important to organise the raw spectra and some other required files for to start. (1) mzxml files: Organise all mzxml raw spectra into a directory, and name this directory as "mzxml". FASTA file: Organise a protein sequence database in FASTA format into a directory. SWISS-PROT database may be obtained at UNIPROT: We have provided a small test dataset as well as a human SWISS-PROT database for testing and these can be download from OCAP webpage. Due the size restriction in Google Doc and the size of our data set, the user need to download all 6 components and put it all together in the same directory. The remaining code illustration will be based on this test dataset. 4.1 Fully Automatic Analysis OCAP is able to automatically analyse the raw mass spectra using only one function: pipeline_analyse. Place all the downloaded data (two files: 245.mzXML and sp_bacteria.fasta) into a C:/test directory. The following codes perform the one-step analysis. ## FASTA database, mzxml directory the full paths, result returned as list re <- pipeline_analyse(premode = "fast", mzxmldir = "C:/test", threshold = 1.1, quanmode = "intensity", fasta = "C:/test/sp_bacteria.fasta") The main parameters are: premade: The preprocessing algorithm peak-picking mode, can be either "fast" or "full". On large dataset full mode can require significant longer time to finish than fast mode. mzxmldir: The directory where mzxml directory is located. This parameter has to be set properly. threshold: The protein identification expectation value threshold. Any identification bigger than this will be omitted. quanmode: The quantification estimation mode, can be "AUC" area under the curve, "intensity", or "trapzoid". fasta: The protein database file. The file has to be in FASTA format. Full path and file name have to be specified. The package has a built-in SWISS-PROT human database, if you would like to use it, simply leave this parameter blank. Full parameter lists please refer to the help file which you can access via the following command.

5 ## get the manual help(pipeline_analyse) The result of the quantification workflow will have both peptide level and protein level quantification results. ## peptide level results re$peptide[1:5,] ## protein level results re$protein[1:5,] Now the user can proceed to the next step of the analysis - higher level of statistical analysis. 4.2 Individual Analysis Component - Peak Picking OCAP provides the users with the option to run the preprocessing procedure step by step. User may also output intermediate results for other statistical analysis software. Step 1: Individual Analysis Component - Peak Picking The peak-picking procedure in OCAP is achieved by function preproc which uses the method DyWave method (Wang et al. 2010). ## running DyWave for spetrum peak-picking spect = preprocess(runmode = "fast", totalpeak = 50, Normalise = "N", mzxmlpath = "C:/test") There are some parameters in the preproc, that may be important: totalpeak: The maximum number of peaks allowed. If not sure, you may leave it by default. Normalise: "Y" or "N", when specified "Y" the peak intensities will be normalised before further preprocessing steps. It may has small impact, and by default is "N". The peak-picking process may take a while to complete, depending on the size of the spectra. So if the R interface freezes, please be patient. Once completed, a processed spectrum file will be stored temporarily on drive and users may access the spectra file through showspectrum given the spectrum index: ## display a specific spectrum showspectrum(1, DrawPeakNum = 20) Step 2: Individual Analysis Component Protein Identification After the peak picking procedure, the next analysis procedure is protein identification. OCAP uses X!Tandem (Craig and Beavis 2004) for searching the database. All database search parameters can be directly given to OCAP which will do necessary parsing and initiate X!Tandem searching. The database search result will be stored temporarily on drive for downstream quantification procedure. The X!Tandem algorithm is known for its speed, however on large dataset database search may still take sometimes. ## initiate X!Tandem database search proteinid(cutoff = 1.1, refine = FALSE, database = "C:/test/sp_bacteria.fasta") See help file for more details.

6 Step 3: Individual Analysis Component Protein Quantification. The last procedure of preprocessing workflow is the protein quantification. It must be applied after the peak picking and protein identification procedures have been completed. If the peak picking and protein identification procedures are not performed, unexpected results may occur. The protein quantification can be achieved by function quantisation, the syntax should follow the example below: ## perform protein quantification re = quantisation(plex = "4", runmode = "intensity") Once the quantification is completed, the results will be returned as a list object. The list will contain two data.frame representing peptide and protein level results. Users can now move to the next phase of the analysis: quality control and higher level of statistical analysis. 5. Descriptive Analysis Workflow OCAP incorporates several functions for visually exploring the data. This includes examining the data quality, removing spurious or problematic samples, checking the reliability of protein identifications and quantifications, etc. This is important since mass spectrometry data are usually very noisy and the protein identification can be error-prone. It may be advisable to apply stringent filtering on the results before proceeding the analysis further. It may be important to look at individual spectrum for checking the protein identification. OCAP provides function showspectrum to display a specific spectrum for checking the peaks. And users may visually compare 2 spectra to see if they are considerably close by looking at the major peaks, which corresponds in theory the ion series. This can be done through function compare_spectrum. ## compare two spectra compare_spectrum(1, 6, peaknum = 20) Users may want to examine a specific protein by looking at its identification confidence scores of underlying peptides. OCAP uses X!Tandem's expect score as the protein identification confidence score (the smaller the score, the more confident the identification is). This can be achieved by function protein_conf_plot. A protein accession is needed for uniquely specify the protein to display. ## protein confident plot acc = "sp P25970"; protein_conf_plot(re, acc) If users want to look at the protein identification sequence coverage, they may use the function protein_coverage_plot of OCAP to see the sequence coverage. Each peptide will be covered in a different colour in the plot. ## the protein identification sequence coverage protein_coverage_plot(re, acc);

7 OCAP can display all the peptides assigned to a specific protein, and this can be achieved by function protein_peptide_plot. Users may want to change the display setting as how many sub-plot per line. ## to look at all the peptides assigned to a specific protein protein_peptide_plot(re, acc, xshow = 1, showpeak = 50); OCAP also provides function to evaluate the correlation between samples of the experiment. This can be done by calling function protein_corr_plot. It will display the correlation plot. ## look at the protein correlation graph protein_corr_plot(re, acc) OCAP incorporates biplot (Pittlekow and Wilson 2003) for overall quality check of the data. Its main interface is function protein_biplot. It is able to plot at both peptide and protein level. However, users may need to impute the data first before applying this biplot. This is necessary because biplot cannot handle missing values. An example is given below. ## first need to define the class category for biplot class_label = c("good", "good", "bad", "bad") ## need to impute before doing biplot library(impute) pep1 = matrix(as.numeric(re$peptide[,3:6]), ncol=4); prot1 = matrix(as.numeric(re$protein[,3:6]), ncol=4); pep1 = matrix(as.character(impute.knn(pep1)[[1]]), ncol=4); prot1 = matrix(as.character(impute.knn(prot1)[[1]]), ncol=4); re.imp = re; re.imp$peptide[,3:6] = pep1; re.imp$protein[,3:6] = prot1; protein_biplot(plex = "4", re.imp, level = "protein", class_label, use_log = FALSE, use_stand = FALSE); The peptide expression image plot can be plotted by function peptide_imageplot. This plot may facilitate quality check on some peptides identification and quantification by their expression. ## the expression image plot of peptides of a protein peptide_imageplot(re, acc) The statistical analysis of protein data is easy in R, however OCAP provides some simple functions for user's convenience. A straight-forward normalisation of the protein data may be achieved by function post_norm, which will also give box-plots for the data. ## perform simple normalisation and box-ploting the expression re_nor = post_norm(plex = "4", re); Differential expressed proteins or peptides may be analysed by function DE_analysis, which uses limma (Smyth 2005) for the analysis. A very simple example is given below. ## DE analysis example, first need to specify a design matrix design= c(0,0,1,1); DE_protein = DE_analysis(plex = "4", re, design, MA_plot = TRUE );

8 Technical details: The version number of R and packages loaded for generating the figures were: R version ( ) Platform: i386-pc-mingw32/i386 (32-bit) Locale: LC_COLLATE = English_Australia.1252, LC_CTYPE = English_Australia.1252, LC_MONETARY = English_Australia.1252, LC_NUMERIC = C, LC_TIME = English_Australia.1252 Attached base packages: stats, graphics, grdevices, utils, datasets, methods, base Other attached packages: impute_1.24.0, OCAP_1.2, affy_1.32, PBSmodelling_ , futilities_ , misctools_0.6-13, corrgram_1.1, seriation_1.0-6, colorspace_1.1-0, gclus_1.3, TSP_1.0-6, cluster_1.14.1, multtest_2.10.0, Biobase_2.14.0, MASS_7.3-16, limma_ Loaded via a namespace (and not attached): affyio_1.22.0, BiocInstaller_1.2.1, preprocesscore_ splines_2.14.0, survival_ , tcltk_2.14.0, zlibbioc_1.0.0 Reference Craig, R. and Beavis, R. (2004) TANDEM: matching proteins with mass spectra. Bioinformatics 20(9): Pittelkow, Y.E. and Wilson, S.R. (2003) Visualisation of gene expression data The GEbiplot, the Chip-plot and the Gene-plot. Stat. Appl. Genet. Mol. Biol. 2: Article6 Epub 2003 Sep 4. Smyth, G. K. (2005) Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. pp Wang, P. et al. (2010) A dynamic wavelet-based algorithm for pre-processing tandem mass spectrometry data. Bioinformatics 26(18): Whitehead K. et al. (2006) An integrated systems approach for understanding cellular responses to gamma radiation. Mol. Syst. Biol. 2: 47.

Tutorial 2: Analysis of DIA/SWATH data in Skyline

Tutorial 2: Analysis of DIA/SWATH data in Skyline In this tutorial we will learn how to use Skyline to perform targeted post-acquisition analysis for peptide and inferred protein detection and quantification.