Supplementary Figure 1 Schematic representation of the Workflow window in Perseus All data matrices uploaded in the running session of Perseus and all processing steps are displayed in the order of execution. The workflow allows the users to keep track of all steps in the analysis and to navigate through data matrices and visualization components just by clicking on the respective node in the diagram. The nodes can be modified to contain description and additional information for clarity. If a data matrix node is selected, information about the number of samples and data points is displayed in the most right panel of Perseus. Moreover, if an analysis node is selected, all parameters that were used in that step can be reviewed. Each data matrix, as well as all visualization windows can be exported in publication ready formats. The workflow scheme can be conveniently saved as a pdf file and used as a documentation of all steps of the analysis.
Supplementary Figure 2 Plug-in architecture of Perseus The current structure of Perseus relies on a data matrix type and various functions for accessing and transforming the matrix are developed. The base code implementing these operations is open source and can be downloaded from GitHub (github.com/jurgencox/perseus-plugins). The rest of the functionality is organized in two main interfaces: Processing and Analysis and the resulting module are added to the software core as plug-ins. Developers wishing to extend the software can build upon the main source code and contribute the new plug-ins to our online plug-in store.
Supplementary Figure 3 Missing value imputation Perseus offers several imputation techniques including a method that draws random values from a distribution meant to simulate expression below the detection limit. The width and the down shift of the distribution can be set to closely represent the missing population. When missing values occur randomly, a distribution similar to that of the measured data is normally used for imputation. In contrast, a frequently used assumption in proteomics experiments is that low expression proteins give rise to missing values, therefore a Gaussian distribution with a median shifted from the measured data distribution median towards low expression should result in accurate imputation of such values. The mode parameter defines the measured data distribution to be used in the calculation of the random distribution. When the samples do not differ largely in their overall distribution, the use of the complete dataset is recommended. The measured distribution is shown in blue and the imputed values in orange. (a) No down-shift and distribution width of 0.5 do not simulate low abundant missing values. (b) Down-shift of 1.8 and distribution width of 0.5 simulate the assumption of low abundant proteins giving rise to missing values. (c) Down-shift of 3.6 and width of 0.5 result in an undesirable bi-modal distribution.
Supplementary Figure 4 Density-enhanced scatterplots between proteome, transcriptome and translatome levels produced by the upload plug-in Short read NGS data as for instance produced by the Illumina platform can be imported for further analysis in the Perseus workflow. In the example we calculate RPKM values for each gene (Ingolia N. T. et al., Science, 2009) and compare these with ibaq values calculated by MaxQuant from proteomics data derived from yeast (Kulak N. A. et al., Nature methods, 2014).
Supplementary Figure 5 Augmented data matrix In addition to the main data matrix, Perseus can make use of background information complementary to the expression columns. (a) Often one of the first processing steps in data analysis is filtering for a minimum number of valid values. As some statistical methods require all values to be present (e.g. PCA) data imputation may be necessary. Upon imputation a second matrix is created in the background storing information of which values were measured and which imputed and can later be used to highlight or remove the imputed values. (b) In a more advanced filtering option, first a Quality matrix is created, which contains additional information about each expression value in the main matrix and which is used for filtering. For example, the number of peptides used for protein quantification can be used to filter proteins, which were identified with less than 2 peptides.
Supplementary table 1. A list of the main functionalities in Perseus. LOAD ANALYSIS Generic matrix upload Visualization Raw upload Scatter plot Create gene list Profile plot Binary upload Histogram Create random matrix Multi-scatter plot Next generation sequencing data upload 3D plot Clustering/PCA MULTI-PROC. Hierarchical clustering Basic Principal component analysis Match rows by name Misc. Match columns by name Volcano plot Replace strings Select rows manually Sequence logos EXPORT Numeric venn diagram Generic matrix export PROCESSING Remove empty columns Basic Transpose Transform Sort by column Combine main columns Fill categorical columns Column correlation De-hyphenate ids Row correlation Expand multi-numeric and text columns Summary statistics (columns) Unique values Summary statistics (rows) Convert multi-numeric column Quantiles Combine categorical columns Density estimation Process text column Performance curves Search text column Combine rows by identifiers Normalization Clone Z-score Add noise Rank Rearrange Unit vectors Change column type Scale to interval Rename columns Width adjustment Rename columns [reg. ex.] Subtract Reorder/remove columns Divide Reorder/remove annotation rows Modify by column Duplicate columns Subtract row cluster Combine annotations Un-Z-score
Filter rows Imputation Filter rows based on categorical column Replace missing val. from normal distrib. Filter rows based on numerical/main column Replace missing values by constant Filter rows based on text column Replace imputed values by NaN Filter rows based on valid values Modifications Filter rows based on random sampling Expand site table Filter columns Add linear motifs Filter columns based on categorical row Add known sites Filter columns based on valid values Add modification counts Quality Kinase-substrate relations Create quality matrix Add sequence features Filter quality Add regulatory sites Convert to NaN Shorten motif length Annot. columns Time series Add annotation Cyclic annotation enrichment To base identifiers Periodicity analysis Fisher exact test Periodogram Average categories Time series ordering Category counting Outliers 1D annotation enrichment Significance A 2D annotation enrichment Significance B Annot. rows Learning Categorical annotation rows Classification Numerical annotation rows Classification feature optimization Average groups Classification parameter optimization Join terms in categorical row Clustering Tests Generic clustering One-sample tests Two-sample tests Multiple-sample tests Two-way ANOVA Three-way ANOVA