SEEK User Manual. Introduction

SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses. In addition, SEEK provides instant visualization of the co-expressed genes in relevant datasets. The following sections walk users through the various pages on the SEEK website, provide annotations to different functions on each page, and provide interpretations to the visualization heat maps generated by SEEK. Version.0

.0 Main Page 3.0, 4.0 Gene, Dataset Analysis Dialogs 2.0 Expression View Page 5.0 Expression Zoom-In Page 6.0 Gene Enrichment Page 7.0 Search Refinement Page 8.0 Options Page.0 Export Page 9.0 Co-expression View Page 0.0 Gene Co-expression Contributions Page

.0 Main Page. Query box the query gene(s) in gene symbol format and separated by spaces

2.0 Expression View Page (Search Result) 2 3 8 2 5 4 3 6 6b 7 9 0

When the search of the user's query is completed, Expression View is the first page presented to the user.. The user's query 2. The query genes' expression profiles across top relevant datasets detected for this query 3. The co-expressed genes' expression profiles 4. The list of co-expressed genes with each gene's rank, co-expression score, and gene name indicated. Clicking on the gene name invokes the Gene Analysis Dialog. 5. An expression profile within a dataset. The dataset heading shows the rank of the dataset, the relevance weight (in parenthesis), and a list of keywords describing the dataset based on mining the dataset description. Clicking on the dataset heading invokes the Dataset Analysis Dialog. 6. Hierarchical clustering of conditions within a dataset based on the expressions of top 50 co-expressed genes 6b. Expression heat-map. Hatching lines indicate that a gene is not present in the dataset. Each row in the heat-map can be clicked to invoke the Expression Zoom-In Page 7. Color gradient indicates down- and up-regulated expression values 8. Navigate to the next page of co-expressed genes, or to the next page of datasets 9. Gene enrichment analysis of top co-expressed genes (See Enrichment Page) 0. Refining search by narrowing down datasets (See Search Refinement Page). Textual export of SEEK's retrieved gene list, dataset list, and expression matrices on this page (See Export Pages) 2. Search and visualization options (See Options Page) 3. Toggle between Expression View and Co-expression View (See Co-expression View Page)

3.0 Gene Analysis Dialog 2. Gene name, description, and ENTREZ id. Clicking on the ENTREZ name re-directs the user to the NCBI Entrez Gene page. 2. Gene description, to be invoked by clicking on [ + ] icon

4.0 Dataset Analysis Dialog 2 3 4. Dataset GSE id, keywords, NCBI GEO link 2. Description of the dataset, to be invoked by clicking on [ + ] 3. Down-regulated conditions, where the query genes and the top 50 co-expressed genes have an average gene-centered expression of -.0 or lower. Gene-centered expression is calculated by subtracting the absolute expression value by the row-average, next divided by the row standard deviation. 4. Up-regulated conditions, where the query and the top 50 co-expressed genes have an average gene-centered expression of.0 or above.

5.0 Expression Zoom-In Page 2 3 This view shows the conditional gene expression in a given gene (SMO) and dataset (GSE2630). Each dataset is made up of a set of conditions. Each condition has a set of attributes annotated to it (acquired from NCBI Gene Expression Omnibus), such as patient age, sex, subtype information, drug treatments, etc. SEEK utilizes these attribute-value pairs for displaying and sorting purposes.. The attribute to be displayed. Changing the display attribute changes the information about the condition being displayed but does not alter the order of conditions.

2. The sorting attribute. Changing this will reorder the conditions according to the attribute values. For example, selecting sort attribute to be sex will reorder the conditions so that all conditions with value female are grouped together, followed by the group of conditions with value male. Another sorting option is to adopt an expression-based approach to hierarchically cluster the conditions (which does not use attribute annotations). 3. The attribute values. Mousing-over the condition label will display all information about this condition. Clicking the label re-directs the user to the NCBI page regarding this condition. Tip: The display attribute and the sort attribute do not need to be the same. In fact, users should take advantage of this to discover any relationship that might exist between attributes. Tip: We also suggest pairing hierarchical clustering (as the sorting option) with a display attribute of the user's choice. This combination has the potential to discover connections between the expression value and an attribute type. For example, users might be curious to know if the breast cancer patients with ESR+ status is correlated with having an up-regulated expression for the gene of interest.

6.0 Gene Enrichment Page 2 3

. The number of top ranked genes to perform enrichment. 2. Perform a gene-set enrichment analysis against an external functional database of gene sets. The functional databases that are connected to SEEK are: Chromosomal position: MsigDB positional gene sets Pathway: BioCarta gene sets KEGG pathway gene sets Reactome gene sets Differentially expressed genes: MsigDB chemical and genetic perturbation gene sets Biological process: Gene Ontology Biological Process branch (with experimental annotations only) Gene Ontology Biological Process branch (with electronic annotations) mirna motif: TargetScan gene sets of shared mirna motif 3. The set of enriched terms found to be statistically significant. Standard hypergeometric tests are performed followed by a Benjamini Hochberg multiple hypothesis testing correction. The table displays the term name, p-value, q-value, size of the term (T), size of the analysis set (removing genes without any annotation) (A), and size of the overlap (T&A). Tip: Mouse-over a number in the column T&A to see the names of the overlapping genes. For example:

7.0 Search Refinement Page 2 3 Users may find the search refinement to be helpful in the following situations: -- The query is too small to be informative for SEEK's dataset weighting. -- Users would like to further continue the analysis upon seeing the search results. They wish to perform the same query on a subset of datasets related to a certain disease, cell type, or tissue. -- Users wish to see the effect of integrating only the top X datasets ranked by SEEK. The search refinement page can help supplement the query-based dataset weighting to provide more dataset-specific results. Users can:. Limit datasets by tissue, cell type, or disease 2. Limit datasets by their rank assigned by SEEK 3. Restore to the default use all datasets option

Limiting datasets by tissue, cell type, disease: 4 2 3 5 SEEK has pre-mapped the datasets to UMLS tissue, cell type, and disease categories based on the dataset descriptions obtained from Gene Expression Omnibus.. Select one or more tissue, cell type, or disease categories. 2. Enter a keyword in the text-box to filter the list (Optional). All categories are displayed in the list if no keyword is entered. Use None to select none, or All to select all in the list. 3. Click Check Selection to see what datasets are being selected. 4. A list of selected datasets is displayed. 5. Click Refine to begin search refinement of the current query within the selected dataset subset.

Limiting datasets by rank: 2. Rank of the dataset, based on the dataset ranking for the current query. Users can select top 0, top 00, or however many datasets. 2. Click Refine to begin the search integration of the current query within the top X datasets.

8.0 Options Page: 2 3 4 6 5 7 8. Three choices of dataset aggregation methods are offered: CV RBP weighted (default), Order statistics 2, Equally weighted. 2. If CV RBP weighted is chosen, the parameter p 3 used in dataset weighting. 3. Specify a minimum threshold for using a dataset. Because not all datasets contain all of the query genes, this option allows users to specify the minimum fraction of query genes required to be present in order to consider a dataset for integration. A dataset that does not meet this threshold is skipped for the query. 4. Three choices of distance measures are offered. Although all of them are based on Pearson correlations, they vary according to additional processing. ) Pearson correlation. 2) Pearson correlation + Fisher transform + standardization (producing a z-score). 3) Z-score + geneconnectivity correction 4. This reduces the influence of well-connected genes in search while boosting weakly connected genes. 5. If true, display the gene-centered expressions in the Expression View.

6. If specified, restrict genes in the Expression View by their P-values. A P-value is computed for each gene to the current query, based on the empirical probability that the score of the gene exceeds the observed co-expression score in a large pool of random queries. 7. If true, each cell in the heat map generated in the Co-expression View indicates the co-expression z-score multiplied by the dataset weight. 8. If true, and the user has previously made a dataset selection using the Refine Search function, the dataset selection will be carried over to all future queries. However, if the user decides to modify the dataset selection, he/she needs to re-select datasets using Refine Search again. There is currently no function to modify an existing selection once it has been made. CV RBP weighting (the default in SEEK) uses a novel dataset weighting algorithm: w d = [( p) rel(i) p rank (i) ]/ Q q Q i=... R q, where Q is the query, R q is the ranking of genes generated when sorting genes in the genome by their correlation to a single query gene q. i is an item in the ranking R q. rel(i) is.0 if i is one of the genes in Q, 0 otherwise. rank(i) is the rank of i in R q. p is a tuning parameter (see below) that is set at 0.99. The integration of datasets using the weights is described by: score g,q = [w d [ z d (g,q)/ Q ]]/ w d d D q Q d D, where D is the set of datasets, Q is the query, z d is the z-scored correlation that is corrected for gene-connectivity, w d is the weight of the dataset d. 2 Order statistics is the search algorithm described in the MEM paper (Adler et al, Genome biology, 2009, 0:R39) 3 The parameter p refers to the w d formula (default value of p is 0.99). It tunes the importance of highly ranked correlations. A lower value of p indicates a higher importance attached to the top ranked correlations. The recommended range for p is between 0.95 and 0.999. Adjusting this parameter can control the distribution of weights across datasets. 4 Gene-connectivity correction on z-score: z corrected (g, q)=z(g, q) / G z(g, x), where G is the genome. x G

9.0 Co-expression View Page 2 3 4 5

The Co-expression View displays the co-expression landscape across 50 datasets at a time.. The top 50 datasets. The heading for a dataset includes the dataset rank and a list of keywords describing the dataset. 2. Query cross-validations. The heat-map demonstrates how each query gene correlates with the rest of the query across 50 datasets. 3. Color gradient scale corresponding to the query cross-validation heat-map. Values indicate the contribution of query q in the formula for w d. 4. The co-expression heat-map. A solid color cell denotes the co-expression score of a gene to the query in a particular dataset. A cell with hatching lines denotes a missing gene in the dataset. With this heat-map view, users can visualize the individual datasets' contributions to the final co-expression score for each gene in the rank list. 5. Color gradient scale corresponding to the co-expression heat-map. Values indicate z-scored correlation. Clicking a gene name on the Co-expression View Page invokes Gene Co-expression Contribution Page.

0.0 Gene Co-expression Contribution Page This view mode presents the co-expression scores of a given gene to the query across all 50 datasets.

.0 Export Pages Gene list export Tab-delimited format exporting the list of co-expressed genes ranked by co-expression score to the current query. Tip: Search a gene of interest to see where it is ranked right within the browser. (Use the browser search function)

Dataset list export Tab-delimited format exporting the list of datasets ranked by query-relevance weight to the current query. Tip: Search a dataset of interest to see where it is ranked right within the browser. (Use the browser search function)