Tutorial 1: Using Excel to find unique values in a list It is not uncommon to have a list of data that contains redundant values. Genes with multiple transcript isoforms is one example. If you are only interested in the genes and not the different transcripts, then you will probably want to filter the list to remove the redundant values. I did a search of the UCSC human genome browser with the query colon cancer and got back >500 matches. I created a text file listing the first 500 matches. You can download this data from the Exercise 1 home page by clicking on the link ListofGenesfromUCSC.txt. The file has 2 columns: Gene Name and Chromosome Location. You will filter on Gene Name. Once you ve downloaded the text file, do the following: Open Excel and from within Excel open the text document. If the file you want to open is greyed out, change the drop down menu to Enable: All Readable Documents. Double-click the file you want to open and this should bring up the Text Import Wizard It should recognize it as delimited. Click the Next button to define the delimiters. By default, Excel assumes a.txt file is tab-delimited Click Next and then Finish to finish the import. Advanced filter: Select the column of gene names Click on the Data menu and select Advanced filter (if you get a warning about being unable to determine which row contains column labels and you have a column header in row 1, just click OK). Check the radio button Copy to another location This should move our mouse to the Copy to text box. Select a column (not Columns A-C) Check the box Unique records only Click the OK button. This should produce a list of 208 genes from the original 500 genes. BCHM 6280 2017 Excel Tutorial Page 1 of 5
Tutorial 2: Using Excel to manage text data An issue common to gene names or gene identifiers is slight variations that can prevent their identification via a database lookup. An example is that as gene or transcript records are reviewed by curators, they are often given an appended number such as NM_0012345.1 or NM_0012345.3 indicating which version they are. The base identifier of NM_0012345 is the same between them but if your list has the appended version number, the database lookup or Excel lookup won t recognize the two as being the same record. In this example, there are two Excel files available from the Exercise 2 homepage: ExpressionData.xlsx and GeneInfo.xlsx The ExpressionData file has two columns. The first has Ensembl GeneIDs with the version number. The second column contains gene expression information in the form of Log2 ratio of treatment/control. The GeneInfo file has four columns. The first has Ensemble GeneIDs, but as the stable identifier rather than as a version. The remaining columns have the gene symbol, NCBI Gene ID and gene description. You want to be able to bring in information from the GeneInfo file into the ExpressionData file but at the moment, they do not share the same identifiers. To correct this, you will use a text-related function called LEFT to change the GeneIDs in the ExpressionData file to match those in the GeneInfo file. 1. Insert a column to the left of the GeneID column in the ExpressionData file. 2. In cell A2, type = and select the LEFT function 3. Select cell B2 for the text box in the FormulaBuilder dialog box 4. Tab to the num_chars box and type in 15 5. This should return the ENSG## up to the. as it was originially 6. Select the newly generated ID in A2, then copy down to the end of the column. Type Ctrl-D to copy the function down the rest of the column. 7. Then Edit->copy the newly generated IDs and use Edit->Paste->Special->Values to replace the formula with values. 8. Now you can use the two files in the next section to bring the data from GeneInfo into the ExpressionData file Tutorial 3: Using Excel to compare lists of data. A very common problem in bioinformatics or information processing of any kind is having multiple lists of data that you want to compare to each other. In Excel is a function called VLOOKUP that makes this easy to do. It is also useful for transferring data from 1 worksheet to another. For this part of the tutorial, you will use the GeneInfo and your modified ExpressionData file from the previous section. You can delete the column from the ExpressionData file that had the GeneIDs with version number in them. In this part of the tutorial, you will bring in the Gene Name and NCBI Gene ID into the ExpressionData file. BCHM 6280 2017 Excel Tutorial Page 2 of 5
Last updated: May, 2017 Open both worksheets in Excel. o In the ExpressionData file, insert a column between columns 1 and 2. o In the second row of column 2 (cell B2), type and = sign. Then go to the drop down menu in the upper left of the worksheet, find the function VLOOKUP and select it. If you do not see VLOOKUP on the main menu, scroll down to more functions which opens a dialog box with all of the available Excel functions. Under lookup and reference you will find VLOOKUP. Figure 1: Inserting a VLOOKUP function into column 2 of ExpressionData worksheet. o Once you ve inserted the function, you must fill out the arguments for the function using the dialog box that opens up. Select cell A2 as the lookup value. o Then click into the box Table_array. Go up to the window menu and select GeneInfor_ExcelTutorial.xlsx as shown in Figure 2. Figure 2: Selecting second worksheet for as table_array in the VLOOKUP function. o This will activate GeneInfo.xlsx. BCHM 6280 2017 Excel Tutorial Page 3 of 5
Last updated: May, 2017 o Select the first 2 columns of GeneInfo.xlsx. o Tab or click on the box Col_index_num. This tells the argument which column of data to bring over to the first worksheet. Type in a 2. o In the final box, Range_lookup, type false. If A2 in the ExpressionData worksheet matches A2 in GeneInfo worksheet, then the value from column 2 of GeneInfo will be entered into cell B2 of ExpressionData. If the 2 cells do not match, it will fill in N/A. o To fill in the rest of the column, select from cell B2 through then end of the data and under the Edit menu, select Fill Down or use the keyboard shortcut of Ctl+D. Figure 3: Filling in the rest of the column with the same function. Figure 5: Filling in the rest of the column with the same function. When you are done, your ExpressionData worksheet should look like that shown in Figure 4: Figure 4: GeneExpression worksheet after completing VLOOKUP BCHM 6280 2017 Excel Tutorial Page 4 of 5
At this point, the data in column 2 is still linked to the GeneInfo worksheet. You can see this if you click on one of the gene names and look at what is displayed in the text box at the top of the sheet. You do not want to leave your file like that, otherwise every time you open it will go through the data lookup function again. To avoid this, select the entire column, copy it and then do a Edit->Paste Special and select values in the Paste special dialog box. This will replace the function with the value of the function. After you complete that, click on a gene name. You should see just the gene name displayed in the text box at the top. Figure 5: GeneExpression worksheet after copying and paste special with values To bring in the NCBI gene ID, just insert another column in the ExpressionData worksheet and repeat the VLOOKUP process bringing in column 3 data from GeneInfo rather than column 2. BCHM 6280 2017 Excel Tutorial Page 5 of 5