Does Pivot Tables and More Jim Holtman jholtman@gmail.com There were several papers at CMG2008, and previous conferences, that got me thinking about other ways that R can help with the analysis and visualization of performance data. There were a couple of sessions that made use of pivot tables in Excel to help analyze data. There was also a paper that referenced sparklines as a method of visualizing data. This paper will show how R can be used to do these, and other, procedures that will enhance your ability to better analyze performance data. 1 Overview At CMG many of the papers describe how various performance metrics about a system can be analyzed. There are a number of different ways that this data is collected (proprietary vendor code, open source, user written scripts, etc.). Once this data is collected, there are again a variety of vendor, open source and user written procedures to process this information. Many of these are very flexible in providing a user with ways of customizing the subset of data to be analyzed, the algorithms to analyze the data and the format for the presentation of this data. I have used many of these tools in the past and still rely on them. Like most practitioners of computer performance analysis, I have my own tool chest of things that make by life easier. These include Perl for preprocessing/formatting unstructured data from log files, standard text editors for examining/changing data, Excel for quick looks at the data and for communicating results to others who are used to working with Excel, and of course R which is my favorite because of its versatility for analysis and graphical presentation of the results. R is an open source language and environment for statistical processing. It is based on the S language originally developed at Bell Labs by John Chambers who won the ACM award in 1998 for the language. It easily handles data files with millions of records (e.g., transaction response times), and compute, for example, the average response time and create a histogram of the response times in less than a couple of seconds. The graphics available in R for data visualization are very rich and flexible. Being able to slice/dice your data and then visualize it in various ways allows you to quickly see patterns in your data that just numbers in a table will not reveal. It is very well supported through an active user s group and there are over 85 books available covering the areas that R has been used for. I have used it for the last 25 years for doing computer performance analysis. To quickly find R on the internet, just type R into Google and it will be the first hit. The links will provide an overview of R. There is a learning curve to it, but it is well worth the effort if you are serious about performance analysis. The presentation slides have a 10 minute R workshop which provides an overview of R. 2 Pivot Tables John Van Wagenen s paper Pivot Tables/Charts Magic Beans Without Living in a Fairy Tale at CMG 2008 gave a very good overview of how pivot tables can help in analyzing, and visualizing, data that a performance analyst typically works with. Pivot tables allow an analyst to slice/dice the data in various ways, and to create aggregations of the data by various classifications. Pivot tables are typically associated with Excel, but the same information can Figure 1 - Sample 15 Minute Data From Excel be constructed by a variety of packages. For example, SQL statements can be used to group the data by various criteria and then summarize the results. Most of the vendor supplied packages have similar capabilities. John gave his permission to use the data from his paper so that I can illustrate that the results are similar when using R. The spreadsheet that he shared with
DAY,HOUR,MIN,SEC,MACHINE,LPAR,PHY_TOT,MIPS,CPS,CPU_HOUR,TYPE be specified. 6/2/2008,0,30,1,713,*PHYSI,3.26,176.5907684,13,0.4238,PROD The data is read into an object ( cpu.15 ) 6/2/2008,0,30,1,713,AAMTBC,0,0,13,0,TEST 6/2/2008,0,30,1,713,BBMTBC,0,0,13,0,TEST which is a dataframe. In R, a dataframe 6/2/2008,0,30,1,713,GGMTBC,0,0,13,0,TEST is very similar to an Excel spreadsheet in 6/2/2008,0,30,1,713,QA,0.63,34.12643684,13,0.0819,TEST that it looks like a table where each of the 6/2/2008,0,30,1,713,QB,1.32,71.50301053,13,0.1716,TEST 6/2/2008,0,30,1,713,QD,0.44,23.83433684,13,0.0572,TEST columns can have a different attribute 6/2/2008,0,30,1,713,SOLAR1,0.33,17.87575263,13,0.0429,PROD (e.g., character, numeric, etc.) and it is easy to reference the data items individually or as a vector representing the Figure 2 - CSV File for Input to R entire column. Part of the power of R comes from the vectorized operations that make it easy to defined transformations on the data. The contents of the dataframe are shown in Figure 5; notice that it looks very similar to the Excel spreadsheet in Figure 1. Figure 3 - Pivot Table From Excel me had some different data, but it did have the pivot tables generated from this data. 2.1 Pivot Tables The first example is from 15 minute data that was collected on system utilization. Figure 1 is a sample of the first entries in the Excel spreadsheet. To read this data into R, I converted the spreadsheet to a CSV file. R can directly read from Excel spreadsheets, but it is easier to illustrate the processing if we assume the data is in a file, since that is probably where most data is located. The resulting CSV is shown in Figure 2. In Excel a pivot table was created summarizing the CPU_HOUR over each DAY, HOUR and MIN, and generating total on each of the breaks. The Excel pivot table is shown in Figure 3. You can read John s paper to see how to setup the pivot table from the given input. To create a similar output in R, the script is shown in Figure 4. The first statement ( read.csv ) calls a function that will read in a CSV (comma separated variable) file. The default parameters are that the separator is a comma and that there is a header line in the file that defines the names of the columns when the data is read in. If your data file did not have a header line, then the parameter header=false tells the function to start reading the data at line 1; you can then assign names to the columns as you desire. If you have another separator like a tab or semicolon, these can As in any programming environment, there are a number of ways of getting similar results. In R there are a number of functions (apply, aggregate, tapply, etc.) that can summarize data in a pivot table-like format. R also has a number of packages (similar to modules in Perl, classes in Java, or libraries in C/C++) which encapsulate useful functions that minimize the amount of code that has to be written. R has a number of these packages that make it easy to transform data, aggregate the data and then summarize the results. One of the packages that I have found very useful is the reshape package which lets you restructure and aggregate your data Figure 4 - R Commands to Create the Pivot Table
So in the script, I indicate that I want to use the package [require(reshape)], and then I melt the dataframe that was read into specifying that I intend to use three of the columns (DAY, HOUR, MIN) to aggregate the data and that the value I want aggregate is CPU_HOUR. Now that the data has been melt ed, it can be cast into some output. The cast function has as its first parameter the object (cpu.melt) from the melt, and then a formula specifying how the Figure 5 - Dataframe in R Created from the CSV File (Looks Like Excel data is to be aggregated. The formula Spreadsheet) DAY + HOUR ~ MIN indicates that the rows will have DAY and HOUR, and that columns will contain the MIN. The data will be aggregated with these variables and the sum will be computed and stored in the resulting dataframe. There is also a parameter to indicate that margins are to be created. Margins will produce row totals and column total on the control Figure 6 - Batch Data From Excel breaks, which in this case is DAY. The output of the first 25 lines is shown in Figure 4. Comparing this output with Figure 3 shows the results are the same; the layout of the data is different. The last command just creates a pivot table for summarizing the CPU_HOUR per day. The data file had over 10,000 lines of data. It took 1 second to read the data in and create the two pivot table outputs. The script can be reused to read in any number of data files. Figure 7 - Pie Chart of Shift Usage Figure 8 - Pivot Table of Shift Usage with just two functions: melt and cast. melt puts the data into a format that can be used by cast to then create new aggregations of the data. Documentation is provided with the package that provides plenty of examples of how to use it. 2.2 Pivot Charts Another use of the output from a pivot table is to generate a chart. John had a data file about batch jobs being run. A sample of the contents of the Excel spreadsheet is shown in Figure 6. This data was summarized by shift and the pie chart in Figure 7 was created; the pivot table for this chart is Figure 8. Figure 9 shows the R script used to read in the CSV file create from the Excel spreadsheet, summarize the cpu hours by shift and then create the pie chart in Figure 11. This used another R function (tapply) to create the aggregation by shift. As I mentioned previously, there are a number of ways of doing things in R. I did notice one difference in the data in that John s pivot table filtered out HOLIDAY since there was such a small usage. I choose to leave it in, but could have easily removed it from the data. This file had about 24,000 data lines. It took 0.5 seconds to read in the data, aggregate the data and generate the pie chart. The final example makes use of some implied information in the data. In the spreadsheet the column DB2 had a name that if the 3 rd character was a P, then it
5/1/2007 6/1/2007 7/1/2007 8/1/2007 9/1/2007 10/1/2007 11/1/2007 12/1/2007 1/1/2008 2/1/2008 3/1/2008 4/1/2008 5/1/2008 6/1/2008 cpu seconds Breakdown by Shifts Figure 9 - R Script for Shift Usage WEEKEND HOLIDAY PERIOD2 PRIME PERIOD3 Figure 11 - Pie Chart from R Figure 10 - Excel Data for Prod/Dev Pivot Table (DEV). So when the data was read in, a new column was added with this indication so the pivot table could be generated. Figure 10, Figure 12 and Figure 13 are the data in the Excel spreadsheet and the pivot table and chart created from the data. Figure 14 is the R script to read in the data, create a new column with the workload, create the pivot table and then generate the chart in Figure 15. This data only had 96 rows and it took 0.2 seconds to read in the data, do the transformations, generate the pivot table and the chart. Figure 12 - Excel Pivot Table from Data 2500000 2000000 1500000 1000000 500000 0 Figure 13 - Chart from Excel Pivot Table was production (PROD), otherwise it was development 3 Sparklines In Ron Kaminski s paper on Automating Process Pathology Detection Rule Engine Design Hints he described sparklines as one of the ways of presenting a lot of data in a small amount of space. Basically sparklines are graphs without the axes to clutter up the presentation of information. Sparklines were invented by Edward Tufte who is a well known expert on data visualization. DEV PROD Figure 16 is an example of sparklines showing the price of 4 stocks over a 5 year period. You can see that they have roughly the same shape, even though the y-axis has different ranges. Numbers provide the extent of these ranges and identify other important points. I have used multiple graphs on a page to show the relationships between various measurements, but typically I was limited to displaying around 15 charts with all the extra space being taken up
2007-05-01 2007-06-01 2007-07-01 2007-08-01 2007-09-01 2007-10-01 2007-11-01 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01 Total CPU Seconds Figure 16 - Example of Sparklines Figure 14 - R Script to Create Pivot Table and Chart 2000000 1500000 1000000 500000 0 Figure 15 - Chart Generated from R Figure 17 - Example of vmstat Log File by the labeling of the axes. Figure 25 is just to show the amount of space that is taken up with labeling the axes and such. It also makes it hard to compare different graphs to look for patterns. With R it is easy to generate sparklines because you have complete control over how graphics are created. R has some very sophisticated graphics, but I will use just the basic graphics to show how sparklines can be created. PROD DEV The only difference between creating a set of charts like Figure 25 and sparklines, is telling the system not to create the axes and to plot the data in a smaller window. The charts in Figure 25 were created from running the vmstat command on a UNIX system. Vmstat will record about 20 different measurements including CPU utilization, memory and number of running processes. Similar data will be used to demonstrate sparklines. One of the scripts that I have running on systems that I monitor writes the vmstat data to a file with a timestamp. This data is then read by the analysis programs and reports and charts are created. An example of the log file is shown in Figure 17. This data is read in and results in as a matrix with each row being a sample and the columns the data for that sample. Figure 18 shows the amount of R code that was written to create a plot of sparklines with nr rows and nc columns on a single page. Figure 26 is the sparklines that were generated. This represents one day of system operation (00:00 24:00). On the left side of each sparkline is the name of the measurement being plotted. This is followed by its average value over the day. The average value is represented by the horizontal gray line that can be used as a reference as to the variation of the data. The red number on the left above the gray line is the maximum value; the green number on the right below the gray line is the minimum value for the day. This allows you to quickly see some of the relationships. There is also a red dot to mark the first maximum and a green dot to mark the first minimum of the sample. The easiest one to point out is the last two lines on the chart; the idle time and the user + system time. As you can see these are mirror images of each other and this is what you would expect from the data. Even without the time being explicit, since we know that this represents a 24-hour day, we can see that the first third of the day appears to be the busiest with the overall activity in the rest of the day being low. For this system, that is what happens; it processes the performance data from a number of systems by downloading load files and then processing the data so that it is
Total Transactions 0 5000 10000 15000 Figure 18 - R Function to Plot Each Column as a Sparkline With nr Rows and nc Columns Figure 19 - Transaction Count for User/Tran Tran.01 Tran.02 Tran.03 Tran.04 Tran.05 Tran.06 Tran.07 Tran.08 Tran.09 Tran.10 Transaction Count by User ready by 07:00 for review to see how the system performed the previous day. Figure 27 was from a CMG2004 paper I wrote and is a levelplot for the system utilization for a month. It uses color to show what would be the z- axis value (utilization) if this were a 3D graph. The data used to create the sparklines is 5/16/05 so you should be able to compare the utilization (user + sys) of the sparkline with the levelplot. I also added to the plot the set of sparklines for the same period. Do they both convey the same information to you? In Figure 28 I just took the month s worth of sparklines and replicated them 12 times to show what a year s worth of utilization might look like. Wouldn t it be nice to have a page like this for each of your systems so that you would look for patterns. You could also line up the plot so that a day of the week was a row so you could see the pattern for that day in the month across the year. If you really like 3D plots, R can generate those also. The rgl package will create a 3D plot that you can rotate with a mouse to see different views. Figure 29, Figure 30 and Figure 31 show the interactive 3D graphs that can be created with R. User.01 User.02 User.03 User.04 User.05 User.06 User.07 User.08 User.09 User.10 Figure 20 - Stacked Bar Chart of the Transaction Count 4 Transaction Data I want to use some transaction data to show another way of visualizing the data from a pivot table. I originally had a transaction log of 79,000 transactions; 159 transaction types across 300 users. To make the data easier to present, I created 10 transaction types by splitting the transactions based on their response times (Trans.01 has the shortest average response time and
Tran User.01 User.02 User.03 User.04 User.05 User.06 User.07 User.08 User.09 User.10 Trans.01 Trans.02 Trans.03 Trans.04 Trans.05 Trans.06 Mosaic Plot of the Number of Transactions by User - Area Proportional to Count Trans.10 has the longest). The users were just split into 10 groups randomly. The log file has the user, transaction, start and end times. The file was read in with an R script and the pivot table in Figure 19 was created. If you look at the data, User.06 has the smallest transaction count and User.08 the largest. One way of visualizing this information is using a stacked barchart as shown in Figure 20. Here it is easy to see that User.08 entered the most transactions and User.06 the least. But it is hard to determine for each user the ratios between the individual transactions for that user. This is where a mosaic plot helps to visualize this relationship. In a mosaic plot, the values are plotted as rectangles and the area of the rectangle is proportional to the count. The vertical axis will be the same for all variables so that you can see the relationships of the transaction counts for a user. Figure 21 is the mosaic plot of the pivot table data. You can see on the chart that User.08 has the widest vertical area indicating that this user has the highest total transaction count; User.06 has the least area indicating the lowest transaction count. In this view of the data, you can see that User.06 has a higher percentage of transactions Tran.06, Tran.09 and Tran.10 than User.08. This might indicate that these two users have different roles and therefore execute different transaction mixes. A mosaic plot can help identify this condition. Figure 24 shows the relationship of the ratios of the average response times of transactions Trans.07 for each user. Here you can see that Trans.10 appears to Trans.08 have an average response time that is almost equal to the sum of the response times for the Trans.09 other 9 transactions. Again, based on how I partitioned the Trans.10 transactions, Trans.10 should User have the longest response Figure 21 - Mosaic Plot of Transaction Counts for a User time, but even across some of the users, there is quite a bit of variation. Remember that this chart does not show the value of the average response time of a transaction for a user, just the ratio of its response time compared to the other transactions executed by that user. Figure 22 - Average Response Time of Transactions for Each User Figure 22 shows the average transaction response time for each user. Even though Trans.10 has the longest response time, relatively it is less frequently executed than most of the other transactions as you can see in Figure 21. Figure 23 is a graph of the sparklines of the distribution of the average response times of the transactions for a given user. Think of this as a histogram drawn with a smooth line. The x-axis is 0-3 seconds for the response times. In the data, there was a maximum of 879 seconds for one transaction (I am not sure if the user really waited for a response in this case); the 95 th percentile was 1.7 seconds, so I choose 3 seconds for the chart since this encompassed over 95% of all the transactions. In most cases, SLAs (service level agreements) are based on XX% of the responses time being less than a given number. Systems I have worked on in the past had this number as 90%/95%. Looking at the data, it appears that User.06 and User.07 have larger tails on the right side indicating that they have are experiencing longer average response times. The pivot table in Figure 22 shows that these
User.01 User.02 User.03 User.04 User.05 User.06 User.07 User.08 User.09 User.10 User.01 User.02 User.03 User.04 User.05 User.06 User.07 User.08 Response Time Distribution -- Sparklines people on the projects. They will give me data in an Excel spreadsheet that I can use as input. When I generate output, in many cases I will transfer the results to an Excel spreadsheet (R can write Excel workbooks with multiple sheets) since it allows the recipient to do further manipulations of the data, or to include the data into Word documents or PowerPoint presentations. The R scripts, and data, used in this paper are available if you send me email requesting them. Trans.01 Trans.02 Trans.03 Trans.04 Trans.05 Trans.06 Trans.07 Trans.08 Trans.09 Trans.10 User.09 User.10 Figure 23 -Sparklines of the Density (Histogram) Plot of Response Times for a User Mosaic Plot of Response Times - Area Proportional to Time Figure 24 Ratios of Average Response Time of Transactions for a User users do have the longest average response times across all their transactions. These users might have different roles, and therefore execute a different mix of transactions some of which have longer response times. It is this type of analysis that leads to a better understanding of your environment. 6 References [1] J. Van Wagenen, Pivot Tables/Charts Magic Beans Without Living in a Fairy Tale, CMG 2008 [2] Ron Kaminski, Automating Process Pathology Detection Rule Engine Design Hints, CMG 2008 [3] R Development Core Team, R: A Language and Environment for Statistical Computing, {ISBN} 3-900051-07-0, http://www.r-project.org [4] J. Holtman, Using R for System Performance Analysis, CMG 2004 [5] J. Holtman, Visualization Techniques for Analyzing Patterns in System Performance Data, CMG 2005 [6] N. J. Gunther, Guerrilla Capacity Planning, Springer-Verlag, Heidelberg, Germany, 2007 [7] H. Wickham, Reshaping data with the reshape package, Journal of Statistical Software, 21(12), 2007 [8] Venables, W. N. and Ripley, B. D. Modern Applied Statistics with S. Fourth Edition. Springer, 2002, ISBN 0-387-95458-0 [9] Tufte, Edward Beautiful Evidence Graphic Press 2006 [10] Spector, Phil Data Manipulation with R (Use R) Springer, 2009. ISBN 978-0387747309 5 Wrap-Up Hopefully I have given you some examples of other things that R can do, and hopefully they will whet your appetite to learn more about R. R should be considered as one of the tools that you have in your toolkit. In my current engagement, I use R for most of the analysis that I do, but still make extensive use of Excel. Excel happens to be the preferred way of interchanging data among the other
Figure 25 - Typical Multiplots Per Page - Data from 5/16/05
Figure 26 - Sparklines Created from 'vmstat' Log File: 19 Different Measurements for 5/16/05 (red is max; green is min)
Figure 27 - Levelplot (3D on 2D Surface) of System Utilization for a Month + Equivalent Sparklines
Figure 28 - What One Year of System Utilization Might Look Like in Sparklines
Figure 29-3D Chart of the Utilization Data Figure 30 - Another View of the Same Data Figure 31 - Yet Another View from Underneath