Data Visualisation with SASIINSIGHT Software. Gerhard Held SAS Institute. Summary. Introduction

Similar documents
Trellis Displays. Definition. Example. Trellising: Which plot is best? Historical Development. Technical Definition

STAT 3304/5304 Introduction to Statistical Computing. Introduction to SAS

CREATING THE ANALYSIS

1 Introduction. Abstract

CREATING THE DISTRIBUTION ANALYSIS

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

SAS/INSIGHT lii SOFTWARE: DATA VISUALISATION WITH THE,SAS SYSTEM HELD BY: GERHARD HELD AND THOMAS EMMERICH I SAS INSTITUTE. 1.

Batch Processing in SAS/INSIGHT Software

SAS (Statistical Analysis Software/System)

How to use FSBforecast Excel add in for regression analysis

Rick Wicklin, SAS Institute Inc. Peter Rowe, CART Statistics & Modeling Team, Wachovia Corporation

Exploratory Data Analysis EDA

Chapter 1 Introduction. Chapter Contents

Using Excel for Graphical Analysis of Data

Minitab 18 Feature List

SAS Visual Analytics 8.2: Getting Started with Reports

SAS Structural Equation Modeling 1.3 for JMP

Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix. Paula Ahonen-Rainio Maa Visual Analysis in GIS

Exploratory model analysis

Technical Support Minitab Version Student Free technical support for eligible products

Multiple Regression White paper

MINITAB Release Comparison Chart Release 14, Release 13, and Student Versions

Generalized Additive Model

Tips and Guidance for Analyzing Data. Executive Summary

Applied Regression Modeling: A Business Approach

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

SYS 6021 Linear Statistical Models

Making the Transition from R-code to Arc

Learn What s New. Statistical Software

Chapter 13 Multivariate Techniques. Chapter Table of Contents

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

TableLens: A Clear Window for Viewing Multivariate Data Ramana Rao July 11, 2006

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Intermediate SAS: Statistics

Chapter 41 SAS/INSIGHT Statements. Chapter Table of Contents

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Statistics Statistical Computing Software

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Tips on JMP ing into Mixture Experimentation

The SAS interface is shown in the following screen shot:

Intro to Stata. University of Virginia Library data.library.virginia.edu. September 16, 2014

Data Analysis: Displaying Data - Deception with Graphs

Data Visualization Techniques

Introduction to Exploratory Data Analysis

Data Management - 50%

Generalized Additive Models

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Brief Guide on Using SPSS 10.0

Graph Structure Over Time

Using Statistical Techniques to Improve the QC Process of Swell Noise Filtering

Using Excel for Graphical Analysis of Data

STAT 311 (3 CREDITS) VARIANCE AND REGRESSION ANALYSIS ELECTIVE: ALL STUDENTS. CONTENT Introduction to Computer application of variance and regression

PHARMACOKINETIC STATISTICAL ANALYSIS SYSTEM - - A SAS/AF AND SAS/FSP APPLICATION

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

Applied Regression Modeling: A Business Approach

Release notes for StatCrunch mid-march 2015 update

Introduction to Mplus

Graphical Analysis of Data using Microsoft Excel [2016 Version]

CS 229: Machine Learning Final Report Identifying Driving Behavior from Data

1 Introducing SAS and SAS/ASSIST Software

YEAR 12 Trial Exam Paper FURTHER MATHEMATICS. Written examination 1. Worked solutions

How to use FSBForecast Excel add-in for regression analysis (July 2012 version)

Data Visualization Techniques

JMP Book Descriptions

THE SWALLOW-TAIL PLOT: A SIMPLE GRAPH FOR VISUALIZING BIVARIATE DATA.

Quality Checking an fmri Group Result (art_groupcheck)

A Modified Approach for Detection of Outliers

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Infographics and Visualisation (or: Beyond the Pie Chart) LSS: ITNPBD4, 1 November 2016

DI TRANSFORM. The regressive analyses. identify relationships

Linked Data Views. Introduction. Starting with Scatterplots. By Graham Wills

Forecasting Asia Pacific Mobile Market Trends Using Regression Analysis

Chapter 1. Using the Cluster Analysis. Background Information

Mira Shapiro, Analytic Designers LLC, Bethesda, MD

VW 1LQH :HHNV 7KH VWXGHQW LV H[SHFWHG WR

Getting Started with JMP at ISU

Enterprise Miner Tutorial Notes 2 1

Time Series Analysis by State Space Methods

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

8. MINITAB COMMANDS WEEK-BY-WEEK

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Data analysis using Microsoft Excel

Section 18-1: Graphical Representation of Linear Equations and Functions

Parametric. Practices. Patrick Cunningham. CAE Associates Inc. and ANSYS Inc. Proprietary 2012 CAE Associates Inc. and ANSYS Inc. All rights reserved.

JMP 10 Student Edition Quick Guide

Introduction to Digital Image Processing

JMP 12.1 Quick Reference Windows and Macintosh Keyboard Shortcuts

USING TEMATH S VISUALIZATION TOOLS IN CALCULUS 1

Lab Activity #2- Statistics and Graphing

AN OVERVIEW AND EXPLORATION OF JMP A DATA DISCOVERY SYSTEM IN DAIRY SCIENCE

Chapter 25 Editing Windows. Chapter Table of Contents

STATISTICS (STAT) Statistics (STAT) 1

CHAPTER 1 INTRODUCTION

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Using JMP Visualizations to Build a Statistical Model George J. Hurley, The Hershey Company, Hershey, PA

Transcription:

Data Visualisation with SASIINSIGHT Software Gerhard Held SAS Institute Summary Recently interactive data analysis packages have become popular. These packages attempt to visualise the structure in multivariate data and offer a more intuitive approach to data analysis. The software is usually offered as stand alone tools for data visualisation (or graphical data analysis). However, its is desirable to extend the data visualisation approach beyond the borders of simple but efficient techniques for describing data such as identifying outliers or brushing of points in a plot. In this paper we will demonstrate how data visualisation techniques implemented in SASIINSIGHT software can be applied to analyze linear relationships of data including general linear models. We will also discuss how intermediate results can be saved for further processing. Introduction Since the early 1970s we have seen a revival of the exploratory data analysis tradition in statistics (Tukey, 1970). The first attempts to implement this approach in computer programs were in the early 1980 (Velleman, Hoaglin, 1981). One goal of exploratory data analysis implementations on a computer was to enable individual users to see structure in multivariate data. This gave rise to a new class of interactive style data analysis packages such as MACSPIN (Donohue, Donohue, Gasko 1986) or Data Desk, pioneered by Paul Velleman, on "personal" platforms such as the Apple Macintosh. A major attraction of these packages was that they the application of dynamic graphical methods to data, defined as the "direct. manipulation of elements of a graph on a computer screen" (Cleveland, McGill 1988). As the aim of dynamic graphical methods is to visualise the structure of multivariate data, "data visualisation" is often used as a synonym. One drawback of these early implementations was that they focused on the data visualisation aspect without integrating this functionality into a wider spectrum of data analysis techniques. As early as 1980 John Tukey requested that "we need both the exploratory and confirmatory (analysis)" (Tukey, 1980). Since then we see more general data analysis packages, such as S-PLUS or The SAS System, including data visualisation tools as a speciality. 394

( I I In this paper we will: introduce SASIINSIGHT software as an example of data visualisation packages available today; discuss the relative merits of data visualisation as opposed to traditional data analysis methods; and point out some new functionality within SASIINSIGHT software which enable tighter integration of data visualisation and traditional methods. Graphical Exploration ~ l[ ~ ~: r h t-,. 1'- \"; i' ~ ", 'i,," f' f: t " t, l ~ ~ ~, ~' - j.~, ~'; i r- ~. j', ~, " t, j; " ~- ; ;," t ~',1 Ji: ~. i\ tr, ~. J!~~ \ ~ \. SASIINSIGHT software, an integrated component of the SAS System, is a dynamic tool for data exploration and analysis. With it you can explore data through interactive histograms, box plots, scatter plots, and 3D rotating plots. Youcan examine correlations and principal components to find the structure of your data. Finally you can construct predictive models based on relationships in the data. All interactive graphs and analyses are linked across multiple windows, Any change in one window is immediately reflected in all windows related to the same data. SASIINSIGHT software is implemented on a large variety of hardware platforms: many popular UNIX-Workstations, Digital Equipment Minicomputer and'decstations running VMS, ffim Mainframes (running MVS, CMS, and VSE), and with Release 6.08 of the SAS System, also on workstations running Windows 3.1 and OS/2 2.0. A powerful workstation is ideal to perform graphic data visualisation. UNIX workstations, or 486- based PCs running OS/2 or Windows are recommended. As SASIINSIGHT software allows many views of the same data set it is also recommended to use large monitors (15" or larger). The discussions in this paper are based on a data set containing statistics about 407 large commercial companies in 1991 (in terms of sales). Variables include the Company Name, Nationality, Industry Type, Number of Employees, and Sales, Profits, Assets and Equity in millions of U.S. dollars. As SASIINSIGHT software is an integrated component it can be invoked from within the SAS System by simply typing in its name (INSIGHT) or by using pull-down menus. SASIINSIGHT then prompts the user for the data set. We select the business data set (COMP9I). The user will then see a data window, the data set presented as a table (rows are companies and columns are variables - see Figure 1). In a data window you can sort, edit, and extract subsets of your data. You can also assign measurement levels and default roles that determine how your variables are used in graphs and analyses. As we would like to identify data points as companies, we assign variable COMPANY to be a LABEL using the DATA menu. Click on COMPANY, select DATA, then PROPERTIES and finally LABEL (sequence: DATA: PROPERTIES : LABEL). Previous users of SASIINSIGHT software will notice at this point that the pull-down menus have been regrouped in a more logical structure. All analysis and graphical f- l i f:. ~ 't~ ~~~ 395 :.,. -/ ~:.:: :.. ". ~--' ","

functions have been combined in the ANALYSIS menu; a new DATA menu covers all data manipulation functions. Figure 1: Data table ofcomp91 data set We would like to explore measures of economic success for the selected companies (profits, SALES) as well as find factors which determine economic success. Release 6.08 of SAS/INSIGHT software offers box plots as an additional way to explore distributions graphically. Selecting BOX PLOT, PROFITS as Y (graph variable), and INDUSTRY as group variable creates the side-by-side box plot shown in Figure 2. Note that group processing (available with all analyses and graphics) has also been added with Release 6.08. In Figure 2 we have already clicked on a few data points (companies) with extreme values for PROFIT. Notice that mm is one of the least profitable companies (2,827 Billion dollars loss) in 1991 whereas in 1990 it was still one of the most profitable companies (6,020 Billion. dollars profit)! 396

Figure 2: Box plot of profits for industry groups.~. The labeled extreme values somewhat distort other relationships so we click or drag with the mouse over extreme values, thus creating a rectangular brush, and select EDIT : OBSERVATIONS (RECORDS) : HIDE IN GRAPHS. This deletes extreme values from the graph (not from any calculations!) and realigns the graph (Figure 3). In addition we click on INDUSTRY and activate the new MARKERS window (EDIT: WINDOWS: MARKERS). We could assign an individual marker to each INDUSTRY or click on the "multiple markers" button at the bottom of the MARKERS window. This automatically assigns a different marker to each value of INDUSTRY. In the same way, colours can also be assigned individually or automatically (a new feature in Release 6.08 of SAS/INSIGHT software). The markers are now an "observation state", i.e. companies retain their marker. for any subsequent graphs unless they are changed. Figure 3 now shows more clearly that some industries are consistently profitable (e.g. the Pharmaceutical industry), others show a large internal variation (e.g. Computer industry, and Oil Refining) and still others feature quite a number of outliers (e.g. Automobiles, food, and Electronics industries). 397

Figure 3: Box plot of profits for industry groups (outliers removed) Figure 4: Scatter plot matrix of EMPLOYS, SALES, and ASSETS \ \ 398

',,"";-_ ~ -, _.~ 'J _, ~, 'c',_ '." J c " _. --' _ ~_-::. ~_ ~_ s~~_ ":_'-' -_ To explore the other measure of economic success, SALES, we take a different approach. We suspect that the number of employees (EMPLOYS) and ASSETS may correlate with SALES. Therefore we click on all three variables in the data table and then select ANALYSE: SCATTER PLOT (X Y) which creates Figure 4. Obviously Figure 4 is distorted again by some very big companies (e.g. Toyota Motor high in SALES, General Electric high in ASSETS and EMPLOYS), and also the variation seems to increase with increasing values of SALES, ASSETS and EMPLOYS. As this is an overall pattern, we may decide for a data transformation rather than hiding extreme values again. To do that we simply click on each of the variables in the graph and select EDIT : VARIABLES : LOG(X). This calculates the logarithm for each variable and adjusts the scatter plot matrix accordingly. The new transformed variables are named L_SALES, L_ASSETS, and L_EMPLOY. Figure 5: Scatter plot ofl_employ by L_SALES For Figure 5 we have selected the "MAGNIFYING GLAS" from the TOOLS window (EDIT: WINDOWS: TOOLS) and dragged over the L_SALES and L_EMPLOY portion of scatter plot matrix to focus in on this part. It shows a clear linear structure of L_SALES to L_EMPLOY and also reveals that oil companies (upward pointing triangles in Figure 5) consistently generate larger L _SALES, as could have been expected ba~ed on their L_EMPLOY value. If needed we could also explore the data in three dimensions using a rotating plot. One function of the rotating plot is to show multivariate outliers. As it would be not adequate to describe insights of rotations by text alone we will not try to illustrate this now. 399

Previous results suggest that we might have a separate look on the influence of INDUSTRY type on this relationship. A simple graphical way to do this would be to generate a series of scatterplots ofl_salesby L_EMPLOY grouped by INDUSTRY. Figure 6 shows two of the industries as an example, Oil Refining and Pharmaceuticals. It is obvious that the slope ofa regression curve ofl_sales on L_EMPLOY would be very similar for. both industries but the intercept for Pharmaceuticals would be negative, meaning that the Pharmaceutical industry would require many more employees to become profitable. Figure 6: Scatter plot ofl _EMPLOY BY L _SALES for oil and pharmaceutical industries Model Formulation We have now enough evidence to formulate a model on sales as a measure of economic success. SAS/INSIGHT supports the traditional parametric regression analysis, but also the general linear model as implemented in the GLM procedure of SAS/STAT software. SAS/INSIGHT now also offers an implementation of the GENERALISED linear model supporting response distributions from the exponential family (normal, inverse Gaussian, gamma, Poisson, binomial); and corresponding canonical link functions (identity, logit, probit, complementary log-log, and power link function). Both the general and the generalized linear model have been introduced in Release 6.08 of SAS/INSIGHT software. In addition, SAS/INSIGHT covers residual plots (residual by predicted, residual 400

normal QQ plot and partial leverage plots} as well as parametric and nonparametric fit curves (splines, kernel estimation).,. As generalized linear models are not adequate for our data we will confine our test to a linear model. The model we would like to test isl _SALES as the response variable and L_EMPLOY and INDUSTRY as factors. We also include the interaction of both' factors into the model. Figure 7 shows part of the results. The R-Square of 0.7169 indicates that 51,4% ofl_sales can be explained by the variables in the model. All variables in the model including the interaction are significant (prob>f associated with the F-Test). Figure 7:. General Linear Model for L_SALES Often it is required to save results for further processing. This can be easily done if the aim is to integrate graphics or reports in a text document as for instance for this paper. A typical problem area, however, is to save statististical tables in a computerised form to apply to new data or to reformat the output. For this purpose SASIINSIGHT software supports the Output Delivery System of the SAS System (ODS). This is another new feature in Release 6.08 of the SAS System. Procedures using the ODS produce their results in the form of output objects, data structures in machine precision that persist in memory. With output objects you can create data sets, produce listings, reorganize output and create and save custom report formats. The ODS is easily activated. For example if you would mark with the mouse the Summary of Fit, Analysis of Variance and Type III Tests tables of Figure 7, then select FILE : SAVE : TABLES, the system would respond with a Note window indicating that

tables were saved as output objects. Using the OUTPUT procedure of the SAS System the saved tables can then be accessed and manipulated (SAS Institute, 1992, see Figure 8). Figure 8: SASIINSIGHT tables as output objects Conclusion We had three goals with this article. Concerning the first goal it could be shown that SASIINSIGHT software adequately covers dynamic graphical data analysis. All interactive graphs and analyses are linked across multiple windows, and changes in one window are immediately reflected in all windows related to the same data. SASIINSIGHT offers typical tools for data visualisation, such as identification and labeling of points, use of colours and markers, brushing, scatterplot matrices, and 3D rotating plots. The latest implementation of SASIINSIGHT software includes additional functionality beyond the standards of data visualisation such as: interactive box plots, group processing for graphical and statistical analyses, integration of general and generalized linear models, and saving of any results (data, graphics, or output) in a form ready for immediated further processmg. Dynamic graphical methods greatly facilitate exploratory data analysis, using state-of-theart computer technology. This technology helps the analyst to concentrate on finding structures in multivariate data rather then deal with the software mechanics of how to code an analysis. Graphical data analysis is a great time saver. All steps discussed in this paper took roughly 15 minutes to accomplish. It is up to the reader to determine how long this would have been taken using traditional methods and tools. A word of caution: data visualisation gives new insights to data but needs traditional hypotheses testing methods as a complement. Therefore, users of these systems are advised to look for integrated software covering both approaches to data analysis. 402

References Cleveland, W.S. & McGill, M.E. (Ed.), Dynamic Graphics for Statistics. Belmont, Ca.: Wadsworth Inc. Donoho, A.W., Donoho, D.L. & vasko, M.(1986). MACSPIN Graphical Data Analysis Software. Austin, Texas: D2 Software SAS Institute Inc.(1993), SASIINSIGHT User's Guide, Version 6, Second Edition, Cary, NC: SAS Institute Inc. Tukey, J.W. (1970). Exploratory Data Analysis. Vol. I. Reading M.A.: Addison-Wesley Tukey, J.W.(1980). We need both exploratory and confirmatory, The American Statistician, 34, 23-25 Velleman, P.F. & Hoaglin.D.C. (1981). Applications,. Basics, and Computing of Exploratory Data Analysis. Boston: Duxbury Press 403