Visualizing large epidemiological data sets using depth and density


 Holly Fox
 3 months ago
 Views:
Transcription
1 Visualizing large epidemiological data sets using depth and density Jukka Pekka Kontto University of Helsinki Faculty of Social Sciences Statistics Master s thesis May 2007
2 Preface This master s thesis work was carried out in International Cardiovascular Disease Epiodemiology Unit at National Public Health Institute in Helsinki from the summer 2006 to the spring I want to express my sincere gratitude to Kari Kuulasmaa from International Cardiovascular Disease Epiodemiology Unit for providing me the opportunity to work in the MORGAM Project and to the personnel of the unit for offering me a good working environment and excellent guidance during this work. Especially I would like to thank my supervisors Juha Karvanen from International Cardiovascular Disease Epiodemiology Unit and Juha Puranen from Department of Mathematics and Statistics at University of Helsinki for their help and guidance throughout this work. I also thank Simona Giampaoli from National Institute of Health in Rome, Abdonas Tamosiunas from Kaunas University of Medicine Institute of Cardiology and Veikko Salomaa from National Public Health Institute in Helsinki for allowing the use of MORGAM data of their cohorts. Helsinki, May 2007 Jukka Kontto 2
3 Contents 1 Introduction 5 2 Visualizing large data sets The exploratory analysis of large data sets Problems of visualizing large data sets Overplotting High plotting and computation times Problems in sub group comparison High number of variables Methods for visualizing association of variables in large data sets using depth and density Introduction Data depth Halfspace depth Depth contours and the depth median of halfspace depth The bagplot and other graphical methods using halfspace depth Some other notions of data depth Data Density Kernel density Some graphical methods for bivariate data using kernel density Application to the MORGAM data Introduction The MORGAM Project Example of unimodal distribution Example of bimodal distribution MORGAM Interactive Databook Structure Technical implementation
4 4.5.3 Current version of the interactive databook and plans for the future 73 5 Conclusion 76 Bibliography 78 Appendix 85 datamerging.r cohorts.r demo11.r tables.r The selection page of MORGAM Interactive Databook
5 Chapter 1 Introduction The developments in computer technology over the last ten years have enabled the processing of even larger data sets and this has opened new opportunities for the statistical data analysis. The computing revolution has made many of yesterday s large data sets available (Carr, 1991), but in order to get most out of them, new methods has to be developed, since the standard data analyzing tools are not necessarily efficient for analyzing these data sets. The emphasis in this work is on graphical methods for analyzing large data sets in epidemiology. Epidemiology studies healthrelated states or events and their genetic and environmental factors in human populations. In epidemiology, environment is defined to include any biological, chemical, physical, psychological or other factor affecting health. The data in epidemiological study is collected experimentally or observationally with the latter providing data for study types such as casecontrol and followup studies (Beaglehole et al., 1993). Epidemiological data sets are typically large, and especially studies of rare events in a population require a large sample size. In longitudinal studies the amount of information is increased by the repeated measurements in different times. Also, national health services and other government officials collect information concerning people s health and this information can be used for epidemiological studies. In this work the data from the MORGAM Project, which is a large multinational followup study, is an example of a large data set in epidemiology. Tufte (1983) argued that welldesigned data graphics are usually the simplest and at the same time the most powerful methods for analyzing statistical information. Graphical displays are tools for exploring and unveiling the underlying structure of a data set and when dealing with large data sets some specific methods are needed in order to this structure to be presented clearly. The main research question of this thesis follows from this need: 5
6 How to visualize large data sets in epidemiology? Here large refers mainly to the high number of observations in a data set. The definition of large data set used in this work was presented by Carr et al. (1987) and it strongly relates to the visualization properties of the data set: The number of observations is large if plotting or computation times are long, or if plots have an extensive amount of overplotting. The overplotting occurs when the graph locations of some observations of the bivariate data set are identical or very close. In chapter 2 the overplotting and other specific characteristics and problems relating to the visualization of large data sets are discussed with the emphasis on a situation of two continuous variables. The addition of one or two categorical variables follows which enables the observations of a bivariate data set to be divided into groups defined by the levels of categorical variables. Throughout this work, the general principles of visualization has been taken into consideration, including the use of colors, scales and labels as well as clear vision and understanding. These principles are discussed, for instance, by Tufte (1983) and Cleveland (1985). The graphical methods used in this thesis can be divided into two groups, the methods using data depth and the methods using data density. Tukey (1975) first introduced the concept of data depth as he used the halfspace depth, which is the most studied depth function, for visualizing bivariate data sets. Thus data depth is relatively new concept in visualizing data sets. The development of depthbased methods has been enabled by the advances of computer technology, since the theory of depthbased methods is largely based on computational geometry (Liu et al., 1999; Rafalin and Souvaine, 2004). The methods using data density discussed in this work are based on the kernel density estimator which is one of the most commonly used and one of the most studied density estimator (Silverman, 1986). The graphical methods using data density are better known than the graphical methods using data depth, hence in this work the emphasis is on depthbased methods. Both depthbased and densitybased methods deal with overplotting by aggregating data. In other words, instead of visualizing every observation in the data set, the observations are divided into groups with a certain criteria and the structure of the data set is presented with regions based on these groups. Therefore the emphasis is changed from a single data point to the general structure of the data set. The actual methods for visualizing large data sets are presented in chapter 3 and implemented using R (R Development Core Team, 2006) which sets some boundaries to the discussion, although the selection of graphical tools in R is quite wide. In chapter 4 the graphical methods are applied to data sets from the MORGAM Project which is a large international followup study of the cardiovascular diseases and genetic risk factors. Many examples are produced using MORGAM Interactive Databook 6
7 which is a web interface tool developed for researchers of the MORGAM Project to be used for exploratory data analysis. The user of the web application selects a data set to be analyzed and graphical methods to be used at the starting page. The application then shows the graphs based on selections. This application was constructed during the summer of 2006 as the part of this thesis. 7
8 Chapter 2 Visualizing large data sets 2.1 The exploratory analysis of large data sets The term exploratory data analysis was introduced by Tukey (1977), who divided data analysis into exploratory and confirmatory data analysis. This implies that the assumptions concerning the data set are abandoned and instead the analysis is based on the data itself. After the data set is collected, it is explored and analyzed with specific techniques which are generally graphical in order to unveil the underlying structure and the main qualities of the data set. Finally the appropriate statistical model is constructed. This data analysis approach differs from the classical data analysis where the data collection is followed by the imposition of a model, with specific assumptions such as normality, and the analysis after that is concentrated on the parameters of that model. On the other hand, the techniques of the classical approach are formal and objective, whereas in exploratory approach the results are suggestive and the interpretation may vary by analyst. In practice, the elements from both approaches are often used in same analyzing process (NIST/SEMATECH, 2007). Most of the techniques in exploratory data analysis are graphical by nature, since graphical displays fulfil the idea to explore data openmindedly and without preliminary assumptions. The graphical displays, combined with the patternrecognition capabilities that we all possess, enable to detect and uncover the underlying structure and other relevant qualities of the data set (NIST/SEMATECH, 2007). These graphical techniques include the methods of plotting the data points, for instance scatterplots and histograms, and plotting simple statistics, for instance boxplots and mean plots. The development in computer technology has brought out new possibilities for the exploratory data analysis. Since it has become possible to process even larger amount of information, it has made many large data sets available for data analysis. These kind of large data sets have been produced by monitoring studies, for instance by monitoring the 8
9 continued impact of mankind on its environment or human populations in terms of their behavioral characteristics such as product consumption and disease transmission. On the other hand, computer simulations generate large data sets (Carr, 1991). These data sets can be processed with traditional graphical techniques of exploratory data analysis up to a certain limit but problems emerge when the data set is too large. When a data set is considered to be large, the traditional techniques in exploratory data analysis are not always efficient and new methods has to be developed. So when the data set in considered to be large? A data set is an N p matrix, where N is the number of observations and p is the number of variables. Thus a large data set can be a result of either large N or large p or both. In this work the emphasis is on large N, since the current graphical methods in R are functional only for relatively small number of dimensions p. The situation of large p is discussed in subsection There is no ambiguous definition to large N, but Carr et al. (1987) presented one definition which is also used in this work: Definition 2.1. N is large if plotting or computation times are long, or if plots can have an extensive amount of overplotting. In other words, large data set is not determined as a certain value of N, but in terms of the visualization properties of a data set. The concepts of computation time and overplotting are discussed more extensively in section Problems of visualizing large data sets As presented in definition 2.1, the qualities which define a large data set are long plotting or computation time and extensive amount of overplotting in a data set. These qualities are quite subjective and it depends on the frame of reference how much overplotting is considered to be extensive or when a plotting or computation time of a graph is considered to be long. Furthermore, due to the development in data processing, the data set having a large computation time one year could be analyzed much faster on a few years later. Carr et al. (1987) presented that since the display speed and computer capabilities can be expected to improve in the future, the key definition of large concerns overplotting. The problems of visualizing large data sets are discussed in a situation of two continuous variables, since the concentration is on the distribution and the spread of the data points, so the number of variables is secondary and something which is discussed in subsection
10 2.2.1 Overplotting The overplotting occurs when the graph locations of some observations of the bivariate data set are identical or very close to each other. This can cause different plotting symbols to obscure one another and the structure of the data set to be lost (Cleveland, 1985). Furthermore, if a graph with an extensive overplotting is saved as vector graphics 1, the size of the file becomes large. The problem of overplotting is demonstrated with the standard scatterplot presentation where the interactions between two continuous variables are visualized. Scatterplot is one of the basic methods in exploratory data analysis to present the bivariate distribution of the variables and an informative tool where many qualities of a data set are unveiled. It provides the first look at the distribution of the bivariate data set and identifies the characteristics of the data set, for instance the clusters of points and outliers (Cleveland, 1993). In figure 2.1 the influence of the number of data points to overplotting is presented using three data sets A = (X 1, Y 1 ), B = (X 2, Y 2 ) and C = (X 3, Y 3 ) which are constructed as follows: X 1, X 2 and X 3 are random samples from N(0, 2), N( 2, 0.5) and N(2, 0.5) respectively while Y 1 is a random sample from N(0, 2) added componentwise with X 1, Y 2 is a random sample from N( 2, 0.5) added componentwise with X 2 and Y 3 is a random sample from N(2, 0.5) added componentwise with X 3. In figure 2.1a random samples of size 200 are generated creating 600 data points. From the resulting scatterplot it can be observed that in addition to the positive correlation there are two clusters of data points in the plot which are data sets B and C. However, when the sample sizes are increased to creating data points (figure 2.1b), the correlation is still to be observed but the two clusters are no longer visible. Exact overplotting is encountered when plotting variables with many equal values. This can occur with two discrete variables or with two continuous variables with low measurement accuracy, for instance. Many solutions to exact overplotting in bivariate situation are developed and some of those solutions are presented in the following list and visualized in figure 2.2. The original data set consists of 10 observations plotted in figure 2.2a. In point (2, 2) there are two observations, in point (1, 1) there are three observations and in point (2, 1) there are four observations. Jittering (Chambers et al., 1983) The points can be jittered by adding random noise to one or both of the variables. This random noise can be generated by various methods (see e.g. Chambers et al. (1983)) and the R function jitter (R Development Core Team, 2006) implements 1 Vector graphics are images that are completely described using mathematical definitions ( April 26, 2007). 10
11 a b Figure 2.1: An example of overplotting with two bivariate distributions of different number of data points: a) A bivariate data set with 600 points. b) A bivariate data set with points. the jittering as follows: Let x i and y i, where i = 1,..., n, be two set of points and let u i and v i be random samples of length n from uniformly distributed random variables U U( a 1, a 1 ) and V U( a 2, a 2 ) respectively. Then jittered values of x i and y i are x i + u i y i + v i, where a 1 and a 2 are selected as the amount of jittering. In figure 2.2b a 1 and a 2 are both selected to be Sunflowers (Chambers et al., 1983; Cleveland and McGill, 1984) The symbols show the number of data points that occur at the centers of the symbols. A dot with no lines means one data point, a dot with two lines means two observations, a dot with three lines means three data points, and so on. An example in figure 2.2c is constructed using the R function sunflowerplot (R Development Core Team, 2006). Moving (Cleveland, 1985) The points are moved to the free space. The locations of the overlapping points are 11
12 altered slightly in order to visualize all data points. This method works well if the number of overlapping points is small. In figure 2.2d the overlapping points are moved vertically a) The original data set c) The data set with sunflowers b) The data set with jittering a) The vertically moved data set Figure 2.2: Different methods for dealing with overplotting: a) The original data set with 10 data points. b) The data points are jittered. c) The overlapping points are presented with sunflowers. d) The overlapping points are moved slightly vertically. In addition, other methods of dealing with overplotting include the transformation of the data set, for instance logarithmic transformation, and the visualization of the residuals instead of the data points (Cleveland, 1985). These methods could remove the data dense in one area but also make the interpretation of the graph more difficult as the original data values are not plotted. The use of open circles (Cleveland and McGill, 1984) in place of the solid data points, on the other hand, could help in distinguishing the individual points, if there are only partial overlapping in the data set. However, all of these methods fail if the size of the data set is large enough (Unwin, 1999). The overplotting in large data sets is caused simply by plotting too many points in too small area. Therefore, specific tools has to be taken into consideration when dealing with overplotting due to high number of observations. In practice, this implies that the ideal situation of presenting all data points in a scatterplot has to be abandoned and instead find methods of aggregating data. Since the influence of one data point decreases when the size of the data set increases, it is not necessary to visualize every data point of a large data set, but develop methods of presenting the overall characteristics of data set. It has to be remembered that the data is not aggregated too much in order to the 12
13 underlying structure and the qualities of the data set are still there to be identified and explored. These kind of methods presented in this work can be roughly divided into two groups: the methods using density and the methods using depth. These approaches to the problem of overplotting are presented in chapter 3. The basic idea in both is that the data points are divided into smaller groups with a certain criteria and these groups are used for visualization. The methods using depth define a depth value for each data point according to their location relative to the other data points. The data points with same depth value have same depth and can be interpreted as a group. The methods using density, on the other hand, divide the plotting region into bins where the number of data points are counted separately High plotting and computation times Although the advances in computer technology has enabled the analysis of even larger data sets, there is a limit of how much data the current computer software can process efficiently. If a data set is large enough, the plotting time of the graph or the computation time of some operations needed for the construction of the graph is no longer real time, which is a fraction of a second (Carr et al., 1987). This kind of data sets can be considered to have high plotting or computation time. However, the size of the data set causing high plotting or computation time is always changing, since the data processing capabilities continue to get better. In fact, Carr et al. (1987) argued that since display speed and computing capabilities can be expected to improve dramatically, the biggest problem in dealing with large data sets concerns overplotting. However, high plotting or computation time is a problem that has to be taken into consideration in this work. One of the reasons for this is the construction of an interactive application (subsection 4.5) which implements the graphical methods discussed in this work into practice. The interactivity of an application requires near real time computation of the graphs. The problem of high plotting or computation time is discussed only in a situation where the cause is a large number of observations and other reasons, such as inefficient programming, are ignored. A basic example of the relationship between plotting and computation times is presented next. A scatterplot with extensive overplotting has usually high plotting time. One method for decreasing this high plotting time is to decrease the amount of plotted information without losing relevant information concerning the data set. Now, plotting density estimates (section 3.3), instead of data points, solves the problem of overplotting and also decreases the plotting time. However, the construction of density estimates requires computation which increases the computation time of the graph. Thus the decrease of both plotting and computation times in the same graph is usually difficult. In these 13
14 situations, though, the increase of computation time is accepted, since it results a better display (Carr et al., 1987). Also, if the decrease in plotting time exceeds the increase in computation time, the construction of density estimates proves to be useful also in terms of the overall process time. Generally, the methods dealing with overplotting decrease the plotting time but increase the computation time of a graph. Thus the problem of high plotting time is solved when the problem of overplotting is solved. The problem of computation time is more complex, since the computation time depends on the operations used for the construction of the plot. If the possibility of inefficient programming of those operations is ignored, the computation time is difficult to decrease. Naturally, if more than one graphical methods visualize the same information of a data set, one solution is to select methods with the fastest computation time. In this work this implies especially to the graphical methods using density, since there are many densitybased methods presenting the same characteristics of a data set. In addition, the question of how long computation time is high depends on the purpose. The importance of fast computation time is emphasized in subsection 4.5, where an interactive application for exploratory data analysis is introduced. One of the most essential properties of this, or any, interactive tool is an almost real time computation and plotting times. Exploratory data analysis is detective by nature (Tukey, 1977) where, for instance, the number of different combinations of variables are tried in order to find something interesting. This property of repeatability requires the computation time of one examination to be near real time. This, combined with the fact that large data sets have high computation times, creates inevitably a difficult problem. In fact it turns out to be the biggest challenge during the construction of this interactive application Problems in sub group comparison In this subsection the situations which involve categorical variables in addition to the two continuous variables are considered. Categorical variables divide the bivariate distribution into groups and the identification of these groups in order to observe the differences between them is one of the goals in this work. Two problems in sub group comparison of large data sets are overplotting and the large number of levels in categorical variables. There can be two kinds of overplotting, since in addition to the overplotting of data points presented in subsection 2.2.1, also the groups can be overlapping and thus the identification of groups is difficult. On the other hand, a large number of levels in categorical variables brings out it own problems. These problems are demonstrated in the situation of two continuous variables and one categorical variable. The visualization of this threedimensional data set is implemented using symbolic scatterplots and partitioning displays (Chambers et al., 1983; Cleveland, 14
15 1985) which are two methods of visualizing the difference between groups. The symbolic scatterplot is a standard scatterplot presentation where the data points have been replaced with symbols so that there are different symbol for each level of the categorical variable. This visualization of different symbols superimposed in scatterplot could help identifying the differences between groups. In figure 2.3 the problem of symbolic scatterplot when using large data sets is presented. Two data sets A = (X 1, Y 1 ) and B = (X 2, Y 2 ) are constructed as follows: X 1 and X 2 are random samples from N(0, 1) while Y 1 is a random sample from N(0, 0.5) added componentwise with X 1 and Y 2 is a random sample from N(0, 0.5) added componentwise with X 2 and 1.5. First, random samples of 25 are generated and a scatterplot of the bivariate data set A B with 50 data points is presented in figure 2.3a. Then, a symbolic scatterplot of data sets A and B is presented in figure 2.3b, where the data points of A are marked with crosses and the data points of B are marked with filled bullets. Clearly, two separate groups can be identified from the graph. In figures 2.3c and 2.3d the same procedure is repeated but the sizes of random samples are Thus points are plotted and the consequence is that the overplotting prevents the two groups to be identified a b c d Figure 2.3: The visualization of two data sets with two continuous variables and one categorical variable. a) The scatterplot of the continuous variables with 50 data points. b) The symbolic scatterplot with 50 data points. c) The scatterplot of the continuous variables with data points. d) The symbolic scatterplot with data points. The overlapping of data points prevents the observation of different symbols and fur 15
16 thermore different groups. This problem becomes more evident, if there are more than two levels in the categorical variable. Thus the symbolic scatterplot for identifying different groups is not functional for large data sets. The partitioning display is a sequence of scatterplots where the data points of each level of the categorical variable are plotted separately. It is essential that the scales of the axes in each graph are equal so that the bivariate distributions can be compared in order to find differences between groups. The partitioning display works well with overlapping groups. Obviously there can be overplotting in the scatterplots after partitioning, but that gets us back to the subsection However, the partitioning display may suffer if the number of levels in the categorical variable is large. Each level is plotted to a separate plot and the large number of plots could make the comparison of the groups difficult. Furthermore, a large number of plots require more computation and the resulting sequence of scatterplots is not an efficient or compact way of visualizing data. That is why it is important to find some methods of visualizing different groups in the same graph. In order to develop methods for presenting threedimensional data in one graphical display, the data has to be aggregated. As discussed in subsection 2.2.1, the bivariate distribution of two continuous variables needs to be aggregated. The level of aggregation has to be increased somehow in order to make the categorical variable visible. One solution is to visualize groups with regions instead of individual data points. Two possible solutions to this problem are the grouped bagplot (subsection 3.2.3) and the bivariate highest density region boxplot (subsection 3.3.2). These methods can be used also if a second categorical variable is added into consideration creating a fourdimensional data set. The partitioning display is used for creating separate scatterplots for all levels of the first categorical variable and then the grouped bagplot and the bivariate highest density region boxplot can be implemented in each scatterplot using the second categorical variable High number of variables So far in this work, the visualization of the data sets with large number of observations has been discussed in a bivariate situation added with one or two categorical variables. However, many large data sets collected in epidemiological followup studies, for instance, are multidimensional by nature containing tens, or even hundreds, of variables. This multidimensionality of a large data set requires specific attention in order to be visualized. In this subsection the problem or high number of variables is discussed. The limitation in visualizing multivariate data is that graphs are twodimensional. This implies that the multidimensional structure of a data set must be inferred us 16
17 ing the twodimensional view (Cleveland, 1985). The example of this limitation is the visualization of three continuous variables using threedimensional coordinate system. Although this presentation includes three variables in three perpendicular axis, the twodimensionality of the graph prevents the clear observation of the point cloud. One solution to this problem is to rotate the point cloud on a computer screen which increases graphical perception, but this is not possible in static environments (Cleveland, 1993). Thus the methods of visualizing multivariate data sets has to be implemented in a twodimensional context. Many graphical methods have been developed for presenting multivariate data in two dimensions, for instance by Chambers et al. (1983), Cleveland (1985) and Cleveland (1993), but most of the methods are not efficient when the data set is large, in other words the methods do not take overplotting into consideration. An example of this kind of method is the symbolic scatterplot which was already presented in subsection It can be used for visualizing three continuous variables, where the third variable is coded into the size of the plotted symbols with the largest and smallest size selected conveniently. This method can be used for outlier detection, but the structure of the data set is not unveiled because of overplotting as was discussed in subsection Another example of this kind of method is the parallel coordinates (Inselberg, 1985) where the large number of observations causes only the outlying values of each variable to be observed. Since the emphasis is on overplotting, the methods used only for outlier detection are not discussed further. An example of a graphical methods for multivariate data which have the ability of dealing with overplotting is the scatterplot matrix, one of the most traditional ways of visualizing multivariate data. Each pair of variables of a data set is visualized by a scatterplot and then the scatterplots are presented in a matrix. Along each row (or column) of the matrix, one variable is plotted against all others with same scale for that variable in every graph in that row (or column). However, this presentation is nothing but a division of the multivariate data into many bivariate situations, which takes the discussion back in subsection Another example of visualizing multivariate data is the biplot (Gabriel, 1971; Gower and Hand, 1996; Le Roux and Gardner, 2005) which is a graphical presentation displaying the data points of a multivariate data in a single graph using methods of multidimensional scaling, most commonly principal component analysis, multiple correspondence analysis and canonical variate analysis (see, for instance Lebart et al. (1984) or Krzanowski (2000)). Using these methods the relationships between the observations in all variables are approximated and presented in a twodimensional space with all variables also approximated and presented by axes, one for each variable and not perpendicular. Thus Gower 17
18 and Hand (1996) consider a biplot as a multivariate analogue of a standard scatterplot. This method is very useful and informative presentation of multivariate data with many interesting possibilities concerning data visualization which can not be implemented using tools for bivariate data. However, when it comes to overplotting in biplot, the solution is the same as in bivariate situation: Le Roux and Gardner (2005) introduced the use of bagplot (subsection 3.2.3) to visualize the structure of the multivariate point cloud. Thus the problem of overplotting in multivariate data goes back to the bivariate situation. Other methods for visualizing multivariate data include the multiway dot plot (Cleveland, 1993) or trellis display where the data set is divided into groups by some categorical variable and each group along with its data is presented in separate dotplot. There is no problem of overplotting with this method, but the number of dotplots can rise with large data sets. There are several methods for visualizing multivariate data. However, since these methods result graphs viewed in twodimensional space, the ways with dealing overplotting are the same in multivariate situation as in bivariate situation. Thus in chapter 3 the graphical methods for large data sets are discussed in a bivariate situation. 18
19 Chapter 3 Methods for visualizing association of variables in large data sets using depth and density 3.1 Introduction The biggest problem of visualizing large data sets is overplotting as discussed in chapter 2. One solution to this problem is to aggregate data by dividing data set into groups with a certain criteria and then visualizing these groups instead of every data point. In this section two solutions of grouping data are presented, namely the methods using data depth and the methods using data density. Data depth is a relatively new concept in data visualization where as methods using density are better known. Thus the emphasis in this section is on data depth and its graphical implementations, but also the methods using density are discussed. The methods are discussed in bivariate situation, since, irrespective the number of variables in a data set, the distribution of the variables is viewed in a twodimensional plane. 3.2 Data depth One of the most essential tasks in data analysis is the estimation of the location of the data. Several useful tools have been developed for this purpose and in univariate situation there are many familiar estimators for location, including the mean 1 and the median. These statistics are unambiguously defined when dealing with univariate data sets, but problems arise when the discussion is turned into the multivariate situation. Although the multivariate mean is still unambiguous, the estimation of location with median is no 1 In this work the mean refers to the arithmetic mean. 19
20 longer straightforward and unambiguous, because there are different options for carrying out the generalization of median into higher dimensions (Small, 1990). The location estimators of univariate data sets have been widely studied. The mean is the most wellknown location estimator due to the fact that it is the best location estimator, when the underlying distribution is normal. Therefore other estimators have been developed to deal with nonnormal data. One of the most important properties for a location estimator is robustness. An estimator is considered to be robust, if it is not sensitive to outliers. This concept is discussed in a univariate situation with the mean and the median. These estimators are introduced in definitions 3.1 and 3.2. Definition 3.1. Let X = {x 1,..., x n be a data set in R 1. The univariate mean of X, X, is the sum of the data points in X divided with the number of data points: X = 1 n n x i. (3.1) i=1 Definition 3.2. Let X = {x 1,..., x n be a data set in R 1 and let {x (1),..., x (n) be the ranking of X from the smallest to the largest. The univariate median of X, Md(X), is the most central point or the mean of two most central points of {x (1),..., x (n) : Md(X) = { x( n+1 2 ) if n is odd (x ( n 2 ) + x ( n+1 ))/2 if n is even. (3.2) 2 In figure 3.1 the robustness of the mean and the median is compared with a data set of five points in R 1 marked with filled bullets. In the upper situation the mean of the data set is denoted with M 1 and in the lower situation the median is denoted with Md 1. Then one data point, a 1, is moved to the right to the point a 2 marked with a circle. The estimates of the modified data set are denoted with M 2 and Md 2. The consequence of the increase of one data point is that the mean also increases but the median remains unchanged. In an extreme case, if one data point is moved to infinity, the mean will also go to infinity. The median, on the other hand, is not affected by the manipulation of one data point. This sensitivity of location estimators to outlying data points can be summarized with the notion of breakdown point. This concept was introduced by Hodges (1967), who called this quality tolerance of extreme values. Donoho and Huber (1983) introduced the definition of breakdown point for finite data sets: Definition 3.3. Let X = {x 1,..., x n be a data set in R d, let T (X) be an location estimator and let Y = {y 1,..., y m in R d be m arbitrary values replacing m values in X. Thus, the fraction of the values of Y in X = X Y is ε = m/n. The breakdown point 20
21 of T at X is denoted with ε (X, T ): { m ε (X, T ) = min 1 m n n : sup T (X) T (X ) =, (3.3) where the supremum is taken over all possible collections X. Thus the breakdown point of an estimator is the smallest amount of contaminated values that may cause an estimator to take on arbitrarily large aberrant values. M 1 M 2 a 1 a 2 Md 1 Md 2 a 1 a 2 Figure 3.1: The comparison of robustness of the mean and the median in R 1. When the value of one point a 1 is increased, the mean also increases but the median remains unchanged. In the example presented in figure 3.1, the mean has the breakdown point of 1. When 5 it comes to the median it seems that it requires the contamination of three data points before the median is affected. However, after the contamination of three data points, the estimator can no longer tell whether the three moved points or the two remaining points is the contaminated part, and the breakdown point is defined from the smaller part according the definition 3.3. Thus the median has the breakdown point of 2. Generally, 5 when dealing with a finite data set of n points, the mean has a breakdown point of 1, n while the median has the breakdown point of 1 (Donoho and Huber, 1983). In fact, 2 the largest possible value for breakdown point of any sensible estimator of location is 1 2 (Lopuhaä and Rousseeuw, 1991). Bassett (1991) called the breakdown point of 1 as the 2 50% breakdown property and discussed it along with two other desirable properties of a univariate location estimators, namely equivariance and monotonicity. These properties are presented in definitions 3.4 and 3.5. Definition 3.4. Let X = {x 1,..., x n be a data set in R 1. An estimate T (X) is equivariant if T (ax + b) = at (X) + b, where a and b are constants. Definition 3.5. Let X = {x 1,..., x n and X = {x 1,..., x n be data sets in R 1. An estimate T (X) is monotonic, if T (X) T (X ), when X X, where the vector inequality is read componentwise. Bassett (1991) showed that the median is the only sensible estimator fulfilling all of these three properties. Because of this result the median can be considered to be a good 21
22 estimator for the location of univariate data set. An ideal generalization of the median for multivariate data sets would be an estimator with the same properties Bassett (1991) introduced for the median in univariate situation. These properties can be defined in higher dimensions, but defining the median, that fulfills these properties, is not unambiguous, since there are several ways of accomplishing the ranking of data points and determining the most central point in a multivariate data set (cf. definition 3.2). Barnett (1976) introduced various solutions of ordering multivariate data sets. He recognized the lack of any obvious and unambiguous means of fully ranking the data points in a multivariate sample and thus called the methods he introduced as subordering principles implying to this defect. Barnett classified these principles into four categories. In the following list these categories are presented along with examples, which are visualized in figure 3.2. The same data set of nine points, marked with filled bullets, is used in all four examples. The rank of each data point is marked is the graphs. Although these examples are presented in bivariate situation, they can be generalized to higher dimensions. Conditional ordering The data points are ranked according to one dimension of the data set. In figure 3.2a the data points are ranked in terms of the values of X 1 so that the point with the smallest value is ranked first. Reduced ordering The data points are reduced to single values using some distance metric. These values are ranked. In figure 3.2b the data points are ranked with respect to their distances from the origo so that the point with the shortest distance is ranked first. Partial ordering The data set is partitioned into smaller groups which are ranked. In figure 3.2c the data points are divided into groups using convex hull peeling (subsection 3.2.4) and the points in the same hull belong to the same group. The points in the periphery of the most outer hull are ranked first. Marginal ordering Each variable is considered independently. The values of each variable are sorted and new points are created using the sorted values. These new points are ranked. In figure 3.2d the values of X 1 and X 2 are ranked separately in increasing order. Then nine new points are created so that the first new point is located in the intersection of the smallest value of X 1 and the smallest value of X 2, the second new point is 22
23 located in the intersection of the second smallest value of X 1 and the second smallest value of X 2 and so on. The new set of points is ranked and marked with circles in the graph, although the sixth point coincides with a point from the original data set. The original data points are not ranked. a) Ranking in terms of X 1 b) Ranking with respect to the distances from origo X X (0,0) X 1 X 1 c) Ranking with convex hull peeling d) Ranking in both dimensions independently X X X X 1 Figure 3.2: Four principles of ordering multivariate data sets (Barnett, 1976): a) Conditional ordering. b) Reduced ordering. c) Partial ordering. d) Marginal ordering (Original data points are not ranked). The concept of data depth has developed from the problem of generalizing univariate median and ordering multivariate data sets. Data depth offers one solution for forming a multivariate order of the data points and according to the classification of Barnett (1976), data depth is placed to the category of reduced ordering. Data depth is a nonparametric method (Liu et al., 1999; Zuo and Serfling, 2000a) in sense that no assumptions concerning the underlying distribution of the data set or existence of the moments are needed. This is considered to be an advantage, since the problem with many methods in multivariate analysis is that they require the assumption of normally distributed variables, which in most research frames is difficult to justify (Liu et al., 1999). On the other hand, the advances of computer technology has made it possible to process even larger multivariate data sets and this has enabled the development of depthbased methods, whose theory is largely based on computational geometry (Liu et al., 1999; Rafalin and Souvaine, 2004). 23
24 The theoretical and sample version of data depth are introduced in definitions 3.6 and 3.7. Definition 3.6. Let F be a probability distribution in R d, where d 1. A data depth function D(x; F ) measures how central (or deep) a point x R d is relative to F. Definition 3.7. Let F be a probability distribution in R d, where d 1, let X = {X 1,..., X n be a finite sample of data points from F and let F n be the empirical distribution of X. A finite sample data depth function D(x; F n ) measures how central (or deep) a point x R d is relative to F n. It has to be emphasized that the whole F (definition 3.6) or every data point of X (definition 3.7) contributes the depth value of point x and that the more central x is relative to F (definition 3.6) or X (definition 3.7) the higher the depth value of x is. The properties of data depth are discussed in subsection Given a notion D(x; F n ) of sample data depth function in definition 3.7, one can compute the depth values of all data points in a data set and order them in terms of these depth values in decreasing order. This results a centeroutward ranking, where the most central data point is ranked first and the most outlying data point is ranked last suggesting that a relevant notion of center is available. This center is proposed, according to its univariate analog, as the depth median: Definition 3.8. The depth median is the point in R d, not necessarily in the data set, globally maximizing the data depth function D(x; F n ). When a multivariate data set is ordered using centeroutward ranking induced by depth function, the depth median is located near the point ranked first. This is a marked difference to the univariate situation, where the median is located near the point in the middle of the ranking, as the data points are ordered using linear ranking from the smallest to the largest. Rafalin et al. (2005) presented two problems concerning the concept of data depth. The first problem concerns the definition of data depth. Data depth is a nonparametric method with no distributional assumptions made. It creates a centeroutward ordering of data points and the depth median (definition 3.8) is the point with largest depth value. This, added with the fact that depth values are higher closer to the depth median, implies that data depth results a unimodal distribution of data depth values, although the underlying distribution or data set would be multimodal. Thus Rafalin et al. (2005) argued 24
25 that one distributional assumption is made, namely that the underlying distribution is unimodal. Another problem Rafalin et al. (2005) introduced is that the current data depth methods are not computationally efficient in higher dimensions. Despite the development in computer technology, the computational complexity of data depth grows exponentially respect to the dimension and this prevents data depth to be a practical analyzing method in higher dimensions. Thus the current depthbased methods available in R are for analyzing data sets with no more than three dimensions. In this work that is not considered to be a disadvantage, since the focus in on the visualization of data sets and the graphical displays of four or more dimensions would not be informative and clear. So, the concentration is on situations with two or three dimensions, and only a fraction of theory is presented in higher dimensions Halfspace depth The halfspace depth 2 is the most wellknown and most studied depth function (Zuo and Serfling, 2000b). One of the reasons behind this is that the halfspace depth function possess many desirable properties as a depth function. These properties and the comparison between halfspace depth and other depth functions are discussed in subsection The popularity of halfspace depth among depth functions has led to the fact, that many of the graphical depthbased presentations currently available in R have been implemented using halfspace depth. Examples of these applications include depth contours (subsection 3.2.2) and the bagplot (subsection 3.2.3). Since the depthbased visualization methods currently available use halfspace depth, the emphasis in this work is on it. Halfspace depth was first introduced by Hodges (1955) as a tool for performing a bivariate analog of the twosided sign test, although he did not use the name halfspace depth. Tukey (1975) used the halfspace depth for visualizing bivariate data sets, but started from the definition of the depth values in univariate situation. This definition is presented in figure 3.3 with a ranked data set X = {x 1,..., x 7. The depth value of the median (x 4 ) is equal to 1 (1 + n), where n is the number of data points in the data set. 2 Generally, the depth value of the point x i, where i = 1,..., 7, is equal to min{i, n+1 i, and the depth values of each point have been included in figure 3.3. As this figure clearly shows, the depth value of a point x i is the minimum of the number of data points on one 2 In the literature halfspace depth is also called as Tukey depth or location depth (Rafalin and Souvaine, 2004; Hugg et al., 2005). 25