Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction

Size: px

Start display at page:

Download "Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction"

Cori Barker
5 years ago
Views:

Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical University of Lisbon, Instituto Superior Técnico, Dept.

1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical University of Lisbon, Instituto Superior Técnico, Dept. of Mechanical Engineering, CIS/IDMEC LAETA, Av. Rovisco Pais, Lisbon, Portugal ARTICLE INFO ABSTRACT This dissertation proposes a novel clustering method based on fuzzy c-means, which is capable of Keywords: handling information from time variant and invariant features. The new method, Mixed Clustering, Data Mining shows the advantages of successfully aggregating both data components to identify systems in a wide Machine Learning Clustering Time Series Analysis Mixed Data Septic Shock 1. Introduction 1.1. Knowledge Data Discovery The present developments in data warehouse enable storing of increasingly bigger sets of data, leading to a growth in the amount of information available regarding any given system as well as the analytical possibilities they provide. The Knowledge Data Discovery (KDD) process focuses on methodologies for extracting useful knowledge from the available information, data bases, (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Firstly the data relevant (target data) for the system under identification from the available data base, after which the target data is pre-processed, cleaning the range of application domains, such as Medical, Management or Energy Systems. The flexible formulation of the proposed methodology can adapt to data sets with multivariate time series and different similarity measures based on distance. In fact, in addition to the euclidean distance, the distance based on the popular Dynamic Time Warping method is used for time series similarity search, being capable of overcoming the temporal misalignment between them, commonly found on these applications. The contribution of the Mixed Clustering approach is demonstrated for forecasting and classification problems, the first being achieved through its application to a meteorological system for temperature and humidity forecasting based on geographical location. The method s performance as a binary classifier is demonstrated with a Medical application, where the goal is to predict the outcome of a patient diagnosed with septic shock through the analysis of physiological variables measured during a sampling period and patient s demography, which is constant during his stay in an Intensive Care Unit. The machine learning process is tested under unsupervised and supervised alternatives. The application of the method showed that when the temporal information of the patient is poorer, the demographic information can improve the classifier s performance. information, handling missing values and adapting it to the requirements of the analysis. Figure 1-1 KDD Process The data is then Transformed, consolidated into structures appropriate for the data mining method then applied, in this case the Mixed Clustering, which identifies patterns in the data. 1

The results obtained from the mined patterns is then interpreted in the original systems field, finally obtaining the useful knowledge desired.

classification problems, and applicable to a diverse range of application domains, from medical problems, climacteric analysis, power management to economic studies, designated as Mixed Clustering. 1.

2 The results obtained from the mined patterns is then interpreted in the original systems field, finally obtaining the useful knowledge desired. The focus of this dissertation is proposal of a new, efficient, data mining method based on clustering, for databases combining time variant and invariant features, valid for forecasting and classification problems, and applicable to a diverse range of application domains, from medical problems, climacteric analysis, power management to economic studies, designated as Mixed Clustering Time Series Data Mining This innovative data mining method searches for patterns and similarities in both data components, time variant and invariant, combining the extracted information to better characterize the data objects. The process of mining time series, particularly, the clustering of time series attracts the interest of researchers. The complexity of this type of data requires careful examination of the proposed algorithms, (Rani & Sikka, 2012). While the time invariant features are easily compared by a common and simple distance function, the Euclidean Distance, the time variant features, represented by time series, require a more complex analysis, (Rani & Sikka, 2012). Thus, a more modern measure is implemented for similarity search of time series, the Dynamic Time Warping. Figure 1-2 Euclidean and DTW matching of Time Series This similarity measure is capable to overcome temporally misaligned time series, identifying similar tendencies and patters, even if unfazed in the time of occurrence. This measure has been successfully applied in areas such as handwriting and online signature matching, time series database search, computer vision, surveillance and signal processing, (Gaudin & Nicoloyannis, 2006) Outline This work is structured as follows: in section 2, the mixed clustering concept is described and the methodology presented. In section 3, the use of the method s outputs to solve a forecasting problem is presented and applied to a Meteorological System, followed by a demonstration and discussion of the results. The method s contribution to a classification problem is demonstrated in section 4, and applied to a Medical System, followed by a demonstration and discussion of the results achieved. Finally, in section 5 the results of the different applications are revised and compared to previous works on the subject, concluding with a set of suggestions to further develop the study described as future work. 2. Clustering 2.1. Concept Clustering is a data mining technique that aims to group similar data objects, based on patterns identified, while distinguishing objects with distinct behaviours, divide the data into clusters, so that intra-group differences are smaller than those inter-groups. This concept is useful in a wide range of applications from image analysis, wireless sensor network's based applications or population segmentation to bioinformatics, (Liao, 2005). Often, the information that describes a system is not all represented in the same type of data, there are categorical, numerical and text features, constant and time-varying features. In such cases, a clustering 2

method capable of conciliating distinct data types becomes necessary. In (Izakian, Witold, & Jamal, 2013), a clustering method to handle spatiotemporal systems is proposed.

Geography, climatology and epidemiology systems are examples of applications relying on spatiotemporal data for their identification.

3 method capable of conciliating distinct data types becomes necessary. In (Izakian, Witold, & Jamal, 2013), a clustering method to handle spatiotemporal systems is proposed. These systems are characterized not only by temporal features but also by the spatial location at which they were measured. Geography, climatology and epidemiology systems are examples of applications relying on spatiotemporal data for their identification. The methodology proposed in (Izakian, Witold, & Jamal, 2013) expands the Fuzzy C-Means (FCM) Clustering technique, (Bezdek, Ehrlich, & Full, 1984) to handle spatiotemporal data by adding a pondering element λ, that factors the importance to be given to the temporal component. This element majorly beneficiates the algorithm s flexibility, allowing it to search for the best combination between temporal and spatial contributions The aim of this dissertation is to expand this notion of spatiotemporal data to any dataset containing different types of data, constant and time-varying, that may require specific treatment, by generalizing the spatiotemporal clustering methodology to data bases with mixed clustering and multivariate time series. We will show that there are benefits in successfully converging both data components to model systems in a wide range of application domains, such as Medical Care, Finances, Management and Energetic Systems Mixed Clustering Methodology When working with a database with time variant and invariant features, the input data is considered as a concatenation of both data components: x i = [x i s x i t ], i = 1,.., n ( 2.1 ) The invariant component, represented by numeric values, is structured as follows x s i = [x s i,1,, x s i,r ] ( 2.2 ) Where r is the number of invariant features. The time variant data component, represented by multivariate time series, is structured as a tridimensional matrix: t x i,j,k = ( 2.3 ) In this format, each value is defined by 3 coordinates: i = 1,, n, indicating the sample number, j = 1,, q, the sampling point and k = 1,, f, the feature The clustering method defines a set of prototypes, or centers for each of the c clusters, comprised of a variant and an invariant component: by: The invariant component s prototypes are determined v l s = n u m s i=1 l,i xi n u m i=1 l,i ( 2.4 ) The time-variant prototypes require an expansion to deal with the dimensionality increase of the data. A 3 dimensional structure was defined, with dimensions [c q f]: t v l,k = n u m t i=1 l,i xi,k n u m i=1 l,i ( 2.5 ) Where the fuzziness parameter, m, makes the process more fuzzy or crisp. The membership degree The value u l,i is an element of the partition matrix, U, that defines the degree at which each sample belongs to each cluster. Being a fuzzy clustering method, the membership of a sample k to a cluster is a value in the c n interval u l,i [0,1], l=1 u i,k = 1and0 < u l,i < n i=1. The similarity between a sample and a cluster is then measured by the sample s augmented distance to the cluster s center, given by: d λ 2 (v l, x i ) = v l s x i s 2 + λ t Where δ(v l,k t f k=1 t t δ(v l,k, x i,k ) ( 2.6 ), x i,k ) is the distance between the k th feature of prototype i and sample j, calculated by the 3

By adding the distances of all features for each sample, the matrix of distances maintains its dimension [c n], resulting in a meaningful partition matrix defined, as for a univariate time-series

4 distance function used and λ is a parameter that defines the influence given to the time variant features. The optimal value of this parameter is determined by sequential runs of the clustering process, for different values, choosing the one that generates the best performance. By adding the distances of all features for each sample, the matrix of distances maintains its dimension [c n], resulting in a meaningful partition matrix defined, as for a univariate time-series system, by: u l,i = 1 ( d λ (v l,x i ) 2/(m 1) c o=1 d λ (vo,x i ) ) ( 2.7 ) Since the objective function J only has direct dependency on the distances and membership degrees, it can be defined as for a univariate time-series system: JJ = c n u m l,i d 2 l=1 i=1 λ (v l, x i ) ( 2.8 ) The Clustering process continues until convergence of the distance function or the maximum number of iterations is achieved. 3. Forecasting Problem Meteorological System 3.1. Modelling The Alberta Agriculture and Rural Development organization provides current and historical weather data from approximately 340 meteorological stations located across the Californian province, mapped on Figure 3-1. The meteorological variables available include temperature, humidity, precipitation and solar radiation, and are of great interest for users such as Epidemiologists seeking to better understand, for instance, the relationships between measures of environmental health and those of animal health. This platform, available at (ARD) is also valuable for environmental or agriculture analysis. Figure 3-1 Map of the province of Alberta, Canada. Area were the meteorological stations are located The Alberta province covers areas with different geographical and meteorological profiles that characterize these locations, including mountains, valleys, lakes and arid areas. For these experiments, the average daily temperatures and the daily average humidity registries where considered, taken from 1/1/2009 to 12/31/2009, forming the time variant input features. The time invariant features used consisted of the latitude and longitude coordinates of the location of the station they were measured at. All stations in which all features were available and had no missing values were considered, resulting in 168 samples. The time series were represented by the Discrete Fourier Transform (DFT). DFT Fuzziness parameter: m = 2 Number of samples: n = 249 Number of time invariant features: r = 2 Number of time variant features: f = 2 Time variant feature s length: q = Experimental Setup The application of the Mixed clustering methodology proposed to the Meteorological System was performed under two distinct criterions. The first, Reconstruction 4

5 Criterion (RC), evaluates the cluster validity, while the Prediction Criterion (PC) evaluates the method s forecasting ability. Reconstruction Criterion The RC assesses the quality of the clusters constructed by attempting to recreate the original data. Defining x as the reconstructed data, its variant and invariant components are respectively defined as x is = c l=1 c l=1 x it = c l=1 c l=1 u m t l,i vl,k u m l,i u m s l,i vl u m l,i ( 3.1 ) k [1, f] ( 3.2 ) The Average Reconstruction Error (ARE) is calculated as: ARE(λ) = 1 n (1 r ( (x i,j n i=1 n r j=1 f s x i,j s ) 2 ) + 1 f q σ j 2 ( (x i,j i=1 k=1 q j=1 t x i,j t ) 2 )) σ j Results and Discussion Reconstruction Criterion The RC was applied to each of time variant feature, humidity or temperature, individually and to the combination of both in a multivariate approach, each using a number of clusters between 2 and 5, using the Euclidean Distance and the DTW for similarity search. It was observed that the multivariate alternative was not capable to improve the quality of the data clusters created, according to this criteria, and that the best results were obtained for the temperature features, with 5 clusters and using the Euclidean Distance. Figure 3-2 shows a plot of the analysed stations according to their geographical location, coloured according to the cluster they have the highest membership degree to, under the best RC conditions. Four stations in different regions are highlighted. ( 3.3 ) Where σ j 2 is the variance of the j th feature. Prediction Criterion The aim of the PC is to predict the temporal component of the data by using the available spatial component of the data, minimizing the resulting error by adjusting the temporal influence parameter λ. A partition matrix is estimated from the invariant data and prototypes: as: 1 u l,i = ( v l s 2 x s i (m 1) c o=1 vs o xi s ) ( 3.4 ) The average Prediction Error (APE) is then calculated t x i,j t ) 2 APE(λ) = 1 ( n f q (x i,j 2 n f q i=1 k=1 j=1 ) ( 3.5 ) σ j The stopping criteria for the clustering algorithm in this experiment were the following: Minimal variation of the objective function: J < ε = 10 5 Maximum number of iterations: maxit = 100 Figure 3-2 Geographical Distribution under best RC conditions, c=5 It is clear that the method was capable of recognizing and distinguishing areas with the most different climacteric profiles. Prediction Criterion The PC was also applied under the same experimental conditions as the RC, multivariate and univariate time series, Euclidean distance and DTW were used as similarity measures for a number of clusters between 2 and 10. 5

The best result was also obtained using the multivariate approach, with the Euclidean distance and 8 clusters. These conditions were used to forecast the temperature and humidity.

6 The best result was also obtained using the multivariate approach, with the Euclidean distance and 8 clusters. These conditions were used to forecast the temperature and humidity. The total samples were separated into training and testing sets: s t x train = [x train x train ] ( 3.1 ) And s t x test = [x test x test ] ( 3.2 ) The procedure followed is described in Figure 3-3. Figure 3-4 Humidity Predicting under best PC conditions Figure 3-3 Workflow representing process for temporal forecasting of test set In this experiment, around 70% of the samples were used as train set, ntrain = 117, while the rest was used as test set. The forecasting results of humidity and temperature of one exemplary test sample, under the best conditions, are shown in Figure 3-4 and Figure 3-5, respectively. Figure 3-5 Temperature Predicting under best PC conditions In the forecasting problem, the DTW did not show improvement on the Euclidean distance, as similarity measures. The multivariate approach achieved the best forecasts of temperature and humidity during 2009, at the selected stations. 4. Classification Problem Medical System An analogy was made from the spatiotemporal concept, where the geographical location becomes, in medical applications, a patient s demography: age, weight, height, sex, among other possibilities. In this equivalence, the temporal component is regarded as all time-varying features that characterize the system, such as heart beats, blood pressure, body temperature and such, measured through a period of time and represented as time-series. 6

7 4.1. Modelling Septic shock is a medical emergency that can occur as a reaction of the immune system to, for example, an operation. It is estimated to affect about 12% of patients in an Intensive Care Unit (ICU) and has a high death rate, which is referred to depend on the patient s age and overall health. The database used, MEDAN, comprises several physiological features of patients diagnosed with abdominal septic shock, uniformly sampled during the whole period while the patient was at the ICU, (Paetz, 2003). This database was pre-processed by (Marques, Moutinho, Vieira, & Sousa, 2011), who analysed the most determinant features for outcome prediction, creating a sub dataset of patients with measurements of 12 of the available features. This data suffered further processing, from which resulted a data set with 100 samples each comprised of: 2 time invariant features: patient s age and weight, represented by a numeric value; 12 time variant features representing physiological variables by time series with a sampling time of 24 hours, over the last 10 days of the patient s stay in an Intensive Care Unit; 1 outcome represented by a binary where 0 represents the patient s survival and 1 the patient s death Experimental Setup The concept of classification based on clustering assumes that similar objects will share outcomes, and uses this knowledge to predict an object s classification. The classification approach proposed in this work is based on this concept and defines an object as belonging to a cluster if its membership degree is higher than a certain threshold. It then assumes that objects grouped together must share the same outcome. Thus, this concept is only valid for binary classifiers using two clusters, c=2. To evaluate the method s ability to predict an object s outcome, a 5 fold Cross Validation was performed. At each fold, the train set is clustered to determine the optimal λ and the resulting clustering output v. The membership degree of each test set sample are then determined, depending on their distance to each cluster prototype, and the predicted outcome determined according to the highest membership degree. The experiments described in this section share the following experimental conditions: Clustering Conditions: o Minimal variation of the objective function: J < ε = 10 8 o Maximum number of iterations: maxit = 500 o Fuzziness parameter: m = 2 Classification Conditions: o 5 Fold CrossValidation o Class Distribution: 44%/56% The Mixed Clustering methodology was applied under two learning approaches: unsupervised and supervised. The first partitions the data without knowledge of its outcome, while the second used labelled samples for training, following the steps: i Unsupervised Clustering of Train set to determine λ ; ii Supervised Clustering of Train set using λ to obtain prototypes v ; iii Unsupervised Classification of Test set using v. The criteria implemented to evaluate the quality of the outcome prediction is frequently used with health care problems, (Lavrač, 1999): Accuracy: measures the number of correct classifications out of samples classified; Sensitivity: accounts for the number of correct positive classifications, out of all positive samples; Specificity: accounts for the number of correct negative classifications, out of all negative samples; 4.3. Results and Discussion The experiments performed with the Mixed Clustering include the use of data representations in time (raw data) and frequency domain (DFT), of the 7

8 Euclidean Distance and the DTW as similarity measures. In addition to the mixed clustering, an alternative clustering was tested, using only the time variant features, to assess the actual benefit of combining both information components, designated as Temporal Clustering. A Forward Feature Selection method was used to assess the quality of each time variant feature, under all combinations of conditions described. It was observed that the superiority of a similarity measure or time series representation method depended on the feature. The benefit of the mixed clustering over the temporal clustering was also not global for every feature. It was verified that when the time variant features, by themselves, were rich enough, the addition of the patient s demography mislead the algorithm, leading to weaker results. However, when the temporal feature was weaker, it benefited from the mixed clustering approach. The best overall Unsupervised Mixed Clustering result was obtained using the Euclidean Distance with the DFT using one time variant feature, no. 6, representative of the Central Venous Pressure. Figure 4-1 shows the differences between the temporal and mixed alternatives, under unsupervised learning, for the best feature and an example of a weaker temporal feature that benefited from the mixed clustering approach, feature 8: Ph. Figure 4-1 Unsupervised Mixed and Temporal Clustering Accuracy for features 6 and 8 It is observable that while the addition of the patient s demography did not increase the performance of feature 6, the weaker feature 8 needed the increase of information that came with it. In Figure 4-2, the equivalent results are shown, for the Supervised learning alternative. Figure 4-2 Supervised Mixed and Temporal Clustering Accuracy for features 6 and 8 The best result under Supervised clustering was also achieved for feature 6, using the DTW and DFT. It is also shown that, for these features, the supervised clustering alternative managed to improve the results of the unsupervised alternative. This effect was not verified for all features however, overall the supervised learning increase the performance of the features that were also the best under the unsupervised alternative, suggesting that the features most related to the outcome beneficiate from its inclusion in the learning process. 5. Conclusions and Future Work A new expanded clustering algorithm was formulated to mine databases represented by both time variant and invariant features, combining the information extracted to further characterize a given system. The results of the data mining and pattern recognition process were applied to machine learning purposes, where distinct methodologies were proposed to solve Forecasting and Classification problems, the first with a Meteorological System, while the last with a Medical application, demonstrating its wide applicability. 8

9 Different measures were implemented for similarity search between time series, the commonly used Euclidean Distance and the increasingly popular Dynamic Time Warping. The benefit of the joint clustering of different types of data was also demonstrated, by comparing it to the clustering of individual data types. Table 5-1 shows the best result obtained from previous work on the same database. It should be noticed that these results are not directly comparable since the studies performed different processing on the input data and the methods used are different. The authors of (Cismondi, et al., 2012) used multi-criteria Feature Selection with Fuzzy Models (FM) and Neural Networks (NN) to predict the patient s outcome. While the FM constructed produced the best ACC, the Mixed Clustering produced comparable results using 4 times less features, 2 of each were numerical values, significantly easier to measure and process. Table 5-1 Best Mixed Clustering and best previous work result Reference Method No. ACC Sens. Spec. features (%) (%) (%) Max Sens NN Max. Spec (Cismondi, Parallel et al., 2012) Max Sens FM Max. Spec Parallel Unsupervised Mixed 3* Clustering Supervised 3* * The mixed Clustering used two constant features, patient s age and weight, combined with one time variant feature. In addition, the Mixed Clustering method has the highest sensitivity, or true positive rate, crucial since the positive class represents a deceased patient. As future work, it would be interesting to expand the clustering possibilities to any number of partitions and to databases with any number of classes. Since the DTW method is able to compare time series of different length, the expansion of the method to form prototypes of variable length would expand the applicability of the mixed clustering method to databases with time series of different length. Also, a reformulation of the method should include the possibility to use different similarity measures for each feature, as well as the influence given to each through the implementation of different temporal influence parameters λ i, where i = 1,2,, f. Even though one of the great advantages of the data mining and soft computing techniques analysis is their ability to read any problem specific to a given field as a generalized system, the final step in the KDD approach would be the interpretation of the results, bringing the problem back to its field and enabling practical conclusions. Thus, the medical system application demonstrated would benefit from further analysis over the best features that resulted from the feature selection algorithms, possibly bringing awareness of the importance of a feature to the medical community. In this context, a feature sensibility study could also be performed on the time variant and invariant features, pre-assessing the quality of the knowledge they contain. The causes of septic are not yet fully comprehended, however some risk factors have been studied (Fink, Abraham, Vincent, & Kochanek, 2005), and could be insert in the Mixed Clustering method as time invariant features. Finally, the validation of the mixed clustering methodology requires its application to problems from different domains and fields, such as Financial, Power Consumption or Surveillance Applications. The use of benchmark databases can demonstrate its value against 9

10 different techniques. However, due to the specific characteristics of the mixed clustering s inputs, there is a shortage of available databases, (Keogh & Kasetty, 2003). References ARD. (n.d.). Current and Historical Alberta Weather Station Data Viewer. Retrieved May 2014, from Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10, Cismondi, F., Horn, A. L., Fialho, A. S., Vieira, S. M., Reti, S. R., Sousa, J. M., et al. (2012). Multistage Modeling Using Fuzzy Multi-criteria Feature Selection to Improve Survival Prediction of ICU Septic Shock Patients. Expert Systems with Applications, 39, Devjver, P. A., & Kittler, J. (1982). Pattern Recognition: A Statistical Approach. Prentice- Hall. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. Al Magazine, 17, Fink, M., Abraham, E., Vincent, J., & Kochanek, P. M. (2005). Septic Shock. In Textbook of Critical Care (5th ed.). Saunders Elsevier. Gaudin, R., & Nicoloyannis, N. (2006). An Adaptable Time Warping Distance for Time Series Learning. 5th International Conference on Machine Learning and Applications (ICMLA 06). Orlando, USA. Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques (2 ed.). Morgan Kaufmann Publishers. Izakian, H., Witold, P., & Jamal, I. (2013, October). Clustering Spatiotemporal Data: An Augmented Fuzzy C-Means. IEEE TRANSACTIONS ON FUZZY SYSTEMS, 21. Keogh, E., & Kasetty, S. (2003, October). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. Data Mining and Knowledge Discovery, 7, pp Lavrač, N. (1999). Artificial Intelligence in Medicine: Machine Learning for Data Mining in Medicine (Vol. 1620). Liao, T. W. (2005, November). Clustering of time series data - a survey. Pattern Recognition, Marques, F. J., Moutinho, A., Vieira, S. M., & Sousa, J. M. (2011). Preprocessing of Clinical Databases to improve classification accuracy of patient diagnosis. World Congress, (pp ). Paetz, J. (2003). Knowledge-based approach to septic shock patient data using a neural network with trapezoidal activation functions. Artificial Intelligence in Medicine, 28, Rani, S., & Sikka, G. (2012). Recent Techniques of Clustering of Time Series Data: A Survey. International Journal of Computer Applications, 52(15). 10

Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco