Data mining and modeling to predict the necessity of vasopressors for sepsis patients

Size: px

Start display at page:

Download "Data mining and modeling to predict the necessity of vasopressors for sepsis patients"

Kerry Bryant
6 years ago
Views:

1 Data mining and modeling to predict the necessity of vasopressors for sepsis patients José Miguel Mourinho Rodrigues Thesis to obtain the Master of Science Degree in Mechanical Engineering Examination Comittee Chairperson: Professor João Rogério Caldas Pinto Supervisor: Professor João Miguel da Costa Sousa Co-supervisor: Doctor Susana Margarida da Silva Vieira Members of the comittee: Professor Luís Manuel Fernandes Mendonça June 213

2 ii

3 Acknowledgments I would like to thank my research advisors, professor João Sousa for his professionalism throughout the whole endeavor, Dr. Susana Vieira for the constant attention and professor Luís Mendonça for the constant support and uplifting. For his insights, availability and readiness and help regarding the database used thanks must also go to André Fialho. A general thanks to all the people at Centro Académico Edith Stein, specialty to those who made their thesis there, for making me know I m not alone. A particular thanks goes to Felipe Blanco, Pedro Viegas, Joana Peleja and Pedro Antunes for their friendship and keeping me focused on the tasks at hand. A special thanks goes also to João Campos and Paulo Araújo for their prayers and emotional support. A final word of thanks goes to my family, for their constant support and to whom I dedicate this work. iii

4 iv

5 Resumo Choque é uma condição médica de vida ou de morte que requer a administração de medicamentos potentes - vasopressores. A identificação atempada destes doentes para preparação de terapia é um objectivo importante. Foram usadas para o agrupamento - clustering - de doentes um conjunto de variáveis das mais frequentemente amostradas e disponíveis numa unidade de cuidados intensivos. Em seguida, foi iniciado um processo de exploração de dados com o uso de fuzzy clustering com o algoritmo fuzzy c-means, onde quatro clusters foram obtidos e as características dos grupos foram analisadas. Uma relação entre os clusters obtidos e o uso de vasopressores foi encontrada e estes resultados foram visualizados com a ajuda de histogramas. Primeiro, um modelo geral foi obtido. Depois, quatro modelos foram treinados e usados numa abordagem multi-modelo, um para cada dos grupos de doentes identificados. Para a abordagem multi-modelo, dois critérios de decisão foram utilizados. Primeiro foi usada uma decisão a priori baseada na distância entre os centros dos clusters e os centros das características dos doentes, e em seguida, uma decisão a posteriori usando cada um dos modelos e em que o valor final utilizado é baseado na incerteza da saída - resposta ao threshold - de cada modelo. A abordagem multi-modelo com decisão a posteriori teve um melhor desempenho em relação aos dois tipos testados, e também teve melhores resultados que o modelo geral. Palavras-chave: Multi-modelo, clustering, modelação fuzzy, vasopressores v

6 vi

7 Abstract Shock is a life-threatening medical condition requiring the administration of powerful drugs - vasopressors. Early identification of these patients is a worthy goal in order to timely prepare them for therapy. A subset composed of the most frequently sampled and readily available variables in an intensive care unit (ICU) was used for clustering patients. Then, a data exploration process was executed through fuzzy clustering based on the fuzzy c-means algorithm. Four clusters were obtained and groups characteristics were analyzed. A relationship between the clusters obtained and the use of vasopressors was found out and these results were visualized with the help of histograms. First, a single general model was derived. Then, four models were trained and used for a multi model approach, one for each identified group of patients. For the multi-model approach, two decision criteria were used: 1) an a priori decision based on the distance from the clusters centers to the patient characteristics, and 2) an a posteriori decision approach where each model was used and the final outcome used is based on the uncertainty of the output response to the threshold of each model. The multi model approach with a posteriori decision returned a better performance of the two tested schemes, and also performed better than the single general model approach. Keywords: Multi-model, fuzzy clustering, fuzzy modeling, vasopressors vii

8 viii

9 Contents Acknowledgments iii Resumo v Abstract vii List of Tables xi List of Figures xiv Nomenclature xvi 1 Introduction Problem overview Data mining in medical care Related work Contributions Clustering Hierarchical Clustering Partitioning Clustering Fuzzy Clustering Fuzzy c-means Other fuzzy clustering algorithms Validation Measures Partition Coefficient Partition Entropy Partition Index Separation Index Xie-Beni Index Other validation measures Knowledge Discovery in Databases Modeling Fuzzy modeling Model assessment Model layouts ix

10 4 Preprocessing of MIMIC II Database MIMIC II Database Vasopressors subset Preprocessing Feature Selection Fluid subset Clustering of MIMIC II Database Clustering Full dataset First data reduction Clustering with high data Cluster Evaluation Clusters obtained Cluster centers Main features histograms Demographics Clusters by pathology Results Single-model Results Multi-model Results Best results comparison Conclusions 39 A Data Partitioning - GK Results 41 A.1 All data All features A.2 All Features Last point A.3 5Features A.4 LastRecords5Features B Cluster Histograms 45 B.1 Physiological variables B.2 Variables at ICU entrance/exit B.3 Output variable B.4 Cluster distribution by pathology C Projections 53 References 58 x

11 List of Tables 4.1 Physiological variables Static variables Chosen data features with the low sample times Number of patients and features mean values for each cluster Single-model results Multi-model results after optimization Single-model and multi-model results B.1 Mean of physiological variables by cluster xi

12 xii

13 List of Figures 1.1 The interrelationship between SIRS, sepsis and infection KDD processes Confusion matrix Finding ROC point AUC maximum by summing areas of the trapezoid Multimodel scheme with decision a priori based on cluster centers Multimodel scheme with decision a posteriori Classifier values Patient records plot of the first 1 entries Validation measures for a set of 2 runs for each number of clusters between 2 and 2 (mean and variance) using the full feature set Validation measures for FCM clustering of the last three records for every patient and the five most frequently sampled features Validation measures for the reduced set, and most sampled variables Patients in cluster and vasopressor distribution Most frequently sampled features histograms Patient distribution per cluster by demographics Patient distribution per cluster by pathology A.1 Validation measures for the GK clustering of the full data set A.2 Validation measures using the GK clustering for the last 3 records of each patient and all features A.3 Validation measures for GK Clustering of last 3 records of each patient and the 5 feature subset data A.4 Validation measures for GK Clustering of the 5 feature subset data and mean of the last three records B.1 Heart rate distribution per cluster B.2 Temperature distribution per cluster B.3 SpO2 distribution per cluster B.4 Respiratory rate distribution per cluster xiii

14 B.5 GCS total distribution per cluster B.6 Braden score distribution per cluster B.7 Hematocrit distribution per cluster B.8 Platelets distribution per cluster B.9 White blood cells (WBC) distribution per cluster B.1 Hemoglobin distribution per cluster B.11 Red blood cells (RBC) distribution per cluster B.12 Blood urea nitrogen (BUN) distribution per cluster B.13 Creatinine distribution per cluster B.14 Glucose distribution per cluster B.15 Potassium distribution per cluster B.16 Chloride distribution per cluster B.17 Sodium distribution per cluster) B.18 Magnesium distribution per cluster B.19 Non-invasive blood pressure (NBP) distribution per cluster B.2 NBP mean distribution per cluster B.21 Arterial ph distribution per cluster B.22 Arterial base excess distribution per cluster B.23 Lactic Acid distribution per cluster B.24 Urine output distribution per cluster B.25 Age distribution per cluster B.26 Sex distribution per cluster B.27 Mortality distribution per cluster B.28 SOFA score distribution per cluster B.29 Vasopressors administration distribution B.3 Pneumonia patients distribution per cluster B.31 Pancriatitis patient distribution per cluster B.32 SIRS patient distribution per cluster C.1 projections xiv

15 Nomenclature Abreviatures AUC BUN CE FCM GK ICU KDD NBP PC RBC ROC Area under curve. Blood urea nitrogen. Classification Entropy. Fuzzy c-means algorithm. Gustafson-Kessel algorithm. Intensive care unit. Knowledge discovery in databases. Non-invasive blood pressure. Partition Coefficient. Red blood cells. Receiver operating curve. SAPS Simplified Acute Physiology Score. SIRS Systemic inflammatory response syndrome. SOFA Sequential Organ Failure Assessment score. WBC XB White blood cells. Xie-Beni index. Greek symbols δ i threshold of model i. Roman symbols C e Number of clusters. Stopping criteria threshold. xv

16 M i Model i. O i Output of model i. S Sc Separation index. Partition index. Subscripts f h fuzzy. hard. i, j, k, l Computational indexes. Superscripts c m Cluster number. Fuzziness index. xvi

17 Chapter 1 Introduction The aim of this study is to address the prediction of the administration of vasopressors in septic shock patients in intensive care units (ICU). Particularly, our hypothesis is that a multi-model approach, based on fuzzy clustering, to the prediction of vasopressor need leads to improved performance compared to a one-model-fits-all approach. First a problem overview regarding the use of the vasopressors in medical care is given, followed by a study of data mining techniques used in medical care, then a review of related work and finally the specific contribution of this dissertation. 1.1 Problem overview Shock is a life-threatening medical emergency that can be defined as acute circulatory failure with inadequate or inappropriately distributed tissue perfusion resulting in generalized cellular hypoxia. [1] This means that end cells aren t getting enough blood depriving them of the needed oxygen, which in turn can lead to the tissues death, causing organ failure. The maintenance of end-organ perfusion is critical to prevent irreversible organ injury and failure, and this frequently requires the use of fluid resuscitation and vasopressors [2], which are medicines used to contract blood vessels so as to increase blood pressure in critical ill patients. Unlike most clinical conditions for which a clinical diagnosis is made before treatment is initiated, the treatment of shock often occurs at the same time or even before the diagnostic process [2]. Thus it is possible for patients that would not require the medication to have it administered anyway. It had also to be considered the added problem that the procedure of administration is risky. When done urgently, can lead to possible infections, increasing the costs in the end. 1

18 If a prediction is made on which patients are going to need vasopressors beforehand, costs will be reduced since less patients are going to need the treatment and catheter surgery will be done more timely and so less prone to medical complications. infection bacteremia fungemia viremia sepsis SIRS other trauma other pancriatitis burns Figure 1.1: The interrelationship between SIRS, sepsis and infection. Figure 1.1 shows the interrelationship between systemic inflammatory response syndrome (SIRS), sepsis, and infection. Basically there is a condition marked by a general body inflamatory response, that if it is caused by a blood-borne infection it is then called sepsis. In case of blood-borne infection it can be called bacteremia - in the case of pneumonia for example -, fungemia, parasitemia, viremia among others depending on the source of infection. Nonetheless, this state of inflammatory response can also have other cause, like burns, trauma or the initial stage of pancriatitis [3] - also a focus of this study. A common development of these conditions is shock, called septic shock in the case of a development of sepsis, and it is that condition that is addressed in this study for the prediction of vasopressors. 1.2 Data mining in medical care Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. [4] Data mining, while recent, is not something new, as it has been used intensively and extensively in other areas. In particular by financial institutions, for credit scoring and fraud detection; marketeers, for direct marketing and cross-selling or up-selling; retailers, for market segmentation and store layout; and manufacturers, for quality control and maintenance scheduling. [5] 2

19 In medical care the adoption of data mining as been more slow, as to the limited availability of data due to privacy and legal issues [6]. But this trend is changing as, according to [5] In health care data mining is becoming increasingly popular, if not increasingly essential. Some driving factors are the existence of medical insurance fraud and abuse, the ever increasing volume of data generated by health care transactions, too complex to treat by traditional methods; financial pressures to increase operating efficiency while maintaining high level of care; and the realization that data mining can generate information that is very useful for all parties involved in the industry, from health care insurers and providers to customers. There are many applications of data mining in the field of health care. In [5] these applications are grouped in four distinct areas: Evaluation of treatment effectiveness; management of health care; customer relationship management; and detection of fraud and abuse. The limitations of health care data mining deal with the availability and quality of the data. The data has to be collected and integrated before data mining is attempted. There can be missing, corrupted, inconsistent or non-standardized data due to different formats and sources. Also the data can be unavailable due to ethical, legal and social issues, such as data ownership [5]. Additionally, the databases can be primarily designed for financial/billing purposes and not for medical/clinical purposes leading in lower quality clinical data for data mining [6]. The success of health care data mining hinges on the availability of clean health care data. Possible future directions include the standardization of clinical vocabulary and the sharing of data across organizations[5]. 1.3 Related work One particular focus of the present work is the combination of multi-models for prediction. One application of model combination, is in weather forecasts, in particular for hydrological forecasts of rainfall-runoff models for water discharge prediction such as in [7]. Initial studies have looked into combining the output of different models through various methods. This study uses the simple average method, a weighter average method and a neural network method. Also in the merging of forecast models models were combined non-linearly using fuzzy rules as in [8]. In [9] the authors address the question of whether multi-model combination can really enhance the prediction skill of ensemble forecasts less skillful models? And also if whether a multi-model can perform better than the best single model available, assuming that there is a best model and that it can be identified. Another use of multi-model combination is in hybrid systems control applications in process industry where models of different setups are used. One such application is in the field of fault diagnosis, as 3

20 [1]. Here a multi-model architecture is used where a fuzzy decision making approach is used to isolate faults based on the analysis of the residuals, the difference between the model output and the identified models output with and without faults. Also of relevance is the machine learning field of ensemble learning as the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem [11]. Ensemble learning is primarily used to improve the (classification, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. Random forest [12] is a representative algorithm that consists of many decision trees that vote to select class membership. A recent paper [13] where the task of combining classifiers, from the creation of ensembles to the decision fusion is addressed, but for further study the book [14] on combining pattern classifiers, from which the first paper was but a summary, is advised. More recently a survey of decision fusion and feature fusion strategies for pattern classification is available in [15]. And finally, for further research, some related areas are the fields of data fusion, decision fusion, ensemble learning, clustering ensembles. 1.4 Contributions This thesis follows the work done on the vasopressor data subset of MIMICII, as compiled by André Fialho, as indicated in [16]. In the present work the dataset is similar but a different approach is taken. In particular an alternative selection of features is presented and they are used for clustering patients with focus on visualizing the results. Like [17] it builds upon the notion of using a multi-model architecture and combining them to create a better performing classification algorithm. Unlike that work though, in the present work, fuzzy clustering is used to build these models and different methodologies are presented to combine the models used. In [6] it is said that in biomedicine clustering is normally used for microarray data analysis rather than general health care data analysis as in that application there is very little information known about genes, while there is more information known about health conditions and disease symptoms. Mentioning, in particular, that clustering is often used when there is no or little information available. The present study contributes to show that there is a place for clustering in healthcare, either as an aid to modeling, or used alone to gain new insights and confirm medical information. Finally, following this work a paper was submitted and accepted for presentation at the 213 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 213) and for publication in the conference proceedings published by IEEE. 4

21 5

22 6

23 Chapter 2 Clustering Clustering is an unsupervised learning task that aims at decomposing a given set of objects into subgroups or clusters based on similarity. The goal is then to divide the data-set in such a way that objects belonging to the same cluster are as similar as possible, whereas objects belonging to different clusters are as dissimilar as possible [18]. Cluster analysis is primarily a tool for discovering previously hidden structure in a set of unordered objects. In this case one assumes that a true or natural grouping exists in the data. However, the assignment of objects to the classes and the description of these classes are unknown [18]. The grade of similarity is obtained through distance functions that are part of the clustering methods and that measure the dissimilarity of presented example cases. Traditionally clustering techniques have been divided in two main groups: hierarchical and partitioning [19], though more groupings can be stated, such as density or grid based clustering, for example. 2.1 Hierarchical Clustering Hierarchical techniques organize data in a nested sequence of groups, which can be visualized in the form of a dendrogram or tree. Based on a dendrogram one can decide on the number of clusters at which the data are best represented for a given purpose [18]. These methods can be further divided into agglomerative and divisive methods, whether a bottom up sequential aggregation of data point in a tree is made or rather a top-down tree division of data into several is done made instead. The most used algorithms for hierarchical clustering are known as single linkage (also know as nearest neighbor) and complete linkage (furthest distance), based on the distance function used. 7

24 It is necessary to note that, in this type of clustering, once a data point is set as belonging to a given cluster it is so until the end, making it more sensitive to outliers and initial conditions. 2.2 Partitioning Clustering Another clustering type is partitioning clustering. Given a positive integer c, these algorithms aim at finding the best partition of the data into c groups based on the given dissimilarity measure and they regard the space of possible partitions into c subsets only [18]. One of the more known partition clustering methods is the K-means. Many other methods were designed based on variations of part of this algorithm, such as K-medoids that uses points from the set as centers (medoids), or K-medians that uses medians instead of means. K-means [2] is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. This algorithm aims at minimizing an objective function (Eq. 2.1), where d ij is the distance function (for the standard k-means this is the euclidean distance) and u ij the partition matrix. c n J h (X, U h, C) = u ij d 2 ij (2.1) i=1 j=1 K-means is also known as hard c-means in the light of the fuzzy c-means clustering algorithm to be described, as in classical (hard) cluster analysis each datum is assigned to exactly one cluster. 2.3 Fuzzy Clustering If we relax the requirement u ij {, 1} that is placed on the cluster assignment in hard partitioning approaches, so as to allow gradual memberships of data points measured as degrees in [, 1]. That way we thus enable the belonging of a data point to more than one cluster. The concept of these membership degrees is built upon the notions of fuzzy sets as introduced in [21] Fuzzy c-means Fuzzy partitioning is carried out through an iterative optimization of the objective function (Eq. 2.2), with the update of membership u ij and the cluster centers c j. 8

25 c n J f (X, U h, C) = (u ij ) m d 2 ij (2.2) i=1 j=1 where u ij is the partition matrix, m is a number > 1 and d ij is a distance function, for the standard FCM case, the euclidean distance. The problem is then the optimization of the objective function J h (Eq. 2.2) where u ij is the partition matrix. This method was first described in [22] with m=2, and latter generalized in [23] with the formula in its current form. 1 u ij = c l=1 ( d2 ij ) 1 d 2 m 1 lj (2.3) n j=1 c i = u ijx j n j=1 u ij (2.4) Other fuzzy clustering algorithms Like k-means, there are many possible variations which lead to other algorithms that can take advantage of some particular case known in advance, like the presence of some given shapes (elipsoids, lines), or noise. One notable variation is Gustafson-Kesseln (GK) clustering algorithm, where the distance function is changed to the Mahalanobis distance in order to detect clusters of different size and orientation. This allows it to extract more information but makes the algorithm more sensitive to initialization and with higher computational demands. Also of note is kernel based fuzzy clustering where the distance function is further modified to include non-vectorial data such as sequences, trees or graphs. 2.4 Validation Measures Usually the number of (true) clusters in the given data is unknown in advance. However, using the partitioning methods one is usually required to specify the number of clusters c as an input parameter. Estimating the actual number of clusters is thus an important issue [18]. The name given to these criteria is not consistent, given the different names attributed to them in literature. In particular they are called validation measures, validity criteria, evaluation measures, or validity 9

26 indices. As with [24] the term used in this work will be validation measures. Such measures can be used to evaluate quantitatively the clustering quality and to compare algorithms one with another. They can also be applied to compare the results obtained with a single algorithm, when the parameter values are changed. In particular they can be used in order to select the optimal number of clusters: applying the algorithm for several c values, the value c leading to the optimal decomposition according to the considered criterion is selected [18]. External clustering validation and internal clustering validation are the two main categories of clustering validation. The main difference is whether or not external information is used [25]. Internal validation measures only rely on information in the data, and are therefore applicable to situations were there is no previous knowledge is known, such as a true number of clusters or previously known classes and so are more suitable for an exploratory knowledge discovery process such as the one in this study and are less database specific. In literature, a number of internal clustering validation measures for crisp clustering have been proposed, such as Dunn index [22], silhouette index [26] or Davies-Bouldin index [27]. But starting with Bezdek in 1975 [23], there were others measures proposed specifically for fuzzy clustering that use information about the partition matrix and other fuzzy clustering parameters. Also, using a validity measure intended for crisp clustering in fuzzy clustering would make the results dependent on some kind of defuzzyfication scheme. For these reasons we focused on internal and fuzzy validation measures alone, some of which are now presented: Partition Coefficient Partition coefficient measures the amount of overlapping between clusters. It is defined by Bezdek as follows [23]: P C = 1 n c i n u 2 ij (2.5) j Partition Entropy The validation measure partition entropy computes the entropy of the obtained membership degrees, and must be minimized. It measures the fuzziness of the cluster partition only, akin to the Partition Coefficient and thus defined [23]: P E = 1 n c i n u ij log u ij (2.6) j 1

27 2.4.3 Partition Index Partition index is the ratio of the sum of compactness and separation of the clusters. It is a sum of individual cluster validity measures normalized through division by the fuzzy cardinality of each cluster [28] : c N j=1 S c (c) = (µ ij) m x j v i 2 c N i k=1 x j v i 2 (2.7) i= Separation Index On the contrary of partition index S c, the separation index uses a minimum-distance separation for partition validity [28]: C N i=1 j=1 S(c) = (µ ij) 2 x j v i 2 N min i,k v k v i 2 (2.8) Xie-Beni Index Also for fuzzy clustering and with a widespread use is the Xie-Beni index, which aims to quantify the ratio of the total variation within clusters and the separation of clusters. [29]: C N i=1 j=1 XB(c) = (µ ij) m x j v i 2 N min i,j x j v i 2 (2.9) A better clustering is obtained by minimizing XB(c). However, it was shown that Xie-Beni index has a problem with an high number of cluster where the behavior of the index is shown to be decreasingly monotonic. This can be addresses by calculating one of the corrected xie-beni index available in literature, though in practical terms only rather small number of clusters is often sought and so the uncorrected Xie-Beni index is still used, given its widespread use Other validation measures Other possible validation measures that weren t used in this study, but are used often enough to at least name them are Fukuyama-Sugeno index [3] and Gath and Geva s validation measures fuzzy hypervolume and partition density [31]. These last two, in particular, are more suited to GK clustering, for they require the calculation the covariance matrix, which is already part of the GK clustering algorithm. 11

28 12

Chapter 3 Knowledge Discovery in Databases Knowledge discovery in databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in

29 Chapter 3 Knowledge Discovery in Databases Knowledge discovery in databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [32]. Figure 3.1: KDD processes. [32] [33] According to Fayyad in [32] there are essentially five steps in the KDD process. These are selection, preprocessing, transformation, data mining and interpretation (also called evaluation). For an engineering background (systems modeling), these steps could be called data acquisition, data preprocessing, feature selection, modeling and interpretation without significant loss of meaning as in [33] and shown in Figure 3.1. The overall process of finding and interpreting patterns from data involves the repeated application of the these steps, now explained in more detail: 1. Data acquisition / selection - Compromises the creation or selection of a target dataset, or particular variables on which to perform data discovery on. It is an important task as the quality of the 13

30 data will impact on the quality of the data discovery process. 2. Data preprocessing - This step focuses on the cleaning or preprocessing of the data. In particular it deals with noise or outliers removal, noise modeling, handling missing data fields and accounting for time sequence information. 3. Feature selection / Transformation - Includes data reduction and projection so as to represent data in a more ready to treatment for data mining. It tries on one hand to select a fewer variables that represent the data, while one the other, eliminating redundancy that can impair the data mining process. 4. Modeling / Data mining) - In this step the selection of a data mining method is made depending on the goal of the data. Six common tasks are: Anomaly detection (Outlier/change/deviation detection) - The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (Dependency modeling) Searches for relationships between variables, sometimes referred to as market basket analysis. Clustering is the task of discovering groups and structures in the data that are in some way or another similar, without using known structures in the data. Classification is the task of generalizing known structure to apply to new data. Regression Attempts to find a function which models the data with the least error. Summarization providing a more compact representation of the data set, including visualization and report generation. 5. Interpretation - In this step, the patterns obtained through data mining are evaluated to see if their interesting or not, thus it can also be called evaluation. The duty is to represent the result in an appropriate way so that it can be examined in a thoroughly way. If the pattern is not interesting, the cause for it has to be found out, and more attempts can be made and some previous steps redone. 3.1 Modeling The choice of the modeling technique to be used may depend on many factors, including the source of the data set and the values that it contains. Methods based on fuzzy systems inherit model transparency and enjoy good function approximation properties. For this reason fuzzy modeling was used in this works along with fuzzy clustering Fuzzy modeling Fuzzy modeling is a tool that allows approximation of nonlinear systems when there is little or no previous knowledge of the problem to be modeled [34]. 14

31 This approach provides a transparent, non-crisp model, and also possibilitates a linguistic interpretation in the form of rules and logical connectedness. These are used to establish relations between the defined features in order to derive a model. A fuzzy classifier contains a rule base consisting of a set of fuzzy if-then rules together with a fuzzy inference mechanism. These models ultimately classify each instance of the dataset as pertaining with a certain degree, to one of the possible classes defined for the specific problem being modeled. As suggested in [35], a discriminant method was used in this work, where the classification is based on the largest discriminant function. In this method, a separate discriminant function d c (x) is associated with each class w c, with c = 1,..., C. The discriminant functions can be implemented as fuzzy inference systems. Here, we use Takagi-Sugeno (TS) fuzzy models [36] with which, each discriminant function consists of rules of the type: Rule R c i : If x 1 is A i1 and... and x M is A im then d c (x) = fi c, i = 1, 2,..., K, where f c i is the consequent function for rule Ri c. In these K rules, the index c indicates that the rule is associated with the output class c. Note that the antecedent parts of the rules can be for different discriminants, as well as the consequents. The classifier assigns the class label corresponding to the maximum value of the discriminant functions, i.e. max c d c (x). (3.1) Model assessment In describing the performance of binary classifiers, the accuracy of classification cannot be considered alone [37]. Both the sensitivity, or hit rate, and specificity, or true rejection rate, must also be analyzed. In medical diagnosis and in the machine learning community, one of the methods for combining these two measures into the evaluation task is the analysis of the area under the ROC curve (AUC), with ROC being the receiver operating curve, a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. ROC curves allow a visualization of the trade-off between hit rates and false alarm rate of classifiers [38]. In this work, the measures used to assess the quality of the obtained classifier were the AUC, specificity (3.2), sensitivity (3.3) and accuracy (3.4), which can be calculated as: specificity = 1 F P rate (3.2) 15

32 sensitivity = T P rate (3.3) accuracy = T P + T N P + N (3.4) where, F P rate = F P N (3.5) T P rate = T P P (3.6) and P are positive, N negatives, T P true-positives, T N true-negatives, and F P false-positives. All which are part of a matrix widely used in pattern recognition called the confusion matrix, and which is used to represent errors in assigning classes to observed patterns in which the ijth element represents the number of samples from class i which were classified as class j. Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). Specificity measures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition, sometimes called the true negative rate). These two measures are closely related to the concepts of type I and type II errors. A perfect predictor would be described as 1% sensitive (i.e. predicting all people from the sick group as sick) and 1% specific (i.e. not predicting anyone from the healthy group as sick). Optimization of the AUC In [16] the optimized discriminating threshold is obtained through maximization of the the AUC. The AUC for each value of the threshold is calculated through the approximation to the trapezoid by adding the areas of the triangles and square shown in Figure 3.3, where an example ROC curve is shown as well as the calculation of the AUC where its value is maximum. So a good value of AUC would be that with a big value for TP rate while having a small one for FP rate, thus maximizing sensitivity and specificity. Another option would be to select different weights for specificity and sensitivity, as used in [17]. It was found that giving an equal weight to specificity and sensitivity lead to the same results. 16

33 p True class n Y True Positives False Positives Hypothesized class N False Negatives True Negatives P N Figure 3.2: Confusion matrix TP rate.6.4 TP rate FP rate (a) ROC curve.5 1 FP rate (b) AUC approximation Figure 3.3: Finding ROC point AUC maximum by summing areas of the trapezoid: 1, 2 and Model layouts When modeling a system, one can opt for a one-model-fits-all approach where a single general model is built with all available training data, or for a multi-model solution. In the second one, there is the additional problem of selecting which model to use for each data point. This paper proposes to use the division of data obtained from unsupervised fuzzy clustering to obtain the multi-models. Thus, the number of models is equal to the number of clusters. The multi-model approach is compared to a single model, which was derived using the whole dataset. The model that maximized the AUC was chosen as the best one, in order to balance specificity and sensitivity. 17

34 A priori decision The a priori decision scheme is based on cluster similarity. The criterion used for the choice of the model was based on the distance of each point to the cluster center. Thus, as a point is closer to a given cluster the output of the model is passed as shown in Figure 3.4, where M1, M2,..., and M n are the models. For this case, 4 clusters were used and so we ended up with 4 models. Figure 3.4: Multimodel scheme with decision a priori based on cluster centers. A posteriori decision In this scheme, the multi-model approach implements an a posteriori decision. Figure 3.5 shows the proposed layout. Figure 3.5: Multimodel scheme with decision a posteriori. The decision is given by the model that has a higher difference between the model output O i and a threshold δ i, see (3.7). This threshold is optimized for each model in order to maximize the AUC. This is based on the hypothesis that a point further apart from the threshold is more accurate, as there is less uncertainty. Figure 3.6 shows the values taken by the output of the model prior to classification. Here, one can see that the values are not and 1 but take the values from -.6 to.8. A threshold has to be applied to turn these from a real value to a binary one. The value chosen was that which maximized the AUC. 18

35 x 1 4 (a) range of classifier values (b) Classifier threshold Figure 3.6: Classifier values. Latter, to choose which model to use, a similar idea was taken. After optimization the value chosen was.17 (Figure 3.6 b)). The criterion was based on the distance to the threshold (3.7), choosing at each point the model in which the output is further apart to the threshold at which the class changes, max i O i δ i (3.7) where the threshold used was δ i =.17, with O i being the output of model i. Another criterion tested was based on the distance from each point to the extremes of classification [,1] and choosing that which the difference from the point to the extreme was lowest. 19

36 2

37 Chapter 4 Preprocessing of MIMIC II Database One of the essential parts of data mining is the access to a database. In real life there can be missing data and other obstacles that one must tackle prior to fully use the dataset. In this work we have data of a medical database known as MIMIC II, which will be introduced. An overview of the actual subset used will be given along with information regarding the necessary preprocessing and feature selection that was required prior to use the data. 4.1 MIMIC II Database The MIMIC II (Multi-parameter Intelligent Monitoring in Intensive Care) Clinical Database is a database composed of detailed information on more than 25, intensive care unit patients. It was initially composed of the data from adult patients admitted at ICUs at Boston s Beth Israel Deaconess Medical Center during the period 21-27, an academic medical center in Boston with 62 beds, 77 of which for critical care [39]. It is composed of two parts: The first is the MIMIC II Waveform Database, that includes bedside monitor trends and waveforms and that is freely available. The second is the MIMIC II Clinical Database, that includes all other elements of MIMIC II like patient demographics, physiological measures, list of procedures, medications, lab tests, fluid balance, notes and staff reports; and that is available to qualified researchers who obtain human subjects training and under the terms of a data use agreement concerning issues of human research and privacy. This information can be queried or downloaded from the database website. 1 Preprocessing was undertaken to improve data quality. Missing data was imputed consistent with the accepted last value carried forward method [4, 41, 42]. Outliers were addressed using the inter-quartile

38 range method [43]. Normalization of the data used the min-max procedure. Finally, data was aligned with a gridding approach based on heart rate sampling [44]. 4.2 Vasopressors subset In the clinical practice, it is common to attribute to each patient a series of ICD-9 codes. This is a medical coding in which every health condition (sign, symptom or disease) is assigned a unique code, grouped by similar conditions. These codes were used to select two specific group of patients: pancreatitis 2 or pneumonia 3. These are two conditions prone to the development of systemic shock and that end up requiring the possible use of fluids and vasopressor agents. For the selection of patients, a set of variables usually obtained in the ICU by non invasive means were chosen, along with the indication of times (in hours) between samples, as indicated in Table 4.1. # Variables (units) 95 % CI for time between samples (hours) 1 Heart Rate (beats/min) Temperature (C) SpO2 (%) Respiratory Rate (breaths/min) GCS Total Braden Score Hematocrit (%) Platelets (cells/l) WBC - White Blood Cells (13/mL) Hemoglobin (g/l) RBC - Red Blood Cells (16/mL) BUN - Blood urea nitrogen (mg/dl) Creatinine (mg/dl) Glucose (mg/dl) Potassium (meq/l) Chloride (meq/l) Sodium (meq/l) Magnesium (mg/dl ) NBP - Non-invasive blood pressure (mmhg) NBP Mean (mmhg) Arterial ph Arterial Base Excess (meq/l) Lactic Acid (mg/dl) Urine Output (ml) Table 4.1: Physiological variables. Also a series of variables connected to the patient information obtained on ICU admission were grouped, as indicated in Table Pancreatitis ICD-9 codes: 577. ; ; ; ; ; Pneumonia ICD-9 codes: 3.22 ; 2.3 ; 2.4 ; 2.5 ; 21.2 ; 22.1 ; 31. ; 39.1 ; 52.1 ; 55.1 ; 73. ; 83. ; ; 114. ; ; ; ; ; ; 13.4 ; ; 48. ; 48.1 ; 48.2 ; 48.3 ; 48.8 ; 48.9 ; 481 ; 482. ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; 483 ; 483. ; ; ; ; ; ; ; ; ; 485 ; 486 ; 513. ;

39 # Variable Remark 1 Patient ID 2 Age at ICU admission 3 Sex if female, 1 if male 4 Mortality 1 if patient died while in the ICU 5 Hospital time stay 6 ICU time stay 7 SAPS score at ICU admission 8 SOFA score at ICU admission 9 Vasopressor administration 1 if it was administered Table 4.2: Static variables. For these records, there was also a binary variable that recorded if, on that each instant, a given patient had vasopressors being administered, or not. This served as an output variable for the prediction of vasopressor need. 4.3 Preprocessing As with any real database, there were a few steps that had to be made prior to use the data. In particular due to the presence of missing data, outliers and data synchronization. In order to tackle differences in of collection, the Heart Rate signal was used as a template variable to align the remaining variables, since it was the most frequently measured variable. This process was presented in [44] with more detail. In particular the values were interpolated and chosen the points that were in sync with this template variable. Regarding missing data, the chosen procedure was to impute recoverable missing segments by cubic interpolation. Additionally, In the de-identification step of the MIMIC II Database, patients whose age was higher than 9 were set the value 2, as visible outlier. We then changed them to 92 so as not to weight too much on the data mining processes. 4.4 Feature Selection Normally for the task of data mining one usually focuses on a subgroup of variables rather then the complete set. This can be for various reasons. Too many variables can lead to redundancy which can lower the prediction performance. Also it can be for computational purposes as less variables leads to 23

40 decrease in time of computation, and there is often quite a lot of data in data mining applications. For the selection of which variables to discard on feature selection one could apply a process of discovery of which variables, separately or in group, have a higher predictive power and so avoid taking out the most important ones and in turn diminishing the prediction rate. In [16] several techniques were used select these variables using a similar dataset to the problem of vasopressor prediction. In particular both a bottom-up and and a top-down tree search approaches and an ant colony optimization method were used without significant loss of performance. In this work this step of selecting a optimum subset of features was not done. We focused instead on a particular subset of 5 variables that were more frequently sampled, as shown in Table 4.3. These were also the variable subset used for the clustering process. ID Variables (units) 1 Heart Rate (beats/min) 3 SpO2 (%) 4 Respiratory Rate (breaths/min) 19 NBP - Non-invasive blood pressure (mmhg) 24 Urine Output (ml) Table 4.3: Chosen data features with the low sample times. Since the preprocessing stage required all variables to be synchronized to the heartbeat rate [44], in those features where sample time was the largest, thus less frequently sampled, some of the records were obtained through interpolation and so they were more prone to errors. These features had a sample time similar to the template variable and so had their values less influenced by the interpolation process. Also, there were initially 1489 patients, but a significant number of patients with too few records was found, as it can be seen in Figure 4.1. So the removal of those with less than three was sought, ending with 122 patients, 8 of which with pancriatitis and 43 with pneumonia. 24

PART III APPLICATIONS

S. Vieira PART III APPLICATIONS Fuzz IEEE 2013, Hyderabad India 1 Applications Finance Value at Risk estimation based on a PFS model for density forecast of a continuous response variable conditional on