Data mining and modeling to predict the necessity of vasopressors for sepsis patients

Size: px
Start display at page:

Download "Data mining and modeling to predict the necessity of vasopressors for sepsis patients"

Transcription

1 Data mining and modeling to predict the necessity of vasopressors for sepsis patients José Miguel Mourinho Rodrigues Thesis to obtain the Master of Science Degree in Mechanical Engineering Examination Comittee Chairperson: Professor João Rogério Caldas Pinto Supervisor: Professor João Miguel da Costa Sousa Co-supervisor: Doctor Susana Margarida da Silva Vieira Members of the comittee: Professor Luís Manuel Fernandes Mendonça June 213

2 ii

3 Acknowledgments I would like to thank my research advisors, professor João Sousa for his professionalism throughout the whole endeavor, Dr. Susana Vieira for the constant attention and professor Luís Mendonça for the constant support and uplifting. For his insights, availability and readiness and help regarding the database used thanks must also go to André Fialho. A general thanks to all the people at Centro Académico Edith Stein, specialty to those who made their thesis there, for making me know I m not alone. A particular thanks goes to Felipe Blanco, Pedro Viegas, Joana Peleja and Pedro Antunes for their friendship and keeping me focused on the tasks at hand. A special thanks goes also to João Campos and Paulo Araújo for their prayers and emotional support. A final word of thanks goes to my family, for their constant support and to whom I dedicate this work. iii

4 iv

5 Resumo Choque é uma condição médica de vida ou de morte que requer a administração de medicamentos potentes - vasopressores. A identificação atempada destes doentes para preparação de terapia é um objectivo importante. Foram usadas para o agrupamento - clustering - de doentes um conjunto de variáveis das mais frequentemente amostradas e disponíveis numa unidade de cuidados intensivos. Em seguida, foi iniciado um processo de exploração de dados com o uso de fuzzy clustering com o algoritmo fuzzy c-means, onde quatro clusters foram obtidos e as características dos grupos foram analisadas. Uma relação entre os clusters obtidos e o uso de vasopressores foi encontrada e estes resultados foram visualizados com a ajuda de histogramas. Primeiro, um modelo geral foi obtido. Depois, quatro modelos foram treinados e usados numa abordagem multi-modelo, um para cada dos grupos de doentes identificados. Para a abordagem multi-modelo, dois critérios de decisão foram utilizados. Primeiro foi usada uma decisão a priori baseada na distância entre os centros dos clusters e os centros das características dos doentes, e em seguida, uma decisão a posteriori usando cada um dos modelos e em que o valor final utilizado é baseado na incerteza da saída - resposta ao threshold - de cada modelo. A abordagem multi-modelo com decisão a posteriori teve um melhor desempenho em relação aos dois tipos testados, e também teve melhores resultados que o modelo geral. Palavras-chave: Multi-modelo, clustering, modelação fuzzy, vasopressores v

6 vi

7 Abstract Shock is a life-threatening medical condition requiring the administration of powerful drugs - vasopressors. Early identification of these patients is a worthy goal in order to timely prepare them for therapy. A subset composed of the most frequently sampled and readily available variables in an intensive care unit (ICU) was used for clustering patients. Then, a data exploration process was executed through fuzzy clustering based on the fuzzy c-means algorithm. Four clusters were obtained and groups characteristics were analyzed. A relationship between the clusters obtained and the use of vasopressors was found out and these results were visualized with the help of histograms. First, a single general model was derived. Then, four models were trained and used for a multi model approach, one for each identified group of patients. For the multi-model approach, two decision criteria were used: 1) an a priori decision based on the distance from the clusters centers to the patient characteristics, and 2) an a posteriori decision approach where each model was used and the final outcome used is based on the uncertainty of the output response to the threshold of each model. The multi model approach with a posteriori decision returned a better performance of the two tested schemes, and also performed better than the single general model approach. Keywords: Multi-model, fuzzy clustering, fuzzy modeling, vasopressors vii

8 viii

9 Contents Acknowledgments iii Resumo v Abstract vii List of Tables xi List of Figures xiv Nomenclature xvi 1 Introduction Problem overview Data mining in medical care Related work Contributions Clustering Hierarchical Clustering Partitioning Clustering Fuzzy Clustering Fuzzy c-means Other fuzzy clustering algorithms Validation Measures Partition Coefficient Partition Entropy Partition Index Separation Index Xie-Beni Index Other validation measures Knowledge Discovery in Databases Modeling Fuzzy modeling Model assessment Model layouts ix

10 4 Preprocessing of MIMIC II Database MIMIC II Database Vasopressors subset Preprocessing Feature Selection Fluid subset Clustering of MIMIC II Database Clustering Full dataset First data reduction Clustering with high data Cluster Evaluation Clusters obtained Cluster centers Main features histograms Demographics Clusters by pathology Results Single-model Results Multi-model Results Best results comparison Conclusions 39 A Data Partitioning - GK Results 41 A.1 All data All features A.2 All Features Last point A.3 5Features A.4 LastRecords5Features B Cluster Histograms 45 B.1 Physiological variables B.2 Variables at ICU entrance/exit B.3 Output variable B.4 Cluster distribution by pathology C Projections 53 References 58 x

11 List of Tables 4.1 Physiological variables Static variables Chosen data features with the low sample times Number of patients and features mean values for each cluster Single-model results Multi-model results after optimization Single-model and multi-model results B.1 Mean of physiological variables by cluster xi

12 xii

13 List of Figures 1.1 The interrelationship between SIRS, sepsis and infection KDD processes Confusion matrix Finding ROC point AUC maximum by summing areas of the trapezoid Multimodel scheme with decision a priori based on cluster centers Multimodel scheme with decision a posteriori Classifier values Patient records plot of the first 1 entries Validation measures for a set of 2 runs for each number of clusters between 2 and 2 (mean and variance) using the full feature set Validation measures for FCM clustering of the last three records for every patient and the five most frequently sampled features Validation measures for the reduced set, and most sampled variables Patients in cluster and vasopressor distribution Most frequently sampled features histograms Patient distribution per cluster by demographics Patient distribution per cluster by pathology A.1 Validation measures for the GK clustering of the full data set A.2 Validation measures using the GK clustering for the last 3 records of each patient and all features A.3 Validation measures for GK Clustering of last 3 records of each patient and the 5 feature subset data A.4 Validation measures for GK Clustering of the 5 feature subset data and mean of the last three records B.1 Heart rate distribution per cluster B.2 Temperature distribution per cluster B.3 SpO2 distribution per cluster B.4 Respiratory rate distribution per cluster xiii

14 B.5 GCS total distribution per cluster B.6 Braden score distribution per cluster B.7 Hematocrit distribution per cluster B.8 Platelets distribution per cluster B.9 White blood cells (WBC) distribution per cluster B.1 Hemoglobin distribution per cluster B.11 Red blood cells (RBC) distribution per cluster B.12 Blood urea nitrogen (BUN) distribution per cluster B.13 Creatinine distribution per cluster B.14 Glucose distribution per cluster B.15 Potassium distribution per cluster B.16 Chloride distribution per cluster B.17 Sodium distribution per cluster) B.18 Magnesium distribution per cluster B.19 Non-invasive blood pressure (NBP) distribution per cluster B.2 NBP mean distribution per cluster B.21 Arterial ph distribution per cluster B.22 Arterial base excess distribution per cluster B.23 Lactic Acid distribution per cluster B.24 Urine output distribution per cluster B.25 Age distribution per cluster B.26 Sex distribution per cluster B.27 Mortality distribution per cluster B.28 SOFA score distribution per cluster B.29 Vasopressors administration distribution B.3 Pneumonia patients distribution per cluster B.31 Pancriatitis patient distribution per cluster B.32 SIRS patient distribution per cluster C.1 projections xiv

15 Nomenclature Abreviatures AUC BUN CE FCM GK ICU KDD NBP PC RBC ROC Area under curve. Blood urea nitrogen. Classification Entropy. Fuzzy c-means algorithm. Gustafson-Kessel algorithm. Intensive care unit. Knowledge discovery in databases. Non-invasive blood pressure. Partition Coefficient. Red blood cells. Receiver operating curve. SAPS Simplified Acute Physiology Score. SIRS Systemic inflammatory response syndrome. SOFA Sequential Organ Failure Assessment score. WBC XB White blood cells. Xie-Beni index. Greek symbols δ i threshold of model i. Roman symbols C e Number of clusters. Stopping criteria threshold. xv

16 M i Model i. O i Output of model i. S Sc Separation index. Partition index. Subscripts f h fuzzy. hard. i, j, k, l Computational indexes. Superscripts c m Cluster number. Fuzziness index. xvi

17 Chapter 1 Introduction The aim of this study is to address the prediction of the administration of vasopressors in septic shock patients in intensive care units (ICU). Particularly, our hypothesis is that a multi-model approach, based on fuzzy clustering, to the prediction of vasopressor need leads to improved performance compared to a one-model-fits-all approach. First a problem overview regarding the use of the vasopressors in medical care is given, followed by a study of data mining techniques used in medical care, then a review of related work and finally the specific contribution of this dissertation. 1.1 Problem overview Shock is a life-threatening medical emergency that can be defined as acute circulatory failure with inadequate or inappropriately distributed tissue perfusion resulting in generalized cellular hypoxia. [1] This means that end cells aren t getting enough blood depriving them of the needed oxygen, which in turn can lead to the tissues death, causing organ failure. The maintenance of end-organ perfusion is critical to prevent irreversible organ injury and failure, and this frequently requires the use of fluid resuscitation and vasopressors [2], which are medicines used to contract blood vessels so as to increase blood pressure in critical ill patients. Unlike most clinical conditions for which a clinical diagnosis is made before treatment is initiated, the treatment of shock often occurs at the same time or even before the diagnostic process [2]. Thus it is possible for patients that would not require the medication to have it administered anyway. It had also to be considered the added problem that the procedure of administration is risky. When done urgently, can lead to possible infections, increasing the costs in the end. 1

18 If a prediction is made on which patients are going to need vasopressors beforehand, costs will be reduced since less patients are going to need the treatment and catheter surgery will be done more timely and so less prone to medical complications. infection bacteremia fungemia viremia sepsis SIRS other trauma other pancriatitis burns Figure 1.1: The interrelationship between SIRS, sepsis and infection. Figure 1.1 shows the interrelationship between systemic inflammatory response syndrome (SIRS), sepsis, and infection. Basically there is a condition marked by a general body inflamatory response, that if it is caused by a blood-borne infection it is then called sepsis. In case of blood-borne infection it can be called bacteremia - in the case of pneumonia for example -, fungemia, parasitemia, viremia among others depending on the source of infection. Nonetheless, this state of inflammatory response can also have other cause, like burns, trauma or the initial stage of pancriatitis [3] - also a focus of this study. A common development of these conditions is shock, called septic shock in the case of a development of sepsis, and it is that condition that is addressed in this study for the prediction of vasopressors. 1.2 Data mining in medical care Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. [4] Data mining, while recent, is not something new, as it has been used intensively and extensively in other areas. In particular by financial institutions, for credit scoring and fraud detection; marketeers, for direct marketing and cross-selling or up-selling; retailers, for market segmentation and store layout; and manufacturers, for quality control and maintenance scheduling. [5] 2

19 In medical care the adoption of data mining as been more slow, as to the limited availability of data due to privacy and legal issues [6]. But this trend is changing as, according to [5] In health care data mining is becoming increasingly popular, if not increasingly essential. Some driving factors are the existence of medical insurance fraud and abuse, the ever increasing volume of data generated by health care transactions, too complex to treat by traditional methods; financial pressures to increase operating efficiency while maintaining high level of care; and the realization that data mining can generate information that is very useful for all parties involved in the industry, from health care insurers and providers to customers. There are many applications of data mining in the field of health care. In [5] these applications are grouped in four distinct areas: Evaluation of treatment effectiveness; management of health care; customer relationship management; and detection of fraud and abuse. The limitations of health care data mining deal with the availability and quality of the data. The data has to be collected and integrated before data mining is attempted. There can be missing, corrupted, inconsistent or non-standardized data due to different formats and sources. Also the data can be unavailable due to ethical, legal and social issues, such as data ownership [5]. Additionally, the databases can be primarily designed for financial/billing purposes and not for medical/clinical purposes leading in lower quality clinical data for data mining [6]. The success of health care data mining hinges on the availability of clean health care data. Possible future directions include the standardization of clinical vocabulary and the sharing of data across organizations[5]. 1.3 Related work One particular focus of the present work is the combination of multi-models for prediction. One application of model combination, is in weather forecasts, in particular for hydrological forecasts of rainfall-runoff models for water discharge prediction such as in [7]. Initial studies have looked into combining the output of different models through various methods. This study uses the simple average method, a weighter average method and a neural network method. Also in the merging of forecast models models were combined non-linearly using fuzzy rules as in [8]. In [9] the authors address the question of whether multi-model combination can really enhance the prediction skill of ensemble forecasts less skillful models? And also if whether a multi-model can perform better than the best single model available, assuming that there is a best model and that it can be identified. Another use of multi-model combination is in hybrid systems control applications in process industry where models of different setups are used. One such application is in the field of fault diagnosis, as 3

20 [1]. Here a multi-model architecture is used where a fuzzy decision making approach is used to isolate faults based on the analysis of the residuals, the difference between the model output and the identified models output with and without faults. Also of relevance is the machine learning field of ensemble learning as the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem [11]. Ensemble learning is primarily used to improve the (classification, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. Random forest [12] is a representative algorithm that consists of many decision trees that vote to select class membership. A recent paper [13] where the task of combining classifiers, from the creation of ensembles to the decision fusion is addressed, but for further study the book [14] on combining pattern classifiers, from which the first paper was but a summary, is advised. More recently a survey of decision fusion and feature fusion strategies for pattern classification is available in [15]. And finally, for further research, some related areas are the fields of data fusion, decision fusion, ensemble learning, clustering ensembles. 1.4 Contributions This thesis follows the work done on the vasopressor data subset of MIMICII, as compiled by André Fialho, as indicated in [16]. In the present work the dataset is similar but a different approach is taken. In particular an alternative selection of features is presented and they are used for clustering patients with focus on visualizing the results. Like [17] it builds upon the notion of using a multi-model architecture and combining them to create a better performing classification algorithm. Unlike that work though, in the present work, fuzzy clustering is used to build these models and different methodologies are presented to combine the models used. In [6] it is said that in biomedicine clustering is normally used for microarray data analysis rather than general health care data analysis as in that application there is very little information known about genes, while there is more information known about health conditions and disease symptoms. Mentioning, in particular, that clustering is often used when there is no or little information available. The present study contributes to show that there is a place for clustering in healthcare, either as an aid to modeling, or used alone to gain new insights and confirm medical information. Finally, following this work a paper was submitted and accepted for presentation at the 213 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 213) and for publication in the conference proceedings published by IEEE. 4

21 5

22 6

23 Chapter 2 Clustering Clustering is an unsupervised learning task that aims at decomposing a given set of objects into subgroups or clusters based on similarity. The goal is then to divide the data-set in such a way that objects belonging to the same cluster are as similar as possible, whereas objects belonging to different clusters are as dissimilar as possible [18]. Cluster analysis is primarily a tool for discovering previously hidden structure in a set of unordered objects. In this case one assumes that a true or natural grouping exists in the data. However, the assignment of objects to the classes and the description of these classes are unknown [18]. The grade of similarity is obtained through distance functions that are part of the clustering methods and that measure the dissimilarity of presented example cases. Traditionally clustering techniques have been divided in two main groups: hierarchical and partitioning [19], though more groupings can be stated, such as density or grid based clustering, for example. 2.1 Hierarchical Clustering Hierarchical techniques organize data in a nested sequence of groups, which can be visualized in the form of a dendrogram or tree. Based on a dendrogram one can decide on the number of clusters at which the data are best represented for a given purpose [18]. These methods can be further divided into agglomerative and divisive methods, whether a bottom up sequential aggregation of data point in a tree is made or rather a top-down tree division of data into several is done made instead. The most used algorithms for hierarchical clustering are known as single linkage (also know as nearest neighbor) and complete linkage (furthest distance), based on the distance function used. 7

24 It is necessary to note that, in this type of clustering, once a data point is set as belonging to a given cluster it is so until the end, making it more sensitive to outliers and initial conditions. 2.2 Partitioning Clustering Another clustering type is partitioning clustering. Given a positive integer c, these algorithms aim at finding the best partition of the data into c groups based on the given dissimilarity measure and they regard the space of possible partitions into c subsets only [18]. One of the more known partition clustering methods is the K-means. Many other methods were designed based on variations of part of this algorithm, such as K-medoids that uses points from the set as centers (medoids), or K-medians that uses medians instead of means. K-means [2] is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. This algorithm aims at minimizing an objective function (Eq. 2.1), where d ij is the distance function (for the standard k-means this is the euclidean distance) and u ij the partition matrix. c n J h (X, U h, C) = u ij d 2 ij (2.1) i=1 j=1 K-means is also known as hard c-means in the light of the fuzzy c-means clustering algorithm to be described, as in classical (hard) cluster analysis each datum is assigned to exactly one cluster. 2.3 Fuzzy Clustering If we relax the requirement u ij {, 1} that is placed on the cluster assignment in hard partitioning approaches, so as to allow gradual memberships of data points measured as degrees in [, 1]. That way we thus enable the belonging of a data point to more than one cluster. The concept of these membership degrees is built upon the notions of fuzzy sets as introduced in [21] Fuzzy c-means Fuzzy partitioning is carried out through an iterative optimization of the objective function (Eq. 2.2), with the update of membership u ij and the cluster centers c j. 8

25 c n J f (X, U h, C) = (u ij ) m d 2 ij (2.2) i=1 j=1 where u ij is the partition matrix, m is a number > 1 and d ij is a distance function, for the standard FCM case, the euclidean distance. The problem is then the optimization of the objective function J h (Eq. 2.2) where u ij is the partition matrix. This method was first described in [22] with m=2, and latter generalized in [23] with the formula in its current form. 1 u ij = c l=1 ( d2 ij ) 1 d 2 m 1 lj (2.3) n j=1 c i = u ijx j n j=1 u ij (2.4) Other fuzzy clustering algorithms Like k-means, there are many possible variations which lead to other algorithms that can take advantage of some particular case known in advance, like the presence of some given shapes (elipsoids, lines), or noise. One notable variation is Gustafson-Kesseln (GK) clustering algorithm, where the distance function is changed to the Mahalanobis distance in order to detect clusters of different size and orientation. This allows it to extract more information but makes the algorithm more sensitive to initialization and with higher computational demands. Also of note is kernel based fuzzy clustering where the distance function is further modified to include non-vectorial data such as sequences, trees or graphs. 2.4 Validation Measures Usually the number of (true) clusters in the given data is unknown in advance. However, using the partitioning methods one is usually required to specify the number of clusters c as an input parameter. Estimating the actual number of clusters is thus an important issue [18]. The name given to these criteria is not consistent, given the different names attributed to them in literature. In particular they are called validation measures, validity criteria, evaluation measures, or validity 9

26 indices. As with [24] the term used in this work will be validation measures. Such measures can be used to evaluate quantitatively the clustering quality and to compare algorithms one with another. They can also be applied to compare the results obtained with a single algorithm, when the parameter values are changed. In particular they can be used in order to select the optimal number of clusters: applying the algorithm for several c values, the value c leading to the optimal decomposition according to the considered criterion is selected [18]. External clustering validation and internal clustering validation are the two main categories of clustering validation. The main difference is whether or not external information is used [25]. Internal validation measures only rely on information in the data, and are therefore applicable to situations were there is no previous knowledge is known, such as a true number of clusters or previously known classes and so are more suitable for an exploratory knowledge discovery process such as the one in this study and are less database specific. In literature, a number of internal clustering validation measures for crisp clustering have been proposed, such as Dunn index [22], silhouette index [26] or Davies-Bouldin index [27]. But starting with Bezdek in 1975 [23], there were others measures proposed specifically for fuzzy clustering that use information about the partition matrix and other fuzzy clustering parameters. Also, using a validity measure intended for crisp clustering in fuzzy clustering would make the results dependent on some kind of defuzzyfication scheme. For these reasons we focused on internal and fuzzy validation measures alone, some of which are now presented: Partition Coefficient Partition coefficient measures the amount of overlapping between clusters. It is defined by Bezdek as follows [23]: P C = 1 n c i n u 2 ij (2.5) j Partition Entropy The validation measure partition entropy computes the entropy of the obtained membership degrees, and must be minimized. It measures the fuzziness of the cluster partition only, akin to the Partition Coefficient and thus defined [23]: P E = 1 n c i n u ij log u ij (2.6) j 1

27 2.4.3 Partition Index Partition index is the ratio of the sum of compactness and separation of the clusters. It is a sum of individual cluster validity measures normalized through division by the fuzzy cardinality of each cluster [28] : c N j=1 S c (c) = (µ ij) m x j v i 2 c N i k=1 x j v i 2 (2.7) i= Separation Index On the contrary of partition index S c, the separation index uses a minimum-distance separation for partition validity [28]: C N i=1 j=1 S(c) = (µ ij) 2 x j v i 2 N min i,k v k v i 2 (2.8) Xie-Beni Index Also for fuzzy clustering and with a widespread use is the Xie-Beni index, which aims to quantify the ratio of the total variation within clusters and the separation of clusters. [29]: C N i=1 j=1 XB(c) = (µ ij) m x j v i 2 N min i,j x j v i 2 (2.9) A better clustering is obtained by minimizing XB(c). However, it was shown that Xie-Beni index has a problem with an high number of cluster where the behavior of the index is shown to be decreasingly monotonic. This can be addresses by calculating one of the corrected xie-beni index available in literature, though in practical terms only rather small number of clusters is often sought and so the uncorrected Xie-Beni index is still used, given its widespread use Other validation measures Other possible validation measures that weren t used in this study, but are used often enough to at least name them are Fukuyama-Sugeno index [3] and Gath and Geva s validation measures fuzzy hypervolume and partition density [31]. These last two, in particular, are more suited to GK clustering, for they require the calculation the covariance matrix, which is already part of the GK clustering algorithm. 11

28 12

29 Chapter 3 Knowledge Discovery in Databases Knowledge discovery in databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [32]. Figure 3.1: KDD processes. [32] [33] According to Fayyad in [32] there are essentially five steps in the KDD process. These are selection, preprocessing, transformation, data mining and interpretation (also called evaluation). For an engineering background (systems modeling), these steps could be called data acquisition, data preprocessing, feature selection, modeling and interpretation without significant loss of meaning as in [33] and shown in Figure 3.1. The overall process of finding and interpreting patterns from data involves the repeated application of the these steps, now explained in more detail: 1. Data acquisition / selection - Compromises the creation or selection of a target dataset, or particular variables on which to perform data discovery on. It is an important task as the quality of the 13

30 data will impact on the quality of the data discovery process. 2. Data preprocessing - This step focuses on the cleaning or preprocessing of the data. In particular it deals with noise or outliers removal, noise modeling, handling missing data fields and accounting for time sequence information. 3. Feature selection / Transformation - Includes data reduction and projection so as to represent data in a more ready to treatment for data mining. It tries on one hand to select a fewer variables that represent the data, while one the other, eliminating redundancy that can impair the data mining process. 4. Modeling / Data mining) - In this step the selection of a data mining method is made depending on the goal of the data. Six common tasks are: Anomaly detection (Outlier/change/deviation detection) - The identification of unusual data records, that might be interesting or data errors that require further investigation. Association rule learning (Dependency modeling) Searches for relationships between variables, sometimes referred to as market basket analysis. Clustering is the task of discovering groups and structures in the data that are in some way or another similar, without using known structures in the data. Classification is the task of generalizing known structure to apply to new data. Regression Attempts to find a function which models the data with the least error. Summarization providing a more compact representation of the data set, including visualization and report generation. 5. Interpretation - In this step, the patterns obtained through data mining are evaluated to see if their interesting or not, thus it can also be called evaluation. The duty is to represent the result in an appropriate way so that it can be examined in a thoroughly way. If the pattern is not interesting, the cause for it has to be found out, and more attempts can be made and some previous steps redone. 3.1 Modeling The choice of the modeling technique to be used may depend on many factors, including the source of the data set and the values that it contains. Methods based on fuzzy systems inherit model transparency and enjoy good function approximation properties. For this reason fuzzy modeling was used in this works along with fuzzy clustering Fuzzy modeling Fuzzy modeling is a tool that allows approximation of nonlinear systems when there is little or no previous knowledge of the problem to be modeled [34]. 14

31 This approach provides a transparent, non-crisp model, and also possibilitates a linguistic interpretation in the form of rules and logical connectedness. These are used to establish relations between the defined features in order to derive a model. A fuzzy classifier contains a rule base consisting of a set of fuzzy if-then rules together with a fuzzy inference mechanism. These models ultimately classify each instance of the dataset as pertaining with a certain degree, to one of the possible classes defined for the specific problem being modeled. As suggested in [35], a discriminant method was used in this work, where the classification is based on the largest discriminant function. In this method, a separate discriminant function d c (x) is associated with each class w c, with c = 1,..., C. The discriminant functions can be implemented as fuzzy inference systems. Here, we use Takagi-Sugeno (TS) fuzzy models [36] with which, each discriminant function consists of rules of the type: Rule R c i : If x 1 is A i1 and... and x M is A im then d c (x) = fi c, i = 1, 2,..., K, where f c i is the consequent function for rule Ri c. In these K rules, the index c indicates that the rule is associated with the output class c. Note that the antecedent parts of the rules can be for different discriminants, as well as the consequents. The classifier assigns the class label corresponding to the maximum value of the discriminant functions, i.e. max c d c (x). (3.1) Model assessment In describing the performance of binary classifiers, the accuracy of classification cannot be considered alone [37]. Both the sensitivity, or hit rate, and specificity, or true rejection rate, must also be analyzed. In medical diagnosis and in the machine learning community, one of the methods for combining these two measures into the evaluation task is the analysis of the area under the ROC curve (AUC), with ROC being the receiver operating curve, a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. ROC curves allow a visualization of the trade-off between hit rates and false alarm rate of classifiers [38]. In this work, the measures used to assess the quality of the obtained classifier were the AUC, specificity (3.2), sensitivity (3.3) and accuracy (3.4), which can be calculated as: specificity = 1 F P rate (3.2) 15

32 sensitivity = T P rate (3.3) accuracy = T P + T N P + N (3.4) where, F P rate = F P N (3.5) T P rate = T P P (3.6) and P are positive, N negatives, T P true-positives, T N true-negatives, and F P false-positives. All which are part of a matrix widely used in pattern recognition called the confusion matrix, and which is used to represent errors in assigning classes to observed patterns in which the ijth element represents the number of samples from class i which were classified as class j. Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). Specificity measures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition, sometimes called the true negative rate). These two measures are closely related to the concepts of type I and type II errors. A perfect predictor would be described as 1% sensitive (i.e. predicting all people from the sick group as sick) and 1% specific (i.e. not predicting anyone from the healthy group as sick). Optimization of the AUC In [16] the optimized discriminating threshold is obtained through maximization of the the AUC. The AUC for each value of the threshold is calculated through the approximation to the trapezoid by adding the areas of the triangles and square shown in Figure 3.3, where an example ROC curve is shown as well as the calculation of the AUC where its value is maximum. So a good value of AUC would be that with a big value for TP rate while having a small one for FP rate, thus maximizing sensitivity and specificity. Another option would be to select different weights for specificity and sensitivity, as used in [17]. It was found that giving an equal weight to specificity and sensitivity lead to the same results. 16

33 p True class n Y True Positives False Positives Hypothesized class N False Negatives True Negatives P N Figure 3.2: Confusion matrix TP rate.6.4 TP rate FP rate (a) ROC curve.5 1 FP rate (b) AUC approximation Figure 3.3: Finding ROC point AUC maximum by summing areas of the trapezoid: 1, 2 and Model layouts When modeling a system, one can opt for a one-model-fits-all approach where a single general model is built with all available training data, or for a multi-model solution. In the second one, there is the additional problem of selecting which model to use for each data point. This paper proposes to use the division of data obtained from unsupervised fuzzy clustering to obtain the multi-models. Thus, the number of models is equal to the number of clusters. The multi-model approach is compared to a single model, which was derived using the whole dataset. The model that maximized the AUC was chosen as the best one, in order to balance specificity and sensitivity. 17

34 A priori decision The a priori decision scheme is based on cluster similarity. The criterion used for the choice of the model was based on the distance of each point to the cluster center. Thus, as a point is closer to a given cluster the output of the model is passed as shown in Figure 3.4, where M1, M2,..., and M n are the models. For this case, 4 clusters were used and so we ended up with 4 models. Figure 3.4: Multimodel scheme with decision a priori based on cluster centers. A posteriori decision In this scheme, the multi-model approach implements an a posteriori decision. Figure 3.5 shows the proposed layout. Figure 3.5: Multimodel scheme with decision a posteriori. The decision is given by the model that has a higher difference between the model output O i and a threshold δ i, see (3.7). This threshold is optimized for each model in order to maximize the AUC. This is based on the hypothesis that a point further apart from the threshold is more accurate, as there is less uncertainty. Figure 3.6 shows the values taken by the output of the model prior to classification. Here, one can see that the values are not and 1 but take the values from -.6 to.8. A threshold has to be applied to turn these from a real value to a binary one. The value chosen was that which maximized the AUC. 18

35 x 1 4 (a) range of classifier values (b) Classifier threshold Figure 3.6: Classifier values. Latter, to choose which model to use, a similar idea was taken. After optimization the value chosen was.17 (Figure 3.6 b)). The criterion was based on the distance to the threshold (3.7), choosing at each point the model in which the output is further apart to the threshold at which the class changes, max i O i δ i (3.7) where the threshold used was δ i =.17, with O i being the output of model i. Another criterion tested was based on the distance from each point to the extremes of classification [,1] and choosing that which the difference from the point to the extreme was lowest. 19

36 2

37 Chapter 4 Preprocessing of MIMIC II Database One of the essential parts of data mining is the access to a database. In real life there can be missing data and other obstacles that one must tackle prior to fully use the dataset. In this work we have data of a medical database known as MIMIC II, which will be introduced. An overview of the actual subset used will be given along with information regarding the necessary preprocessing and feature selection that was required prior to use the data. 4.1 MIMIC II Database The MIMIC II (Multi-parameter Intelligent Monitoring in Intensive Care) Clinical Database is a database composed of detailed information on more than 25, intensive care unit patients. It was initially composed of the data from adult patients admitted at ICUs at Boston s Beth Israel Deaconess Medical Center during the period 21-27, an academic medical center in Boston with 62 beds, 77 of which for critical care [39]. It is composed of two parts: The first is the MIMIC II Waveform Database, that includes bedside monitor trends and waveforms and that is freely available. The second is the MIMIC II Clinical Database, that includes all other elements of MIMIC II like patient demographics, physiological measures, list of procedures, medications, lab tests, fluid balance, notes and staff reports; and that is available to qualified researchers who obtain human subjects training and under the terms of a data use agreement concerning issues of human research and privacy. This information can be queried or downloaded from the database website. 1 Preprocessing was undertaken to improve data quality. Missing data was imputed consistent with the accepted last value carried forward method [4, 41, 42]. Outliers were addressed using the inter-quartile

38 range method [43]. Normalization of the data used the min-max procedure. Finally, data was aligned with a gridding approach based on heart rate sampling [44]. 4.2 Vasopressors subset In the clinical practice, it is common to attribute to each patient a series of ICD-9 codes. This is a medical coding in which every health condition (sign, symptom or disease) is assigned a unique code, grouped by similar conditions. These codes were used to select two specific group of patients: pancreatitis 2 or pneumonia 3. These are two conditions prone to the development of systemic shock and that end up requiring the possible use of fluids and vasopressor agents. For the selection of patients, a set of variables usually obtained in the ICU by non invasive means were chosen, along with the indication of times (in hours) between samples, as indicated in Table 4.1. # Variables (units) 95 % CI for time between samples (hours) 1 Heart Rate (beats/min) Temperature (C) SpO2 (%) Respiratory Rate (breaths/min) GCS Total Braden Score Hematocrit (%) Platelets (cells/l) WBC - White Blood Cells (13/mL) Hemoglobin (g/l) RBC - Red Blood Cells (16/mL) BUN - Blood urea nitrogen (mg/dl) Creatinine (mg/dl) Glucose (mg/dl) Potassium (meq/l) Chloride (meq/l) Sodium (meq/l) Magnesium (mg/dl ) NBP - Non-invasive blood pressure (mmhg) NBP Mean (mmhg) Arterial ph Arterial Base Excess (meq/l) Lactic Acid (mg/dl) Urine Output (ml) Table 4.1: Physiological variables. Also a series of variables connected to the patient information obtained on ICU admission were grouped, as indicated in Table Pancreatitis ICD-9 codes: 577. ; ; ; ; ; Pneumonia ICD-9 codes: 3.22 ; 2.3 ; 2.4 ; 2.5 ; 21.2 ; 22.1 ; 31. ; 39.1 ; 52.1 ; 55.1 ; 73. ; 83. ; ; 114. ; ; ; ; ; ; 13.4 ; ; 48. ; 48.1 ; 48.2 ; 48.3 ; 48.8 ; 48.9 ; 481 ; 482. ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; 483 ; 483. ; ; ; ; ; ; ; ; ; 485 ; 486 ; 513. ;

39 # Variable Remark 1 Patient ID 2 Age at ICU admission 3 Sex if female, 1 if male 4 Mortality 1 if patient died while in the ICU 5 Hospital time stay 6 ICU time stay 7 SAPS score at ICU admission 8 SOFA score at ICU admission 9 Vasopressor administration 1 if it was administered Table 4.2: Static variables. For these records, there was also a binary variable that recorded if, on that each instant, a given patient had vasopressors being administered, or not. This served as an output variable for the prediction of vasopressor need. 4.3 Preprocessing As with any real database, there were a few steps that had to be made prior to use the data. In particular due to the presence of missing data, outliers and data synchronization. In order to tackle differences in of collection, the Heart Rate signal was used as a template variable to align the remaining variables, since it was the most frequently measured variable. This process was presented in [44] with more detail. In particular the values were interpolated and chosen the points that were in sync with this template variable. Regarding missing data, the chosen procedure was to impute recoverable missing segments by cubic interpolation. Additionally, In the de-identification step of the MIMIC II Database, patients whose age was higher than 9 were set the value 2, as visible outlier. We then changed them to 92 so as not to weight too much on the data mining processes. 4.4 Feature Selection Normally for the task of data mining one usually focuses on a subgroup of variables rather then the complete set. This can be for various reasons. Too many variables can lead to redundancy which can lower the prediction performance. Also it can be for computational purposes as less variables leads to 23

40 decrease in time of computation, and there is often quite a lot of data in data mining applications. For the selection of which variables to discard on feature selection one could apply a process of discovery of which variables, separately or in group, have a higher predictive power and so avoid taking out the most important ones and in turn diminishing the prediction rate. In [16] several techniques were used select these variables using a similar dataset to the problem of vasopressor prediction. In particular both a bottom-up and and a top-down tree search approaches and an ant colony optimization method were used without significant loss of performance. In this work this step of selecting a optimum subset of features was not done. We focused instead on a particular subset of 5 variables that were more frequently sampled, as shown in Table 4.3. These were also the variable subset used for the clustering process. ID Variables (units) 1 Heart Rate (beats/min) 3 SpO2 (%) 4 Respiratory Rate (breaths/min) 19 NBP - Non-invasive blood pressure (mmhg) 24 Urine Output (ml) Table 4.3: Chosen data features with the low sample times. Since the preprocessing stage required all variables to be synchronized to the heartbeat rate [44], in those features where sample time was the largest, thus less frequently sampled, some of the records were obtained through interpolation and so they were more prone to errors. These features had a sample time similar to the template variable and so had their values less influenced by the interpolation process. Also, there were initially 1489 patients, but a significant number of patients with too few records was found, as it can be seen in Figure 4.1. So the removal of those with less than three was sought, ending with 122 patients, 8 of which with pancriatitis and 43 with pneumonia. 24

PART III APPLICATIONS

PART III APPLICATIONS S. Vieira PART III APPLICATIONS Fuzz IEEE 2013, Hyderabad India 1 Applications Finance Value at Risk estimation based on a PFS model for density forecast of a continuous response variable conditional on

More information

OPTIMIZATION. Optimization. Derivative-based optimization. Derivative-free optimization. Steepest descent (gradient) methods Newton s method

OPTIMIZATION. Optimization. Derivative-based optimization. Derivative-free optimization. Steepest descent (gradient) methods Newton s method OPTIMIZATION Optimization Derivative-based optimization Steepest descent (gradient) methods Newton s method Derivative-free optimization Simplex Simulated Annealing Genetic Algorithms Ant Colony Optimization...

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization GLOBAL JOURNAL OF ENGINEERING SCIENCE AND RESEARCHES APPLICATION OF CLASSIFICATION TECHNIQUES TO DETECT HYPERTENSIVE HEART DISEASE Tulasimala B. N* 1, Elakkiya S 2 & Keerthana N 3 *1 Assistant Professor,

More information

Fuzzy Modeling for the Prediction of Vasopressors Administration in the ICU Using Ensemble and Mixed Fuzzy Clustering Approaches

Fuzzy Modeling for the Prediction of Vasopressors Administration in the ICU Using Ensemble and Mixed Fuzzy Clustering Approaches Fuzzy Modeling for the Prediction of Vasopressors Administration in the ICU Using Ensemble and Mixed Fuzzy Clustering Approaches Carlos Santos Azevedo Thesis to obtain the Master of Science Degree in Mechanical

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Overview of Clustering

Overview of Clustering based on Loïc Cerfs slides (UFMG) April 2017 UCBL LIRIS DM2L Example of applicative problem Student profiles Given the marks received by students for different courses, how to group the students so that

More information

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING Abhinav Kathuria Email - abhinav.kathuria90@gmail.com Abstract: Data mining is the process of the extraction of the hidden pattern from the data

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Analyse multivariable - mars-avril 2008 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Abstract: Mass classification of objects is an important area of research and application in a variety of fields. In this

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Application of Clustering as a Data Mining Tool in Bp systolic diastolic

Application of Clustering as a Data Mining Tool in Bp systolic diastolic Application of Clustering as a Data Mining Tool in Bp systolic diastolic Assist. Proffer Dr. Zeki S. Tywofik Department of Computer, Dijlah University College (DUC),Baghdad, Iraq. Assist. Lecture. Ali

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino DataBase and Data Mining Group of Data mining fundamentals Data Base and Data Mining Group of Data analysis Most companies own huge databases containing operational data textual documents experiment results

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Cluster quality assessment by the modified Renyi-ClipX algorithm

Cluster quality assessment by the modified Renyi-ClipX algorithm Issue 3, Volume 4, 2010 51 Cluster quality assessment by the modified Renyi-ClipX algorithm Dalia Baziuk, Aleksas Narščius Abstract This paper presents the modified Renyi-CLIPx clustering algorithm and

More information

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures Clustering and Dissimilarity Measures Clustering APR Course, Delft, The Netherlands Marco Loog May 19, 2008 1 What salient structures exist in the data? How many clusters? May 19, 2008 2 Cluster Analysis

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Seminars of Software and Services for the Information Society

Seminars of Software and Services for the Information Society DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the Information Society

More information

Web Based Fuzzy Clustering Analysis

Web Based Fuzzy Clustering Analysis Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 11 (November2014), PP 51-57 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.com Web Based Fuzzy Clustering Analysis

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Regulatory Aspects of Digital Healthcare Solutions

Regulatory Aspects of Digital Healthcare Solutions Regulatory Aspects of Digital Healthcare Solutions TÜV SÜD Product Service GmbH Dr. Markus Siebert Rev. 02 / 2017 02.05.2017 TÜV SÜD Product Service GmbH Slide 1 Contents Digital solutions as Medical Device

More information

Unsupervised Learning

Unsupervised Learning Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision

More information

Credit card Fraud Detection using Predictive Modeling: a Review

Credit card Fraud Detection using Predictive Modeling: a Review February 207 IJIRT Volume 3 Issue 9 ISSN: 2396002 Credit card Fraud Detection using Predictive Modeling: a Review Varre.Perantalu, K. BhargavKiran 2 PG Scholar, CSE, Vishnu Institute of Technology, Bhimavaram,

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM Contour Assessment for Quality Assurance and Data Mining Tom Purdie, PhD, MCCPM Objective Understand the state-of-the-art in contour assessment for quality assurance including data mining-based techniques

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Contents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1

Contents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1 Contents List of Figures List of Tables List of Algorithms Preface xiii xv xvii xix I Clustering, Data, and Similarity Measures 1 1 Data Clustering 3 1.1 Definition of Data Clustering... 3 1.2 The Vocabulary

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Data mining fundamentals

Data mining fundamentals Data mining fundamentals Elena Baralis Politecnico di Torino Data analysis Most companies own huge bases containing operational textual documents experiment results These bases are a potential source of

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis Application of fuzzy set theory in image analysis Nataša Sladoje Centre for Image Analysis Our topics for today Crisp vs fuzzy Fuzzy sets and fuzzy membership functions Fuzzy set operators Approximate

More information

CHAPTER 2. Morphometry on rodent brains. A.E.H. Scheenstra J. Dijkstra L. van der Weerd

CHAPTER 2. Morphometry on rodent brains. A.E.H. Scheenstra J. Dijkstra L. van der Weerd CHAPTER 2 Morphometry on rodent brains A.E.H. Scheenstra J. Dijkstra L. van der Weerd This chapter was adapted from: Volumetry and other quantitative measurements to assess the rodent brain, In vivo NMR

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information