Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction

Size: px
Start display at page:

Download "Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction"

Transcription

1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical University of Lisbon, Instituto Superior Técnico, Dept. of Mechanical Engineering, CIS/IDMEC LAETA, Av. Rovisco Pais, Lisbon, Portugal ARTICLE INFO ABSTRACT This dissertation proposes a novel clustering method based on fuzzy c-means, which is capable of Keywords: handling information from time variant and invariant features. The new method, Mixed Clustering, Data Mining shows the advantages of successfully aggregating both data components to identify systems in a wide Machine Learning Clustering Time Series Analysis Mixed Data Septic Shock 1. Introduction 1.1. Knowledge Data Discovery The present developments in data warehouse enable storing of increasingly bigger sets of data, leading to a growth in the amount of information available regarding any given system as well as the analytical possibilities they provide. The Knowledge Data Discovery (KDD) process focuses on methodologies for extracting useful knowledge from the available information, data bases, (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Firstly the data relevant (target data) for the system under identification from the available data base, after which the target data is pre-processed, cleaning the range of application domains, such as Medical, Management or Energy Systems. The flexible formulation of the proposed methodology can adapt to data sets with multivariate time series and different similarity measures based on distance. In fact, in addition to the euclidean distance, the distance based on the popular Dynamic Time Warping method is used for time series similarity search, being capable of overcoming the temporal misalignment between them, commonly found on these applications. The contribution of the Mixed Clustering approach is demonstrated for forecasting and classification problems, the first being achieved through its application to a meteorological system for temperature and humidity forecasting based on geographical location. The method s performance as a binary classifier is demonstrated with a Medical application, where the goal is to predict the outcome of a patient diagnosed with septic shock through the analysis of physiological variables measured during a sampling period and patient s demography, which is constant during his stay in an Intensive Care Unit. The machine learning process is tested under unsupervised and supervised alternatives. The application of the method showed that when the temporal information of the patient is poorer, the demographic information can improve the classifier s performance. information, handling missing values and adapting it to the requirements of the analysis. Figure 1-1 KDD Process The data is then Transformed, consolidated into structures appropriate for the data mining method then applied, in this case the Mixed Clustering, which identifies patterns in the data. 1

2 The results obtained from the mined patterns is then interpreted in the original systems field, finally obtaining the useful knowledge desired. The focus of this dissertation is proposal of a new, efficient, data mining method based on clustering, for databases combining time variant and invariant features, valid for forecasting and classification problems, and applicable to a diverse range of application domains, from medical problems, climacteric analysis, power management to economic studies, designated as Mixed Clustering Time Series Data Mining This innovative data mining method searches for patterns and similarities in both data components, time variant and invariant, combining the extracted information to better characterize the data objects. The process of mining time series, particularly, the clustering of time series attracts the interest of researchers. The complexity of this type of data requires careful examination of the proposed algorithms, (Rani & Sikka, 2012). While the time invariant features are easily compared by a common and simple distance function, the Euclidean Distance, the time variant features, represented by time series, require a more complex analysis, (Rani & Sikka, 2012). Thus, a more modern measure is implemented for similarity search of time series, the Dynamic Time Warping. Figure 1-2 Euclidean and DTW matching of Time Series This similarity measure is capable to overcome temporally misaligned time series, identifying similar tendencies and patters, even if unfazed in the time of occurrence. This measure has been successfully applied in areas such as handwriting and online signature matching, time series database search, computer vision, surveillance and signal processing, (Gaudin & Nicoloyannis, 2006) Outline This work is structured as follows: in section 2, the mixed clustering concept is described and the methodology presented. In section 3, the use of the method s outputs to solve a forecasting problem is presented and applied to a Meteorological System, followed by a demonstration and discussion of the results. The method s contribution to a classification problem is demonstrated in section 4, and applied to a Medical System, followed by a demonstration and discussion of the results achieved. Finally, in section 5 the results of the different applications are revised and compared to previous works on the subject, concluding with a set of suggestions to further develop the study described as future work. 2. Clustering 2.1. Concept Clustering is a data mining technique that aims to group similar data objects, based on patterns identified, while distinguishing objects with distinct behaviours, divide the data into clusters, so that intra-group differences are smaller than those inter-groups. This concept is useful in a wide range of applications from image analysis, wireless sensor network's based applications or population segmentation to bioinformatics, (Liao, 2005). Often, the information that describes a system is not all represented in the same type of data, there are categorical, numerical and text features, constant and time-varying features. In such cases, a clustering 2

3 method capable of conciliating distinct data types becomes necessary. In (Izakian, Witold, & Jamal, 2013), a clustering method to handle spatiotemporal systems is proposed. These systems are characterized not only by temporal features but also by the spatial location at which they were measured. Geography, climatology and epidemiology systems are examples of applications relying on spatiotemporal data for their identification. The methodology proposed in (Izakian, Witold, & Jamal, 2013) expands the Fuzzy C-Means (FCM) Clustering technique, (Bezdek, Ehrlich, & Full, 1984) to handle spatiotemporal data by adding a pondering element λ, that factors the importance to be given to the temporal component. This element majorly beneficiates the algorithm s flexibility, allowing it to search for the best combination between temporal and spatial contributions The aim of this dissertation is to expand this notion of spatiotemporal data to any dataset containing different types of data, constant and time-varying, that may require specific treatment, by generalizing the spatiotemporal clustering methodology to data bases with mixed clustering and multivariate time series. We will show that there are benefits in successfully converging both data components to model systems in a wide range of application domains, such as Medical Care, Finances, Management and Energetic Systems Mixed Clustering Methodology When working with a database with time variant and invariant features, the input data is considered as a concatenation of both data components: x i = [x i s x i t ], i = 1,.., n ( 2.1 ) The invariant component, represented by numeric values, is structured as follows x s i = [x s i,1,, x s i,r ] ( 2.2 ) Where r is the number of invariant features. The time variant data component, represented by multivariate time series, is structured as a tridimensional matrix: t x i,j,k = ( 2.3 ) In this format, each value is defined by 3 coordinates: i = 1,, n, indicating the sample number, j = 1,, q, the sampling point and k = 1,, f, the feature The clustering method defines a set of prototypes, or centers for each of the c clusters, comprised of a variant and an invariant component: by: The invariant component s prototypes are determined v l s = n u m s i=1 l,i xi n u m i=1 l,i ( 2.4 ) The time-variant prototypes require an expansion to deal with the dimensionality increase of the data. A 3 dimensional structure was defined, with dimensions [c q f]: t v l,k = n u m t i=1 l,i xi,k n u m i=1 l,i ( 2.5 ) Where the fuzziness parameter, m, makes the process more fuzzy or crisp. The membership degree The value u l,i is an element of the partition matrix, U, that defines the degree at which each sample belongs to each cluster. Being a fuzzy clustering method, the membership of a sample k to a cluster is a value in the c n interval u l,i [0,1], l=1 u i,k = 1and0 < u l,i < n i=1. The similarity between a sample and a cluster is then measured by the sample s augmented distance to the cluster s center, given by: d λ 2 (v l, x i ) = v l s x i s 2 + λ t Where δ(v l,k t f k=1 t t δ(v l,k, x i,k ) ( 2.6 ), x i,k ) is the distance between the k th feature of prototype i and sample j, calculated by the 3

4 distance function used and λ is a parameter that defines the influence given to the time variant features. The optimal value of this parameter is determined by sequential runs of the clustering process, for different values, choosing the one that generates the best performance. By adding the distances of all features for each sample, the matrix of distances maintains its dimension [c n], resulting in a meaningful partition matrix defined, as for a univariate time-series system, by: u l,i = 1 ( d λ (v l,x i ) 2/(m 1) c o=1 d λ (vo,x i ) ) ( 2.7 ) Since the objective function J only has direct dependency on the distances and membership degrees, it can be defined as for a univariate time-series system: JJ = c n u m l,i d 2 l=1 i=1 λ (v l, x i ) ( 2.8 ) The Clustering process continues until convergence of the distance function or the maximum number of iterations is achieved. 3. Forecasting Problem Meteorological System 3.1. Modelling The Alberta Agriculture and Rural Development organization provides current and historical weather data from approximately 340 meteorological stations located across the Californian province, mapped on Figure 3-1. The meteorological variables available include temperature, humidity, precipitation and solar radiation, and are of great interest for users such as Epidemiologists seeking to better understand, for instance, the relationships between measures of environmental health and those of animal health. This platform, available at (ARD) is also valuable for environmental or agriculture analysis. Figure 3-1 Map of the province of Alberta, Canada. Area were the meteorological stations are located The Alberta province covers areas with different geographical and meteorological profiles that characterize these locations, including mountains, valleys, lakes and arid areas. For these experiments, the average daily temperatures and the daily average humidity registries where considered, taken from 1/1/2009 to 12/31/2009, forming the time variant input features. The time invariant features used consisted of the latitude and longitude coordinates of the location of the station they were measured at. All stations in which all features were available and had no missing values were considered, resulting in 168 samples. The time series were represented by the Discrete Fourier Transform (DFT). DFT Fuzziness parameter: m = 2 Number of samples: n = 249 Number of time invariant features: r = 2 Number of time variant features: f = 2 Time variant feature s length: q = Experimental Setup The application of the Mixed clustering methodology proposed to the Meteorological System was performed under two distinct criterions. The first, Reconstruction 4

5 Criterion (RC), evaluates the cluster validity, while the Prediction Criterion (PC) evaluates the method s forecasting ability. Reconstruction Criterion The RC assesses the quality of the clusters constructed by attempting to recreate the original data. Defining x as the reconstructed data, its variant and invariant components are respectively defined as x is = c l=1 c l=1 x it = c l=1 c l=1 u m t l,i vl,k u m l,i u m s l,i vl u m l,i ( 3.1 ) k [1, f] ( 3.2 ) The Average Reconstruction Error (ARE) is calculated as: ARE(λ) = 1 n (1 r ( (x i,j n i=1 n r j=1 f s x i,j s ) 2 ) + 1 f q σ j 2 ( (x i,j i=1 k=1 q j=1 t x i,j t ) 2 )) σ j Results and Discussion Reconstruction Criterion The RC was applied to each of time variant feature, humidity or temperature, individually and to the combination of both in a multivariate approach, each using a number of clusters between 2 and 5, using the Euclidean Distance and the DTW for similarity search. It was observed that the multivariate alternative was not capable to improve the quality of the data clusters created, according to this criteria, and that the best results were obtained for the temperature features, with 5 clusters and using the Euclidean Distance. Figure 3-2 shows a plot of the analysed stations according to their geographical location, coloured according to the cluster they have the highest membership degree to, under the best RC conditions. Four stations in different regions are highlighted. ( 3.3 ) Where σ j 2 is the variance of the j th feature. Prediction Criterion The aim of the PC is to predict the temporal component of the data by using the available spatial component of the data, minimizing the resulting error by adjusting the temporal influence parameter λ. A partition matrix is estimated from the invariant data and prototypes: as: 1 u l,i = ( v l s 2 x s i (m 1) c o=1 vs o xi s ) ( 3.4 ) The average Prediction Error (APE) is then calculated t x i,j t ) 2 APE(λ) = 1 ( n f q (x i,j 2 n f q i=1 k=1 j=1 ) ( 3.5 ) σ j The stopping criteria for the clustering algorithm in this experiment were the following: Minimal variation of the objective function: J < ε = 10 5 Maximum number of iterations: maxit = 100 Figure 3-2 Geographical Distribution under best RC conditions, c=5 It is clear that the method was capable of recognizing and distinguishing areas with the most different climacteric profiles. Prediction Criterion The PC was also applied under the same experimental conditions as the RC, multivariate and univariate time series, Euclidean distance and DTW were used as similarity measures for a number of clusters between 2 and 10. 5

6 The best result was also obtained using the multivariate approach, with the Euclidean distance and 8 clusters. These conditions were used to forecast the temperature and humidity. The total samples were separated into training and testing sets: s t x train = [x train x train ] ( 3.1 ) And s t x test = [x test x test ] ( 3.2 ) The procedure followed is described in Figure 3-3. Figure 3-4 Humidity Predicting under best PC conditions Figure 3-3 Workflow representing process for temporal forecasting of test set In this experiment, around 70% of the samples were used as train set, ntrain = 117, while the rest was used as test set. The forecasting results of humidity and temperature of one exemplary test sample, under the best conditions, are shown in Figure 3-4 and Figure 3-5, respectively. Figure 3-5 Temperature Predicting under best PC conditions In the forecasting problem, the DTW did not show improvement on the Euclidean distance, as similarity measures. The multivariate approach achieved the best forecasts of temperature and humidity during 2009, at the selected stations. 4. Classification Problem Medical System An analogy was made from the spatiotemporal concept, where the geographical location becomes, in medical applications, a patient s demography: age, weight, height, sex, among other possibilities. In this equivalence, the temporal component is regarded as all time-varying features that characterize the system, such as heart beats, blood pressure, body temperature and such, measured through a period of time and represented as time-series. 6

7 4.1. Modelling Septic shock is a medical emergency that can occur as a reaction of the immune system to, for example, an operation. It is estimated to affect about 12% of patients in an Intensive Care Unit (ICU) and has a high death rate, which is referred to depend on the patient s age and overall health. The database used, MEDAN, comprises several physiological features of patients diagnosed with abdominal septic shock, uniformly sampled during the whole period while the patient was at the ICU, (Paetz, 2003). This database was pre-processed by (Marques, Moutinho, Vieira, & Sousa, 2011), who analysed the most determinant features for outcome prediction, creating a sub dataset of patients with measurements of 12 of the available features. This data suffered further processing, from which resulted a data set with 100 samples each comprised of: 2 time invariant features: patient s age and weight, represented by a numeric value; 12 time variant features representing physiological variables by time series with a sampling time of 24 hours, over the last 10 days of the patient s stay in an Intensive Care Unit; 1 outcome represented by a binary where 0 represents the patient s survival and 1 the patient s death Experimental Setup The concept of classification based on clustering assumes that similar objects will share outcomes, and uses this knowledge to predict an object s classification. The classification approach proposed in this work is based on this concept and defines an object as belonging to a cluster if its membership degree is higher than a certain threshold. It then assumes that objects grouped together must share the same outcome. Thus, this concept is only valid for binary classifiers using two clusters, c=2. To evaluate the method s ability to predict an object s outcome, a 5 fold Cross Validation was performed. At each fold, the train set is clustered to determine the optimal λ and the resulting clustering output v. The membership degree of each test set sample are then determined, depending on their distance to each cluster prototype, and the predicted outcome determined according to the highest membership degree. The experiments described in this section share the following experimental conditions: Clustering Conditions: o Minimal variation of the objective function: J < ε = 10 8 o Maximum number of iterations: maxit = 500 o Fuzziness parameter: m = 2 Classification Conditions: o 5 Fold CrossValidation o Class Distribution: 44%/56% The Mixed Clustering methodology was applied under two learning approaches: unsupervised and supervised. The first partitions the data without knowledge of its outcome, while the second used labelled samples for training, following the steps: i Unsupervised Clustering of Train set to determine λ ; ii Supervised Clustering of Train set using λ to obtain prototypes v ; iii Unsupervised Classification of Test set using v. The criteria implemented to evaluate the quality of the outcome prediction is frequently used with health care problems, (Lavrač, 1999): Accuracy: measures the number of correct classifications out of samples classified; Sensitivity: accounts for the number of correct positive classifications, out of all positive samples; Specificity: accounts for the number of correct negative classifications, out of all negative samples; 4.3. Results and Discussion The experiments performed with the Mixed Clustering include the use of data representations in time (raw data) and frequency domain (DFT), of the 7

8 Euclidean Distance and the DTW as similarity measures. In addition to the mixed clustering, an alternative clustering was tested, using only the time variant features, to assess the actual benefit of combining both information components, designated as Temporal Clustering. A Forward Feature Selection method was used to assess the quality of each time variant feature, under all combinations of conditions described. It was observed that the superiority of a similarity measure or time series representation method depended on the feature. The benefit of the mixed clustering over the temporal clustering was also not global for every feature. It was verified that when the time variant features, by themselves, were rich enough, the addition of the patient s demography mislead the algorithm, leading to weaker results. However, when the temporal feature was weaker, it benefited from the mixed clustering approach. The best overall Unsupervised Mixed Clustering result was obtained using the Euclidean Distance with the DFT using one time variant feature, no. 6, representative of the Central Venous Pressure. Figure 4-1 shows the differences between the temporal and mixed alternatives, under unsupervised learning, for the best feature and an example of a weaker temporal feature that benefited from the mixed clustering approach, feature 8: Ph. Figure 4-1 Unsupervised Mixed and Temporal Clustering Accuracy for features 6 and 8 It is observable that while the addition of the patient s demography did not increase the performance of feature 6, the weaker feature 8 needed the increase of information that came with it. In Figure 4-2, the equivalent results are shown, for the Supervised learning alternative. Figure 4-2 Supervised Mixed and Temporal Clustering Accuracy for features 6 and 8 The best result under Supervised clustering was also achieved for feature 6, using the DTW and DFT. It is also shown that, for these features, the supervised clustering alternative managed to improve the results of the unsupervised alternative. This effect was not verified for all features however, overall the supervised learning increase the performance of the features that were also the best under the unsupervised alternative, suggesting that the features most related to the outcome beneficiate from its inclusion in the learning process. 5. Conclusions and Future Work A new expanded clustering algorithm was formulated to mine databases represented by both time variant and invariant features, combining the information extracted to further characterize a given system. The results of the data mining and pattern recognition process were applied to machine learning purposes, where distinct methodologies were proposed to solve Forecasting and Classification problems, the first with a Meteorological System, while the last with a Medical application, demonstrating its wide applicability. 8

9 Different measures were implemented for similarity search between time series, the commonly used Euclidean Distance and the increasingly popular Dynamic Time Warping. The benefit of the joint clustering of different types of data was also demonstrated, by comparing it to the clustering of individual data types. Table 5-1 shows the best result obtained from previous work on the same database. It should be noticed that these results are not directly comparable since the studies performed different processing on the input data and the methods used are different. The authors of (Cismondi, et al., 2012) used multi-criteria Feature Selection with Fuzzy Models (FM) and Neural Networks (NN) to predict the patient s outcome. While the FM constructed produced the best ACC, the Mixed Clustering produced comparable results using 4 times less features, 2 of each were numerical values, significantly easier to measure and process. Table 5-1 Best Mixed Clustering and best previous work result Reference Method No. ACC Sens. Spec. features (%) (%) (%) Max Sens NN Max. Spec (Cismondi, Parallel et al., 2012) Max Sens FM Max. Spec Parallel Unsupervised Mixed 3* Clustering Supervised 3* * The mixed Clustering used two constant features, patient s age and weight, combined with one time variant feature. In addition, the Mixed Clustering method has the highest sensitivity, or true positive rate, crucial since the positive class represents a deceased patient. As future work, it would be interesting to expand the clustering possibilities to any number of partitions and to databases with any number of classes. Since the DTW method is able to compare time series of different length, the expansion of the method to form prototypes of variable length would expand the applicability of the mixed clustering method to databases with time series of different length. Also, a reformulation of the method should include the possibility to use different similarity measures for each feature, as well as the influence given to each through the implementation of different temporal influence parameters λ i, where i = 1,2,, f. Even though one of the great advantages of the data mining and soft computing techniques analysis is their ability to read any problem specific to a given field as a generalized system, the final step in the KDD approach would be the interpretation of the results, bringing the problem back to its field and enabling practical conclusions. Thus, the medical system application demonstrated would benefit from further analysis over the best features that resulted from the feature selection algorithms, possibly bringing awareness of the importance of a feature to the medical community. In this context, a feature sensibility study could also be performed on the time variant and invariant features, pre-assessing the quality of the knowledge they contain. The causes of septic are not yet fully comprehended, however some risk factors have been studied (Fink, Abraham, Vincent, & Kochanek, 2005), and could be insert in the Mixed Clustering method as time invariant features. Finally, the validation of the mixed clustering methodology requires its application to problems from different domains and fields, such as Financial, Power Consumption or Surveillance Applications. The use of benchmark databases can demonstrate its value against 9

10 different techniques. However, due to the specific characteristics of the mixed clustering s inputs, there is a shortage of available databases, (Keogh & Kasetty, 2003). References ARD. (n.d.). Current and Historical Alberta Weather Station Data Viewer. Retrieved May 2014, from Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10, Cismondi, F., Horn, A. L., Fialho, A. S., Vieira, S. M., Reti, S. R., Sousa, J. M., et al. (2012). Multistage Modeling Using Fuzzy Multi-criteria Feature Selection to Improve Survival Prediction of ICU Septic Shock Patients. Expert Systems with Applications, 39, Devjver, P. A., & Kittler, J. (1982). Pattern Recognition: A Statistical Approach. Prentice- Hall. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. Al Magazine, 17, Fink, M., Abraham, E., Vincent, J., & Kochanek, P. M. (2005). Septic Shock. In Textbook of Critical Care (5th ed.). Saunders Elsevier. Gaudin, R., & Nicoloyannis, N. (2006). An Adaptable Time Warping Distance for Time Series Learning. 5th International Conference on Machine Learning and Applications (ICMLA 06). Orlando, USA. Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques (2 ed.). Morgan Kaufmann Publishers. Izakian, H., Witold, P., & Jamal, I. (2013, October). Clustering Spatiotemporal Data: An Augmented Fuzzy C-Means. IEEE TRANSACTIONS ON FUZZY SYSTEMS, 21. Keogh, E., & Kasetty, S. (2003, October). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. Data Mining and Knowledge Discovery, 7, pp Lavrač, N. (1999). Artificial Intelligence in Medicine: Machine Learning for Data Mining in Medicine (Vol. 1620). Liao, T. W. (2005, November). Clustering of time series data - a survey. Pattern Recognition, Marques, F. J., Moutinho, A., Vieira, S. M., & Sousa, J. M. (2011). Preprocessing of Clinical Databases to improve classification accuracy of patient diagnosis. World Congress, (pp ). Paetz, J. (2003). Knowledge-based approach to septic shock patient data using a neural network with trapezoidal activation functions. Artificial Intelligence in Medicine, 28, Rani, S., & Sikka, G. (2012). Recent Techniques of Clustering of Time Series Data: A Survey. International Journal of Computer Applications, 52(15). 10

Feature Selection in Knowledge Discovery

Feature Selection in Knowledge Discovery Feature Selection in Knowledge Discovery Susana Vieira Technical University of Lisbon, Instituto Superior Técnico Department of Mechanical Engineering, Center of Intelligent Systems, IDMEC-LAETA Av. Rovisco

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

PART III APPLICATIONS

PART III APPLICATIONS S. Vieira PART III APPLICATIONS Fuzz IEEE 2013, Hyderabad India 1 Applications Finance Value at Risk estimation based on a PFS model for density forecast of a continuous response variable conditional on

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Data Mining: An experimental approach with WEKA on UCI Dataset

Data Mining: An experimental approach with WEKA on UCI Dataset Data Mining: An experimental approach with WEKA on UCI Dataset Ajay Kumar Dept. of computer science Shivaji College University of Delhi, India Indranath Chatterjee Dept. of computer science Faculty of

More information

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA Knowledge Discovery Javier Béjar URL - Spring 2019 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn Indranil Bose and Xi Chen Abstract In this paper, we use two-stage hybrid models consisting of unsupervised clustering techniques

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM Contour Assessment for Quality Assurance and Data Mining Tom Purdie, PhD, MCCPM Objective Understand the state-of-the-art in contour assessment for quality assurance including data mining-based techniques

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications & Algorithms 14 (2007) 103-111 Copyright c 2007 Watam Press FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR 1.Introductıon. 2.Multi Layer Perception.. 3.Fuzzy C-Means Clustering.. 4.Real

More information

Customer Clustering using RFM analysis

Customer Clustering using RFM analysis Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras

More information

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher ISSN: 2394 3122 (Online) Volume 2, Issue 1, January 2015 Research Article / Survey Paper / Case Study Published By: SK Publisher P. Elamathi 1 M.Phil. Full Time Research Scholar Vivekanandha College of

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Clustering Analysis based on Data Mining Applications Xuedong Fan

Clustering Analysis based on Data Mining Applications Xuedong Fan Applied Mechanics and Materials Online: 203-02-3 ISSN: 662-7482, Vols. 303-306, pp 026-029 doi:0.4028/www.scientific.net/amm.303-306.026 203 Trans Tech Publications, Switzerland Clustering Analysis based

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data 1 P. Valarmathie, 2 Dr MV Srinath, 3 Dr T. Ravichandran, 4 K. Dinakaran 1 Dept. of Computer Science and Engineering, Dr. MGR University,

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Climate Precipitation Prediction by Neural Network

Climate Precipitation Prediction by Neural Network Journal of Mathematics and System Science 5 (205) 207-23 doi: 0.7265/259-529/205.05.005 D DAVID PUBLISHING Juliana Aparecida Anochi, Haroldo Fraga de Campos Velho 2. Applied Computing Graduate Program,

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS

CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS 4.1. INTRODUCTION This chapter includes implementation and testing of the student s academic performance evaluation to achieve the objective(s)

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22 Knowledge Discovery Javier Béjar cbea URL - Spring 2018 CS - MIA 1/22 Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Application of Clustering as a Data Mining Tool in Bp systolic diastolic

Application of Clustering as a Data Mining Tool in Bp systolic diastolic Application of Clustering as a Data Mining Tool in Bp systolic diastolic Assist. Proffer Dr. Zeki S. Tywofik Department of Computer, Dijlah University College (DUC),Baghdad, Iraq. Assist. Lecture. Ali

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems Management Information Systems Management Information Systems B10. Data Management: Warehousing, Analyzing, Mining, and Visualization Code: 166137-01+02 Course: Management Information Systems Period: Spring

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database. Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Cluster analysis of 3D seismic data for oil and gas exploration

Cluster analysis of 3D seismic data for oil and gas exploration Data Mining VII: Data, Text and Web Mining and their Business Applications 63 Cluster analysis of 3D seismic data for oil and gas exploration D. R. S. Moraes, R. P. Espíndola, A. G. Evsukoff & N. F. F.

More information

Comparative Study of Clustering Algorithms using R

Comparative Study of Clustering Algorithms using R Comparative Study of Clustering Algorithms using R Debayan Das 1 and D. Peter Augustine 2 1 ( M.Sc Computer Science Student, Christ University, Bangalore, India) 2 (Associate Professor, Department of Computer

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection Based on Locality Preserving Projection 2 Information & Technology College, Hebei University of Economics & Business, 05006 Shijiazhuang, China E-mail: 92475577@qq.com Xiaoqing Weng Information & Technology

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory

More information

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús

More information

Time Series Classification in Dissimilarity Spaces

Time Series Classification in Dissimilarity Spaces Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data AALTD 2015 Time Series Classification in Dissimilarity Spaces Brijnesh J. Jain and Stephan Spiegel Berlin Institute

More information

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE 32 CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE 3.1 INTRODUCTION In this chapter we present the real time implementation of an artificial neural network based on fuzzy segmentation process

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

Using Cluster Analysis in the Synthesis of Electrical Equipment Diagnostic Models

Using Cluster Analysis in the Synthesis of Electrical Equipment Diagnostic Models Using Cluster Analysis in the Synthesis of Electrical Equipment Diagnostic Models Ksenia Gnutova, Denis Eltyshev Electrotechnical Department, Perm National Research Polytechnic University, Komsomolsky

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Computational Intelligence Meets the NetFlix Prize

Computational Intelligence Meets the NetFlix Prize Computational Intelligence Meets the NetFlix Prize Ryan J. Meuth, Paul Robinette, Donald C. Wunsch II Abstract The NetFlix Prize is a research contest that will award $1 Million to the first group to improve

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

NEURAL NETWORKS ... FEATURE SELECTION USING ANT COLONY OPTIMIZATION: APPLICATIONS IN HEALTH CARE. Motivation. Outline.

NEURAL NETWORKS ... FEATURE SELECTION USING ANT COLONY OPTIMIZATION: APPLICATIONS IN HEALTH CARE. Motivation. Outline. Motivation FEATURE SELECTION USING ANT COLONY OPTIMIZATION: APPLICATIONS IN HEALTH CARE João M. C. Sousa jmsousa@ist.utl.pt S. M. Vieira, S. N. Finkelstein 2,3, A. S. Fialho,2, F. Cismondi,2, S. R. Reti

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Fuzzy Ant Clustering by Centroid Positioning

Fuzzy Ant Clustering by Centroid Positioning Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Improving Tree-Based Classification Rules Using a Particle Swarm Optimization Chi-Hyuck Jun *, Yun-Ju Cho, and Hyeseon Lee Department of Industrial and Management Engineering Pohang University of Science

More information

3-D MRI Brain Scan Classification Using A Point Series Based Representation

3-D MRI Brain Scan Classification Using A Point Series Based Representation 3-D MRI Brain Scan Classification Using A Point Series Based Representation Akadej Udomchaiporn 1, Frans Coenen 1, Marta García-Fiñana 2, and Vanessa Sluming 3 1 Department of Computer Science, University

More information

Biclustering Bioinformatics Data Sets. A Possibilistic Approach

Biclustering Bioinformatics Data Sets. A Possibilistic Approach Possibilistic algorithm Bioinformatics Data Sets: A Possibilistic Approach Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Bioinformatics Data Sets Outline Introduction

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

Seminars of Software and Services for the Information Society

Seminars of Software and Services for the Information Society DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the Information Society

More information

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Image Mining: frameworks and techniques

Image Mining: frameworks and techniques Image Mining: frameworks and techniques Madhumathi.k 1, Dr.Antony Selvadoss Thanamani 2 M.Phil, Department of computer science, NGM College, Pollachi, Coimbatore, India 1 HOD Department of Computer Science,

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Density Based Clustering using Modified PSO based Neighbor Selection

Density Based Clustering using Modified PSO based Neighbor Selection Density Based Clustering using Modified PSO based Neighbor Selection K. Nafees Ahmed Research Scholar, Dept of Computer Science Jamal Mohamed College (Autonomous), Tiruchirappalli, India nafeesjmc@gmail.com

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Intelligent Risk Identification and Analysis in IT Network Systems

Intelligent Risk Identification and Analysis in IT Network Systems Intelligent Risk Identification and Analysis in IT Network Systems Masoud Mohammadian University of Canberra, Faculty of Information Sciences and Engineering, Canberra, ACT 2616, Australia masoud.mohammadian@canberra.edu.au

More information

Available online Journal of Scientific and Engineering Research, 2019, 6(1): Research Article

Available online   Journal of Scientific and Engineering Research, 2019, 6(1): Research Article Available online www.jsaer.com, 2019, 6(1):193-197 Research Article ISSN: 2394-2630 CODEN(USA): JSERBR An Enhanced Application of Fuzzy C-Mean Algorithm in Image Segmentation Process BAAH Barida 1, ITUMA

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville ranka@cise.ufl.edu Data Preprocessing What preprocessing step can or should

More information

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti Information Systems International Conference (ISICO), 2 4 December 2013 The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria

More information

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms Biotechnology & Biotechnological Equipment, 2014 Vol. 28, No. S1, S44 S48, http://dx.doi.org/10.1080/13102818.2014.949045 ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation

More information

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm IJCSES International Journal of Computer Sciences and Engineering Systems, Vol. 5, No. 2, April 2011 CSES International 2011 ISSN 0973-4406 A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

More information

The k-means Algorithm and Genetic Algorithm

The k-means Algorithm and Genetic Algorithm The k-means Algorithm and Genetic Algorithm k-means algorithm Genetic algorithm Rough set approach Fuzzy set approaches Chapter 8 2 The K-Means Algorithm The K-Means algorithm is a simple yet effective

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Remotely Sensed Image Processing Service Automatic Composition

Remotely Sensed Image Processing Service Automatic Composition Remotely Sensed Image Processing Service Automatic Composition Xiaoxia Yang Supervised by Qing Zhu State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University

More information

CS490D: Introduction to Data Mining Prof. Chris Clifton

CS490D: Introduction to Data Mining Prof. Chris Clifton CS490D: Introduction to Data Mining Prof. Chris Clifton April 5, 2004 Mining of Time Series Data Time-series database Mining Time-Series and Sequence Data Consists of sequences of values or events changing

More information

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei Data Mining Chapter 1: Introduction Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei 1 Any Question? Just Ask 3 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional

More information

Simulation of Zhang Suen Algorithm using Feed- Forward Neural Networks

Simulation of Zhang Suen Algorithm using Feed- Forward Neural Networks Simulation of Zhang Suen Algorithm using Feed- Forward Neural Networks Ritika Luthra Research Scholar Chandigarh University Gulshan Goyal Associate Professor Chandigarh University ABSTRACT Image Skeletonization

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM Saroj 1, Ms. Kavita2 1 Student of Masters of Technology, 2 Assistant Professor Department of Computer Science and Engineering JCDM college

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Machine Learning with MATLAB --classification

Machine Learning with MATLAB --classification Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information