Hybrid Algorithm for predict heart disease

Size: px

Start display at page:

Download "Hybrid Algorithm for predict heart disease"

Timothy Bridges
5 years ago
Views:

International Research Journal of Applied and Basic Sciences 2015 Available online at www.irjabs.

1 International Research Journal of Applied and Basic Sciences 2015 Available online at ISSN X / Vol, 9 (3): Science Explorer Publications Hybrid Algorithm for predict heart disease Mitra Mohamadi 1 * 1 Department of Computer Engineering, Malayer branch, Islamic Azad University, Malayer, Iran ; Corresponding author mitramohamadi374@yahoo.com ABSTRACT: A remarkable growth and heart disease and the effects of their duties and high costs on the community, has led to the medical community seeks to plans for further investigation, prevention, early detection and treatment to be effective. For the construction of the regulatory models of different techniques, such as the tree CART decision, k nearest neighbor and improved nearest neighbor was used by the algorithm birds. With an extension of the models and involve new parameters to more and more reliable results. By using the results of these models to predict the heart disease can be in various statistics and mortality decreased from heart disease. Keywords: data mining, heart disease, tree, CART decision, K algorithm nearest neighbor, Pso. DATA MINING AND ITS IMPORTANCE Today, with the development of a systems and high volume of data stored in these systems, there is a need for a tool to be stored data and information processing of this process is in the hands of users. Knowledge discovery process with a small but to identify patterns understood the importance of a series of large amounts of data that is potentially helpful [1,2]. One of the stages of the exploitation of knowledge is data mining, data mining a stage in connection with the knowledge of the actual mining data [3]. Data mining is a process in which using data analysis tools seek to discover patterns and connections between the data available in a way that may lead to a new information extraction database [3]. Data mining include information and analysis tools to explore reliable patterns and unknown among a lot of data. Data mining algorithms in various professional methods are used. [4]. - data mining is credible information extraction process, unknown, understandable and reliable of large databases and its use in decision making on major commercial activities. [5] - is a process that data mining techniques, intelligent, knowledge of a set of data. [6] -, i. e., data mining in search of a data base for finding patterns between data. [6] - Data mining discovered in fact structures, interesting and valuable through a vast collection of data and activity is basically with detailed analysis of the data. [5] Importance of the issue: Figures from the World Health Organization (WHO) in 2005 shows that cardiovascular disease 5 \/ 17 million victims that 30 % of the total number of the world, and the figure is expected to 2030 year to 23 million increases. Writings in Iran showed that 38 % of the total deaths related to heart disease. According to research conducted share in Kermanshah Province killed more than 40 % in the year to people [7]. Diagnosed heart disease and a significant in medicine and is also complex work that needs to be carried out, and efficiency. However tools for analysis of data mining in the availability of the massive collection of medical data leads to analyze the truth in this field. With the use of medical information age, gender, blood pressure and blood sugar predicted probability of heart disease can be. The data should be collected, organized the collected data can be used for prevention system integration [8] and [9] and [8]. Diagnosis and prediction heart attacks using a clustering algorithm based on genetic K means: One of the main acts clustering of data mining and is aimed at sorting data to meaningful classes (clusters), so that the resemblance between a bunch of data similarity between the highest and lowest data from two different clusters. In this study a clustering algorithm based on K - Means to genetic data composite properties and classification. The proposed algorithm described by changing the cluster centers failed to solve the restriction genetic algorithms K - Means and thus better identification clusters. New features, clustering algorithm for data (data such as heart disease patients who), in many cases their characteristics or complex numbers are classified as suitable,. for the diagnosis and prediction heart attacks used [ 10 ].

2 Improve the accuracy of data mining algorithm KNN by using dependence: In this study to offer a new classification algorithm based on the use of dependence laws in Algorithm K-N-N in order to increase the accuracy of the algorithm K-N-N classification. K-N-N algorithm to each of the characteristics are allocated weight and each of attributes that more weight, more influence on the calculation of the record of the distance between the two, every feature, which is less weight less effective in the distance. Practical tests showed that the proposed algorithm more closely algorithms NBTREE C4. 5, NB, NN, LWL IBL VFI, [11]. K algorithm nearest neighbor: KNN algorithm is one of the most important classification algorithms due to be implemented in many fields, is used. This algorithm for the classification, a record, the gap between the record of all existing lines in a series of training, K similar to the most or the nearest its neighbor's and the record label that is in the majority of the class to new record. Away from the formula for calculating, Euclidean distance [12]. If the rows with n trait to put them into a vector n show next: X=(x1, x2,x3,.xn) Y=(y1, y2,y3,.yn) n DIST X, Y = x i y 2 i i=1 (1) (2) (3) After interval calculated using the above formula, K to choose the most similar lines and using the label them new data. The nearest neighbor technique these principles: the things that are located adjacent to each other with the same values are expected. So, if the amount of related to the one thing we know we can amount to close neighbors also tackled forecast. Database: In this study, the database that used consists of a set of data from the heart of the Imam Ali hospital patients (RA) Kermanshah. It includes 396 is record after preparing and clean - up in the software SQL Server all the records useful was diagnosed with no records and was eliminated. 12 is a field that includes using them and with the help of the existing prediction models to predict whether these people may be infected with heart disease or not. The parameters that at the base in two categories are divided into input and output parameters that Disease heart output and other parameters, input parameters. Input parameters include: Table 1- features in anticipation of cardiovascular diseases attributes used Age Blood-sugar Disease Heart beat Cholesterol HDL LDL Smoking Gender Blood pressure PTT comments patient age blood sugar )except for heart disease( her heart cholesterol levels cholesterol full dense cholesterol less dense smoking gender patients blood pressure screening test in order to assess their ability In the formation of the blood clot as appropriate 347

The action Feature Selection techniques.

3 In Table 1 each of features that in anticipation of cardiovascular diseases that the number of parameters studied included the 11. Figure 1 - impose normalization Feature model selection: Feature Selection techniques for reducing the number of technical specifications before applying the data - mining algorithm is used. The action Feature Selection techniques. This technique percent of the importance of the fields and the importance of using the % can be diagnosed as the field, it is necessary to act in data mining company or not. As shown in Figure 2, Disease Fields are unnecessary HDL and therefore there is no need to that, in practice, data mining and develop the model of them. Other fields in order of importance and influence on the field goal. Figure 2 - Feature Selection model Decision trees: In classification methods for selecting categories options there is one of the most important and at the same time, the tree in decision - making [13]. Decision tree is a flowchart of domestic that each node in a test on quality. Each branch an outcome of the test and each node a class label. If a line is assumed to be given the 348

classroom, lacks the qualities of values in the tree nodes are tested and a route from the tree roots decision to achieve a leaf nodes in line to identify and label.

Generally, the decision - making good accuracy, although the successful use, used to.

4 classroom, lacks the qualities of values in the tree nodes are tested and a route from the tree roots decision to achieve a leaf nodes in line to identify and label. The use of the decision - making due to their simplicity and speed in the construction and what is common in that category. Generally, the decision - making good accuracy, although the successful use, used to. A structured approach decision trees are generally division and solve the recursive top to bottom, and it is in an attempt to the input variable spaces in the end nodes. A number of different algorithms, which can be used to build the decision to include: C5. 0, Chaid, Cart, Quest the size of the tree can be achieved through the laws, which stopped the growth of the tree. Algorithm C & R: Cart algorithm a classification and prediction based on the tree. The first time by Avloshan, Friedman, Bermian and Stone [14] per year, 1998 was designed for classification. In every step of the educational records into two sub - division, so that each subset records over the previous collections more and the procedure continues until one of stopping criteria. In the algorithm Cart the failure to determine the amount of Impurity parameter. Impurity concept of here like Field Value purpose and reached a node records. In this algorithm a prophetic field may often in different levels of decision - making tree. All undertaken by each division depends on the algorithm in binary, will mean that only two sub - group of each node will be split. It also algorithms and Prophet target Fields of type of data and a class. Figure 3 - C & R Educational and also recognition accuracy % in the Test series. 300 record set includes training and test set includes 96 the record. After the implementation of the model of the importance of characteristics can be influential in the model C & R, according to form. Figure 4- The importance of characteristics in the model C & R 349

5 In Figure (4) the importance of fields or dependence on target variable is shown as field, is the highest importance of blood pressure and blood sugar in anticipation of the least importance to the model of the C & R. K algorithm nearest neighbor: KNN algorithm is one of the most important classification algorithms due to be implemented in many fields, is used. This algorithm for the classification, a record. The gap between the record of all existing line in the training set. Then K similar to the most or the nearest its neighbors have chosen the record. And label the class in the majority of which is a new record. Away from the formula for calculating, Euclidean distance [15]. After interval calculated using the above formula, K to choose the most similar lines and using the label them new data. In this algorithm. All - the same effect of the traits in the calculation of the distance - the new record with neighboring - record of it. In the event that some of these traits for classification sub - Ned. This misleading bunch of process - the timing and reduce the accuracy of the category - scheduling algorithm. In this study, in order to solve the problem with the mass movement of the particles (pso) to feature - are allocated weight and improve the accuracy of the algorithm K N N. Algorithms K data on the nearest neighbor: To develop the model with the algorithm k nearest neighbor, the data sets randomly divided into two parts of education and test fit with the equivalent of 75 % and 25 % - divided. This algorithm with different value of k in Matlab software 2012 was implemented in the end, it was observed that this algorithm with k= 7 compared to other values of k has a better result. The accuracy of the model to the nearest neighbor k in Table 2 is shown. Table 2- the accuracy of the model to the nearest neighbor k train Performance 90% test Performance % Table 2 pointed out that the case Mice recognition accuracy of the model 90 % in the training set as well as % in the Test series. Algorithm particles ( Pso) movement: Group - based optimization particles, an optimization technique based on the possibility of laws, which is in the year 1995 by Dr. eberhardt and Dr. Kennedy. The basic idea of this method of collective behavior fish or birds in search of food. Pso solution algorithm, which is said to be a little bit, the equivalent of a bird in the algorithm mass movement of birds. Each particle is a fitness value by a fitness function. Whatever little space in search of food in goal (model) movement of birds closer, more worthy also has every particle has a speed that is leading the particle motion. Each particle by following the optimal particles in the current state, to move in the issue continues. In this way every bit of trying to adjust its path and move toward the best personal experience and collective experience, the final solution. pso beginning in this way, a group of particles (solutions) randomly with to update the generations, trying to find the optimal solution. At every step, every bit of using the best value for two days. The first case, the best bit so far failed to reach it. The situation in the name of best and known. The best value by the algorithm is used is the best so far by the population of particles. The situation best displayed. After finding the best values, speed, and the situation of each particle with relations. V [ t+1 ] = W * V [ t ] + C 1 * rand ( t ) * ( best [ t ] - Position [ t ] ) +C 2 * rand (t) * (best [t] Position [t]) (4) Position [t+1] = Position[t] + V [t] (5) Relations 4 and 5, V [t] particle velocity and 6 the current particle that both arrays as long as the number of magnitude of the problem. Rand a function in Matlab that random number in the period and 0 (1), C1 and C2 parameters are learning, one of the weight of the parameters of inertia (ω), which is a good balance between the search for a global and local search in it. For w downward function. Initially, the better part of the current speed particle velocity is involved in the future, with the passage of time, it is reduced. Rather, at the outset of the particles more like the movement of improvised explosive devices and new experiences, and this time to follow in the footsteps of the best more. This method in many cases the problem could be trapped in the local minimum. The right side of the equation 4, 5 parts, the first part is the current speed and parts of the second and third change speed and spin it to the best personal experience and the best experience of the group. If the first part of this equation, the particle velocity only with regard to the current situation and the best experience and the best experience is determined by the company. Thus, the best bit in their place, remains constant and others at the little movement. Indeed procession particles without the first part of the equation 5, a process that will be gradually during the search space is small and local search around the best bit taking shape. Conversely, if only the first part of the equation 4 and 5, the normal way particles themselves to the range and a global search. Of 350

6 the most important advantages Pso that caused widespread use: simply applying it, a small number of parameters and high - speed it [16]. k - Improving the nearest neighbor by the algorithm Pso KNN algorithm for classification of all feature size is used [17]. That if all records properties may be the same role in that category and non - related features of the two record close to each other, far apart from each other to identify and classify the right to take place. The so - called the scourge of the dimensions of the problem, they say [17]. In order to solve the problem, calculate distance record for two, that you are more important than feature - that are less important, the impact of May. For this purpose, for each feature a weight wi i definition. No matter how the weight of a larger property, the impact of the distance in the calculation. If n feature in a database - n - weight vector next hop w= w1, w2 we define the calculation formula - 1 record of the distance between the two, Gauss will be as follows [ 18 ] and [ 15 ] and [ 17 ]. This type of distance calculation, in fact, only for the quantity of the value of debts features, but also the importance of quality attributes and makes the classification accuracy. Is clear, however, and women are more accurate, more classified, but if bad women are selected even classification accuracy than before decreases [18]. i. e. The goal in the optimization problem minimize the classification of error. Accidentally dataset into two parts, training and test with the proportion of 75 % and 25 % are divided. This algorithm with different value of k in Matlab software Finally, it was observed that this algorithm with the values of k = 4 compared to other values of k has a better result. The accuracy of the algorithm k improved nearest neighbor is shown in Table 5. Table 3- the accuracy of the algorithm k improved nearest neighbor train Performance test Performance 95% 81.83% Table 3 shows that samples with the model has recognition accuracy 95% in the training set as well as 81.83% in the Test series. The same as you see improvement after k nearest neighbor move by the algorithm particles, about 14 percent increase in the training data prediction accuracy and 7 % increase in the accuracy of the test data. Matrix confusion: Matrix turbulence or matrix event, a visual tool to display the classification accuracy is to show that the relationship between the results of the anticipated and using [19]. According to Table format in which the following: * TP: the number of correct predictions in class * FN: the number of false predictions in class * FP: the number of false predictions in class * TN: the number of correct predictions in class Matrix confusion ACTUAL CLASS Table 4 - matrix confusion PREDICTED CLASS Class a Class )TP) C Class (FP) Class b )TP) d (TN) Table 4 on the basis of the following formula for assessing models. Accuracy = a + d a + b + c + d = TP + TN TP + TN + FP + FN (3) 351

7 Error = c + d a + b + c + d = FN + FP TP + TN + FP + FN (4) After applying this algorithm on the model and its analysis as follows. Evaluation of the model C & R: In this study, using the software first perturbation matrix relating to the model of the values and then related to the inputs and Accuracy will be calculated. Matrix confusion in the form of 1 Cart model is shown. Figure 5 - model (C & R) matrix confusion According to the relationship 1 and 2 the following results. ACCURACY = ERROR = = 251 = 0/ = %63/ = 145 =./ = %36/6 396 Table 5 - Evaluation of the model to the nearest neighbor k K Nearest Neighborhood NUM After the implementation of the nearest neighbor k model in Matlab software 2012 matrix turbulence related to training and test data collection, according to the tables. 352

8 Table 6 - matrix turbulence test set k model to the nearest neighbor K Nearest Neighborhood NUM Table 7 - upset matrix educational complex model k improved nearest neighbor Improved knn with pso algorithm NUM Table 8 - matrix turbulence test set model k improved nearest neighbor Improved knn with Pso algorithm NUM Comparison of the results For comparison, the proposed method with other method of existing - table. In which all - discussed with classification accuracy and mentioned. Table 9 - the results of the models used in the study K improved nearest neighbor 81.81% 18.19% K nearest neighbor 73.73% 26.27% C&R 63.38% 36.61% model Accuracy Error REFERENCES [1] Amir Amiri and Vahid Rafe, " Hybrid Algorithm for Detecting Diabetes", International Research Journal of Applied and Basic Sciences, Vol, 8 (12): [2] Amir Amiri and Vahid Rafe, " Diagnosing diabetes using data mining algorithms and artificial intelligence systems ", Elixir Comp. Engg. 78 (2015) [3] Krzysztof.J. Cios and Lukasz A. Kurgan."Trends in Data Mining and Knowledge Discovery ",Advanced Information and Knowledge Processing, pp 1-26,2005. [4] L.Prodromidis.A, Stolfo.S,"Agent_Based Distributed Learning Applied to Fraud Detection", Sixteenth National Conference on Artificial Intelligence,1999. [5] Parthiban.L,Subramanian.R, Intelligent Heart Disease Prediction System using CANFIS and Genetic Algorithm, International Journal of Biological and Life Sciences, 2007 [6] Phua.C, Alaha Koon.D and Lee.V,"Report in Fraud Detection:Classification of Skewed Data",2004. [7] [8] Rani.B. K,Srinivas. R. K, Dr.Govrdhan.A, "Applications of Data Mining Techniques in Healthcare and Prediction of Heart Attacks", (IJCSE) International Journal on Computer Science and Engineering pp , [9] Jyoti Soni.U. A., Sharma.D, "Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction," International Journal of Computer Applications ( ),vol. 17 No.8,pp ,March [10] Dehghani.T, Afshari Saleh.M, Khalilzadeh.M,"A genetic K-means clustering algorithm for heart disease data", 5 th Conference of Data Mining of Iran, Amirkabir University,2011. [11] Bradley.P, Fayyad.U and Reina. C, "Scaling Clustering Algorithms to Large Databases", Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, Menlo Park, California pp.9-15,1998. [12] Gyorodi.C, Gyorodi.R,Holban.S, "A Comparative Study of Association Rules Mining Algorithms", SACI st Romanian-Hungarian Joint Symposium on Applied Computational Intelligence, Timisoara, Romania, May 26-26, 2004 page [13] Han.Jand Kamber.M, Data Mining : Concepts and Techniques, Second Edition,Morgan Kaufman Publisher,

9 [14] Alpaydin.E,"Introduction to Machine Learning", The MIT Press books, Cambridge, [15] T.Larose.D, "Discovery Knowledge indata: An introduction to data mining",new jersey, [16] Aqueel.A,S.A.Hannan, "Data Mining Techniques to Find Out Heart Diseases:An Overview," International Journal of Innovative Technology and Exploring Engineering (IJITEE),vol 11, pp , September [17] Zhan. Y, Chen.H and Zhang.G.C, " An optimization Algorithm of K-NN classification ", Proceedings of the fifth International conference on Machin Learning and Cybernetics, Dalian, 13-16, August [18] Shamsul Huda.Md, Rokibul Alam.Md, Mutsuddi.K, " A Dynamic K-Nearest Neighbor Algorithm for Pattern Analysis Problem", 3 rd International conference on Electrical& computer Engineering, Dhaka,Bangladesh, ICECE, December 28-30, [19] Chaitrali.P, Dangare Sulabha.S,Apte.S, "Improved Study of Heart Disease Prediction System using Data Mining Classification Techniques," International Journal of Computer Applications, ( ), vol 47 No.10,pp ,June

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,