Genetic-PSO Fuzzy Data Mining With Divide and Conquer Strategy Amin Jourabloo Department of Computer Engineering, Sharif University of Technology, Tehran, Iran E-mail: jourabloo@ce.sharif.edu Abstract - Nowadays, discovery the association rules is an important and controversial area in data mining research studies. These rules, describe noticeable association relationships among different attributes. While most studies have focused on binary valued transaction data, in real world applications, there data usually consist of quantitative values. With that in mind, in this paper, we propose a fuzzy data mining algorithm for extracting membership functions from quantitative transactions. This is a hybrid genetic-pso algorithm for finding membership functions suitable for mining problems by a strong cooperation of GA and PSO. This algorithm integrates the two techniques entire run of simulation in each iteration, a part of population are substituted by new ones generated by means of GA, while the other part is the same of previous generation but moved on the solution space by PSO. At the end, best final sets of membership functions in all the populations are gathered to be used for mining fuzzy association rules. According to experimental results, the proposed genetic-pso fuzzy data mining algorithm has a good effect on fitness of membership functions. Keywords: data mining, fuzzy sets, genetic algorithms (GA), Particles Swarm Optimization (PSO), membership functions. Introduction An important area of data mining research deals with the discovery of association rules, which describe interesting association relationships among different attributes. [] Association rule techniques are generally applied to databases of transactions where each transaction consists of a set of items. [6] Let us consider a database of customer transactions T, where each transaction is a set of items. The objective is to find all rules of the form X => Y, which correlate the presence of one set of items X with another set of items Y. An example of such a rule is: 98% of people who purchase diapers and baby food also buy baby soap. [] Most previous studies have focused on binary valued transaction data. Transaction data in real-world applications, however, usually consist of quantitative values. Designing a sophisticated data-mining algorithm able to deal with various types of data presents a challenge to workers in this research field. Recently, fuzzy set theory has been used more and more frequently in intelligent systems because of its simplicity and similarity to human reasoning. The theory has been applied in fields such as manufacturing, engineering, diagnosis, economics, among others. Several fuzzy learning algorithms for inducing rules from given sets of data have been designed and used to good effect with specific domains [7]. Evolutionary computing (EC) is an exciting development in mining algorithms. It amounts to building, applying and studying algorithms based on Darwinian principles of natural selection [, 2]. Genetic algorithms (GAs) are a family of computational models developed by Holland [3, 4].Genetic Algorithms (GA) and Particles Swarm Optimization (PSO) are both population based algorithms that have proven to be successful in solving a variety of difficult problems. However, both models have strengths and weaknesses. Comparisons between GAs and PSOs have been performed by Eberhart and Angeline and both conclude that a hybrid of the standard GA and PSO models could lead to further advances. [9,0] Several hybrids Genetic, Particle Swarm Optimization algorithms have been designed.as to [8] proposed an algorithm that combines the standard velocity and position update rules of PSOs with the ideas of selection, crossover and mutation from GAs. The algorithm is designed so that the GA facilitates a global search and the PSO performs a local search. [4] In 2006 Esmin, et al [] proposed a new model called Hybrid Particle Swarm Optimizer with Mutation (HPSOM), by integrate the mutation process often used in GA into PSO. This process allows the search to escape from local optima and search in different zones of the search space. This paper, thus, proposes a fuzzy data-mining algorithm for extracting both association rules and membership functions from quantitative transactions. A hybrid genetic-pso algorithm for finding membership functions suitable for mining problems is proposed that consists in a strong cooperation of GA and PSO, since it maintains the integration of the two techniques for the entire run of simulation. In each iteration, in fact, some of the individuals are substituted by new generated ones by means of GA, while the remaining part is the same of the previous generation but moved on the solution space by PSO. Considering Genetic Algorithms and Particle Swarm Optimization algorithms, most of the times, PSO has faster convergence rate than GA initially, but they are often outperformed by GA for long simulation runs.
2. Genetic Algorithm Genetic algorithms were first introduced by Holand in the early 970 s [3] and have been widely successful in optimization problems. The genetic operator that used in this paper have been described in [7] and [7].In representation step, each set of membership functions are encoded as a chromosome and handled as an individual with real-number schema. Genetic operators are very important to the success of specific GA application. The crossover and mutation operators chosen in this paper are the max-min-arithmetical (MMA) crossover proposed in [7] and the one-point mutation proposed in [7]. 2.. Fitness The fitness value of each set of membership function is determined according to two factors: suitability of membership functions and fuzzy supports of large -itemsets that have been described in [7]. The fitness value of a chromosome Cq is defined a Fig.. The proposed GA-PSO flowchart for fuzzy data mining. 2 GA-PSO Mining framework based on the divide-and-conquer strategy In this section, the fuzzy and GA-PSO concepts are used to discover both useful association rules and suitable membership functions from quantitative values. A GA-PSO framework with the divide-and-conquer strategy is proposed for searching for membership functions suitable for the mining problems. The final best sets of membership functions in all the populations are then gathered together to be used for mining fuzzy association rules. The proposed framework is shown in Fig.. The proposed framework in Fig. is divided into two phases: mining membership functions and mining fuzzy association rules. Assume the number of items is. In the phase of mining membership functions, it maintains populations of membership functions, with each population for an item. Each chromosome in a population represents a possible set of membership functions for that item. The chromosomes in the same population are of the same length. The proposed mechanism then chooses appropriate strings for mating, gradually creating good offspring sets of membership functions. The offspring sets of membership functions undergo recursive evolution until a good set of membership functions has been obtained. Next, in the phase of mining fuzzy association rules, the sets of membership function for all the items are gathered together and used to mine the fuzzy interesting association rules from the given quantitative database. () where L is the set of large -itemsets obtained by using the set of membership function in Cq 2.2 PSO The Particle Swarm Optimization (PSO) algorithm is a new optimization algorithm inspired by social behavior in nature. Like Genetic Algorithms, the PSO is a populationbased optimization method that searches multiple solutions in parallel. However PSO employs a cooperative strategy unlike GA, which utilizes a competitive strategy. During each generation each particle is accelerated toward the particle s previous best position and the global best position. At each iteration a new velocity value for each particle is calculated based on its current velocity, the distance from its previous best position, and the distance from the global best position. The new velocity value is then used to calculate the next position of the particle in the search space. This process is then iterated a set number of times or until a minimum error is achieved. In the inertia version of the algorithm an inertia weight, reduced linearly each generation, is multiplied by the current velocity and the other two components are weighted randomly to produce a new velocity value for this particle, this in turn affects the next position of the particle during the next generation. Thus, the governing equations are: Vi(t) = ω Vi(t-) + c ϕ (Pi Xi(t-)) + c2 ϕ2 (Pg Xi(t-)) Xi(t) = Xi(t-) + Vi(t) (2)
Where in this paper, t is number of generation and X i is particle i s membership function, Vi is particle i s velocity, P i is particle i s previous best membership function and P g is the global best particles membership function. The parameter ω is the inertia weight and variables c, c2, ϕ and ϕ2 are social parameters and random numbers in the range [0.0,.0], respectively. 2.3 randomly generated. Assume the population size is ten in this example. For comparison the proposed algorithm with GA and PSO the best fitness of each generation with GA, PSO and GA-PSO are shown in Fig2 GA PSO GA and PSO are much similar in their inherent parallel characteristics, whereas experiments show that they have their specific advantages when solving different problems. What we would like to do is to use both their excellent features by synthesizing the two algorithms 3 The proposed mining algorithm The input is a body of n quantitative transaction data, a set of m items, each with a number of predefined linguistic terms, a support threshold α, and a population size P and we are looking for the output a set of membership functions for extracting association rules. First randomly generate m populations, each for an item, each individual in a population represents a possible set of membership functions for that items and encode each set of membership functions into a string representation. For each chromosome in each population calculate the fitness value and randomly divide the population into two parts. Execute GA Algorithm on part one using the selection operation to choose membership functions. Any selection operation, such as the elitism selection strategy or the roulette selection strategy may be used here. Now execute crossover operations and then mutation operations on each population. For part two execute PSO Algorithm by finding the best membership function in each population and updating the global best membership function and then update velocity and membership functions. Now merge two parts and create new population, calculate the fitness value of each chromosome in each population. If the termination criterion is not satisfied, again divide this new population into two parts and execute GA and PSO as mentioned on them otherwise gather the sets of membership functions, each of which has the highest fitness value in its population. Note that the termination criterion may be number of iterations, allowed execution time or convergence of the fitness values. The proposed mining algorithm illustrated in Fig. 3. An Example In this section, an example is given to show advantage of the proposed mining algorithm. This is a simple example to show how the proposed algorithm can be used to mine membership from quantitative data. Assume that we have one item in a transaction database: Milk and the data set include the six transactions. Assume item has three fuzzy regions: Low, Middle, and High. Thus, three fuzzy membership functions must be derived for item. The population is Fig. 2. Comparison the proposed algorithm with GA and PSO For statistical analysis of proposed mining algorithm the result of 32 runs of GA-PSO and GA are shown separately in Fig3, Fig4 respectively. (Min Max Average) Fig. 3. Statistical analysis for 32 runs of GA 4 Experimental Results In this section, experiments conducted to show the performance of the proposed approach are described. They were implemented in Matlab on an Intel Core 2 personal computer with 2.00 GHz and GB RAM. The initial population size P was set at 0, the crossover rate Pc was set at 0.8, and the mutation rate Pm was set at 0.0 according to [8]. The parameter d of the crossover operator was set at 0.3 according to [7], the parameter of the mutation operator was set at 3, the minimum support α was set at 0.04 (4%), the inertia weight ω was set at 0.86 and the social parameters c, c2 are set 0.2.
Simulated datasets with 3 items and with different dataset sizes from to 20 k transactions were used in the experiments. In Table., the numerical result of execution of GA, PSO and GA-PSO on different size of database from to 20 k are shown and the result shows that GA-PSO have the best performance. The relationship between the Average Fitness and the database size is shown in Fig. and the relationship between the execution time and the database size is shown in Fig. 6 Fig. 4. Statistical analysis for 32 runs of GA-PSO Table Algorithm (size) Generation Fitness Fitness2 Fitness3 Average Fitness Time(minute) GA( KB) 0.40 0.7 0.49 0.4 PSO( KB) 0.70 3 GA-PSO( KB) 0.98 0.90 0.8 0.9 4 GA(0 KB) 0.30 0.8 0.39 0.42 4 PSO(0 KB) 0.87 GA-PSO(0 KB) 0.9 0.9 0.96 7 GA( KB) 0.07 0.3 0.46 0.28 2 PSO( KB) 0.47 0.82 0.93 8 GA-PSO( KB) 0.9 0.94 0.94 32 GA(20 KB) 0.2 0.0 0.4 0.49 24 PSO(20 KB) 0.79 0.86 0.8 0.9 0.9 0.93 GA-PSO(20 KB) Fig. The relationship between the Average Fitness and the database size Fig. 6 The relationship between the execution time and the database size
Conclusions In this paper, we have proposed a GA-PSO fuzzy data mining algorithm for extracting both association rules and membership functions from quantitative transactions. The experimental results have also shown that the proposed genetic-pso fuzzy mining algorithm have a good effect on fitness of membership function. In the future, we will continuously attempt to enhance the GA-PSO based mining framework for more complex problems. 6 References [] A.E. Eiben, M. Schoenauer, Evolutionary computing, Inform.Process. Lett. 82 () (2002) 6. [2] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, Springer, Berlin, 2003. [3] J.H. Holland, Adaptation in Natural and Artificial System, the University of Michigan Press, Ann Arbor, MI, 97. [4] D.E. Goldberg, Genetic Algorithms in Search, Optimization &Machine Learning, Addison-Wesley, Reading, MA, 989. [] Sushmita Mitra-Tinkuacharya Data Mining Multimedia, Soft Computing, and Bioinformatics - 2003 by John Wiley & Sons, Inc. (page64 & 267) [6] Olivia Parr Rud Data Mining Cookbook Modeling Data for Marketing, Risk, and Customer Relationship Management - 200 - page3. [7] Tzung-Pei Hong, Chun-Hao Chen, Yeong-Chyi Lee, and Yu-Lung Wu- Genetic-Fuzzy Data Mining With Divide-and-Conquer Strategy, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 2, NO. 2, APRIL 2008 [8] Matthew Settles, Terence Soule - Breeding Swarms A GA-PSO Hybrid, GECCO, ACM 200, -993-00 [9] R. Eberhart and Y. Shi. Comparison between genetic algorithms and particle swarm optimization. In e. a. V. William Porto, editor, Evolutionary Programming, volume 447 of Lecture Notes in Computer Science, pages 666. Springer, 998 [0] P. Angeline. Evolutionary optimization versus particle swarm optimization: Philosophy and performance differences. In V. W. Porto and et al., editors, Evolutionary Programming, volume 447 of Lecture Notes in Computer Science, pages 60-60. Springer, 998 [] A. A. A. Esmin,G. Lambert-Torres,G. B. Alvarenga Hybrid Evolutionary Algorithm Based on PSO and GA mutation Proceedings of the Sixth International Conference on Hybrid Intelligent Systems (HIS'06) 0769-2662-4/06 2006 IEEE [2] X.H. Shi, Y.C. Liang, H.P. Lee, C. Lu, L.M. Wang An improved GA and a novel PSO-GA-based hybrid algorithm Information Processing Letters 93 (200) 2 26 [3] O. Cordón, F. Herrera, and P. Villar, Generating the knowledge base of a fuzzy rule-based system by the genetic learning of the data base, IEEE Tran. Fuzzy Systems, vol. 9, no. 4, 200 [4] A. Parodi and P. Bonelli, A new approach of fuzzy classifier systems, in Proc. th Int. Conf. Genetic Algorithms, 993, pp. 223 230 [] C. H. Wang, T. P. Hong, and S. S. Tseng, Integrating fuzzy knowledge by genetic algorithms, IEEE Trans. Evol. Comput. vol. 2, no. 4, pp. 38 49, 998 [6] C. H. Wang, T. P. Hong, and S. S. Tseng, Integrating membership functions and fuzzy rule sets from multiple knowledge sources, Fuzzy Sets Syst., vol. 2, pp. 4 4, 2000 [7] F. Herrera, M. Lozano, and J. L. Verdegay, Fuzzy connectives based crossover operators to model genetic algorithms population diversity, Fuzzy Sets Syst., vol. 92, no., pp. 2 30, 997 [8] M. Srinivas and L. M. Patnaik, Genetic algorithms: A survey, Computer, vol. 27, no. 6, pp. 7 26, 994