Efficient Distributed Data Mining using Intelligent Agents

1 Efficient Distributed Data Mining using Intelligent Agents Cristian Aflori and Florin Leon Abstract Data Mining is the process of extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases. Agents are defined as software or hardware entities that perform some set of tasks on behalf of users with some degree of autonomy. In order to work for somebody as an assistant, an agent has to include a certain amount of intelligence, which is the ability to choose among various courses of action, plan, communicate, adapt to changes in the environment, and learn from experience. In general, an intelligent agent can be described as consisting of a sensing element that can receive events, a recognizer or classifier that determines which event occurred, a set of logic ranging from hard-coded programs to rule-based inference, and a mechanism for taking action. In several steps through knowledge discovery, which include data preparation, mining model selection and application, and output analysis, intelligent agent paradigm can be used to automate the individual tasks. In the experiment setup, we discover association rules in a distributed database using intelligent agents. We apply an original approach for effective distributed mining association rules: loose-couple incremental methods. We compare the results obtained with the similar work done in the field. Index Terms Association rules, distributed data mining, intelligent agents, loose-couple incremental methods. I I. INTELLIGENT AGENTS N recent years, distributed artificial intelligence developed and diversified, as it is a research field that merges concepts and results from many disciplines, such as psychology, sociology and economy. Its interdisciplinary nature makes it difficult to established a unanimously accepted definition, but generally distributed artificial intelligence refers to "the study, construction, and application of multiagent systems, that is, systems in which several interacting, intelligent agents pursue some set of goals or perform some set of tasks." [1]. This development was facilitated by the progress accomplished in computer science. Multi-tasking operating systems, communicating processes, distributed computing, This work was supported in part by the National University Research Council under Grant AT no 66 / 2004, The prototype of GIS web system for data mining using intelligent agents. Cristian Aflori is a lecturer at the Gh. Asachi Technical University, Department of Automatic Control and Computer Science (e-mail: caflori@ cs.tuiasi.ro). Florin Leon is a PhD student at the Gh. Asachi Technical University, Department of Automatic Control and Computer Science (e-mail: fleon@ cs.tuiasi.ro). and object oriented programming languages supported the design, implementation and deployment of agent-based systems. Most classical artificial intelligence systems are statical, their architecture is predefined, while agent-based systems dynamically modify in time. Distributed artificial intellience studies the issues related to the design of distributed, interactive systems. An autonomous agent can be considered a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to affect what it senses in the future [2]. The use of agents is mainly justified by the fact that they are a solution for managing complex systems. Because of their autonomy, they can act on behalf of the user, without having only the role of a simple interface. Agents are defined as software or hardware entities that perform some set of tasks on behalf of users with some degree of autonomy [3]. In order to work for somebody as an assistant, an agent has to include a certain amount of intelligence, which is the ability to choose among various courses of action, plan, communicate, adapt to changes in the environment, and learn from experience. In general, an intelligent agent can be described as consisting of a sensing element that can receive events, a recognizer or classifier that determines which event occurred, a set of logic ranging from hard-coded programs to rule-based inference, and a mechanism for taking action [4] [5]. Other attributes that are important for agent paradigm include mobility and learning. An agent is mobile if it can navigate through a network and perform tasks on remote machines. A learning agent adapts to the requirements of its user and automatically changes its behavior to environmental changes. II. DATA MINING AGENTS Data mining (DM) or knowledge discovery in databases (KDD) is the process of search for valuable information in large volumes of data, exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules [6]. Data mining agents seek data and information based on the profile of the user and the instructions she gives. A group of flexible data-mining agents can co-operate to discover knowledge from distributed sources. They are responsible for accessing data and extracting higher-level useful information from the data. A data mining agent specializes in performing some activity in the domain of interest. Agents can work in

2 parallel and share the information they have gathered so far. In several steps through knowledge discovery, which include data preparation, mining model selection and application, and output analysis, intelligent agent paradigm can be used to automate the individual tasks. In data preparation, agent use can be especially on sensitivity to learning parameters, applying some triggers for database updates and handling missing or invalid data. In data mining model, we have seen the agent-based studies are implemented for classification, clustering, summarization and generalization which have learning nature and rule generation since current learning methods are able to find regularities in large data sets [8], [9]. An intelligent agent can use domain knowledge with embedded simple rules and using the training data it can learn and reduce the need for domain experts. In the interpretation of what is learned, a scanning agent can go through the rules and facts generated and identify items that can possibly contain valuable information [10]. Searching for patterns of interest by using learning and intelligence in classification, clustering, summarization and generalization can also be accomplished by intelligent agents. An agent can learn from a profile or from examples and feedback from user can be used to refine confidence in agent s predictions. An intelligent agent can use domain knowledge with embedded simple rules and using the training data it can learn and reduce the need for domain experts. Data mining using neural networks and possible intelligent agent use in data mining process are discussed in [4]. In the understanding of what is learned, agent use can be only as a fixed-agent or simply a program in visualization. The major advantage of using intelligent agents in automation of data mining is indicated as their possible support for online transaction data mining. When new data is added to the database, an alarm or triggering agent can send events to the main mining application and to the learning task in it, so that new data can be evaluated with the already mined data. This automated decision support using triggers in data mining is called as active data mining [11]. Since, the main mining functions can be performed by using learning methods, the implementation and application of these methods by using intelligent agents will provide flexible, modular and delegated solution. Additionally, this paradigm can be used in the parallelization of the data mining algorithms according to its usability in distributed environments. III. DISTRIBUTED DATA MINING In the present time, the organizations have different branches located in various geographical places, and each branch own a local database to store information about their own business. If the top level management needs to mine novel information in the process of decision making, there are two options. The first one is to transfer data to a single database and mine it on that database. The second option is to mine them independently and still generate information for the combination of the data in multiple databases. The architecture of a data mining system using intelligent agents is presented in the following figure: The development of distributed rule mining is a challenging and critical task since it requires knowledge of all the data stored at different locations and the ability to combine partial results from individual databases into a single result [12]. The individual databases have to be analyzed to generate rules to make local decisions. It would be easier for the organization to make decisions based on the rules generated by the Branch 2 Agents Communication Branch 1 Top Level Branch (Central DB) Branch 4 individual branches, rather than using the raw data. If the raw data from each of the individual databases were sent to a single database to generate the rules, certain useful rules, which would aid in making decisions about local branches, would be lost. If the raw data from all the databases were transferred to a single database then each of the individual branches would not be generating the rules with respect to its data. In such a case the organization may miss out certain rules that were prominent in certain branches and were not found in the other branches similar to the above example. Generating such rules would aid in making decisions about specific branches. The patterns in multi-databases are divided into the following classes [13]: - patterns: branches need to consider the original raw data in their datasets so they can identify local patterns for local decisions. - High-vote patterns: These are the patterns that are supported by most of the branches and are used for making global decisions. - Exceptional patterns: These patterns are strongly supported by only a few branches and are used to create policies for specific branches. IV. INCREMENTAL ASSOCIATION Branch 3 Fig. 1. Mining rules in a distributed environment using intelligent agents. In this paper, we focus on an important data mining operation, the association, which we perform by the means of the agents. Association rule induction is a powerful method, which aims at finding regularities in the trends of the data. With the induction of association rules one tries to find sets of data

3 instances that frequently appear together. Such information is usually expressed in the form of rules. An association rule expresses an association between (sets of) items. However, not every association rule is useful, only those that are expressive and reliable. Therefore, the standard measures to assess association rules are the support and the confidence of a rule, both of which are computed from the support of certain item sets. Our procedure is based on the idea of the Apriori algorithm [14] for extracting the rules. From the implementation point of view, we used the variant described in the data mining book by Witten and Frank [15]. The algorithm is founded on the observation that if any given set of attributes S is not adequately supported, any superset of S will also not be adequately supported and consequently any effort to calculate the support for such supersets is wasted. For example if we know that {A, B} is not supported it follows that {A, B, C}, {A, B, D}, etc. will also not be supported. The algorithm first determines the support for all single attributes (sets of cardinality 1) in the data set, and deletes all the single attributes that are not adequately supported. Then, for all supported single attributes, it constructs pairs of attributes (sets of cardinality 2). If there are no pairs, it finishes; otherwise it determines the support for the constructed pairs. For all supported pairs of attributes candidate sets of cardinality 3 (triples) are built. Again, if there are no triples, it ends; otherwise it determines the support for the constructed triples. It continues likewise until no more candidate sets can be produced. In a distributed incremental approach individual agents have access only to a limited number of transactions. Therefore, by employing the Apriori algorithm, they only have a partial view of the association rules. However, they can memorize the rules with a lower support and gradually update them, as they access more databases or communicate with other agents. In the related work the Incremental mining algorithm [16], [17], [18] is used to find new frequent itemsets with minimal recomputation when new transactions are added to or deleted from the transaction database. The algorithm uses the negative border concept for this. The negative border [19] consists of all itemsets that were candidates, which did not have the minimum support. During each pass of the Apriori algorithm, the set of candidate itemsets Ck is computed from the frequent itemsets Fk-1 in the join and prune steps of the algorithm. The negative border is the set of all those itemsets that were candidates in the k th pass but did not satisfy the user specified support, that is (NBd(Fk)) = Ck Fk. The algorithm uses a full scan of the whole database only if the negative border of the frequent itemsets expands. V. CASE STUDY In order to demonstrate our approach, let us consider a database with four attributes, each taking nine possible values. The structure of the database is given as ARFF (Attribute- Relation File Format), a common format used to describe data for machine learning algorithms: @RELATION alphanum @ATTRIBUTE first {A1,A2,A3,A4,A5,A6,A7,A8,A9} @ATTRIBUTE second {B1,B2,B3,B4,B5,B6,B7,B8,B9} @ATTRIBUTE third {C1,C2,C3,C4,C5,C6,C7,C8,C9} @ATTRIBUTE fourth {D1,D2,D3,D4,D5,D6,D7,D8,D9} @DATA A1,B2,C4,D1 A3,B2,C1,D7 A1,B2,C2,D3 A2,B1,C2,D2 A2,B2,C2,D7 A1,B3,C4,D2 A1,B4,C2,D3 A2,B2,C4,D4 A1,B2,C3,D7 A3,B4,C4,D5 We split the main database in two databases, each containing an equal number of transactions: A1,B2,C4,D1 A3,B2,C1,D7 A1,B2,C2,D3 A2,B1,C2,D2 A2,B2,C2,D7 A1,B3,C4,D2 A1,B4,C2,D3 A2,B2,C4,D4 A1,B2,C3,D7 A3,B4,C4,D5 A4,B4,C3,D5 A1,B4,C4,D7 A6,B2,C4,D6 A1,B5,C5,D7 A5,B2,C6,D6 A1,B6,C4,D5 A3,B2,C6,D7 A3,B5,C5,D5 A3,B3,C5,D4 We applied the standard Apriori algorithm to the two databases and retained the rules with a support higher than 1, because the databases were simple enough. The first database gives the following rules: 1) fourth=d7 6 ==> second=b2 6 2) first=a1 fourth=d7 4 ==> second=b2 4 3) third=c4 fourth=d7 3 ==> first=a1 second=b2 3 4) first=a1 third=c4 fourth=d7 3 ==> second=b2 3 5) second=b2 third=c4 fourth=d7 3 ==> first=a1 3 6) third=c4 fourth=d7 3 ==> second=b2 3 7) third=c4 fourth=d7 3 ==> first=a1 3 8) fourth=d3 2 ==> first=a1 third=c2 2 9) first=a1 third=c2 2 ==> fourth=d3 2

4 10) first=a1 fourth=d3 2 ==> third=c2 2 11) third=c2 fourth=d3 2 ==> first=a1 2 12) fourth=d3 2 ==> third=c2 2 13) fourth=d3 2 ==> first=a1 2 The second database gives the following rules: 1) third=c4 fourth=d7 5 ==> first=a1 5 2) first=a1 second=b2 4 ==> third=c4 fourth=d7 4 3) first=a1 second=b2 third=c4 4 ==> fourth=d7 4 4) first=a1 second=b2 fourth=d7 4 ==> third=c4 4 5) second=b2 third=c4 fourth=d7 4 ==> first=a1 4 6) first=a1 second=b2 4 ==> fourth=d7 4 7) first=a1 second=b2 4 ==> third=c4 4 8) second=b5 2 ==> third=c5 2 9) fourth=d6 2 ==> second=b2 2 10) third=c6 2 ==> second=b2 2 These results can be combined in order to produce a common set of rules by adding the support of the premises and, respectively, the support of the conclusions: Supp p = Supp pi (1) Supp c = Supp ci (2) 4) first=a1 second=b2 third=c4 8 ==> fourth=d7 7 5) first=a1 second=b2 fourth=d7 8 ==> third=c4 7 6) first=a1 third=c4 fourth=d7 8 ==> second=b2 7 7) third=c4 fourth=d7 8 ==> second=b2 7 8) fourth=d7 13 ==> second=b2 11 conf:(0.85) 9) first=a1 fourth=d7 10 ==> third=c4 8 conf:(0.8) 10) first=a1 second=b2 10 ==> fourth=d7 8 conf:(0.8) One can observe that these rules can be also obtained by combining partial rules from the two databases (Table 1). It is important to note that some global rules cannot appear in the partial database rules, because there are not enough transactions to form them. In one case (the third line of the table), we considered the premise and conclusion as interchangeable, because their confidence was 1. By merging all the selected partial rules one cannot obtain exactly the main rules. Had we memorized all the partial rules, we could have combined them into some more precise rules, closer to those of the main database. The confidence factor can be recomputed by dividing the summed supports: Supp Supp c ci Conf = = (3) Supp Supp p We also computed the main database rules. The following are the first ten rules obtained: 1) third=c4 fourth=d7 8 ==> first=a1 8 2) second=b2 third=c4 fourth=d7 7 ==> first=a1 7 3) third=c4 fourth=d7 8 ==> first=a1 second=b2 pi TABLE I GENERETATED ASSOCIATION RULES VI. PERFORMANCE EVALUATION In an organization with several branches it is crucial that the top level management to have a complete, update imagine of their own activities all over the world. In order to achieve this goal, it is important to efficiently mine novel information in a distributed environment. For these reasons we need to measure the performance of the incremental association algorithm comparing with the standard Apriori algorithm. The classic Apriori algorithm performs a recomputation of all data each time a database increment arrives from a local database. To measure the algorithm performances an experiment was setup using the following parameters: a synthetically dataset Main database rule First partial database rule Second partial database rule 1. third=c4, fourth=d7 8 ==> first=a1 8 7. third=c4 fourth=d7 3 ==> first=a1 3 1. third=c4 fourth=d7 5 ==> first=a1 5 2. second=b2, third=c4, fourth=d7 7 ==> first=a1 7 5. second=b2 third=c4 fourth=d7 3 ==> first=a1 3 5. second=b2 third=c4 fourth=d7 4 ==> first=a1 4 3. third=c4, fourth=d7 8 ==> first=a1, second=b2 7 3. third=c4, fourth=d7 3 ==> first=a1 second=b2 3 2. first=a1 second=b2 4 ==> third=c4 fourth=d7 4 4. first=a1 second=b2 third=c4 8 ==> fourth=d7 7-3. first=a1 second=b2 third=c4 4 ==> fourth=d7 4 5. first=a1 second=b2 fourth=d7 8 ==> third=c4 7-4. first=a1 second=b2 fourth=d7 4 ==> third=c4 4 6. first=a1 third=c4 fourth=d7 8 ==> 4. first=a1 third=c4 fourth=d7 3 ==> - second=b2 7 second=b2 3 7. third=c4 fourth=d7 8 ==> second=b2 7 6. third=c4 fourth=d7 3 ==> second=b2 3-8. fourth=d7 13 ==> second=b2 11 conf:(0.85) 1. fourth=d7 6 ==> second=b2 6-9. first=a1 fourth=d7 10 ==> third=c4 8 - - conf:(0.8) 10. first=a1 second=b2 10 ==> fourth=d7 8 conf:(0.8) - 6. first=a1 second=b2 4 ==> fourth=d7 4

5 in ARFF format, a server (P4 2.4Mhz, 1Gb RAM) with the WindowsXP operating system. The algorithm implementation is in J# (the Microsoft version of Java for the DotNET framework): the classical Apriori and the incremental version. The nomenclature of these datasets is of the form TxxIyyDzzzK, where xx denotes the average number of items present per transaction, "yy" denotes the average support of each item in the dataset and "zzzk" denotes the total number of transactions in K (1000s). A percentage of the transactions of the database are considered as the original database and the remaining transactions are added incrementally in percentages. The experiments are performed for 3 increments to the original database as in most cases recomputing may turn out to be better than incremental mining during the initial iterations till the size of the dataset grows considerably. The database grows with a percentage of 10% from the total size. The initial size of the database is 700k transactions. The database is updated with 3 increments each having 100k transactions and the rules resulted have 30%, 25% and 20% support factor. The efficiency of the incremental algorithm comparing with the classical Apriori is presented in the Figure 2, which shows an improvement in performance for the incremental association algorithm compared to the classical Apriori of 48% for 30% support, and it decreased to about 43% for 25% support and to 40% for 20% support. The resulted performance is similar to the related work done in the incremental association algorithm using the negative border concept [20]. The experiments show that the incremental mining performs better compared with recomputation for larger datasets. VII. CONCLUSION AND FUTURE WORK This paper presented an original approach for efficiently Fig. 2. Performance of the incremental association algorithm comparing with the classical Apriori algorithm. mining association rules in a distributed environment using intelligent agents. The case study showed that the incremental algorithm produced almost the same rules like the classical recomputation algorithm. The performance of the incremental algorithm is varying depending on the database size, the increment size and the support factor, but for large datasets the improvement is about 40%-48% compared to the classical algorithm. As future direction of research, we will consider forming sets of superior cardinality from the partial rules transported by the agents, instead of simply mixing the existing rules into similar ones with higher support. Also, we will explore other incremental versions of association algorithms, implemented in the database (database tight), and we will evaluate the algorithms in a real world case. Also, it is very interesting to extend the data mining multiagent framework with the visualization features of a Geographic Information System and with the capabilities of mining spatial data. REFERENCES [1] G. Weiß, S. Sen (eds.): Adaptation and Learning in Multiagent Systems. Berlin: Springer Verlag, 1996 [2] S. Franklin, A. Graesser: Is it an Agent, or just a Program?: A Taxonomy for Autonomous Agents, Institute for Intelligent Systems, University of Memphis, http://www.msci.memphis.edu/~franklin/agentprog.html [3] S.Russell, P.Norvig, Artificial Intelligence: A Modern Approach, Prentice-Hall, 1995 [4] J.P.Bigus, Data Mining with Neural Networks - Solving Business Problems from Application Development to Decision Support, McGraw-Hill, 1996. [5] T.Dean, J.Allen, Y.Aloimonos, Artificial Intelligence: Theory and Practice, The Benjamin/Cummings Publishing Co. Inc., 1995. [6] Fayyad U. M.,.J., Piatetsky-Shapiro G., Smyth P.: From Data Mining to KnowledgeDiscovery: An Overview, in: Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, 1996, pp. 1-34. [7] Han, J. and M. Kamber, Data Mining : Concepts and Techniques. 2001: Morgan Kaufmann Publishers. [8] T.R.Payne, P.Edwards, C.L.Green, Experience with Rule Induction and k-nearest Neighbor Methods for Interface Agents that Learn, IEEE Transactions on Knowledge and Data Engineering, vol.9, no.2, pp. 329-335, Mar/Apr 1997 [9] J.Yang, P.Pai, V. Hanovar, L.Miller, Mobile Intelligent Agents for Document Classification and Retrieval: A Machine Learning Approach, Proceedings of the European Symposium on Cybernetics and System Research, Vienna, Austria, 1998. [10] H.S.Nwana, M.Wooldridge, Software Agent Technologies, Software Agents and Soft Computing: Towards Enhanced Machine Intelligence, Lecture Notes in Artificial Intelligence 1198, pp.59-77, 1997. [11] Agrawal and Psalia Active Data Mining! [12] Zhang, S., X. Wu, and C. Zhang, Multi-Database Mining. IEEE Computational Intelligence Bulletin, Vol. 2, No. 1, June 2003: p. 5-13. [13] Wu, X. and S. Zhang. Synthesizing High-Frequency Rules from Different Data Sources. in IEEE Transactions on Knowledge and Data Engineering. 2003. [14] Agrawal R., Srikant R., Fast Algorithms for Mining Association Rules, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994 [15] Witten, I. H., Frank, E., Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000. [16] Thomas, S., et al. An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases. in Knowledge Discovery and Data Mining. 1997. [17] Thuraisingham, B., A Primer for Understanding and Applying Data Mining. IEEE, 2000. Vol. 2, No.1: p. 28-31.

[18] Agrawal, R., T. Imielinski, and A. Swami. Mining Association Rules between sets of items in large databases. in ACM SIGMOD International Conference on the Management of Data. 1993. Washington, D.C. [19] Toivonen, H. Sampling Large Databases for Association Rules. in In Proc. 1996 Int. Conf. Very Large Data Bases. 1996: Morgan Kaufman. [20] Hima Valli Kona, Association Rule Mining Over Multiple Databases: Partitioned and Incremental Approaches, The University of Texas at Arlington, 2003. 6