Domain Independent Prediction with Evolutionary Nearest Neighbors.

Research Summary Domain Independent Prediction with Evolutionary Nearest Neighbors. Introduction In January of 1848, on the American River at Coloma near Sacramento a few tiny gold nuggets were discovered. This triggered one of the largest human migratio ns in history as a halfmillion people from around the world descended upon California in search of instant wealth [2]. We live in a data rich information poor environment [1] with a requirement for the migration of computational tools to suite the data to extract valuable information nuggets. Data mining is a multi disciplinary field with the primary objective of supporting knowledge workers to extract the information from large volumes of data. In most of the practical data mining applications large scale tool customization is done for the specific application domain. There is a requirement for a generalized data mining tool for at least one major area of data mining. This work is an attempt to investigate, build, and evaluate a generalized prediction frame work that facilitates easy migration of the tool towards the application in an attempt to do meaningful data mining. We propose a nearest neighbor prediction approach with a genetic algorithm (GA) based relevance tuning for the particular application domain. Generalization of the tool enabling easy migration with the use of a GA may be computationally prohibitive for large data sets. We propose the use of a vertical data mining ready data structure (P-trees 1 ) that would enable the tool frame work to be computationally efficient in the generalized setting. Background The work proposed fall into the area of research categorized as data mining and knowledge discovery with evolutionary algorithms. The main motivation for applying evolutionary algorithms to data mining tasks is that they are robust and adaptive. Classification (Prediction) is 1 P-tree technology is patent pending.

most probably the most widely studied data mining task [3]. K Nearest Neighbor (KNN) classification is well explored in the literature and has been shown to have good classification (prediction) performance on a wide range of real world data sets [4]. KNN is simple and straight forward to implement. The use of a distance metric in KNN opens a wide array of opportunities to use evolutionary techniques to tune the metric to a particular application domain. Most of the existing research use evolutionary techniques for dimensionality reduction and attribute relevance [4],[5],[6],[7] etc. There are some other cases where evolutionary techniques are used to optimize other parameters such as the optimum k in KNN [4]. Genetic algorithms [8] are parallel, iterative optimizers, and have been successively applied to a broad spectrum of optimization problems [4]. Attribute dimensions can be scaled, using, a genetic algorithm, to optimize the classification accuracy of a separate algorithm, such as KNN [7]. The artificial tuning process requires the evaluation of the prediction model iteratively. Iterative evaluation of the data mining model could be computationally expensive for large data sets. P-trees are a lossless, compressed, and data-mining-ready data structure. This data structure has been successfully applied in data mining applications for real world data [6],[9],[10],[11]. Efficient computation of required neighborhood counts leads to a low cost solution for the iterative evaluation required for the GA based evolution (tuning). Two major obstacles with quick and easy migration of a tool frame work are diversity in data and diversity in domain knowledge. These could be addressed with the use of Ptrees and a GA respectively. Proposed approach The main objective of the proposed work is a generalized prediction frame work. This should allow easy migration of the tool framework to different application domains. We propose the use of an artificially tuned nearest neighbor type prediction model. We propose exploring all

possible tuning parameters for the nearest neighbor prediction model with the use of a GA. For example the non restriction of the neighborhood search to k with the use of a GA optimized influence function in the similarity metric (Figure 2). The use of the P-tree data structure allows this work to go beyond the classical nearest neighbor classification. In the classical approach the similarity counting is done through expensive database scans, which is replaced by a collection of logical operations on compressed bit vectors in P-trees. An outline of the proposed architecture is shown in figure 1. Training data from the application domain will be initially converted to P-trees. This will be used for neighborhood counting in the predictor. The GA will be used to tune the predictor. Finally the input samples will be predicted with the use of the tuned predictor. Application Training Data Genetic Algorithm P-tree Engine & Data Repository Nearest Neighbor Predictor Attr. Relevance Data to be predicted Tuned Predictor Figure 1 Proposed outline of architecture Prediction Neighborhood influence Figure 2 Example of two parameters that could be tuned on the similarity metric of the nearest neighbor predictor with the use of the genetic algorithm. Proposed Evaluation With respect to the main objective of this work the tool framework proposed should be evaluated at least in two diverse application domains to show the ease of migration independent of the application domain. It will also be an added advantage to look for an application domain with a

high potential for return with respect to the use of data mining. Prediction applications in bioinformatics and software project cost estimation are proposed as two initial application areas. In bioinformatics there is an abundance of data [12] with some, specific such as protein function prediction and not so specific classification and predication applications. In software engineering there is a specific need for good software cost predictions [13] for the mere survival of the industry with a general intuition that data mining can provide a reasonable solution. Two major criterions for evaluation with respect to the quality of solution are the accuracy and the computational cost. In this work more emphasis will be focused on the accuracy. The proposed evaluation will test the tool framework against published results of existing solutions. Each selected application domain will be tested with only the migration enabled by the artificial tuning proposed in this work. Conclusion As with the human migration in the Gold Rush, we are proposing a tool frame work with quick and easy migration across application domains to find valuable information. The two major obstacles with the migration of data mining tools are diversity in application data and diversity in domain knowledge. The data diversity is handled by the use of a uniform and computationally efficient data structure in P-trees. Diversity in domain knowledge is handled by the use of an evolutionary algorithm in a GA. Successful completion of the proposed work will contribute to the body of knowledge the feasibility of enabling technology for a computationally intelligent gold rush for information in diverse application domains.

References [1] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Academic Press, Morgan Kaufmann Publishers, 2001. [2] http://ceres.ca.gov/ceres/calweb/geology/goldrush.html [3] A.A. Freitas, A survey of evolutionary algorithms for data mining and knowledge discovery, Advances in Evolutionary Computation, pp 819-845, Springer-Verlag, August 2002. [4] M. L. Raymer, W. F. Punch, E. D. Goodman, L. A. Kuhn, and L. C. Jain. Dimensionality reduction using genetic algorithms. IEEE Trans. on Evolutionary Computation, 4(2):164 171, 2000. [5] Yang J and Honavar V. Feature subset selection using a genetic algorithm. In: Liu H & Motoda H (Eds.) Feature Extraction, Construction and Selection: a data mining perspective, 117-136. Kluwer, 1998. [6] P-tree Classification of Yeast Gene Deletion Data. Amal Perera, Anne Denton, Pratap Kotala, William Jockheck,Willy Valdivia Granda,William Perrizo. SIGKDD Explorations. January 2003 Vol 4, Issue 2. [7] W.F. Punch, E.D. Goodman, M. Pei, L. Chia-Shun, P. Hovland, and R. Enbody, Further research on feature selection and classification using genetic algorithms, Proc. of the Fifth Int. Conf. on Genetic Algorithms, pp 557-564, San Mateo, CA, 1993. [8] Goldberg, D.E., Genetic Algorithms in Search Optimization, and Machine Learning, Addison Wesley, 1989. [9] Ding, Q., Ding, Q., Perrizo, W., ARM on RSI Using P-trees, Pacific-Asia KDD Conf., pp. 66-79, Taipei, May 2002. [10] Ding, Q., Ding, Q., Perrizo, W., Decision Tree Classification of Spatial Data Streams Using Peano Count Trees, ACM SAC, pp. 426-431, Madrid, Spain, March 2002. [11] Khan, M., Ding, Q., Perrizo, W., KNN on Data Stream Using P-trees, Pacific-Asia KDD, pp. 517-528, Taipei, May 2002. [12] Beck, S. and Sterk, P. Genome-scale DNA sequencing: where are we? Curr. Opin. Biotechnol. 9,116-120, 1998. [13] S. Chulani, B. Boehm, and B. Steece. Bayesian analysis of empirical software engineering cost models. IEEE Transactionon Software Engineerining, 25(4), July/August 1999.