CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI 02881, (401) 874-2701, hamel@cs.uri.edu Scott Lloyd, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7056, sjlloyd@uri.edu ABSTRACT This paper presents a proposed classification scheme for methods of scaling data mining techniques. Three basic methods using hardware, algorithms, and statistics are defined. The intersections of those areas are then explored and examples of previous research are then classified. KEYWORDS: Scaling, Data Mining, Classification INTRODUCTION Storage Law is the rate at which storage capacity grows. For the last decade it has doubled every nine months, which is twice the rate predicted by Moore s Law for computing power. This has resulted in a situation where the ability to capture and store data far outperforms the ability to analyze data [1]. One of the results of this is a need for data mining techniques able to adapt to this changing circumstance. Hence scalability (the ability to adapt problem solutions to ever growing problem sizes) has become an important aspect of data mining. The obstacles to scaling are many. Human limitations in visualization, computational complexity, memory costs, and lack of parallel software are just a few problems faced in scaling data mining techniques [2]. However, technological advances in hardware, improvements in algorithms, and the continued advancement of statistics have all contributed to the progress in the area of scaling. Such a great deal of research has been conducted in this area that the need for a way to classify that research has arisen. In addition previous research has included scientists from many different disciplines (statistics, computer science, electrical engineering, information systems) that do not necessarily share a common language. A classification scheme provides a common framework by which researchers can refer to and analyze previous research. This paper proposes a classification for scalable data mining techniques. It posits that the most interesting areas of scalability are not in the independent advancements of hardware, algorithms, or statistics (although each have a lot to contribute), but in the intersections of these three areas. The remainder of this paper presents the proposed classification model, classifies previous research, and finally discusses the classification scheme. PROPOSED CLASSIFICATION SCHEME 13701

A classification scheme provides insights and guidance regarding classes of objects. By showing how attributes of certain objects are similar or dissimilar it is possible to discover the distinguishing characteristics that make up the essence of each class of objects. In this case the classes of objects are data mining scaling methods. This allows more precise discourse on scaling methods. In order to help clarify the issue of scaling data mining techniques the proposed classification scheme is diagramed in Figure 1 below. Hardware III I Statistics II Algorithms Figure 1: Classification Scheme As proposed, in this classification the objects are scaling methods and the classes are the circles and intersections the circles form. The circles of the above figure represent three basic ways to scale techniques to larger datasets: hardware, algorithms, and statistics. Each of the three basic methods is briefly described. Using hardware to scale simply means implementing existing algorithms and datasets on faster machines. Machines that incorporate faster processors and more memory among the many other options bigger budgets can afford. This creates an environment that can handle larger datasets and process them more quickly. Modifying algorithms in order to scale involves looking for ways to change the code itself in order to allow the machine to process a larger amount of data, possibly by creating routines that are processed more efficiently. This often requires a great deal of time and knowledge about how the existing system works. 13702

Traditional statistical sampling techniques have long been used as a way to allow larger datasets to be processed [3]. This provides a set of techniques that have been shown to be reliable in a wide variety of circumstances and are available in common statistical packages. The intersections of these circles are of greatest interest in this study. They represent combinations of the basic methods of scaling. The goal is to draw on the strengths of the three basic areas in order to create powerful scaling techniques in the intersections. The intersections may contain various ways to combine technologies. Some of the possibilities are: parallelism, pre/post-processors, and hardware/stat. Parallelism can be thought of as a pure hardware solution to scaling by simply adding more processors to a machine. However, code that was written for single processor machines cannot take advantage of the parallelism without changes. Hence this often falls into the intersection between hardware and algorithms. The algorithms need to be modified in order to properly utilize the advantages parallel processors can provide. The inherent power of parallel processors enables existing techniques to handle larger datasets. Note that not all data mining techniques lend themselves to this process. Some have routines that must run serially and therefore cannot take advantage of the added power. Preprocessors can also be written in order to take advantage of statistical techniques before data mining algorithms are implemented. They can accomplish things like aggregating the data into a smaller dataset or using sampling to reduce the size of the dataset. There are such a wide variety of possibilities that lie within these intersections that the best way to show what can be done is to look at what has been done previously. The next section classifies examples of previous research into the intersections presented in Figure 1 above. CLASSIFYING PREVIOUS RESEARCH This section provides relevant examples of research papers on scaling techniques and fits them within the proposed classification scheme. Intersection I: Hardware X Algorithms Joshi, Karypis, and Kumar [4] presented an algorithm called ScalParC (Scalable Parallel Classifier), a decision tree based classification process. The results of their testing with this algorithm showed that both runtime and memory requirements decrease as the number of processors increases. However, the gains achieved increase at a decreasing rate. For example there was a speed up of 1.61 as a results of going from 16 to 32 processors but a speedup of only 1.31 as a result of going from 32 to 64 processors. Collobert, Bengio, and Bengio [5] presented a methodology that reduces that training time of support vector machines so that it is close to O(T) instead of O(T 3 ). This was accomplished by iteratively partitioning the data and then processing the partitions (modules) on parallelized machines. In datasets between 100,000 and 400,000 instances they were able to reduce training time to a linear function O(T). In one example they were able to solve a problem with 400,000 instances in less than four hours on 50 computers. They estimate it would have taken more than one month to solve the same problem with a single support vector machine. 13703

Garcke, Hegland, and Nielsen [6] fit sparse grids (used for approximating high dimensionality) by exploiting parallelism at multiple stages in the process. Shafer, Agrawal, and Mehta [7] presented an algorithm called SPRINT that uses disk resident data sets that are too larger to fit into memory. SPRINT removes the memory restriction but still requires the use of a hash tree proportional in size to the training set, which can become expensive for large training sets. Wang, Iyer, and Vitter [8] proposed a classification algorithm called MIND (MINing in Databases) that uses extended relational calculus allowing it to be built into a relational database system. This creates greater I/O efficiency and thus scalability because data is not memory resident. The algorithm also scales easily over multiple processors. Zhang, Bajaj, and Blanke [9] presented a fully parallel visualization solution that incorporates both parallel processors and disk arrays. Intersection II: Algorithms X Statistics Statistics are applied to a dataset in such a way that the size of the dataset can be reduced without affecting the content of the dataset, that is, the representation of the information to be learned or mined. Sampling the dataset with or without replacement is perhaps the most popular technique in order to reduce the number of instances within a dataset. Another way to reduce the size of a dataset is to reduce the number of attributes. A standard statistical technique to remove attributes from a dataset is Principal Component Analysis. Another approach to attribute reduction put forward by Kohavi and John is called the wrapper approach [10]. Here, the sensitivity of the model in question is systematically tested against all the available attributes. Attributes for which the model exhibits low sensitivity are eliminated. Bradley et al. [11] discuss the use of a smaller set of sufficient statistics in decision tree construction. The aggregated set is smaller than the original thus creating a way to scale to larger data sets. Intersection III: Statistics X Hardware Yu and Skillicorn [12] tested the effects of parallelism (more processors) on bagging and boosting. Bagging and boosting are techniques used to generate subsets of training data for use in predictor generation with voting or regression. Bagging is when the subsets are chosen independently, boosting is a more complex alternative that uses information concerning how hard objects are to classify. Ideally objects that are hard to classify should be over represented in samples so that new predictors can spend more time defining tighter boundaries between classes. The results for bagging showed that more processors do not equate to shorter total times spent on the problem. Boosting results showed that increasing parallelism increases accuracy. This suggests that the information shared amongst processors during voting does improve the learning ability of the algorithm. Interestingly the total times increases as more processors were used, reflecting the increased communication costs of additional processors. The limitation of these 13704

techniques is that the original dataset needs to be large enough so that the partitions allow reasonable samples to be selected. This research fits into the intersection between hardware and statistics. The authors used a hardware-based solution to improve the performance of specialized statistical techniques. Helgand [6, 10] reviews the effects of small granularity parallelism in four nonparametric regression techniques: additive models, radial basis function fitting with thin plate splines, multivariate adaptive regression splines, and sparse grids. The models discussed in the paper effectively deal with the curse of dimensionality (see http://www.statsoftinc.com/textbook/stathome.html for a brief description of the curse of dimensionality) and scale well through parallelism of the algorithms. DISCUSSION This paper presents a classification scheme for scaling methods in data mining. The classification focuses on the intersections of the three main areas of hardware, algorithms, and statistics. The idea is to draw on the strengths of multiple areas in order to create a more powerful solution for scaling. For example statistical techniques such as boosting are powerful in their own right but become even more powerful when coupled with parallel processors. And, while increased memory is useful, modified algorithms that can reduce or eliminate the memory requirement while mining are even better suited to scaling. The proposed scheme creates a formalized way of classifying scaling solutions. This provides clarity for future research in this area and may allow researchers to see the common attributes between existing scaling methods that are not clear at first glance. Thus making it possible to classify all literature dealing with scaling in data mining. In addition it should become clear that there are clear advantages to having a multi-disciplinary perspective when conducting research in scaling. Each field (statistics, computer science, electrical engineering, information systems) has its own language. Without a common language it can be difficult for researchers in separate disciplines to find, understand, and apply concepts or principles from other areas. This classification scheme may be a stepping stone for future research aimed at developing a common language for researchers in this area. REFERENCES [[1] U. Fayyad and R. Uthurusamy, "Evolving data mining into solutions for insights," Communications of the ACM, vol. 45, pp. 28-31, 2002. [2] P. J. Huber, "Massive datasets workshop: Four years after," Journal of Computational and Graphical Statistics, vol. 8, pp. 635-652, 1999. [3] J. H. Friedman, "Data mining and statistics: What's the connection?," in Proceedings of the Second International Conference on Knowledge Discovery in databases and Data Mining: AAAI/MIT Press, 1997. 13705

[4] M. V. Joshi, G. Karypis, and V. Kumar, "ScalParC: A new scalable and efficient parallel classification for mining large datasets," in In Proceedings of the First Merged International...and Symposium on Parallel and Distributed Processing: IEEE, 1998. [5] R. Collobert, Y. Bengio, and S. Bengio, "Scaling large learning problems with hard parallel mixtures," International Journal of Pattern Recognition, vol. 17, pp. 349-365, 2003. [6] M. Hegland, "Parallel algorithms for predictive modeling," Working Paper, 2003. [7] J. Shafer, R. Agrawal, and M. Mehta, "A scalable parallel classifier for data mining," in Proceedings of the 1996 International Conference on Very Large Databases, 1996. [8] M. Wang, B. Iyer, and J. S. Vitter, "Scalable mining for classification rules in relational databases," in In Proceedings of the International Database Engineering and Applications Symposium (IDEAS'98). Cardiff, Wales, U.K., July 1998. [9] X. Zhang, C. Bajaj, and W. Blanke, "Scalable isosurface visualization of massive datasets on COTS clusters," in In Proceedings of the IEEE 2001 Symposium on Parallel and Large-Data Visualization and Graphics: IEEE, 2001. [10] R. Kohavi and G. John, "Wrappers for feature selection," Artificial Intelligence, vol. 97, pp. 273-324, 1997. [11] P. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant, "Scaling mining algorithms to large databases," Communications of the ACM, vol. 45, pp. 38-43, 2002. [12] C. Yu and D. B. Skillicorn, "Parallelizing boosting and bagging," Department of Computing and Information Science Queen's University, Kingston, Canada, Technical report 2001-442, 2001. 13706