CLASSIFICATION FOR SCALING METHODS IN DATA MINING

Similar documents
Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

A Parallel Decision Tree Builder for Mining Very Large Visualization Datasets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Random Forest A. Fornaser

Performance Analysis of Data Mining Classification Techniques

Data mining with Support Vector Machine

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Dynamic Clustering of Data with Modified K-Means Algorithm

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Enhancing K-means Clustering Algorithm with Improved Initial Center

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

K-Means Clustering With Initial Centroids Based On Difference Operator

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Leveraging Set Relations in Exact Set Similarity Join

Datasets Size: Effect on Clustering Results

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

CS229 Lecture notes. Raphael John Lamarre Townshend

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Keywords Data alignment, Data annotation, Web database, Search Result Record

Data Mining, Parallelism, Data Mining, Parallelism, and Grids. Queen s University, Kingston David Skillicorn

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

An Improved Apriori Algorithm for Association Rules

Performance impact of dynamic parallelism on different clustering algorithms

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Fuzzy Entropy based feature selection for classification of hyperspectral data

IMPROVED FACE RECOGNITION USING ICP TECHNIQUES INCAMERA SURVEILLANCE SYSTEMS. Kirthiga, M.E-Communication system, PREC, Thanjavur

Efficient integration of data mining techniques in DBMSs

Clustering Large Dynamic Datasets Using Exemplar Points

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

Domestic electricity consumption analysis using data mining techniques

AN IMPROVED GRAPH BASED METHOD FOR EXTRACTING ASSOCIATION RULES

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Clustering Documents in Large Text Corpora

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Centroid Based Text Clustering

International Journal of Software and Web Sciences (IJSWS)

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Normalization based K means Clustering Algorithm

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING

Distribution-free Predictive Approaches

Comparative Study of Subspace Clustering Algorithms

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Support vector machines

Mining Distributed Frequent Itemset with Hadoop

Image Mining: frameworks and techniques

Fast Efficient Clustering Algorithm for Balanced Data

Prowess Improvement of Accuracy for Moving Rating Recommendation System

Mining of Web Server Logs using Extended Apriori Algorithm

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Encoding and Decoding Techniques for Distributed Data Storage Systems

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Iteration Reduction K Means Clustering Algorithm

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

1 (eagle_eye) and Naeem Latif

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.7

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Clustering Algorithms In Data Mining

Character Recognition from Google Street View Images

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Workload Characterization using the TAU Performance System

A new approach to analysing spatial data using sparse grids

Research Article Apriori Association Rule Algorithms using VMware Environment

An Efficient Clustering for Crime Analysis

Kernel Methods and Visualization for Interval Data Mining

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Web page recommendation using a stochastic process model

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Feature Subset Selection Utilizing BioMechanical Characteristics for Hand Gesture Recognition

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Data Parallelism and the Support Vector Machine

Business Club. Decision Trees

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

3 Ways to Improve Your Regression

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Detecting Clusters and Outliers for Multidimensional

Integrating Logistic Regression with Knowledge Discovery Systems

Massive Data Analysis

Transcription:

CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI 02881, (401) 874-2701, hamel@cs.uri.edu Scott Lloyd, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7056, sjlloyd@uri.edu ABSTRACT This paper presents a proposed classification scheme for methods of scaling data mining techniques. Three basic methods using hardware, algorithms, and statistics are defined. The intersections of those areas are then explored and examples of previous research are then classified. KEYWORDS: Scaling, Data Mining, Classification INTRODUCTION Storage Law is the rate at which storage capacity grows. For the last decade it has doubled every nine months, which is twice the rate predicted by Moore s Law for computing power. This has resulted in a situation where the ability to capture and store data far outperforms the ability to analyze data [1]. One of the results of this is a need for data mining techniques able to adapt to this changing circumstance. Hence scalability (the ability to adapt problem solutions to ever growing problem sizes) has become an important aspect of data mining. The obstacles to scaling are many. Human limitations in visualization, computational complexity, memory costs, and lack of parallel software are just a few problems faced in scaling data mining techniques [2]. However, technological advances in hardware, improvements in algorithms, and the continued advancement of statistics have all contributed to the progress in the area of scaling. Such a great deal of research has been conducted in this area that the need for a way to classify that research has arisen. In addition previous research has included scientists from many different disciplines (statistics, computer science, electrical engineering, information systems) that do not necessarily share a common language. A classification scheme provides a common framework by which researchers can refer to and analyze previous research. This paper proposes a classification for scalable data mining techniques. It posits that the most interesting areas of scalability are not in the independent advancements of hardware, algorithms, or statistics (although each have a lot to contribute), but in the intersections of these three areas. The remainder of this paper presents the proposed classification model, classifies previous research, and finally discusses the classification scheme. PROPOSED CLASSIFICATION SCHEME 13701

A classification scheme provides insights and guidance regarding classes of objects. By showing how attributes of certain objects are similar or dissimilar it is possible to discover the distinguishing characteristics that make up the essence of each class of objects. In this case the classes of objects are data mining scaling methods. This allows more precise discourse on scaling methods. In order to help clarify the issue of scaling data mining techniques the proposed classification scheme is diagramed in Figure 1 below. Hardware III I Statistics II Algorithms Figure 1: Classification Scheme As proposed, in this classification the objects are scaling methods and the classes are the circles and intersections the circles form. The circles of the above figure represent three basic ways to scale techniques to larger datasets: hardware, algorithms, and statistics. Each of the three basic methods is briefly described. Using hardware to scale simply means implementing existing algorithms and datasets on faster machines. Machines that incorporate faster processors and more memory among the many other options bigger budgets can afford. This creates an environment that can handle larger datasets and process them more quickly. Modifying algorithms in order to scale involves looking for ways to change the code itself in order to allow the machine to process a larger amount of data, possibly by creating routines that are processed more efficiently. This often requires a great deal of time and knowledge about how the existing system works. 13702

Traditional statistical sampling techniques have long been used as a way to allow larger datasets to be processed [3]. This provides a set of techniques that have been shown to be reliable in a wide variety of circumstances and are available in common statistical packages. The intersections of these circles are of greatest interest in this study. They represent combinations of the basic methods of scaling. The goal is to draw on the strengths of the three basic areas in order to create powerful scaling techniques in the intersections. The intersections may contain various ways to combine technologies. Some of the possibilities are: parallelism, pre/post-processors, and hardware/stat. Parallelism can be thought of as a pure hardware solution to scaling by simply adding more processors to a machine. However, code that was written for single processor machines cannot take advantage of the parallelism without changes. Hence this often falls into the intersection between hardware and algorithms. The algorithms need to be modified in order to properly utilize the advantages parallel processors can provide. The inherent power of parallel processors enables existing techniques to handle larger datasets. Note that not all data mining techniques lend themselves to this process. Some have routines that must run serially and therefore cannot take advantage of the added power. Preprocessors can also be written in order to take advantage of statistical techniques before data mining algorithms are implemented. They can accomplish things like aggregating the data into a smaller dataset or using sampling to reduce the size of the dataset. There are such a wide variety of possibilities that lie within these intersections that the best way to show what can be done is to look at what has been done previously. The next section classifies examples of previous research into the intersections presented in Figure 1 above. CLASSIFYING PREVIOUS RESEARCH This section provides relevant examples of research papers on scaling techniques and fits them within the proposed classification scheme. Intersection I: Hardware X Algorithms Joshi, Karypis, and Kumar [4] presented an algorithm called ScalParC (Scalable Parallel Classifier), a decision tree based classification process. The results of their testing with this algorithm showed that both runtime and memory requirements decrease as the number of processors increases. However, the gains achieved increase at a decreasing rate. For example there was a speed up of 1.61 as a results of going from 16 to 32 processors but a speedup of only 1.31 as a result of going from 32 to 64 processors. Collobert, Bengio, and Bengio [5] presented a methodology that reduces that training time of support vector machines so that it is close to O(T) instead of O(T 3 ). This was accomplished by iteratively partitioning the data and then processing the partitions (modules) on parallelized machines. In datasets between 100,000 and 400,000 instances they were able to reduce training time to a linear function O(T). In one example they were able to solve a problem with 400,000 instances in less than four hours on 50 computers. They estimate it would have taken more than one month to solve the same problem with a single support vector machine. 13703

Garcke, Hegland, and Nielsen [6] fit sparse grids (used for approximating high dimensionality) by exploiting parallelism at multiple stages in the process. Shafer, Agrawal, and Mehta [7] presented an algorithm called SPRINT that uses disk resident data sets that are too larger to fit into memory. SPRINT removes the memory restriction but still requires the use of a hash tree proportional in size to the training set, which can become expensive for large training sets. Wang, Iyer, and Vitter [8] proposed a classification algorithm called MIND (MINing in Databases) that uses extended relational calculus allowing it to be built into a relational database system. This creates greater I/O efficiency and thus scalability because data is not memory resident. The algorithm also scales easily over multiple processors. Zhang, Bajaj, and Blanke [9] presented a fully parallel visualization solution that incorporates both parallel processors and disk arrays. Intersection II: Algorithms X Statistics Statistics are applied to a dataset in such a way that the size of the dataset can be reduced without affecting the content of the dataset, that is, the representation of the information to be learned or mined. Sampling the dataset with or without replacement is perhaps the most popular technique in order to reduce the number of instances within a dataset. Another way to reduce the size of a dataset is to reduce the number of attributes. A standard statistical technique to remove attributes from a dataset is Principal Component Analysis. Another approach to attribute reduction put forward by Kohavi and John is called the wrapper approach [10]. Here, the sensitivity of the model in question is systematically tested against all the available attributes. Attributes for which the model exhibits low sensitivity are eliminated. Bradley et al. [11] discuss the use of a smaller set of sufficient statistics in decision tree construction. The aggregated set is smaller than the original thus creating a way to scale to larger data sets. Intersection III: Statistics X Hardware Yu and Skillicorn [12] tested the effects of parallelism (more processors) on bagging and boosting. Bagging and boosting are techniques used to generate subsets of training data for use in predictor generation with voting or regression. Bagging is when the subsets are chosen independently, boosting is a more complex alternative that uses information concerning how hard objects are to classify. Ideally objects that are hard to classify should be over represented in samples so that new predictors can spend more time defining tighter boundaries between classes. The results for bagging showed that more processors do not equate to shorter total times spent on the problem. Boosting results showed that increasing parallelism increases accuracy. This suggests that the information shared amongst processors during voting does improve the learning ability of the algorithm. Interestingly the total times increases as more processors were used, reflecting the increased communication costs of additional processors. The limitation of these 13704

techniques is that the original dataset needs to be large enough so that the partitions allow reasonable samples to be selected. This research fits into the intersection between hardware and statistics. The authors used a hardware-based solution to improve the performance of specialized statistical techniques. Helgand [6, 10] reviews the effects of small granularity parallelism in four nonparametric regression techniques: additive models, radial basis function fitting with thin plate splines, multivariate adaptive regression splines, and sparse grids. The models discussed in the paper effectively deal with the curse of dimensionality (see http://www.statsoftinc.com/textbook/stathome.html for a brief description of the curse of dimensionality) and scale well through parallelism of the algorithms. DISCUSSION This paper presents a classification scheme for scaling methods in data mining. The classification focuses on the intersections of the three main areas of hardware, algorithms, and statistics. The idea is to draw on the strengths of multiple areas in order to create a more powerful solution for scaling. For example statistical techniques such as boosting are powerful in their own right but become even more powerful when coupled with parallel processors. And, while increased memory is useful, modified algorithms that can reduce or eliminate the memory requirement while mining are even better suited to scaling. The proposed scheme creates a formalized way of classifying scaling solutions. This provides clarity for future research in this area and may allow researchers to see the common attributes between existing scaling methods that are not clear at first glance. Thus making it possible to classify all literature dealing with scaling in data mining. In addition it should become clear that there are clear advantages to having a multi-disciplinary perspective when conducting research in scaling. Each field (statistics, computer science, electrical engineering, information systems) has its own language. Without a common language it can be difficult for researchers in separate disciplines to find, understand, and apply concepts or principles from other areas. This classification scheme may be a stepping stone for future research aimed at developing a common language for researchers in this area. REFERENCES [[1] U. Fayyad and R. Uthurusamy, "Evolving data mining into solutions for insights," Communications of the ACM, vol. 45, pp. 28-31, 2002. [2] P. J. Huber, "Massive datasets workshop: Four years after," Journal of Computational and Graphical Statistics, vol. 8, pp. 635-652, 1999. [3] J. H. Friedman, "Data mining and statistics: What's the connection?," in Proceedings of the Second International Conference on Knowledge Discovery in databases and Data Mining: AAAI/MIT Press, 1997. 13705

[4] M. V. Joshi, G. Karypis, and V. Kumar, "ScalParC: A new scalable and efficient parallel classification for mining large datasets," in In Proceedings of the First Merged International...and Symposium on Parallel and Distributed Processing: IEEE, 1998. [5] R. Collobert, Y. Bengio, and S. Bengio, "Scaling large learning problems with hard parallel mixtures," International Journal of Pattern Recognition, vol. 17, pp. 349-365, 2003. [6] M. Hegland, "Parallel algorithms for predictive modeling," Working Paper, 2003. [7] J. Shafer, R. Agrawal, and M. Mehta, "A scalable parallel classifier for data mining," in Proceedings of the 1996 International Conference on Very Large Databases, 1996. [8] M. Wang, B. Iyer, and J. S. Vitter, "Scalable mining for classification rules in relational databases," in In Proceedings of the International Database Engineering and Applications Symposium (IDEAS'98). Cardiff, Wales, U.K., July 1998. [9] X. Zhang, C. Bajaj, and W. Blanke, "Scalable isosurface visualization of massive datasets on COTS clusters," in In Proceedings of the IEEE 2001 Symposium on Parallel and Large-Data Visualization and Graphics: IEEE, 2001. [10] R. Kohavi and G. John, "Wrappers for feature selection," Artificial Intelligence, vol. 97, pp. 273-324, 1997. [11] P. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant, "Scaling mining algorithms to large databases," Communications of the ACM, vol. 45, pp. 38-43, 2002. [12] C. Yu and D. B. Skillicorn, "Parallelizing boosting and bagging," Department of Computing and Information Science Queen's University, Kingston, Canada, Technical report 2001-442, 2001. 13706