KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

Similar documents
A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Lluis Belanche + Alfredo Vellido Data Mining II An Introduction to Mining (2)

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS

9. Conclusions. 9.1 Definition KDD

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Materialized Data Mining Views *

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software

K-Mean Clustering Algorithm Implemented To E-Banking

Dynamic Data in terms of Data Mining Streams

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Data Management Glossary

> Data Mining Overview with Clementine

BEST PRACTICES FOR ADAPTATION OF DATA MINING TECHNIQUES IN EDUCATION SECTOR Jikitsha Sheth, Dr. Bankim Patel

Bitmap index-based decision trees

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

The Data Mining usage in Production System Management

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

The CRISP-DM Process Model

The Knowledge Discovery Process

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Web page recommendation using a stochastic process model

A Model-View-Controller architecture for Knowledge Discovery

Data Mining An Overview ITEV, F /18

IJMIE Volume 2, Issue 9 ISSN:

Introduction to Data Mining

Discovery of Actionable Patterns in Databases: The Action Hierarchy Approach

ISSN: [Saurkar* et al., 6(4): April, 2017] Impact Factor: 4.116

Introduction to Data Science

With data-based models and design of experiments towards successful products - Concept of the product design workbench

Introduction to Data Mining

Knowledge Discovery in Data Bases

Now, Data Mining Is Within Your Reach

An Improved Apriori Algorithm for Association Rules

Multi-relational Decision Tree Induction

Taccumulation of the social network data has raised

Database Infrastructure to Support Knowledge Management in Physicochemical Data - Application in NIST/TRC SOURCE Data System

PPKM: Preserving Privacy in Knowledge Management

Elena Marchiori Free University Amsterdam, Faculty of Science, Department of Mathematics and Computer Science, Amsterdam, The Netherlands

CHAPTER 7 CONCLUSION AND FUTURE WORK

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal

Chapter 1, Introduction

Item Set Extraction of Mining Association Rule

How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments?

arxiv: v1 [cs.db] 7 Dec 2011

Designing Data Warehouses. Data Warehousing Design. Designing Data Warehouses. Designing Data Warehouses

Multidimensional Process Mining with PMCube Explorer

UML-Based Conceptual Modeling of Pattern-Bases

Capturing and Formalizing SAF Availability Management Framework Configuration Requirements

Customer Clustering using RFM analysis

A Survey of the State of the Art in Data Mining and Integration Query Languages

DATA MINING II - 1DL460

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Question Bank. 4) It is the source of information later delivered to data marts.

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Performance Analysis of Data Mining Classification Techniques

PROJECT PERIODIC REPORT

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients

IJREAS Volume 2, Issue 2 (February 2012) ISSN: COMPARING MANUAL AND AUTOMATIC NORMALIZATION TECHNIQUES FOR RELATIONAL DATABASE ABSTRACT

Data Access Paths in Processing of Sets of Frequent Itemset Queries

SAS Enterprise Miner 7.1

An Approach for Accessing Linked Open Data for Data Mining Purposes

Association Rules Mining using BOINC based Enterprise Desktop Grid

Information mining and information retrieval : methods and applications

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Datasets Size: Effect on Clustering Results

Data Mining: Concepts and Techniques

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Web Mining Evolution & Comparative Study with Data Mining

Discovering Paths Traversed by Visitors in Web Server Access Logs

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Efficient integration of data mining techniques in DBMSs

Marketing Knowledge Management Model

Mining High Order Decision Rules

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

DIGIT.B4 Big Data PoC

Towards Rule Learning Approaches to Instance-based Ontology Matching

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 9 Database Design

Symbolic Evaluation of Sums for Parallelising Compilers

Introduction to Data Mining and Data Analytics

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Method to Study and Analyze Fraud Ranking In Mobile Apps

Collaborative Rough Clustering

BSIT 1 Technology Skills: Apply current technical tools and methodologies to solve problems.

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Circle Graphs: New Visualization Tools for Text-Mining

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Keywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.

Data Mining: Approach Towards The Accuracy Using Teradata!

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Transcription:

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done that seek the establishment of standards in the area. Included on these efforts there can be enumerated SEMMA and CRISP-DM. Both grow as industrial standards and define a set of sequential steps that pretends to guide the implementation of data mining applications. The question of the existence of substantial differences between them and the traditional KDD process arose. In this paper, is pretended to establish a parallel between these and the KDD process as well as an understanding of the similarities between them. KEYWORDS Data Mining Standards, Knowledge Discovery in Databases, Data Mining. 1. INTRODUCTION Fayyad considers Data Mining (DM) as one of the phases of the KDD 1 process (Fayyad et al., 1996). The data mining phase concerns, mainly, to the means by which the patterns are extracted and enumerated from data. The literature is a source of some confusion because de two terms are indistinctively used, making it difficult to determine exactly each of the concepts (Benoît, 2002). In this paper, data mining is seen as one of the phases of the KDD process, as presented in (Fayyad et al., 1996) and in (Brachman & Anand, 1996). The growth of the attention paid to the area emerged from the rising of big databases in an increasing and differentiate number of organizations. There is the risk of wasting all the value and wealthy of information contained on these databases unless there are used the adequate techniques to extract useful knowledge (Chen et al, 1996) (Simoudis, 1996) (Fayyad, 1996). In the latest years, it has been occurring the growth and consolidation of the data mining area. Some efforts are being done that seek the establishment of standards in the area, both by academics and by people in the industry field. The academics efforts are centered in the attempt to formulate a general framework for DM (Dzeroski, 2006). The bulk of these efforts are centered in the definition of a language for DM that can be accepted as a standard, in the same way that SQL was accepted as a standard for relational databases (Han et al, 1996) (Meo et al, 1998) (Imielinski et al, 1999) (Sarawagi, 2000) (Botta et al, 2004). The efforts in the industrial field concern mainly the definition of processes/methodologies that can guide the implementation of DM applications. In this paper, SEMMA and CRISP-DM have been chosen, because they are considered to be the most popular. Although it is not scientific this perception exists, because SEMMA and CRISP-DM are presented in many of the publications of the area and are really used in practice. During the analysis of the documentation on SEMMA and on CRISP-DM, the question of the existence of substantial differences between them and the traditional KDD process arose. In this paper, it is pretended 1 Knowledge Discovery in Databases

to establish a parallel between these and the KDD process as well as an understanding of the similarities between them. The paper begins, on section 2, by presenting KDD, SEMMA and CRISP-DM. Next, on section 3, a comparative study is done, presenting the analogies and the differences between the three processes. Finally, on section 4, conclusions and future work are presented. 2. KDD, SEMMA AND CRISP-DM DESCRIPTION The term knowledge discovery in databases or KDD, for short, was coined in 1989 to refer to the broad process of finding knowledge in data, and to emphasize the high-level application of particular data mining methods (Fayyad et al, 1996). Fayyad considers DM as one of the phases of the KDD process and considers that the data mining phase concerns, mainly, to the means by which the patterns are extracted and enumerated from data. In this paper there is a concern with the overall KDD process, which will be described in section 2.1. SEMMA was developed by the SAS Institute. CRISP-DM was developed by the means of the efforts of a consortium initially composed with DaimlerChryrler, SPSS and NCR. They will be described in sections 2.2 and 2.3, respectively. Despite that SEMMA and CRISP-DM are usually referred as methodologies, in this paper they are referred as processes, in the sense that they consist of a particular course of action intended to achieve a result. 2.1 The KDD process The KDD process, as presented in (Fayyad et al, 1996) is the process of using DM methods to extract what is deemed knowledge according to the specification of measures and thresholds, using a database along with any required preprocessing, sub sampling, and transformation of the database. There are considered five stages, presented in figure 1: 1. Selection This stage consists on creating a target data set, or focusing on a subset of variables or data samples, on which discovery is to be performed. 2. Pre processing This stage consists on the target data cleaning and pre processing in order to obtain consistent data. 3. Transformation This stage consists on the transformation of the data using dimensionality reduction or transformation methods. 4. Data Mining This stage consists on the searching for patterns of interest in a particular representational form, depending on the data mining objective (usually, prediction) 5. Interpretation/Evaluation This stage consists on the interpretation and evaluation of the mined patterns. Figure 1. The five stages of KDD

The KDD process is interactive and iterative, involving numerous steps with many decisions being made by the user. (Brachman, Anand, 1996). Additionally, the KDD process must be preceded by the development of an understanding of the application domain, the relevant prior knowledge and the goals of the end-user. It also must be continued by the knowledge consolidation by incorporating this knowledge into the system (Fayyad et al, 1996). 2.2 The SEMMA process The SEMMA process was developed by the SAS Institute. The acronym SEMMA stands for Sample, Explore, Modify, Model, Assess, and refers to the process of conducting a data mining project. The SAS Institute considers a cycle with 5 stages for the process: 1. Sample This stage consists on sampling the data by extracting a portion of a large data set big enough to contain the significant information, yet small enough to manipulate quickly. This stage is pointed out as being optional. 2. Explore This stage consists on the exploration of the data by searching for unanticipated trends and anomalies in order to gain understanding and ideas. 3. Modify This stage consists on the modification of the data by creating, selecting, and transforming the variables to focus the model selection process. 4. Model This stage consists on modeling the data by allowing the software to search automatically for a combination of data that reliably predicts a desired outcome. 5. Assess This stage consists on assessing the data by evaluating the usefulness and reliability of the findings from the data mining process and estimate how well it performs. Although the SEMMA process is independent from de DM chosen tool, it is linked to the SAS Enterprise Miner software and pretends to guide the user on the implementations of DM applications. SEMMA offers an easy to understand process, allowing an organized and adequate development and maintenance of DM projects. It thus confers a structure for his conception, creation and evolution, helping to present solutions to business problems as well as to find de DM business goals. (Santos & Azevedo, 2005) 2.3 The CRISP-DM process The CRISP-DM process was developed by the means of the effort of a consortium initially composed with DaimlerChryrler, SPSS and NCR. CRISP-DM stands for CRoss-Industry Standard Process for Data Mining. It consists on a cycle that comprises six stages (figure 2): 1. Business understanding This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. 2. Data understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information. 3. Data preparation The data preparation phase covers all activities to construct the final dataset from the initial raw data. 4. Modeling In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. 5. Evaluation At this stage the model (or models) obtained are more thoroughly evaluated and the steps executed to construct the model are reviewed to be certain it properly achieves the business objectives. 6. Deployment Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. (Chapman et al, 2000)

Figure 2 The CRISP-DM life cycle (Chapman et al, 2000) The sequence of the six stages is not rigid, as is schematize in figure 2. CRISP-DM is extremely complete and documented. All his stages are duly organized, structured and defined, allowing that a project could be easily understood or revised (Santos & Azevedo, 2005). Although the CRISP-DM process is independent from de DM chosen tool, it is linked to the SPSS Clementine software 3. A COMPARATIVE STUDY By doing a comparison of the KDD and SEMMA stages we would, on a first approach, affirm that they are equivalent: Sample can be identified with Selection, Explore can be identified with Pre processing Modify can be identified with Transformation Model can be identified with Data Mining Assess can be identified with Interpretation/Evaluation. Examining it thoroughly, we may affirm that the five stages of the SEMMA process can be seen as a practical implementation of the five stages of the KDD process, since it is directly linked to the SAS Enterprise Miner software. Comparing the KDD stages with the CRISP-DM stages is not as straightforward as in the SEMMA situation. Nevertheless, we can first of all observe that the CRISP-DM methodology incorporates the steps that, as referred above, must precede and follow the KDD process that is to say: The Business Understanding phase can be identified with the development of an understanding of the application domain, the relevant prior knowledge and the goals of the end-user The Deployment phase can be indentified with the consolidation by incorporating this knowledge into the system. Concerning the remaining stages, we can say that: The Data Understanding phase can be identified as the combination of Selection and Pre processing The Data Preparation phase can be identified with Transformation The Modeling phase can be identified with Data Mining The Evaluation phase can be identified with Interpretation/Evaluation. In table 1, we present a summary of the presented correspondences.

Table 1. Summary of the correspondences between KDD, SEMMA and CRISP-DM KDD SEMMA CRISP-DM Pre KDD ------------- Business understanding Selection Sample Pre processing Explore Data Understanding Transformation Modify Data preparation Data mining Model Modeling Interpretation/Evaluation Assessment Evaluation Post KDD ------------- Deployment 4. CONCLUSIONS AND FUTURE WORK Considering the presented analysis we conclude that SEMMA and CRISP-DM can be viewed as an implementation of the KDD process described by (Fayyad et al, 1996). At first sight, we can get to the conclusion that CRISP-DM is more complete than SEMMA. However, analyzing it deeper, we can integrate the development of an understanding of the application domain, the relevant prior knowledge and the goals of the end-user, on the Sample stage of SEMMA, because the data can not be sampled unless there exists a truly understanding of all the presented aspects. With respect to the consolidation by incorporating this knowledge into the system, we can assume that it is present, because it is truly the reason for doing it. This leads to the fact that standards have been achieved, concerning the overall process: SEMMA and CRISP-DM do guide people to know how DM can be applied in practice in real systems. In the future we pretend to analyze other aspects related to DM standards, namely SQL-based languages for DM, as well as XML-based languages for DM. As a complement, we pretend to investigate the existence of other standards for DM. REFERENCES Fayyad, U. M. et al. 1996. From data mining to knowledge discovery: an overview. In Fayyad, U. M.et al (Eds.), Advances in knowledge discovery and data mining. AAAI Press / The MIT Press. Benoît, G., 2002. Data Mining. Annual Review of Information Science and Technology, Vol. 36, No. 1, pp 265-310. Brachman, R. J. & Anand, T., 1996. The process of knowledge discovery in databases. In Fayyad, U. M. et al. (Eds.), Advances in knowledge discovery and data mining. AAAI Press / The MIT Press. Chen, M. et al, 1996. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp 866-883. Simoudis, E., 1996. Reality check for data mining. IEEE Expert, Vol. 11, No. 5, pp 26-33. Fayyad, U. M., 1996. Data mining and knowledge discovery: making sense out of data. IEEE Expert, Vol. 11 No. 5, pp 20-25. Dzeroski, S., 2006. Towards a General Framework for Data Mining.. In Dzeroski, S and Struyf, J (Eds.), Knowledge Discovery in Inductive Databases. LNCS 47474. Springer-Verlag. Han, J. et al, 1996. DMQL: A Data Mining Query Language for Relational Databases. In proceedings of DMKD-96 (SIGMOD-96 Workshop on KDD). Montreal. Canada.

Meo, R. e tal, 1998. An Extension to SQL for Mining Association Rules. Data Mining and Knowledge Discovery Vol. 2, pp 195-224. Kluwer Academic Publishers. Imielinski, T.; Virmani, A., 1999. MSQL: A Query Language for Database Mining. Data Mining and Knowledge Discovery Vol. 3, pp 373-408. Kluwer Academic Publishers. Sarawagi, S. et al, 2000. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. Data Mining and Knowledge Discovery, Vol. 4, pp 89 125. Botta, Marco, et al, 2004. Query Languages Supporting Descriptive Rule Mining: A Comparative Study. Database Support for Data Mining Applications. LNAI 2682, pp 24-51. SAS Enterprise Miner SEMMA. SAS Institute. Accessed from http://www.sas.com/technologies/analytics/datamining/miner/semma.html, on May 2008 Santos, M & Azevedo, C (2005). Data Mining Descoberta de Conhecimento em Bases de Dados. FCA Publisher. Chapman, P. et al, 2000. CRISP-DM 1.0 - Step-by-step data mining guide. Accessed from http://www.crisp-dm.org/crispwp-0800.pdf on May 2008