Chapter 1 Introduction to Data Mining

Size: px

Start display at page:

Download "Chapter 1 Introduction to Data Mining"

Sheena Manning
5 years ago
Views:

1 1.1 Introduction to Data Mining Chapter 1 Introduction to Data Mining Data mining refers to the process of extracting or mining knowledge from ample amounts of data (Hand, et al., 2000). It is the process of searching available patterns by scanning the huge amount of data (Han & Kamber, 2006). Storing enormous quantity of data is utile to extract precious knowledge. To seek out constructive patterns within the data, there are different kinds of algorithms which can categorize the data either automatically or semiautomatically (Agrawal & Srikant, 1994). These patterns are used to obtain the sets of rules. The patterns discovered must be meaningful such that they may lead to many advantages like decisions making, market analysis, financial growth, business intelligence etc. To get such meaningful patterns, significantly large amount of data is required. To cope up with this huge data, data mining take the benefit of derived concept from machine learning and statistics. Data mining gain insights, understanding of data and provides knowledge. It is also provides capability to predict the future observations. Besides predicting future observation, data mining is also useful for summarizing the underlying relationship in data. Data mining can mine data from different data storage like text data, databases, data warehouse, transactional data, multimedia data, sequence, web, stream, time-series, multi-media, spatiotemporal, graphs & social and information networks etc. Now days, data mining has grown up so huge that it is producing fruitful results in many fields like insurance, risk management, health aids, customer management, financial analysis, operation activity in manufacturing and anticipates reimbursement of corporate expense claims etc. The focus of thesis is on how data mining is relevant in knowledge discovery at multiple levels of abstraction. Data mining examine data from various angles and sum up the outcome into precious information. It also explores data from different dimensions, after that it categorizes and summarizes the associations among them. To be precise, the process of finding the patterns and interrelation among data is known as data mining. Ongoing development in data mining contributed in several types of algorithms, drawn from the areas of database and statistics, machine learning and pattern recognition, which is utile for technology utilization and adaptation. Introduction to Data Mining Page 1

2 Data mining is mainly used today by companies to acquire information about their products, customers, marketing strategies and other affecting aspects (Barbara, et al., 2001). The companies can find out associations among the "external" element like customer demography and economic indicators etc and "internal" elements such as product positioning, staff skills and price etc by using Data mining. 1.2 History of Data Mining Data mining is the growth of an area with an extensive history (Coenen, 2004). The innovation of word takes place in 1990s. The origin of Data mining is vestige back by the side of three unit lines. First is the artificial intelligence (Wu, 2004), second is the statistics and third is the machine learning (Zhou, 2003). Artificial intelligence (AI) is based on heuristics; it tries to utilize human like thinking procedure to statistical jobs. Lots of high-end business products utilize various artificial intelligence techniques, for example, relational database systems utilize query optimization technique. Statistics acts as the base for the numerous data mining techniques, for example, standard variance, regression analysis, discriminate analysis, confidence intervals, standard deviation, standard distribution and cluster analysis etc. These are used to examine the data and their relationships. Machine learning (ML) (Michalski, et al., 1998) aggregates the artificial intelligence and statistic. The focus of machine learning is to develop algorithms which are able to teach themselves and are able to grow whenever they encounter new kind of data. It also uses several statistical techniques. With the help of these techniques, people can take various decisions which are based on the superiority of the data. In first, data mining algorithms were mainly developed for numeral data but it was further extend for all types of data like multimedia, text, spatial, picture and web etc. Initially, data mining starts with the analysis of individual data base, subsequently, data mining techniques have formulated for traditional and relational database, flat files and data warehouse. Afterwards, with the blend of machine learning techniques and statistics, diverse algorithms developed to mine organized and unorganized data. The area of data mining has been developing due to its tremendous attainment in terms of range of applications, scientific progress and understanding. The ever augmentative complexities in several fields and betterment in technology have posed fresh challenges to data mining. The various state of affairs include heterogeneous data formats, progress in networking, computation resources, scientific research fields and new business demands etc. Introduction to Data Mining Page 2

According to Fayyad (1996) KDD will continue its development in various fields like machine learning, artificial intelligence, databases, machine discovery, scientific discovery and information

3 According to Fayyad (1996) KDD will continue its development in various fields like machine learning, artificial intelligence, databases, machine discovery, scientific discovery and information retrieval etc. (Fayyad, et al., 1996). The various techniques from all the above mentioned fields are used in knowledge discovery process. 1.3 Knowledge Discovery in Database To generate the knowledge in the form that a human can understand is the basic purpose of knowledge discovery. It is the way to extract valuable information and knowledge from a large volume of data. Data Mining is a footstep in Knowledge Discovery in Database (KDD) process which uses particular algorithms to pull out patterns (models) from data (Pazzani, 2000). The term KDD tells about the entire process of extracting useable knowledge from data. The KDD process constitutes five stages as shown in Figure 1.1. Figure 1.1: KDD Process Model The phases of this process are data selection, data preprocessing, data transformation, data mining and evaluation. At first, data is obtained from data warehouses or various data sources, then data preprocessing like data cleaning and data integration is applied. After that data transformation or data reduction is performed on preprocessed data. In the next phase of KDD process, an appropriate data mining technique has been applied on data. As the results of data mining process some useful patterns and structures get appeared. These patterns and structures are further used to interpret the knowledge. Therefore, Data mining Introduction to Data Mining Page 3

4 plays an essential role in the knowledge discovery process. The KDD process converts the data into high level of knowledge. This overall process of discovery of patterns and relationships from the database can be automated or semi-automated. In the KDD process, data mining is one of the most important steps. Although, the two terms KDD and DM are closely related, yet they refer to slightly different two concepts. Data mining is only the application of a specific algorithm based on the overall goal of the KDD process. The data mining phase is used to pull out the knowledge from the data. After that the data is represented in a form that any user can understand it. On the basis of this, the user is able to make important decision. Data mining can mine data from different data storage like text data, databases, data warehouse, transactional data, multimedia data, data stream, spatiotemporal, sequence, timeseries, multi-media, web, graphs, social and information networks etc (Han & Kamber, 2008). 1.4 Data Mining Models As per convention, Data mining model are mainly of two types i.e. descriptive model and predictive model Descriptive Model The primary goal of these models is to drive patterns (correlation, trends) that summarize the under laying relationship between data. Descriptive data mining is normally applied to generate correlation, frequency and cross tabulation (Witten & Frank, 2005). Descriptive model can be outlined to pull out interesting patterns in the data, to find antecedent unrevealed patterns and find fascinating subgroups in the data. For example, to identify the web pages those are accessed together by user. Under descriptive model, Association rules discovery, clustering, sequential patterns mining and summarization are used Predictive Model The idea behind these models is to design a framework by using the outcome of the known data and to anticipate the consequence of unknown data sets (Han & Kamber, 2008) (Witten & Frank, 2005). For example, a bank has the necessary data about the loans granted in the past terms. In this data, autonomous variables are the characteristics of the clients to whom the loan was granted and the dependent variable is, whether the loan is return back or not. In this way, the model build by this data will help in taking decision, whether the loan must be given to the client or not. For predictive data mining regression functionality, deviation detection and classification are used. Introduction to Data Mining Page 4

1.5 Functionality of Data Mining The function of data mining is to pull out the knowledge and interesting design from given data. There is lot of functionality available to extract patterns.

5 1.5 Functionality of Data Mining The function of data mining is to pull out the knowledge and interesting design from given data. There is lot of functionality available to extract patterns. Data mining investigates for interesting patterns from data. At early stages, these patterns are generally unknown but actually useable. Data mining provides several kind of functionality. The particular type of functionality can be selected on the basis of application area and kind of information to be mined. Using these functionalities various kind of knowledge like association rule, classification rule, characterization, clustering, discriminate rule, deviation analysis and predictive analysis etc can be mined. Data mining usefulness are rich and extensive; it can serve several applications and areas (Tan, et al., 2005). Figure 1.2 demonstrates the basic functionalities like outlier analysis, clustering, classification, frequent pattern mining and characterization etc. These functionalities are explained below. Figure 1.2: Data Mining Functionalities Classification & Prediction Classification approach in data mining is competent of processing an ample amount of data (Han & Kamber, 2008). Classification allots items in a data set to target classes. Classification anticipates the target category for each instance in the data. Classification assigns a label of class to a set of uncategorized cases. As the class tag is given to all the training data, this stage is termed as supervised learning. Classification is utilized to represent data items into various predefined classes (Weiss & Kulikowski, 1991). In this kind of task, firstly, the training samples are providing. On the basis of these samples, a model is designed which works on the values of some other attributes. The classification technique are utilized by knowledge disclosure applications as Introduction to Data Mining Page 5

the trends categorization in financial markets and in this way, it automatically recognize the interesting pattern from large databases. Classification techniques deduce a model from the database.

6 the trends categorization in financial markets and in this way, it automatically recognize the interesting pattern from large databases. Classification techniques deduce a model from the database. The database comprises of numerous kinds of attributes which denotes the particular category of any tuple and these attributes are called as the predicted attributes. Beside these attributes there are leftover attributes which are known as the predicting attributes. An aggregation of values for the anticipated attributes specifies a class. In the process of learning the classification rules, first of all, the user must define conditions for all the classes. On the basis of these rules, the system predicts the class. After that the data mine system builds the descriptions for these classes. Initially, a tuple or case with definite known attribute values is the requirement of a system so that it can be able to predict the class related to the case. After defining the class, the system become capable to predict the patterns that decide the classification, therefore, the system becomes capable to find the interpretation of each class. In this way, the interpretation will refer the attributes of the training set which are helpful in prediction, due to that it find out related values which satisfy the interpretation and ignore the others. A case or rule is said to be accurate, if its interpretation is capable to find out the related entries of the classes and ignore the non-related. An illustration of classification's decision tree is shown in Figure 1.3. Figure 1.3: Classification using Decision Tree Image Source: [Weblink1] There are several methods of data mining classification like Decision Tree based Methods, Rule-based Methods, Naïve Bayes and Bayesian Belief Networks, Nearest-Neighbor Introduction to Data Mining Page 6

Method, Neural Networks, Support Vector Machines and Ensemble Methods usable for classification and prediction. 1.5.

7 Method, Neural Networks, Support Vector Machines and Ensemble Methods usable for classification and prediction Clustering Clustering is the function of categorizing a group of objects in such a way that objects of similar kind are kept in the same clusters (Cheeseman & Stutz, 1996) (Ester & Kriegel, 1995) (Ng & Han, 1994). It is different from classification because it doesn't use any training data. It is a key technique of data mining, which is commonly used for statistical data analysis including machine learning, image analysis, pattern recognition, information retrieval, and bioinformatics. Basically, different kind of partitions are created in clustering (Witten & Frank, 2005), then on the basis of similarity that is based on some metric, participated values are kept into those partitions. The clustering method follows the unsupervised technique, in this technique categories or groups or classes are not defined already. In unsupervised technique the grouping of objects is done on the basis of the set of objects proximity or similarity. In such kind of learning, the discovery of the classes is done by the system itself i.e. the system will itself select an attribute based on the given data and on the basis of that it will partition the data. After that it select another attribute to partition the data and so on. Objects are often represented into mutually exclusive or/and exhaustive group of clusters. Figure 1.4: Clustering Image Source: [Weblink2] Clustering in terms of similarity is a very powerful method. It is able to interpret some instinctive measure of similar nature into the quantitative measure (Zhang & Ramakrishan, 1996). There are lots of perspectives for creating clusters. One perspective is to create rules which provide membership in the same group which is based on the degree of similarity Introduction to Data Mining Page 7

between members. Another perspective is to construct a set of functions that will measure the belongings of partitions as method of some parameter of the partition. Figure 1.

8 between members. Another perspective is to construct a set of functions that will measure the belongings of partitions as method of some parameter of the partition. Figure 1.4 shows clustering data mining functionality Characterization and Discrimination Data characterization is a summarization or abstract form of the general features or characteristics of a selective class of data (Witten & Frank, 2005). In data characterization, the abstraction is done on the behalf of the specific requirement of the users. Usually, the data can be collected by shooting a query. Summarization is the procedure to find a brief description for a subset of data (Fayyad, et al., 1996). There are lots of refined techniques for summarization and these are generally applied to perform data analysis and to assist in automatic report generation. In data discrimination the target objects of the class data are compared to the objects from one or many different classes with regard to specific generalized characteristic (Dash & Liu, 1997) (Pitt & Nayak, 2007) Outlier Analysis/ Deviation Detection Outliers are those objects which are not abide by the general model or behavior of data (Han & Kamber, 2008). If outliers are present in dataset, these are thrown-away before processing by using the other data mining functionalities. Figure 1.5: Outlier Analysis Image Source: [Weblink3] Generally, Outliers represents the noise or exceptions. Figure 1.5 shows outlier analysis, R represent data which is outlier. To detect the major changes in data from earlier normalized or calculated values, Deviation detection is used (Fayyad, et al., 1996). Introduction to Data Mining Page 8

9 1.5.5 Frequent Patterns Mining The patterns that appear frequently in the data are known as frequent patterns (Agrawal & Srikant, 1994). The itemsets, sequences and subsequences can be considered as patterns. A frequent pattern or large-itemset is an itemset that meets the minimum support requirement. Support of an item is number of occurrence of that item in all transactions. For example items A, B and C occur simultaneously in eight transactions out of ten, it means the itemset {A, B, C} has support 80%. Finding such frequent patterns play an essential role in mining association link and many other interesting relationship among data. Thus frequent pattern mining is an important data mining task and focused a lot in data mining research. Discovering frequent pattern is a very important data mining problem with a numerous of practical applications. The discovery of frequent pattern helps in many business decision making processes such as catalog design, cross marketing, customer shopping behavior etc. Frequent itemsets are used to generate association rules. Data mining functionalities covers wide range of applications and allows the discovery of different kinds of knowledge and at different levels of abstraction. Accordingly if appropriate data mining functionality applied to handle data, works effectively. This research work emphasize at mining of frequent patterns. And frequent pattern mining can be obtained using association rules which are explained in detail in section Association Rules Association rule mining is a method for finding the interesting relations among various items in databases (Agrawal & Srikant, 1994). Using various kinds of measures, it identifies strong rules from the databases. An association rule comprises of two parts, first is antecedent and second is consequent. The antecedent is the item found in the database and the consequent is the item which is found in the aggregation with the antecedent. Association rules are created by inspecting the data for frequent patterns and then using the concept of support and confidence to determine the most crucial relationships. Support is the number of any item occurs in the database and the confidence depends upon support, it is the proportion of the transaction that contains support item and its dependent item. Let us take an example such as "80% of all the records that contain items A also contain items B. So, A is playing the role of antecedent and B is the consequent. B's value depend upon A. Support is the individual count of items A and B where A is antecedent and B is consequent. Here, B's value depends upon A and its confidence value is 80%. On the basis of level, association rules can be classified in two categories. Introduction to Data Mining Page 9

10 1) Single level association rules 2) Multiple level association rules Single Level Association Rules Single level association rules can only provide loose detailed information. Moreover, it can only render general rules without getting the more precise rule. For example, it is good to find that 80% of customers that buy milk also buy bread but it will be better to find that among these customers who buy bread there are75% of people buy only wheat bread. This type of hidden information in or between levels of abstraction can be provided by Multiple- Level Association rules Multiple Level Association Rules There are varieties of associations or correlations that mainly grab the attention. These associations occur among hierarchies of items. On the basis of the domain nature the items can be divided into different hierarchies. For instance, beverages in a supermarket, things in a departmental store, or objects in a sports shop can be represented into classes and subclasses. With the help of these, hierarchies can be constructed, which plays a key role to provide association among items. These are essential for mining multiple level association rules. For discovering these rules two passes are required: Initially, on the basis of minimum support threshold, it generates the frequent patterns at each level of concept hierarchy and then, on the basis of these frequent patterns convenient association rules are generated. In multiple-level association rule mining, the items in an itemset are characterized by using a concept hierarchy. Mining occurs at multiple levels in the hierarchy. At lowest levels, it might be that no rules may match the constraints. At highest levels, rules can be extremely general. Generally, a top-down approach is used where the support threshold may be same or varies from level to level (support is reduced going from higher to lower levels) (Han & Fu, 1995) Association Rule Mining Applications Initially, the issue of mining association rules was intended to face the decision support problem which was the issue of the majority of retail organizations (Agrawal, et al., 1993). Due to development of the bar-code technology, it becomes possible for retail organizations to collect and store ample amounts of sales data, which was represented as the basket data. In such type of data, a record typically consists of the date of transaction and the items purchased in the transaction. As per the view of successful organizations such kind of databases was an important item of marketing infrastructure. They were curious about Introduction to Data Mining Page 10

11 establishing the information-driven process of marketing which could be managed by the database technology. It helps the marketers to evolve and utilize customized marketing strategies and program. Now days, there are variety of applications in multiple domain for association rules. At first, the description of market basket analysis is given and after that, the original motivation for mining association rules, followed by other applications. Market Basket Analysis: By mining transactional data, a retail store can discover associations among the sales of items. This information could be very useful in various ways. For example, the rules with Maintenance Agreement as the consequent might be supportive for increasing the Maintenance Agreement sales. Rules with Home Appliances might specify other related products. There is a related application known as loss-leader analysis. Generally stores sell various products in loss during a promotion. It was done with the hope that customers would buy some other items along with the loss-leader. However, lot of customers might cherry-pick the item on sale. By mining associations over the time interval of the promotion as well as ahead of the promotion, and keeping a look on the changes in support and confidence of rules participating the promotional product, the store can find whether or not cherry-picking take place (Rajak & Gupta, 2008). Item Placement: To determine the place of items in a store there must be the knowledge regarding what kind of items are sold together. A closely attached application is the arrangement of catalog. Mail-order companies can use associations rule mining to help in determining what kind of items should be kept on the similar page of a catalog. Attached Mailing: In spite of sending the identical catalog to everyone, direct marketing retailers can use the associations and sequential patterns to alter the catalog which is based upon the items a customer has bought. Moreover, these kinds of tailored catalogs may be much smaller and also mailed less frequently which will helpful in reducing the mailing costs. Fraud Detection: Insurance companies are very much concerned in finding groups of medical service providers like doctors or clinics, who ping-pong patients among each other for irrelevant test. With the help of the paid medical claims data, all the patients can be mapped to a transaction, and every doctor or clinic visited by a patient. In this case, the items in the association rule now correlate with the set of providers, and the support of the rule will correlate with the number of patients these medical service providers have in common. On the basis of these data the insurance company can investigate the claim records for the sets of medical service providers who have a large number of common patients to decide if any fraudulent activity actually occurred or not. Another practical application is Introduction to Data Mining Page 11

12 detecting the usage of wrong medical settlement codes. For example, insurance companies are concerned in detecting unbundling, where a set of settlement codes related to the components of a medical process are used to claim payment, rather than the settlement code for the overall procedure. (The motivation is that the amount of the payments for the constituent codes may be greater than the normal settlement for the procedure.) Associations among medical settlement codes can also be used for finding the sets of payment codes which are utilized frequently. Medical Research: A data-sequence may match to the symptoms or diseases of a patient, with the transaction related to the symptoms displayed or diseases diagnosed during a visit to the doctor. The patterns discovered with the help of this data could be used in disease study to help identify symptoms of the diseases that lead certain diseases (Serban, et al., 2006). 1.7 Outline of the Thesis The organization of the thesis is as follows: Chapter-1 The chapter deals with the basic introduction of data mining, different models and various functionalities of data mining. This chapter also explains the single level and multiple level association rules with their application in multiple domains. Chapter-2 This chapter covers the Literature Review related to association rules mining algorithms. The chapter is organized into sub-sections detailing various literature reviewed regarding different aspects of multiple level association rule mining. As the main focus of our research is mining multiple level association rules; it is prerequisite to have a look at what is association rules mining. So, this chapter firstly explores some basic concepts which are helpful in carrying out research work directly or indirectly. After that single-level association rules mining approaches are explained. Subsequently, this chapter presents an overview of pertinent literature and research of multiple levels association rules mining methods. And last segment comprises the study of miscellaneous research papers used to carrying out the research work. Chapter-3 This chapter outlined the definition of the problem based on which the research objectives were articulated to handle the challenges. It has highlighted the objectives of research and also outlined the significance of study. A research methodology to address the identified objectives is also given in this chapter. Chapter-4 The chapter gives a comprehensive survey and study of some problems about various existing methods. These existing methods have some issues and challenges in this Introduction to Data Mining Page 12

13 field. The heated discussion about shortcoming of evolutionary algorithms leads to some improvements. This chapter also provides an introduction about concept hierarchy and types of concept hierarchies. This chapter also investigates the requirement of concept hierarchies in multiple level association rules mining and other data warehousing and data mining applications. A case study of an efficient encoding scheme of concept hierarchy is described. Finally it provides the summary of concept hierarchies. Chapter-5 The traditional algorithms for mining association rules at multiple levels of abstraction are explained in this chapter. Accordingly two well established algorithms MLT2_L1 and Level Wise Filtered Table (LWFT) algorithm are presented in this chapter to find multiple level frequent itemsets. The main focus of this chapter is to identify the basic working of MLT2_L1 and Level Wise Filtered Table (LWFT) algorithms. At last critically examines the weakness of MLT2_L1 and LWFT algorithm. Chapter-6 In this chapter new algorithms TransTrie and MLTransTrie are proposed for discovery of association rules at different levels of abstraction. The MLTransTrie algorithm employs the TransTrie algorithm at each level for generation of frequent patterns. The working of proposed algorithm is demonstrated with an example database and follows by the summary of the chapter. Chapter-7 In this chapter the results of the proposed algorithm MLTransTrie has been given and discussed. To study the performance of the algorithm, different support threshold were used. In this experimental research, initially process model starts from selection of the datasets. The dataset used in this study has been taken from UCI Repository of Machine Learning databases available on line. In this study basically four datasets of various sizes and with different number of attributes are used. These real world datasets are Breast-cancer, Credit-g, Mushroom and Soybean. To prove the competence of proposed algorithm a comparative analysis is performed with well-known evolutionary algorithms. Finally, comes the wrapping up of the research work carried out in previous chapters and making inferences. Chapter-8 This chapter points out the detailed conclusion of the research work carried throughout the doctorial work and discussion on future research. This research work has expended the scope of the study of mining association rules from single level to multiple concept levels. Finally this chapter is followed with bibliography/ references and after that there is appendix. Introduction to Data Mining Page 13

14 After carrying out this research work the researcher believe that this effort will certainly be great contribution towards research community, academicians, society, corporate sector and decision analyst as well. Introduction to Data Mining Page 14

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,