DATA MINING TEAM #1. Kristen Durst Mark Gillespie Banan Mandura. MBA 664: Database Management

Size: px

Start display at page:

Download "DATA MINING TEAM #1. Kristen Durst Mark Gillespie Banan Mandura. MBA 664: Database Management"

Shon Nelson
6 years ago
Views:

1 DATA MINING TEAM #1 Kristen Durst Mark Gillespie Banan Mandura : Database Management

2 OUTLINE INTRODUCTION 1 DATA MINING DEFINITION AND EXAMPLES 1 DATA MINING PRODUCTS 2 DATA MINING PROCESS 4 DATA MINING TECHNIQUES 7 DATA MINING EXAMPLE 11 CONCLUSION 14 REFERENCES 14 APPENDIX: FIGURES 15 Team #1 ii

3 INTRODUCTION The purpose of this paper is to provide a brief overview of data mining and how data mining complements database technology. First, a definition for data mining will be provided and some example applications will be discussed. Next, a few of the more well known data mining companies will be presented along with the software and services they provide. Following the review of data mining products, an approach to the data mining process will be discussed along with an overview of a few of the more prominent data mining analysis techniques. Finally, a data mining example will be presented that illustrates the data mining process by means of a data collection and statistical approach to a real world problem. The intent is to provide the reader with a better feel for the data mining process and how it may be applied in actual applications. DATA MINING DEFINITION AND APPLICATIONS Data mining is an analysis process applied to large amounts of data with the intent of identifying hidden, unknown patterns and relationships within the data thereby enabling the user to draw conclusions and predict future outcomes. Practitioners of data mining are not as concerned with determining what has happened based on an analysis of their data as they are about predicting what will happen in the future. Data mining has grown in interest and application over the last several years as advances in computer processing and digital data storage have greatly increased the speed with which data can be accessed and processed while simultaneously reducing the cost and infrastructure required to store the data and the results. As will be discussed later, data mining does require a process, but in practice, the data mining process is not uniform from user to user. However, the data mining process will generally include the following three high level steps: ( a ) Description of the data to summarize attributes of the available data ( b ) Predictive modeling derived from a portion of the existing data ( c ) Verification of the model against the larger domain of data in the real world Despite the wide interest in and buzzword status of data mining, a user who wishes to implement data mining must recognize what data mining is not and what data mining cannot do. Team #1 1

4 Data mining is not simply the blind application of a series of algorithms to large sets of data. The data mining analyst must still understand the data and its origins, the business in which the data originated and is used, as well as the analytical methods that are applied to the data and the results of that analysis. Furthermore, data mining does not indicate what you must do with the data and the results. Only a knowledgeable user of the data will be able to assess the value of the patterns and relationships gleaned from the data mining approach and apply them to make a positive impact to their business. Data mining can be implemented in any business to aid the analysis and resolution of multiple problems; however, the use of data mining has been most widely noted in the telecommunications, credit card, financial and retail industries among others. For instance, the telecommunications industry has studied data to determine which customers are most likely to turn over or churn on their cell phone contracts; the credit card industry is able to detect and track fraudulent use of their services; financial companies are able to predict corporate stock performance; and retailers are able to tailor which products to stock and offer to particular customers. Unfortunately, the benefits of data mining do not come without a cost, and practitioners of data mining must recognize the potential legal and ethical concerns resulting from the widespread application of data mining tools. In particular, the ability to track and identify individual consumer behavior through the aggregation of data from multiple sources when the original data was in fact anonymous is of concern and has resulted in the adoption of data control policies within many corporations. DATA MINING PRODUCTS A wide range of data mining software and service providers exist in the marketplace today and they serve a wide range of customers. According to a 2008 study by the Gartner Group, an information technology research and advisory firm, five of the largest data mining software companies are indicated below: AGNOSS SOFTWARE COMPANY ( Agnoss offers a suite of software tools to perform predictive diagnostics. These tools cover all phases of the data mining process including profiling, exploration, modeling, implementation, scoring and validation. Key software tools include Knowledge SEEKER Team #1 2

5 for profiling and visualization, Knowledge STUDIO, a decision tree based tool for predictive analytics, and Strategy BUILDER, a tool combining analysis results into business rules. INFOR GLOBAL SOLUTIONS ( Infor Global Solutions is the world s third largest software company and has acquired a wide range of software applications that include Infor CRM Epiphany an integrated software tool that performs marketing, sales and service analytics. PORTRAIT SOFTWARE ( Portrait Software provides a suite of marketing analysis tools to support marketing, service and selling activities. Portrait Software offers products that perform marketing automation as well as predictive analytics. Quadstone Analytics is one of their predictive modeling tools and it employs various techniques including decision trees, regression, additive scorecards, clustering and uplift modeling. SAS INSTITUTE ( SAS is a leader in the data mining community and provides tools and solutions to a broad range of customers. SAS Enterprise Miner and SAS Analytics offer customers access to a multitude of methods and techniques to perform statistical analysis, data visualization, forecasting, and model management and deployment. (SAS was originally an acronym for Statistical Analysis System.) SPSS INC ( SPSS Inc. provides a range of products in four families allowing customers to perform Data Collection, Modeling, Statistical Analysis, and Deployment. These tools can be integrated with Clementine a data mining workbench that uses a wide range of data mining techniques. (The name SPSS is derived from Statistical Package for the Social Sciences). Team #1 3

6 DATA MINING PROCESS A formal, uniformly accepted methodology for the process of data mining does not truly exist. However, a 2002 survey by KDnuggets.com, a leading web based data mining resource, indicated that 51% of the 189 respondents do follow CRISP-DM (CRoss Industry Standard Process for Data Modeling) a methodology developed and advocated by SPSS. Another 12% of respondents reported that they apply the tools described by SAS s SEMMA approach. Nevertheless, the remaining 38% of those taking the survey indicated that they follow their own methodology, the methodology devised by their employer, or nothing at all. Despite the apparent lack of a uniform process for data mining, all approaches to data mining will likely incorporate activities to accomplish the tasks of (1) problem definition, (2) data collection, (3) data review, (4) data conditioning, (5) model building, (6) model evaluation, and (7) documentation and deployment. As known leaders in the data mining community, SPSS s CRISP-DM method and SAS s SEMMA approach will be discussed in more detail below. Although these approaches do not explicitly call out the seven activities just described, those seven activities are embedded within the SPSS and SAS approaches, and they will likely be incorporated into any successful data mining approach. CRISP-DM (CRoss Industry Standard Process for Data Mining) CRISP-DM was conceived in 1996 by a consortium consisting of Daimler Chrysler, SPSS, and NCR. The intent was to develop a data mining approach that was not specific to any particular industry, application, or analysis tool. With funding from the European Commission, the consortium conducted a workshop and upon finding general agreement for the need of a data mining template, CRIPS-DM was born. CRISP-DM is a hierarchical process model that consists of a set of tasks with various degrees of definition. The top level of the hierarchy is the Phase. Each Phase consists of generic tasks, the second level of the hierarchy. The tasks are generic in order to maintain the neutrality of the process, and they are intended to be complete, applicable to the entire process, as well as stable, tolerant of new and unplanned developments. Specialized tasks form the third level, and these are designed for the unique, particular nature of problems to be solved. Finally, records of actions, decisions, and results form the fourth and final level of the CRISP-DM hierarchy. The Team #1 4

7 data mining context will determine the mapping from the generic levels (levels 1 and 2) to the more specific levels (levels 3 and 4). Moreover, CRISP-DM is described by a six phase reference model that flows in a particular sequence but does not require the user to follow the phases in a fixed path. User s will likely find a need to move back and forth iteratively between phases as individual phase results come into focus. The CRIPS-DM methodology is accommodating of that requirement. Finally, CRISP-DM is designed to be cyclical in nature with an understanding that the data mining activity may not end once a solution is derived. New questions and problems are likely to be identified from the solution that may demand a continuous flow of follow-on activity. The six phase CRIPS-DM cyclical model is briefly described below. Phase 1 Business Understanding: The purpose of Business Understanding is to assess the objectives and requirements of the business and articulate these needs into a specific problem or problems the business wishes to solve. Phase 2 Data Understanding: Data Understanding consists of preliminary data collection along with the assessment of any insights into the data and any data quality issues. Potential data segregation may occur and preliminary hypotheses may be formed. Phase 3 Data Preparation: In data preparation, data quality issues are resolved and the final data set for analysis is generated. Any required data transforms are completed as is necessary data cleansing. Multiple iterations may be required. Phase 4 Modeling: The methodology is neutral to any of the various data modeling approaches. Multiple modeling choices may be reviewed and tailored to the specific problem and available data. If the desired modeling technique requires specific data conditions, a return to Phase 3 may be required. Multiple techniques may be applied. Phase 5 Evaluation: In Phase 5, the model is complete and validated for sufficient quality. If quality in the model is lacking or it fails to meet the needs of the business, a review and return to Phase 4 may be necessary. Team #1 5

8 Phase 6 Deployment: In Deployment, the data and model are organized and presented to the customer for use. Data visualization is critical as is documentation of all detailed process steps and their results. SEMMA (Sample, Explore, Modify, Model, Assess) SAS proposes that SEMMA is not so much a data mining methodology as it is a set of tools deployed within their SAS Enterprise Miner software that can be integrated into any data mining method. SEMMA articulates that it is the user s responsibility to define the business problem to be solved and acquire and condition the data appropriately. The SEMMA focus is on model development. A brief description of the five elements of SEMMA follows: SAMPLE: The Sample activity consists of extracting a statistically significant data set from the larger data domain. The data set must adequately represent the larger data set but be small enough for ease of manipulation. The data may be portioned to facilitate model training, validation and test. EXPLORE: In the Explore activity, different views and plots of the data are generated and trends or unusual data instances are discovered. Additionally, traditional statistical analysis tools or data mining techniques may be employed to ascertain any data subgroups. MODIFY: Modification results from the creation, selection and transformation of data in preparation for the modeling activity. New variables or groups may be defined and any outliers, data points resulting from special cause variation, may be eliminated. The data set is updated accordingly. MODEL: Modeling allows the user to fit the data using a wide variety of modeling techniques and predict outcomes as derived from the overarching business need. Techniques that may be applied include neural nets, decision trees, logistic regression, and k-nearest neighbor to name a few. Team #1 6

9 ASSESS: Finally, in assessment, the model is evaluated for usefulness in solving the articulated business problem and validated against the subset of data. As in all data mining approaches, the model is checked for over fitting to ensure the model is not tuned so tightly to the model development subset that it cannot adequately predict outcomes from other data sets. From the brief assessment of CRISP-DM and SEMMA above, it is clear that there is commonality of activity in any data mining approach even if the terminology and articulation of methods are different. As indicated in the introductory paragraph, any good data mining approach will include the tasks of (1) problem definition, (2) data collection, (3) data review, (4) data conditioning, (5) model building, (6) model evaluation, and (7) documentation and deployment. DATA MINING TECHNIQUES STATISTICAL METHODS The data mining technique that most people are familiar with are statistical methods such as sample statistics or linear regression. These are usually used for very simple problems that have very few predictive variables. If the problem was more complex then another method would be more appropriate. Sample statistics involve looking at particular variables and calculating the minimum value, maximum value, mean, median, and variance. For example, a retail store could analyze their sales data and find out for the previous quarter the summary statistics for their particular products. They can quickly make conclusions about particular product lines and if they find something they do not expect or looks interesting they can mine further using more complex methods. Linear regression is an easy way to predict values based on a simple equation. There could be many interactions involved to find the correct model, in which case linear regression would not be appropriate. However, for simple situations this can be a very powerful method. For example, Company ABC has data on their customer s income level and their sales data. As Team #1 7

10 shown in Figure 1, as the customer s income increases their total purchase amount increases. A line is fit through the data to minimize the error between the data points and the line. The line then becomes an equation: Total Purchase Amount = y-intercept + (Slope * Customer s Income). ABC can predict what the sales amount will be based on what the customer s income level is by putting it into the equation. Also, since ABC knows the relationship between income and sales, they can restrict their marketing to only certain income levels. NEAREST NEIGHBOR FOR PREDICTION Nearest neighbor for prediction is a very easy data mining technique to understand. The concept comes from the idea that you can predict the outcome or how something is going to behave based on how other predictive variables near it behave. An everyday example of how this used is in real estate. When someone buys a house the realtor will check to see what other houses in the area sold for because this is a good predictor of what the house for sale should be worth. This technique works best and is ideal to use when there are a few amount of predictive variables. A simple example of how a business might use this technique is if a company had a product they wanted to start selling in a new city. Company XYZ wants to estimate how many units will sell so they can determine if it is worth moving into the new market. XYZ has a database of the current sales data of each city where the product is already being sold. The predictive variables are the population of the city and the distance away from where their competitor s product is sold in relation to the city. As shown in Figure 2, each city is represented by a letter, which corresponds to three categories of the amount of units sold: >200 units, units, and <100 units. These markers are placed in the graph by the population of the city and the distance away from their competitor. There is a U marker that represents the new city where the amount of product sold is unknown and we want to predict. Using the nearest neighbor for prediction method the U marker is nearest to more cities falling into the A sales category than any other sales category. This means that we could predict that this city will behave in the same way and will have sales greater than 200 units. XYZ should plan on extending their product line to this market given the high prediction of sales. With the nearest neighbor for prediction method it is also possible to estimate how confident the company is with their prediction. If there prediction variables are extremely close Team #1 8

11 to their neighbors then there is a higher level of confidence. However, if there are not any prediction variables that are close a prediction can still be made, but with very little confidence. This is extremely valuable because a company would not want to follow through with a major investment with a prediction that has a low level of confidence. NEURAL NETWORK A neural network is a data mining technique that is much more complex. Some benefits of this technique is that it can use extremely large amounts of predictive variables, once the network has been created and it has been confirmed as successful then it can be used again and again, and it can be used in many different types of situations. The disadvantages of this technique are that the outcomes are not very easy to interpret and it can be very time consuming to get the data into the right format for running the model. A neural network is a complex computer model that takes input variables and then outputs a solution. Neural networks include an input layer, hidden layer, and output layer. The input layer consists of the predictive variables that go into the model. The hidden layer is created by the computer model and is not seen by the user. The output layer is the end prediction that has been calculated by the model. All of the variables that go into the neural network have to be converted into numeric variables with values between 0 and 1. Company XYZ could use a simple neural network to predict the same scenario that was used for the nearest neighbor for prediction method. The database that has the population of the city, the distance away from where their competitor s product is sold in relation to the city, and the product sales would be used create the model. The computer would use the population of the city and the distance away from the competition for the input layer and then go through a testing phase. During the testing phase the computer will assign various weights to each of the variables and then output a number that represents the predicted product sales. This number will be between 0 and 1 and needs to be interpreted as how that relates to the actual ranges that are provided in the database. For example, an output of less than means the product sales are less than 100 units, an output between and means the product sales are between 100 and 200 units, and an output greater than means the product sales are greater than 200 units. The computer will keep testing the actual data and adjusting the weights as needed to create the best model for computing the predicted product sales. As shown in Figure 3, once the Team #1 9

12 model has finished testing the data, the new city can be entered into the model and the predicted product sales can be computed. This model has an output of 0.736, so the product sales are predicted to be greater than 200 units. CLUSTERING/SEGMENTING Clustering or segmenting is a data mining technique where there is not something specific that is predicted. This technique forms groups that are similar and groups that are very different. This can help to give a good overall view of the data and what is going on in the business. For example if Company MNO has a database of demographic information on their customers and their buying habits then segmenting can be used to find buying patterns based on that demographic data. As shown in Figure 4, male consumers under the age of forty are behaving in the same way which is drastically different than female consumers over the age of forty. For this particular example you can group into gender differences, age differences, and also both gender and age. These groupings can then be used to for different marketing campaigns. The marketing techniques should be different for each group. Not all the variables in the database will be used for clustering or segmentation and some will need to be removed by the user if they do not make any meaningful sense. Clustering can also be used to identify potential problems by finding outliers. For example, through clustering company MNO determined they have a much higher sale volume for snowboards in their stores where they are within fifty miles of a ski resort. However, they found one store where there is a low volume of sales even though they are only twenty-four miles away from a ski resort. With some more research, company MNO realized that their sales were down in that store because that area had become saturated with so many competitors. With this new information they decided to pull their store out of this area because they cannot compete with the larger stores. DECISION TREE Decision trees are a predictive model that group together classification variables into a tree. Each branch represents a classification group that has been divided. The decision tree splits the data into groups by examining all of the data and picking the variable that has the greatest split between categories first. Then the category can continue to be split at each level Team #1 10

13 until there are no more logical splits to be made. Decision trees are designed to handle categorical data, but numeric data can be made into categories to use in the tree. The advantages of decision trees are they can be very easy to interpret, there is not much involved in getting the data ready to process, and they can be used for a variety of situations. One of the disadvantages of decision trees is sometimes with simpler problems it is more time consuming to use this method than linear regression. Recently a decision tree was used at a consumer products company to help determine why a consumer study we had done did not produce the results that were predicted. The conclusions and numbers are the same, but the product and categories have been changed to protect confidentiality. A study was designed to test how well a consumer likes the change in the design of a chair cushion. The consumer ranked how comfortable they thought the chair was on a scale of zero to five with zero being extremely uncomfortable and five being extremely comfortable and a 0.5 step increment in between. Then a second chair was presented to the consumer and they scored this chair as well. It was predicted that the second chair would score higher and the average increase in score from chair one to chair two was calculated across all the consumers. As shown in Figure 5, the change was minimal and a decision tree was constructed to provide insight into the reason why. The largest split between what influenced the average score was what the baseline score was, or the score for chair one. If the score for chair one was below 3.75 then the average change in score was If the score for chair one was greater than 3.75 then the change in score was The company continued to split the tree into what was influencing the scores further down, but the true value of the tree was in the first split. The consumers who started with high score did not have much room for improvement and actually averaged a decrease. The consumers that started out with a low score did improve their score, as predicted. From this information it was determined that the study was designed incorrectly and they should have only recruited consumers that were using more of the scale in their evaluation of the first chair. DATA MINING EXAMPLE A brief data mining example follows. Although this example did not specifically follow one of the more widely accepted data mining methodologies or use any of the sophisticated data Team #1 11

14 mining modeling techniques, it does illustrate the use of the principles of data mining in that large amounts of data were extracted from multiple data bases and a meaningful model was generated to describe a business process and ultimately change behavior. In this example, the business problem in question was assessing the status of and improving the on time delivery of New Product Introduction (NPI) hardware. NPI hardware is defined as fabricated assemblies that are all components of a larger machine assembly. The delivery problem stems from the fact that the fabrication of the very first set of NPI hardware required for the assembly of the very first machine is often late relative to the required due date. For the study at hand, the actual delivery of the first set of hardware was over 25 days late to the customer orders with an inter-quartile range of 35 days. To initiate analysis, a fishbone diagram was derived to assess potential causes of late delivery and a data collection plan was generated. In the data collection plan, data was identified for extraction from two different databases, the engineering product definition database and the manufacturing database. The data extracted from these two databases was merged into a single data set for analysis. An initial review of the data following evaluation of the process capability for on time delivery indicated two distinct subgroups of data: assemblies described as brackets and assemblies described as not brackets. From this segregation of data, analysis proceeded on the distinct subgroups and new data was generated to describe design and manufacturing subprocesses based on the extraction of time based event data from the databases. Preliminary plots of the data were generated and initial models were attempted using linear regression and general linear model approaches. Unfortunately, no single regression was able to adequately describe the data. Finally, an attempt was made to categorize some of the sub-process process times by percentiles and plot the data against on time delivery as a main effects plot. This plot revealed that manufacturing activity was related to on time delivery but design activity was not. In fact, the main effects plot indicated that parts designed closer to the due date actually had better on time delivery than parts designed further from the due date. Thus it was clear that no single regression of design and manufacturing data could adequately describe the process. From this new understanding of the data, an approach was made to fit only the manufacturing data to on time delivery and an acceptable regression was found. Similarly, an Team #1 12

15 attempt was made to fit the design data to one of the manufacturing variables, the creation of the part identity in the manufacturing database, and again a suitable regression was found. These two regressions were combined through a common term to form a single regression equation. This equation, while not fitting the on time delivery data very well, as evidenced by a poor R Squared correlation coefficient, did produce an accurate representation of the on time delivery distribution. Thus while the combined regression model could not predict how late a particular part would be, it could, based on the design release distribution, predict what percentage of parts would be late and when all the parts would be available for assembly to the first machine. Moreover, this model showed that the greatest contributor to the variation in on time delivery was the variation in design lead time. By using the derived regression equation along with the known distributions for the input parameters and the desired on time delivery distribution, a Monte Carlo simulation was employed to calculate the required cumulative distribution for design releases to ensure on time delivery. This cumulative distribution was compared to previous design release plans and the design release plans were shown to be faulty. As a result of this study, design plans were updated to reflect the learning and the required design release schedules. A data visualization tool was developed to extract data from the design database and plot the design release requirements against the design release plan and the actual design releases. This tool allowed for an assessment of the ability to meet on time delivery before all designs were released and the taking of corrective action in advance of the hardware due date. Furthermore, a data tool was developed to extract and format manufacturing data (Bill of Material, manufacturing status, quantity on hand, and manufacturing work order information) for coordination of design and manufacturing processes along with the rapid assessment of manufacturing status for all assemblies required to build a particular machine. Unfortunately, this example is not a direct implementation of one of the data mining methodologies described in the paper. Nor did this example illustrate use of any of the elaborate data modeling techniques that might be used on larger more complex data sets. However, this example does illustrate the basic principles of the data mining process (defining a business need, data collection, data review, data conditioning, model generation, model evaluation, and documentation and deployment) and the use of data and data modeling to change behavior and solve a business need. Team #1 13

16 CONCLUSION This paper has provided an overview of data mining and has included a data mining definition, a review of where data mining may be applied, a summary of data mining products, an assessment of the data mining process, and a synopsis of some of the major data mining techniques. The paper concludes with an example project that attempts to illustrate the data mining process and how it may be used to solve a real world problem. REFERENCES Angoss Software. Angoss Software Corporation, Inc. Last visited 7 APR 09. < Chapman, Pete et al., CRISP-DM Retrieved from < Step-by- Step_Data_Mining_Guide.pdf> 5 APR 09. Infor Solutions. Infor Global Solutions. Last visited 7 APR 09. < Poll: What Main Methodology Are You Using for Data Mining. JUL KDnuggets. Last visited 5 APR 09. < Portrait Software Solutions. Portrait Software plc. Last visited 7 APR 09. < SAS Enterprise Miner. SAS Institute, Inc. Last visited 5 APR 09. < SAS Products and Solutions. SAS Institute, Inc. Last visited 7 APR 09. < SPSS Software. SPSS, Inc. Last visited 7 APR 09. < Two Crows Corporation Introduction to Data Mining and Knowledge Discovery, Third Edition. (ISBN: ). Retrieved from < 25 FEB 09. Team #1 14

17 Figure 1: KDnuggets.com, 2002 Survey; Data Mining Process Figure 2: CRISP-CM Breakdown Team #1 15

18 Figure 3: CRISP-DM Phases and Flow Figure 4: Linear Regression Technique Team #1 16

19 Figure 5: Nearest Neighbor Technique Figure 6: Neural Net Technique Team #1 17

20 Figure 7: Clustering / Segmenting Technique Figure 8: Decision Tree Technique Team #1 18

21 On Time Delivery Probability Delivery Actual - fit Delivery Required Figure 9: Example On Time Delivery Problem Statement Brainstorm Variation Sources Data Collection Plan Figure 10: Example On Time Delivery Data Collection Team #1 19

22 TOTAL LEAD TIME by Part Type: p <.05 Level N Mean StDev BRACKET 520 x6.76 x3.14 (--*-) DUCT 138 x6.70 x0.40 (----*---) MANIFOLD 44 x9.95 x4.68 ( * ) TUBE 47 x3.60 x2.79 (------* ) Pooled StDev = Figure 11: Example Data Segmentation SHIP_DUE IR CREATE BOM CREATE BOMC_MODC BOMC_MODP BOMC_MODI MODC_DUE MODI_DUE BOMC_DUE Main Effects Plot - Data Means for SHIP-DUE MODI_MODC CAT MODEL_CR CAT BOM_CR-D CAT SCHED_ST CAT MO_FINIS CAT MOD_ISSU CAT MAN-DUE CAT MO_START 60 SHIP-DUE Figure 12.1: Example Model Building X make more negative 52.8% Y make smaller X make smaller 3.5% 28.3% X make smaller - Time + Time 0 X make smaller 7.1% X make smaller 8.4% Model Create IR Create Model PRE BOM Create Model / DWG MAN Issue Release Components Scheduled Available MO Start MO Start MO Finish SHIP DATE SHIP-DUE = *(MODEL_CR-DUE) *(CR-ISS) *(MAN_BOMC) *(SCH_ST-MAN) *(MOS_MOFIN) [R^2A 4.4%] {R^2A(1) 76.5%, R^2A(2) 68.0%} Figure 12.2: Example Model Building DUE DATE Team #1 20

23 1.2 1 Overlay Chart Actual Delivery Probability SHIP DUE MODEL Predicted Delivery (Regression) SHIP DUE ACTUAL Figure 13: Example Model Evaluation Issue Required for On-Time Delivery 1.2 Overlay Chart Probability MODI ACT modi calc new Issue Actual Figure 14: Example Model Evaluation Team #1 21

24 BRACKETS SUMMARY Plan Requirements 70 Actual Number of Parts CUM Req Issue CUM Plan Issue CUM Actual Issue 30 *** WARNINGS *** 1.1 BRACKET PLANNING 20 # Issed No PRE - 6 # Issued Post Due - 0 # Multiple Issued Files - 12 # Complex Not Planned Early - 0 # Complex Not Issued Early Cumulative Percent OLD PLAN 0 NEW PLAN 08/06/05 REQUIRED 08/20/05 09/03/05 09/17/05 10/01/05 10/15/05 10/29/05 11/12/05 11/26/05 12/10/05 12/24/05 01/07/06 01/21/06 02/04/06 02/18/06 03/04/06 03/18/06 04/01/06 04/15/06 04/29/06 05/13/06 05/27/06 06/10/06 06/24/ Date Days All Due Dates Figure 15: Example Deployment Team #1 22

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/