Transactions on Information and Communications Technologies vol WIT Press, ISSN

Size: px

Start display at page:

Download "Transactions on Information and Communications Technologies vol WIT Press, ISSN"

Tyler Foster
5 years ago
Views:

1 A comparison of methods for customer classification Maria Celia S.Lopes, Myrian C.A.Costa & Nelson F.F.Ebecken COPPE/Federal University ofrio de Janeiro Caixa Postal CEP Rio de Janeiro - R7- Brazil Phone.: (55 21) Fax.: (55 21) Abstract This work presents a comparison of current methods used for classification problems. The solution of two typical applications related to Customer Classification for Business Applications are considered: the first proposed in the Second International Competition of Data Analysis by Intelligent Techniques [6] and a higher dimensionality case. 1 Introduction Recently, databases with hundred of fields and tables and million of records are being treated with data mining tools. A very large number of records in the database and a very large number of fields, generate a high dimensionality problem. This creates difficulties in terms of increasing the size of the search space for model induction and generates spurious patterns related to irrelevant variables. This paper is organized as follows: section 2 presents a description of the employed methods; in section 3, the problem 1 and results are briefly described, section 4 deals with a more complex problem; finally, section 5 presents some conclusions.

2 2 Data Mining Methods M I -DECISION TREE - Personal Computer Implementation [7] This induction algorithm is considered to be binary, once it creates a two way branch at every split in the tree. The selection of the attribute to split on at every stage in tree building is done according to the information contents of each attribute in terms of classifying the outcome groups. The most informative attribute is selected at every branching point. For discrete attributes, the value groups are split between the two branches so as to maximize the information content to the attribute. For numeric attributes the two way split is based on a numeric threshold which is derived to maximize the information content to the attribute. When the outcome is numeric the standard deviation of the data filtering to both branches are used as the basis for selecting the best attribute and the best threshold. One of the parameters that have to be specified before induction process commences is the Minimum Examples in a Branch. This figure gives the induction algorithm a criteria for stopping the creation of new branches from any given point in the tree if the number of data samples filtering to that point falls bellow this limit. This limit provides defence against noise in the data. In effect it will only allow branches to be developed from an acceptable number of records. Normally this figure is set depending on the total number of records and the level of noise in the data. MII -DECISION TREE - Workstation Implementation [8] This Decision tree algorithm builds a classification model in the form of a binary tree that can be interpreted visually or by reading rules in if-then format. The model starts at the root node and follows a path determined by the attribute test until a leaf node is encountered. Each leaf node has a label assigned that represents the classification of the record. The split used to create the binary tree employs a breadth-first tree-growing technique and depends on the type of the attribute considered. If the attribute is numeric, the splits are of the form: 334

3 A <v, where A is the attribute and v is a numeric value for this attribute. If the attribute is discrete, the method considers splits of the type: A E S\ where S(A) are the set of possible values for attributed and S' c:s. M III -NEURAL INDUCTION- Workstation Implementation [4] The Neural Induction algorithm employs a back-propagation neural network, with heuristic search for the best network architecture, to produce one trained network and the sensitivity analysis of the attributes as outputs. The back-propagation is a general purpose supervised learning algorithm. The sensitivity analysis will show how fields contribute to the classification. The resulting classification model can then be used to predict the classes of new attribute values as well as a ranked list of fields relevant to the classification. In this particular implementation the classification is based on the value of one classfield.the complete algorithm has four parts: a) Normalization, where the data are examined to determine how the values are translated into format required by the input. b) Selection of architectures, that chooses several network configurations with different numbers of hidden units based on the number of inputs and outputs units. c) Training of architectures, using the back-propagation algorithm. d) Choose the best network architecture and overall training, based on a score calculated from desired accuracy, error limits and complexity of networks M IV - MULTILAYERED PERCEPTRON/Exaustive Network Search [3] This method utilizes the Multi-Layer Perceptron (MLP) model and a constructive approach to build networks in conjunction with an adaptive gradient learning rule. The algorithm of network construction is characterized by the addition of hidden units one or a few at a time. 335

4 Construction is stopped when performance on an independent test set shows no further improvement. The method follows the steps listed above in order to build a neural network model: a) analyzing and converting data into a form suitable for the network inputs; b) attributes selection, that utilizes a genetic algorithm to search for good sets of input attributes and for each possible set, a logistic regression or a neural network is trained and used to rank the subsets of inputs; c) network construction and train, using a method of network construction with an adaptive gradient learning rule. M V - MULTILAYERED PERCEPTRON/Genetic Algorithm [5] Multi-layered Perceptron (MLP) is a nonparametric architecture. Used with the backpropagation algorithm it is capable of generating smooth nonlinear mappings between input and output variables. The multilayered perceptron is considered a type of neural network. Both of these terms come from the fact that this architecture was originally proposed as a model for neural biological processes. However, in this approach, this vantage point is ignored, viewing MLP simply as a useful architecture for nonparametric modeling. Output Layer Hidden Layer Intput Figure 1 - A MLP T network T Intput Layer 336

5 As shown in figure 1, an MLP can be viewed as an interconnected network made up of nodes that can be thought of as simple computacional elements. The nodes are arranged into one or more layers. The first layer is the input layer, the final layer is the output layer, and the other layers inbetween are called hidden layers. The output of a node in a hidden layer is used as an input to the nodes of the next layer. Each hidden node outputs the value obtained from applying a sigmoidal function of a weighted sum of its inputs. In classification, the output nodes also apply a sigmoidal function. In estimation, output nodes simply output weighted sums of their inputs. A separate weight exists for each connection in the network (i.e., between each pair of nodes in adjoining layers). These are the weights used by nodes to weigh the inputs they are summing, and they constitute the free parameters to be tuned by data. Backpropagation is one of many errorminimazing functions which tunes these weights to generate the desired maping. The error function used is usually the mean-squared error (MSB) over a data set. M VI - GENERALIZED REGRESSION NEURAL NETWORK [2] GRNN are a 4 layer feed forward neural network that accepts discrete and/or continuous valued inputs and generates discrete or continues valued outputs. GRNN also memorize the training records by storing the input and output variables in the network itself. Once stored, as new records are presented, the GRNN looks at the difference between the current record and all the stored records and performs what can be thought of as an interpolation and generates an estimated output based on the history stored in the network. Unlike back propagation, which attempts to create a mathematical formula that generates outputs based on inputs, the GRNN is performing an interpolated estimate within its previous experience. GRNN uses a recall factor called "sigma" to adjust the acuity (degree of discrimination) of neural network's response. Sigma is inversely related to acuity thus a low sigma generates a highly discriminate response while a high sigma generates a more generalized response. GRNN are sometimes a good substitute for Back Propagation, within the GRNN's constraints (covered below). They can be used for regression and time series types of applications. 337

6 Advantages of GRNN: Fast Training Speed for modest training sets Sometimes higher accuracy than Back Propagation Disadvantages of GRNN: Large networks with large training Data Sets Slower performance on "recall" with large networks Do not handle trending inputs and outputs well Difficult to determine proper "recall factor" (sigma). GRNN memorizes the data in one pass. This reduces training greatly as compared to iterative techniques like Back Propagation. Depending on the nature of the data, it can be much more accurate than Back Propagation. This can mean some outstanding advantages. Since GRNN memorize the training data, large training sets (records) can create large networks. There is one first hidden layer node created for each output (plus 1). Thus, for training records and 2 predicting outputs using 5 inputs, this ends up with a 5 input, hidden, 3 hidden, 2 output network. If Back Propagation is used, a much smaller network might be found. Since these networks can get large, depending on the application, their "recall time", the time to pass new data through them, can be slow. Also, since they are comparing current records to a previous history and are not creating a mathematical "formula" relating inputs and outputs like Back Propagation, they do not handle data that is outside the range they were trained on very well. Additionally, the knowing what a proper "sigma" factor to use can be problematic. It depends on how discriminating the output is needed. This issue is reduced quite a bit by automatically optimizing sigma in the algorithm as the networks are built. M VII - TIME DELAY NEURAL NETWORK [1] TDNN is a more general form of Back Propagation. It employs the Back Propagation technique for setting weights between neurons, but they also 338

7 consider time inherently in the structure of its architecture. It can be viewed as a back propagation network where there are multiple connections from the input neurons to the output neurons. Each of these connections looks back over time and sets its weights for each connection to minimize Mean Squared Error (MSB) of the overall network. Figure 2 depicts such a network. Input(M) Output(t) Optional Recurrent Outputs Intput Layer Hidden Layer Output Layer Figure 2 - A TDNN network The network shown in Figure 2 is a Time Delay Neural Network with two (2) connections from each input neuron to each hidden neuron, and one connection from the hidden neuron to the output neuron. In a TDNN, each connection is set to a specific data interval back in time with the first connection set at the current time, (current record) and the second connection set to one period ago (third connection to 2 periods ago, etc.) This look back is performed by providing each neuron with memory, so that it can remember previous layer outputs for N periods of time. Thus, TDNN can be thought of a as a back propagation network with fixed time delays back N periods of time, exactly like lagging your inputs by N periods of time. The one major difference is that TDNN also does this with hidden neurons output too, thus seeing, remembering and using "features" in your data over time. TDNN, in this implementation, has also the option of being recurrent. This means that the network architectures can be set to take the last output of the network and use it as an input. These inputs also include the TDNN time-based look back ability, thus it can look back over the history of the neural outputs for extended time span features of the data. The Look Back feature of TDNN networks makes them particularly appropriate for time series applications where histories of input variables 339

8 are used to produce predictions into the future. TDNN network can also be used for time based classification and diagnostics where histories of inputs are used to identify the existence of some condition. Advantages of TDNN networks: No input lagging required, just load the data in time sequential order and run Uses familiar back propagation techniques Much higher accuracy on some problem types than back propagation Disadvantages of TDNN networks: Time delays consume usable records (it takes records to "pre-load" a network just like with recurrent networks and much like records lost when lagging inputs manually) More free parameters requires more data More free parameters means relatively slower training 3 Problem 1 - Description Customer Classification for Business Applications [6] The objective of database marketing is the discovery of inhomogeneous information of the customer's personal and demographic background as well as the products the customer already uses. The whole information extracted from the databases are used in order to select as exactly as possible those persons from a database who show the greatest potential to actually buy an advertised product or to get information about the customers requirements. The data records included personal data, such as age, gender and job as well as detailed information about the customers' product utilization. 18 different products were considered, ranging from checking and savings accounts to investment plans and securities. Additionally, a classification of their place of residence was available for each person in the database. Altogether about 100 data fields were available for each customer in the database. 27 from the total of 100 fields were selected as input data. The whole set of data contains 300 data sets, including personal data and information about the customers product utilization and the information 340

9 whether the customer has ordered a special product during a marketing campaign. Parts of the data are encoded to protect the personal data without a loss of information. The training data are in ASCII format. Table 1: Structure of the data of problem 1 Feature 1 pd,j Feature 2 Pd,,2 Feature j Pd,,i Feature 27 Pd,,27 pd,so,i Pdi50,2 Pdisoj Pdi50,27 Explanation: pdy: personal dataj of data set i (i=l,..., 300; j=l,..., 27) Feature : customer personal data; Feature : product utilization data (0 or 1) RESULTS FOR PROBLEM 1 M I -DECISION TREE - Personal Computer Implementation The Building of this tree was complete, presenting the folowing results: Number of classes = 2 General Accuracy for class 1: 93.30%. General Accuracy for class 0: 93.30%. Overall Accuracy: 93.3% M II -DECISION TREE - Workstation Implementation The Building of this tree was complete, presenting the folowing results: Number of classes = 2 Errors =64(21,33%) Predicted Class Total = 150 Total =150 Total =

10 M III -NEURAL INDUCTION- Workstation Implementation The Training of this network was complete, presenting the folowing results: General Accuracy for class 1: 83.30%. General Accuracy for class 0: 82.60%. Actual Unknown M IV - MULTILAYERED PERCEPTRON/Genetic Algorithm The Training of this network was complete, presenting the folowing results: Mean Squared Error on training set: Min. Mean Squared Error on test set: This network is a Multilayered Perceptron neural network, employing 9 inputs and 19 hidden layers. There were 1 output neuron. M V - MULTILAYERED PERCEPTRON/Genetic Algorithm The Training of this network was complete, presenting the folowing results: Mean Squared Error on training set: This network is a Multilayered Perceptron neural network. All the columns in the datafile were used. M VI - GENERALIZED REGRESSION NEURAL NETWORK The Training of this network was complete, presenting the folowing results: Mean Squared Error on training set: Min. Mean Squared Error on test set: This network is a Generalized Regression neural network, employing 27 inputs and 2 hidden layers. The second hidden layer used a summation transfer function. There were 1 output neurons using a direct transfer function. All the columns in the datafile were used. 342

11 M VII - TIME DELAY NEURAL NETWORK The Training of this network was complete, presenting the folowing results: Accuracy on training set: 98.00%. Max. accuracy on test set: 84.00%. This network is a Time Delay neural network, employing 24 inputs and 2 hidden layers. Thefirsthidden layer had 4 Tanh 2 Linear neurons with 3 connections. The second hidden layer had 2 Logistic 2 Linear neurons with 7 connections. There were 1 output neurons using a linear transfer function and 1 connections each. The following columns in the datafile were used: Cl, C2, C3, C5, C7, C8, C9, CIO, Cll, PI, P2, P3, P5, P6, P7, P8, P9, P10, P11,P12,P13,P14,P15,P16 COMPARISON OF RESULTS In figure 3, thre was an attempt to show the results of the described methods. + Class Class M-ll A Class M-III X Class M-IV X Class M-V Class M-VI + ClassM-VII Figure 3 - Outputs Comparison of the results of Methods II, HI, IV, V, VI, VII 343

12 4 Problem 2 - Description Problem 2 has extracted data from a real-world insurance database. A data set of 32 attributes and registers was collected, from a huge database containing information of insured persons and companies. This set of data has typical properties like fragmentation, varying data quality, irregular data value coding, missing values, noise, etc. which make the application of data mining a challenge. The complexity and dimensionality of this problem brings some discussions about algorithms and their results. The database describes relations among customers, insurance contracts and components of insurance tariffs. Each customer can play roles in certain insurance policies and an insurance contract can have several components, each of which is related with a tariff role of the respective customer. Each policy concerns a certain product and tariff components are bound to dedicated insurance tariffs. The 32 attributes are distributed as follows: - 9 attributes describes the customer, like sex, birth date, marital status, etc.; - 11 attributes has policy information, like type of contract, status of contract, modus of payment, etc.; - 12 attributes specify the tariff components, like insured benefits and regular premium. Several methods were been used to discover the relationship among attributes, but only two of them were considered since the results are reasonable. The first result utilizes the method 2 to classify data. The obtained decision tree has 43 levels of depth. This run identified 3 classes with 3624 registers (11,32%) erroneously classified. Figure 4 shows the comparison of the predicted and the target classification of an excerpt of a hundred points. 344

13 . Target classification Q Predicted classification 0, Figure 4 -Target x Predicted Classification The second result uses method 3. The obtained neural network classified the data with 63,11% of accuracy. The Figure 5 presents the resulting and the desired classification of a hundred outputs. 2,5 2 1,5-1 0,5 0.Desired classification & Resulting Classification Figure 5 - Desired x Resulting Classification 5 Conclusion In this work, two considerable difficult problems were chosen to examine the main characteristics of usual customer classification problem methods. 345

14 These methods are selected from those most commonly employed today's software. No special effort was made in the amalgamation of the raw data. The simply consideration of the 28 attributes resulted on a hard task. Problem 1 (28att x 300 records) could be easily analised on personal computer platforms. In this case, the Multilayered Perceptron/Exaustive Network Search corresponded to the more accurated solution. In Problem 2 (32att x records), the solutions could only be obtained on workstation implementations. Naturally, The induction solution have shown na excellent computer performance. In terms of accuracy, again the neural net strategy yielded the best results. As a conclusion, the focused data can be considered as benchmarks to the research of mining methods implemented to scale the dimensions of very large databases. REFERENCES [1] Masters, T., Signal and Image Processing -with Neural Networks, John Wiley and Sons, Inc, USA, pp , [2] Masters, T., Advanced Algorithms for Neural Networks, John Wiley and Sons, Inc, New York, pp , [3] Michie, D., Spiegelhalter, D.J. and Taylor, C.C., Machine Learning, Neural and Statistical Classification, Ellis Horwood Limited, [4] Arbib, M. A., The Handbook of Brain Theory and Neural Networks, The MIT Press, Massachusetts, pp , [5] Kennedy, R.L., Lee, Y., Van Roy, B., Reed, C.D. and Lippmann, R.P., Solving Data Mining Problems through Pattern Recognition, Prentice Hall, USA, 1997 [6] ERUDIT'98 - Second International Competition of Data Analysys by Intelligent Techiniques - European Network of Excellence for Uncertainty 346

15 Modelling - Sep [7] Quinlan, J.R., C4.5 Programs for Machine Learning, Morgan Kaufmann Publishers, California, 1993 [8] Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkano, A.I., Fast Discovery of Association Rules, Chapter 12, Advances in knowledge Discovery and Data Mining, eds. Fayyad, U.M., Shapiro, G.R., Smyth P. and Uthurusamy, R., The MIT Press, California, pp ,

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer