A Neural Network Model Of Insurance Customer Ratings

Size: px

Start display at page:

Download "A Neural Network Model Of Insurance Customer Ratings"

Lester Welch
5 years ago
Views:

1 A Neural Network Model Of Insurance Customer Ratings Jan Jantzen 1 Abstract Given a set of data on customers the engineering problem in this study is to model the data and classify customers into two classes. The data were split into a training set and a test set, and the training set was modelled by a multilayer perceptron network. The model classified 62 percent of the unseen test data correctly. Contents 1 Introduction 2 2 Method And Equipment 3 3 Results and discussion Discussion 6 4 Conclusions 7 1 Technical University of Denmark, Department of Automation, Bldg 326, DK-2800 Lyngby, DENMARK (project report), 16 Nov

2 1. Introduction The study concerns data from an insurance company, and the problem at hand can be characterised as a data mining problem. That is, given a fair amount of customer information, is it possible to extract a model of customer goodness (rating) from those data? If yes, the model could be used to predict the goodness of new customers, for instance. A neural network approach is one of several possible ways to attack such a problem, and there are many commercial software packages that can assist the designer in building the model. Neural network models are quite difficult to build, since there are many parameters to tune in order to get the best fit of the data. Alternative methods exist, for example within the fields of pattern recognition, computational intelligence, or soft computing. A new alternative is to use Dimac s patented classification algorithm ( which is simpler and faster than neural network training. The objective of this study is primarily to provide a reference for benchmarking of modelling approaches, in particular the Dimac classifier. The benchmark data consists of more than 50 features (columns) on each customer (rows) along with a binary classification into good / less good related to the historical indemnity paid to the policy holder. Similar problems exist in the literature. For example customer segmentation in a bank ( The bank created customer profiles for banking products such as telephone banking and home banking. The objective was to improve the direct marketing. The approach was to search for clusters or groups of typical product users using a Fuzzy-C-Means Algorithm. Several other financial applications exist, for example assessment of credit worthiness ( In this case another bank wished to enforce a consistent decision-making procedure in all branches of the bank, by means of a model rather than human decisions. There are many commercial tools for building and using neural networks, either alone or together with fuzzy logic tools; for an overview, see the database CITE (MIT, 1995) available free of charge from the Web ( Neural network computations are naturally expressed in matrix notation, and there are several toolboxes in the matrix language Matlab, for example the commercial neural network toolbox (Demuth & Beale, 1992), and a university developed toolbox for identification and control, downloadable from the World Wide Web (Nørgaard, NNSYSID with NNCTRL ). DataEngine is a software tool for data analysis combining fuzzy rules, fuzzy clustering, neural networks and fuzzy neural systems with mathematics, statistics and signal processing (MIT, 1997; It has a plug-in, FeatureSelector, which automatically selects the most relevant set of features for a given application. This study is somewhat different from the ones mentioned since it had to be completed in two man-days, so the focus is on the modelling aspect rather than the application aspect. The nature of the task is thus rather ordinary: to apply an established method to a set of given data. The task does not require knowledge of the insurance business, since the data features (the columns of the data) have symbolic names which make no sense to an outsider. Insight and skill are required, however, to build the neural network model. The work plan was the following. 2

3 Input layer Hidden layer Output layer Figure 1: Fully connected multi-layer perceptron. Step 1. cleaning of data Step 2. feature selection Step 3. training Step 4. test The usual practice is to divide the data into a training set and a test set. The neural network is supposed to learn from the training data so that it can reproduce the classes in the test data. Inputs to the network are a subset of features, and output is a label for each unseen customer: good or less good. An error rate, a number, is produced to assess the performance of the classifier. 2. Method And Equipment A neural network is basically a model structure and an algorithm for fitting the model to a given set of data. The network uses a generic nonlinearity and allows all the parameters to be adjusted. In this manner it can deal with a wide range of nonlinearities. Learning is the procedure of training a neural network to represent a model of the data. For an introduction to networks, see the textbook by Haykin (1994), the overview article by Lippmann (1987), or the downloadable introduction by Jantzen (1997, The multilayer perceptron (MLP) network was chosen for the study. It has one or more hidden layers of neurons. The graph in Fig. 1 illustrates a multilayer network with one hidden layer, and fully connected, as every node in a layer is connected to all nodes in the next layer. A neuron is the fundamental processor of a neural network (Fig. 2). It has three basic elements: 1. A set of connecting links (or synapses ); each link carries a weight (or gain) w 0,w 1,w A summation (or adder) sums the input signals after they are multiplied by their respective weights. 3. An activation function f (x) limits the output of the neuron. Typically the output is limited to the interval [0, 1] or alternatively [ 1, 1]. 3

4 f(x) 1 w 0 1 Hard limiter w 1 w 2 + f(x) x (a) (b) Figure 2: Perceptron consisting of a neuron (a) with an offset w 0 and an activation function f (x), which is a hard limiter (b). The summation in the neuron also includes an offset w 0 for lowering or raising the net input to the activation function. Learning (training) is a matter of adjusting the weights in order to get the best fit of the input-output relationship in the training data. The MLP architecture is chosen here, because it is widely used and it is a standard textbook method. An MLP network can approximate functions and it can perform discrete classifications. It does require that the training data and the test data are similar; it cannot reproduce instances outside of the modelled region of the feature space. In other words: it can interpolate, not extrapolate. The software used is DataEngine with FeatureSelector. It is compiled software which makes it difficult for the user to include personal programs, but it is specifically developed for the type of problem at hand. 3. Results and discussion The given data are in a table containing an identification column (SE) and 52 features in the 53 first columns. Additionally, there is a class assignment column (GOD) in the remaining columns. The data are contained in a comma separated file (erhverv.csv)whichcanbe processed by a spreadsheet program (Microsoft Excel) and then imported into DataEngine. Cleaning of data There are further columns, including columns with textual information, but these columns are disregarded. There are 300 rows having a class assignment, the rest of the rows are without a class assignment in the raw data, and those rows are disregarded. All data within the feature space are numerical. There are, however, some practical problems and engineering choices have to be made. Many cells are empty. The MLP has no built-in mechanism for coping with missing data, so the data had to be cleaned or completed. The data are most likely contaminated with noise, and one can also assume that some of the data are faulty, resulting in outlier points 4

5 Parameter Value Errors 26 of 69 Neurons (input-hidden-output) Connections 7 Activation functions tanh Features V4, V10, V13, V14, V29_2, S31 Training epochs 1000 RMS test error Learning method backpropagation Strategy Cumulative (batch) Learning rate, hidden layer 0.1 Learning rate, output layer Table 1: Final results with no logical connection with the rest of the data. The training set was built from original data. Five columns (V30_1 to V30_5) were discovered to contain many empty cells, and those columns were deleted from the data. The remaining rows, having no empty cells, were then split into a training set and a test set, and all columns were standardised (scaled to a mean value of zero, standard deviation one). Feature selection In principle, the MLP can cope with as many features (inputs) as necessary, but in practice the training is sensitive to noise and outliers. The less features, the less noise, and the clearer becomes any functional relationships. Selection of a few and significant features, is thus an important, but difficult task. The Feature Selector produces a ranked list of combinations of good features. It does not necessarily find the best features, but it does provide a selection of good feature combinations, that the designer can test. For example, it has happened that one suggested set of features had a fairly strong correlation between two features, and leaving one of them out improved the classifier. In the final model six features were selected out of the 52 original features. Training The network has three layers: an input layer, one hidden layer, and an output layer. It is a small network and the training proceeds smoothly and relatively fast (Fig. 3). It learns the training data very well (bottom curve), while the reproduction of the test data results is difficult (middle curve). That indicates a weak functional relationship in the data. The training is stopped when the test error starts to increase. Test The final solution misclassified 26 rows out of 69 (38%) or, conversely, it classified 43 out of 69 (62%) correctly (TABLE 1). This number can be compared with the worst case: a random selection would result in 50 percent correct classifications, since there are only two classes. The result is thus 12 percent better than random. 5

6 Epoch RMS Training Error Max. Training Error RMS Test Error Max. Test Error Figure 3: Learning curve. After 1000 epochs the test error is more or less stable; further training is in vain. 3.1 Discussion The network has a total of 7 connection weights to adjust, and thus the ratio of weights to training examples is about 1:10, which is acceptable. The training produces slightly different results after retraining, due to the random initialisation of the weights; the result varies between 25 and 30 misclassifications. The learning method has been supplemented with the momentum and learning rate decay options, but it seemed that for a small network, a simple learning method works best. The network is remarkably simple; it has an input for each feature, one output to represent the two classes, and thus only one extra node. To interpret, the data are most likely noisy such that a larger network will learn the noise, spoiling the generalisation ability, and increase the test error. The result is less sensitive to the network parameters than to the choice of training data and test data. The strategy has been to avoid empty cells in the raw data. This is in order to facilitate the benchmarking of methods. For example, better classification rates can be achieved by filling the missing data according to some rule, but then one runs the risk of introducing an artificial bias in the data. To illustrate the sensitivity to the data, a simple swap of training and test data produced worse results. The sensitivity is explained by the presence of noise, again. If faulty data could be removed, and empty cells could be filled, it would improve the situation. It would probably also help to know what the features are and how they were derived. For example, it might be possible to combine two or more features into one, more significant indicator using expert knowledge. The major sensitivity item is the choice of features. A small set of features must be selected, and Feature Selector turned out to be a time saver in this respect. Ideally, many of Feature Selector s suggestions ought to be tried out in order to find a near optimal set, but only a few were tested due to the tight schedule. The feature selection is the major subproblem in a neural network approach. 6

7 4. Conclusions The objective was to provide a reference for benchmarking. The achieved solution is believed to be near-optimal, although better solutions are conceivable because of the lack of a systematic design approach. The training data and the test data can be used by other testers in order to make a fair comparison of approaches. The objective focused on neural networks, but other approaches may work better. Especially hybrid approaches, a combination of established approaches, should be tried if the goal is to find the best classifier. References Demuth, H. and Beale, M. (1992). Neural Network Toolbox: For Use with Matlab, The Math- Works, Inc, Natick, MA, USA. Haykin, S. (1994). Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, Inc., 866 Third Ave, New York, NY Lippmann, R. (1987). An introduction to computing with neural nets, IEEE ASSP Magazine pp Meier, W., Weber, R. and Zimmermann, H.-J. (1994). Fuzzy data analysis-methods and industrial applications, Fuzzy Sets and Systems 61: MIT (1995). CITE Literature and Products Database, MIT GmbH / ELITE, Promenade 9, D Aachen, Germany. MIT (1997). DataEngine: Part II, Tutorials, MIT GmbH, Promenade 9, D Aachen, Germany. Nørgaard,P.M.(n.d.a).NNCTRL Toolkit, Technical University of Denmark: Dept. of Automation, Nørgaard, P. M. (n.d.b). NNSYSID Toolbox, Technical University of Denmark: Dept. of Automation, 7

Climate Precipitation Prediction by Neural Network

Climate Precipitation Prediction by Neural Network Journal of Mathematics and System Science 5 (205) 207-23 doi: 0.7265/259-529/205.05.005 D DAVID PUBLISHING Juliana Aparecida Anochi, Haroldo Fraga de Campos Velho 2. Applied Computing Graduate Program,