A Deeper Insight of Neural Network Spatial Interaction Model s Performance Trained with Different Training Algorithms

Size: px

Start display at page:

Download "A Deeper Insight of Neural Network Spatial Interaction Model s Performance Trained with Different Training Algorithms"

Donald Parks
5 years ago
Views:

1 A Deeper Insight of Neural Network Spatial Interaction Model s Performance Trained with Different Training Algorithms Gusri YALDI Civil Engineering Department, Padang State Polytechnic, Padang, Indonesia gusri.yaldi@yahoo.com Abstract: Neural Network (NN) expansion can be categorized into three crucial stages, namely before 1960, , and after 1986 periods. The era before 1960 is when NN was discovered and developed; the next period saw a decline in NN research, while the last period has witnessed advanced developments in NN, especially in improvement of the training algorithms. Therefore, training algorithm is considered as the key in the successfulness of NN application. However, inappropriate choice of training algorithms may lead to poor NN performance. This paper reports the performance of NN models for spatial interaction modelling trained by three different algorithms and discusses some fundamental issues such as the training error and gradient, connection weight update, mapping output, performance consistency, and over fitting. These issues are rarely discussed in previous studies. Findings from this study are expected can assist the transport modeller in using NN as a robust and sound modelling tool. Keywords: Training Algorithm, Training Error and Gradient, Connection Weight Update, Over Fitting 1. INTRODUCTION Neural Network (NN) approach is widely known as an intelligent computer system working based on the human brain working system. It is a forecasting method that specifies output by minimizing an error term indicated by the deviation between input and output through the use of a specific training algorithm and random learning rate (Black, 1995; Zhang et al, 1998). NN expansion can be categorized into three crucial stages, namely before 1960, , and after 1986 periods. The era before 1960 is when NN was discovered and developed; the next period saw a decline in NN research, while the last period has witnessed advanced developments in NN, especially in improvement of the training algorithms. Training algorithm is a series of procedures used by the neural model to learn and update or adjust the connection weights with specific rules and objective function, activation function and error threshold. Learning is the process in determining the optimum values of connection weights by using the available training pattern/data, the key element of a neural model. The operational and performance of a neural model is determined by the learning rules in performing specific task and hence the connection weights adjusted over time and become stable/converge through a series of iterative process. Therefore, it is a fundamental component in the use of NN. Backpropagation (BP) has been considered the most famous training algorithm. It was also used in the spatial interaction modelling by Black (1995), one of the earliest studies in transport using neural models. Although BP is considered as a landmark in the revival of NN 342

2 in the 1990s, it is considered to be too slow in the training process and often experiences local minima, leading the development of alternative training algorithms in order to improve its convergence speeds (Jacobs, 1988, Barnard, 1992). One of the BP modified version is Variable Learning Rate (VLR) as proposed by Vogl et al. (1988). The improvement method described by Vogl et al. (1988) is categorized as an ad hoc or heuristic technique. According to Wilamowski et al. (2001), there are improvements contributed by the heuristic approach, however, these are considered minor. Second order approaches such as the Newton s method, conjugate gradient or the Levenberg-Marquardt (LM) optimization technique can improve the neural model performance more significantly. The LM algorithm proposed by Hagan and Menhaj (1994), is the training algorithm from the second order approach which is reported as the most efficient and widely accepted (Wilamowski et al., 2001). Thus, this paper explores the performance of the neural model for spatial interaction modelling by using three different algorithms, namely (1) BP, (2) VLR, and (3) LM. Multilayer Feedforward Neural Network with single hidden layer and ten hidden nodes as depicted by Figure 1 is used in the training process. Figure 1 The trained NN model structure In spatial movement interaction modelling, NN has been adopted for different levels of modelling with different methods of training. Black (1995) only conducted the modelling for calibration level and suggested the good replication ability of neural models. The next neural models for trip distribution were reported by Mozolin et al. (2000) who reported that neural model has good calibration capability. However, the so called predictive capability of neural models was found poor. The ad hoc based BP modification training algorithm called Quickprop was used to train the model (Mozolin et al., 2000). Then, VLR and LM were used in the recent trip distribution model neural model as reported by Yaldi et al. (2009), Shirmohammadli et al. (2010), and Yaldi et al. et al. (2011). Different performance of neural models was reported, especially at the testing level. The techniques of back propagating the error to the systems and adjusting the connection weight are the basic process and difference between BP, VLR, and LM. This paper investigates the impact of different techniques in adjusting the connection weight values and consequently the neural model performance. The search involves BP as used by the earlier studies of NN in spatial movement (Dantas et al., 2000, Black, 1995), followed by the ad hoc modification BP represented by VLR (Vogl et al., 1988). The second partial derivative BP modification is represented by LM proposed by Menhaj and Hagan (1994). 343

3 2. MODEL SCENARIO The experimental scenario and training algorithm details are reported in Table 1. The number of experiments is set for 30 trials, where each experiment has different initial weight connection values. The experiment was undertaken by using different numbers of epochs. Examples are 10, 100, 500, and 1000 epochs. The aim was to investigate the mapping performance of neural models trained with each training algorithm for different maximum epoch numbers. The neural model trained with a maximum epoch of 10 iterations was trained for 30 times, as well as other maximum epoch numbers. The purpose was to assess the sensitivity and consistency of neural models trained with different algorithms and maximum epoch numbers. Testing performance is not discussed in this paper. Neither is comparison with other modelling approaches. Table 1 Model scenarios and training algorithm details Scenario # Training algorithm details Maximum epoch # 1 BP Learning rate 0.1 Minimum gradient 1E-10 2 VLR Learning rate 0.1 Ratio to increase learning rate 1.05 Ratio to decrease learning rate 0.7 Maximum performance increase 1.04 Minimum gradient 1E-10 3 LM Minimum performance gradient 1E-10 Initial Mu MU decrease factor 0.1 Mu increase factor 10 Maximum Mu 1E10 10, 100, 500, 1000, , 100, 500, , MODEL DATA The simple three region flow problem described by Black (1995) is used (see Tables 2 and 3). The number inside the bracket in Table 2 is the distance between the regions. The model is for calibration level only. It is expected that the calibration performance will determine the testing performance too. Therefore, the best training algorithm between BP, VLR, and LM found at the calibration level should also be the best one at the testing level. Table 2 Black s three region flow problem O/D a b c Total a 15 (2) 4 (3) 1 (4) 20 b 18 (3) 21 (2) 1 (5) 40 c 17 (4) 5 (5) 18 (2) 40 Total

4 Table 3 Input code/label for Black s three region flow problem O/D a b c a A B C b D E F c G H I 4. MODEL OUTPUT AND DISCUSSION 4.1 Training error and gradient The process of connection weight updating is divided in two stages, namely feed forward and back propagation stages. The first stage begins with the summation in each node in the layer after the input layer until the computation of error in each output node. This process is the same for each algorithm BP, VLR, and LM. The error is computed by using the following Equation. E k = 1 (y 2 p k k,p d k,p ) 2 (1) Where: Ek = The total error in the output layer node = The output of node k, in output layer (k layer) yk The initial error computed by that equation is the same for all training algorithms as depicted by Figure 2. This figure illustrates the mean square error (MSE) for the first experiment only; however, this figure shows the typical trend of MSE for other experiments. Figure 2 Example of MSE for the model trained with different algorithms Figure 2 represents the MSE for neural models trained with different algorithms. It can be seen that although the initial MSE is the same for all training algorithms, the MSE for LM sharply decreases starting from the first epoch. The MSE for the first iteration is even lower than the final MSE of other two algorithms. Then, it gradually and linearly decreases until epoch 10 is reached. Meanwhile, the model trained with BP and VLR also experience 345

5 decreases in the MSE. However, these are not as sharp as the model trained with LM. When the MSE for models trained with BP and VLR are compared, the MSE has almost the same pattern. VLR has a slightly lower MSE at the end of the training process which started to separate from the 7 th epoch (see Figure 2). The difference is related to the gradient computation of each training algorithm. Different methods have been used in calculating the gradient. BP and VLR counts the gradient based on the following Equation, Ew kj = E w kj (2) However, LM counts the gradient based on Equation 3 below, E (w) = J T (w) e(w) (3) Where: E = The error change wkj = Change in the connection weights layer j and k J(w) = The Jacobian Matrix Unlike BP, LM gradient is computed by using the Jacobian matrix. Thus, the optimization commences with different initial values of gradient, although the initial error is the same. See Table 4 and Figure 3 for the gradient state resulting from the training. They represent the gradient for the first experiment; however, other experiments show the same trend as the first experiment. The decrease with LM from the initial to the next iteration is strongly sharp, unlike the model trained with BP and VLR. There is a similar trend between the gradient (Figure 3) and the MSE (Figure 2), except that the initial gradients are different for different training algorithms while MSE is initially the same even though the training algorithms are different. It can be seen from both figures that the values of gradient and MSE gradually decrease after the first iteration for LM, while the gradient for the other two is linearly and gradually decreases from the beginning. Table 4 Gradient for BP, VLR and LM for 1st experiment Epoch # Gradient BP VLR LM

Figure 3 Gradient for the 1 st experiment The MSE and gradient for neural models trained with BP and VLR tend to have the same trend, except that VLR s MSE and gradient start to have bigger gaps

6 Figure 3 Gradient for the 1 st experiment The MSE and gradient for neural models trained with BP and VLR tend to have the same trend, except that VLR s MSE and gradient start to have bigger gaps after a few iterations. Allowing the learning rate to vary and utilizing a momentum term have slightly improved the model performance. However, training is limited to ten epochs only. The model may need more iteration. Thus, the improvement is expected can be more obvious when the neural models are trained with more epochs. This is discussed in next section. 4.2 Connection weight update Observing the movement of connection weights during the updating process is an interesting task, especially when three different training algorithms are used in that process. As noted, the MSE and gradient for the neural models trained with BP, VLR, and LM have a distinctive trend. Since MSE, gradient and weight update are three important components of neural network training and are related each other, it can be expected that similar trends will be found in the connection weight update flow. Figures 4-6 illustrate the typical movements of the connection weight update resulting from the training process. It only depicts the connection weight between the hidden layer and output layer nodes. The same figure could also be displayed for the connection weights from input to hidden layers; however, this would become too crowded and hence difficult to interpret. There is only one output node as symbolized by the number 1 after the letter w which represents the connection weight. The rest of the numbers represent the numbers of the nodes in the hidden layer connected to the output layer. There are ten nodes in the hidden layer, thus the connection weight between hidden and output layer s nodes becomes w11-w110. None of the figures show a clear movement. The changes in the connection weight from the iterative training process are almost unseen. All lines tend to be plateaued. This suggests that the magnitude of changes is so small that it is hard to recognize. Thus, more iteration is needed and a longer time to converge is required for BP and VLR based on Figures 4 and

Figure 4 Connection weight updating with BP Figure 5 Connection weight updating with VLR Figure 6 Connection weight updating with LM Unlike BP and VLR, the neural model

Meanwhile, the direction and the magnitude of the changes are more noticeable as depicted in Figures 7-9.

7 Figure 4 Connection weight updating with BP Figure 5 Connection weight updating with VLR Figure 6 Connection weight updating with LM Unlike BP and VLR, the neural model trained with LM shows the connection weight changes from the initial to the next iteration much more clearly as seen in Figure 6. Meanwhile, the direction and the magnitude of the changes are more noticeable as depicted in Figures 7-9. This is also the contribution of the Jacobian matrix in the computation of the gradient. The search for the gradient is seen to be more effectively and efficiently. The method is more efficient as it requires fewer iterations and hence less time to converge. It is 348

8 more effective as the expected output is much more accurate and precise than that from BP and VLR. The percentages of connection weight changes are attached for the model with hidden layer nodes of one, five, and ten (w11, w15, w110) only. It is may be useful to attach from node 1 to 10, however, these three figures are considered enough to illustrate the changes. Figure 7 Percentages of connection weight changes, w11 Figure 8 Percentages of connection weight changes, w15 Figure 9 Percentages of connection weight changes, w

4.3 BP Mapping Output The ability of each training algorithm in updating the connection weight and gradient is realized in the distribution of the estimations and compared with the observations.

9 4.3 BP Mapping Output The ability of each training algorithm in updating the connection weight and gradient is realized in the distribution of the estimations and compared with the observations. This is called mapping. The mapping of each training algorithm is illustrated in Figures This begins with BP, and followed by VLR and LM. The performance is depicted for 10, 100, 1000, and 5000 epochs. The mapping results of BP are illustrated by Figures Each figure shows the distribution of estimation and observations for different maximum epoch numbers, namely (1) 10 epochs, (2) 100 epochs, (3) 1000 epochs, and (4) 5000 epochs respectively. The epoch number is limited to 5000 iterations. The performance is classified as coincident and close. A sample point is classified as coincident when its location deviates from the observed point by a maximum of ten per cent. It is classified as close when the deviation is larger than ten per cent; but, no more than 50%. This location is represented by the proximity of the estimation towards the observation on the Y axis. The X axis has no meaning; however, it uses a uniform distance between each points of observation or estimation. The movement of each point towards different maximum epoch numbers can be seen though these figures, symbolizing by capital letters A-I. For the maximum epoch number of ten, all of estimations are distributed linearly. This line is far from the observation points (see Figure 10). The model needs more iteration so that it can spread closer to the observations. Then, the same model is trained with a higher maximum epoch number, namely 100 epochs. Figure 10 BP mapping figure, trained with 10 epochs The results of the neural model trained with BP with maximum epoch number of 100 are displayed in Figure 11. Although still spreads along a straight line as in the previous figure, that line is now surrounded by the observations. There is improvement; however, only one point (point A) is classified as coincident, while there are four estimations classified as close (points D, E, G, and I). Thus, the same model is trained again and at this time the maximum epoch number is increased to 1000 iterations. It is expected that more estimations should be classified as coincident migrating from close, and more estimations classified as close moving from unclassified points. However, Figure 11 shows unexpected outputs. 350

10 Figure 11 BP mapping figure, trained with 100 epochs There is no estimation classified as coincident after the model is trained with 1000 epochs (see Figure 12). The estimation point previously classified as coincident migrates to close groups. There are two newcomers in the close classification, namely points B and H. Points E and G which were previously in the close classification, move further. Thus, there are five estimations classified as close, namely points A, B, D, H and I. The same model is trained again; with the maximum epoch is increased to 5000 iterations. Figure 13 shows the results. Figure 12 BP mapping figure, trained with 1000 epochs The estimation point A is back to the coincident classification, and point E is also back to the close classification. Therefore, there is one point classified as coincident and five estimations classified as close. The estimation points are now distributed more widely and closer to the observations as can be seen in Figure 13. It is now inappropriate to use a linear line in that figure. It is much better than the previous outputs shown by previous figures. 351

11 Figure 13 BP mapping figure, trained with 5000 epochs Table 5 shows the estimations resulted from the neural model trained with BP. The maximum epoch is There are 30 experiments, and the estimation is averaged. The percentage in the bracket shows the information about the difference between estimations and observations. The black shade represents the estimations classified as coincident, while the grey shade is for close. Although the model has been trained with 5000 epochs, the row and column totals still show great differences with the observed ones, except for column total b. Table 5 Estimated flow numbers trained with BP (30 trials, 5000 epochs) O/D a b c Total A 14 (-7%) 8(100%) 5(400%) 27(35%) B 14 (-22%) 15(-28.5%) 5(400%) 34(-15%) C 10(-41%) 6(20%) 14(-18%) 30(-25%) Total 38(-24%) 29(3%) 25(25%) 92(-8%) The training for BP is stopped at this stage. Actually, the model is also trained with a maximum epoch number of 500 as reported in Table 6. In general, it is found that a higher epoch number may result in closer distribution of estimations to the observations. Training algorithm Max Epoch # Table 6 Summarize of model outputs Average epoch # Training time Results Coincident (sample) 352 Close (sample) LM 10 9 About 1 7 (A, B, D, E, G, H, I) 2 (C and F) About 8 9 (ALL) - VLR About < 5 5 (A, B, D, H, I) <30 6 (A, B, D, E, G, I) 2 (C and H) <60 6 (A, B, D, E, G, I) 2 (C and H) BP About < 5 1 (A) 4 (D, E, G, I) <30-3 (A, B, H) <60-5 (A, B, D, H, I) <250 1 (A) 5 (D, E, G, H, I) Close : % deviation from observed data 50% diff 10% Coincide : % deviation from observed data 10% Therefore, it can be expected that more estimations will move to the coincident classification when a higher epoch number is applied. It is no use to train the model with

12 higher number of epoch in this section as it will require more training time. It can be seen in Table 6 that more training time is required when more epochs are used. The intention to investigate the general trend of neural models trained with BP by using different epoch numbers has adequately been drawn from Figures In addition, more training would cause over fitting. 4.4 VLR Mapping Output The next training uses VLR. The same models as in the previous section are trained. The results are shown in Figures Three different maximum epoch numbers are applied in the training, namely 10, 100, and 1000 epochs. Like BP training, the model is also trained with 500 epochs; however, this figure is not displayed here. It is simply to reduce the number of figures reported in this paper and hence avoid confusion. However, the results are reported in Table 6. VLR is developed by allowing the learning rate to vary depends on state of the total error. It is expected that the training will converge quicker than BP. Figure depict the results. It can be seen that the outputs are similar with BP when the maximum epoch is limited to ten iterations. The effects of the adaptive learning rate as well as the momentum term are unseen at this stage. It should be noticed that VLR is basically a gradient descent method, like BP. The estimations are distributed along a straight line, located far from the observations (see Figure 14). No estimation is classified as coincident or close. Figure 14 VLR mapping figure, trained with 10 epochs Then, the maximum training epoch is increased and becomes 100 iterations. This also results in similar distribution as the model trained with BP (see Figure 15). Although no estimation is in the coincident classification, there are five points that are close. Those estimations are A, B, D, H, I. This is slightly different from BP s results (see Table 6). The effects of dynamic learning rate and momentum start to appear at this stage. Better results should be obtained when more iteration are used. Thus, the neural model is trained to a maximum epoch number of 1000 iterations. The outputs are illustrated in Figure

13 Figure 15 VLR mapping figure, trained with 100 epochs Figure 16 VLR mapping figure, trained with 1000 epochs A substantial improvement is seen in Figure 16. Six estimations are found to be coincident while two estimations are in the close classification. Almost all of estimations previously classified as close migrate to the coincident, namely points A, B, D, I, and plus two newcomers, namely E and G. Point H remains in the close classification. Thus, the effect of variable learning rate and momentum is gained when the model is trained with more iteration. The distributions of estimations are also reported in Table 7. Another interesting finding from this table is that the row and column totals are almost the same as the observations. Thus, it can be concluded that VLR converges quicker than BP, requiring 1000 iterations to properly distribute the estimations. Table 7 Estimated flow numbers trained with VLR (30 trials, 1000 epochs) O/D a b c Total a 15(-) 4(-) 1(-) 20(-) b 18(-) 21(-) 2(100%) 41(2.5%) c 17(-) 4(-20%) 18(-) 39(-2.5%) Total 50(-) 29(-3%) 21(5%) 101(1%) 4.5 LM Mapping Output 354

14 In terms of mapping performance of the neural model trained with BP and VLR, BP requires at least 5000 iterations to properly distribute the estimations, however, VLR requires only one fifth of that number. Another version of BP modification is also used to train the neural model, namely the second partial derivative based training algorithm called Levenberg-Marquardt (LM) algorithm. The results are shown in Figures 17 and 18. By using the same classification as previously, it can be seen that seven estimations are classified as coincident and another two as close. These are the results of 10 iteration training. Then, the same model is trained again with epoch number up to 100 iterations. It was found that the neural model trained with LM only requires a maximum epoch number of 39 iterations to properly distribute all of estimations (see Figure 18 and Table 8). This number is considerably lower than the maximum epoch used by BP and VLR. The key factor of this great success is the method in calculating the gradient by using the Jacobian matrix. The gradient of the training with LM is more efficient than BP and VLR. It results in more effective and efficient training performance. Thus, the use of second partial derivative based training algorithm can speed up the training (see also Rumelhart et al. (1986)). Figure 17 LM mapping figure, trained with 10 epochs Figure 18 LM mapping figure, trained with maximum 39 epochs The distribution of estimations is reported in Table 8. It can be seen that all estimations are perfectly distributed according to the observed data, with a maximum of 39 epochs. Thus, all column and row totals fit with the observed ones. However, it should be noted that this is based on a 3x3 matrix which is a relatively simple and linear problem. The performance of the 355

15 neural model trained with LM may not be as perfect as this sample flow problem, especially when larger and more complex problems are used. Table 8 Estimated flow numbers trained with LM (30 trials, maximum 39 epochs) O/D a B C Total A 15(-) 4(-) 1(-) 20(-) B 18(-) 21(-) 1(-) 40(-) c 17(-) 5(-) 18(-) 40(-) Total 50(-) 30(-) 20(-) 100(-) 4.6 Performance consistency The training consists of 30 trials, while each trial has different random initial weights. Thus, the final and optimum connection weights resulting from the training will also be different. However, the results should be statistically similar. In order to illustrate this, the best performance of the neural model trained with BP, VLR, and LM is depicted in Figure 19. It shows the MSE for BP, VLR and LM trained with 5000, 1000, and 39 epochs consecutively. It can be seen that the MSE for BP is the highest, followed by VLR. The fluctuation is much more obvious for BP than other training algorithms. In order to see the differences between VLR and LM more clearly, the MSE for BP is removed as in Figure 20. The difference between VLR and LM becomes obvious. It can be seen that the MSE for VLR is clearly much higher than LM, although it has been trained for up to 25 times more than the maximum training for LM. The figures suggest that LM tends to perform at a consistently higher grade than BP and VLR. Figure 19 MSE for the best BP, VLR, and LM models The average MSE for 30 trials for different maximum epoch numbers are given in Table 9. The MSE for LM is significantly lower than the other two, although it was trained for a maximum epoch of 39 compared to 1000 and 5000 epochs for VLR and BP respectively. It is followed by the significantly lower standard deviation for LM compared for the other two. However, the lowest coefficient of variation belongs to VLR, and followed by BP. LM has the highest coefficient of variation (see the ratio between SD and average MSE in Table 9). 356

Figure 20 MSE for the best VLR, and LM models Table 9 Average MSE, SD, and coefficient variation for BP, VLR, and LM Statistics BP VLR LM Epoch number 5000 1000 39 Average MSE 0.0041 0.0003 0.

16 Figure 20 MSE for the best VLR, and LM models Table 9 Average MSE, SD, and coefficient variation for BP, VLR, and LM Statistics BP VLR LM Epoch number Average MSE SD Ratio SD and average MSE It should be noted that this is for the calibration only, without using the validation dataset. Although LM is trained for a significantly lower epoch numbers than the other two, the epoch number for 30 trials varies from 6 to 39 iterations with an average epoch of 14 iterations. Compared to BP and VLR, the coefficient of variation varies significantly as BP and VLR models are trained with consistently 5000 and 1000 epochs for all trials. It is expected than when the same epoch number or when the training is stopped by the validation stop, the coefficient of variation for LM will significantly drop. 4.7 Over fitting Neural models trained with LM is used to illustrate the over fitting as it has the best performance and also the lowest iteration number. It is also suggested as the best model for passenger trip distribution compared to BP and VLR models (Yaldi et al., 2010). Because the Black s sample flow dataset is so small that it cannot be divided into calibration, validation, and testing dataset, a different dataset is used to illustrate the effect of over fitting. It is based on real work trip data collected in Padang in 2005 (Interplan, 2005). The data is split into 40, 30, and 30 per cent for training, validation, and testing respectively. All other procedures are the same as before. Figures and Table 10 show the performance lines and MSE of LM s model for passenger trip estimation. The blue, red and green lines represent the performance of training, validation and testing in term of normalized mean squared error (MSE). The maximum epochs for training as displayed in Figures 21, 22, and 23 are 22, 100 and 1000 iterations respectively. Firstly, the training is stopped at the 22 nd epoch as the testing error starts to increase (Figure 21). The training error is e-006, while the testing error is

Then, the same model is trained again with 100 (Figure 22) and 1000 (Figure 23) epochs. It generates MSE for training and testing 6.3198e-006 and 0.00056461 for 100 iterations, while it is 5.

17 Then, the same model is trained again with 100 (Figure 22) and 1000 (Figure 23) epochs. It generates MSE for training and testing e-006 and for 100 iterations, while it is e-006 and for 1000 iterations. All of these results are provided in Table 10. It can be concluded that higher number iteration generates a lower error for the training, but a higher error for the testing. Therefore, the training must be stopped to avoid this over fitting. Figure 21 Over-fitting behaviour and its impact (22 epochs) Figure 22 Over-fitting behaviour and its impact (100 epochs) 358

Figure 23 Over-fitting behaviour and its impact (1000 epochs) Table 10 MSE for calibration, validation, and testing Epoch # MSE Calibration Validation Testing 22 8.4184e-006 0.00016432 0.

18 Figure 23 Over-fitting behaviour and its impact (1000 epochs) Table 10 MSE for calibration, validation, and testing Epoch # MSE Calibration Validation Testing e e e Figure 24 depicts the MSE for calibration and testing. The MSE for calibration is much smaller than for testing, as expected. In order to make the difference clearer, the calibration MSE is multiplied by Figure 24 MSE for training and validation 5. CONCLUSIONS The findings from this study suggest that neural models trained with BP, VLR, and LM have different levels of performance in calibrating the three region flow sample used by Black (1995). The maximum training time required for calibrating the neural model in this section is less than five minutes (see Table 6). This was for the neural model trained with BP for

19 epochs. Although its mapping performance is better when trained with more epochs, it could experience over fitting when more epochs are allowed. The real work trip neural model has shown that when the model is trained with more than a certain number of epochs, the calibration performance improves; however, the testing performance deteriorates. Meanwhile, it requires less than five seconds training the neural model for 100 epochs, for all algorithms. This result shows that the time to train the model is almost the same for the different training algorithms, yet, the performance is much different. The key factor is the method used in adjusting the connection weight and calculating the gradient. The second partial derivative based training algorithm is much more efficient and effective than BP and its ad hoc modification. Therefore, it is recommended to use LM in order to obtain more precise and accurate results. These findings are based on the MLFFNN with ten hidden layer nodes only. When more nodes and bigger datasets are used, the training time is expected to be longer. REFERENCES BARNARD, E Optimization for training neural nets. Neural Networks, IEEE Transactions on, 3, BLACK, W. R Spatial interaction modeling using artificial neural networks. Journal of Transport Geography, 3, DANTAS, A., YAMAMOTO, K., LAMAR, M. V. & YAMASHITA, Y. Year. Neural network for travel demand forecast using GIS and remote sensing. In: Neural Networks, IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol.4. HAGAN, M. T. & MENHAJ, M. B Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks 5, INTERPLAN, B Master Plan of Transportation for Padang City (In Indonesian). Padang: Transportation Agent, Padang City. JACOBS, R. A Increased rates of convergence through learning rate adaptation. Neural Networks, 1, MOZOLIN, M., THILL, J. C. & LYNN, U. E Trip distribution forecasting with multilayer perceptron neural networks: A critical evaluation. Transportation Research Part B: Methodological, 34, RUMELHART, D. E., HINTON, G. E. & WILLIAMS, R. J Learning representations by back-propagating error. Nature, 323, SHIR-MOHAMMADLI, M., SHETAB-BUSHEHRI, S. N., POORZAHEDY, H. & HEJAZI, S. R A comparative study of a hybrid Logit Fratar and neural: network models for trip distribution: case of the city of Isfahan. Journal of Advanced Transportation, 45, VOGL, T., MANGIS, J., RIGLER, A., ZINK, W. & ALKON, D Accelerating the convergence of the back-propagation method. Biological Cybernetics, 59, WILAMOWSKI, B. M., IPLIKCI, S., KAYNAK, O. & EFE, M. Ö An Algorithm for Fast Convergence in Training Neural Networks. IEEE, 3, YALDI, G., TAYLOR, M. A. P. & YUE, W. L Using Artificial Neural Network in Passenger Trip Distribution Modelling (A Case Study in Padang, Indonesia). Journal of Eastern Asia Society for Transportation Studies (in press). 360

20 YALDI, G., TAYLOR, M. A. P. & YUE, W. L Refining the Performance Neural Network Approach in Modelling Work Trip Distribution by Using Lavemberg- Marquardt Algorithm. Journal of the Society for Transportation and Traffic Studies (JSTS). YALDI, G., TAYLOR, M. A. P. & YUE, W. L Forecasting origin-destination matrices by using neural network approach: A comparison of testing performance between back propagation, variable learning rate and levenberg-marquardt algorithms. The 34th Australasian Transportation Research Forum. Adelaide, South Australia. 361

Abstract. 1. Introduction. 1.1 The origin-destination matrices

Abstract. 1. Introduction. 1.1 The origin-destination matrices Australasian Transport Research Forum 2011 Proceedings 28-30 September 2011, Adelaide, Australia Publication website: http://www.patrec.org/atrf.aspx Forecasting origin-destination matrices by using neural