A Learning Algorithm for Piecewise Linear Regression Giancarlo Ferrari-Trecate 1, arco uselli 2, Diego Liberati 3, anfred orari 1 1 nstitute für Automatik, ETHZ - ETL CH 8092 Zürich, Switzerland 2 stituto per i Circuiti Elettronici - CNR via De arini, 6-16149 Genova, taly 3 Ce.S.T..A. - CNR c/o Politecnico di ilano Piazza Leonardo da Vinci, 32-20133 ilano, taly Abstract A new learning algorithm for solving piecewise linear regression problems is proposed. t is able to train a proper multilayer feedforward neural network so as to reconstruct a target function assuming a different linear behavior on each set of a polyhedral partition of the input domain. The proposed method combine local estimation, clustering in weight space, classification and regression in order to achieve the desired result. A simulation on a benchmark problem shows the good properties of this new learning algorithm. 1 ntroduction Real-world problems to be solved by artificial neural networks are normally subdivided in two groups according to the range of values assumed by the output. f it is Boolean or nominal, we speak of classification problems; otherwise, when the output is coded by a continuous variable, we are facing with a regression problem. n most cases, the techniques employed to train a connectionist model depend on the kind of problem we are dealing with. However, applications can be found, which lie on the borderline between classification and regression; these occur when the input space can be subdivided into disjoint regions X i characterized by different behaviors of the function f to be reconstructed. The target of the learning problem is consequently twofold: by analyzing a set of samples of f, possibly affected by noise, it has to generate both the collection of regions X i and the behavior of the unknown function f in each of them. f the region X i corresponding to each sample in the training set were known, we could add the index i of the region as an output, thus obtaining a classification problem which has the target of finding the effective form of each X i. On the other side, if the actual partition X i were known, we could solve several regression problems to find the behavior of the function f within each X i.
Because of this mixed nature, classical techniques for neural network training cannot be directly applied, but specific methods are necessary to deal with this kind of problems. Perhaps, the simplest situation one can think of is piecewise linear regression: in this case the regions X i are polyhedra and the behavior of the function f in each X i can be modeled by a linear expression. Several authors have treated this kind of problem [2, 3, 4, 8], providing algorithms for reaching the desired result. Unfortunately, most of them are difficult to extend beyond two dimensions [2], whereas others consider only local approximations [3, 4], thus missing the effective extension of regions X i. n this contribution a new training algorithm for neural networks solving piecewise linear regression problems is proposed. t combines clustering and supervised learning to obtain the correct values for the weights of a proper multilayer feedforward architecture. 2 The piecewise linear regression problem Let X be a polyhedron in the n-dimensional space R n and X i, i = 1,..., s, a polyhedral partition of X, i.e. X i X j = for every i, j = 1,..., s and s i=1 X i = X. The target of a Piecewise Linear Regression (PLR) problem is to reconstruct an unknown function f : X R having a linear behavior in each region X i f(x) = z i = w i0 + n w ij x j j=1 when only a training set S containing m samples (x k, y k ), k = 1,..., m, is available. The output y k gives an evaluation of f(x k ) subject to noise, being x k X; the region X i to which x k belongs is not known in advance. Scalars w i0, w i1,..., w in, for i = 1,..., s, characterize univocally the function f and their estimate is a target of the PLR problem; for notational purposes they will be included in a vector w i. Since regions X i are polyhedral, they can be defined by a set of l i linear inequalities of the following kind: a ij0 + n a ijk x k 0 (1) k=1 Scalar a ijk, for j = 1,..., l i and k = 0, 1,..., n, can be included in a matrix A i, whose estimate is still a target of the reconstruction process for every i = 1,..., s. Discontinuities may be present in the function f at the boundaries between two regions X i. Following the general idea presented in [8], a neural network realizing a piecewise linear function f of this kind can be modeled as in Fig. 1. t contains a gate layer that verifies inequalities (1) and decides which of the terms z i must be used as the output y of the whole network. Thus, the i-th unit in the gate
N 5 O K J F K J = O A H ) ) ) 5 5 5 / = J A = O A H 0 E @ @ A = O A H N N 1 F K J = O A H Figure 1: General neural network realizing a piecewise linear function. layer has output equal to its input z i, if all the constraints (1) are satisfied for j = 1,..., l i, and equal to 0 in the opposite case. All the other units perform a weighted sum of their inputs; the weights of the output neuron, having no bias, are always set to 1. 3 The proposed learning algorithm As previously noted, the solution of a PLR problem requires a technique that combine classification and regression: the first has the aim of finding matrices A i to be inserted in the gate layer of the neural network (Fig. 1), whereas the latter provides weight vectors w i for the input to hidden layer connections. A method of this kind is reported in Fig. 2; it is composed of four steps, each of which is devoted to a specific task. The first of them (Step 1) has the aim of obtaining a first estimate of the weight vectors w i by performing local linear regressions based on small subsets of the whole training set S. n fact, points x k that are close to each other are likely to belong to the same region X i. Then, for each sample (x k, y k ), with k = 1,..., m, we build a set C k containing (x k, y k ) and the c 1 distinct pairs (x, y) S that score the lowest values of the distance x k x. The parameter c can be freely chosen, though the inequality c n must be respected to perform the linear regression. t can be easily seen that some sets C k, called mixed, will contain input patterns belonging to different regions X i. They lead to wrong estimates for w i and consequently their number must be kept minimum; this can be obtained by lowering the value of c. However,
ALGORTH FOR PECEWSE LNEAR REGRESSON 1. (Local regression) For every k = 1,..., m do 1a. Form the set C k containing the pair (x k, y k ) and the samples (x, y) S associated with the c 1 nearest neighbors x to x k. 1b. Perform a linear regression to obtain the weight vector v k of a linear unit fitting the samples in C k. 2. (Clustering) Perform a clustering process in the space R n+1 to subdivide the set of weight vectors v k into s groups V i. 3. (Classification) Build a new training set S containing the m pairs (x k, i k ), being V ik the cluster including v k. Train a multicategory classification method to produce the matrices A i for the regions X i. 4. (Regression) For every i = 1,..., s perform a linear regression on the samples (x, y) S with x X i to obtain the weight vector w i for the i-th unit in the hidden layer. Figure 2: Proposed learning method for piecewise linear regression. the quality of the estimate improves when the size c of the sets C k increases; a tradeoff must therefore be attained in selecting a reasonable value for c. Denote with v k the weight vector of the linear unit produced through the linear regression on the samples in C k. f the generation of the samples in the training set is not affected by noise, most of the v k coincide with the desired weight vectors w i. Only mixed sets C k yield spurious vectors v k, which can be considered as outliers. Nevertheless, even in presence of noise, a clustering algorithm (Step 2) can be used to determine the sets V i of vectors v k associated with the same w i. A proper version of the K-means algorithm [6] can be adopted to this aim if the number s of regions is fixed beforehand; otherwise, adaptive techniques, such as the Growing Neural Gas [7], can be employed to find at the same time the value of s. The sets V i generated by the clustering process induce a classification on the input patterns x k belonging to the training set S. As a matter of fact, if v k V i for a given i, the set C k is fitted by the linear neuron with weight vector w i and consequently x k is located into the region X i. The effective extension of this region can be determined by solving a linear multicategory classification problem (Step 3), whose training set S is built by adding as output to each input pattern x k the index i k of the set V ik to which the corresponding vector v k belongs. To avoid the presence of multiply classified points or of unclassified patterns in the input space, proper techniques [1] based on linear and quadratic programming can be employed. n this way the s matrices A i for the gate layer are generated; they can include redundant rows that are not necessary in the determination of the polyhedral regions X i. These rows can be removed by applying standard linear programming techniques.
4 3 2 1 0 1 2 3 4 16 16 14 14 12 12 10 10 8 8 y y 6 6 4 4 2 2 0 0 2 x 2 4 3 2 1 0 1 2 3 4 a) b) x Figure 3: Simulation results for a benchmark problem: a) unknown piecewise linear function f and training set S, b) function realized by the trained neural network (dashed line). Finally, weight vectors w i for the neural network in Fig. 1 can be directly obtained by solve s linear regression problems (Step 4) having as training sets the samples (x, y) S with x X i, where X 1,... X s are the regions built by the classification process. 4 Simulation results The proposed algorithm for piecewise linear regression has been tested on a one-dimensional benchmark problem, in order to analyze the quality of the resulting neural network. The unknown function to be reconstructed is the following x if 4 x 0 f(x) = x if 0 < x < 2 (2) 2 + 3x if 2 x 4 with X = [ 4, 4] and s = 3. A training set S containing m = 100 samples (x, y) has been generated, where y = f(x)+ε and ε is a normal random variable with zero mean and variance σ 2 = 0.05. The behavior of f(x) together with the elements of S are depicted in Fig. 3a. The method described in Fig. 2 has been applied by choosing at Step 1 the value c = 6. At Step 2 the number s of regions has been supposed to be known, thus allowing the application of the K-means clustering algorithm [5]; a proper definition of norm has been employed to improve the convergence of the clustering process [6]. ulticategory classification (Step 3) has then been performed by using the method described in [1], which can be easily extended to realize nonlinear boundaries among the X i when treating a multidimensional
problem. Finally, least square estimation is adopted to generate vectors w i for piecewise linear regression. The resulting neural network realizes the following function, represented as a dashed line in Fig. 3b: f(x) = 0.0043 0.9787x if 4 x 0.24 0.0899 + 0.9597x if 0.24 < x < 2.12 1.8208 + 3.0608x if 2.12 x 4 As one can note, this is a good approximation to the unknown function (2). Errors can only be detected at the boundaries between two adjacent regions X i ; they are mainly due to the effect of mixed sets C k on the classification process. References [1] E. J. Bredensteiner and K. P. Bennett, ulticategory classification by support vector machines. Computational Optimizations and Applications, 12 (1999) 53 79. [2] V. Cherkassky and H. Lari-Najafi, Constrained topological mapping for nonparametric regression analysis. Neural Networks, 4 (1991) 27 40. [3] C.-H. Choi and J. Y. Choi, Constructive neural networks with piecewise interpolation capabilities for function approximation. EEE Transactions on Neural Networks, 5 (1994) 936 944. [4] J. Y. Choi and J. A. Farrell, Nonlinear adaptive control using networks of piecewise linear approximators. EEE Transactions on Neural Networks, 11 (2000) 390 401. [5] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. (1973) New York: John Wiley and Sons. [6] G. Ferrari-Trecate,. uselli, D. Liberati, and. orari, A Clustering Technique for the dentification of Piecewise Affine Systems. Accepted at the Fourth nternational Workshop on Hybrid Systems: Computation and Control, Roma, taly, arch 28-30, 2001. [7] B. Fritzke, A growing neural gas network learns topologies. n Advances in Neural nformation Processing Systems 7 (1995) Cambridge, A: T Press, 625 632. [8] K. Nakayama, A. Hirano, and A. Kanbe, A structure trainable neural network with embedded gating units and its learning algorithm. n Proceedings of the nternational Joint Conference on Neural Networks (2000) Como, taly, 253 258.