Identification of Hybrid and Linear Parameter Varying Models via Recursive Piecewise Affine Regression and Discrimination

Size: px

Start display at page:

Download "Identification of Hybrid and Linear Parameter Varying Models via Recursive Piecewise Affine Regression and Discrimination"

Bethanie Lindsey
5 years ago
Views:

1 Identification of Hybrid and Linear Parameter Varying Models via Recursive Piecewise Affine Regression and Discrimination Valentina Breschi, Alberto Bemporad and Dario Piga Abstract Piecewise affine (PWA) regression is a supervised learning method which aims at estimating, from a set of training data, a PWA map approximating the relationship between a set of explanatory variables (commonly called regressors) and continuous-valued outputs. PWA maps are described by a finite collection of affine functions, defined over a polyhedral partition of the regressor space. The PWA regression problem amounts to estimating both the affine functions and the polyhedral partition from data. In this paper, we describe a recursive and numerically efficient PWA regression algorithm based on a combined use of (i) clustering and multi-model recursive least-squares techniques and (ii) linear multi-category discrimination. Then, we discuss its application to the identification of multi-input multi-output PWA dynamical models in autoregressive form and to the identification of linear parameter-varying (LPV) models. In the former identification problem, the regressor is defined as a collection of past input and output observations, and local linear time-invariant models are estimated together with the partition of the regressor space. In the latter problem, PWA approximations of the scheduling variable dependent coefficients of an LPV model are estimated from the data, along with a partition of the scheduling variable domain. I. INTRODUCTION One of the most challenging problems in regression learning is choosing a proper structure to describe the underlying relation between input (or regressor) vectors x R nx and continuous-valued target outputs y R n y. On the one hand, too simple models (e.g., linear functions of the independent variable x) might not be flexible enough to describe the input/output relationship, leading to a systematic error when predicting the output for a given input. On the other hand, complex model structures tend to be computationally difficult and to overfit the training data, leading to poor generalization to unseen data. This problem is commonly referred in machine learning and system identification as bias-variance tradeoff [9]. Several methods exist in the literature to deal with the bias-variance tradeoff. A first possibility is to use overparameterized models, described by a large set of a priori fixed nonlinear basis functions, from which only a few might actually be needed for an accurate approximation of the underlying relationship among the variables. Sparse estimation methods like the LASSO [8] are then used to *This work was partially supported by the European Commission under project H00-SPIRE DISIRE - Distributed In-Situ Sensors Integrated into Raw Material and Energy Feedstock ( eu/disire/). The authors are with the IMT Institute of Advanced Studies Lucca, 5500 Lucca, Italy. s: {valentina.breschi, alberto.bemporad, identify a low-complex model, by minimizing the fitting error along with a penalty term aiming at minimizing the complexity of the model. However, the final estimate strongly depends on the chosen basis functions. Motivated by the need of simple and flexible model structures, algorithms for computing PieceWise Affine (PWA) approximations of nonlinear regression functions have been developed. Piecewise affine functions are given by partitioning the regressor space into a finite number of polyhedral regions, and by considering an affine submodel on each polyhedron. Both the partition of the regressor space and the parameters defining each affine submodel are inferred from data. PWA regression algorithms can be also extended to the identification of hybrid and PieceWise Affine autoregressive with exogenous inputs (PWARX) systems by defining the regressor as the collection of past input and output observations (see [5] for a literature overview). Among the algorithms developed for the identification of hybrid and PWARX systems, we mention the bounded-error methods [7], [4], the clustering-based procedures [], [9], [], the mixed integer quadratic programming approach [7], the Bayesian method [0], and the sparse optimization method [3]. All these algorithms are executed in a batch fashion, and thus they are not suitable for an on-line implementation, apart from the method in []. This paper proposes an alternative, numerically efficient algorithm, for the identification of discrete-time multi-input multi-output PWARX models. The same algorithm can be properly extended to the identification of Linear Parameter- Varying (LPV) models. In this case, a PWA approximation of the scheduling variable dependent LPV model coefficients is estimated from the training data, along with a polyhedral partition of the scheduling variable domain. The proposed identification approach relies on the two-stage PWA regression algorithm recently developed by the authors and detailed in [6]. The main idea of the algorithm is to sequentially process the observed regressor/output pairs and assign each pair to the affine submodel that is most compatible to it. At the same time, while doing the assignment of a pair, the parameters of the corresponding submodel are updated via a recursive least-squares technique based on inverse QR decomposition. The second stage starts once all the observations have been classified. At this stage, the regressor domain is partitioned into polyhedral regions through a piecewise linear separator, which is computed by solving a convex unconstrained optimization problem, whose solution can be efficiently computed either off-line through a

2 regularized piecewise-smooth Newton method or recursively through an averaged stochastic gradient descent algorithm. Overall, the proposed identification approach turns out to be computationally very effective for offline learning, and suitable for online learning. The paper is organized as follows. The general PWA regression problem is formulated in Section II. The specific problems of data-driven modeling of discrete-time PWARX models and LPV models are formulated in Section II-B and Section II-C, respectively, extending the contribution in [6] by showing its applicability to system identification problems. Section III summarizes the general PWA regression method, which consists of an algorithm to simultaneously cluster the observed regressors and update the model parameters, and two multi-category discrimination algorithms to compute the polyhedral partition of the regressor domain (more details on these algorithms can be found in [6]). The application of the PWA regression algorithm described in Section III to the identification of PWARX and LPV models is discussed in Section IV through two numerical examples. Concluding remarks and some directions for future research are drawn in Section V. A. Notation Let R n be the set of real vectors of dimension n. Let I {,,..., } be a finite set of integers and denote by I the cardinality of I. Given a vector a R n, let a i denote the i-th entry of a, a I the subvector obtained by collecting the entries a i for all i I, a the Euclidean norm of a, a + a vector whose i-th element is max{a i, 0}. Given two vectors a, b R n, max(a, b) is the vector whose i-th component is max{a i, b i }. Given a matrix A R n m, A denotes the transpose of A, A i the i-th row of A. Given two matrices A and B, A B denotes the Kronecker product between A and B. Let I n be the identity matrix of size n and n be the n-dimensional column vector of ones. II. PROBLEM STATEMENT A. Piecewise affine regression Consider the vector-valued PWA function f : X R ny A [ x ] if x X, f(x) =. () A s [ x ] if x X s, where x R n x, X R n x, s N denotes the number of modes (i.e., the number of affine functions defining f), A i R n y (n x +) are parameter matrices, and the sets X i, i =,..., s are polyhedra defined by the linear inequalities X i. = {x R n x : H i x D i }, () with H i and D i being real-valued matrices, that form a complete polyhedral partition of the space X. The following A collection {X i } s i= is a complete partition of the regressor domain X if s i= X i = X and X i X j =, i j, with X i denoting the interior of X i. regression problem is addressed: Problem : PWA regression Consider the data-generating system: y(k) = f o (x(k)) + e o (k), (3) where e o (k) R n y is a zero-mean additive random noise independent of x(k) and f o : X R ny is an unknown function, possibly discontinuous and not necessarily PWA. The PWA regression problem aims at finding a PWA approximation f of the unknown regression function f o based on a set of N observations {x(k), y(k)} N k=. Estimating a PWA approximation f of the true function f o requires (i) choosing the number of modes s, (ii) computing the parameter matrices {A i } s i= that characterize the PWA submodels defining f, and (iii) finding the polyhedral partition {X i } s i= of the regressor space X where those PWA submodels are defined. In this work, we assume that the number of modes s is fixed by the user. When choosing the value of s, one must take into account the tradeoff between data fitting and model complexity. For small values of s, the PWA map f cannot accurately capture the shape of the nonlinear function f o. On the other hand, increasing the number of modes also increases the degrees of freedom in the description of the PWA map f, which may cause overfitting and poor generalization to unseen data (i.e., the final estimate is sensitive to the noise corrupting the observations), besides increasing the complexity of the estimation procedure and of the resulting PWA model. The value of s can be chosen through cross-calibration based procedures, by testing, for different values of s, the performance of the estimated model on a fresh set of data (i.e., calibration data set) which is not used in the estimation of the parameter matrices {A i } s i= and of polyhedral partition {X i } s i=. The following subsections show how the problems of identifying hybrid dynamical models in PWARX form and Linear Parameter-Varying models can be cast as the PWA regression problem. B. Identification of PWARX models Model () represents a Multi-Input Multi-Output (MIMO) PWARX discrete-time dynamical system if the regressor vector x(k) at the sampling step k is a collection of past input and output observations, i.e., x(k) = [y (k ) y (k ) y (k n a ) u (k ) u (k ) u (k n b )], (4) where u(k) R n u and y(k) R n y denote the observed input and output signals at time k, respectively. C. Identification of LPV models Linear parameter-varying systems can be seen as an extension of Linear Time-Invariant (LTI) systems, as the dynamic relation between input and output signals is linear but it can

3 change over time according to a measurable time-varying signal p(t) R np, the so-called scheduling vector. The identification problem of MIMO LPV systems can be formulated as the PWA regression Problem by adopting the following MIMO LPV-ARX form n a n b y(k)=a 0 (p(k))+ a j (p(k))y(k j)+ a j+na (p(k))u(k j) j= j= where a j (p(k)), j = 0,..., n a + n b, are PWA functions of p(k) defined as: A j (p(k)) if p(k) P, a j (p(k)) =. (5) A j s(p(k)) if p(k) P s, and p(k) P R np is the value of the scheduling vector at time k. By imposing that each entry of matrix A j i (p(k)) depends affinely on the scheduling vector p(k), the LPV identification problem reduces to reconstructing the PWA mapping of the p-dependent coefficient functions {a j (p(k))} n a+n b j=0 over the polyhedral partition {P i } s i= of the scheduling vector set P. In this case, the N-length sequence of observations {x(k), y(k)} N k= requires x(k) = [y (k )... y (k n a ) u (k )... u (k n b )] [ p (k)]. Note that, since the polyhedral partition {P i } s i= is not fixed a priori, the underlying dependencies of the functions a j (p(k)) on the scheduling vector p(k) are directly reconstructed from data. This represents one of the main advantages w.r.t. to widely used parametric LPV identification approaches (see, e.g., [3], [6]), which in turn require to parameterize a j (p(k)) as a linear combination of some basis functions specified a priori (e.g., polynomial functions). III. PWA REGRESSION ALGORITHM In this section, we summarize the two-stage algorithm to solve the general PWA regression problem (namely, Problem ) proposed in [6]. Its application to the identification of PWARX and LPV models is discussed in Section IV through two simulation examples. The proposed PWA regression algorithm consists of two stages: S. Simultaneous clustering of the regressor vectors {x(k)} N k= and estimation of the parameter matrices {A i } s i= describing the PWA map in (). This is performed recursively, by processing the training pairs {x(k), y(k)} sequentially. S. Computation of a polyhedral partition of the regressor space X through a multi-category linear separation method. This stage is performed after stage S, and it can be executed either offline or online (recursively). A. Iterative clustering and parameter estimation Stage S is carried out as described in Algorithm. The algorithm is an extension to the case of multiple linear regressions and clustering of the approach proposed in [] for solving recursive least squares problems using inverse QR decomposition. Algorithm Recursive clustering and parameter estimation algorithm Input: Sequence of observations {x(k), y(k)} N k=, desired number s of modes, noise covariance matrix Λ e ; initial condition for matrices A i, cluster centroids c i, and centroid covariance matrices R i, i =,..., s.. let C i, i =,..., s;. for k =,..., N do [.. let e i (k) y(k) A ] i x(k), i =,..., s;.. let i(k) arg min (x(k) c i) R i (x(k) c i )+ i=,...,s e i (k) Λ e e i (k);.3. let C i(k) C i(k) {x(k)};.4. update the parameter matrices A i(k) using the inverse QR factorization approach of [];.5. let δc i(k) C i(k) (x(k) c i(k) );.6. update the centroid c i(k) of cluster C i(k) (6) c i(k) c i(k) + δc i(k) ; (7).7. update the centroid covariance matrix R i(k) for cluster C i(k) C i(k) R i(k) C i(k) R i(k) + δc i(k) δc i(k) + (8) 3. end for; 4. end. [ ] [ ] C i(k) x(k) ci(k) x(k) ci(k) ; Output: Estimated matrices A,..., A s, clusters C,..., C s. Algorithm requires an initial guess for the parameter matrices A i and cluster centroids c i, i =,..., s. Zero matrices A i, randomly chosen centroids c i, and identity centroid covariance matrices R i are a possible choice. In alternative, if Algorithm can be executed in a batch fashion on a subset of data, one can initialize the parameter matrices A,..., A s all equal to the best linear model A i arg min A N k= y(k) A [ x(k) ], i =,..., s (9) that fits all data, and classify the regressors {x(k)} N k= through k-means clustering [], to compute the cluster centroids and the centroid covariance matrices. Otherwise, the initial guess for the parameters could be obtained executing Algorithm without the first term in (6) and updating the cluster centroids and covariance matrices only once all the data have been classified. Clearly, the final estimate of the parameter matrices A,..., A s and of the clusters C,..., C s depends on their initial values. When working offline on

4 a batch of identification data, estimation quality may be improved by repeating Algorithm iteratively, by using its output as initial condition for its following execution. After computing the estimation error e i (k) for all models i at Step., Step. picks up the best mode i(k) to which the current sample x(k) must be associated with (the reader is referred to as [6] for a stochastic interpretation of the discrimination criterium (6) used to cluster the observed data points). Step.4 updates the parameter matrix A i(k) associated to the selected mode i(k) using the recursive least-squares algorithm of [] based on inverse QR factorization. Note that only the parameters of the matrix A i(k) associated to the selected mode i(k) are updated at time k, while the parameters associated to the other modes are not. Remark : In case the noise covariance matrix Λ e of noise e o cannot be determined a priori, a possible choice for Λ e is Λ e = I ny (i.e., the regression errors are not weighted in (6)). Alternatively, if Algorithm is executed in a batch fashion and iteratively repeated by using the output as initial condition for the next execution, an estimate ˆΛ e of Λ e can be computed at the end of each execution as ˆΛ e = N s N i= k= x(k) C i B. Partitioning the regressor space ( [ y(k) ]) ( [ Ai x(k) y(k) ]) Ai x(k). When one is interested in getting also the partition {X i } s i=, besides the affine models {A i} s i=, the clusters {C i } s i= must be separated via linear multicategory discrimination. The linear multicategory discrimination problem amounts to computing a convex piecewise affine separator function ϕ : R n x R discriminating between the clusters C,..., C s. The piecewise affine separator ϕ is defined as the maximum of s affine functions {ϕ i (x)} s i=, i.e., ϕ(x) = max ϕ i(x). (0) i=,...,s The affine functions ϕ i (x) are described by the parameters ω i R n x and γ i R, namely: [ ] ϕ i (x) = [ x ω ] i. () γ i For i =,..., s, let M i be a m i n x dimensional matrix (with m i denoting the cardinality of cluster C i ) obtained by stacking the regressors x(k) belonging to C i in its rows. According to [8], in case of linearly separable clusters, the piecewise-affine separator ϕ satisfies the conditions: [ Mi mi ] [ ω i γ i ] [ M i mi ] [ ω j γ j ] + mi, () i, j =,..., s, i j. ) Off-line multicategory discrimination: Rather than solving a robust linear programming (RLP) problem as in [8], the parameters {ω i, γ i } s i= are computed by solving the convex unconstrained optimization problem λ s ( min ω i + (γ i ) ) + (3) ξ i= s s ( [ ) [ M i mi ] ω j ω i + mi, i= j = j i m i γ j γ i ] with ξ = [ (ω )... (ω s ) γ... γ s ]. Problem (3) aims at minimizing the averaged squared -norm of the violation of the inequalities in (). The regularization parameter λ > 0 is introduced so that the objective function in (3) is strongly convex. Problem (3) can be efficiently solved via the Regularized Piecewise-Smooth Newton (RPSN) method explained in [6] and originally proposed in [5]. ) Recursive multicategory discrimination: As an alternative to the off-line approach to the multicategory discrimination, or in addition to it for refining the partition ϕ online based on streaming data, a recursive approach to solve problem (3) based on on-line convex programming can be used. Let us treat the data-points x R nx as random vectors and assume that an oracle function i : R n x : {,..., s} exists that, to any x R nx, assigns the corresponding mode i(x) {,..., s}. Function i implicitly defines clusters in the datapoint space R nx. Let us also assume that the following values π i = Prob[i(x) = i] = δ(i, i(x))p(x)dx R nx are known for all i =,..., s, where δ(i, j) = if i = j, zero otherwise. The multicategory discrimination problem (3) can be the generalized to the following convex regularized stochastic optimization problem ξ = min E x R nx [l(x, ξ)] + λ ξ ξ (4) s ( l(x, ξ) = x (ω j ω i(x) ) γ j + γ i(x) + π i(x) j = j i(x) where E x [ ] denotes the expected value w.r.t. x. Problem (4) aims at violating the least, on average over x, the condition in () for i = i(x). The coefficients π i can be a-priori estimated offline from a subset of data, namely π i = m i N, and possibly iteratively updated. Numerical experiments have shown however that constant and uniform coefficients π = s work equally well. The solution of (4), which provides the piecewise affine multicategory discrimination function ϕ, can be computed by online convex optimization, through the Averaged Stochastic Gradient Descent (ASGD) method described in [6]. IV. SIMULATION EXAMPLES In this section, we consider two numerical examples to show the application of the proposed PWA regression approach to the identification of PWARX and LPV systems. + ) +

5 All computations are carried out on an i7.40-ghz Intel core processor with 4 GB of RAM running MATLAB R04b. In both examples, the output used in the training phase is corrupted by an additive zero-mean white noise e o with Gaussian distribution. The effect of measurement noise on the output signal is quantified through the Signal-to-Noise Ratio (SNR), that is defined for the i-th output channel as N k= SNR i = 0 log (y i(k) e o,i (k)) N k= e o,i (k), (5) with e o,i (k) denoting the i-th component of e o (k). The results obtained after the training phase are validated on a noiseless data sequence. Let y o and ŷ denote, respectively, the vectors staking the actual and the simulated outputs of the estimated model, let ȳ o,i be the sample mean of the i-th output, and N V the length of the validation data sequence. The Best Fit Rate (BFR) and Mean Square Error (MSE) indicators { BFR i = max y } o,i ŷ i, 0 (6) y o,i ȳ o,i MSE i = N V N V (y o,i (k) ŷ i (k)) (7) k= defined for each output channel i, i =,..., n y, are used to assess the quality of the estimated models. A. Identification of a multivariable PWARX system Let the system generating the data be a MIMO PWARX system described by the difference equation [ ] y (k) y = [ ] [ ] y (k ) (k) y + [ ] [ ] u (k ) (k ) u (k ) { [ + [ ] + max ] [ ] y (k ) y (k ) [ + [ ] u(k ) u (k ) ] + [ ], [ 0 0 ] } + e o (k), which is characterized by s = 4 operating modes, given by the possible combinations of sign of the components of the first vector argument of the max operator. The excitation input u(k) is a white noise sequence with uniform distribution in the box [ 0] [ ] and length N = 4000, e o (k) R is a zero-mean white noise with covariance matrix Λ e = [ ]. This corresponds to signal-to-noise ratios equal to SNR = 8.7 db and SNR = 6.9 db on the first and second output channels, respectively. We run Algorithm with s = s = 4. The initial guess for the cluster centroids and covariance matrices are computed by running an instance of Algorithm without the first term in (6) (i.e., only the modeling error is used as a discrimination criterion to cluster the observed data points at the first run). Algorithm is then run again for 5 times, with the full criterion (6), by initializing A i, c i and R i with the output of the previous run. The clusters generated by Algorithm are then separated through the off-line multicategory discrimination method described in Section III-B., by solving the convex optimization problem (3) via the RPSN method TABLE I PWARX IDENTIFICATION: BFR AND MSE ON THE TWO OUTPUT CHANNELS BFR BFR MSE MSE 96.% 96.3% TABLE II PWARX IDENTIFICATION: BFR ON THE TWO OUTPUT CHANNELS VS BFR BFR LENGTH N OF THE TRAINING SET. N = 4000 N = 0000 N = (Off-line) RLP [8] 96.0 % 96.5 % 99.0 % (Off-line) RPSN 96. % 96.4 % 98.9 % (On-line) ASGD 86.7 % 95.0 % 96.7 % (Off-line) RLP [8] 96. % 96.9 % 99.0 % (Off-line) RPSN 96.3 % 96.8 % 99.0 % (On-line) ASGD 87.4 % 95. % 96.4 % TABLE III PWARX IDENTIFICATION: CPU TIME REQUIRED TO PARTITION THE REGRESSOR SPACE VS LENGTH N OF THE TRAINING SET. N = 4000 N = 0000 N = (Off-line) RLP [8] s 3.7 s.435 s (Off-line) RPSN 0.06 s s s (On-line) ASGD 0.03 s 0.03 s s described in [6], which is initialized with {ω i, γ i } s i= = 0. The regularization parameter λ is set to 0 5. The quality of the estimated PWA model is assessed w.r.t. a validation set of N V = 500 samples. The true output y o and the open-loop simulated output ŷ generated by the estimated model are plotted in Fig., along with the simulation error y o (k) ŷ(k). For the sake of visualization, only the samples from time 0 to 00 related to the first channel are shown in Fig.. The obtained BFR and MSE are reported in Table I. The estimated polyhedral partition of the regressor space is such that only out of 500 data samples are misclassified (i.e.,.4 % of the whole validation set). As the accuracy of the final model estimate and the total CPU time is influenced by the number M of runs of Algorithm, the performance of the proposed learning approach has been tested with respect to both M and the dimension N of the training set. The obtained BFR as a function of iterations of Algorithm is plotted in Fig. for different values of N. Clearly, there is no improvement in BFR on the two output channels after about 8 runs. The total CPU time for estimating the PWARX model is 0.76 s, of which 0.06 s are taken to solve the linear multi-category discrimination problem (3). For the sake of comparison, both the robust linear programming approach of [8] and the on-line method described in Section III-B. are also run to generate the partition. Results in Table II show that all of the three algorithms used to compute the polyhedral partition of the regressor space lead to an accurate estimate of the system, with BFRs larger then 95 % (except when N = 4000 and the on-line multi-category In executing the on-line approach in Section III-B., the weights π i and the initial guess of ϕ used by the stochastic gradient method are computed by executing the batch Algorithm in Section III-B. on the first 3000 training samples.

BFR.4. N=4000 N=0000 N=00000 0.8 5 0 5 0 5 30 runs BFR.4. N=4000 N=0000 N=00000 0.8 5 0 5 0 5 30 runs Fig.

6 BFR.4. N=4000 N=0000 N= runs BFR.4. N=4000 N=0000 N= runs Fig.. PWARX identification: BFR on the first and on the second output channel vs number of runs of Algorithm discrimination method is used). Furthermore, the results show that the estimated models become more accurate as the number of training samples increases. The CPU times taken to compute the polyhedral partition are reported in Table III, which shows that, for a large training set (i.e., N = 00000), the off-line RPSN and the on-line ASGD method are about 300x and 600x faster, respectively, than the robust linear programming method of [8]. B. Identification of an LPV system Let the data be collected from the MIMO LPV system [ ] [ ] [ ] y (k) ā, (p(k)) ā y (k) =, (p(k)) y (k ) y (k ) ] [ ] + u(k ) u (k ) + e o (k), where ā, (p(k)) = ā, (p(k)) ā, (p(k)) [ b, (p(k)) b, (p(k)) b, (p(k)) b, (p(k)) 0.3 if 0.4 (p (k) + p (k)) 0.3, 0.3 if 0.4 (p (k) + p (k)) 0.3, 0.4 (p (k) + p (k)) otherwise, ā, (p(k)) = 0.5 ( p (k) + p (k) ), ā, (p(k)) = p (k) p (k), 0.5 if p (k) < 0, ā, (p(k)) = 0 if p (k) = 0, 0.5 if p (k) > 0, b, (p(k)) = 3p (k) + p (k), b, (p(k)) = { 0.5 if ( p (k) + p (k) ) 0.5, ( p (k) + p (k) ) otherwise, b, (p(k)) = sin {p (k) p (k)}, b, (p(k)) = 0. (8) Both the input u(k) and the scheduling vector p(k) are white noise sequences (independent of each other) of length N = 000 with uniform distribution in the boxes [ ] [ ] and [ ] [ ], respectively. The noise covariance matrix of e o (k) R is Λ e = [ ]. This corresponds to signal-to-noise ratios on the first and on the second output channel equal to SNR = 4 db and SNR = 7 db, respectively. Fig. 3. LPV identification: polyhedral partition of the scheduling vector space P The goal is to estimate, from the gathered data, a PWA approximation of the p-dependent nonlinear functions ā i,j and b i,j defining the behaviour of the LPV data-generating system (8). ) Choice of the number of modes: The number s of polyhedral regions defining the partition of the scheduling vector space P = [ ] [ ] is chosen through cross validation. Specifically, the 000-length training data set is split into two disjoint sets. The first 0000 samples are used to estimate a PWA approximation of ā i,j and b i,j, along with the polyhedral partition of the scheduling vector space P, for different values of s in the range For each value of s, the identification Algorithm is run 0 times. The second part of training data (i.e., the remaining 000 samples) is used to assess the quality of the identified LPV models. For each value of s, the BFR on the two output channels is computed. Among the identified LPV models, the one providing the largest aggregated BFR T = BFR + BFR is selected, which corresponds to s = 0 polyhedral regions. The computed polyhedral partition, obtained by solving problem (3) through the regularized piecewisesmooth Newton method explained in [6], is plotted in Fig. 3 (the Hybrid Toolbox for MATLAB [4] has been used to plot the polytopes in Fig. 3). ) Model quality assessment: The quality of the estimated LPV model is then assessed w.r.t. a validation dataset, consisting of a new sequence of N V = 000 noiseless samples used neither to estimate the LPV model nor to select the number of modes s. For the sake of comparison, the nonlinear coefficient functions ā i,j (p(k)) and b i,j (p(k)) are also estimated through the parametric LPV identification approach proposed in [3], by parameterizing the nonlinear functions ā i,j (p(k)) and b i,j (p(k)) as fourth-order polynomials in the two-dimensional scheduling vector p(k). The true outputs y o and the simulated output sequences ŷ of the estimated LPV models are plotted in Fig. 4, along with the simulation error y o (k) ŷ(k). For the sake of visualization, only the samples from time 0 to 00 related to the second channel are reported. The BFR and MSE on the

7 yo,ŷ time (samples) (a) First output channel (output signal): black = true, red = estimated Fig.. yo ŷ time (samples) (b) First output channel (simulation error) PWARX identification: output signal and simulation error on the first output channel TABLE IV LPV IDENTIFICATION: BFR AND MSE OF THE LPV MODELS ESTIMATED BY USING THE PROPOSED PWA REGRESSION APPROACH AND THE PARAMETRIC LPV IDENTIFICATION APPROACH OF [3] BFR BFR MSE MSE PWA regression 87 % 84 % parametric LPV [3] 80 % 70 % two output channels are reported in Table IV. The obtained results show that the proposed LPV identification approach based on the PWA approximation of the coefficient functions ā i,j (p(k)) and b i,j (p(k))) outperforms the parametric LPV identification approach in [3]. We also remark that the online computational time required to evaluate the output of the LPV model, given the current value of the scheduling vector p and the past input/output observations is about 0 µs, 40 µs of which are required to evaluate which region the current scheduling vector belongs to. Note that the latter step requires to compute the maximum of s = 0 affine functions {ϕ i ( p)} s i= defining the piecewise affine separator ϕ( p) in (0). This relatively low online computational time is mainly due to the PWA structure of the coefficient functions describing the LPV model, and it allows to use the estimated LPV model in applications requiring a fast online determination of the operating mode, such as in gain scheduling or in LPV model predictive control. 3) Performance of multi-category discrimination algorithms: The CPU time required to estimate the LPV model through the proposed PWA regression approach is 759 s. This includes the cross-validation phase to compute the number of modes s. For s = 0, the CPU time required to compute the LPV model is 4 s, 0.4 s of which are spent to compute the polyhedral partition via problem (3) (the robust linear programming multicategorical discrimination algorithm of [8] takes 4. seconds, i.e., almost 0x slower). For a more exhaustive comparison between the regularized piecewise-smooth Newton approach used to solve problem (3) and the robust linear programming algorithm of [8], the CPU time required by the two algorithms to partition the scheduling parameter space is plotted, as a function of s, in Fig. 5. Fig. 5 also shows the CPU time required by the averaged stochastic gradient descent algorithm in [6] to compute the solution of problem (4). The weights π i and the initial estimate used by the averaged stochastic gradient descent algorithm are computed by solving problem (3) on the first 000 training samples. The remaining 9000 training samples are processed recursively. The regularization parameter λ in problems (3) and (4) is set to 0 5. In order to test the performance of the three multicategory discrimination algorithms in terms of model accuracy, the aggregate best fit rate BFR T obtained by using the three algorithms is plotted, as a function of s, in Fig. 6. Results in Figs. 5 and 6 show that: the CPU time required by all of the three discrimination algorithms to partition the scheduling vector space increases with the number of modes s (Fig. 5). This is due to the fact that the number of parameters ξ defining the piecewise affine separator ϕ(x) in (0) increases linearly with s; the (offline) regularized piecewise-smooth Newton method and the (online) average stochastic gradient method used to solve problem (3) and (4), respectively, are faster (from 6x to 0x) than the robust linear programming based approach of [8]. in terms of model accuracy, the robust linear programming approach of [8] and the regularized piecewisesmooth Newton method achieve similar performance (Fig. 6), while, for s =, s = 4 and s = 0, the averaged stochastic gradient descent algorithm does not provide an accurate partition of the scheduling vector space, leading to LPV models with an aggregate best fit rates smaller than.. This means that, for s =, 4, 0 the solution of the averaged stochastic gradient descent algorithm fails to converge to the batch solution of problem (3) when only N = 0000 training samples are used. time [s] 0 5 RLP [8] RPSN ASGD s Fig. 5. LPV identification: CPU time required to partition the scheduling vector space vs number of modes (s).

8 yo,ŷ 0 yo ŷ time (samples) (a) Second output channel (output signal): black = true, red = PWA regression, green = polynomial parameterization time (samples) (b) Second output channel (simulation error): red = PWA regression, green = polynomial parameterization Fig. 4. LPV model estimate: output signal and simulation error on the second output channel. Estimate obtained through the PWA regression approach (red lines); estimate obtained through the parametric approach [3] (green lines) V. CONCLUSIONS In this paper we have reviewed the PWA regression algorithm introduced in [6], and discussed its application to the identification of PWARX and LPV dynamical systems. The approach consists of two stages: first, simultaneously cluster the observed regressors and estimate an affine submodel for each cluster; second, solve a convex unconstrained multicategory discrimination problem to determine a piecewise linear separator function and a polyhedral partition of the regressor space. The regularized piecewise-smooth Newton method used to compute the polyhedral partition of the regressor space makes the proposed regression algorithm computationally effective for offline learning, while the averaged stochastic gradient descent allows to process the data samples recursively (i.e., one data sample at the time), so that the overall algorithm can be implemented in real-time, such as for adaptive control purposes. Future research includes the extension of the PWA regression algorithm presented in the paper to the identification of hybrid and LPV systems under different noise conditions (e.g., Output-Error and Box-Jenkins model structures), and the generalization to piecewise-nonlinear models (such as piecewise polynomial) by feeding regression data manipulated through nonlinear basis functions to Algorithm and to the numerical algorithms used to solve multicategory discrimination problems. REFERENCES [] S.T. Alexander and A.L. Ghirnikar. A method for recursive least squares filtering based upon an inverse QR decomposition. IEEE Trans. Signal Processing, 4():0 30, 993. BFR T RLP [8] RPSN ASGD s Fig. 6. LPV identification: aggregate best fit rate BFR T vs number of modes (s). [] L. Bako, K. Boukharouba, E. Duviella, and S. Lecoeuche. A recursive identification algorithm for switched linear/affine models. Nonlinear Analysis: Hybrid Systems, 5():4 53, 0. [3] B. A. Bamieh and L. Giarré. Identification of linear parametervarying models. International Journal of Robust Nonlinear Control, (9):84 853, 00. [4] A. Bemporad. Hybrid Toolbox - User s Guide, 004. url: bemporad/hybrid/toolbox. [5] A. Bemporad, D. Bernardini, and P. Patrinos. A convex feasibility approach to anytime model predictive control. Technical report, IMT Institute for Advanced Studies, Lucca, February 05. http: //arxiv.org/abs/ [6] A. Bemporad, V. Breschi, and D. Piga. Piecewise affine regression via recursive multiple least squares and multicategory discrimination. Technical report, IMT Institute for Advanced Studies, Lucca, October 05. Submitted to Automatica. Available at: dariopiga.com/tr/tr_pwareg_bbp_05.pdf. [7] A. Bemporad, A. Garulli, S. Paoletti, and A. Vicino. A bounded-error approach to piecewise affine system identification. IEEE Transactions on Automatic Control, 50(0): , 005. [8] K.P. Bennett and O.L. Mangasarian. Multicategory discrimination via linear programming. Optimization Methods and Software, 3:7 39, 994. [9] G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari. A clustering technique for the identification of piecewise affine systems. Automatica, 39():05 7, 003. [0] A. L. Juloski, S. Weiland, and W. P. M. H. Heemels. A bayesian approach to identification of hybrid systems. IEEE Transactions on Automatic Control, 50(0):50 533, 005. [] A. Likas, N. Vlassis, and J. Verbeek. The global k-means clustering algorithm. Pattern recognition, 36():45 46, 003. [] H. Nakada, K. Takaba, and T. Katayama. Identification of piecewise affine systems based on statistical clustering technique. Automatica, 4(5):905 93, 005. [3] H. Ohlsson and L. Ljung. Identification of switched linear regression models using sum-of-norms regularization. Automatica, 49(4): , 03. [4] N. Ozay, C. Lagoa, and M. Sznaier. Set membership identification of switched linear systems with known number of subsystems. Automatica, 5:80 9, 05. [5] S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, and R. Vidal. Identification of hybrid systems a tutorial. European journal of control, 3():4 60, 007. [6] D. Piga, P. Cox, R. Tóth, and V. Laurain. LPV system identification under noise corrupted scheduling and output signal observations. Automatica, 53:39 338, 05. [7] J. Roll, A. Bemporad, and L. Ljung. Identification of piecewise affine systems via mixed-integer programming. Automatica, 40():37 50, 004. [8] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 67 88, 996. [9] V. Vapnik. Statistical learning theory. Wiley New York, 998.

A Greedy Approach to Identification of Piecewise Affine Models

A Greedy Approach to Identification of Piecewise Affine Models Alberto Bemporad, Andrea Garulli, Simone Paoletti, and Antonio Vicino Università di Siena, Dipartimento di Ingegneria dell Informazione, Via