ANFIS: Adaptive-Network-Based Fuzzy Inference System

Size: px

Start display at page:

Download "ANFIS: Adaptive-Network-Based Fuzzy Inference System"

Felix McGee
5 years ago
Views:

2 NFIS: daptive-network-ased Fuzzy Inference System Jyh-Shing Roger Jang Department of Electrical Engineering and Computer Science University of California, erkeley, C 947 stract This paper presents the architecture and learning procedure underlying NFIS (daptive-network-ased Fuzzy Inference System), a fuzzy inference system implemented in the framework of adaptive networks. y using a hyrid learning procedure, the proposed NFIS can construct an input-output mapping ased on oth human knowledge (in the form of fuzzy if-then rules) and stipulated input-output data pairs. In our simulation, we employ the NFIS architecture to model nonlinear functions, identify nonlinear components on-linely in a control system, and predict a chaotic time series, all yielding remarkale results. Comparisons with artificail neural networks and earlier work on fuzzy modeling are listed and discussed. Other extensions of the proposed NFIS and promising applications to automatic control and signal processing are also suggested. I. Introduction System modeling ased on conventional mathematical tools (e.g., differential equations) is not well suited for dealing with ill-defined and uncertain systems. y contrast, a fuzzy inference system employing fuzzy ifthen rules can model the qualitative aspects of human knowledge and reasoning processes without employing precise quantitative analyses. This fuzzy modeling or fuzzy identification, first explored systematically y Takagi and Sugeno [54], has found numerous practical applications in control [48, 38], prediction and inference [8, 9]. However, there are some asic aspects of this approach which are in need of etter understanding. More specifically:. No standard methods exist for transforming human knowledge or experience into the rule ase and dataase of a fuzzy inference system.. There is a need for effective methods for tuning the memership functions (MF s) so as to minimize the output error measure or maximize performance index. In this perspective, the aim of this paper is to suggest a novel architecture called daptive-network-ased Fuzzy Inference System, or simply NFIS, which can serve as a asis for constructing a set of fuzzy if-then rules with appropriate memership functions to generate the stipulated input-output pairs. The next section introduces the asics of fuzzy if-then rules and fuzzy inference systems. Section 3 descries the structures and learning rules of adaptive networks. y emedding the fuzzy inference system into the framework of adaptive networks, we otain the NFIS architecture which is the ackone of this paper and it is covered in section 4. pplication examples such as nonlinear function modeling and chaotic time series prediction are given in section 5. Section 6 concludes this paper y giving important extensions and future directions of this work.. Fuzzy If-Then Rules II. Fuzzy If-Then Rules and Fuzzy Inference Systems Research supported in part y NS Grant NCC -75, MICRO Grant 9-8, EPRI greement RP 8-34, and ISC program. IEEE Trans. on Systems, Man and Cyernetics, vol. 3, no. 3, pp , May 993

3 input (crisp) fuzzification interface knowledge ase dataase rule ase defuzzification interface output (crisp) (fuzzy) decision-making unit (fuzzy) Figure : Fuzzy inference system. Fuzzy if-then rules or fuzzy conditional statements are expressions of the form IF THEN, where and are laels of fuzzy sets [66] characterized y appropriate memership functions. Due to their concise form, fuzzy if-then rules are often employed to capture the imprecise modes of reasoning that play an essential role in the human aility to make decisions in an environment of uncertainty and imprecision. n example that descries a simple fact is If pressure is high, then volume is small. where pressure and volume are linguistic variales [67], high and small are linguistic values or laels that are characterized y memership functions. nother form of fuzzy if-then rule, proposed y Takagi and Sugeno [53], has fuzzy sets involved only in the premise part. y using Takagi and Sugeno s fuzzy if-then rule, we can descrie the resistant force on a moving oject as follows: If velocity is high, then. where, again, high in the premise part is a linguistic lael characterized y an appropriate memership function. However, the consequent part is descried y a nonfuzzy equation of the input variale, velocity. oth types of fuzzy if-then rules have een used extensively in oth modeling and control. Through the use of linguistic laels and memership functions, a fuzzy if-then rule can easily capture the spirit of a rule of thum used y humans. From another angle, due to the qualifiers on the premise parts, each fuzzy if-then rule can e viewed as a local description of the system under consideration. Fuzzy if-then rules form a core part of the fuzzy inference system to e introduced elow.. Fuzzy Inference Systems Fuzzy inference systems are also known as fuzzy-rule-ased systems, fuzzy models, fuzzy associative memories (FM), or fuzzy controllers when used as controllers. asically a fuzzy inference system is composed of five functional locks (Figure ): a rule ase containing a numer of fuzzy if-then rules; a dataase which defines the memership functions of the fuzzy sets used in the fuzzy rules; a decision-making unit which performs the inference operations on the rules; a fuzzification interface which transforms the crisp inputs into degrees of match with linguistic values; a defuzzification interface which transform the fuzzy results of the inference into a crisp output. Usually, the rule ase and the dataase are jointly referred to as the knowledge ase. The steps of fuzzy reasoning (inference operations upon fuzzy if-then rules) performed y fuzzy inference systems are:

4 premise part consequent part type type type 3 C C w w z=axyc X Y z Z Z C C w w z=pxqyr x X y Y multiplication z weighted average Z max Z weighted average (or min) z = w*zw*z ww z = w*zw*z ww z (centroid of area) Figure : Commonly used fuzzy if-then rules and fuzzy reasoning mechanisms.. Compare the input variales with the memership functions on the premise part to otain the memership values (or compatiility measures) of each linguistic lael. (This step is often called fuzzification ).. Comine (through a specific T-norm operator, usually multiplication or min.) the memership values on the premise part to get firing strength (weight) of each rule. 3. Generate the qualified consequent (either fuzzy or crisp) of each rule depending on the firing strength. 4. ggregate the qualified consequents to produce a crisp output. (This step is called defuzzification.) Several types of fuzzy reasoning [5, 6] have een proposed in the literature. Depending on the types of fuzzy reasoning and fuzzy if-then rules employed, most fuzzy inference systems can e classified into three types (Figure ): Type : The overall output is the weighted average of each rule s crisp output induced y the rule s firing strength (the product or minimum of the degrees of match with the premise part) and output memership functions. The output memership functions used in this scheme must e monotonically non-decreasing [55]. Type : The overall fuzzy output is derived y applying max operation to the qualified fuzzy outputs (each of which is equal to the minimum of firing strength and the output memership function of each rule). Various schemes have een proposed to choose the final crisp output ased on the overall fuzzy output; some of them are center of area, isector of area, mean of maxima, maximum criterion, etc [5, 6]. Type 3: Takagi and Sugeno s fuzzy if-then rules are used [53]. The output of each rule is a linear comination of input variales plus a constant term, and the final output is the weighted average of each rule s output. Figure utilizes a two-rule two-input fuzzy inference system to show different types of fuzzy rules and fuzzy reasoning mentioned aove. e aware that most of the differences lie in the specification of the consequent part (monotonically non-decreasing or ell-shaped memership functions, or crisp function) and thus the defuzzification schemes (weighted average, centroid of area, etc) are also different. III. daptive Networks: rchitectures and Learning lgorithms 3

5 input vector x x y y output vector Figure 3: n adaptive network. This section introduces the architecture and learning procedure of the adaptive network which is in fact a superset of all kinds of feedforward neural networks with supervised learning capaility. n adaptive network, as its name implies, is a network structure consisting of nodes and directional links through which the nodes are connected. Moreover, part or all of the nodes are adaptive, which means each output of these nodes depends on the parameter(s) pertaining to this node, and the learning rule specifies how these parameters should e changed to minimize a prescried error measure. The asic learning rule of adaptive networks is ased on the gradient descent and the chain rule, which was proposed y y Weros [6] in the 97 s. However, due to the state of artificial neural network research at that time, Weros early work failed to receive the attention it deserved. In the following presentation, the derivation is ased on the author s work [, ] which generalizes the formulas in [4]. Since the asic learning rule is ased the gradient method which is notorious for its slowness and tendency to ecome trapped in local minima, here we propose a hyrid learning rule which can speed up the learning process sustantially oth the atch learning and the pattern learning of the proposed hyrid learning rule is discussed elow.. rchitecture and asic Learning Rule n adaptive network (Figure 3) is a multi-layer feedforward network in which each node performs a particular function (node function) on incoming signals as well as a set of parameters pertaining to this node. The nature of the node functions may vary from node to node, and the choice of each node function depends on the overall input-outputfunction which the adaptive network is required to carry out. Note that the links in an adaptive network only indicate the flow direction of signals etween nodes; no weights are associated with the links. To reflect different adaptive capailities, we use oth circle and square nodes in an adaptive network. square node (adaptive node) has parameters while a circle node (fixed node) has none. The parameter set of an adaptive network is the union of the parameter sets of each adaptive node. In order to achieve a desired input-output mapping, these parameters are updated according to given training data and a gradient-ased learning procedure descried elow. Suppose that a given adaptive network has L layers and the k-th layer has # nodes. We can denote the node in the -th position of the -th layer y, and its node function (or node output) y. Since a node output depends on its incoming signals and its parameter set, we have # () where,,, etc. are the parameters pertaining to this node. (Note that we use as oth the node output and node function.) ssuming the given training data set has entries, we can define the error measure (or energy function ) for the -th ( ) entry of training data entry as the sum of squared errors: # $# &%(' )% () "! # where &% is the * -th component of -th target output vector, and &% is the * -th component of actual output vector produced y the presentation of the,.- -th input vector. Hence the overall error measure is!. 4

6 ' - In order to develop a learning procedure that implements gradient descent in over the parameter space, first we have to calculate the error rate for -th training data and for each node output. The error rate for the output node at can e calculated readily from equation (): % For the internal node at, the error rate can e derived y the chain rule: % #! $# % ' % (3) )% )% (4) % where '. That is, the error rate of an internal node can e expressed as a linear comination of the error rates of the nodes in the next layer. Therefore for all and #, we can find y equation (3) and (4). Now if is a parameter of the given adaptive network, we have where is the set of nodes whose outputs depend on. Then the derivative of the overall error measure with respect to is (6)! (5) ccordingly, the update formula for the generic parameter ' in which is a learning rate which can e further expressed as! is (7) (8) where is the step size, the length of each gradient transition in the parameter space. Usually, we can change the value of to vary the speed of convergence. The heuristic rules for changing are discussed in section 5 where we report simulation results. ctually, there are two learning paradigms for adaptive networks. With the atch learning (or off-line learning), the update formula for parameter is ased on equation (6) and the update action takes place only after the whole training data set has een presented, i.e., only after each epoch or sweep. On the other hand, if we want the parameters to e updated immediately after each input-output pair has een presented, then the update formula is ased on equation (5) and it is referred to as the pattern learning (or on-line learning). In the following we will derive a faster hyrid learning rule and oth of its learning paradigms.. Hyrid Learning Rule: atch (Off-Line) Learning Though we can apply the gradient method to identify the parameters in an adaptive network, the method is generally slow and likely to ecome trapped in local minima. Here we propose a hyrid learning rule [] which comines the gradient method and the least squares estimate (LSE) to identify parameters. For simplicity, assume that the adaptive network under consideration has only one output " " $# &% ' (9) 5

7 ' ' where ' % is the set of input variales and is the set of parameters. If there exists a function such that the composite function # is linear in some of the elements of, then these elements can e identified y the least squares method. More formally, if the parameter set can e decomposed into two sets () (where represents direct sum) such that # is linear in the elements of, then upon applying to equation (9), we have " " # &% ' which is linear in the elements of. Now given values of elements of, we can plug training data into equation () and otain a matrix equation: where, then the dimensions of, and are, and, respectively. Since (numer of training data pairs) is usually greater than (numer of linear parameters), this is an overdetermined prolem and generally there is no exact solution to equation (). Instead, a least squares estimate (LSE) of,, is sought to minimize the squared ' error. This is a standard prolem that forms the grounds for linear regression, adaptive filtering and signal processing. The most well-known formula for uses the pseudo-inverse of : is an unknown vector whose elements are parameters in. Let, and where is the transpose of is the pseudo-inverse of if is non-singular. While equation (3) is concise in notation, it is expensive in computation when dealing with the matrix inverse and, moreover, it ecomes ill-defined if is singular. s a result, we employ sequential formulas to compute the LSE of. This sequential method of LSE is more efficient (especially when is small) and can e easily modified to an on-line version (see elow) for systems with changing characteristics. Specifically, let the th row vector of matrix defined in equation () e and the th element of e, then can e calculated iteratively using the sequential formulas widely adopted in the literature [, 7, 8, 47]: ' (4) where is often called the covariance matrix and the least squares estimate is equal to -. The initial ' conditions to ootstrap equation (4) are and, where is a positive large numer and ' is the identity matrix of dimension. When dealing with multi-output adaptive networks ( " " in equation (9) is a column vector), equation (4) still applies except that is the -th rows of matrix. This sequential least squares estimate of,! = and can e interpreted as a Kalman filter [7] for the process! #" %$ where = =. For this reason, equation (4) is sometimes loosely referred to as the Kalman filter algorithm. (Note that all our simulations descried in later chapters are deterministic; there is no noise added in the simulation settings.) Now we can comine the gradient method and the least squares estimate to update the parameters in an adaptive network. Each epoch of this hyrid learning procedure is composed of a forward pass and a ackward pass. In the forward pass, we supply input data and functional signals go forward to calculate each node output until the matrices and in equation () are otained, and the parameters in are identified y the sequential least squares formulas in equation (4). fter identifying parameters in, the functional signals keep going forward till the error measure is calculated. In the ackward pass, the error rates (the derivative of the error measure w.r.t. each node output, see equation(3) and (4)) propagate from the output end toward the input end, and the parameters in are updated y the gradient method in equation (7). For given fixed values of parameters in, the parameters in thus found are guaranteed to e the gloal optimum point in the parameter space due to the choice of the squared error measure. Not only can this hyrid () () (3) (5) (6) 6

8 ' learning rule decrease the dimension of the search space in the gradient method, ut, in general, it will also cut down sustantially the convergence time. Take for example an one-hidden-layer ack-propagation neural network with sigmoid activation functions. If this neural network has output units, then the " " in equation (9) is a column vector. Let e the inverse sigmoid function " then equation () ecomes a linear (vector) function such that each element of is a linear comination of the parameters (weights and thresholds) pertaining to layer. In other words, = weights and thresholds of hidden layer, = weights and thresholds of output layer. Therefore we can apply the ack-propagation learning rule to tune the parameters in the hidden layer, and the parameters in the output layer can e identified y the least squares method. However, it should e keep in mind that y using the least squares method on the data transformed y, the otained parameters are optimal in terms of the transformed squared error measure instead of the original one. Usually this will not cause practical prolem as long as is monotonically increasing. C. Hyrid Learning Rule: Pattern (On-Line) Learning If the parameters are updated after each data presentation, we have the pattern learning or on-line learning paradigm. This learning paradigm is vital to the on-line parameter identification for systems with changing characteristics. To modify the atch learning rule to its on-line version, it is ovious that the gradient descent should e ased on (see equation (5)) instead of. Strictly speaking, this is not a truly gradient search procedure to minimize, yet it will approximate to one if the learning rate is small. For the sequential least squares formulas to account for the time-varying characteristics of the incoming data, we need to decay the effects of old data pairs as new data pairs ecome availale. gain, this prolem is well studied in the adaptive control and system identification literature and a numer of solutions are availale [7]. One simple method is to formulate the squared error measure as a weighted version that gives higher weighting factors to more recent data pairs. This amounts to the addition of a forgetting factor to the original sequential formula: ' (8) where the value of is etween and. The smaller * is, the faster the effects of old data decay. ut a small * sometimes causes numerical unstaility and should e avoided. IV. NFIS: daptive-network-ased Fuzzy Inference System The architecture and learning rules of adaptive networks have een descried in the previous section. Functionally, there are almost no constraints on the node functions of an adaptive network except piecewise differentiaility. Structurally, the only limitation of network configuration is that it should e of feedforward type. Due to these minimal restrictions, the adaptive network s applications are immediate and immense in various areas. In this section, we propose a class of adaptive networks which are functionally equivalent to fuzzy inference systems. The proposed architecture is referred to as NFIS, standing for daptive-network-ased Fuzzy Inference System. We descrie how to decompose the parameter set in order to apply the hyrid learning rule. esides, we demonstrate how to apply the Stone-Weierstrass theorem to NFIS with simplified fuzzy if-then rules and how the radial asis function network relate to this kind of simplified NFIS.. NFIS architecture For simplicity, we assume the fuzzy inference system under consideration has two inputs and and one output. Suppose that the rule ase contains two fuzzy if-then rules of Takagi and Sugeno s type [53]: Rule : If is and is, then Rule : If is and is, then 7 ' " " (7)

9 X Y w f = p x q y r f = w f w w f w x X y Y w (a) f = p x q y r = w f w f layer layer 4 layer layer 3 x y layer 5 x y w w w w w f w f f x y () Figure 4: (a) Type-3 fuzzy reasoning; () equivalent NFIS (type-3 NFIS). then the type-3 fuzzy reasoning is illustrated in Figure 4(a), and the corresponding equivalent NFIS architecture (type-3 NFIS) is shown in Figure 4(). The node functions in the same layer are of the same function family as descried elow: Layer Every node in this layer is a square node with a node function (9) where is the input to node, and is the linguistic lael (small, large, etc.) associated with this node function. In other words, is the memership function of and it specifies the degree to which the given satisfies the quantifier. Usually we choose to e ell-shaped with maximum equal to and minimum equal to, such as the generalized ell function or the Gaussian function $ &%')( *!#" () *-,.! () where /.,,, (or /.,, in the latter case) is the parameter set. s the values of these parameters change, the ell-shaped functions vary accordingly, thus exhiiting various forms of memership functions on linguistic lael. In fact, any continuous and piecewise differentiale functions, such as commonly used trapezoidal or triangular-shaped memership functions, are also qualified candidates for node functions in this layer. Parameters in this layer are referred to as premise parameters. Layer Every node in this layer is a circle node laeled Π which multiplies the incoming signals and sends the product out. For instance, ) 8 () Each node output represents the firing strength of a rule. (In fact, other T-norm operators that perform generalized ND can e used as the node function in this layer.) Layer 3 Every node in this layer is a circle node laeled N. The -th node calculates the ratio of the -th rule s firing strength to the sum of all rules firing strengths: 8 8 (3)

10 w C x X X y Y Y w (a) C f f Z Z f = = w w w f w f w f w f x y w w w w w f w f f () Figure 5: () Type- fuzzy reasoning; () equivalent NFIS (type- NFIS). For convenience, outputs of this layer will e called called normalized firing strengths. Layer 4 Every node where to as consequent parameters. in this layer is a square node with a node function 4 4 ( 7 (4) is the output of layer 3, and / (,, is the parameter set. Parameters in this layer will e referred Layer 5 The single node in this layer is a circle node laeled Σ that computes the overall output as the summation of all incoming signals, i.e., 5 %. #( (5) Thus we have constructed an adaptive network which is functionally equivalent to a type-3 fuzzy inference system. For type- fuzzy inference systems, the extension is quite straightforward and the type- NFIS is shown in Figure 5 where the output of each rule is induced jointly y the output memership funcion and the firing strength. For type- fuzzy inference systems, if we replace the centroid defuzzification operator with a discrete version which calculates the approximate centroid of area, then type-3 NFIS can still e constructed accordingly. However, it will e more complicated than its type-3 and type- versions and thus not worth the efforts to do so. Figure 6 shows a -input, type-3 NFIS with 9 rules. Three memership functions are associated with each input, so the input space is partitioned into 9 fuzzy suspaces, each of which is governed y a fuzzy if-then rules. The premise part of a rule defines a fuzzy suspace, while the consequent part specifies the output within this fuzzy suspace.. Hyrid Learning lgorithm From the proposed type-3 NFIS architecture (Figure 4), it is oserved that given the values of premise parameters, the overall output can e expressed as a linear cominations of the consequent parameters. More precisely, the output in Figure 4 can e rewritten as ( 7 ( 7 (6) 9

11 Y Y premise parameters x y 3 3 consequent parameters y x f X X (a) Figure 6: (a) -input type-3 NFIS with 9 rules; () corresponding fuzzy suspaces. () which is linear in the consequent parameters ((,,, (, and ). s a result, we have = set of total parameters, = set of premise parameters, = set of consequent parameters, in equation (); and are the identity function and the function of the fuzzy inference system, respectively. Therefore the hyrid learning algorithm developed in the previous chapter can e applied directly. More specifically, in the forward pass of the hyrid learning algorithm, functional signals go forward till layer 4 and the consequent parameters are identified y the least squares estimate. In the ackward pass, the error rates propagate ackward and the premise parameters are updated y the gradient descent. Tale summarizes the activities in each pass. - forward pass ackward pass premise parameters fixed gradient descent consequent parameters least squares estimate fixed signals node outputs error rates Tale : Two passes in the hyrid learning procedure for NFIS. s mentioned earlier, the consequent parameters thus identified are optimal (in the consequent parameter space) under the condition that the premise parameters are fixed. ccordingly the hyrid approach is much faster than the strict gradient descent and it is worthwhile to look for the possiility of decomposing the parameter set in the manner of equation (). For type- NFIS, this can e achieved if the memership function on the consequent part of each rule is replaced y a piecewise linear approximation with two consequent parameters (Figure 7). In this case, again, the consequent parameters constitute set and the hyrid learning rule can e employed directly. However, it should e noted that the computation complexity of the least squares estimate is higher than that of the gradient descent. In fact, there are four methods to update the parameters, as listed elow according to their computation complexities:. Gradient descent only : all parameters are updated y the gradient descent.. Gradient descent and one pass of LSE : the LSE is applied only once at the very eginning to get the initial values of the consequent parameters and then the gradient descent takes over to update all parameters.

12 output p q output Figure 7: Piecewise linear approximation of memership functions on the consequent part of type- NFIS. 3. Gradient descent and LSE : this is the proposed hyrid learning rule. 4. Sequential (approximate) LSE only : the NFIS is linearized w.r.t. the premise parameters and the extended Kalman filter algorithm is employed to update all parameters. This has een proposed in the neural network literature [45, 44, 43]. The choice of aove methods should e ased on the trade-off etween computation complexity and resulting performance. Our simulations presented in the next section are performed y the third method. Note that the consequent parameters can also e updated y the Widrow-Hoff LMS algorithm [63], as reported in [46]. The Widrow-Hoff algorithm requires less computation and favors parallel hardware implementation, ut it converges relatively slowly when compared to the least square estimate. s pointed out y one of the reviewers, the learning mechanisms should not e applied to the determination of memership functions since they convey linguistic and sujective description of ill-defined concepts. We think this is a case-y-case situation and the decision should e left to the users. In principle, if the size of availale input-output data set is large enough, then the fine-tuning of the memership functions are applicale (or even necessary) since the human-determined memership functions are suject to the differences from person to person and from time to time; therefore they are rarely optimal in terms of reproducing desired outputs. However, if the data set is too small, then it proaly does not contain enough information of the system under consideration. In this situation, the the human-determined memership functions represent important knowledge otained through human experts experiences and it might not e reflected in the data set; therefore the memership functions should e kept fixed throughout the learning process. Interestingly enough, if the memership functions are fixed and only the consequent part is adjusted, the NFIS can e viewed as a functional-link network [, 36] where the "enhanced representation" of the input variales are achieved y the memership functions. This "enhanced representation" which takes advantage of human knowledge are apparently more insight-revealing than the functional expansion and the tensor (outerproduct) models [36]. y fine-tuning the memership functions, we actually make this "enhanced representation" also adaptive. ecause the update formulas of the premise and consequent parameters are decoupled in the hyrid learning rule (see Tale ), further speedup of learning is possile y using other versions of the gradient method on the premise parameters, such as conjugate gradient descent, second-order ack-propagation [37], quick-propagation [5], nonlinear optimization [58] and many others. C. Fuzzy Inference Systems with Simplified Fuzzy If-Then Rules Though the reasoning mechanisms (Figure ) introduced earlier are commonly used in the literature, each of them has inherent drawacks. For type- reasoning (Figure or 5), the memership functions on the consequence part are restricted to monotonically non-decreasing functions which are not compatile with linguistic terms such as "medium" whose memership function should e ell-shaped. For type- reasoning (Figure ), the defuzzification process is time-consuming and systematic fine-tuning of the parameters are not easy. For type-3 reasoning (Figure or 4), it is just hard to assign any appropriate linguistic terms to the consequence part which is a nonfuzzy function of the input variales. To cope with these disadvantages, simplified fuzzy if-then rules of the following form are introduced: If is ig and 7 is small, then is. where is a crisp value. Due to the fact that the output is descried y a crisp value (or equivalently, a singular memership function), this class of simplified fuzzy if-then rules can employ all three types of reasoning

13 . mechanisms. More specifically, the consequent part of this simplified fuzzy if-then rule is represented y a step function (centered at ) in type, a singular memership function (at ) in type, and a constant output function in type 3, respectively. The three reasoning mechanisms are unified under this simplified fuzzy if-then rules. Most of all, with this simplified fuzzy if-then rule, it is possile to prove that under certain circumstance, the resulting fuzzy inference system has unlimited approximation power to match any nonlinear functions aritrarily well on a compact set. We will proceed this in a descriptive way y applying the Stone-Weierstrass theorem [, 4] stated elow. Theorem. Let domain e a compact space of dimensions, and let e a set of continuous real-valued functions on, satisfying the following criteria:. Identity function: The constant $ is in.. Separaility: For any two points in, there is an in such that. 3. lgeraic closure: If and are any two functions in, then and. are in for any two real numers. and. Then is dense in, the set of continuous real-valued functions on. In other words, for any, and any function in, there is a function in such that * for all. In application of fuzzy inference systems, the domain in which we operate is almost always closed and ounded and therefore it is compact. For the first and second criteria, it is trivial to find simplified fuzzy inference systems that satisfy them. Now all we need to do is examine the algeraic closure under addition and multiplication. Suppose we have two fuzzy inference systems and ; each has two rules and the output of each system can e expressed as : : (7) (8) where,, and are constant output of each rule. Then. and can e calculated as follows:..... (9) which are of the same form as equation (7) and (8). pparently the NFIS architectures that compute. and are of the same class of and if and only if the class of memership functions is invariant under multiplication. This is loosely true if the class of memership functions is the set of all ell-shaped functions, since the multiplication of two ell-shaped function is almost always still ell-shaped. nother more tightly defined class of memership functions satisfying this criteria, as pointed out y Wang [56, 57], is the scaled Gaussian memership function:. %( * *,.! (3) Therefore y choosing an appropriate class of memership functions, we can conclude that the NFIS with simplified fuzzy if-then rules satisfy the four criteria of the Stone-Weierstrass theorem. Consequently, for any given, and any real-valued function, there is a fuzzy inference system such that * for all in the underlying compact set. Moreover, since the simplified NFIS is a proper suset of all three types of NFIS

14 Typical Initial MF s input variales Figure 8: typical initial memership function setting in our simulation. (The operating range is assumed to e!.) in Figure, we can draw the conclusion that all the three types of NFIS have unlimited approximation power to match any given data set. However, caution has to e taken in accepting this claim since there is no mention aout how to construct the NFIS according to the given data set. That is why learning plays a role in this context. nother interesting aspect of the simplified NFIS architecture is its functional equivalence to the radial asis function network (RFN). This functional equivalence is estalished when the Gaussian memership function is used in the simplified NFIS. detailed treatment can e found in [4]. This functional equivalence provides us with a shortcut for etter understanding of NFIS and RFN and advances in either literatures apply to oth directly. For instance, the hyrid learning rule of NFIS can e apply to RFN directly and, vice versa, the approaches used to identify RFN parameters, such as clustering preprocess [3, 3], orthogonal least squares learning [3], generalization properties [], sequential adaptation [6], among others [5, 33], are all applicale techniques for NFIS. V. pplication Examples This section presents the simulation results of the proposed type-3 NFIS with oth atch (off-line) and pattern (on-line) learning. In the first two examples, NFIS is used to model highly nonlinear functions and the results are compared with neural network approach and earlier work. In the third example, NFIS is used as an identifier to identify a nonlinear component on-linely in a discrete control system. Lastly, we use NFIS to predict a chaotic time series and compare the results with various statistical and connectionist approaches.. Practical Considerations In a conventional fuzzy inference system, the numer of rules is decided y an expert who is familiar with the system to e modeled. In our simulation, however, no expert is availale and the numer of memership functions (MF s) assigned to each input variale is chosen empirically, i.e., y examining the desired input-output data and/or y trial and error. This situation is much the same as that of neural networks; there are no simple ways to determine in advance the minimal numer of hidden nodes necessary to achieve a desired performance level. fter the numer of MF s associated with each inputs are fixed, the initial values of premise parameters are set in such a way that the MF s are equally spaced along the operating range of each input variale. Moreover, they satisfy -completeness [5, 6] with 8 5, which means that given a value of one of the inputs in the operating range, we can always find a linguistic lael such that. In this manner, the fuzzy inference system can provide smooth transition and sufficient overlapping from one linguistic lael to another. Though we did not attempt to keep the %( -completeness during the learning in our simulation, it can e easily achieved y using the constrained gradient method [65]. Figure 8 shows a typical initial MF setting when the numer of MF is 4 and the operating range is!. Note that throughout the simulation examples presented elow, all the memership functions used are the ell function defined in equation (): & )!#" (3) 3

15 3 MF. slope=-/a.5 c-a c ca x a Figure 9: Physical meanings of the parameters in the ell memership function.. error measure rule : increase step size after 4 downs (point ) rule : decrease step size after cominations of up and down (point ) epochs Figure : Two heuristic rules for updating step size. which contains three fitting parameters., and,. Each of these parameters has a physical meaning:, determines the center of the corresponding memership function;. is the half width; and (together with. ) controls the slopes at the crossover points (where MF value is.5). Figure 9 shows these concepts. We mentioned that the step size in equation (8) may influence the speed of convergence. It is oserved that if is small, the gradient method will closely approximate the gradient path, ut convergence will e slow since the gradient must e calculated many times. On the other hand, if is large, convergence will initially e very fast, ut the algorithm will oscillate aout the optimum. ased on these oservations, we update according to the following two heuristic rules (see Figure ):. If the error measure undergoes 4 consecutive reductions, increase y %.. If the error measure undergoes consecutive cominations of increase and reduction, decrease y %. Though the numers %, 4 and are chosen more or less aritrarily, the results shown in our simulation appear to e satisfactory. Furthermore, due to this dynamical update strategy, the initial value of is usually not critical as long as it is not too ig.. Simulation Results Example : Modeling a Two-Input Nonlinear Function In this example, we consider using NFIS to model a nonlinear sinc equation, (3) 4

16 root mean squared error quick-prop neural network.. NFIS epochs Figure : RMSE curves for the quick-propagation neural networks and the NFIS. From the grid points of the range *! 3 *! within the input space of the aove equation, training data pairs were otained first. The NFIS used here contains 6 rules, with four memership functions eing assigned to each input variale and the total numer of fitting parameters is 7 which are composed of 4 premise parameters and 48 consequent parameters. (We also tried NFIS with 4 rules and 9 rules, ut oviously they are too simple to descrie the highly nonlinear sinc function.) Figure shows the RMSE (root mean squared error) curves for oth the -8- neural network and the NFIS. Each curve is the average of ten runs: for the neural network, this ten runs were started from different set of initial random weights; for the NFIS, different initial step size ( 8 8 8'88 8 ) were used. The neural network, containing 73 fitting parameters (connection weights and thresholds), was trained with quick propagation [5] which is considered one of the est learning algorithms for connectionist models. Figure demonstrate how NFIS can effectively model a highly nonlinear surface as compared to neural networks. However, this comparison cannot taken to e universal since we did not attempt an exhaustive search to find the optimal settings for the quick-propagation learning rule of the neural networks. The training data and other reconstructed surfaces at different epoch numers are shown in Figure. (Since the error measure is always computed after the forward pass is over, the epoch numers shown in Figure always end with.5.) Note that the reconstructed surface after.5 epoch is due to the identification of consequent parameters only and it already looks similar to the training data surface. Figure 3 lists the initial and final memership functions. It is interesting to oserve that the sharp changes of the training data surface around the origin is accounted for y the moving of the memership functions toward the origin. Theoretically, the final MF s on oth and 7 should e symmetric with respect to the origin. However, they are not symmetric due to the computer truncation errors and the approximate initial conditions for ootstrapping the calculation of the sequential least squares estimate 4. Example : Modeling a Three-Input Nonlinear Function The training data in this example are otained from #( (33) which was also used y Takagi et al. [5], Sugeno et al. [49] and Kondo [] to verify their approaches. The NFIS (see Figure 4) used here contains 8 rules, with memership functions eing assigned to each input variale. 6 training data and 5 checking data were sampled uniformly from the input ranges 6! 3 6! 3 6! and ! ! !, respectively. The training data was used for the training of NFIS, while the 5

17 Figure : Training data (upper left) and reconstructed surfaces at.5 (upper right), 99.5 (lower left) and 49.5 (lower right) epochs. (Example ). (a) initial MF s on x () initial MF s on y x (c) final MF s on x x y (d) final MF s on y y Figure 3: Initial and final memership functions of example. 6

18 x y z C C predicted output Figure 4: The NFIS architecture for example. (The connections from inputs to layer 4 are not shown.) (a) Initial MF s on x, y and z () Final MF s on x x, y and z (c) Final MF s on y y -5 5 x (d) Final MF s on z z Figure 5: Example, (a) memership functions efore learning; ()(c) (d) memership functions after learning. checking data was used for verifying the identified NFIS only. To allow comparison, we use the same performance index adopted in [49, ]:. )%. % (%, %. % % * % 8 (34) where is the numer of data pairs; and are -th desired output and calculated output, respectively. Figure 5 illustrates the memership functions efore and after training. The training error curves with different initial step sizes (from 8 to 8 9) are shown in Figure 6(a), which demonstrates that the initial step size is not too critical on the final performance as long as it is not too ig. Figure 6() is the training and checking error curves with initial step size equal to 8. fter 99.5 epochs, the final results are 8 43% and 8 66%, which is listed in Tale along with other earlier work [49, ]. Since each simulation cited here was performed under different assumptions and with different training and checking data sets, we cannot make conclusive comments here. Example 3: On-line Identification in Control Systems 7

19 average percentage error(%) 4 (a) epochs average percentage error(%) 6 () epochs Figure 6: Error curves of example : (a) 9 training error curves for 9 initial step size from. (solid line) to.9; () training (solid line) and checking (dashed line) error curves with initial step size equal to.. Model (%) (%) Parameter no. Training Set Size Checking Set Size NFIS GMDH model [] Fuzzy model [49].5. Fuzzy model [49] Tale : Example : comparisons with earlier work. (The last three rows are from [49].) Here we repeat the simulation example of [34] where a --- neural network is employed to identify a nonlinear component in a control system, except that we use NFIS to replace the neural network. The plant under consideration is governed y the following difference equation: * (35) where y(k) and u(k) are the output and input, respectively, at time index, and the unknown function has the form (36) In order to identify the plant, a series-parallel model governed y the difference equation 7ˆ $ 8 37) ˆ 8 67ˆ * (37) was used where is the function implemented y NFIS and its parameters are updated at each time index. Here the NFIS has 7 memership functions on its input (thus 7 rules, and 35 fitting parameters) and the pattern (on-line) learning paradigm was adopted with a learning rate 8 and a forgetting factor The input to the plant and the model was a sinusoid 5 and the adaptation started at and stopped at 5. s shown in Figure 7, the output of the model follows the output of the plant almost immediately even after the adaptation stopped at 5 and the is changed to after 5. s a comparison, the neural network in [34] fails to follow the plant when the adaptation stopped at 5 and the identification procedure had to continue for 5 time steps using a random input. Tale 3 summarizes the comparison. In the aove, the MF numer is determined y trial and errors. If the MF numer is elow 7 then the model output will not follow the plant output satisfactorily after 5 adaptations. ut can we decrease the parameter numers y using atch learning which is supposed to e more effective? Figure 8, 9 and show the results after 49.5 epochs of atch learning when the MF numers are 5, 4 and 3, respectively. s can e seen, the NFIS is a good model even when the MF is as small as 3. However, as the MF numer is getting smaller, the correlation etween and each rule s output is getting less ovious in the sense that it is harder to sketch from each 8

20 (a) u(k) time index (k) () f(u(k)) and F(u(k)) time index (k) 5 (c) plant output and model output time index (k) Figure 7: Example 3: (a) ; (a) and ; () plant output and model output. Method Parameter Numer Time Steps of daptation NN 6 5 NFIS 35 5 Tale 3: Example 3: comparison with NN identifier [34].) 9

21 initial MF s final MF s u u f(u) and F(u) each rule s output u u Figure 8: Example 3: atch learning with 5 MF s. rule s consequent part. In other words, when the parameter numer is reduced mildly, usually the NFIS can still do the jo ut at the cost of sacrificing its semantics in terms of the local-description nature of fuzzy if-then rules; it is less of a structured knowledge representation and more of a lack-ox model (like neural networks). Example 4: Predicting Chaotic Dynamics Example, and 3 show that the NFIS can e used to model highly nonlinear functions effectively. In this example, we will demonstrate how the proposed NFIS can e employed to predict future values of a chaotic time series. The performance otained in this example will e compared with the results of a cascade-correlation neural network approach reported in [39] and a simple conventional statistical approach, the auto-regressive (R) model. The time series used in our simulation is generated y the chaotic Mackey-Glass differential delay equation [9] defined elow: $ 8 * * * 8 8 (38) The prediction of future values of this time series is a enchmark prolem which has een considered y a numer of connectionist researchers (Lapedes and Farer [4], Moody [3, 3], Jones et al. [5], Crower [39] and Sanger [4]). The goal of the task is to use known values of the time series up to the point to predict the value at some point in the future -. The standard method for this type of prediction is to create a mapping from points of the time series spaced apart, that is, ( * *,..., *, ), to a predicted future value. To allow comparison with earlier work (Lapedes and Farer [4], Moody [3, 3], Crower [39]), the values 4 and 6 were used. ll other simulation settings in this example were purposedly arranged to e as close as possile to those reported in [39]. To otain the time series value at each integer point, we applied the fourth-order Runge-Kutta method to find the numerical solution to equation (38). The time step used in the method is., initial condition 8, 7, and is thus derived for. (We assume for in the integration.) From the Mackey-Glass time series, we extracted input-output data pairs of the following format: * 8 * * 6 ; 6! (39)

22 initial MF s final MF s u u f(u) and F(u) each rule s output u u Figure 9: Example 3: atch learning with 4 MF s. initial MF s final MF s u u f(u) and F(u) each rule s output u u Figure : Example 3: atch learning with 3 MF s.

23 x(t-8), x(t-), x(t-6) and x(t) (a) Initial MF s for all four input variales first input, x(t-8) second input, x(t-) third input, x(t-6) fourth input, x(t) () Final MF s for each input variale. Figure : Memership functions of example 4, (a) efore learning; () after learning. where 8 to 7. The first 5 pairs (training data set) was used for training the NFIS while the remaining 5 pairs (checking data set) were used for validating the identified model. The numer of memership functions assigned to each input of the NFIS was aritrarily set to, so the rule numer is 6. Figure (a) is the initial memership functions for each input variale. The NFIS used here contains a total of 4 fitting parameters, of which 4 are premise parameters and 8 are consequent parameters fter epochs, we had 8 6 and 8 5, which are much etter when compared with other approaches explained elow. The resulting 6 fuzzy if-then rules are listed in the ppendix. The desired and predicted values for oth training data and checking data are essentially the same in Figure (a); their differences (Figure ()) can only e seen on a finer scale. Figure () is the final memership functions; Figure 3 shows the RMSE curves which indicate most of the learning was done in the first epochs. It is quite unusual to oserve the phenomenon that during the training process. Considering oth the RMSE s are vary small, we conclude that: () the NFIS has captured the essential components of the underlying dynamics; () the training data contains the effects of the initial conditions (rememer that we set for in the integration) which might not e easily accounted for y the essential components identified y the NFIS. s a comparison, we performed the same prediction y using the auto-regressive (R) model with the same numer of parameters: 6$. * 6 & * 6 (4)

Fuzzy Mod. Department of Electrical Engineering and Computer Science University of California, Berkeley, CA Generalized Neural Networks

Fuzzy Mod. Department of Electrical Engineering and Computer Science University of California, Berkeley, CA Generalized Neural Networks From: AAAI-91 Proceedings. Copyright 1991, AAAI (www.aaai.org). All rights reserved. Fuzzy Mod Department of Electrical Engineering and Computer Science University of California, Berkeley, CA 94 720 1