OPTIMIZATION. Optimization. Derivative-based optimization. Derivative-free optimization. Steepest descent (gradient) methods Newton s method

OPTIMIZATION Optimization Derivative-based optimization Steepest descent (gradient) methods Newton s method Derivative-free optimization Simplex Simulated Annealing Genetic Algorithms Ant Colony Optimization... 2

Derivative-based optimization Goal: Solving optimization (nonlinear) problems using derivative information Methods: Gradient based optimization Steepest descent Newton methods Conjugate gradient Nonlinear least-squares problems 22 Derivative-based optimization Methods used in: Deriving membership functions Deriving consequents of Takagi-Sugeno models Neural network learning Regression analysis in nonlinear models Optimization of nonlinear neuro-fuzzy models... Methods will be introduced as needed in the next lectures. 23 2

DATA CLUSTERING Data clustering Extensively used for data categorization (e.g. classification) data compression model construction Two types Hierarchical clustering Objective function-based clustering Partitions data into several groups such that withingroup similarity is larger than between-group similarity 25 3

Example.8.6.4.2.5.8.6.4.2.5 26 Hierarchical clustering Proceed by a series of successive mergers of data points (bottom-up) or successive division of data set (top-down) Induce a tree-like structure in the data set No pre-determined number of clusters Clusters are induced after partitioning by thresholding the induced tree based on similarity 27 4

Similarity Based on some notion of distance Inputs are usually normalized to the range [,] Inversely related to distance, e.g. S( xi, x j) d ( x, x ) Similarity is a reflexive and symmetric relation Suitable distance metrics can be defined, even for nominal data i j 28 Linkage algorithms Given the data T x [ x, x,, x ], k,, N k k 2k nk. Start with N clusters and compute the NN matrix of similarities 2. Determine most similar clusters i *, j * 3. Merge clusters i * and j * for form new cluster i. 4. Delete the rows and columns corresponding to i * and j * in the similarity matrix. 5. Determine the similarity between i and all other remaining clusters. 29 5

Cost function-based clustering Partition data by optimizing an objective function Clusters represented by cluster prototypes Objective function usually minimizes the distance to cluster prototypes Pre-determined number of clusters Simultaneous estimation of partition and cluster prototypes 2 Crisp vs. fuzzy clustering Crisp clustering algorithms partition the data set into disjoint groups, i.e. each data point belongs to one cluster only similarity is quantified using some metric Fuzzy clustering algorithms partition the data set into overlapping groups, i.e. each data point belongs to multiple clusters with varying degree of membership similarity is quantified using some metric which is modified by membership values 2 6

Clustering algorithms Hard c-means (K-means) Fuzzy c-means Gustafson-Kessel Possibilistic clustering Mountain clustering Subtractive clustering And many, many others 22 K(C)-means clustering Partition data into disjoint sets based on similarity amongst patterns Given the data T n x [ x, x,, x ], k,, N k k 2k nk Find the crisp partition matrix: N U, ij {,} C CN and the cluster centres: n V { v, v }, v C i 23 7

K-means clustering The following cost function of dissimilarity measures is minimized C C J J d( x v ), often d( x v ) x v i k i k i k i i i k, xkgi Partition matrix can be calculated if centers are fixed 2 2 if x j vi x j vk, k ij otherwise Centers can be calculated if partition matrix is fixed vi xk G i k xkgi i 2 24 C-means algorithm. Initialize cluster centers V 2. Determine the partition matrix U 3. Compute cost function J Stop if J is below a threshold or if it has not improved 4. Update the cluster centers V and iterate from step 2. Note that C C N, j,, N and N ij i i j ij 25 8

FUZZY CLUSTERING Fuzzy c means Partition data into overlapping sets based on similarity amongst patterns Given the data T n x [ x, x,, x ], k,, N k k 2k nk Find the fuzzy partition matrix: N U, ij [,] C CN This is a generalization and the cluster centres: of hard c-means! n V { v, v }, v C i 27 9

Fuzzy clustering Minimize objective function J C N m 2 ( X, U, V) ikd ( xk, vi) i k subject to ik, i,, C, k,, N C i N ik k, k,, N N, i,, C ik m(, ) is the fuzziness parameter membership degree total membership no cluster empty 28 Fuzzy c-means algorithm Initialize V or U Repeat. Compute cluster centers N m k ik x k vi N Assumes partition matrix is fixed m k 2. Calculate distances 2 T dik ( xk vi) ( xk vi) 3. Update partition matrix Until U ik C 2 2 /( ) ( / ) m d j ik d jk ik Assumes cluster centers are fixed Other stopping criteria are possible 29

Fuzzy c-means example MF.5 MF.5.5 Y X.5.5 Y X.5 MF.5.5.5 Y X 22 Fuzzy c-means example 22

Distance measures Euclidian norm: 2 T d ( x, v ) ( x v ) ( x v ) k i k i k i Inner-product norm: 2 T d ( x, v ) ( x v ) A ( x v ) k i k i k i Ais diagonal v i Mahalanobis norm: d ( x, v ) ( x v ) F ( x v ) 2 T k i k i i k i Rotated clusters 222 Gustafson-Kessel clustering Uses an adaptive distance metric d 2 ( x v ) ( x v ) T A ( x v ) k i k i i k i / n i i i A F F Fuzzy covariance matrix ( ) ( )( ) ik xk v k i xk v i Fi N m ( ) N m T k Clusters are constrained by volume ik Clusters adapt themselves to the shape and location of data 223 2

GK algorithm Repeat. Compute cluster centers N m k ik x k v Assumes partition matrix is fixed i N m k ik 2. Calculate covariance matrices and distances / n N m T ( ) ( x v )( x v ) 2 T k ik k i k i d ik F i ( x k v i ) F i ( x k v i ) Fi N m ( ) k ik 3. Update partition matrix Assumes cluster centers are fixed ik C 2 2 /( ) ( d / ) m j ik djk Until U Other stopping criteria are possible 224 GK algorithm example 225 3

GK algorithm example smallest eigenvector d(z,v) k data cluster centers z k d(z,v) k 2 smallest eigenvector A A 2 226 Mountain clustering Algorithm: Lay a grid (any type) on the data space Compute mountain functions (measure of data density) f( v ) i N k e 2 d ( xkvi) 2 2 Sequentially destroy mountain function after selecting grid point with highest value C ( C) ( j) vi vi v j j f ( ) f( ) f ( ) e 2 d ( viv j) 2 2 227 4

Mountain construction.8.6 (a) 6 4 2 (b) =.2.4.2.5.5.5 (c) (d) 5 4 3 2 =. =.2 8 6 4 2.5.5.5.5 228 Mountain destruction (a) =., =. (b) 5 4 3 2 5 4 3 2.5.5.5.5 (c) (d) 5 4 3 2 5 4 3 2.5.5.5.5 229 5

Subtractive clustering Mountain clustering with four variables and grid lines results in 4 grid points to be evaluated. In subtractive clustering, the grid consists of the data points themselves. Computational complexity independent of the dimension of the data vectors Rule of thumb: =.5 23 Effect of probabilistic constraint Probabilistic constraint: C i ik, k,, N Problematic if a data point lies far away from all clusters (e.g. outliers) Leads to nonconvex clusters.9.8.7.6.5.4.3.2. 2 3 4 5 6 7 8 9 Possibilistic constraint: C i ik, k,, N 23 6

Possibilistic clustering Minimize objective function: C N m C N 2 m ( X, U, V, ) ( xk, vi ) i( ik ) ik i k i k J d m (,) is the fuzziness parameter determine the size of the clusters suitable values from average inter-cluster distance N m 2 k ik d ik i N m k ik The optimization problem can now be decomposed into C independent optimization problems. 232 Possibilistic clustering algorithm Repeat:. Compute cluster centers N m k ik x k vi N m Until k 2. Calculate distances ik d ( x v ) A( x v ) 2 T ik k i k i 3. Update partition matrix ik Membership value does not depend 2 d m ik 2 on the membership of other clusters U i 233 7

Effect of fuzziness index As m increases, clusters overlap more; their centres become more isolated As m decreases, clusters overlap less; they become crisp Often m = 2 is selected m =.2 m =.9.8.7.6.5.4.3.2. 2 3 4 5 6 7 8 9.9.8.7.6.5.4.3.2. 2 3 4 5 6 7 8 9 234 Issues in fuzzy clustering Normalization Initialization Cluster volumes Number of clusters Cluster convexity Selection of parameters (e.g. fuzziness m) Categorical variables Outliers Missing values 235 8

Normalization How to compare measurements on different scales? Data box normalization xjl min xjl ' j xjl, j,, N and l,, n max x min x j jl j Standard deviation normalization x ' jl xl xjl, j,, N and l,, n l jl Adaptive distance metrics as in Gustafson-Kessel clustering are less sensitive to normalization 236 Initialization How to avoid local minima during the optimization? Randomly select a set of cluster prototypes V Randomly select a set of data points as cluster centers V Randomly initialize the partition matrix U Use information (e.g. cluster center locations) from a separate clustering step Initialize centers far away from data 237 9

Cluster volumes How large should clusters be? Extent of clusters Data density and distribution Size of cluster prototypes Cluster volume can be a parameter in Gustafson- Kessel clustering 3 2.5 2.5.5 FCM cluster centres -.5-2 3 238 Cluster validity How good are the clustering results? Correct number of clusters? Well-separated clusters? Compact clusters? Cluster validity measures try to quantify the answers to these questions in a formula Optimal number of clusters at a local minimum of the validity measure 239 2

Validity measures Gath and Geva index S N C m T ( ) ( )( ) ik xk vi x k k v i G N m i ( ) k ik C Xie-Beni index S X C N m 2 d x ik k vi i k N d (, ) 2 min ( vi v j) i, j, ij 24 Validity measures - example Validity (Gath and Geva) 6 4 2 minimum 2 4 6 8 2 4 Number of clusters Local Models (Gustafson-Kessel) Clusters 5-5 2 4 6 8.5 2 4 6 8 24 2

CLUSTERING FOR IDENTIFICATION Modeling based on clustering. Determine relevant input and output variables/features and collect data 2. Select model structure (Mamdani, Takagi-Sugeno, ) 3. Select number of clusters and clustering algorithm 4. Cluster the data 5. Obtain antecedent membership functions (MF) 6. Obtain consequents (MF or parameters) 7. Simplify the model, if necessary 8. Validate the model 243 22

Linguistic models fr. clustering R: Ifxis A thenyis B, i,2,, K i i i Use fuzzy c-means algorithm. Cluster data in input output product space. Membership functions obtained by: projection onto variables, membership function parameterization. One rule per cluster 244 Membership functions After clustering, MFs are obtained by projections of partition matrix values. 245 23

Example of linguistic model If income is Low then tax is Low If income is High then tax is High 246 Data for clustering T Inputs and output: x y X, y ZX y T x y N N Example of data: what is the structure of the model? y(2) y2(2) y2() u(2) y(3) y(3) y2(3) y2(2) u(3) y(4) Z y( N ) y2( N ) y2( N 2) u( N ) y( N) y ( k ) f( y ( k), y ( k), y ( k ), u ( k)) 2 2 247 24

TS models from clustering R: If xis A then y f ( x), i,2,, K i i i i Takagi-Sugeno order zero: Takagi-Sugeno order one: Degree of fulfillment ( x) i A i Model output given by the weighted fuzzy-mean: y K k y i b y a xb i T i i i k kyk, with k (normalized k) K j j 248 TS models from clustering Use fuzzy clustering algorithm (one rule per cluster) Cluster data in input output product space Project clusters onto input variables and fit parametric membership functions to projected clusters Estimate consequent parameters 249 25

Matrix notation T x y X, y T x y N N W k k kn Consequent parameters: TS order zero: b, k,, K T TS order one: k ak b k Extended regression matrix k e k X X 25 Estimation of consequents Global least squares: W WX WX WX e 2 e K e T T T 2 K Resulting least squares problem: Solution: T T WW Wy y W Solution of local least squares: T T k e k e e k X W X X W y 25 26

Example: TS order one Consequents can approximate local linear models of the system y y a xb y a xb 2 2 y a xb 3 3 A A 2 A 3 x 252 Example: TS model (a) Local Models 6 4 Output (y) 2-2 -4-6 2 3 4 5 6 7 8 9 Input (x) (b) Clusters Membership degree.8.6.4.2 2 3 4 5 6 7 8 9 Input (x) 253 27

Example: TS order zero y b 2 y = f(x) b 4 b b 3 x A 2 A 3 A 4 254 Example: pressure control dp RT 2 2P P g ( RH) ln dt 22.4Vh Kf P 255 28

Fermenter: parameters R gas constant (8.34 J mol K ) T temperature (35 K) V h gas volume (.5 m ) g gas flow-rate (3.75 m s ) R H radius of the outlet pipe (.78 m) P reference pressure (.3 N m ) outside air density (.2 Kg m ) P pressure in the tank (N m ) K f valve friction factor (J mol ) 256 Takagi-Sugeno fuzzy model. If yk ( ) is LOW and uk ( ) is OPEN then yk ( ).67 yk ( ).7 uk ( ).35 2. If yk ( ) is MEDIUM and uk ( ) is HALF CLOSED then yk ( ).8 yk ( ).28 uk ( ).7 3. If yk ( ) is HIGH and uk ( ) is CLOSED then yk ( ).9 yk ( ).7 uk ( ).39 257 29

Validation 258 Example: 3-D representation Fuzzy partition of the input-state product space Corresponding local linear models membership grade.5 2.8.6.4.2 current pressure fuzzyregion If current pressure yk () is LOW then and valve uk () is OPEN 5 valve position new pressure Fuzzy-linear rules current pressure valve position new pressure y(+)=a k y()+ k bu()+ k c 259 3

Interpretability in fuzzy models Interpretability is not obtained automatically Knowledge acquisition ensures transparency. Based on data: some redundancy is unavoidable. Redundancy manifests itself in two ways: High number of rules. Trade-off between: model accuracy / model complexity generalization capability / approximating data Very similar fuzzy sets similarity between fuzzy sets similarity to the universal fuzzy set 26 Redundancy in fuzzy models High number of rules Overlapping similar fuzzy sets Unnecessary complexity Difficult to assign linguistic labels Less transparency and generality 26 3

Redundancy in fuzzy clustering Number of clusters Projection of clusters onto antecedent variables 262 Methods to solve redundancy Cluster merging Method to determine the best number of clusters Merge clusters that are compatible Cluster again and continue merging until there are no compatible clusters Similarity based simplification Merge similar antecedent fuzzy sets Remove sets similar to universal set Combine / merge similar consequents Combine rules with equal antecedents 263 32

Cluster merging Select number of clusters larger than needed and do clustering Merge clusters that are compatible Cluster again and continue merging until there are no compatible clusters Cluster compatibility measured by Compatibility criteria How close are clusters? How similar are their characteristics? Etc. Similarity measures 264 Compatible cluster merging Select K number of clusters larger than needed. Repeat. Perform clustering 2. Evaluate compatibility criteria: jp' smallest eigenvector k, k close to ip' jp' i v v j k, k close to 2 2 smallest eigenvector ip' v i d 2 (v i,v j ) v j 265 33

Compatible cluster merging 3. Compute parallelism and closeness and 2 ij ij ij.8 parallel 2 ij.8 close membership.6.4 membership.6.4.2.2.2.4.6.8 inner product distance 4. Compute compatibility s ij s ij = 2 ij ij 2 266 Compatible cluster merging 5. Compute transitive closure of compatibility matrix S and threshold with s *. 6. Merge clusters if min max d v M v Until no clusters can be merged ik max v, v M i i j km d ij 267 34

Example (a) Local Models (a) Local Models 5 5 Output (y) Output (y) -5-5 2 3 4 5 6 7 8 9 Input (x) 2 3 4 5 6 7 8 9 Input (x) (b) Clusters (b) Clusters Membership degree.8.6.4.2 2 3 4 5 6 7 8 9 Input (x) Initial ( clusters) Membership degree.8.6.4.2 2 3 4 5 6 7 8 9 Input (x) After CCM (6 clusters) 268 Example (a) Local Models 6 4 Output (y) 2-2 -4-6 2 3 4 5 6 7 8 9 Input (x) (b) Clusters Membership degree.8.6.4.2 2 3 4 5 6 7 8 9 Input (x) 269 35

Similarity based simplification Reduce redundancy in rule and term set Merge similar antecedent fuzzy sets create generalized concepts reduce the number of terms Remove sets similar to universal set (always fires) reduce number of terms Combine/merge similar consequents reduce the number of consequent values Combine rules with equal antecedents reduce number of rules 27 Similarity measures Similarity measure: AB S( AB, ) AB Merging of similar sets Gives generalizing term Merging by aggregating the parameters of the individual sets 27 36

Simplification and reduction 272 Example: algae growth Prediction of chlorophyll-a concentration in lake ecosystems. 998 observations from nine different lakes. Inputs: temperature, N,P,Si, day length, light intensity Output: chlorophyll-a concentration Takagi-Sugeno model: If T is A T, N is A N, P is A P, Si is A Si, D is A D and I is A I then Chl = p + p T + p 2 N + p 3 P + p 4 Si + p 5 D + p 6 I Method: fuzzy clustering and similarity analysis 273 37

Initial rule base Seven rules with a total of 42 fuzzy sets; Difficult to assign linguistic labels, inspection is virtually impossible 274 Simplified rule base 275 38