Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
outline Motivation Introduction Multi-label classification overview Confidence level in prediction Multi-label classification using learning classifier systems (LCSs) Simulation results Conclusion and future works 2
Motivation Data-driven techniques are ubiquitous in many applications such as classification, estimation and modeling In some classification applications, samples in the data set attribute to more than one class simultaneously Multi-label classification methods that solve a single problem are in advantage The level of confidence in assigned labels to the samples, is vital to train an accurate machine When modeling a dynamical system, the overlap among adjacent sub-models can be handled using multi-label data with appropriate confidence levels 3
Introduction Multi-class classification Multi-label classification 4
Introduction Multi-class classification Multi-label classification In contrast to simple binary-class classification, each instance of the data set belongs to one of (M > 2) different classes The goal is to construct a function which, given a new data point, will correctly predict the class to which the new point belongs One-vs-all: trains M binary classifiers one for each class One-vs-all: trains M(M 1) classifiers to distinguish each pair of classes Decision tress, naïve Bayes, neural networks, 5
Introduction Multi-class classification Multi-label classification In contrast to conventional (single-label) classification, the setting of multi-label classification (MLC) allows an instance to belong to several classes simultaneously. Multi-label classification tasks are ubiquitous in real-world problems Text categorization: each document may belong to several predefined topics Bioinformatics: one protein may have many effects on a cell when predicting its functional classes 6
Definitions Notation: D: multi label data set H: X Y i, Y i ε Y Y = y 1, y 2,, y l Label cardinality of D: the average number of labels of the examples in D Label density of D: the average number of labels of the examples in D divided by Y Hamming loss: Ranking loss: HL H, D = 1 D RL f = 1 D D i=1 D i=1 Y i Y i Y 1 Y Y R(x) 7
MLC methods Problem transformation methods Algorithm adaptation methods 8
MLC methods Problem transformation methods Select family: discards ML data or selects one of the multiple labels for each instance It discard a lot of information content in the original dataset Label power set method: considers each different set of labels, as a single label It may lead to large number of classes with a few examples per class Binary relevance: learns Y binary classifiers one for each different label The most common problem transformation method Ranking by pairwise comparison: generates Y data sets 2 binary label Outputs a ranking of labels based on the votes from binary classifiers 9
MLC methods Problem transformation methods Select family: discards ML data or selects one of the multiple labels for each instance It discard a lot of information content in the original dataset Label power set method: considers each different set of labels, as a single label It may lead to large number of classes with a few examples per class Random k-labelsets: breaks the initial set of labels into small random, disjoint or overlapping, subsets Improves label power set results, still is challenged with domains with large number of labels and instances 10
MLC methods Algorithm adaptation methods Decision trees: C4.5 was adapted to learn ML data ML models that are understandable by human Probabilistic methods: proposed for text classification, a generative model is trained according to which, each label generates different words The ML document in generated by a mixture of the word distributions of its labels using EM Neural networks: the back-propagation algorithm is adapted by introduction of a new error function similar to ranking loss Lazy methods: k-nearest neighbors algorithm is used to maximize the posterior probability of labels assigned to new instances Outputs a ranking function for the probability of each label 11
MLC methods Algorithm adaptation methods Support vector machines: the one-versus-one strategy is used to partition a dataset with Y labels into Y 2 double label subsets. Assumes double label instances are located at marginal region between positive and negative instances Associative classification methods: constructs classification rule sets using associative rule mining MMAC learns an initial set of rules, removes the examples associated with this rule set, and recursively learns a new rule set from the remaining examples until no further frequent items are left. 12
Confidence in prediction The AdaBoost algorithm has been extended to generate a confidence degree for the predictions of weaker hypotheses Confidence scores give a reliability of each prediction Classification methods like probabilistic approaches and logistic regression, output a value as a probability of a label to be true The idea of confidence in prediction can be extended to one step prior to training 13
Confidence in prediction The AdaBoost algorithm has been extended to generate a confidence degree for the predictions of weaker hypotheses Confidence scores give a reliability of each prediction Classification methods like probabilistic approaches and logistic regression, output a value as a probability of a label to be true The idea of confidence in prediction can be extended to one step prior to training Encounter confidence levels in training data provided by the expert 14
Confidence in prediction The AdaBoost algorithm has been extended to generate a confidence degree for the predictions of weaker hypotheses Confidence scores give a reliability of each prediction Classification methods like probabilistic approaches and logistic regression, output a value as a probability of a label to be true The idea of confidence in prediction can be extended to one step prior to training Encounter confidence levels in training data provided by the expert The hypothesis will learn confidence levels and output a confidence degree along with its predicted labels for new instances 15
Notations X denotes the instance space and Y = {y 1, y 2,, y k } is the finite set of class labels Each instance x X is associated with a subset of labels y Y D is the set of data D = { x 1, λ 1, C 1, x 2, λ 2, C 2, x n, λ n, C n } λ i is the binary relevance vector of labels for instance x i λ i,j = {1: y j y, 0: y j y i 1, n, j [1, k]} H: X (Y, C), outputs a set of predicted labels (Y) along with a vector of confidence level (W)of the hypothesis in each of the labels 16
LCS structure A strength based Michigan-style classifier system has been used to extract knowledge from ML data Michigan-style classifier system are rule-based and supervised learning systems with a fixed rule length Genetic algorithm acts as a driving force to help evolve useful rules Classification model consists of a population of rule in the form of IF condition-then action Originally structured for learning binary classification problems Isolated structure of the action part of the classifiers, lets further modifications to adapt to more general cases of classification problems, namely multi-class and multi-label 17
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters Data set: a set of triples in the form of: (sample, label, confidence level) Training instance: randomly drawn individual from the data set 19
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters [P]: population of rules/classifiers Classifier parameters: Condition Action Strength (S) Confidence estimate W = w 1, w 2,, w k Confidence error (ε) 20
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters Condition: For binary-valued attributes composed of {0,1, #} For real-valued attributes takes the form of an ordered list of pairs of center and spread (c i, s i ) 21
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters Action: is an ordered list of 0,1 Example: labels for a sample drawn from a four class data set "0110 Confidence level for this label set C = [0 1 0.9 0] 22
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters [M]: matching classifiers with provided instance c i s i < x i < c i + s i Covering: creates a matching classifier if [M] is empty 23
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters CR: conflict resolution Uses bidding to identify the classifier that gets to classify the instance B = Sμe αε μ is a function of specificity or generality of the classifier 24
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters : classifiers having the same action as the winning classifier : [M] Genetic algorithm: randomly picks two classifiers from and creates two offsprings Off-springs are replaced into the [P] 25
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters Genetic algorithm: favors classifiers with higher fitness value and lower confidence estimate error simultaneously 26
LCS structure Training instance Data set Covering [P] Model Genetic algorithm CR [M] Update rule parameters Taxes are deducted from classifiers in both sets ε i = W i C 1 Delta rule update scheme W i W i + β C W i Fitness and error proportionate recourse sharing scheme S i e αε i R i = S j e αε R j 0 j 27
LCS structure Covering [P] Training instance Genetic algorithm Data set Model CR [M] Update rule parameters Model: the population of trained classifiers (rules) that collectively solve the classification problem, after proper number of training iterations 28
Performance measures Hamming loss is employed as a measure of accuracy and plotted against training iterations The average confidence estimate error of the population is plotted against training iterations In the test stage The prediction of the model is generated based on the votes from classifiers that match the instance The confidence level of classification is reported as the weighted average of the confidence estimates of the classifiers that match the instance 29
Simulation results Artificial Binary-valued data set: five attributes and two classes Artificial real-valued data set: four attributes and two classes, attribute range is ( 0.5, 0.5) 30
Simulation results Iris data: a three class data set with 50 samples per class All data are used for training Results averaged over 10 runs Method OVO SVM MLP Logistic Regression Random Forest Accuracy 97.33 99.48 98 100 98 LCS 31
Conclusion and future work Strength-based learning classifier system is employed to design an embedded MLC algorithm Classifier structure is adapted to handle confidence level in labels provided in the training set Model is tested on one real-world data set and two artificial datasets and results are provided Appropriate performance measures for test accuracy needs to be implemented MLC method discussed here will be extended to accuracybased classifier system (UCS) 32
Thank you for your attention! Your questions are welcome and feedback are appreciated! 33