Advertisement Image Detection

Size: px

Start display at page:

Download "Advertisement Image Detection"

Blaise Turner
5 years ago
Views:

1 Zhaoxi Li Yu Wang Bing Yue Software Engineering and Computer Science 4/6TF3 Data Mining: Concepts and Algorithms Course Project Advertisement Image Detection Instructor: Dr. Jiming Peng April 14, 2005

2 Table of Contents Introduction...3 Goal...3 Dataset...3 Algorithms...3 Dataset Preprocessing...3 Missing Values and Outliers...4 Miscellaneous Issues...4 Learning Algorithms...5 Algorithm 1: Naive Bayes...5 Algorithm 2 SVM (Support Vector Machine)...6 Algorithm 4 C4.5 (C4.5 Quinlan)...7 Result and Post processing...8 Conclusion...18 Appendix A: Figures and Tables...19 Appendix C: Source Code and Scripts...19 References

3 Introduction Goal Web browsing is the most common way for people gaining information so far. The contents aggregated into web pages are mostly text contents and images. Images advertising product becomes annoying when one wants concentrate on the context. The goal of this project is to determine whether a given image is an advertisement image or not. The project aims to search and extend some learning algorithm to perform this classification task. Dataset A dataset for the project was created and donated by Dr. Nicholas Kushmerick from Computer Science Department at University College Dublin in The dataset contains a set of advertisements found on Internet. These advertisements are mostly described in image geometries as well as phrases contained in the URL, the image's URL, alt text, the anchor text, and places near the anchor text. The dataset is a two-class data set. In other word, its class attribute is boolean, which is either ad or nonad, corresponds to It is an advertisement and It is no an advertisement. There are 3279 instances, where 2821 instances are nonads and 458 instances are ads. There are 1558 attributes, where 3 attributes are continuous and others are binary. The percentage of missing values in the dataset is 28%, These missing values are represented as "unknown". The attributes are listed in Table 1. To illustrate the idea in storing the data, two examples are shown in the following tables. Image A (An advertisement image) Height: 60 URL_Yahoo.com: No Width: 120 URL_AdExpress.com: Yes Ratio: 0.5 URL_mcmaster.ca: No Local: No AltText_Free: Yes Caption_$: Yes AltText_Money: Yes Caption_Figure: No AltText_photo: No Image A (An non-advertisement image) 3

4 Height: 300 URL_Yahoo.com: No Width: 400 URL_AdExpress.com: No Ratio: 0.75 URL_mcmaster.ca: Yes Local: Yes AltText_Free: No Caption_$: No AltText_Money: No Caption_Figure: Yes AltText_photo: Yes The attributes shown in the two examples are not full attributes contains in the dataset to be processed, as the number of attributes is very large. But the examples still show the same idea. The first example describes an image with ratio 0.5 in its dimension. This ratio is possibly a banner ratio. In addition, this image comes from a remote server AdExpress.com instead of local server. The keywords Free, Monday and also dollar sign $ are found in its alternative and caption texts. Such evidences give a strong possibility that this image is an advertisement image. Comparatively, the second example describes an image with ratio 0.75 in its dimension. It is from the local server where the web page is stored, and the keywords contained in the alternative text and captions are just normal texts instead of the keywords found in the first example. This image is classified to be a non-advertisement image. Dataset Preprocessing It is significant when setting the outlier detection boundary to have the values large enough so that it does not exclude valid data instances (False Positive or FP). On the other side, the boundary also correctly eliminate outliers (False Negative or FN). Preparing data file readable for computer consumes the bulk of the effort invested in the entire data mining process. Raw data is usually unformatted with low quality. Missing Values and Outliers Missing Values Strategies a) Numeric Attribute: We want to preserve consistency of the data set. As a result, we choose to replace the missing numeric attribute instead of simply ignoring it. Our strategy of dealing with missing numeric attribute is to replace the missing value of the mean of existing values of that attribute. b) Nominal Attribute: Our strategy of dealing with nominal attribute is to replace the missing 4

5 value with the most frequent appearance nominal value in that attribute. This protect the safety and consistency of the data set. Outliers Strategies: a) Numeric Attribute: Form a range of possible values for a particular attribute, any value outside the predefined range is considered as outliers. b) Nominal Attribute: Since this is a fairly large data set compared to the data set we were using in Assignment 3 as well as the missing value is a very small portion of the data set, our strategy of dealing with nominal attribute is simply treat the strange values as outliers. Method: if Cr = -1 or Cr = 1 then remove attribute B where Cr = [(Ai avg (A))*(Bi avg(b))/( (n -1)* Ω( A)* Ω (B) )] Ω( A) = SQRT( (A avg(a) )^2/ (n-1)) Ω (B) = SQRT( (B avg(b) )^2/ (n-1)) avg(a) = ( Ai) / n avg(b) = ( Bi) / n Miscellaneous Issues Data Cleansing: First, we transfer nominal attributes to numeric attribute, Rules: a) R(A,B) > 0.9 ==> Delete B (B is redundant ) b) R(A,B) <0.01 ==> Delete B (B is irrelevant) c) dev (A) = 0 ==> Delete A (A is irrelevant) d) dev (B) = 0 ==> Delete B (B is irrelevant) where: 5

6 R(A,B) = sum(a avg(a))*(b avg(b)) / [(n-1)*dev(a)*dev(b)] dev(a) = SQRT (sum(a-avg(a))^2/ (n-1) Learning Algorithms We classify our dataset by using various algorithms Naive Bayes, C4.5 (C4.5 Quinlan), Support Vector Machine and Stacking. Reasons of selecting these algorithms are their speed, usage, accuracy, and ease based on our dataset structures. The dataset we choose has much more attributes and not so many instances as well as almost all of the attributes are nominal attributes. This characteristics determines our final selection of classification algorithms. In addition, all of algorithms evaluation is based on 10-fold cross validation, which gives a uniform platform for comparison. Algorithm 1: Naive Bayes Naive Bayes perform powerful classifier and directly correlated with the degree of feature dependencies. It can achieve optimality condition because we have independent attributes. In our dataset, a small, finite number of values in most of the attributes. This is the main reason we choose Naive Bayes. Although algorithm classifier's probability estimates are only optimal under quadratic loss if the independence assumptions hold, the classifier itself can be optimal under zero-one loss even when this assumption is violated by a wide margin. The region of quadraticloss optimality of Naive Bayes classifier is in fact a second-oder infinitesimal fraction of the region of zero-one optimality. As a consequence, it has broad range of applicability in real applications. Another reason for choosing this algorithm is its fast speed. It makes predictions by multiplying the probabilities of attributes. This learning algorithm is based on Bayesian theorem and well suited for our data set since the dimensionality of our inputs is very high (over 1500 attributes). It is also practical for other applications such as system performance and text classification. As a result, we predict that it can deal with this sophisticated classification method. Low entropy implies that good performance in Naive Bayes. Mathematically, the classifier can be represented by fi(x) = P(Xj = xj C = i) P(C = i), where P denotes the probability function. To compute the probability density we use: 6

7 with particular instances x. Algorithm 2 SVM (Support Vector Machine) SVM (SMO Weka) is the most popular classification algorithm being used nowadays due to its robustness and high overall correctness performance compared to other learning algorithms. User could manually modify the kernel functions to achieve high classification correctness rate and display result in sparse vector representations. For example, Kernel(x,y) = ( exp(- x+2y ) is an exponential kernel. Also, it can handle thousands of support vectors as well as hundreds of training examples. There are several ways of optimize classification performance. We can transform the original dataset into higher space to make the classifier more flexible or maximize the margin besides a good kernel function. Our dataset only classified into two classes, they are ad or nonad. SVM is used to optimally divide a set of two classes. The resulted maximum margin hyperplane will gave the greatest separation between the two classes. We use SVM to transform input instances to a new space by non-linear mapping. SVM is a newly emerged generation of learning algorithm. The computational efficiency has improved very much recently year. One of the main reasons we chose to use SVM is its generalization performance. Algorithm 3 Stacking (java weka.classifiers.stacking) We learn stacking late in the semester and it is for the area of predictive data mining. It is used because we want identify a statistical model that can be used to predict some response even though it is not as widely used as previous classification algorithms. However, it is useful when dealing with very different types of models although it is difficult the analyze theoretically. The combination of different models using different methods yields more accuracy on predictions on models. However, we take into account that all the algorithms that we combined are incorrect. There is one problem we considered that is with more classes they grows exponentially so that too many classifiers to built. This would lead to less accurate outcome. As a result, we predicted that stacking would not achieve high accuracy and the comparison of algorithms result will be discussed later. 7

8 Algorithm 4 C4.5 (C4.5 Quinlan) C4.5 is extension of ID3 which improved the performance and correctness in many aspects such as reduced error pruning, avoid over fitting, and handling attributes with different costs. A divide and conquer approach is applied on partitioned small datasets. Also, the algorithm is kept partitioned into smaller. The main benefits of using C4.5 algorithm is that both numeric and nominal attributes with post-pruning automatically. We choose to use C4.5 because its robustness and execution speed as well as the ease of understanding of descriptions generated. Entropy of attribute is related to the order of the attribute. It means that greedy algorithm is employed on information gain. Information gain can be described as the effective decrease in entropy. We also use predefined post-pruning techniques to simplify the resulting decision tree. In addition, C4.5 algorithm is widely used in data mining tasks. There are only three numeric attributes in this data set. We discretized each attribute into two intervals. More than one split will result additional complexity in classification. Greedy algorithm is employed for selection of resulting gain ratios. For the three numeric attributes, it evaluates the information gain for every possible split point, and chose the best splitting point. By applying the pruning process, the resulting tree is simplified to prevent noise in the data even though it reduces the over correct classification rate. Rule generation We also considered that our dataset has huge amount attributes compared to number of instances. The interpret ability and comprehensibility is extensively discussed with the performance of Naive Bayes algorithm. The C4.5 algorithm is implemented in the C4.5 program written by Quinlan 1, by which we denoted C4.5 Quinlan to distinguish from the C4.5 Algorithm. The original dataset is already in the right format required by C4.5 Quinlan. The dataset consists two files, which are ad.name and ad.data. They describe the attributes and the data separately. To be processed by C4.5 Quinlan, we still need a testing dataset. To get a testing dataset, we split a portion of data from ad.data and make it a new file, ad.test. The portion of the testing data is 10% of the training data. It is because in filtering the advertisement images, seeing a non-advertisement image as an advertisement image is not desirable. Such dataset is called cost sensitive 2. Thus, we want the training data set to be much larger than the testing dataset, in order to get a higher accuracy. Result of C4.5 decision tree can be found in the following section. As we can see from the result, there are totally 9 decision trees before the decision tree is simplified. Each tree corresponds to a separate branch of the decision. By simplifying, we get our final decision tree with 3 subtrees due to the limited output space. The result is trivial that, for example, the first attribute the decision tree is looking at is whether the URL of the tested image contains the 1 C4.5 Quinlan can be downloaded at 2 Data Mining Practical Machine Learning Tools and Techniques with Java Implementation, Page 144 8

9 keyword ads or not. If it does, the it is for sure an advertisement image, otherwise it is not. It is possible in some cases the URL contains ads for non advertisement purpose. However, such situation is much less possible than the target case, that it is an advertisement image. The detail steps of using C4.5 Quinlan can be found in the Source Code and Scripts section in the appendix. Result and Post processing Knowledge Discovery in Databases (KDD) has become a very attractive discipline both for research and industry within the last few years. Its goal is to extract "pieces" of knowledge from usually very large databases. It portrays a robust sequence of procedures that have to be carried out so as to derive reasonable and understandable results [2], [3]. By applying several appropriate learning algorithms, one is able to get decision trees or decision rules from the given data set. Post processing comes and plays a role in analyzing the results. It applies several pruning routines and rule filtering to optimize and simplify the results returned by different learning algorithms. C4.5 Quinlan has included several post pruning procedures to deal with the created decision tree such rules post pruning and reduced error pruning compare to ID3. The Post processing procedures and methods can categorized into four groups. They are knowledge filtering, interpretation and explanation, evaluation and knowledge integration. Those pruning methods used in C4.5 Quinlan are belong to knowledge filtering groups. By comparing the classification results from different algorithms, selecting the algorithm with the highest classification accuracy as the appropriate learning algorithms for this advertisement data set. Having all the testing learning algorithms to use 10 fold cross validation during the evaluate process for different algorithms. The statics analysis also will help us to determine the performance of different algorithms. Such as t-test for calculate the confidence interval of the classification rules. Calculation of confidence interval: Assumptions: c = 0.9, z = 1.65, Pr[X >= z] = 5%. upper bound = [ F + Z^2/2N + Z * lower bound = [ F + Z^2/2N Z * Ω(f) = SQRT(f/N f^2/n + Z^2/4N^2) Ω (f) ] / (1+Z^2/N) Ω (f) ] / (1+Z^2/N) 9

10 Measurement of Performance on difference learning algorithms (T-test): t = m_d / SQRT(fi^2/k) where, m_d = Mean_1 Mean_2 fi = std_dev(algo1) std_dev(algo2) comparing t and z for difference: t < -z or t > z ==> performance significant otherwise ==> performance insignificant Detail output of the results can be found in Appendix B: Results. Conclusion In conclusion, the result from using C4.5 gives us very high confidence interval which and low error rate. Compare to other approach we attempted, such as Naive Bayes, SVM and stacking. SVM achieves the highest performance. The classification rules that we developed based on our research has successfully determined 93% of the web images, whether they are advertisement or not. C4.5 classifiers are the best for advertisement image predictions. It is possible for advertisement spam company to employ these classification rules into their product. In the future, if the data base has been grown large enough, we can use the same approach to get even high performance of blocking advertisement image over the Internet. 10

11 Appendix A: Figures and Tables Attribute Name Number of Same Attributes Attribute Type height 1 Continuous width 1 Continuous aratio 1 Continuous local 1 Boolean (0,1) url*images+buttons 457 Boolean (0,1) origurl*labyrinth 495 Boolean (0,1) ancurl*search+direct 472 Boolean (0,1) alt*your 111 Boolean (0,1) caption*and 19 Boolean (0,1) Total: 1558 Table 1 Attribute List Appendix B: Results Weka Naive Bayes Ouput: Time taken to build model: 2.44 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 2359 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 11

12 ad nonad === Confusion Matrix === a b <-- classified as a = ad b = nonad SVM (Weak SMO) Output: Time taken to build model: seconds === Evaluation on training set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 2359 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class ad nonad === Confusion Matrix === a b <-- classified as a = ad b = nonad Stacking output: === Run information === Scheme: weka.classifiers.meta.stacking -X 10 -M "weka.classifiers.rules.zeror " - S 1 -B "weka.classifiers.rules.zeror " Relation: ad-weka.filters.supervised.attribute.discretize-rfirst-last Instances: 2359 Attributes: 1559 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === Stacking Base classifiers 12

13 ZeroR predicts class value: nonad Meta classifier ZeroR predicts class value: nonad Time taken to build model: 0.28 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic 0 Mean absolute error Root mean squared error Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 2359 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class ad nonad === Confusion Matrix === a b <-- classified as a = ad b = nonad C4.5 Output: wangy22@penguin:shm c4.5 -f ad C4.5 [release 8] decision tree generator Wed Mar 30 18:40: Options: File stem <ad> Read 3279 cases (1558 attributes) from ad.data Decision Tree: url*ads = 0: ancurl*click = 1: ad (103.0/2.0) ancurl*click = 0: ancurl*http+www = 1: ad (43.0) ancurl*http+www = 0: ancurl*nph = 1: ad (22.0) ancurl*nph = 0: url*doubleclick.net = 1: ad (15.0) url*doubleclick.net = 0: alt*visit+our = 1: ad (9.0) 13

14 alt*visit+our = 0: ancurl*adclick = 1: ad (6.0) ancurl*adclick = 0: origurl*home.netscape.com = 1: ad (5.0) origurl*home.netscape.com = 0: origurl*jun = 1: ad (4.0) origurl*jun = 0: ancurl*url+http = 1: ad (3.0) ancurl*url+http = 0: url*memberbanners = 1: ad (10.0/1.0) url*memberbanners = 0: origurl*zdnet.com = 1: ad (2.0) origurl*zdnet.com = 0: ancurl*n+a = 1: ad (2.0) ancurl*n+a = 0: ancurl*plx = 1: ad (2.0) ancurl*plx = 0:[S1] url*ads = 1: origurl* = 0: ad (149.0/4.0) origurl* = 1: nonad (3.0/1.0) Subtree [S1] ancurl*redirect+cgi = 0: alt*ad = 1: ad (4.0/1.0) alt*ad = 0: ancurl*url = 1: ad (3.0/1.0) ancurl*url = 0: url*banner+gif = 1: ad (8.0/4.0) url*banner+gif = 0: ancurl*ad = 0: alt*here+for = 0: origurl*bin = 1: ad (2.0/1.0) origurl*bin = 0: alt*at = 1: ad (2.0/1.0) alt*at = 0: alt*click+here = 0: url*images+home = 0: ancurl*marketing = 0: ancurl*pl = 0: alt*download = 1: nonad (3.0/1.0) alt*download = 0: alt*for+a = 1: nonad (3.0/1.0) alt*for+a = 0:[S2] ancurl*pl = 1: url*images = 1: nonad (3.0) url*images = 0: url*gifs = 0: ad (4.0/1.0) url*gifs = 1: nonad (2.0) ancurl*marketing = 1: url*logo = 1: nonad (5.0) url*logo = 0: height <= 44 : nonad (2.0) height > 44 : ad (4.0) url*images+home = 1: width <= 196 : nonad (10.2) 14

15 width > 196 : ad (6.8/0.8) alt*click+here = 1: url*images = 1: ad (3.0) url*images = 0: alt*here+to = 1: nonad (14.0/1.0) alt*here+to = 0: url*assets = 1: nonad (2.0) url*assets = 0: url*thejeep.com = 1: ad (3.0) url*thejeep.com = 0: height <= 45 : nonad (2.5/1.0) height > 45 : ad (2.5/0.5) alt*here+for = 1: url*thejeep.com = 0: nonad (4.0/1.0) url*thejeep.com = 1: ad (2.0) ancurl*ad = 1: url*ad+gif = 0: ad (3.0) url*ad+gif = 1: nonad (3.0) ancurl*redirect+cgi = 1: origurl*messier = 0: ad (8.0) origurl*messier = 1: nonad (2.0) Subtree [S2] alt*banner = 1: nonad (4.0/1.0) alt*banner = 0: ancurl*site = 1: nonad (5.0/1.0) ancurl*site = 0: url*logo+gif = 0: ancurl*main = 0: caption*click = 1: nonad (7.0/1.0) caption*click = 0: caption*in = 1: nonad (7.0/1.0) caption*in = 0: url*mindspring.com = 1: nonad (7.0/1.0) url*mindspring.com = 0: origurl*football = 1: nonad (7.0/1.0) origurl*football = 0: origurl*contents = 0: ancurl*download = 0: alt*now = 1: nonad (10.0/1.0) alt*now = 0: alt*for = 1: nonad (9.0/1.0) alt*for = 0: url*valley+2539 = 1: nonad (10.0/1.0) url*valley+2539 = 0:[S3] ancurl*download = 1: url*images = 0: nonad (6.0) url*images = 1: aratio <= : nonad (7.0) aratio > : ad (3.0/1.0) origurl*contents = 1: ancurl*members = 0: ad (4.0/1.0) ancurl*members = 1: nonad (20.0) ancurl*main = 1: ancurl*index = 0: nonad (11.0) 15

16 ancurl*index = 1: ad (2.0) url*logo+gif = 1: url*images = 0: nonad (5.0) url*images = 1: origurl*football = 1: nonad (2.0) origurl*football = 0: height <= 67 : ad (2.5/0.5) height > 67 : nonad (2.5) Subtree [S3] origurl*chapel = 1: nonad (11.0/1.0) origurl*chapel = 0: origurl*plains+5257 = 0: ancurl*page = 1: nonad (11.0/1.0) ancurl*page = 0: alt*free = 1: nonad (13.0/1.0) alt*free = 0: caption*and = 1: nonad (12.0/1.0) caption*and = 0: url*martnet = 1: nonad (17.0/1.0) url*martnet = 0: url*link = 1: nonad (20.0/1.0) url*link = 0: width <= 260 : alt*of = 1: nonad (36.8/1.0) alt*of = 0: ancurl*asp = 1: nonad (34.9/1.0) ancurl*asp = 0: alt*site = 1: nonad (43.7/1.0) alt*site = 0: url*com = 1: nonad (2.0) url*com = 0: ancurl*yahoo = 1: nonad (2.0) ancurl*yahoo = 0: alt*from = 1: nonad (2.0) alt*from = 0:[S4] width > 260 : origurl*index+htm = 1: nonad (2.4/1.0) origurl*index+htm = 0: ancurl*globec.com.au = 1: nonad (20.1) ancurl*globec.com.au = 0: ancurl*geocities.com = 1: nonad (13.0) ancurl*geocities.com = 0: url*graphics = 1: nonad (7.3) url*graphics = 0:[S5] origurl*plains+5257 = 1: url*images+geoguideii = 0: ad (2.0) url*images+geoguideii = 1: nonad (22.0) Subtree [S4] alt*find = 1: nonad (2.0) alt*find = 0: alt*out = 1: nonad (2.0) alt*out = 0: 16

17 ancurl*lg = 1: nonad (2.8) ancurl*lg = 0: url*image+navigate = 1: nonad (3.0) url*image+navigate = 0: ancurl*homepage = 1: nonad (3.0) ancurl*homepage = 0: ancurl*exe = 1: nonad (2.9) ancurl*exe = 0: ancurl*magic = 1: nonad (2.8) ancurl*magic = 0: alt*click = 1: nonad (3.0) alt*click = 0: alt*visit+the = 1: nonad (3.0) alt*visit+the = 0: alt*network = 1: nonad (2.0) alt*network = 0: alt*more = 1: nonad (3.0) alt*more = 0: ancurl*thejeep.com = 1: nonad (3.0) ancurl*thejeep.com = 0: url*thejeep.com = 1: nonad (3.0) url*thejeep.com = 0: alt*get = 1: nonad (2.9) alt*get = 0:[S6] Subtree [S5] ancurl*members.accessus.net = 1: nonad (6.0) ancurl*members.accessus.net = 0: ancurl*index = 1: nonad (5.7) ancurl*index = 0: origurl*arvann = 1: nonad (3.6) origurl*arvann = 0: ancurl*home = 1: nonad (3.5) ancurl*home = 0: url*aol.com = 1: nonad (3.1) url*aol.com = 0: url*ball = 1: nonad (3.1) url*ball = 0: width <= 419 : nonad (45.0/2.0) width > 419 : aratio <= : nonad (6.2/0.0) aratio > : aratio <= : ad (3.1/1.1) aratio > : nonad (9.3/4.0) Subtree [S6] alt*join = 1: nonad (2.9) alt*join = 0: alt*internet+explorer = 1: nonad (3.0) alt*internet+explorer = 0: alt*microsoft = 1: nonad (2.0) alt*microsoft = 0: url*express-scripts.com = 1: nonad (2.9) 17

18 url*express-scripts.com = 0: alt*netscape = 1: nonad (2.9) alt*netscape = 0: ancurl*comprod+mirror = 1: nonad (2.0) ancurl*comprod+mirror = 0: ancurl*home.netscape.com = 1: nonad (2.9) ancurl*home.netscape.com = 0: url*ie = 1: nonad (2.9) url*ie = 0: alt*your = 1: nonad (2.9) alt*your = 0: ancurl*ie = 1: nonad (3.0) ancurl*ie = 0: alt*net = 1: nonad (4.0) alt*net = 0: url*cat = 1: nonad (3.9) url*cat = 0: url*button+gif = 1: nonad (3.8) url*button+gif = 0: url*media = 1: nonad (4.0) url*media = 0:[S7] Subtree [S7] alt*online = 1: nonad (4.0) alt*online = 0: ancurl*dejay = 1: nonad (4.0) ancurl*dejay = 0: url*members.accessus.net = 1: nonad (2.9) url*members.accessus.net = 0: ancurl*members = 1: nonad (3.9) ancurl*members = 0: ancurl*cat = 1: nonad (4.0) ancurl*cat = 0: ancurl*default = 1: nonad (4.0) ancurl*default = 0: ancurl*amp = 1: nonad (3.9) ancurl*amp = 0: alt*here = 1: nonad (3.9) alt*here = 0: alt*you = 1: nonad (3.9) alt*you = 0: origurl*hist = 1: nonad (4.7) origurl*hist = 0: origurl*corp = 1: nonad (2.8) origurl*corp = 0: ancurl*mei.co.jp = 1: nonad (4.7) ancurl*mei.co.jp = 0: ancurl*e+html = 1: nonad (2.0) ancurl*e+html = 0:[S8] Subtree [S8] origurl*e+html = 1: nonad (3.0) origurl*e+html = 0: 18

19 ancurl*tii = 1: nonad (3.0) ancurl*tii = 0: alt*book = 1: nonad (4.9) alt*book = 0: origurl*geocities.com = 1: nonad (482.5) origurl*geocities.com = 0: url*tour = 1: nonad (2.0) url*tour = 0: ancurl*heartland+pointe = 1: nonad (2.0) ancurl*heartland+pointe = 0: ancurl*forums = 1: nonad (2.0) ancurl*forums = 0: alt*chat = 1: nonad (2.0) alt*chat = 0: ancurl*enchantedforest = 1: nonad (2.0) ancurl*enchantedforest = 0: ancurl* = 1: nonad (2.0) ancurl* = 0: alt*guestbook = 1: nonad (3.0) alt*guestbook = 0: ancurl*plains = 1: nonad (3.0) ancurl*plains = 0: ancurl*soho = 0: nonad (1653.1/1.9) ancurl*soho = 1: nonad (2.9) Simplified Decision Tree: url*ads = 0: ancurl*click = 1: ad (103.0/3.8) ancurl*click = 0: ancurl*http+www = 1: ad (43.0/1.4) ancurl*http+www = 0: url*doubleclick.net = 1: ad (15.0/1.3) url*doubleclick.net = 0: alt*visit+our = 1: ad (9.0/1.3) alt*visit+our = 0: ancurl*adclick = 1: ad (19.0/1.3) ancurl*adclick = 0: origurl*home.netscape.com = 1: ad (5.0/1.2) origurl*home.netscape.com = 0: origurl*jun = 1: ad (4.0/1.2) origurl*jun = 0: ancurl*url+http = 1: ad (3.0/1.1) ancurl*url+http = 0: url*memberbanners = 1: ad (10.0/2.4) url*memberbanners = 0: origurl*zdnet.com = 1: ad (10.0/1.3) origurl*zdnet.com = 0: ancurl*n+a = 1: ad (2.0/1.0) ancurl*n+a = 0: ancurl*plx = 1: ad (2.0/1.0) ancurl*plx = 0: ancurl*redirect+cgi = 0: alt*ad = 1: ad (4.0/2.2) alt*ad = 0: ancurl*ad = 0:[S1] ancurl*ad = 1:[S2] 19

20 ancurl*redirect+cgi = 1:[S3] url*ads = 1: origurl* = 0: ad (149.0/6.2) origurl* = 1: nonad (3.0/2.1) Subtree [S1] alt*click+here = 0: url*images+home = 0: ancurl*marketing = 0: nonad (2820.0/57.7) ancurl*marketing = 1: url*logo = 1: nonad (5.0/1.2) url*logo = 0: height <= 44 : nonad (2.0/1.0) height > 44 : ad (5.0/1.2) url*images+home = 1: width <= 196 : nonad (10.1/1.3) width > 196 : ad (7.9/2.3) alt*click+here = 1: alt*here+to = 0: ad (18.0/8.0) alt*here+to = 1: nonad (14.0/2.5) Subtree [S2] url*ad+gif = 0: ad (3.0/1.1) url*ad+gif = 1: nonad (3.0/1.1) Subtree [S3] origurl*messier = 0: ad (8.0/1.3) origurl*messier = 1: nonad (2.0/1.0) Tree saved Evaluation on training data (3279 items): Before Pruning After Pruning Size Errors Size Errors Estimate ( 8.1%) 53 68( 8.6%) ( 9.8%) << Appendix C: Source Code and Scripts!/bin/csh Computer Science 4TF3 - Data Mining N-way cross-validation script (Modified by Yu Wang from team 18) 20

21 * All program are to be executed under Linux environment. * The following steps were tested under Gentoo Linux with C4.5 Quinlan installed in $home/documents/4tf3/r8/bin Steps: 1. Make sure the ad.name and ad.data are in the same directory. 2. Check the number of instances in ad.data: $ nl ad.data tail 3. Assuming the number of instances to be 2000, the testing dataset is splitted out of ad.data: $ split -l 1800 ad.data $ mv xaa ad.data $ mv xab ad.test 4. $ xval.sh ad sort the options into result suffix and control options for the programs Note: for options with values, there must be no space between the option name and value; e.g. "-v1", not "-v 1" set treeopts = set ruleopts = set suffix = set path = ($path $home/documents/4tf3/r8/bin) foreach i ( $argv[3-] ) switch ( $i ) case "+*": set suffix = $i breaksw case "-v*": case "-c*": set treeopts = ($treeopts $i) set ruleopts = ($ruleopts $i) breaksw case "-p": case "-t*": case "-w*": case "-i*": case "-g": case "-s": case "-m*": set treeopts = ($treeopts $i) breaksw case "-r*": case "-F*": case "-a": set ruleopts = ($ruleopts $i) 21

22 breaksw default: echo "unrecognised or inappropriate option" $i exit endsw end prepare the data for cross-validation cat $1.data $1.test xval-prep $2 >XDF.data cp /dev/null XDF.test ln $1.names XDF.names rm $1.[rt]o[0-9]*$suffix set junk = `wc XDF.data` set examples = $junk[1] set large = `expr $examples % $2` set segsize = `expr $ $examples / $2 $ + 1` perform the cross-validation trials set i = 0 while ( $i < $2 ) if ( $i == $large ) set segsize = `expr $examples / $2` cat XDF.test XDF.data split -`expr $examples - $segsize` mv xaa XDF.data mv xab XDF.test c4.5 -f XDF -u $treeopts >$1.to$i$suffix c4.5rules -f XDF -u $ruleopts i++ end remove the temporary files and summarize results rm -f XDF.* cat $1.to[0-9]*$suffix grep "<<" average >$1.tres$suffix cat $1.ro[0-9]*$suffix grep "<<" average >$1.rres$suffix 22

23 References 1. N. Kushmerick. Learning to remove Internet advertisements, 3rd Int Conf Autonomous Agents Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press, Jiming Peng, Data Mining: Concepts and Algorithms Class Notes, McMaster University, 2004,

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already