3 Supervised Learning

Size: px

Start display at page:

Download "3 Supervised Learning"

Antonia Amie Gardner
6 years ago
Views:

1 Preface The rapd growth of the Web n the last decade makes t the largest publcly accessble data source n the world. Web mnng ams to dscover useful nformaton or knowledge from Web hyperlnks, page contents, and usage logs. Based on the prmary knds of data used n the mnng process, Web mnng tasks can be categorzed nto three man types: Web structure mnng, Web content mnng and Web usage mnng. Web structure mnng dscovers knowledge from hyperlnks, whch represent the structure of the Web. Web content mnng extracts useful nformaton/knowledge from Web page contents. Web usage mnng mnes user access patterns from usage logs, whch record clcks made by every user. The goal of ths book s to present these tasks, and ther core mnng algorthms. The book s ntended to be a text wth a comprehensve coverage, and yet, for each topc, suffcent detals are gven so that readers can gan a reasonably complete knowledge of ts algorthms or technques wthout referrng to any external materals. Four of the chapters, structured data extracton, nformaton ntegraton, opnon mnng, and Web usage mnng, make ths book unque. These topcs are not covered by exstng books, but yet they are essental to Web data mnng. Tradtonal Web mnng topcs such as search, crawlng and resource dscovery, and lnk analyss are also covered n detal n ths book. Although the book s enttled Web Data Mnng, t also ncludes the man topcs of data mnng and nformaton retreval snce Web mnng uses ther algorthms and technques extensvely. The data mnng part manly conssts of chapters on assocaton rules and sequental patterns, supervsed learnng (or classfcaton), and unsupervsed learnng (or clusterng), whch are the three most mportant data mnng tasks. The advanced topc of partally (sem-) supervsed learnng s ncluded as well. For nformaton retreval, ts core topcs that are crucal to Web mnng are descrbed. Ths book s thus naturally dvded nto two parts. The frst part, whch conssts of Chaps. 5, covers data mnng foundatons. The second part, whch contans Chaps. 6, covers Web specfc mnng. Two man prncples have guded the wrtng of ths book. Frst, the basc content of the book should be accessble to undergraduate students, and yet there are suffcent n-depth materals for graduate students who plan to

2 VIII Preface pursue Ph.D. degrees n Web data mnng or related areas. Few assumptons are made n the book regardng the prerequste knowledge of readers. One wth a basc understandng of algorthms and probablty concepts should have no problem wth ths book. Second, the book should examne the Web mnng technology from a practcal pont of vew. Ths s mportant because most Web mnng tasks have mmedate real-world applcatons. In the past few years, I was fortunate to have worked drectly or ndrectly wth many researchers and engneers n several search engne and e-commerce companes, and also tradtonal companes that are nterested n explotng the nformaton on the Web n ther busnesses. Durng the process, I ganed practcal experences and frst-hand knowledge of realworld problems. I try to pass those non-confdental peces of nformaton and knowledge along n the book. The book, thus, should have a good balance of theory and practce. I hope that t wll not only be a learnng text for students, but also a valuable source of nformaton/knowledge and even deas for Web mnng researchers and practtoners. Acknowledgements Many researchers have asssted me techncally n wrtng ths book. Wthout ther help, ths book mght never have become realty. My deepest thanks goes to Flppo Menczer and Bamshad Mobasher, who were so knd to have helped wrte two essental chapters of the book. They are both experts n ther respectve felds. Flppo wrote the chapter on Web crawlng and Bamshad wrote the chapter on Web usage mnng. I am also very grateful to Wee Sun Lee, who helped a great deal n the wrtng of Chap. 5 on partally supervsed learnng. Jan Pe helped wth the wrtng of the PrefxSpan algorthm n Chap., and checked the MS-PS algorthm. Eduard Dragut asssted wth the wrtng of the last secton of Chap. 0 and also read the chapter many tmes. Yuanln Zhang gave many great suggestons on Chap. 9. I am ndebted to all of them. Many other researchers also asssted n varous ways. Yang Da and Rudy Setono helped wth Support Vector Machnes (SVM). Chrs Dng helped wth lnk analyss. Clement Yu and ChengXang Zha read Chap. 6, and Amy Langvlle read Chap. 7. Kevn C.-C. Chang, J-Rong Wen and Clement Yu helped wth many aspects of Chap 0. Justn Zobel helped clarfy some ssues related to ndex compresson, and Ion Muslea helped clarfy some ssues on wrapper nducton. Dvy Agrawal, Yunbo Cao, Edward Fox, Hang L, Xaol L, Zhaohu Tan, Dell Zhang and Zan Zheng helped check varous chapters or sectons. I am very grateful.

3 Preface I X Dscussons wth many researchers helped shape the book as well: Amr Ashkenaz, Imran Azz, Roberto Bayardo, Wendell Baker, Lng Bao, Jeffrey Benkler, AnHa Doan, Byron Dom, Mchael Gamon, Robert Grossman, Jawe Han, Wynne Hsu, Ronny Kohav, Davd D. Lews, Ian McAllster, We-Yng Ma, Marco Maggn, Llew Mason, Kamel Ngan, Julan Qan, Yan Qu, Thomas M. Trpak, Andrew Tomkns, Alexander Tuzhln, Wemn Xao, Gu Xu, Phlp S. Yu, and Mohammed Zak. My former and current students, Gao Cong, Mnqng Hu, Ntn Jndal, Xn L, Ymng Ma, Yanhong Zha and Kad Zhao checked many algorthms and made numerous correctons. Some chapters of the book have been used n my graduate classes at the Unversty of Illnos at Chcago. I thank the students n these classes for mplementng several algorthms. Ther questons helped me mprove and, n some cases, correct the algorthms. It s not possble to lst all ther names. Here, I would partcularly lke to thank John Castano, Xaowen Dng, Murthy Ganapathbhotla, Cyntha Kersey, Har Prasad Dvyakott, Ravkanth Turlapat, Srkanth Tadkonda, Mako Tamura, Hasheng Wang, and Chad Wllams for pontng out errors n texts, examples or algorthms. Mchael Bombyk from DePaul Unversty also found several typng errors. It was a pleasure workng wth the helpful staff at Sprnger. I thank my edtor Ralf Gerstner who asked me n early 005 whether I was nterested n wrtng a book on Web mnng. It has been a wonderful experence workng wth hm snce. I also thank my copyedtor Mke Nugent for helpng me mprove the presentaton, and my producton edtor Mchael Renfarth for gudng me through the fnal producton process. Two anonymous revewers also gave me many nsghtful comments. The Department of Computer Scence at the Unversty of Illnos at Chcago provded computng resources and a supportve envronment for ths proect. Fnally, I thank my parents, brother and sster for ther constant supports and encouragements. My greatest grattude goes to my own famly: Yue, Shelley and Kate. They have helped me n so many ways. Despte ther young ages, Shelley and Kate actually read many parts of the book and caught numerous typng errors. My wfe has taken care of almost everythng at home and put up wth me and the long hours that I have spent on ths book. I dedcate ths book to them. Bng Lu

4 3 Supervsed Learnng Supervsed learnng has been a great success n real-world applcatons. It s used n almost every doman, ncludng text and Web domans. Supervsed learnng s also called classfcaton or nductve learnng n machne learnng. Ths type of learnng s analogous to human learnng from past experences to gan new knowledge n order to mprove our ablty to perform real-world tasks. However, snce computers do not have experences, machne learnng learns from data, whch are collected n the past and represent past experences n some real-world applcatons. There are several types of supervsed learnng tasks. In ths chapter, we focus on one partcular type, namely, learnng a target functon that can be used to predct the values of a dscrete class attrbute. Ths type of learnng has been the focus of the machne learnng research and s perhaps also the most wdely used learnng paradgm n practce. Ths chapter ntroduces a number of such supervsed learnng technques. They are used n almost every Web mnng applcaton. We wll see ther uses from Chaps Basc Concepts A data set used n the learnng task conssts of a set of data records, whch are descrbed by a set of attrbutes A = {A, A,, A A }, where A denotes the number of attrbutes or the sze of the set A. The data set also has a specal target attrbute C, whch s called the class attrbute. In our subsequent dscussons, we consder C separately from attrbutes n A due to ts specal status,.e., we assume that C s not n A. The class attrbute C has a set of dscrete values,.e., C = {c, c,, c C }, where C s the number of classes and C. A class value s also called a class label. A data set for learnng s smply a relatonal table. Each data record descrbes a pece of past experence. In the machne learnng and data mnng lterature, a data record s also called an example, an nstance, a case or a vector. A data set bascally conssts of a set of examples or nstances. Gven a data set D, the obectve of learnng s to produce a classfcaton/predcton functon to relate values of attrbutes n A and classes n C. The functon can be used to predct the class values/labels of the future

5 56 3 Supervsed Learnng data. The functon s also called a classfcaton model, a predctve model or smply a classfer. We wll use these terms nterchangeably n ths book. It should be noted that the functon/model can be n any form, e.g., a decson tree, a set of rules, a Bayesan model or a hyperplane. Example : Table 3. shows a small loan applcaton data set. It has four attrbutes. The frst attrbute s Age, whch has three possble values, young, mddle and old. The second attrbute s Has_Job, whch ndcates whether an applcant has a ob. Its possble values are true (has a ob) and false (does not have a ob). The thrd attrbute s Own_house, whch shows whether an applcant owns a house. The fourth attrbute s Credt_ratng, whch has three possble values, far, good and excellent. The last column s the Class attrbute, whch shows whether each loan applcaton was approved (denoted by Yes) or not (denoted by No) n the past. Table 3.. A loan applcaton data set ID Age Has_ob Own_house Credt_ratng Class young false false far No young false false good No 3 young true false good Yes 4 young true true far Yes 5 young false false far No 6 mddle false false far No 7 mddle false false good No 8 mddle true true good Yes 9 mddle false true excellent Yes 0 mddle false true excellent Yes old false true excellent Yes old false true good Yes 3 old true false good Yes 4 old true false excellent Yes 5 old false false far No We want to learn a classfcaton model from ths data set that can be used to classfy future loan applcatons. That s, when a new customer comes nto the bank to apply for a loan, after nputtng hs/her age, whether he/she has a ob, whether he/she owns a house, and hs/her credt ratng, the classfcaton model should predct whether hs/her loan applcaton should be approved. Our learnng task s called supervsed learnng because the class labels (e.g., Yes and No values of the class attrbute n Table 3.) are provded n

6 3. Basc Concepts 57 the data. It s as f some teacher tells us the classes. Ths s n contrast to the unsupervsed learnng, where the classes are not known and the learnng algorthm needs to automatcally generate classes. Unsupervsed learnng s the topc of the next chapter. The data set used for learnng s called the tranng data (or the tranng set). After a model s learned or bult from the tranng data by a learnng algorthm, t s evaluated usng a set of test data (or unseen data) to assess the model accuracy. It s mportant to note that the test data s not used n learnng the classfcaton model. The examples n the test data usually also have class labels. That s why the test data can be used to assess the accuracy of the learned model because we can check whether the class predcted for each test case by the model s the same as the actual class of the test case. In order to learn and also to test, the avalable data (whch has classes) for learnng s usually splt nto two dsont subsets, the tranng set (for learnng) and the test set (for testng). We wll dscuss ths further n Sect The accuracy of a classfcaton model on a test set s defned as: Number of correct classfcatons Accuracy =, () Total number of test cases where a correct classfcaton means that the learned model predcts the same class as the orgnal class of the test case. There are also other measures that can be used. We wll dscuss them n Sect We pause here to rases two mportant questons:. What do we mean by learnng by a computer system?. What s the relatonshp between the tranng and the test data? We answer the frst queston frst. Gven a data set D representng past experences, a task T and a performance measure M, a computer system s sad to learn from the data to perform the task T f after learnng the system s performance on the task T mproves as measured by M. In other words, the learned model or knowledge helps the system to perform the task better as compared to no learnng. Learnng s the process of buldng the model or extractng the knowledge. We use the data set n Example to explan the dea. The task s to predct whether a loan applcaton should be approved. The performance measure M s the accuracy n Equaton (). Wth the data set n Table 3., f there s no learnng, all we can do s to guess randomly or to smply take the maorty class (whch s the Yes class). Suppose we use the maorty class and announce that every future nstance or case belongs to the class Yes. If the future data are drawn from the same dstrbuton as the exstng tranng data n Table 3., the estmated classfcaton/predcton accuracy

7 58 3 Supervsed Learnng on the future data s 9/5 = 0.6 as there are 9 Yes class examples out of the total of 5 examples n Table 3.. The queston s: can we do better wth learnng? If the learned model can ndeed mprove the accuracy, then the learnng s sad to be effectve. The second queston n fact touches the fundamental assumpton of machne learnng, especally the theoretcal study of machne learnng. The assumpton s that the dstrbuton of tranng examples s dentcal to the dstrbuton of test examples (ncludng future unseen examples). In practcal applcatons, ths assumpton s often volated to a certan degree. Strong volatons wll clearly result n poor classfcaton accuracy, whch s qute ntutve because f the test data behave very dfferently from the tranng data then the learned model wll not perform well on the test data. To acheve good accuracy on the test data, tranng examples must be suffcently representatve of the test data. We now llustrate the steps of learnng n Fg. 3. based on the precedng dscussons. In step, a learnng algorthm uses the tranng data to generate a classfcaton model. Ths step s also called the tranng step or tranng phase. In step, the learned model s tested usng the test set to obtan the classfcaton accuracy. Ths step s called the testng step or testng phase. If the accuracy of the learned model on the test data s satsfactory, the model can be used n real-world tasks to predct classes of new cases (whch do not have classes). If the accuracy s not satsfactory, we need to go back and choose a dfferent learnng algorthm and/or do some further processng of the data (ths step s called data pre-processng, not shown n the fgure). A practcal learnng task typcally nvolves many teratons of these steps before a satsfactory model s bult. It s also possble that we are unable to buld a satsfactory model due to a hgh degree of randomness n the data or lmtatons of current learnng algorthms. Tranng data Learnng algorthm model Test data Accuracy Step : Tranng Step : Testng Fg. 3.. The basc learnng process: tranng and testng From the next secton onward, we study several supervsed learnng algorthms, except Sect. 3.3, whch focuses on model/classfer evaluaton. We note that throughout the chapter we assume that the tranng and test data are avalable for learnng. However, n many text and Web page related learnng tasks, ths s not true. Usually, we need to collect raw data,

8 3. Decson Tree Inducton 59 desgn attrbutes and compute attrbute values from the raw data. The reason s that the raw data n text and Web applcatons are often not sutable for learnng ether because ther formats are not rght or because there are no obvous attrbutes n the raw text documents or Web pages. 3. Decson Tree Inducton Decson tree learnng s one of the most wdely used technques for classfcaton. Its classfcaton accuracy s compettve wth other learnng methods, and t s very effcent. The learned classfcaton model s represented as a tree, called a decson tree. The technques presented n ths secton are based on the C4.5 system from Qunlan [453]. Example : Fgure 3. shows a possble decson tree learnt from the data n Table 3.. The tree has two types of nodes, decson nodes (whch are nternal nodes) and leaf nodes. A decson node specfes some test (.e., asks a queston) on a sngle attrbute. A leaf node ndcates a class. Age? Young mddle old Has_ob? Own_house? Credt_ratng? true false true false far good excellent Yes No (/) (3/3) Yes No (3/3) (/) No Yes Yes (/) (/) (/) Fg. 3.. A decson tree for the data n Table 3. The root node of the decson tree n Fg. 3. s Age, whch bascally asks the queston: what s the age of the applcant? It has three possble answers or outcomes, whch are the three possble values of Age. These three values form three tree branches/edges. The other nternal nodes have the same meanng. Each leaf node gves a class value (Yes or No). (x/y) below each class means that x out of y tranng examples that reach ths leaf node have the class of the leaf. For nstance, the class of the left most leaf node s Yes. Two tranng examples (examples 3 and 4 n Table 3.) reach here and both of them are of class Yes. To use the decson tree n testng, we traverse the tree top-down accordng to the attrbute values of the gven test nstance untl we reach a leaf node. The class of the leaf s the predcted class of the test nstance.

9 60 3 Supervsed Learnng Example 3: We use the tree to predct the class of the followng new nstance, whch descrbes a new loan applcant. Age Has_ob Own_house Credt-ratng Class young false false good? Gong through the decson tree, we fnd that the predcted class s No as we reach the second leaf node from the left. A decson tree s constructed by parttonng the tranng data so that the resultng subsets are as pure as possble. A pure subset s one that contans only tranng examples of a sngle class. If we apply all the tranng data n Table 3. on the tree n Fg. 3., we wll see that the tranng examples reachng each leaf node form a subset of examples that have the same class as the class of the leaf. In fact, we can see that from the x and y values n (x/y). We wll dscuss the decson tree buldng algorthm n Sect An nterestng queston s: Is the tree n Fg. 3. unque for the data n Table 3.? The answer s no. In fact, there are many possble trees that can be learned from the data. For example, Fg. 3.3 gves another decson tree, whch s much smaller and s also able to partton the tranng data perfectly accordng to ther classes. Own_house? true false Yes (6/6) Has_ob? true false Yes No (3/3) (6/6) Fg A smaller tree for the data set n Table 3. In practce, one wants to have a small and accurate tree for many reasons. A smaller tree s more general and also tends to be more accurate (we wll dscuss ths later). It s also easer to understand by human users. In many applcatons, the user understandng of the classfer s mportant. For example, n some medcal applcatons, doctors want to understand the model that classfes whether a person has a partcular dsease. It s not satsfactory to smply produce a classfcaton because wthout understandng why the decson s made the doctor may not trust the system and/or does not gan useful knowledge. It s useful to note that n both Fg. 3. and Fg. 3.3, the tranng examples that reach each leaf node all have the same class (see the values of

10 3. Decson Tree Inducton 6 (x/y) at each leaf node). However, for most real-lfe data sets, ths s usually not the case. That s, the examples that reach a partcular leaf node are not of the same class,.e., x y. The value of x/y s, n fact, the confdence (conf) value used n assocaton rule mnng, and x s the support count. Ths suggests that a decson tree can be converted to a set of f-then rules. Yes, ndeed. The converson s done as follows: Each path from the root to a leaf forms a rule. All the decson nodes along the path form the condtons of the rule and the leaf node or the class forms the consequent. For each rule, a support and confdence can be attached. Note that n most classfcaton systems, these two values are not provded. We add them here to see the connecton of assocaton rules and decson trees. Example 4: The tree n Fg. 3.3 generates three rules., means and. Own_house = true Class =Yes [sup=6/5, conf=6/6] Own_house = false, Has_ob = true Class = Yes [sup=3/5, conf=3/3] Own_house = false, Has_ob = false Class = No [sup=6/5, conf=6/6]. We can see that these rules are of the same format as assocaton rules. However, the rules above are only a small subset of the rules that can be found n the data of Table 3.. For nstance, the decson tree n Fg. 3.3 does not fnd the followng rule: Age = young, Has_ob = false Class = No [sup=3/5, conf=3/3]. Thus, we say that a decson tree only fnds a subset of rules that exst n data, whch s suffcent for classfcaton. The obectve of assocaton rule mnng s to fnd all rules subect to some mnmum support and mnmum confdence constrants. Thus, the two methods have dfferent obectves. We wll dscuss these ssues agan n Sect. 3.5 when we show that assocaton rules can be used for classfcaton as well, whch s obvous. An nterestng and mportant property of a decson tree and ts resultng set of rules s that the tree paths or the rules are mutually exclusve and exhaustve. Ths means that every data nstance s covered by a sngle rule (a tree path) and a sngle rule only. By coverng a data nstance, we mean that the nstance satsfes the condtons of the rule. We also say that a decson tree generalzes the data as a tree s a smaller (more compact) descrpton of the data,.e., t captures the key regulartes n the data. Then, the problem becomes buldng the best tree that s small and accurate. It turns out that fndng the best tree that models the data s a NP-complete problem [48]. All exstng algorthms use heurstc methods for tree buldng. Below, we study one of the most successful technques.

11 6 3 Supervsed Learnng. Algorthm decsontree(d, A, T) f D contans only tranng examples of the same class c C then make T a leaf node labeled wth class c ; 3 elsef A = then 4 make T a leaf node labeled wth c, whch s the most frequent class n D 5 else // D contans examples belongng to a mxture of classes. We select a sngle 6 // attrbute to partton D nto subsets so that each subset s purer 7 p 0 = mpurtyeval-(d); 8 for each attrbute A A (={A, A,, A k }) do 9 p = mpurtyeval-(a, D) 0 endfor Select A g {A, A,, A k } that gves the bggest mpurty reducton, computed usng p 0 p ; f p 0 p g < threshold then // A g does not sgnfcantly reduce mpurty p 0 3 make T a leaf node labeled wth c, the most frequent class n D. 4 else // A g s able to reduce mpurty p 0 5 Make T a decson node on A g ; 6 Let the possble values of A g be v, v,, v m. Partton D nto m dsont subsets D, D,, D m based on the m values of A g. 7 for each D n {D, D,, D m } do 8 f D then 9 create a branch (edge) node T for v as a chld node of T; 0 decsontree(d, A {A g }, T ) // A g s removed endf endfor 3 endf 4 endf Fg A decson tree learnng algorthm 3.. Learnng Algorthm As ndcated earler, a decson tree T smply parttons the tranng data set D nto dsont subsets so that each subset s as pure as possble (of the same class). The learnng of a tree s typcally done usng the dvde-andconquer strategy that recursvely parttons the data to produce the tree. At the begnnng, all the examples are at the root. As the tree grows, the examples are sub-dvded recursvely. A decson tree learnng algorthm s gven n Fg For now, we assume that every attrbute n D takes dscrete values. Ths assumpton s not necessary as we wll see later. The stoppng crtera of the recurson are n lnes 4 n Fg The algorthm stops when all the tranng examples n the current data are of the same class, or when every attrbute has been used along the current tree

12 3. Decson Tree Inducton 63 path. In tree learnng, each successve recurson chooses the best attrbute to partton the data at the current node accordng to the values of the attrbute. The best attrbute s selected based on a functon that ams to mnmze the mpurty after the parttonng (lnes 7 ). In other words, t maxmzes the purty. The key n decson tree learnng s thus the choce of the mpurty functon, whch s used n lnes 7, 9 and n Fg The recursve recall of the algorthm s n lne 0, whch takes the subset of tranng examples at the node for further parttonng to extend the tree. Ths s a greedy algorthm wth no backtrackng. Once a node s created, t wll not be revsed or revsted no matter what happens subsequently. 3.. Impurty Functon Before presentng the mpurty functon, we use an example to show what the mpurty functon ams to do ntutvely. Example 5: Fgure 3.5 shows two possble root nodes for the data n Table 3.. Age? Own_house? Young mddle old true false No: 3 No: No: Yes: Yes: 3 Yes: 4 (A) No: 0 No: 6 Yes: 6 Yes: 3 Fg Two possble root nodes or two possble attrbutes for the root node Fg. 3.5(A) uses Age as the root node, and Fg. 3.5(B) uses Own_house as the root node. Ther possble values (or outcomes) are the branches. At each branch, we lsted the number of tranng examples of each class (No or Yes) that land or reach there. Fg. 3.5(B) s obvously a better choce for the root. From a predcton or classfcaton pont of vew, Fg. 3.5(B) makes fewer mstakes than Fg. 3.5(A). In Fg. 3.5(B), when Own_house = true every example has the class Yes. When Own_house = false, f we take maorty class (the most frequent class), whch s No, we make three mstakes/errors. If we look at Fg. 3.5(A), the stuaton s worse. If we take the maorty class for each branch, we make fve mstakes (marked n bold). Thus, we say that the mpurty of the tree n Fg. 3.5(A) s hgher than the tree n Fg. 3.5(B). To learn a decson tree, we prefer Own_house to Age to be the root node. Instead of countng the number of mstakes or errors, C4.5 uses a more prncpled approach to perform ths evaluaton on every attrbute n order to choose the best attrbute to buld the tree. (B)

13 64 3 Supervsed Learnng The most popular mpurty functons used for decson tree learnng are nformaton gan and nformaton gan rato, whch are used n C4.5 as two optons. Let us frst dscuss nformaton gan, whch can be extended slghtly to produce nformaton gan rato. The nformaton gan measure s based on the entropy functon from nformaton theory [484]: entropy ( D) = Pr( c )log Pr( c ) () C = Pr( c ) =, C = where Pr(c ) s the probablty of class c n data set D, whch s the number of examples of class c n D dvded by the total number of examples n D. In the entropy computaton, we defne 0log0 = 0. The unt of entropy s bt. Let us use an example to get a feelng of what ths functon does. Example 6: Assume we have a data set D wth only two classes, postve and negatve. Let us see the entropy values for three dfferent compostons of postve and negatve examples:. The data set D has 50% postve examples (Pr(postve) = 0.5) and 50% negatve examples (Pr(negatve) = 0.5). entropy ( D) = 0.5 log log 0.5 =.. The data set D has 0% postve examples (Pr(postve) = 0.) and 80% negatve examples (Pr(negatve) = 0.8). entropy ( D) = 0. log log 0.8 = The data set D has 00% postve examples (Pr(postve) = ) and no negatve examples, (Pr(negatve) = 0). entropy ( D) = log 0 log 0 = 0. We can see a trend: When the data becomes purer and purer, the entropy value becomes smaller and smaller. In fact, t can be shown that for ths bnary case (two classes), when Pr(postve) = 0.5 and Pr(negatve) = 0.5 the entropy has the maxmum value,.e., bt. When all the data n D belong to one class the entropy has the mnmum value, 0 bt. It s clear that the entropy measures the amount of mpurty or dsorder n the data. That s exactly what we need n decson tree learnng. We now descrbe the nformaton gan measure, whch uses the entropy functon.

14 3. Decson Tree Inducton 65 Informaton Gan The dea s the followng:. Gven a data set D, we frst use the entropy functon (Equaton ) to compute the mpurty value of D, whch s entropy(d). The mpurtyeval- functon n lne 7 of Fg. 3.4 performs ths task.. Then, we want to know whch attrbute can reduce the mpurty most f t s used to partton D. To fnd out, every attrbute s evaluated (lnes 8 0 n Fg. 3.4). Let the number of possble values of the attrbute A be v. If we are gong to use A to partton the data D, we wll dvde D nto v dsont subsets D, D,, D v. The entropy after the partton s v D entropya ( D) = entropy( D ). (3) D = The mpurtyeval- functon n lne 9 of Fg. 3.4 performs ths task. 3. The nformaton gan of attrbute A s computed wth: gan( D, A ) = entropy( D) entropy ( D). (4) Clearly, the gan crteron measures the reducton n mpurty or dsorder. The gan measure s used n lne of Fg. 3.4, whch chooses attrbute A g resultng n the largest reducton n mpurty. If the gan of A g s too small, the algorthm stops for the branch (lne ). Normally a threshold s used here. If choosng A g s able to reduce mpurty sgnfcantly, A g s employed to partton the data to extend the tree further, and so on (lnes 5 n Fg. 3.4). The process goes on recursvely by buldng sub-trees usng D, D,, D m (lne 0). For subsequent tree extensons, we do not need A g any more, as all tranng examples n each branch has the same A g value. Example 7: Let us compute the gan values for attrbutes Age, Own_house and Credt_Ratng usng the whole data set D n Table 3.,.e., we evaluate for the root node of a decson tree. Frst, we compute the entropy of D. Snce D has 6 No class tranng examples, and 9 Yes class tranng examples, we have entropy D) = log log ( = A We then try Age, whch parttons the data nto 3 subsets (as Age has three possble values) D (wth Age=young), D (wth Age=mddle), and D 3 (wth Age=old). Each subset has fve tranng examples. In Fg. 3.5, we also see the number of No class examples and the number of Yes examples n each subset (or n each branch).

15 66 3 Supervsed Learnng entropy Age ( D) = entropy( D ) entropy( D ) entropy( D3 ) = = Lkewse, we compute for Own_house, whch parttons D nto two subsets, D (wth Own_house=true) and D (wth Own_house=false). entropy 6 9 D) = entropy ( D) entropy ( D ) = = Own _ house ( Smlarly, we obtan entropy Has_ob (D) = 0.647, and entropy Credt_ratng (D) = The gans for the attrbutes are: gan(d, Age) = = gan(d, Own_house) = = 0.40 gan(d, Has_ob) = = 0.34 gan(d, Credt_ratng) = = Own_house s the best attrbute for the root node. Fgure 3.5(B) shows the root node usng Own_house. Snce the left branch has only one class (Yes) of data, t results n a leaf node (lne n Fg. 3.4). For Own_house = false, further extenson s needed. The process s the same as above, but we only use the subset of the data wth Own_house = false,.e., D. Informaton Gan Rato The gan crteron tends to favor attrbutes wth many possble values. An extreme stuaton s that the data contan an ID attrbute that s an dentfcaton of each example. If we consder usng ths ID attrbute to partton the data, each tranng example wll form a subset and has only one class, whch results n entropy ID (D) = 0. So the gan by usng ths attrbute s maxmal. From a predcton pont of revew, such a partton s useless. Gan rato (Equaton 5) remedes ths bas by normalzng the gan usng the entropy of the data wth respect to the values of the attrbute. Our prevous entropy computatons are done wth respect to the class attrbute: ganrato( D, A ) = s D D log = D D gan( D, A ) where s s the number of possble values of A, and D s the subset of data (5)

16 3. Decson Tree Inducton 67 that has the th value of A. D / D corresponds to the probablty of Equaton (). Usng Equaton (5), we smply choose the attrbute wth the hghest ganrato value to extend the tree. Ths method works because f A has too many values the denomnator wll be large. For nstance, n our above example of the ID attrbute, the denomnator wll be log D. The denomnator s called the splt nfo n C4.5. One note s that the splt nfo can be 0 or very small. Some heurstc solutons can be devsed to deal wth t (see [453]) Handlng of Contnuous Attrbutes It seems that the decson tree algorthm can only handle dscrete attrbutes. In fact, contnuous attrbutes can be dealt wth easly as well. In a real lfe data set, there are often both dscrete attrbutes and contnuous attrbutes. Handlng both types n an algorthm s an mportant advantage. To apply the decson tree buldng method, we can dvde the value range of attrbute A nto ntervals at a partcular tree node. Each nterval can then be consdered a dscrete value. Based on the ntervals, gan or ganrato s evaluated n the same way as n the dscrete case. Clearly, we can dvde A nto any number of ntervals at a tree node. However, two ntervals are usually suffcent. Ths bnary splt s used n C4.5. We need to fnd a threshold value for the dvson. Clearly, we should choose the threshold that maxmzes the gan (or ganrato). We need to examne all possble thresholds. Ths s not a problem because although for a contnuous attrbute A the number of possble values that t can take s nfnte, the number of actual values that appear n the data s always fnte. Let the set of dstnctve values of attrbute A that occur n the data be {v, v,, v r }, whch are sorted n ascendng order. Clearly, any threshold value lyng between v and v + wll have the same effect of dvdng the tranng examples nto those whose value of attrbute A les n {v, v,, v } and those whose value les n {v +, v +,, v r }. There are thus only r possble splts on A, whch can all be evaluated. The threshold value can be the mddle pont between v and v +, or ust on the rght sde of value v, whch results n two ntervals A v and A > v. Ths latter approach s used n C4.5. The advantage of ths approach s that the values appearng n the tree actually occur n the data. The threshold value that maxmzes the gan (ganrato) value s selected. We can modfy the algorthm n Fg. 3.4 (lnes 8 ) easly to accommodate ths computaton so that both dscrete and contnuous attrbutes are consdered. A change to lne 0 of the algorthm n Fg. 3.4 s also needed. For a contnuous attrbute, we do not remove attrbute A g because an nterval can

17 68 3 Supervsed Learnng be further splt recursvely n subsequent tree extensons. Thus, the same contnuous attrbute may appear multple tmes n a tree path (see Example 9), whch does not happen for a dscrete attrbute. From a geometrc pont of vew, a decson tree bult wth only contnuous attrbutes represents a parttonng of the data space. A seres of splts from the root node to a leaf node represents a hyper-rectangle. Each sde of the hyper-rectangle s an axs-parallel hyperplane. Example 8: The hyper-rectangular regons n Fg. 3.6(A), whch parttons the space, are produced by the decson tree n Fg. 3.6(B). There are two classes n the data, represented by empty crcles and flled rectangles..6.5 Y (A) A partton of the data space X X > Y.5 >.5 Y.6 >.6 Y > X 3 > 3 (B). The decson tree X 4 > 4 Fg A parttonng of the data space and ts correspondng decson tree Handlng of contnuous (numerc) attrbutes has an mpact on the effcency of the decson tree algorthm. Wth only dscrete attrbutes the algorthm grows lnearly wth the sze of the data set D. However, sortng of a contnuous attrbute takes D log D tme, whch can domnate the tree learnng process. Sortng s mportant as t ensures that gan or ganrato can be computed n one pass of the data Some Other Issues We now dscuss several other ssues n decson tree learnng. Tree Prunng and Overfttng: A decson tree algorthm recursvely parttons the data untl there s no mpurty or there s no attrbute left. Ths process may result n trees that are very deep and many tree leaves may cover very few tranng examples. If we use such a tree to predct the tranng set, the accuracy wll be very hgh. However, when t s used to classfy unseen test set, the accuracy may be very low. The learnng s thus not effectve,.e., the decson tree does not generalze the data well. Ths

18 3. Decson Tree Inducton 69 phenomenon s called overfttng. More specfcally, we say that a classfer f overfts the data f there s another classfer f such that f acheves a hgher accuracy on the tranng data than f, but a lower accuracy on the unseen test data than f [385]. Overfttng s usually caused by nose n the data,.e., wrong class values/labels and/or wrong values of attrbutes, but t may also be due to the complexty and randomness of the applcaton doman. These problems cause the decson tree algorthm to refne the tree by extendng t to very deep usng many attrbutes. To reduce overfttng n the context of decson tree learnng, we perform prunng of the tree,.e., to delete some branches or sub-trees and replace them wth leaves of maorty classes. There are two man methods to do ths, stoppng early n tree buldng (whch s also called pre-prunng) and prunng the tree after t s bult (whch s called post-prunng). Postprunng has been shown more effectve. Early-stoppng can be dangerous because t s not clear what wll happen f the tree s extended further (wthout stoppng). Post-prunng s more effectve because after we have extended the tree to the fullest, t becomes clearer whch branches/subtrees may not be useful (overft the data). The general dea of post-prunng s to estmate the error of each tree node. If the estmated error for a node s less than the estmated error of ts extended sub-tree, then the sub-tree s pruned. Most exstng tree learnng algorthms take ths approach. See [453] for a technque called the pessmstc error based prunng. Example 9: In Fg. 3.6(B), the sub-tree representng the rectangular regon X, Y >.5, Y.6 n Fg. 3.6(A) s very lkely to be overfttng. The regon s very small and contans only a sngle data pont, whch may be an error (or nose) n the data collecton. If t s pruned, we obtan Fg. 3.7(A) and (B)..6.5 Y X (A) A partton of the data space X > Y > X 3 > 3 X 4 > 4 (B). The decson tree Fg The data space partton and the decson tree after prunng

19 70 3 Supervsed Learnng Another common approach to prunng s to use a separate set of data called the valdaton set, whch s not used n tranng and nether n testng. After a tree s bult, t s used to classfy the valdaton set. Then, we can fnd the errors at each node on the valdaton set. Ths enables us to know what to prune based on the errors at each node. Rule Prunng: We noted earler that a decson tree can be converted to a set of rules. In fact, C4.5 also prunes the rules to smplfy them and to reduce overfttng. Frst, the tree (C4.5 uses the unpruned tree) s converted to a set of rules n the way dscussed n Example 4. Rule prunng s then performed by removng some condtons to make the rules shorter and fewer (after prunng some rules may become redundant). In most cases, prunng results n a more accurate rule set as shorter rules are less lkely to overft the tranng data. Prunng s also called generalzaton as t makes rules more general (wth fewer condtons). A rule wth more condtons s more specfc than a rule wth fewer condtons. Example 0: The sub-tree below X n Fg. 3.6(B) produces these rules: Rule : Rule : Rule 3: X, Y >.5, Y >.6 X, Y >.5, Y.6 O X, Y.5 Note that Y >.5 n Rule s not useful because of Y >.6, and thus Rule should be Rule : X, Y >.6 In prunng, we may be able to delete the condtons Y >.6 from Rule to produce: X Then Rule and Rule 3 become redundant and can be removed. A useful pont to note s that after prunng the resultng set of rules may no longer be mutually exclusve and exhaustve. There may be data ponts that satsfy the condtons of more than one rule, and f naccurate rules are dscarded, of no rules. An orderng of the rules s thus needed to ensure that when classfyng a test case only one rule wll be appled to determne the class of the test case. To deal wth the stuaton that a test case does not satsfy the condtons of any rule, a default class s used, whch s usually the maorty class. Handlng Mssng Attrbute Values: In many practcal data sets, some attrbute values are mssng or not avalable due to varous reasons. There are many ways to deal wth the problem. For example, we can fll each

20 3.3 Classfer Evaluaton 7 mssng value wth the specal value unknown or the most frequent value of the attrbute f the attrbute s dscrete. If the attrbute s contnuous, use the mean of the attrbute for each mssng value. The decson tree algorthm n C4.5 takes another approach. At a tree node, dstrbute the tranng example wth mssng value for the attrbute to each branch of the tree proportonally accordng to the dstrbuton of the tranng examples that have values for the attrbute. Handlng Skewed Class Dstrbuton: In many applcatons, the proportons of data for dfferent classes can be very dfferent. For nstance, n a data set of ntruson detecton n computer networks, the proporton of ntruson cases s extremely small (< %) compared wth normal cases. Drectly applyng the decson tree algorthm for classfcaton or predcton of ntrusons s usually not effectve. The resultng decson tree often conssts of a sngle leaf node normal, whch s useless for ntruson detecton. One way to deal wth the problem s to over sample the ntruson examples to ncrease ts proporton. Another soluton s to rank the new cases accordng to how lkely they may be ntrusons. The human users can then nvestgate the top ranked cases. 3.3 Classfer Evaluaton After a classfer s constructed, t needs to be evaluated for accuracy. Effectve evaluaton s crucal because wthout knowng the approxmate accuracy of a classfer, t cannot be used n real-world tasks. There are many ways to evaluate a classfer, and there are also many measures. The man measure s the classfcaton accuracy (Equaton ), whch s the number of correctly classfed nstances n the test set dvded by the total number of nstances n the test set. Some researchers also use the error rate, whch s accuracy. Clearly, f we have several classfers, the one wth the hghest accuracy s preferred. Statstcal sgnfcance tests may be used to check whether one classfer s accuracy s sgnfcantly better than that of another gven the same tranng and test data sets. Below, we frst present several common methods for classfer evaluaton, and then ntroduce some other evaluaton measures Evaluaton Methods Holdout Set: The avalable data D s dvded nto two dsont subsets, the tranng set D tran and the test set D test, D = D tran D test and D tran D test =

21 7 3 Supervsed Learnng. The test set s also called the holdout set. Ths method s manly used when the data set D s large. Note that the examples n the orgnal data set D are all labeled wth classes. As we dscussed earler, the tranng set s used for learnng a classfer whle the test set s used for evaluatng the resultng classfer. The tranng set should not be used to evaluate the classfer as the classfer s based toward the tranng set. That s, the classfer may overft the tranng set, whch results n very hgh accuracy on the tranng set but low accuracy on the test set. Usng the unseen test set gves an unbased estmate of the classfcaton accuracy. As for what percentage of the data should be used for tranng and what percentage for testng, t depends on the data set sze and two thrds for tranng and one thrd for testng are commonly used. To partton D nto tranng and test sets, we can use a few approaches:. We randomly sample a set of tranng examples from D for learnng and use the rest for testng.. If the data s collected over tme, then we can use the earler part of the data for tranng/learnng and the later part of the data for testng. In many applcatons, ths s a more sutable approach because when the classfer s used n the real-world the data are from the future. Ths approach thus better reflects the dynamc aspects of applcatons. Multple Random Samplng: When the avalable data set s small, usng the above methods can be unrelable because the test set would be too small to be representatve. One approach to deal wth the problem s to perform the above random samplng n tmes. Each tme a dfferent tranng set and a dfferent test set are produced. Ths produces n accuraces. The fnal estmated accuracy on the data s the average of the n accuraces. Cross-Valdaton: When the data set s small, the n-fold cross-valdaton method s very commonly used. In ths method, the avalable data s parttoned nto n equal-sze dsont subsets. Each subset s then used as the test set and the remanng n subsets are combned as the tranng set to learn a classfer. Ths procedure s then run n tmes, whch gves n accuraces. The fnal estmated accuracy of learnng from ths data set s the average of the n accuraces. 0-fold and 5-fold cross-valdatons are often used. A specal case of cross-valdaton s the leave-one-out cross-valdaton. In ths method, each fold of the cross valdaton has only a sngle test example and all the rest of the data s used n tranng. That s, f the orgnal data has m examples, then ths s m-fold cross-valdaton. Ths method s normally used when the avalable data s very small. It s not effcent for a large data set as m classfers need to be bult.

22 3.3 Classfer Evaluaton 73 In Sect. 3..4, we mentoned that a valdaton set can be used to prune a decson tree or a set of rules. If a valdaton set s employed for that purpose, t should not be used n testng. In that case, the avalable data s dvded nto three subsets, a tranng set, a valdaton set and a test set. Apart from usng a valdaton set to help tree or rule prunng, a valdaton set s also used frequently to estmate parameters n learnng algorthms. In such cases, the values that gve the best accuracy on the valdaton set are used as the fnal values of the parameters. Cross-valdaton can be used for parameter estmatng as well. Then a separate valdaton set s not needed. Instead, the whole tranng set s used n cross-valdaton Precson, Recall, F-score and Breakeven Pont In some applcatons, we are only nterested n one class. Ths s partcularly true for text and Web applcatons. For example, we may be nterested n only the documents or web pages of a partcular topc. Also, n classfcaton nvolvng skewed or hghly mbalanced data, e.g., network ntruson and fnancal fraud detecton, we are typcally nterested n only the mnorty class. The class that the user s nterested n s commonly called the postve class, and the rest negatve classes (the negatve classes may be combned nto one negatve class). Accuracy s not a sutable measure n such cases because we may acheve a very hgh accuracy, but may not dentfy a sngle ntruson. For nstance, 99% of the cases are normal n an ntruson detecton data set. Then a classfer can acheve 99% accuracy wthout dong anythng by smply classfyng every test case as not ntruson. Ths s, however, useless. Precson and recall are more sutable n such applcatons because they measure how precse and how complete the classfcaton s on the postve class. It s convenent to ntroduce these measures usng a confuson matrx (Table 3.). A confuson matrx contans nformaton about actual and predcted results gven by a classfer. Table 3.. Confuson matrx of a classfer Classfed postve Classfed negatve Actual postve TP FN Actual negatve FP TN where TP: the number of correct classfcatons of the postve examples (true postve) FN: the number of ncorrect classfcatons of postve examples (false negatve) FP: the number of ncorrect classfcatons of negatve examples (false postve) TN: the number of correct classfcatons of negatve examples (true negatve)

23 74 3 Supervsed Learnng Based on the confuson matrx, the precson (p) and recall (r) of the postve class are defned as follows: TP TP p =. r =. (6) TP + FP TP + FN In words, precson p s the number of correctly classfed postve examples dvded by the total number of examples that are classfed as postve. Recall r s the number of correctly classfed postve examples dvded by the total number of actual postve examples n the test set. The ntutve meanngs of these two measures are qute obvous. However, t s hard to compare classfers based on two measures, whch are not functonally related. For a test set, the precson may be very hgh but the recall can be very low, and vce versa. Example : A test data set has 00 postve examples and 000 negatve examples. After classfcaton usng a classfer, we have the followng confuson matrx (Table 3.3), Table 3.3. Confuson matrx of a classfer Classfed postve Classfed negatve Actual postve 99 Actual negatve Ths confuson matrx gves the precson p = 00% and the recall r = % because we only classfed one postve example correctly and classfed no negatve examples wrongly. Although n theory precson and recall are not related, n practce hgh precson s acheved almost always at the expense of recall and hgh recall s acheved at the expense of precson. In an applcaton, whch measure s more mportant depends on the nature of the applcaton. If we need a sngle measure to compare dfferent classfers, the F-score s often used: pr F = (7) p + r The F-score (also called the F -score) s the harmonc mean of precson and recall. F = (8) + p r

24 3.4 Rule Inducton 75 The harmonc mean of two numbers tends to be closer to the smaller of the two. Thus, for the F-score to be hgh, both p and r must be hgh. There s also another measure, called precson and recall breakeven pont, whch s used n the nformaton retreval communty. The breakeven pont s when the precson and the recall are equal. Ths measure assumes that the test cases can be ranked by the classfer based on ther lkelhoods of beng postve. For nstance, n decson tree classfcaton, we can use the confdence of each leaf node as the value to rank test cases. Example : We have the followng rankng of 0 test documents. represents the hghest rank and 0 represents the lowest rank. + ( ) represents an actual postve (negatve) documents Assume that the test set has 0 postve examples. At rank : p = / = 00% r = /0 = 0% At rank : p = / = 00% r = /0 = 0% At rank 9: p = 6/9 = 66.7% r = 6/0 = 60% At rank 0: p = 7/0 = 70% r = 7/0 = 70% The breakeven pont s p = r = 70%. Note that nterpolaton s needed f such a pont cannot be found. 3.4 Rule Inducton In Sect. 3., we showed that a decson tree can be converted to a set of rules. Clearly, the set of rules can be used for classfcaton as the tree. A natural queston s whether t s possble to learn classfcaton rules drectly. The answer s yes. The process of learnng such rules s called rule nducton or rule learnng. We study two approaches n the secton Sequental Coverng Most rule nducton systems use an algorthm called sequental coverng. A classfer bult wth ths algorthm conssts of a lst of rules, whch s also called a decson lst [463]. In the lst, the orderng of the rules s sgnfcant. The basc dea of sequental coverng s to learn a lst of rules sequentally, one at a tme, to cover the tranng data. After each rule s learned,

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.