1 Classfcaton Methods Ajun An York Unversty, Canada C INTRODUCTION Generally speakng, classfcaton s the acton of assgnng an object to a category accordng to the characterstcs of the object. In data mnng, classfcaton refers to the task of analyzng a set of pre-classfed data objects to learn a model (or a functon) that can be used to classfy an unseen data object nto one of several predefned classes. A data object, referred to as an example, s descrbed by a set of attrbutes or varables. One of the attrbutes descrbes the class that an example belongs to and s thus called the class attrbute or class varable. Other attrbutes are often called ndependent or predctor attrbutes (or varables). The set of examples used to learn the classfcaton model s called the tranng data set. Tasks related to classfcaton nclude regresson, whch bulds a model from tranng data to predct numercal values, and clusterng, whch groups examples to form categores. Classfcaton belongs to the category of supervsed learnng, dstngushed from unsupervsed learnng. In supervsed learnng, the tranng data conssts of pars of nput data (typcally vectors), and desred outputs, whle n unsupervsed learnng there s no a pror output. Classfcaton has varous applcatons, such as learnng from a patent database to dagnose a dsease based on the symptoms of a patent, analyzng credt card transactons to dentfy fraudulent transactons, automatc recognton of letters or dgts based on handwrtng samples, and dstngushng hghly actve compounds from nactve ones based on the structures of compounds for drug dscovery. BACKGROUND Classfcaton has been studed n statstcs and machne learnng. In statstcs, classfcaton s also referred to as dscrmnaton. Early work on classfcaton focused on dscrmnant analyss, whch constructs a set of dscrmnant functons, such as lnear functons of the predctor varables, based on a set of tranng examples to dscrmnate among the groups defned by the class varable. Modern studes explore more flexble classes of models, such as provdng an estmate of the jon dstrbuton of the features wthn each class (e.g. Bayesan classfcaton), classfyng an example based on dstances n the feature space (e.g. the k-nearest neghbor method), and constructng a classfcaton tree that classfes examples based on tests on one or more predctor varables (.e., classfcaton tree analyss). In the feld of machne learnng, attenton has more focused on generatng classfcaton expressons that are easly understood by humans. The most popular machne learnng technque s decson tree learnng, whch learns the same tree structure as classfcaton trees but uses dfferent crtera durng the learnng process. The technque was developed n parallel wth the classfcaton tree analyss n statstcs. Other machne learnng technques nclude classfcaton rule learnng, neural networks, Bayesan classfcaton, nstance-based learnng, genetc algorthms, the rough set approach and support vector machnes. These technques mmc human reasonng n dfferent aspects to provde nsght nto the learnng process. The data mnng communty nherts the classfcaton technques developed n statstcs and machne learnng, and apples them to varous real world problems. Most statstcal and machne learnng algorthms are memory-based, n whch the whole tranng data set s loaded nto the man memory before learnng starts. In data mnng, much effort has been spent on scalng up the classfcaton algorthms to deal wth large data sets. There s also a new classfcaton technque, called assocaton-based classfcaton, whch s based on assocaton rule learnng. MAIN THRUST Major classfcaton technques are descrbed below. The technques dffer n the learnng mechansm and n the representaton of the learned model. Decson Tree Learnng Decson tree learnng s one of the most popular classfcaton algorthms. It nduces a decson tree from data. A decson tree s a tree structured predcton model where each nternal node denotes a test on an attrbute, each outgong branch represents an outcome of the test, and each leaf node s labeled wth a class or Copyrght 2005, Idea Group Inc., dstrbutng n prnt or electronc forms wthout wrtten permsson of IGI s prohbted.
Fgure 1. A decson tree wth tests on attrbutes X and Y Y=A Y =? Y=B class dstrbuton. A smple decson tree s shown n Fgure 1. Wth a decson tree, an object s classfed by followng a path from the root to a leaf, takng the edges correspondng to the values of the attrbutes n the object. A typcal decson tree learnng algorthm adopts a top-down recursve dvde-and conquer strategy to construct a decson tree. Startng from a root node representng the whole tranng data, the data s splt nto two or more subsets based on the values of an attrbute chosen accordng to a splttng crteron. For each subset a chld node s created and the subset s assocated wth the chld. The process s then separately repeated on the data n each of the chld nodes, and so on, untl a termnaton crteron s satsfed. Many decson tree learnng algorthms exst. They dffer manly n attrbute-selecton crtera, such as nformaton gan, gan rato (Qunlan, 1993), gn ndex (Breman, Fredman, Olshen, & Stone, 1984), etc., termnaton crtera and post-prunng strateges. Post-prunng s a technque that removes some branches of the tree after the tree s constructed to prevent the tree from over-fttng the tranng data. Representatve decson algorthms nclude CART (Breman et al., 1984) and C4.5 (Qunlan, 1993). There are also studes on fast and scalable constructon of decson trees. Representatve algorthms of such knd nclude RanForest (Gehrke, Ramakrshnan, & Gant, 1998) and SPRINT (Shafer, Agrawal, & Mehta., 1996). Decson Rule Learnng X =? X<1 X 1 Y=C Class 1 Class 2 Class 1 Class 2 Decson rules are a set of f-then rules. They are the most expressve and human readable representaton of classfcaton models (Mtchell, 1997). An example of decson rules s f X<1 and Y=B, then the example belongs to Class 2. Ths type of rules s referred to as propostonal rules. Rules can be generated by translatng a decson tree nto a set of rules one rule for each leaf node n the tree. A second way to generate rules s to learn rules drectly from the tranng data. There s a varety of rule nducton algorthms. The algorthms nduce rules by searchng n a hypothess space for a hypothess that best matches the tranng data. The algorthms dffer n the search method (e.g. general-tospecfc, specfc-to-general, or two-way search), the search heurstcs that control the search, and the prunng method used. The most wdespread approach to rule nducton s sequental coverng, n whch a greedy general-to-specfc search s conducted to learn a dsjunctve set of conjunctve rules. It s called sequental coverng because t sequentally learns a set of rules that together cover the set of postve examples for a class. Algorthms belongng to ths category nclude CN2 (Clark & Boswell, 1991), RIPPER (Cohen, 1995) and ELEM2 (An & Cercone, 1998). Nave Bayesan Classfer The nave Bayesan classfer s based on Bayes theorem. Suppose that there are m classes, C 1, C 2,, C m. The classfer predcts an unseen example X as belongng to the class havng the hghest posteror probablty condtoned on X. In other words, X s assgned to class C f and only f C X) > C j X) for 1 j m, j. By Bayes theorem, we have X C ) C ) C X ) =. X ) As X) s constant for all classes, only P ( X C ) C ) needs to be maxmzed. Gven a set of tranng data, C ) can be estmated by countng how often each class occurs n the tranng data. To reduce the computatonal expense n estmatng X C ) for all possble Xs, the classfer makes a naïve assumpton that the attrbutes used n descrbng X are condtonally ndependent of each other gven the class of X. Thus, gven the attrbute values (x 1, x 2, x n ) that descrbe X, we have n X C ) = x j C ). j = 1 The probabltes x 1 C ), x 2 C ),, x n C ) can be estmated from the tranng data. The naïve Bayesan classfer s smple to use and effcent to learn. It requres only one scan of the tranng data. Despte the fact that the ndependence assumpton s often volated n practce, naïve Bayes often competes well wth more sophstcated classf- 2
ers. Recent theoretcal analyss has shown why the nave Bayesan classfer s so robust (Domngos & Pazzan, 1997; Rsh, 2001). Bayesan Belef Networks A Bayesan belef network, also known as Bayesan network and belef network, s a drected acyclc graph whose nodes represent varables and whose arcs represent dependence relatons among the varables. If there s an arc from node A to another node B, then we say that A s a parent of B and B s a descendent of A. Each varable s condtonally ndependent of ts nondescendents n the graph, gven ts parents. The varables may correspond to actual attrbutes gven n the data or to hdden varables beleved to form a relatonshp. A varable n the network can be selected as the class attrbute. The classfcaton process can return a probablty dstrbuton for the class attrbute based on the network structure and some condtonal probabltes estmated from the tranng data, whch predcts the probablty of each class. The Bayesan network provdes an ntermedate approach between the naïve Bayesan classfcaton and the Bayesan classfcaton wthout any ndependence assumptons. It descrbes dependences among attrbutes, but allows condtonal ndependence among subsets of attrbutes. The tranng of a belef network depends on the senaro. If the network structure s known and the varables are observable, tranng the network only conssts of estmatng some condtonal probabltes from the tranng data, whch s straghtforward. If the network structure s gven and some of the varables are hdden, a method of gradent decent can be used to tran the network (Russell, Bnder, Koller, & Kanazawa, 1995). Algorthms also exst for learnng the netword structure from tranng data gven observable varables (Buntme, 1994; Cooper & Herskovts, 1992; Heckerman, Geger, & Chckerng, 1995). The k-nearest Neghbour Classfer The k-nearest neghbour classfer classfes an unknown example to the most common class among ts k nearest neghbors n the tranng data. It assumes all the examples correspond to ponts n a n-dmensonal space. A neghbour s deemed nearest f t has the smallest dstance, n the Eucldan sense, n the n-dmensonal feature space. When k = 1, the unknown example s classfed nto the class of ts closest neghbour n the tranng set. The k-nearest neghbour method stores all the tranng examples and postpones learnng untl a new example needs to be classfed. Ths type of learnng s called nstance-based or lazy learnng. The k-nearest neghbour classfer s ntutve, easy to mplement and effectve n practce. It can construct a dfferent approxmaton to the target functon for each new example to be classfed, whch s advantageous when the target functon s very complex, but can be dscrbed by a collecton of less complex local approxmatons (Mtchell, 1997). However, ts cost of classfyng new examples can be hgh due to the fact that almost all the computaton s done at the classfcaton tme. Some refnements to the k-nearest neghbor method nclude weghtng the attrbutes n the dstance computaton and weghtng the contrbuton of each of the k neghbors durng classfcaton accordng to ther dstance to the example to be classfed. Neural Networks Neural networks, also referred to as artfcal neural networks, are studed to smulate the human bran although brans are much more complex than any artfcal neural network developed so far. A neural network s composed of a few layers of nterconnected computng unts (neurons or nodes). Each unt computes a smple functon. The nput of the unts n one layer are the outputs of the unts n the prevous layer. Each connecton between unts s assocated wth a weght. Parallel computng can be performed among the unts n each layer. The unts n the frst layer take nput and are called the nput unts. The unts n the last layer produces the output of the networks and are called the output unts. When the network s n operaton, a value s appled to each nput unt, whch then passes ts gven value to the connectons leadng out from t, and on each connecton the value s multpled by the weght assocated wth that connecton. Each unt n the next layer then receves a value whch s the sum of the values produced by the connectons leadng nto t, and n each unt a smple computaton s performed on the value - a sgmod functon s typcal. Ths process s then repeated, wth the results beng passed through subsequent layers of nodes untl the output nodes are reached. Neural networks can be used for both regresson and classfcaton. To model a classfcaton functon, we can use one output unt per class. An example can be classfed nto the class correspondng to the output unt wth the largest output value. Neural networks dffer n the way n whch the neurons are connected, n the way the neurons process ther nput, and n the propogaton and learnng methods used (Nurnberger, Pedrycz, & Kruse, 2002). Learnng a neural network s usually restrcted to modfyng the weghts based on the tranng data; the structure of the ntal network s usually left unchanged durng the learnng process. A typcal network structure s the C 3
multlayer feed-forward neural network, n whch none of the connectons cycles back to a unt of a prevous layer. The most wdely used method for tranng a feedforward neural network s backpropagaton (Rumelhart, Hnton, & Wllams, 1986). Support Vector Machnes The support vector machne (SVM) s a recently developed technque for multdmensonal functon approxmaton. The objectve of support vector machnes s to determne a classfer or regresson functon whch mnmzes the emprcal rsk (that s, the tranng set error) and the confdence nterval (whch corresponds to the generalzaton or test set error) (Vapnk, 1998). Gven a set of N lnearly separable tranng examples S = { x n R = 1,2,..., N}, where each example belongs to one of the two classes, represented by y { + 1, 1}, the SVM learnng method seeks the optmal hyperplane w x + b = 0, as the decson surface, whch separates the postve and negatve examples wth the largest margn. The decson functon for classfyng lnearly separable data s: f ( x ) = sgn( w x + b), where w and b are found from the tranng set by solvng a constraned quadratc optmzaton problem. The fnal decson functon s N f ( x) = sgn α y ( x x) + b. = 1 The functon depends on the tranng examples for whch α s non-zero. These examples are called support vectors. Often the number of support vectors s only a small fracton of the orgnal dataset. The basc SVM formulaton can be extended to the nonlnear case by usng nonlnear kernels that map the nput space to a hgh dmensonal feature space. In ths hgh dmensonal feature space, lnear classfcaton can be performed. The SVM classfer has become very popular due to ts hgh performances n practcal applcatons such as text classfcaton and pattern recognton. FUTURE TRENDS Classfcaton s a major data mnng task. As data mnng becomes more popular, classfcaton technques are ncreasngly appled to provde decson support n busness, bomedcne, fnancal analyss, telecommuncatons and so on. For example, there are recent applcatons of classfcaton technques to dentfy fraudulent usage of credt cards based on credt card transacton databases; and varous classfcaton technques have been explored to dentfy hghly actve compounds for drug dscovery. To better solve applcaton-specfc problems, there has been a trend toward the development of more applcaton-specfc data mnng systems (Han & Kamber, 2001). Tradtonal classfcaton algorthms assume that the whole tranng data can ft nto the man memory. As automatc data collecton becomes a daly practce n many busnesses, large volumes of data that exceed the memory capacty become avalable to the learnng systems. Scalable classfcaton algorthms become essental. Although some scalable algorthms for decson tree learnng have been proposed, there s stll a need to develop scalable and effcent algorthms for other types of classfcaton technques, such as decson rule learnng. Prevously, the study of classfcaton technques focused on explorng varous learnng mechansms to mprove the classfcaton accuracy on unseen examples. However, recent study on mbalanced data sets has shown that classfcaton accuracy s not an approprate measure to evaluate the classfcaton performance when the data set s extremely unbalanced, n whch almost all the examples belong to one or more, larger classes and far fewer examples belong to a smaller, usually more nterestng class. Snce many real world data sets are unbalanced, there has been a trend toward adjustng exstng classfcaton algorthms to better dentfy examples n the rare class. Another ssue that has become more and more mportant n data mnng s prvacy protecton. As data mnng tools are appled to large databases of personal records, prvacy concerns are rsng. Prvacy-preservng data mnng s currently one of the hottest research topcs n data mnng and wll reman so n the near future. CONCLUSION Classfcaton s a form of data analyss that extracts a model from data to classfy future data. It has been studed n parallel n statstcs and machne learnng, and s currently a major technque n data mnng wth a broad applcaton spectrum. Snce many applcaton problems can be formulated as a classfcaton problem and the volume of the avalable data has become overwhelmng, developng scalable, effcent, doman-specfc, and prvacy-preservng classfcaton algorthms s essental. 4
REFERENCES An, A., & Cercone, N. (1998). ELEM2: A learnng system for more accurate classfcatons. Proceedngs of the 12th Canadan Conference on Artfcal Intellgence, 426-441. Breman, L., Fredman, J., Olshen, R., & Stone, C. (1984). Classfcaton and regresson trees, Wadsworth Internatonal Group. Buntne, W.L. (1994). Operatons for learnng wth graphcal models. Journal of Artfcal Intellgence Research, 2, 159-225. Castllo, E., Gutérrez, J.M., & Had, A.S. (1997). Expert systems and probablstc network models. New York: Sprnger-Verlag. Clark P., & Boswell, R. (1991). Rule nducton wth CN2: Some recent mprovements. Proceedngs of the 5 th European Workng Sesson on Learnng, 151-163. Cohen, W.W. (1995). Fast effectve rule nducton. Proceedngs of the 11th Internatonal Conference on Machne Learnng, 115-123, Morgan Kaufmann. Cooper, G., & Herskovts, E. (1992). A Bayesan method for the nducton of probablstc networks from data. Machne Learnng, 9, 309-347. Domngos, P., & Pazzan, M. (1997). On the optmalty of the smple Bayesan classfer under zero-one loss. Machne Learnng, 29, 103-130. Gehrke, J., Ramakrshnan, R., & Gant, V. (1998). RanForest - A framework for fast decson tree constructon of large datasets. Proceedngs of the 24 th Internatonal Conference on Very Large Data Bases. Han, J., & Kamber, M. (2001). Data mnng Concepts and technques. Morgan Kaufmann. Heckerman, D., Geger, D., & Chckerng, D.M. (1995) Learnng bayesan networks: The combnaton of knowledge and statstcal data. Machne Learnng, 20, 197-243. Mtchell, T.M. (1997). Machne learnng. McGraw-Hll. Nurnberger, A., Pedrycz, W., & Kruse, R. (2002). Neural network approaches. In Klosgen & Zytkow (Eds.), Handbook of data mnng and knowledge dscovery. Oxford Unversty Press. Pearl, J. (1986). Fuson, propagaton, and structurng n belef networks. Artfcal Intellgence, 29(3), 241-288. Qunlan, J.R. (1993). C4.5: Programs for machne learnng. Morgan Kaufmann. Rsh, I. (2001). An emprcal study of the nave Bayes classfer. Proceedngs of IJCAI 2001 Workshop on Emprcal Methods n Artfcal Intellgence. Rumelhart, D.E., Hnton, G.E., & Wllams, R.J. (1986). Learnng representatons by back-propagatng errors. Nature, 323, 533-536. Russell, S., Bnder, J., Koller, D., & Kanazawa, K. (1995). Local learnng n probablstc networks wth hdden varables. Proceedngs of the 14 th Jont Internatonal Conference on Artfcal Intellgence, 2, 1146-1152. Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classfer for data mnng. Proceedngs of the 22 th Internatonal Conference on Very Large Data Bases. Vapnk, V. (1998). Statstcal learnng theory. New York: John Wley & Sons. KEY TERMS Backpropagaton: A neural network tranng algorthm for feedforward networks where the errors at the output layer are propagated back to the prevous layer to update connecton weghts n learnng. If the prevous layer s not the nput layer, then the errors at ths hdden layer are propagated back to the layer before. Dsjunctve Set of Conjunctve Rules: A conjunctve rule s a propostonal rule whose antecedent conssts of a conjuncton of attrbute-value pars. A dsjunctve set of conjunctve rules conssts of a set of conjunctve rules wth the same consequent. It s called dsjunctve because the rules n the set can be combned nto a sngle dsjunctve rule whose antecedent conssts of a dsjuncton of conjunctons. Generc Algorthm: An algorthm for optmzng a bnary strng based on an evolutonary mechansm that uses replcaton, deleton, and mutaton operators carred out over many generatons. Informaton Gan: Gven a set E of classfed examples and a partton P = {E 1,..., E n } of E, the nformaton gan s defned as E entropy( E), n entropy( E ) * = 1 E where X s the number of examples n X, and m entropy( X ) = p j log 2 ( p j ) (assumng there are m classes j= 1 5 C
n X and p j denotes the probablty of the jth class n X). Intutvely, the nformaton gan measures the decrease of the weghted average mpurty of the parttons E 1,..., E n, compared wth the mpurty of the complete set of examples E. Machne Learnng: The study of computer algorthms that develop new knowledge and mprove ts performance automatcally through past experence. Rough Set Data Analyss: A method for modelng uncertan nformaton n data by formng lower and upper approxmatons of a class. It can be used to reduce the feature set and to generate decson rules. Sgmod Functon: A mathematcal functon defned by the formula 1 t) = 1 + e Its name s due to the sgmod shape of ts graph. Ths functon s also called the standard logstc functon. t 6