CUM: An Efficient Framework for Mining Concept Units

Size: px
Start display at page:

Download "CUM: An Efficient Framework for Mining Concept Units"

Transcription

1 CUM: An Effcent Framework for Mnng Concept Unts P.Santh Thlagam Ananthanarayana V.S Department of Informaton Technology Natonal Insttute of Technology Karnataka - Surathkal Inda santh_soc@yahoo.co.n, anvs@ntk.ac.n Abstract Web s the most mportant repostory of dfferent knds of meda such as text, sound, vdeo, mages etc. Web mnng s the process of applyng data mnng technques to automatcally dscover knowledge from such a dverse, sheer sze data so that t can be more easly browsed, organzed, and catalogued wth mnmal human nterventon. A web ste usually contans a large number of concept enttes, each consstng of one or more web pages connected by hyperlnks. A large porton of web search actvtes ams to locate a set of concept enttes relevant to the user query. The web unt mnng problem s proposed to dscover the concept enttes and classfy these concept enttes nto categores. Web page classfcaton manly assgns one or more concept labels to every web page based on ts own content wthout consderng other neghbourng web pages. The exstng teratve Web Unt Mnng (WUM) algorthms create more than one web unt (ncomplete web unts) from a sngle concept entty. In ths paper, we propose a novel non-teratve web unt mnng algorthm, Concept Unt Mnng (CUM), whch fnds the set of web pages formng each web unt, and assgns the web unt a concept label based on the structure of the web pages so as reduce the later classfcaton errors. Our experments usng the WebKB dataset show that the dsadvantage of WUM algorthms are removed and over all accuracy s sgnfcantly mproved. 1. Introducton The World Wde Web (WWW) s a popular and nteractve medum to dssemnate nformaton today. Web s radcally dfferent from structured databases n schema, volume, topc-coherence [1]. Searchng, comprehendng, and usng the sem structured nformaton stored on the web poses a sgnfcant challenge because ths data s more sophstcated and dynamc than the database systems. Web Mnng s the process of applyng data mnng technques to extract useful nformaton from varous knds of web data such as content data, structure data, and usage data. The Fgure 1 shows the taxonomy of web mnng [2]. Fgure 1: Web mnng taxonomy Internatonal Conference on Management of Data COMAD 2008, Mumba, Inda, December , 2008 Computer Socety of Inda, 2008 Descrbng and organzng ths vast amount of content s essental for realzng the web s full potental as an nformaton resource. Web Classfcaton s mportant 1

2 and ndeed essental for organzng and understandng web content. In web classfcaton research, most efforts are about classfyng ndvdual web pages from one or more web stes nto a set of at categores or categores organzed n a herarchy. Web page classfcaton and web ste classfcaton methods try to classfy the set of web pages and web stes respectvely. However, n realty, web pages from a web ste are often created together wth lnks among them and these lnks carry some semantcs that bnd a set of web pages to form a meanngful nformaton unt. Unfortunately, such an observaton has rarely been consdered by the exstng web classfcaton research. Recently, Sun and Lm [3] proposed the concept of web unt and the web unt mnng problem. Here, the web pages are grouped as unts and classfed. Web unt mnng s dfferent from web page classfcaton as t conssts of both web unt constructon and classfcaton task. Sun et. al [3] proposed teratve Web Unt Mnng (WUM) method whch carres out web unt constructon and web unt classfcaton n an teratve manner. The Iteratve method mght create more than one web unt (ncomplete web unt) from a sngle concept entty and each ncomplete web unt contans ncomplete nformaton about a concept entty. In ths paper, we propose a Concept Unt Mnng (CUM) method explots both structure and content of web stes to mprove the classfcaton accuracy. CUM s not an teratve method thus faster than the WUM method. The rest of the paper s organzed as follows. Secton 2 presents the related work. Secton 3 descrbes the overvew of the problem statement. Secton 4 and secton 5 dscuss the concept unt mnng methodology. Secton 6 presents the expermental study. Secton 7 concludes ths paper. 2. Related Work Searchng nformaton from the web contanng mllons of web pages based on certan prerequste condtons s a very complex task. Ths problem has been a focus of many researchers. Web page classfcaton and web ste classfcaton are ntroduced for ease of searchng and browsng [4,5,6]. Fgure 2 shows the taxonomy of varous exstng web classfcaton algorthms n the lterature. Z.Broder et.al [7] modeled and analyzed the web page classfcaton problem usng lnk nformaton. L.Getoor et.al [8] proposed a unfed probablstc model for both the textual content and the lnk structure of a document collecton. The model s based on the recently ntroduced framework of Probablstc Relatonal Models (PRMs) whch allows us to capture correlatons between lnked documents. M.Craven et.al [6] presented hypertext classfers that combnes a statstcal text learnng method wth a relaton rule learner. J.Furnkranz [9] ntroduced hyperlnk ensembles, a novel type of ensemble classfer for classfyng hypertext documents. H.J.Oh et.al [10] proposed a practcal hypertext categorzaton method bult on bayesan classfer model usng lnks and ncrementally avalable class nformaton. Y.Yang et.al [11] presented a study wth sgnfcant analyss on fve well known text categorzaton methods: Support Vector Machnes (SVM), K-Nearest Neghbor (knn) classfer, a Neural Network (NNet) approach, the Lnear Least Squares Ft (LLSF) mappng and a Naïve Bayes (NB) classfer. L.Terveen et.al [12] ntroduced a novel approach consderng web ste as the basc unt to analyse the web. The mportant factor complcatng the explotaton of structure s the uncertanty of the labels wthn the ste trees. Y.Tan et.al [13] adapted Hdden Markov Models (HMM) to handle the uncertanty wthn the ste classfer. Sun et.al [3] proposed a new problem called the web unt mnng problem, whch s the focus of ths paper. Fgure 2. Taxonomy of Web Classfcaton Algorthms Three methods exst n the lterature to solve the above problem of web unt mnng. They are: a) Baselne method: A straghtforward soluton s to perform web page classfcaton on web pages and group the pages together to construct web unts based on classfcaton results. In ths method, large number of web pages wll be classfed and any mstake made n web page classfcaton wll nevtably cause erroneous web unts to be constructed. b) Baselne wth fragments method: The baselne wth fragments method (denoted by baselne (fragment)) deals wth web fragments nstead of web pages. Ths method only constructs the web unts. It does not classfy the web unts. c) Iteratve Web Unt Mnng (WUM): The web unt mnng has two sub-problems, namely web unt constructon and web unt classfcaton. WUM addresses these two sub-problems n an teratve manner. It frst groups closely-related 2

3 web pages nto small unts (web fragments), whch are then classfed and merged wth one another to form large web unts. Ths classfyngmergng procedure s repeated untl there s no change of category labels assgned to the web unts. The rules of mergng allow a sngle labeled web unt to merge wth neghborng unlabeled web unts, but not labeled web unts. A labeled web unt refers to one that has been assgned a category label; otherwse t s called an unlabeled web unt. Note that those unlabeled web unts are ntermedate results of WUM. They are expected to merge wth neghborng labeled web unts n the next web unt constructon phase. When WUM stops, those remanng unlabeled web unts are dscarded, only labeled web unts are returned. In other words, web unt mnng results are all labeled web unts. Ths s shown as the mprovement over the two baselne methods. The exstence of ncomplete web unts does harmful n usng web unts to other applcatons. One mportant applcaton s to allow search engnes to ndex web nformaton at the web unt level nstead of the web page level. As per the example gven n Fgure 3, the CS100 course entty can be ndexed as a sngle unt (f the CS100 web unt s successful constructed). However, f three ncomplete web unts are created from the CS100 course entty, we need to ndex three unts although all of them represent CS100. It s not effcent for search engnes. Snce each ncomplete web unt contans ncomplete nformaton about CS100, t mght cause search engnes to fal to retreve them for some user queres n the worst case. Fgure 3. Web unt examples: CS100 and Johnson 3. Problem Statement Gven a collecton of web pages and a set of concepts, the web unt mnng problem s to construct web unts from these web pages and assgn them the approprate concepts. Web unt mnng therefore nvolves two man tasks: Fndng the set of web pages that form each web unt, and classfyng the constructed web unts. Web unt of the concept s a web page or a set of web pages from a web ste that jontly provdes nformaton about a concept nstance. A web unt conssts of exactly one key page and zero or more support pages. Key page s a page whch has lnks to the all the support pages havng the supplementary nformaton about the concept. If a web unt conssts of a key page only, we call t a one page web unt and mult-page web unt otherwse. A lnk connectng any two pages from a mult-page web unt s known as an ntra-unt lnk. Two example web unts are llustrated n Fgure 3: a course web unt and a faculty web unt. The CS100 course web unt (refer Fgure 3.a) conssts of 8 pages; the frst page s the course's key page (underlned) and the others provde supplementary course nformaton.e. support pages. Smlarly, among the 6 pages, the frst page n the Johnson faculty web unt (refer Fgure 3.b) s the key page and the remanng pages provde nformaton about research, publcaton, teachng and so on. Every page n a web unt contrbutes a pece of nformaton. Snce a web unt has rcher and more complete content than the ndvdual web pages, so web unt s a more approprate granularty for classfyng, ndexng and organzng web nformaton. 4. Desgn The framework for the overall mnng procedure to solve the web unt mnng problem can be brefly enumerated as follows: 1) Tranng phase: The nput to ths phase s the set of labelled web unts. Ths phase has some set of concept unts for each concept or category. Each unt conssts of one key page as well as zero or more support pages. All the pages are preprocessed and the classfer s traned. 2) Web unt constructon phase: The nput to ths phase s set of test pages. Ths phase takes all the test pages and generates the web drectory. Web drectory s then processed and web unts consstng of one key page as well as zero or more pages wll be generated. 3) Web unt classfcaton phase: The nput to ths phase s all the constructed web unts. Ths phase classfes the generated web unts usng the classfers from the tranng phase. Support vector machne algorthm s used for classfcaton. In all these phases, the web pages have to be preprocessed as per the requrement of classfcaton algorthm. In the proposed concept unt mnng approach, the classfcaton s not an teratve process and the 3

4 method of constructng web unts s dfferent from the exstng WUM method. The overall desgn of the proposed method s shown dagrammatcally n Fgure 4 and explaned n detal n the next subsectons. used here for stemmng. Document count (d) s calculated for each word, whch s the number of documents n whch the word contans. The nverse document frequency (df) of the words s then calculated as follows. Where N s the number of tranng documents. Lst of words and the correspondng nverse document frequency are created whch wll be ndexed and used n the next phases for feature weght calculatons. Fgure 4. Mnng of Concept unts 5. Soluton Methodology Our soluton approach adopts the followng steps to perform concept unt mnng. A dataset consstng of web pages from the four dfferent unverstes s chosen as nput. The steps n the soluton strategy ncludes generaton of word lst of the web pages, preprocessng of all the web pages to extract features, representaton of theses features n the form feature vectors, constructon of web unts usng the feature vectors, classfcaton of the web unts usng SVM classfer and fnally evaluaton of web unts. These steps are explaned n detal n the followng sectons. 5.1 Generaton of word Lst Ths s the frst phase n the soluton approach. The target documents of our applcaton are web pages. The words from the all the tranng documents need to be dentfed properly, n order to enable proper web page feature extracton. In ths step, all the web pages from tranng web page collecton are parsed to get the content, then content s tokensed and transformed to lower case. Some of the words are non nformatve whch wll not contrbute to dstngush the documents, whch are called stop words (ex: s, was, when, etc). These stop words are removed from ths lst. Stemmng s the process of extractng from a word the meanngful part, known as the lemma, and gnorng nflecton, declenson, or plural suffxes,.e. gettng the root of a word. Stemmng s appled after removng the stop words, whch wll reduce the number of words. Porter stemmng [14] algorthm s 5.2 Preprocessng of web pages In ths step, all web pages are scanned and features are extracted. Each concept unt conssts of one key page and zero or more support pages. These web pages have to be preprocessed and features need to be extracted before tranng or testng. Ttle, text, and hyperlnk text are consdered as features n ths paper. Preprocessng of the web page s depcted n Fgure 5. As mentoned n the prevous secton, the web pages are parsed frst to get the ttle, text and hyperlnk text; ths text s tokensed and transformed to the lowercase; stop words are removed; stemmng s appled to get the root words; and these words are ndexed n the wordlst created n the prevous secton and term frequency s calculated. Term frequency s the number of occurrences of that term n the web page. The calculated term frequences are used for calculatng the feature weghts. Preprocessng s used n both tranng as well as testng. Fgure 5. Preprocessng of web page 5.3 Generaton of feature vectors Feature vector s a vector contanng feature ds and correspondng weghts of a web page. These vectors are generated n both tranng as well as testng. In case of tranng, we have a set of labeled web unts where each web unt conssts of one key page and zero or more support pages. The classfer has to be bult for each concept label. The postve tranng examples for a concept C j consst of key pages of the web unts labeled wth concept C j ; the negatve tranng examples nclude the support pages of web unts n C j and both key pages and support pages of the web unts that do not belong to C j. We have now the term frequences (tf) of each term of 4

5 a page and df values. The weght of a term s calculated as follows w = tf * df The constructed web unts are classfed only based on the key pages, so for each key page feature vector s constructed and gven to the classfer n the case of testng. 5.4 Hypertext Categorzaton usng SVM SVM (Support Vector Machne) has been proved to be very effectve n dealng wth hgh-dmensonal feature spaces - the most challengng problem of other machne learnng technques due to the so-called curse of dmensonalty. The basc dea of SVM s to fnd an optmal hyperplane to separate two classes wth the largest margn from pre-classfed data. To construct an optmal hyperplane, SVM uses an teratve tranng algorthm, whch s used to mnmze the error functon. Accordng to the form of the error functon, SVM models can be dvded nto four dstnct groups such as C-SVM classfcaton, nu-svm classfcaton, epslon-svm regresson, and nu-svm regresson [15]. C-SVM s the mostly used SVM method for classfcaton. C-SVM classfcaton s used n ths work also. In rest of the paper SVM means C-SVM only. In SVM, there are four common kernels such as Lnear, Polynomal, Radal Bass Functon (RBF), and Sgmod [15,16]. Support vector machnes use quadratc programmng for fndng the hyperplane. In ths work, Lnear SVM s used, so we consdered only lnear separable case. A more mathematcally rgorous treatment of the geometrc arguments of SVM can be found n [17,18]. Understandng hgher dmensonal s dffcult, so consder the followng example n two-dmensonal spaces [19]. Hypertext s a type of text. SVM has been known to be very effcent n Text Categorzaton (TC) due to a number of followng reasons: Hgh dmensonal nput spaces: Normally, n TC the dmensons of nput spaces are very large, and t s very challengng to other machne learnng technques. However, SVM does not depend on the number of features, and thus t has the potental to handle large feature spaces. Few rrelevant features: In TC, there are very few rrelevant features. Therefore, the uses of feature selecton n other machne learnng approaches to reduce the number of rrelevant features wll also decrease the accuracy of the classfers. In contrast, SVM can handle very large feature space and thus feature selecton s only optonal. Document vectors are sparse: In TC, each document vector has only few entres whch are not zero (.e., sparse). Moreover, there are many theoretcal and emprcal evdence that SVM s well suted for problems wth dense concepts and sparse nstances lke TC. Most TC problems are lnearly separable: In practce, many TC problems are known to be lnearly separable. Therefore, SVM s sutable for these tasks. Ths paper concerns about the hypertext categorzaton whch s a type of text categorzaton, so SVM s best sutable for ths. 5.5 Construct web unts Web unts are generated n ths phase. Ths phase conssts of two steps: 1. Constructon of web drectory and 2. Web unt constructon. Frst, we wll construct the web drectory usng the URLs of the pages and ths wll be gven as nput to the web unt constructon algorthm. Gven a collecton of web pages from a web ste, together wth web folders extracted from them, a web drectory s a tree structure wth web folders as nternal nodes and web pages as leaf nodes [3]. A web folder s usually created to contan a set of closely related web pages. We dentfy sx connecton patterns of a web folder. Fully connected: There exst hyperlnks between any two web pages. Star: There exsts one web page that has hyperlnks pontng to all other web pages. Lnear: There exsts a lnear hyperlnk path to connect all web pages. Herarchcal: There exsts a herarchcal hyperlnk structure to connect all web pages. Isolated: There does not exst any hyperlnk between any two pages. Partally connected: There exst some hyperlnks to connect a few web pages, but cannot fnd a way to connect all web pages. A web folder may satsfy more than one connecton pattern. To make t smple, when analyzng a web folder, we examne connecton patterns n the order of fully connected, star, lnear, herarchcal, solated, and partally connected. If a web folder contans only one web page or satsfes any of the four connecton patterns, namely fully connected, star, lnear, or herarchcal, we consder t as well-connected. In a well-connected web folder there exsts one entry page, these entry pages are very lkely to be the key pages of web unts. Here we wll 5

6 consder web unt and web fragment as same. The web fragment constructon algorthm s shown n Algorthm 1. We create a web drectory and run WebFragGen (root), where root s the root web folder of the web drectory. The web drectory s traversed n a bottom-up manner. A web folder wll be examned only after all ts chld web folders have been examned or t has no chld web folder. If a web folder f has no chld web folder and s well-connected, a canddate web fragment frag s created wth all web pages n f. frag s then checked by a functon named Is_ Fragment, whch determnes whether frag s a web fragment or not. Once Is_ Fragment returns true value, frag s added to the web fragment lst wfl and f s removed from the web drectory. If the currently-vsted web folder f has some chld web folders sub(f), t ndcates that web fragments are not found n sub(f). We frst create a canddate web fragment frag wth all pages n subt (f) and test t. If t fals, we construct another canddate web fragment frag wth all pages n f and test t agan. When constructng a canddate web fragment, we also dentfy one canddate key page based on the connecton pattern of the correspondng web folder and some heurstcs. the web unt and the support pages are not mportant at all. Ths also suggests that the web unt mnng performance s only determned by ts ablty to dentfy and classfy key pages correctly. If α = 1/ u, the key page and support pages of a web unt enjoy equal mportance. Hence, the web unt mnng performance must consder both key and support pages. 5.6 Classfy the web unts Ths s the last step n concept unt mnng. The constructed web unts are classfed based manly on the key page. The feature vector s generated from the key page for each constructed web unt. Ths vector s gven to the SVM classfer, whch wll classfy based on the prevously constructed models. A web unt can be descrbed by two separate sets of features: content features and web ste structure features. The content features refer to the words n the key page, the ttle and the anchor words. Web ste structure features refer those obtaned by studyng how the web unts are organzed n a web ste. Both these features are used n ths classfcaton. 5.7 Web unt evaluaton There are two dffcultes n evaluatng web unt mnng methods. Frstly, web unts are subgraphs and they are dscovered n the process of web unt mnng. Thus t s dffcult and unfar to expect perfect matchng between the labeled web unts and the mned web unts. Secondly, the notons of key and support pages also complcate the performance measurement. Hence, the standard precson and recall measures n text classfcaton cannot be appled drectly. Therefore the author Sun [3] proposed new precson and recall defntons for web unt mnng. To account for the mportance of key pages, a satsfacton varable α s ntroduced to represent the degree of mportance when the key page of a web unt s correctly dentfed. Gven a web unt u, the value of α s n the range [1/ u, 1] where u s the number of web pages n web unt u. If α = 1, the key page completely domnates Let the key page of a web unt u be denoted by u.k and the support pages be denoted by the set u.s. Gven a web unt u constructed by a web unt mnng method, we must frst match t wth an approprate labeled web unt u', also known as the perfect web unt. We defne u' to be the labeled web unt contanng u.k and u' has the same label as u ; u.k can be ether the key page or a support page of u'. The contngency table for matchng a web unt u wth ts perfect web unt u' s shown n Table 1. Each table 6

7 entry represents a set of overlappng web pages between the key/support pages of u and u'. For example. TK = { u. { u. and TS = u. s u. s. The entres n the last column and row carry specal meanngs as follows. FS = u. s ( u. s { u. ). NK = { u. { u. u. s. NS = u. s { u. u. s. Note that TK + KS + NK = 1 and TK + SK = 1. If the perfect web unt for u does not exst, u s consdered nvald and wll be assgned zero precson and recall values. Otherwse, the precson and recall of a web unt, u, are defned as follows. Table 1: Contngency table for web unt u 6. Expermental Study All the experments are performed on a Pentum-IV processor, 2.53 GHz CPU and 512 MB RAM. The operatng system envronment s Fedora Core 3. We used java for mplementng the proposed approach. Real World Wde Knowledge Base (WebKb) [21] dataset s used for testng the performance of our approach. Ths dataset conssts of web pages from four dfferent unverstes Cornell, Texas, Washngton and Wsconsn. All the web pages are from computer scence department of the respectve unverstes. As there are no exstng labeled web unt datasets for our experments, we used UntSet dataset prepared for teratve web unt mnng [3] from WebKB dataset. WebKB data set conssts of 4159 web pages collected from four unverstes were manually classfed nto 7 categores: student, faculty, staff, department, course, project and other. The other s a specal category for pages that not assgned as the man pages n the last sx categores. The UntSet data set conssts of web unts grouped and labeled manually. The pages from frst categores were used as key pages of web unts whle the majorty of pages from other category were used as support pages of the correspondng web unts. Only four concepts student, faculty, course and project were expermented as the number of web unts of other concepts are small ( 20) smlar to the WUM. The web unt statstcs are shown n Table 2 where u and p refer to the number of web unts and pages respectvely.. α. TK + (1 α ). TS Pru = α + (1 α ).( KS + TS + FS ) Table 2: UntSet web unt dstrbuton α. TK + (1 α ). TS Re u = α + (1 α ).( SK + TS + NS ) Suppose M web unts are constructed and assgned wth concept C j, and N web unts are manually labeled wth C j, the precson and recall of the concept, C j, denoted by Prc j and Rec j, are defned as follows. Pr c j Re c j = = u C j M u C j N Pr u Re From the above defnton, the macro/mcro average precson and recall for a set of concepts can be easly derved [20]. u We used three unverstes for tranng and one unversty for testng. UntSet dataset s used for tranng the web unt classfer. Results of WUM method are used for comparng the CUM method. The precson (Pr), recall (Re), and F1 value for each category of these methods are gven. Macro and Mcro averages are also presented. In Table 3 and Table 4 results of concept unt mnng method for α=1, α=0.5 are presented. As α = 1/ u ndcates the key page and support pages of a web unt have the equal mportance, we presented the results of both WUM and CUM methods for α = 1/ u n Table 5. Web unts are extracted wth reasonable accuraces usng WUM although the performance of dfferent 7

8 categores vares. The low precson of WUM s manly due to the exstence of ncomplete web unts. In summary, web ste structure and key page content are the two major factors that affect the performance of WUM. The WUM results could be regarded as a baselne results to evaluate the performance of the proposed method. In Concept unt mnng, web unts are constructed usng the hyperlnks. In ths approach, the ncomplete web unt problem s handled by reducng the possble classfcaton errors. Its nfluence on the web unt mnng performances s mproved for some categores. The mproved precson scores are manly because the number of ncomplete web unts have been greatly reduced through a more effectve web unt constructon step. The reduced recall scores are manly because CUM faled to dentfy some low qualty web unts. Table 3: Web unt mnng results (α = 1) Macro/Mcro average values of both the methods are presented n Fgure 7. Values MaPr MaRe MaF1 MPr MRe MF1 Macro/Mcro measures Fgure 7: Macro/Mcro-averaged results of the two methods WUM CUM 7. Concluson and Future work Table 4: Web unt mnng results (α = 0.5) Table 5: Web unt mnng results (α =1/ u ) From the Table 5, t s clear that the performance of Concept unt mnng method s good for some concepts but the overall Macro/Mcro average values are showng mprovement over WUM method. The comparson of Present days major source of nformaton s the web. The amount of data avalable onlne s ncreasng. Web unt mnng ams to dscover the set of web pages that represent a sngle concept entty from a web ste. Mned web unts are usually objects of nterest for people who search n that web ste, e.g. courses, professors, or students enttes from a unversty web ste. As a result, a major applcaton of web unts s to ncorporate them nto search engnes so as to support better ndexng and searchng. An exstng web unt-mnng algorthm, WUM, creates ncomplete web unts, wth each coverng only a subset of web pages of a sngle concept entty. The exstence of ncomplete web unts does harmful to usng web unts to ndex web nformaton. In order to address ths problem, Concept unt mnng method s proposed. Experments wth the UnKB dataset show that a large porton of ncomplete web unts removed and web unt mnng performance s thus mproved. CUM s not an teratve process thus reduces the tme requred to mne the concept unts. Web unt mnng s a new research feld. There s stll much room for further mprovement. For example, as exstng web unt mnng algorthms heavly rely on web ste structure to construct web unts and key page content to classfy web unts, we need to conduct experments wth other web stes. More research efforts are requred to mprove web unt mnng performance on those web stes not well-structured and on those web unts wth key pages of lttle content. In another drecton of our future work, focus on utlzng mned web unts to enhance web nformaton retreval and explorng ways for modellng web unts and develop web unt-based rankng strateges. 8

9 8. References 1. Da Costa, M.G., Jr.; Zhguo Gong, Web structure mnng: an ntroducton, IEEE Internatonal Conference on Informaton Acquston, pp. 6, July Jadeep srvastava, prasanna Deskan, vpn kumar, Web mnng accomplshment and future drectons, In Natonal scence foundaton workshop 2001, pp Sun A. and E.-P. Lm, Web unt mnng: fndng and classfyng sub graphs of web pages, n Proceedngs of the twelfth nternatonal conference on Informaton and knowledge management, S. Chakrabart, B. Dom, and P. Indyk, Enhanced hypertext categorzaton usng hyperlnks, n Proceedngs of the ACM SIGMOD nternatonal conference on Management of Data, 1998, pp S. T. Dumas and H. Chen, Herarchcal classfcaton of Web content, In Proc. of ACM SIGIR, pages , Athens, Greece, M. Craven and S. Slattery. Relatonal learnng wth statstcal predcate nventon: Better models for hypertext, Journal of Machne Learnng, vol. 43, no. 1-2, pp , A. Z. Broder, R. Krauthgamer, and M. Mtzenmacher, Improved classfcaton va connectvty nformaton, In Proc. of 11th ACM-SIAM Sym. on Dscrete Algo., pages , L. Getoor, E. Segal, B. Taskar, and D. Koller. Probablstc models of text and lnk structure for hypertext classfcaton, In Proc. of Intl Jont Conf. on Artfcal Intellgence Workshop on Text Learnng: beyond Supervson, Seattle, WA, J. Furnkranz. Hyperlnk ensembles: A case study n hypertext classfcaton, Journal of Informaton Fuson, vol. 1, pp , H.J. Oh, S. H. Myaeng, and M.H. Lee. A practcal hypertext categorzaton method usng lnks and ncrementally avalable class nformaton, n Proceedngs of the 23rd annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, 2000, pp Y. Yang and X. Lu. A re-examnaton of text categorzaton methods, n Proceedngs of the 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, 1999, pp L. Terveen, W. Hll, and B. Amento. Constructng, organzng, and vsualzng collectons of topcally related web resources, ACM Transactons on Computer-Human Interacton, vol. 6, no. 1, pp , Y. Tan, T. Huang, W. Gao, J. Cheng, and P. Kang. Two-phase web ste classfcaton based on hdden markov tree models, n proceedngs of IEEE/WIC Web Intellgence, October 13-17, M.F Porter, An algorthm for suffx strppng, program, 14(3): , C.J.C. Burges, A tutoral on support vector machnes for pattern recognton. Data Mnng and Knowledge Dscovery, 2(2): , Smola, A. and Schölkopf, B. A tutoral on support vector regresson, Techncal Report NC2-TR NC2-TR , NeuroCOLT 2, Bennett K. and Bredenstener E., Dualty and Geometry n SVMs, In P. Langley edtor, Proc. Of 17 th Internatonal Conference on Machne Learnng, Morgan Kaufmann, San Francsco, 65-72, Crsp D. and Burges C., A geometrc nterpretaton of v-svm classfers, Advances n Neural Informaton Processng Systems, 12 ed. S.A. Solla, T.K Leen and K.R. Muller, MIT Press, Shuonan Dong, Support Vector Machnes Appled to Handwrtten Numerals Recognto., machne learnng, Sun A and E.-P. Lm. Herarchcal text classfcaton and evaluaton., In Proc. of the 1st IEEE Int. Conf. on Data Mnng, pages , Calforna, USA, Nov WebKb data from ject/theo-20/www/data/, feb

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

CUM : An Efficient Framework for Mining Concept Units

CUM : An Efficient Framework for Mining Concept Units COMAD 2008 CUM : An Efficient Framework for Mining Concept Units By Mrs. Santhi Thilagam Department of Computer Engineering NITK - Surathkal Outline Introduction Problem statement Design and solution methodology

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University CAN COMPUTERS LEARN FASTER? Seyda Ertekn Computer Scence & Engneerng The Pennsylvana State Unversty sertekn@cse.psu.edu ABSTRACT Ever snce computers were nvented, manknd wondered whether they mght be made

More information

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

Incremental Learning with Support Vector Machines and Fuzzy Set Theory The 25th Workshop on Combnatoral Mathematcs and Computaton Theory Incremental Learnng wth Support Vector Machnes and Fuzzy Set Theory Yu-Mng Chuang 1 and Cha-Hwa Ln 2* 1 Department of Computer Scence and

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

Face Recognition Based on SVM and 2DPCA

Face Recognition Based on SVM and 2DPCA Vol. 4, o. 3, September, 2011 Face Recognton Based on SVM and 2DPCA Tha Hoang Le, Len Bu Faculty of Informaton Technology, HCMC Unversty of Scence Faculty of Informaton Scences and Engneerng, Unversty

More information

Announcements. Supervised Learning

Announcements. Supervised Learning Announcements See Chapter 5 of Duda, Hart, and Stork. Tutoral by Burge lnked to on web page. Supervsed Learnng Classfcaton wth labeled eamples. Images vectors n hgh-d space. Supervsed Learnng Labeled eamples

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET 1 BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET TZU-CHENG CHUANG School of Electrcal and Computer Engneerng, Purdue Unversty, West Lafayette, Indana 47907 SAUL B. GELFAND School

More information

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach Angle Estmaton and Correcton of Hand Wrtten, Textual and Large areas of Non-Textual Document Images: A Novel Approach D.R.Ramesh Babu Pyush M Kumat Mahesh D Dhannawat PES Insttute of Technology Research

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law) Machne Learnng Support Vector Machnes (contans materal adapted from talks by Constantn F. Alfers & Ioanns Tsamardnos, and Martn Law) Bryan Pardo, Machne Learnng: EECS 349 Fall 2014 Support Vector Machnes

More information

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION SHI-LIANG SUN, HONG-LEI SHI Department of Computer Scence and Technology, East Chna Normal Unversty 500 Dongchuan Road, Shangha 200241, P. R. Chna E-MAIL: slsun@cs.ecnu.edu.cn,

More information

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification Introducton to Artfcal Intellgence V22.0472-001 Fall 2009 Lecture 24: Nearest-Neghbors & Support Vector Machnes Rob Fergus Dept of Computer Scence, Courant Insttute, NYU Sldes from Danel Yeung, John DeNero

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur FEATURE EXTRACTION Dr. K.Vjayarekha Assocate Dean School of Electrcal and Electroncs Engneerng SASTRA Unversty, Thanjavur613 41 Jont Intatve of IITs and IISc Funded by MHRD Page 1 of 8 Table of Contents

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Experiments in Text Categorization Using Term Selection by Distance to Transition Point Experments n Text Categorzaton Usng Term Selecton by Dstance to Transton Pont Edgar Moyotl-Hernández, Héctor Jménez-Salazar Facultad de Cencas de la Computacón, B. Unversdad Autónoma de Puebla, 14 Sur

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION 1 FENG YONG, DANG XIAO-WAN, 3 XU HONG-YAN School of Informaton, Laonng Unversty, Shenyang Laonng E-mal: 1 fyxuhy@163.com, dangxaowan@163.com, 3 xuhongyan_lndx@163.com

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

A Novel Term_Class Relevance Measure for Text Categorization

A Novel Term_Class Relevance Measure for Text Categorization A Novel Term_Class Relevance Measure for Text Categorzaton D S Guru, Mahamad Suhl Department of Studes n Computer Scence, Unversty of Mysore, Mysore, Inda Abstract: In ths paper, we ntroduce a new measure

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Learning to Classify Documents with Only a Small Positive Training Set

Learning to Classify Documents with Only a Small Positive Training Set Learnng to Classfy Documents wth Only a Small Postve Tranng Set Xao-L L 1, Bng Lu 2, and See-Kong Ng 1 1 Insttute for Infocomm Research, Heng Mu Keng Terrace, 119613, Sngapore 2 Department of Computer

More information

An Anti-Noise Text Categorization Method based on Support Vector Machines *

An Anti-Noise Text Categorization Method based on Support Vector Machines * An Ant-Nose Text ategorzaton Method based on Support Vector Machnes * hen Ln, Huang Je and Gong Zheng-Hu School of omputer Scence, Natonal Unversty of Defense Technology, hangsha, 410073, hna chenln@nudt.edu.cn,

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

Pruning Training Corpus to Speedup Text Classification 1

Pruning Training Corpus to Speedup Text Classification 1 Prunng Tranng Corpus to Speedup Text Classfcaton Jhong Guan and Shugeng Zhou School of Computer Scence, Wuhan Unversty, Wuhan, 430079, Chna hguan@wtusm.edu.cn State Key Lab of Software Engneerng, Wuhan

More information

Detection of an Object by using Principal Component Analysis

Detection of an Object by using Principal Component Analysis Detecton of an Object by usng Prncpal Component Analyss 1. G. Nagaven, 2. Dr. T. Sreenvasulu Reddy 1. M.Tech, Department of EEE, SVUCE, Trupath, Inda. 2. Assoc. Professor, Department of ECE, SVUCE, Trupath,

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

Using Neural Networks and Support Vector Machines in Data Mining

Using Neural Networks and Support Vector Machines in Data Mining Usng eural etworks and Support Vector Machnes n Data Mnng RICHARD A. WASIOWSKI Computer Scence Department Calforna State Unversty Domnguez Hlls Carson, CA 90747 USA Abstract: - Multvarate data analyss

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Classic Term Weighting Technique for Mining Web Content Outliers

Classic Term Weighting Technique for Mining Web Content Outliers Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa Classc Term Weghtng Technque for Mnng Web Content Outlers W.R. Wan Zulkfel, N. Mustapha, and A. Mustapha

More information

On Some Entertaining Applications of the Concept of Set in Computer Science Course

On Some Entertaining Applications of the Concept of Set in Computer Science Course On Some Entertanng Applcatons of the Concept of Set n Computer Scence Course Krasmr Yordzhev *, Hrstna Kostadnova ** * Assocate Professor Krasmr Yordzhev, Ph.D., Faculty of Mathematcs and Natural Scences,

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques Enhancement of Infrequent Purchased Product Recommendaton Usng Data Mnng Technques Noraswalza Abdullah, Yue Xu, Shlomo Geva, and Mark Loo Dscplne of Computer Scence Faculty of Scence and Technology Queensland

More information

Feature Selection as an Improving Step for Decision Tree Construction

Feature Selection as an Improving Step for Decision Tree Construction 2009 Internatonal Conference on Machne Learnng and Computng IPCSIT vol.3 (2011) (2011) IACSIT Press, Sngapore Feature Selecton as an Improvng Step for Decson Tree Constructon Mahd Esmael 1, Fazekas Gabor

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

Classification / Regression Support Vector Machines

Classification / Regression Support Vector Machines Classfcaton / Regresson Support Vector Machnes Jeff Howbert Introducton to Machne Learnng Wnter 04 Topcs SVM classfers for lnearly separable classes SVM classfers for non-lnearly separable classes SVM

More information

Alignment Results of SOBOM for OAEI 2010

Alignment Results of SOBOM for OAEI 2010 Algnment Results of SOBOM for OAEI 2010 Pegang Xu, Yadong Wang, Lang Cheng, Tany Zang School of Computer Scence and Technology Harbn Insttute of Technology, Harbn, Chna pegang.xu@gmal.com, ydwang@ht.edu.cn,

More information

Using an Automatic Weighted Keywords Dictionary for Intelligent Web Content Filtering

Using an Automatic Weighted Keywords Dictionary for Intelligent Web Content Filtering Journal of Advances n Computer Research Quarterly pissn: 2345-606x eissn: 2345-6078 Sar Branch, Islamc Azad Unversty, Sar, I.R.Iran (Vol. 6, No. 1, February 2015), Pages: 101-114 www.jacr.ausar.ac.r Usng

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Impact of a New Attribute Extraction Algorithm on Web Page Classification

Impact of a New Attribute Extraction Algorithm on Web Page Classification Impact of a New Attrbute Extracton Algorthm on Web Page Classfcaton Gösel Brc, Banu Dr, Yldz Techncal Unversty, Computer Engneerng Department Abstract Ths paper ntroduces a new algorthm for dmensonalty

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grid-based Algorithm for Clustering Analysis A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan

More information

Network Intrusion Detection Based on PSO-SVM

Network Intrusion Detection Based on PSO-SVM TELKOMNIKA Indonesan Journal of Electrcal Engneerng Vol.1, No., February 014, pp. 150 ~ 1508 DOI: http://dx.do.org/10.11591/telkomnka.v1.386 150 Network Intruson Detecton Based on PSO-SVM Changsheng Xang*

More information

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements Module 3: Element Propertes Lecture : Lagrange and Serendpty Elements 5 In last lecture note, the nterpolaton functons are derved on the bass of assumed polynomal from Pascal s trangle for the fled varable.

More information

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Інформаційні технології в освіті ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Some aspects of programmng educaton

More information

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 48 CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 3.1 INTRODUCTION The raw mcroarray data s bascally an mage wth dfferent colors ndcatng hybrdzaton (Xue

More information

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan

More information

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies Deep Classfer: Automatcally Categorzng Search Results nto Large-Scale Herarches Dkan Xng 1, Gu-Rong Xue 1, Qang Yang 2, Yong Yu 1 1 Shangha Jao Tong Unversty, Shangha, Chna {xaobao,grxue,yyu}@apex.sjtu.edu.cn

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

A Knowledge Management System for Organizing MEDLINE Database

A Knowledge Management System for Organizing MEDLINE Database A Knowledge Management System for Organzng MEDLINE Database Hyunk Km, Su-Shng Chen Computer and Informaton Scence Engneerng Department, Unversty of Florda, Ganesvlle, Florda 32611, USA Wth the exploson

More information

Query classification using topic models and support vector machine

Query classification using topic models and support vector machine Query classfcaton usng topc models and support vector machne Deu-Thu Le Unversty of Trento, Italy deuthu.le@ds.untn.t Raffaella Bernard Unversty of Trento, Italy bernard@ds.untn.t Abstract Ths paper descrbes

More information

Improving Web Image Search using Meta Re-rankers

Improving Web Image Search using Meta Re-rankers VOLUME-1, ISSUE-V (Aug-Sep 2013) IS NOW AVAILABLE AT: www.dcst.com Improvng Web Image Search usng Meta Re-rankers B.Kavtha 1, N. Suata 2 1 Department of Computer Scence and Engneerng, Chtanya Bharath Insttute

More information

General Vector Machine. Hong Zhao Department of Physics, Xiamen University

General Vector Machine. Hong Zhao Department of Physics, Xiamen University General Vector Machne Hong Zhao (zhaoh@xmu.edu.cn) Department of Physcs, Xamen Unversty The support vector machne (SVM) s an mportant class of learnng machnes for functon approach, pattern recognton, and

More information

Efficient Text Classification by Weighted Proximal SVM *

Efficient Text Classification by Weighted Proximal SVM * Effcent ext Classfcaton by Weghted Proxmal SVM * Dong Zhuang 1, Benyu Zhang, Qang Yang 3, Jun Yan 4, Zheng Chen, Yng Chen 1 1 Computer Scence and Engneerng, Bejng Insttute of echnology, Bejng 100081, Chna

More information

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY Proceedngs of the 20 Internatonal Conference on Machne Learnng and Cybernetcs, Guln, 0-3 July, 20 THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY JUN-HAI ZHAI, NA LI, MENG-YAO

More information

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System Fuzzy Modelng of the Complexty vs. Accuracy Trade-off n a Sequental Two-Stage Mult-Classfer System MARK LAST 1 Department of Informaton Systems Engneerng Ben-Guron Unversty of the Negev Beer-Sheva 84105

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Face Recognition University at Buffalo CSE666 Lecture Slides Resources: Face Recognton Unversty at Buffalo CSE666 Lecture Sldes Resources: http://www.face-rec.org/algorthms/ Overvew of face recognton algorthms Correlaton - Pxel based correspondence between two face mages Structural

More information

Personalized Concept-Based Clustering of Search Engine Queries

Personalized Concept-Based Clustering of Search Engine Queries IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Personalzed Concept-Based Clusterng of Search Engne Queres Kenneth Wa-Tng Leung, Wlfred Ng, and Dk Lun Lee Abstract The exponental growth of nformaton

More information

Face Detection with Deep Learning

Face Detection with Deep Learning Face Detecton wth Deep Learnng Yu Shen Yus122@ucsd.edu A13227146 Kuan-We Chen kuc010@ucsd.edu A99045121 Yzhou Hao y3hao@ucsd.edu A98017773 Mn Hsuan Wu mhwu@ucsd.edu A92424998 Abstract The project here

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

CLASSIFICATION OF ULTRASONIC SIGNALS

CLASSIFICATION OF ULTRASONIC SIGNALS The 8 th Internatonal Conference of the Slovenan Socety for Non-Destructve Testng»Applcaton of Contemporary Non-Destructve Testng n Engneerng«September -3, 5, Portorož, Slovena, pp. 7-33 CLASSIFICATION

More information

Learning Non-Linearly Separable Boolean Functions With Linear Threshold Unit Trees and Madaline-Style Networks

Learning Non-Linearly Separable Boolean Functions With Linear Threshold Unit Trees and Madaline-Style Networks In AAAI-93: Proceedngs of the 11th Natonal Conference on Artfcal Intellgence, 33-1. Menlo Park, CA: AAAI Press. Learnng Non-Lnearly Separable Boolean Functons Wth Lnear Threshold Unt Trees and Madalne-Style

More information

Audio Content Classification Method Research Based on Two-step Strategy

Audio Content Classification Method Research Based on Two-step Strategy (IJACSA) Internatonal Journal of Advanced Computer Scence and Applcatons, Audo Content Classfcaton Method Research Based on Two-step Strategy Sume Lang Department of Computer Scence and Technology Chongqng

More information

Web Document Classification Based on Fuzzy Association

Web Document Classification Based on Fuzzy Association Web Document Classfcaton Based on Fuzzy Assocaton Choochart Haruechayasa, Me-Lng Shyu Department of Electrcal and Computer Engneerng Unversty of Mam Coral Gables, FL 33124, USA charuech@mam.edu, shyu@mam.edu

More information

A Web Site Classification Approach Based On Its Topological Structure

A Web Site Classification Approach Based On Its Topological Structure Internatonal Journal on Asan Language Processng 20 (2):75-86 75 A Web Ste Classfcaton Approach Based On Its Topologcal Structure J-bn Zhang,Zh-mng Xu,Kun-l Xu,Q-shu Pan School of Computer scence and Technology,Harbn

More information