DOCUMENT clustering is a special version of data clustering
|
|
- Nancy Bryan
- 6 years ago
- Views:
Transcription
1 INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 2011, VOL. 57, NO. 3, PP Manuscrpt receved June 19, 2011; revsed September DOI: /v Document Clusterng Concepts, Metrcs and Algorthms Tomasz Tarczynsk Abstract Document clusterng, whch s also refered to as text clusterng, s a technque of unsupervsed document organsaton. Text clusterng s used to group documents nto subsets that consst of texts that are smlar to each orher. These subsets are called clusters. Document clusterng algorthms are wdely used n web searchng engnes to produce results relevant to a query. An example of practcal use of those technques are Yahoo! herarches of documents [1]. Another applcaton of document clusterng s browsng whch s defned as searchng sesson wthout well specfc goal. The browsng technques heavly reles on document clusterng. In ths artcle we examne the most mportant concepts related to document clusterng. Besdes the algorthms we present comprehensve dscusson about representaton of documents, calculaton of smlarty between documents and evaluaton of clusters qualty. Keywords Document clusterng, text mnng, kmeans, herarchcal clusterstng, vector space model. usually dffer a bt from the other ones because they must meet other requrements. Nowadays document clusterng s becomng even more mportant, mostly because databases contanng the documents are growng rapdly. That mght cause two knds of problems. Frst of them s the tme needed to perform a search. Tme complexty of most clusterng algorthm depends lnearly or quadratcaly on the number of documents whch makes. The second ssue s connected wth an accuracy of results. Increase n number of documents ncreases the probablty that document wll be assgned to the wrong cluster. I. INTRODUCTION DOCUMENT clusterng s a specal verson of data clusterng problem. Data clusterng s defned as an organzaton of a set of object nto dsjoned subsets called clusters (see [2]). The assgnment should be made n such a way that objects wthn a cluster are smlar to each other. In text clusterng nstead of usng some general objects we use documents. Clusterng algorthms are often used n web search engnes to automatcally group web pages nto categores whch can be browsed easly by a user. The man purpose of usng document clusterng s to fnd documents relevant to a gven query. To obtan such results a query must be formulated frst. It s not always easy to defne t. The example of such hard to defne query s how to formulate a query f a user wants to retreve artcles descrbng the most mportant recent events n the world. Defnng the query as recent most mportant events probably would not return anythng useful, unless there would be an artcle that s smlarly ttled, ths would also not guarantee the success because ths artcle could be wrtten 10 years ago. In such a case the user does not know exactly what he s lookng for. Ths concept s called browsng. More formally t s defned as a search sesson wthout well specfed goal (see [3]). Ths goal s beng usually defned durng the browsng process. The user s purpose mght be also changed durng the browsng process. The concept of browsng uses text clusterng as a man subroutne of whole process. Those clusterng algorthms T. Tarczynsk s wth the Department of Appled Informatcs, Warsaw Unversty of Lfe Scences, ul. Nowoursynowska 159, Warsaw, Poland (e-mal: tomek.tarczynsk@gmal.com). Fg. 1. An llustraton of clusterng results of objects that can be descrbed usng two attrbutes. Each of the elpses represenst dfferent cluster contanng smlar objects. Ths paper s dvded nto 6 sectons. Secton 2 presents the defntons and notatons that are used n ths paper, whle Secton 3 presents dfferent representatons of documents. Secton 4 descrbes the concept of smlarty and evaluaton of cluster qualty. Secton 5, whch s the man part of ths paper, covers dfferent algorthms used for clusterng. Secton 6 presents conclutons. II. DEFINITIONS AND NOTATIONS The followng conventons are used throughout the document. Captal letters are used to denote sets, whle the lowercase letters are used for vecors are scalars. Let D be a set of all documents that are to be clustered. It s assumed that ths set s always a fnte set wth cardnalty n. Dependng on the context d j can denote the jth document or the vector that represents the jth document (representaton of documents s explaned n the next secton). C s the set of clusters, whle C refers to the th cluster. The cardnalty of set C s equal to k. Let T be the set of all possble words that can occur n
2 272 T. TARCZYNSKI documents. Ths set s also fnte and the cardnalty of ths set s dentoed by t. It s assumed that the sze of ths set does not depend on the number of documents. Most of the tme ths set s referred to as dctonary.the th word n the dctonary s denoted by t. The number of occurences of the word t n the document d j s denoted by n j, whle n j s the total number of words n the jth document. Symbol x refers to the length of vector x and A denotes the cardnalty of the set A. A set of clusters produced durng a clusterng process s called clusterng results. III. REPRESENTATION OF A DOCUMENT Computers process numbers much better than documents and therefore documents are usually represented usng numbers. Many dfferent models that represent documents were proposed. The most wdespread representaton of documents s called Vector Space Model, whch was presented by Salton, Wong and Yang n [4]. In ths model the documents are represented by nonnegatve sparse vectors n a t-dmensonal space. Such vectors wll be denoted by d j, where j denotes the jth document n the set. d j = (w 1j, w 2j,..., w tj ) (1) where w j s the wegth of th term n jth document. Each term usually refers to a sngle word from the dctonary, but some algorthms lke Phrase-Intersecton clusterng use phrases nstead of sngle words. In such a case the dctonary would consst of phrases and t would denote the number of possble phrases. There are several ways of calculatng those weghts, but typcally term frequency-nverse document frequency method s used (see [5]). In ths method each weght s a normalzed product of two components: term frequency and nverse document frequency. Term frequency s calculated as a normalzed number of occurrences of the term n the document. The term frequency of the -th term n the j-th document, T j, s defned as: T j = n j n j (2) Term frequency represents how often a word s used n a document, the more often the hgher the weght. To dstngush common words, whch do not say anythng useful n the context of document clusterng, from the mportant words nverse document frequency s used. The more documents contan partcular word the lower the value of ths term, the number of occurences of ths word wthn sngle document s not relevant, only whether the word occured or not. Ths term s calculated as a logarthm of the nverse of a fracton of documents n whch ths word occurred. For the -th word, the nverse document frequency, I, s defned as: ( I = log n {j : t d j } There s a common problem assocated wth smlar equatons namely dvson by 0. To overcome ths problem two approaches are possble. The dea of the frst one s to add 1 ) (3) to the denomnator, but t changes the results. The second one, whch s more common, s to assume that words that do not occure n any documents are removed. Inverse term frequency solves a problem wth common words, whch should not have any nfluence on the clusterng process. If a word occurs n all documents then the nomnator and denomnator are equal and the value of the whole term s 0. It removes the nfluence of words such as: a, the, so, are, etc. Accordng to [4] the a model wthout Itf has lower precson and recall 1 by 14% on average. The mprovement was measured over the standard term frequency weghtng. More about the VSM can be found n [6] and [7]. IV. DOCUMENT SIMILARITY AND EVALUATION OF CLUSTER QUALITY The frst part of ths secton dscusses the document smlarty measures. Presented smlarty measures assumes that documents are represented usng VSM. Smlarty measures are one of the key parts of almost every clusterng algorthms. They can be used not only to determne the smlarty between two documents but also to calculate the smlarty between a document and a cluster and between two clusters. The most popular document smlarty measures, accordng to [8], are Cosne, Dce, Jaccard. They are all based on the dot product, the only dfference s the normalzaton factor. It s worth notcng that f the vectors are normalzed then cosne and dce measures produce the same result. Sm c (d j, d q ) = w j w q w2 j (4) w2 q Sm d (d j, d q ) = Sm j (d j, d q ) = 2 w j w q w2 j + (5) w2 q w j w q w2 j + w2 q w (6) j w q It s common to calculate document dssmlarty (dstance) nstead of the smlarty. Ths s a bt dfferent approach to the same problem, because nstead of sayng how smlar the documents are we are sayng how dfferent they are. Usually the Eucldan dstance s used n ths approach (see [9]), although some other metrcs lke Manhattan dstance or Mnkowsk dstance can be used, but they are not so commonly used as they are n data clusterng. Document smlarty measures can be extended to a measure resemblance of all documents n a cluster. The whole group of such measures s called ntra-cluster smlarty measures (see [10]). The most wdely used ntra-cluster smlarty measure s group average smlarty. It s defned as an average of parwse smlarty of documents n the cluster, t s gven by: d d(c ) = j,d q C sm(d j, d q ) C 2 (7) If we use a cosne as a document smlarty measure then somewhat suprsngly ths formula can be sgnfcantly smplfed. To smplfy ths formula we need to ntroduce the concept 1 These terms are explaned n the next secton.
3 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 273 of a centrod. The centrod of a cluster s a vector that s used to represent the whole cluster, t s defned as an average of all vectors that represent documents n a cluster, n ths paper we denote the centrod of the th cluster by c. c = 1 C d j (8) d j C The centrod does not have to have normalzed length, n fact t almost never does. It can be easly proved that the ntracluster smlarty s equal to the squared length of a centrod vector. d(c ) = c 2 (9) The calculaton of the smlarty between two clusters can be smplfed to the calculaton of the smlarty of ther centrods. Another approach was presented n [11], where Olsen ntroduced new clusters smlarty measurng metrcs: sngle lnk, average lnk, complete lnk and medan. They were defned as the mnmum dstance between documents n two clusters, an average dstance, a maxmum dstance and a medan of dstances (see also [3]). The last part of ths secton s devoted to external qualty measures. In contrast to the prevously mentoned measures these ones need an external knowledge about real classes. Ths assumpton prevents from usng them n an clusterng algorthm as a crteron functon. Those measures reflect the smlarty between real classes and the ones obtaned n the clusterng process. One the most fundamental measure s the entropy, whch s based on the termodynamcal entropy. The entropy of a sngle cluster s defned as: E j = p j ln(p j ) (10) Where p j s the probablty that a document whch s n the cluster j belongs to the class. Here we assume that 0 ln(0) = 0. To obtan the qualty of all clusters the weghted sum of entropes of all clusters must be calculated. The weght s equal to the sze of the cluster. E = 1 E n (11) n F measure, was frst ntroduced by C. J. van Rjsberge n [12], s a dfferent external qualty measure. To understand ths measure two concepts must be ntroduced. Frst of them s precson. Precson(,j) can be nterpreted as a probablty that randomly chosen document from jth cluster belongs to the th class. The second concept s called recall and t can be nterpreted as a probablty that randomly chosen document from the th class belongs to the jth cluster. The F-measure of the cluster j and class s calculated as harmonc mean of precson and recall, whch s gven by: F (, j) = 2 Recall(, j) P recson(, j) P recson(, j) + Recall(, j) (12) The F-measure of whole cluster result s gven n the followng form F = 1 k n (13) n max j F (, j) Fg. 2. An llustraton of recall and precson. Black dots represent documents that are n one category and the whte ones represent documents that are n another category. Dots that are n an ellpse corespond to the documents n one cluster. Recall of ths cluster s equal to 2 and the precson s equal to More about F-measure and other qualty measures n nfromaton retreval can be found n [13] V. CLUSTERING ALGORITHMS Ths secton descrbes the most mportant clusterng algorthms. Each of presented algorthms fts nto one of two categores: herarchcal clusterng and non-herarchcal clusterng (see [14]). Non-herarchcal clusterng algorthms are often called flat methods or parttonal methods. Herarchcal methods n contrast to the parttonal methods do not create a sngle clusterng result, but the whole herarchy of clusterng. There are two basc approaches to herarchcal clusterng. Frst of them s herarchcal agglomeratve clusterng (HAC) and the second one s herarchcal dvse clusterng (HDC). A. Herarchcal Agglomeratve Algorthms Herarchcal methods are often beleved to produce better clusterng results (see [15]), but ther man dsadvantage s the complexty. The lower bound on tme complexty n herarchcal clusterng s O(n 2 ). The results obtaned by a herarchcal clusterng algorthm can be vewed at dfferent levels. Each of the levels can be seen as separate clusterng result wth dfferent number of clusters, although those levels are strongly connected because the result on each level s based on the result on prevous level. Ths herarchy of results s usually graphcally presented usng a tree. Ths tree s called dendrogram. It s a tree wth a sngle node at the top and n nodes at the bottom. Each of the nodes that are on the bottom represent a cluster wth a sngle document n t. At the begnnng of the agglomeratve approach each document s n ts own cluster, so n the frst level number of clusters s equal to n. In each teraton the two most smlar clusters are merged. Ths process contnues untl only one cluster s left. That strategy s also called bottom-up approach. Exhaustve survey on agglomeratve herarchcal methods can be found n [14]. More about the effcency of HAC can be found n [16]. The tradtonal agglomeratve algorthm can be summarzed as follows: 1) Compute the smlarty matrx A, whose entry [a j ] s the smlarty between th and jth cluster. 2) Merge the two most smlar clusters.
4 274 T. TARCZYNSKI 3) Update the smlarty matrx (only the column that represents new cluster must be updated) 4) Repeat steps 2 and 3 untl all documents are n one cluster. Another huge advantage of ths algorthm s that t can be seen as ether flat clusterng or herarchcal clusterng method. Usng t as a parttonal clusterng mght even produce better results because the obtaned results can be mproved usng K-means algorthm on the desred number of clusters. In ths case the bsectng K-means would just lead to fndng centrod for clusters. Fg. 3. Sample dendrogram. Lookng from the bottom-up approach n the frst teraton documents D 1 and D 2 were merged nto one cluster, n the second one documents D 3 and D 4 were grouped nto one cluster. In the thrd teraton cluster contanng D 1 and D 2 was merged wth cluster that contans D 3 and D 4 and fnally n the last teraton all documents were put nto one cluster. B. Bsectng K-means Bsectng K-means s a herarchcal dvsve algorthm, whch was ntroduced n [15]. Bsectng K-means should not be confused wth K-means algorthm whch s parttonal algorthm. The dvsve strategy s n some sense a reverse of the agglomeratve technque. At the begnnng of the algorthm there s only one cluster whch contans all documents. In each teraton a chosen cluster s dvded nto two clusters. The process stops when each cluster contans only a sngle document. The bsectng K-means s descrbed below: 1) Choose a cluster to splt. 2) Apply K-means clusterng wth k = 2 to ths cluster. (bsectng step) 3) Repeat the bsectng step ITER tmes and choose the splt that produces clusterng wth the hghest overall smlarty. 4) Repeat all steps untl the desred number of clusters s obtaned. Emprcal research n [15] shows that ITER = 5 s a good compromse between the qualty and the tme complexty. The ncrease of the ITER parameter would ncrease both the qualty and the tme complexty. Ths algorthm s very flexble because steps 1 and 2 can be mplemented n many dfferent ways. The most ntutve way to choose a cluster n step 1 s to choose the largest cluster. That was the method used n [15]. The other possble approaches mentoned n [15] are take the cluster wth the least overall smlarty, use a crteron based on both sze and overall smlarty. The bsectng step s even more adjustable snce any nondetermnstc parttonal clusterng algorthm can be used n ths step, although takng K-means seems reasonable because t s a smple and effectve algorthm. The K-means algorthm tselft can be also adjusted n many ways, what wll be descrbed n the next subsecton. C. K-means K-means clusterng (see [17], [18]) s the most wdely used parttonal clusterng technque. Its popularty s due to ts smplcty and good results. The general descrpton of k- means algorthm s as follows: 1) Choose k the number of clusters 2) Choose k ponts as a center of clusters 3) Assgn each document to the nearest cluster 4) If stop crterum s met then stop 5) Recalculate center of clusters 6) Go to the step 3 Choosng k n the frst step depends strongly on the dataset and motvatons that are behnd clusterng, although some systematc approaches for choosng the number of clusters are gven n [19]. The ponts n the second step are usually chosen at random for smplcty, but some modfcatons such as k- means++ [20] mght be used. A comprehensve survey on dfferent methods for ntalzng the K-means clusterng can be found n [21]. Authors compare there 11 dfferent ntalzng methods. From those experments t mght be concluded that the best ntalzng methods are the ones proposed n [22], [23] and [24]. In order to assgn a document to the nearest cluster any of the smlarty measure mentoned n chapter 3 can be used. The stop crteron s usually defned as a stop when no documents have been moved from one cluster to another. In the standard k-means algorthm, the center of a cluster s calculated as a centrod of the documents n that cluster, but many modfcatons to ths pont were proposed. One of them s the k-medans algorthm whch calculates n ths step the medan nstead of the mean, ths medan does not have to be an actual document because medan of each term s calculated separately. Another modfcaton was presented as the k-medod algorthm where one of the documents s chosen to be a center of a cluster. Ths document s usually chosen as a document wth the lowest parwse dssmlarty. D. Algorthms Based on Crteron Functons Many parttonal clusterng algorthms are based on optmzaton of a global crteron functon. In some of these algorthms the crteron functon mght be exchanged easly. Examples of those algorthms are CobWeb [25] and AutoClass [26]. Such algorthms consst of two man components. One of them s the crteron functon used for an optmzaton and the second one s the algorthm tself. Many dfferent clusterng algorthms can easly be created f a lst of crteron functons and optzaton algorthms are avalable because every combnaton of the optmzaton algorthm and the crteron functon creates new clusterng algorthm.
5 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 275 In ths chapter we wll focus mostly on the crteron functon nstead of algorthms, because accordng to [9] and [27] greedy strategy produces comparable results to the results obtaned usng more complcated algorthms. These more complcated algorthms nclude concepts lke teratve schemes based on hll-clmbng methods, spectral-based optmzers and so on. The reader more nterested n these algorthms are reffered to [28], [29], [30], [31], [32] and [33]. A great survey on crteron functons can be found n [9], Table contatng all crteron functons tested n [9] s presented below. f 1 maxmze k n r 1 n 2 cos(d, d j ) (14) r d,d j C r f 2 maxmze k cos(d, C r ) d C r (15) k mnmze n r cos(c r, c) (16) f 3 f 4 f 5 f 6 f 7 maxmze maxmze mnmze mnmze ( k n r k 1 n 2 r cos(d, d j ) d,d j C r k n r cos(c r, c) cos(d, C r ) d C r k n r cos(c r, c) k k cut(s r, S S r ) d,d jnc r cos(c, c j ) cut(v r, V V r ) W (V r ) ) (17) (18) (19) (20) where c denotes the centrod of all documents, V r s the set of vertces assgned to the rth cluser. and W (V r ) s the sum of the weghts of the adjacenct lsts of V r. Here we descrbe an dea that s behnd each of these functons: f 1 t s weghted sum of the average parwse smlartes between the documents n each cluster. The weghts are normalzed due to the sze of the cluster. f 2 t s the sum of the cosne between each document and the centrod of the cluster of ths document. f 3 t s the sum of the cosne between each centrod of clusters and the centrod of all documents. f 4 t s a quotent of f 1 and f 3. f 5 t s a quotent of f 2 and f 3. f 6 t s edge-cut of each partton scaled by the sum of the cluster s nternal edges. t s normalzed edge-cut of the parttonng. f 7 The entropy was used as a measure of qualty of the results. Extensve experments made on 15 dfferent datasets showed that none of these functons s a superor to the others, although some functons performed better. In almost all tested f 2, f 4 and f 5 outperformed all other functons. The dfference between the qualty of those functons was not sgnfcant. It s nterestng that some functons, whch seem to be very smlar lke f 1 and f 2, produced very dfferent results. The results vared by up to twenty percent. E. Word-Intersecton Clusterng Oren Zamr n [34] outlned that reducng the number of returned documents not necessarly makes browsng easer for user. He clams that effcent document clusterng mght be used as a good method for navgatng through a large collecton of documents, furthermore t mght fnd some patterns that would be normally mssed. Zamr also created a 6 pont lst of requrements for clusterng algorthm n context of document browsng. 1) Ease-of-browsng: Results must be presented n such a way that user mmedately can decde whether some content s useful or not. 2) Speed: The clusterng should be fast enough to create results wthn seconds. 3) Scalablty: The method should be able to cluster large set of documents. 4) No preprocessng: The algorthm should not rely on the preprocessng, because t s appled to dynamcally generated set of documents. 5) Clent sde executon: The system should be able to process the clusterng algorthm on the clent sde. It s motvated by a fact that server mght not have enought computaton power to produce clusterng for each user. 6) Snppet-Capable: The algorthm should not requre whole documents, but only small snppets of those documents. Due to clent sde executon ths requrement s necessary, beacuse transferng whole documents could be more tme consumng that the algorthms tself. Word-Intersecton Clusterng s a herarchcal agglomeratve algorthm wth a few new concepts. Frst of them s that clusters are characterzed by the set of words that are n every document n the cluster. Ths automatcally solves a problem wth the descrpton, because those common words mght be used as a descrpton of a cluster. The numbers of words that are shared by all the documents n the cluster s called coheson and t s denoted by h(c ). Another new concept s the score functon s(c), whch expresses the decreasng margnal sgngcance of the coheson. s(c ) = C 1 exp ( βh(c )) 1 + exp ( βh(c )) (21) The most mportant new feauture of ths algorthm s Global Qualty Functon whch quantfes the qualty of a clusterng. It s defned as: GQF (C) = f(c) g( C ) C C s(c ) (22) Where the f(c) s a functon that s proportonal to the normalzed number of clusterd documents. A document s consdered to be clustered when the sze of the cluster that
6 276 T. TARCZYNSKI contans s greater than 1, whle the g( C ) s an ncreasng functon n the number of clusters. The algorthm tself can be summarzed as: 1) Put all documents n dfferent clusters. 2) If there are no two clusters whose merge would ncrese GQF halt. 3) Merge two clusters that creases GQF the most. 4) Go to the step 2. The experments made n [34] showed that ths algorthm s faster that the classcal herachcal agglomeratve algorthm wth the cosne smlarty measure, whch s referred n that paper as COS-GAVG. In addton the qualty of the results obtaned by word-ntersecton clusterng was sgnfcantly better than the resuls produced by COS-GAVG. These results prove that ths algorthm s very promsng. VI. CONCLUSIONS In ths artcle we have presented the most mportant concepts related to document clusterng. Our research confrmed that there s no superor text clusterng algorthm. Document clusterng algorthms vary n both the complexty and the qualty of results. Herarchcal algorthms n general are consdered as the ones that performs better than the flat methods, but on the other hand parttonal algorthms are much faster. It seems that the most promsng algorthms are hybrd methods lke bstectng k-means. Ths algorthm outperformed parttonal clusterng algorthms wthout sgnfcant ncrease n complexty. The clusterng algorthm should be chosen dependng on the avalable tme, herarchcal algorthms are well suted for a precomputaton, whle parttonal algorthms are good for an onlne computaton. The complexty of clusterng algorthms wll play a crucal role n the future, ths wll be due to contnuous ncrese n the number of ducuments. On the other hand the speed of the computers wll ncrease more slowly because the Moore s law 2 s consdered to be broken. Ths premses lead to the concluson that parallel and dstrbuted clusterng algorthms should be more stressed n the future. Many parallel data clusterng algorthms were already desgned (see [35], [11], [36]), but not much has been done n the feld of parallel document clusterng. Besde the complexty another potental problem s assocated wth hgher demands on the qualty. These demands mght lead to the development of a more lngustc approach to the document clusterng problem. Present algorthms do not take nto account an order of the words n a sentence. In some cases the order of words mght be relevant n the process of clusterng. Ths task s extremely hard to mplement, because each language has ts own grammar rules whch determne whether the order n partcular case s mportant or not. REFERENCES [1] Y. Labrou and T. Fnn, Yahoo! as an ontology: usng yahoo! categores to descrbe documents, n Proceedngs of the eghth nternatonal 2 Moore s law states that the number of transstors that can be placed on ntegrated crcuts, wthout any addtonal costs, doubles every two year conference on Informaton and knowledge management, ser. CIKM 99. New York, NY, USA: ACM, 1999, pp [Onlne]. Avalable: [2] A. K. Jan, M. N. Murty, and P. J. Flynn, Data clusterng: a revew, ACM Comput. Surv., vol. 31, pp , September [Onlne]. Avalable: [3] D. R. Cuttng, D. R. Karger, J. O. Pedersen, and J. W. Tukey, Scatter/gather: a cluster-based approach to browsng large document collectons, n Proceedngs of the 15th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 92, New York, NY, USA, 1992, pp [Onlne]. Avalable: [4] G. Salton, A. Wong, and C. S. Yang, A vector space model for automatc ndexng, Commun. ACM, vol. 18, pp , November [Onlne]. Avalable: [5] G. Salton and C. Buckley, Term weghtng approaches n automatc text retreval, Cornell Unversty, Ithaca, NY, USA, Tech. Rep., [6] S. K. M. Wong, W. Zarko, V. V. Raghavan, and P. C. N. Wong, On modelng of nformaton retreval concepts n vector spaces, ACM Trans. Database Syst., vol. 12, pp , June [Onlne]. Avalable: [7] X. Ta, M. Sasak, Y. Tanaka, and K. Kta, Improvement of vector space nformaton retreval model based on supervsed learnng, n Proceedngs of the ffth nternatonal workshop on on Informaton retreval wth Asan languages, ser. IRAL 00, New York, NY, USA, 2000, pp [Onlne]. Avalable: [8] G. Salton, Ed., Automatc text processng. Boston, MA, USA: Addson- Wesley Longman Publshng Co., Inc., [9] Y. Zhao and G. Karyps, Emprcal and theoretcal comparsons of selected crteron functons for document clusterng, Mach. Learn., vol. 55, pp , June [Onlne]. Avalable: [10] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma, Learnng to cluster web search results, n Proceedngs of the 27th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 04, New York, NY, USA, 2004, pp [Onlne]. Avalable: [11] C. F. Olson, Parallel algorthms for herarchcal clusterng, Parallel Comput., vol. 21, [12] C. J. van Rjsbergen, Informaton Retreval, 2nd ed. Newton, MA, USA: Butterworth-Henemann, [13] J. Makhoul, F. Kubala, R. Schwartz, and R. Weschedel, Performance measures for nformaton extracton, n In Proceedngs of DARPA Broadcast News Workshop, 1999, pp [14] A. El-Hamdouch and P. Wllett, Comparson of herarchc agglomeratve clusterng methods for document retreval, The Computer Journal, vol. 32, pp , jan [Onlne]. Avalable: [15] M. Stenbach, G. Karyps, and V. Kumar, A comparson of document clusterng technques, [Onlne]. Avalable: [16] W. H. Day and H. Edelsbrunner, Effcent algorthms for agglomeratve herarchcal clusterng methods, Journal of Classfcaton, vol. 1, pp. 7 24, [Onlne]. Avalable: [17] G. A. Wlkn and X. Huang, A practcal comparson of two k-means clusterng algorthms, BMC Bonformatcs, vol. 9, p. S19, [Onlne]. Avalable: [18] J. Wu, H. Xong, and J. Chen, Adaptng the rght measures for k-means clusterng, n Proceedngs of the 15th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, ser. KDD 09, New York, NY, USA, 2009, pp [Onlne]. Avalable: [19] M. Chang and B. Mrkn, Experments for the number of clusters n k-means, n Progress n Artfcal Intellgence, ser. Lecture Notes n Computer Scence, J. Neves, M. Santos, and J. Machado, Eds. Sprnger Berln / Hedelberg, 2007, vol. 4874, pp [Onlne]. Avalable: 33 [20] D. Arthur and S. Vasslvtsk, k-means++: the advantages of careful seedng, n Proceedngs of the eghteenth annual ACM-SIAM symposum on Dscrete algorthms, ser. SODA 07, Phladelpha, PA, USA, 2007, pp [Onlne]. Avalable: [21] R. Matra, A. Peterson, and A. Ghosh, A systematc evaluaton of dfferent methods for ntalzng the k-means clusterng algorthm, IEEE Transactons on Knowledge and Data Engneerng, 2010.
7 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 277 [22] G. W. Mllgan and P. D. Isaac, The valdaton of four ultrametrc clusterng algorthms, Pattern Recognton, vol. 12, no. 2, pp , [Onlne]. Avalable: [23] P. S. Bradley and U. M. Fayyad, Refnng ntal ponts for k-means clusterng, n Proceedngs of the Ffteenth Internatonal Conference on Machne Learnng, ser. ICML 98, San Francsco, CA, USA, 1998, pp [Onlne]. Avalable: [24] B. Mrkn, Clusterng for Data Mnng: A Data Recovery Approach. Chapman and Hall/CRC, [25] D. H. Fsher, Knowledge acquston va ncremental conceptual clusterng, Mach. Learn., vol. 2, pp , September [Onlne]. Avalable: [26] P. Cheeseman and J. Stutz, Advances n knowledge dscovery and data mnng, U. M. Fayyad, G. Patetsky-Shapro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA, USA: Amercan Assocaton for Artfcal Intellgence, 1996, ch. Bayesan classfcaton (AutoClass): theory and results, pp [Onlne]. Avalable: [27] S. Savares, D. L. Boley, S. Bttant, and G. Gazzanga, Choosng the cluster to splt n bsectng dvsve clusterng algorthms, n Second SIAM Internatonal Conference on Data Mnng, [28] M. Mela and D. Heckerman, An expermental comparson of model-based clusterng methods, Mach. Learn., vol. 42, pp. 9 29, January [Onlne]. Avalable: [29] G. Karyps, E. Han, and V. Kumar, Chameleon: Herarchcal clusterng usng dynamc modelng, Computer, vol. 32, pp , August [Onlne]. Avalable: [30] D. Boley, Prncpal drecton dvsve parttonng, Data Mn. Knowl. Dscov., vol. 2, pp , December [Onlne]. Avalable: [31] H. Zha, X. He, C. Dng, H. Smon, and M. Gu, Bpartte graph parttonng and data clusterng, n Proceedngs of the tenth nternatonal conference on Informaton and knowledge management, ser. CIKM 01, New York, NY, USA, 2001, pp [Onlne]. Avalable: [32] C. H. Zha, H. Zha, X. He, C. Dng, H. Smon, and M. Gu, Spectral relaxaton for k-means clusterng, n Advances n Neural Informaton Processng Systems. MIT Press, 2001, pp [33] I. S. Dhllon and D. S. Modha, Concept decompostons for large sparse text data usng clusterng, Mach. Learn., vol. 42, pp , January [Onlne]. Avalable: [34] O. Zamr, O. Etzon, O. Madan, and R. M. Karp, Fast and ntutve clusterng of web documents, n In Proceedngs of the 3rd Internatonal Conference on Knowledge Dscovery and Data Mnng, 1997, pp [35] M. Dash, S. Petrutu, and P. Scheuermann, Effcent parallel herarchcal clusterng, n Internatonal Europar Conference, [36] Y. Song, W. Chen, H. Ba, C. Ln, and E. Chang, Parallel spectral clusterng, Machne Learnng and Knowledge Dscovery n Databases, pp , [Onlne]. Avalable: 25 [37] Y. Lu, J. Mostafa, and W. Ke, A fast onlne clusterng algorthm for scatter/gather browsng, [38] D. R. Cuttng, D. R. Karger, and J. O. Pedersen, Constant nteractontme scatter/gather browsng of very large document collectons, n Proceedngs of the 16th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 93, New York, NY, USA, 1993, pp [Onlne]. Avalable:
Machine Learning: Algorithms and Applications
14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of
More informationUnsupervised Learning and Clustering
Unsupervsed Learnng and Clusterng Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also
More informationOutline. Type of Machine Learning. Examples of Application. Unsupervised Learning
Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton
More informationHierarchical clustering for gene expression data analysis
Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15
CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc
More informationTerm Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task
Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto
More informationTsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance
Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for
More informationA Fast Content-Based Multimedia Retrieval Technique Using Compressed Data
A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,
More informationUnsupervised Learning
Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and
More informationParallelism for Nested Loops with Non-uniform and Flow Dependences
Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr
More informationMachine Learning. Topic 6: Clustering
Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess
More informationSupport Vector Machines
/9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.
More informationLearning the Kernel Parameters in Kernel Minimum Distance Classifier
Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department
More informationK-means and Hierarchical Clustering
Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your
More informationWeb Mining: Clustering Web Documents A Preliminary Review
Web Mnng: Clusterng Web Documents A Prelmnary Revew Khaled M. Hammouda Department of Systems Desgn Engneerng Unversty of Waterloo Waterloo, Ontaro, Canada 2L 3G1 hammouda@pam.uwaterloo.ca February 26,
More informationCluster Analysis of Electrical Behavior
Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School
More informationContent Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers
IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth
More informationDocument Representation and Clustering with WordNet Based Similarity Rough Set Model
IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org Document Representaton and Clusterng wth WordNet Based Smlarty Rough Set Model
More informationThe Research of Support Vector Machine in Agricultural Data Classification
The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou
More informationQuery Clustering Using a Hybrid Query Similarity Measure
Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan
More informationUB at GeoCLEF Department of Geography Abstract
UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department
More informationA Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems
A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty
More informationMathematics 256 a course in differential equations for engineering students
Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the
More informationIntelligent Information Acquisition for Improved Clustering
Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center
More informationA Binarization Algorithm specialized on Document Images and Photos
A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a
More informationAn Optimal Algorithm for Prufer Codes *
J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,
More informationA Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment
A Webpage Smlarty Measure for Web Sessons Clusterng Usng Sequence Algnment Mozhgan Azmpour-Kv School of Engneerng and Scence Sharf Unversty of Technology, Internatonal Campus Ksh Island, Iran mogan_az@ksh.sharf.edu
More informationPerformance Evaluation of Information Retrieval Systems
Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence
More informationSubspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;
Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features
More informationA Fast Visual Tracking Algorithm Based on Circle Pixels Matching
A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng
More informationSteps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices
Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between
More informationDetermining the Optimal Bandwidth Based on Multi-criterion Fusion
Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn
More informationAn Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices
Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal
More informationFor instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)
Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A
More informationSum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints
Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan
More informationNUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS
ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana
More informationUnsupervised Learning and Clustering
Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned
More informationClassifier Selection Based on Data Complexity Measures *
Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.
More informationThe Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique
//00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy
More informationCollaboratively Regularized Nearest Points for Set Based Recognition
Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,
More informationOutline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:
Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A
More informationA PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION
1 THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Seres A, OF THE ROMANIAN ACADEMY Volume 4, Number 2/2003, pp.000-000 A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION Tudor BARBU Insttute
More informationCompiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz
Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster
More informationA Deflected Grid-based Algorithm for Clustering Analysis
A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)
More informationLoad Balancing for Hex-Cell Interconnection Network
Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,
More informationMachine Learning 9. week
Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below
More informationA MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS
Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung
More informationTECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.
TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of
More informationEdge Detection in Noisy Images Using the Support Vector Machines
Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona
More informationDetection of an Object by using Principal Component Analysis
Detecton of an Object by usng Prncpal Component Analyss 1. G. Nagaven, 2. Dr. T. Sreenvasulu Reddy 1. M.Tech, Department of EEE, SVUCE, Trupath, Inda. 2. Assoc. Professor, Department of ECE, SVUCE, Trupath,
More informationConcurrent Apriori Data Mining Algorithms
Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng
More informationParallel matrix-vector multiplication
Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more
More informationCourse Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms
Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques
More informationLinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals
nkselector: A Web Mnng Approach to Hyperlnk Selecton for Web Portals Xao Fang and Olva R. u Sheng Department of Management Informaton Systems Unversty of Arzona, AZ 8572 {xfang,sheng}@bpa.arzona.edu Submtted
More informationCHAPTER 2 DECOMPOSITION OF GRAPHS
CHAPTER DECOMPOSITION OF GRAPHS. INTRODUCTION A graph H s called a Supersubdvson of a graph G f H s obtaned from G by replacng every edge uv of G by a bpartte graph,m (m may vary for each edge by dentfyng
More informationAvailable online at Available online at Advanced in Control Engineering and Information Science
Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced
More informationReducing Frame Rate for Object Tracking
Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg
More informationUser Authentication Based On Behavioral Mouse Dynamics Biometrics
User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA
More informationMeta-heuristics for Multidimensional Knapsack Problems
2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,
More information6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour
6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the
More informationCS 534: Computer Vision Model Fitting
CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust
More informationHCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS
HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS VIJAY SONAWANE 1, D.RAJESWARA.RAO 2 1 Research Scholar, Department of CSE, K.L.Unversty, Green Felds, Guntur, Andhra Pradesh
More informationClassic Term Weighting Technique for Mining Web Content Outliers
Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa Classc Term Weghtng Technque for Mnng Web Content Outlers W.R. Wan Zulkfel, N. Mustapha, and A. Mustapha
More informationHybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 2 Sofa 2016 Prnt ISSN: 1311-9702; Onlne ISSN: 1314-4081 DOI: 10.1515/cat-2016-0017 Hybrdzaton of Expectaton-Maxmzaton
More informationOptimizing Document Scoring for Query Retrieval
Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng
More informationSolving two-person zero-sum game by Matlab
Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by
More informationSLAM Summer School 2006 Practical 2: SLAM using Monocular Vision
SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,
More informationFuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval
Fuzzy -Means Intalzed by Fxed Threshold lusterng for Improvng Image Retreval NAWARA HANSIRI, SIRIPORN SUPRATID,HOM KIMPAN 3 Faculty of Informaton Technology Rangst Unversty Muang-Ake, Paholyotn Road, Patumtan,
More informationFINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK
FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,
More informationAssignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.
Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton
More informationOutline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1
4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:
More informationInformation Retrieval
Anmol Bhasn abhasn[at]cedar.buffalo.edu Moht Devnan mdevnan[at]cse.buffalo.edu Sprng 2005 #$ "% &'" (! Informaton Retreval )" " * + %, ##$ + *--. / "#,0, #'",,,#$ ", # " /,,#,0 1"%,2 '",, Documents are
More information/02/$ IEEE
A Modfed Fuzzy ART for Soft Document Clusterng Ravkumar Kondadad and Robert Kozma Dvson of Computer Scence Department of Mathematcal Scences Unversty of Memphs, Memphs, TN 38152 ABSTRACT Document clusterng
More informationModule Management Tool in Software Development Organizations
Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,
More informationBRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering
015 IEEE 17th Internatonal Conference on Hgh Performance Computng and Communcatons (HPCC), 015 IEEE 7th Internatonal Symposum on Cyberspace Safety and Securty (CSS), and 015 IEEE 1th Internatonal Conf
More informationDescription of NTU Approach to NTCIR3 Multilingual Information Retrieval
Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan
More informationSmoothing Spline ANOVA for variable screening
Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory
More informationProblem Definitions and Evaluation Criteria for Computational Expensive Optimization
Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty
More informationComplex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.
Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal
More informationAnalyzing Popular Clustering Algorithms from Different Viewpoints
1000-9825/2002/13(08)1382-13 2002 Journal of Software Vol.13, No.8 Analyzng Popular Clusterng Algorthms from Dfferent Vewponts QIAN We-nng, ZHOU Ao-yng (Department of Computer Scence, Fudan Unversty, Shangha
More informationHierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1
Herarchcal agglomeratve Cluster Analyss Chrstne Sedle 19-3-2004 Clusterng 1 Classfcaton Basc (unconscous & conscous) human strategy to reduce complexty Always based Cluster analyss to fnd or confrm types
More informationBioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.
[Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented
More informationA Two-Stage Algorithm for Data Clustering
A Two-Stage Algorthm for Data Clusterng Abdolreza Hatamlou 1 and Salwan Abdullah 2 1 Islamc Azad Unversty, Khoy Branch, Iran 2 Data Mnng and Optmsaton Research Group, Center for Artfcal Intellgence Technology,
More informationMULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION
MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and
More informationData Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach
Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer
More informationA fast algorithm for color image segmentation
Unersty of Wollongong Research Onlne Faculty of Informatcs - Papers (Arche) Faculty of Engneerng and Informaton Scences 006 A fast algorthm for color mage segmentaton L. Dong Unersty of Wollongong, lju@uow.edu.au
More informationDetermining Fuzzy Sets for Quantitative Attributes in Data Mining Problems
Determnng Fuzzy Sets for Quanttatve Attrbutes n Data Mnng Problems ATTILA GYENESEI Turku Centre for Computer Scence (TUCS) Unversty of Turku, Department of Computer Scence Lemmnkäsenkatu 4A, FIN-5 Turku
More informationGraph-based Clustering
Graphbased Clusterng Transform the data nto a graph representaton ertces are the data ponts to be clustered Edges are eghted based on smlarty beteen data ponts Graph parttonng Þ Each connected component
More informationHermite Splines in Lie Groups as Products of Geodesics
Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the
More informationLecture 5: Multilayer Perceptrons
Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented
More informationQuerying by sketch geographical databases. Yu Han 1, a *
4th Internatonal Conference on Sensors, Measurement and Intellgent Materals (ICSMIM 2015) Queryng by sketch geographcal databases Yu Han 1, a * 1 Department of Basc Courses, Shenyang Insttute of Artllery,
More informationOptimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming
Optzaton Methods: Integer Prograng Integer Lnear Prograng Module Lecture Notes Integer Lnear Prograng Introducton In all the prevous lectures n lnear prograng dscussed so far, the desgn varables consdered
More informationProgramming in Fortran 90 : 2017/2018
Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values
More informationLobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide
Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.
More informationNon-Split Restrained Dominating Set of an Interval Graph Using an Algorithm
Internatonal Journal of Advancements n Research & Technology, Volume, Issue, July- ISS - on-splt Restraned Domnatng Set of an Interval Graph Usng an Algorthm ABSTRACT Dr.A.Sudhakaraah *, E. Gnana Deepka,
More informationRecommender System based on Higherorder Logic Data Representation
Recommender System based on Hgherorder Logc Data Representaton Lnna L Bngru Yang Zhuo Chen School of Informaton Engneerng, Unversty Of Scence and Technology Bejng Abstract Collaboratve Flterng help users
More informationDeep Classification in Large-scale Text Hierarchies
Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong
More informationKeywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines
(IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak
More informationKOHONEN'S SELF ORGANIZING NETWORKS WITH "CONSCIENCE"
Kohonen's Self Organzng Maps and ther use n Interpretaton, Dr. M. Turhan (Tury) Taner, Rock Sold Images Page: 1 KOHONEN'S SELF ORGANIZING NETWORKS WITH "CONSCIENCE" By: Dr. M. Turhan (Tury) Taner, Rock
More information