DOCUMENT clustering is a special version of data clustering

Size: px
Start display at page:

Download "DOCUMENT clustering is a special version of data clustering"

Transcription

1 INTL JOURNAL OF ELECTRONICS AND TELECOMMUNICATIONS, 2011, VOL. 57, NO. 3, PP Manuscrpt receved June 19, 2011; revsed September DOI: /v Document Clusterng Concepts, Metrcs and Algorthms Tomasz Tarczynsk Abstract Document clusterng, whch s also refered to as text clusterng, s a technque of unsupervsed document organsaton. Text clusterng s used to group documents nto subsets that consst of texts that are smlar to each orher. These subsets are called clusters. Document clusterng algorthms are wdely used n web searchng engnes to produce results relevant to a query. An example of practcal use of those technques are Yahoo! herarches of documents [1]. Another applcaton of document clusterng s browsng whch s defned as searchng sesson wthout well specfc goal. The browsng technques heavly reles on document clusterng. In ths artcle we examne the most mportant concepts related to document clusterng. Besdes the algorthms we present comprehensve dscusson about representaton of documents, calculaton of smlarty between documents and evaluaton of clusters qualty. Keywords Document clusterng, text mnng, kmeans, herarchcal clusterstng, vector space model. usually dffer a bt from the other ones because they must meet other requrements. Nowadays document clusterng s becomng even more mportant, mostly because databases contanng the documents are growng rapdly. That mght cause two knds of problems. Frst of them s the tme needed to perform a search. Tme complexty of most clusterng algorthm depends lnearly or quadratcaly on the number of documents whch makes. The second ssue s connected wth an accuracy of results. Increase n number of documents ncreases the probablty that document wll be assgned to the wrong cluster. I. INTRODUCTION DOCUMENT clusterng s a specal verson of data clusterng problem. Data clusterng s defned as an organzaton of a set of object nto dsjoned subsets called clusters (see [2]). The assgnment should be made n such a way that objects wthn a cluster are smlar to each other. In text clusterng nstead of usng some general objects we use documents. Clusterng algorthms are often used n web search engnes to automatcally group web pages nto categores whch can be browsed easly by a user. The man purpose of usng document clusterng s to fnd documents relevant to a gven query. To obtan such results a query must be formulated frst. It s not always easy to defne t. The example of such hard to defne query s how to formulate a query f a user wants to retreve artcles descrbng the most mportant recent events n the world. Defnng the query as recent most mportant events probably would not return anythng useful, unless there would be an artcle that s smlarly ttled, ths would also not guarantee the success because ths artcle could be wrtten 10 years ago. In such a case the user does not know exactly what he s lookng for. Ths concept s called browsng. More formally t s defned as a search sesson wthout well specfed goal (see [3]). Ths goal s beng usually defned durng the browsng process. The user s purpose mght be also changed durng the browsng process. The concept of browsng uses text clusterng as a man subroutne of whole process. Those clusterng algorthms T. Tarczynsk s wth the Department of Appled Informatcs, Warsaw Unversty of Lfe Scences, ul. Nowoursynowska 159, Warsaw, Poland (e-mal: tomek.tarczynsk@gmal.com). Fg. 1. An llustraton of clusterng results of objects that can be descrbed usng two attrbutes. Each of the elpses represenst dfferent cluster contanng smlar objects. Ths paper s dvded nto 6 sectons. Secton 2 presents the defntons and notatons that are used n ths paper, whle Secton 3 presents dfferent representatons of documents. Secton 4 descrbes the concept of smlarty and evaluaton of cluster qualty. Secton 5, whch s the man part of ths paper, covers dfferent algorthms used for clusterng. Secton 6 presents conclutons. II. DEFINITIONS AND NOTATIONS The followng conventons are used throughout the document. Captal letters are used to denote sets, whle the lowercase letters are used for vecors are scalars. Let D be a set of all documents that are to be clustered. It s assumed that ths set s always a fnte set wth cardnalty n. Dependng on the context d j can denote the jth document or the vector that represents the jth document (representaton of documents s explaned n the next secton). C s the set of clusters, whle C refers to the th cluster. The cardnalty of set C s equal to k. Let T be the set of all possble words that can occur n

2 272 T. TARCZYNSKI documents. Ths set s also fnte and the cardnalty of ths set s dentoed by t. It s assumed that the sze of ths set does not depend on the number of documents. Most of the tme ths set s referred to as dctonary.the th word n the dctonary s denoted by t. The number of occurences of the word t n the document d j s denoted by n j, whle n j s the total number of words n the jth document. Symbol x refers to the length of vector x and A denotes the cardnalty of the set A. A set of clusters produced durng a clusterng process s called clusterng results. III. REPRESENTATION OF A DOCUMENT Computers process numbers much better than documents and therefore documents are usually represented usng numbers. Many dfferent models that represent documents were proposed. The most wdespread representaton of documents s called Vector Space Model, whch was presented by Salton, Wong and Yang n [4]. In ths model the documents are represented by nonnegatve sparse vectors n a t-dmensonal space. Such vectors wll be denoted by d j, where j denotes the jth document n the set. d j = (w 1j, w 2j,..., w tj ) (1) where w j s the wegth of th term n jth document. Each term usually refers to a sngle word from the dctonary, but some algorthms lke Phrase-Intersecton clusterng use phrases nstead of sngle words. In such a case the dctonary would consst of phrases and t would denote the number of possble phrases. There are several ways of calculatng those weghts, but typcally term frequency-nverse document frequency method s used (see [5]). In ths method each weght s a normalzed product of two components: term frequency and nverse document frequency. Term frequency s calculated as a normalzed number of occurrences of the term n the document. The term frequency of the -th term n the j-th document, T j, s defned as: T j = n j n j (2) Term frequency represents how often a word s used n a document, the more often the hgher the weght. To dstngush common words, whch do not say anythng useful n the context of document clusterng, from the mportant words nverse document frequency s used. The more documents contan partcular word the lower the value of ths term, the number of occurences of ths word wthn sngle document s not relevant, only whether the word occured or not. Ths term s calculated as a logarthm of the nverse of a fracton of documents n whch ths word occurred. For the -th word, the nverse document frequency, I, s defned as: ( I = log n {j : t d j } There s a common problem assocated wth smlar equatons namely dvson by 0. To overcome ths problem two approaches are possble. The dea of the frst one s to add 1 ) (3) to the denomnator, but t changes the results. The second one, whch s more common, s to assume that words that do not occure n any documents are removed. Inverse term frequency solves a problem wth common words, whch should not have any nfluence on the clusterng process. If a word occurs n all documents then the nomnator and denomnator are equal and the value of the whole term s 0. It removes the nfluence of words such as: a, the, so, are, etc. Accordng to [4] the a model wthout Itf has lower precson and recall 1 by 14% on average. The mprovement was measured over the standard term frequency weghtng. More about the VSM can be found n [6] and [7]. IV. DOCUMENT SIMILARITY AND EVALUATION OF CLUSTER QUALITY The frst part of ths secton dscusses the document smlarty measures. Presented smlarty measures assumes that documents are represented usng VSM. Smlarty measures are one of the key parts of almost every clusterng algorthms. They can be used not only to determne the smlarty between two documents but also to calculate the smlarty between a document and a cluster and between two clusters. The most popular document smlarty measures, accordng to [8], are Cosne, Dce, Jaccard. They are all based on the dot product, the only dfference s the normalzaton factor. It s worth notcng that f the vectors are normalzed then cosne and dce measures produce the same result. Sm c (d j, d q ) = w j w q w2 j (4) w2 q Sm d (d j, d q ) = Sm j (d j, d q ) = 2 w j w q w2 j + (5) w2 q w j w q w2 j + w2 q w (6) j w q It s common to calculate document dssmlarty (dstance) nstead of the smlarty. Ths s a bt dfferent approach to the same problem, because nstead of sayng how smlar the documents are we are sayng how dfferent they are. Usually the Eucldan dstance s used n ths approach (see [9]), although some other metrcs lke Manhattan dstance or Mnkowsk dstance can be used, but they are not so commonly used as they are n data clusterng. Document smlarty measures can be extended to a measure resemblance of all documents n a cluster. The whole group of such measures s called ntra-cluster smlarty measures (see [10]). The most wdely used ntra-cluster smlarty measure s group average smlarty. It s defned as an average of parwse smlarty of documents n the cluster, t s gven by: d d(c ) = j,d q C sm(d j, d q ) C 2 (7) If we use a cosne as a document smlarty measure then somewhat suprsngly ths formula can be sgnfcantly smplfed. To smplfy ths formula we need to ntroduce the concept 1 These terms are explaned n the next secton.

3 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 273 of a centrod. The centrod of a cluster s a vector that s used to represent the whole cluster, t s defned as an average of all vectors that represent documents n a cluster, n ths paper we denote the centrod of the th cluster by c. c = 1 C d j (8) d j C The centrod does not have to have normalzed length, n fact t almost never does. It can be easly proved that the ntracluster smlarty s equal to the squared length of a centrod vector. d(c ) = c 2 (9) The calculaton of the smlarty between two clusters can be smplfed to the calculaton of the smlarty of ther centrods. Another approach was presented n [11], where Olsen ntroduced new clusters smlarty measurng metrcs: sngle lnk, average lnk, complete lnk and medan. They were defned as the mnmum dstance between documents n two clusters, an average dstance, a maxmum dstance and a medan of dstances (see also [3]). The last part of ths secton s devoted to external qualty measures. In contrast to the prevously mentoned measures these ones need an external knowledge about real classes. Ths assumpton prevents from usng them n an clusterng algorthm as a crteron functon. Those measures reflect the smlarty between real classes and the ones obtaned n the clusterng process. One the most fundamental measure s the entropy, whch s based on the termodynamcal entropy. The entropy of a sngle cluster s defned as: E j = p j ln(p j ) (10) Where p j s the probablty that a document whch s n the cluster j belongs to the class. Here we assume that 0 ln(0) = 0. To obtan the qualty of all clusters the weghted sum of entropes of all clusters must be calculated. The weght s equal to the sze of the cluster. E = 1 E n (11) n F measure, was frst ntroduced by C. J. van Rjsberge n [12], s a dfferent external qualty measure. To understand ths measure two concepts must be ntroduced. Frst of them s precson. Precson(,j) can be nterpreted as a probablty that randomly chosen document from jth cluster belongs to the th class. The second concept s called recall and t can be nterpreted as a probablty that randomly chosen document from the th class belongs to the jth cluster. The F-measure of the cluster j and class s calculated as harmonc mean of precson and recall, whch s gven by: F (, j) = 2 Recall(, j) P recson(, j) P recson(, j) + Recall(, j) (12) The F-measure of whole cluster result s gven n the followng form F = 1 k n (13) n max j F (, j) Fg. 2. An llustraton of recall and precson. Black dots represent documents that are n one category and the whte ones represent documents that are n another category. Dots that are n an ellpse corespond to the documents n one cluster. Recall of ths cluster s equal to 2 and the precson s equal to More about F-measure and other qualty measures n nfromaton retreval can be found n [13] V. CLUSTERING ALGORITHMS Ths secton descrbes the most mportant clusterng algorthms. Each of presented algorthms fts nto one of two categores: herarchcal clusterng and non-herarchcal clusterng (see [14]). Non-herarchcal clusterng algorthms are often called flat methods or parttonal methods. Herarchcal methods n contrast to the parttonal methods do not create a sngle clusterng result, but the whole herarchy of clusterng. There are two basc approaches to herarchcal clusterng. Frst of them s herarchcal agglomeratve clusterng (HAC) and the second one s herarchcal dvse clusterng (HDC). A. Herarchcal Agglomeratve Algorthms Herarchcal methods are often beleved to produce better clusterng results (see [15]), but ther man dsadvantage s the complexty. The lower bound on tme complexty n herarchcal clusterng s O(n 2 ). The results obtaned by a herarchcal clusterng algorthm can be vewed at dfferent levels. Each of the levels can be seen as separate clusterng result wth dfferent number of clusters, although those levels are strongly connected because the result on each level s based on the result on prevous level. Ths herarchy of results s usually graphcally presented usng a tree. Ths tree s called dendrogram. It s a tree wth a sngle node at the top and n nodes at the bottom. Each of the nodes that are on the bottom represent a cluster wth a sngle document n t. At the begnnng of the agglomeratve approach each document s n ts own cluster, so n the frst level number of clusters s equal to n. In each teraton the two most smlar clusters are merged. Ths process contnues untl only one cluster s left. That strategy s also called bottom-up approach. Exhaustve survey on agglomeratve herarchcal methods can be found n [14]. More about the effcency of HAC can be found n [16]. The tradtonal agglomeratve algorthm can be summarzed as follows: 1) Compute the smlarty matrx A, whose entry [a j ] s the smlarty between th and jth cluster. 2) Merge the two most smlar clusters.

4 274 T. TARCZYNSKI 3) Update the smlarty matrx (only the column that represents new cluster must be updated) 4) Repeat steps 2 and 3 untl all documents are n one cluster. Another huge advantage of ths algorthm s that t can be seen as ether flat clusterng or herarchcal clusterng method. Usng t as a parttonal clusterng mght even produce better results because the obtaned results can be mproved usng K-means algorthm on the desred number of clusters. In ths case the bsectng K-means would just lead to fndng centrod for clusters. Fg. 3. Sample dendrogram. Lookng from the bottom-up approach n the frst teraton documents D 1 and D 2 were merged nto one cluster, n the second one documents D 3 and D 4 were grouped nto one cluster. In the thrd teraton cluster contanng D 1 and D 2 was merged wth cluster that contans D 3 and D 4 and fnally n the last teraton all documents were put nto one cluster. B. Bsectng K-means Bsectng K-means s a herarchcal dvsve algorthm, whch was ntroduced n [15]. Bsectng K-means should not be confused wth K-means algorthm whch s parttonal algorthm. The dvsve strategy s n some sense a reverse of the agglomeratve technque. At the begnnng of the algorthm there s only one cluster whch contans all documents. In each teraton a chosen cluster s dvded nto two clusters. The process stops when each cluster contans only a sngle document. The bsectng K-means s descrbed below: 1) Choose a cluster to splt. 2) Apply K-means clusterng wth k = 2 to ths cluster. (bsectng step) 3) Repeat the bsectng step ITER tmes and choose the splt that produces clusterng wth the hghest overall smlarty. 4) Repeat all steps untl the desred number of clusters s obtaned. Emprcal research n [15] shows that ITER = 5 s a good compromse between the qualty and the tme complexty. The ncrease of the ITER parameter would ncrease both the qualty and the tme complexty. Ths algorthm s very flexble because steps 1 and 2 can be mplemented n many dfferent ways. The most ntutve way to choose a cluster n step 1 s to choose the largest cluster. That was the method used n [15]. The other possble approaches mentoned n [15] are take the cluster wth the least overall smlarty, use a crteron based on both sze and overall smlarty. The bsectng step s even more adjustable snce any nondetermnstc parttonal clusterng algorthm can be used n ths step, although takng K-means seems reasonable because t s a smple and effectve algorthm. The K-means algorthm tselft can be also adjusted n many ways, what wll be descrbed n the next subsecton. C. K-means K-means clusterng (see [17], [18]) s the most wdely used parttonal clusterng technque. Its popularty s due to ts smplcty and good results. The general descrpton of k- means algorthm s as follows: 1) Choose k the number of clusters 2) Choose k ponts as a center of clusters 3) Assgn each document to the nearest cluster 4) If stop crterum s met then stop 5) Recalculate center of clusters 6) Go to the step 3 Choosng k n the frst step depends strongly on the dataset and motvatons that are behnd clusterng, although some systematc approaches for choosng the number of clusters are gven n [19]. The ponts n the second step are usually chosen at random for smplcty, but some modfcatons such as k- means++ [20] mght be used. A comprehensve survey on dfferent methods for ntalzng the K-means clusterng can be found n [21]. Authors compare there 11 dfferent ntalzng methods. From those experments t mght be concluded that the best ntalzng methods are the ones proposed n [22], [23] and [24]. In order to assgn a document to the nearest cluster any of the smlarty measure mentoned n chapter 3 can be used. The stop crteron s usually defned as a stop when no documents have been moved from one cluster to another. In the standard k-means algorthm, the center of a cluster s calculated as a centrod of the documents n that cluster, but many modfcatons to ths pont were proposed. One of them s the k-medans algorthm whch calculates n ths step the medan nstead of the mean, ths medan does not have to be an actual document because medan of each term s calculated separately. Another modfcaton was presented as the k-medod algorthm where one of the documents s chosen to be a center of a cluster. Ths document s usually chosen as a document wth the lowest parwse dssmlarty. D. Algorthms Based on Crteron Functons Many parttonal clusterng algorthms are based on optmzaton of a global crteron functon. In some of these algorthms the crteron functon mght be exchanged easly. Examples of those algorthms are CobWeb [25] and AutoClass [26]. Such algorthms consst of two man components. One of them s the crteron functon used for an optmzaton and the second one s the algorthm tself. Many dfferent clusterng algorthms can easly be created f a lst of crteron functons and optzaton algorthms are avalable because every combnaton of the optmzaton algorthm and the crteron functon creates new clusterng algorthm.

5 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 275 In ths chapter we wll focus mostly on the crteron functon nstead of algorthms, because accordng to [9] and [27] greedy strategy produces comparable results to the results obtaned usng more complcated algorthms. These more complcated algorthms nclude concepts lke teratve schemes based on hll-clmbng methods, spectral-based optmzers and so on. The reader more nterested n these algorthms are reffered to [28], [29], [30], [31], [32] and [33]. A great survey on crteron functons can be found n [9], Table contatng all crteron functons tested n [9] s presented below. f 1 maxmze k n r 1 n 2 cos(d, d j ) (14) r d,d j C r f 2 maxmze k cos(d, C r ) d C r (15) k mnmze n r cos(c r, c) (16) f 3 f 4 f 5 f 6 f 7 maxmze maxmze mnmze mnmze ( k n r k 1 n 2 r cos(d, d j ) d,d j C r k n r cos(c r, c) cos(d, C r ) d C r k n r cos(c r, c) k k cut(s r, S S r ) d,d jnc r cos(c, c j ) cut(v r, V V r ) W (V r ) ) (17) (18) (19) (20) where c denotes the centrod of all documents, V r s the set of vertces assgned to the rth cluser. and W (V r ) s the sum of the weghts of the adjacenct lsts of V r. Here we descrbe an dea that s behnd each of these functons: f 1 t s weghted sum of the average parwse smlartes between the documents n each cluster. The weghts are normalzed due to the sze of the cluster. f 2 t s the sum of the cosne between each document and the centrod of the cluster of ths document. f 3 t s the sum of the cosne between each centrod of clusters and the centrod of all documents. f 4 t s a quotent of f 1 and f 3. f 5 t s a quotent of f 2 and f 3. f 6 t s edge-cut of each partton scaled by the sum of the cluster s nternal edges. t s normalzed edge-cut of the parttonng. f 7 The entropy was used as a measure of qualty of the results. Extensve experments made on 15 dfferent datasets showed that none of these functons s a superor to the others, although some functons performed better. In almost all tested f 2, f 4 and f 5 outperformed all other functons. The dfference between the qualty of those functons was not sgnfcant. It s nterestng that some functons, whch seem to be very smlar lke f 1 and f 2, produced very dfferent results. The results vared by up to twenty percent. E. Word-Intersecton Clusterng Oren Zamr n [34] outlned that reducng the number of returned documents not necessarly makes browsng easer for user. He clams that effcent document clusterng mght be used as a good method for navgatng through a large collecton of documents, furthermore t mght fnd some patterns that would be normally mssed. Zamr also created a 6 pont lst of requrements for clusterng algorthm n context of document browsng. 1) Ease-of-browsng: Results must be presented n such a way that user mmedately can decde whether some content s useful or not. 2) Speed: The clusterng should be fast enough to create results wthn seconds. 3) Scalablty: The method should be able to cluster large set of documents. 4) No preprocessng: The algorthm should not rely on the preprocessng, because t s appled to dynamcally generated set of documents. 5) Clent sde executon: The system should be able to process the clusterng algorthm on the clent sde. It s motvated by a fact that server mght not have enought computaton power to produce clusterng for each user. 6) Snppet-Capable: The algorthm should not requre whole documents, but only small snppets of those documents. Due to clent sde executon ths requrement s necessary, beacuse transferng whole documents could be more tme consumng that the algorthms tself. Word-Intersecton Clusterng s a herarchcal agglomeratve algorthm wth a few new concepts. Frst of them s that clusters are characterzed by the set of words that are n every document n the cluster. Ths automatcally solves a problem wth the descrpton, because those common words mght be used as a descrpton of a cluster. The numbers of words that are shared by all the documents n the cluster s called coheson and t s denoted by h(c ). Another new concept s the score functon s(c), whch expresses the decreasng margnal sgngcance of the coheson. s(c ) = C 1 exp ( βh(c )) 1 + exp ( βh(c )) (21) The most mportant new feauture of ths algorthm s Global Qualty Functon whch quantfes the qualty of a clusterng. It s defned as: GQF (C) = f(c) g( C ) C C s(c ) (22) Where the f(c) s a functon that s proportonal to the normalzed number of clusterd documents. A document s consdered to be clustered when the sze of the cluster that

6 276 T. TARCZYNSKI contans s greater than 1, whle the g( C ) s an ncreasng functon n the number of clusters. The algorthm tself can be summarzed as: 1) Put all documents n dfferent clusters. 2) If there are no two clusters whose merge would ncrese GQF halt. 3) Merge two clusters that creases GQF the most. 4) Go to the step 2. The experments made n [34] showed that ths algorthm s faster that the classcal herachcal agglomeratve algorthm wth the cosne smlarty measure, whch s referred n that paper as COS-GAVG. In addton the qualty of the results obtaned by word-ntersecton clusterng was sgnfcantly better than the resuls produced by COS-GAVG. These results prove that ths algorthm s very promsng. VI. CONCLUSIONS In ths artcle we have presented the most mportant concepts related to document clusterng. Our research confrmed that there s no superor text clusterng algorthm. Document clusterng algorthms vary n both the complexty and the qualty of results. Herarchcal algorthms n general are consdered as the ones that performs better than the flat methods, but on the other hand parttonal algorthms are much faster. It seems that the most promsng algorthms are hybrd methods lke bstectng k-means. Ths algorthm outperformed parttonal clusterng algorthms wthout sgnfcant ncrease n complexty. The clusterng algorthm should be chosen dependng on the avalable tme, herarchcal algorthms are well suted for a precomputaton, whle parttonal algorthms are good for an onlne computaton. The complexty of clusterng algorthms wll play a crucal role n the future, ths wll be due to contnuous ncrese n the number of ducuments. On the other hand the speed of the computers wll ncrease more slowly because the Moore s law 2 s consdered to be broken. Ths premses lead to the concluson that parallel and dstrbuted clusterng algorthms should be more stressed n the future. Many parallel data clusterng algorthms were already desgned (see [35], [11], [36]), but not much has been done n the feld of parallel document clusterng. Besde the complexty another potental problem s assocated wth hgher demands on the qualty. These demands mght lead to the development of a more lngustc approach to the document clusterng problem. Present algorthms do not take nto account an order of the words n a sentence. In some cases the order of words mght be relevant n the process of clusterng. Ths task s extremely hard to mplement, because each language has ts own grammar rules whch determne whether the order n partcular case s mportant or not. REFERENCES [1] Y. Labrou and T. Fnn, Yahoo! as an ontology: usng yahoo! categores to descrbe documents, n Proceedngs of the eghth nternatonal 2 Moore s law states that the number of transstors that can be placed on ntegrated crcuts, wthout any addtonal costs, doubles every two year conference on Informaton and knowledge management, ser. CIKM 99. New York, NY, USA: ACM, 1999, pp [Onlne]. Avalable: [2] A. K. Jan, M. N. Murty, and P. J. Flynn, Data clusterng: a revew, ACM Comput. Surv., vol. 31, pp , September [Onlne]. Avalable: [3] D. R. Cuttng, D. R. Karger, J. O. Pedersen, and J. W. Tukey, Scatter/gather: a cluster-based approach to browsng large document collectons, n Proceedngs of the 15th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 92, New York, NY, USA, 1992, pp [Onlne]. Avalable: [4] G. Salton, A. Wong, and C. S. Yang, A vector space model for automatc ndexng, Commun. ACM, vol. 18, pp , November [Onlne]. Avalable: [5] G. Salton and C. Buckley, Term weghtng approaches n automatc text retreval, Cornell Unversty, Ithaca, NY, USA, Tech. Rep., [6] S. K. M. Wong, W. Zarko, V. V. Raghavan, and P. C. N. Wong, On modelng of nformaton retreval concepts n vector spaces, ACM Trans. Database Syst., vol. 12, pp , June [Onlne]. Avalable: [7] X. Ta, M. Sasak, Y. Tanaka, and K. Kta, Improvement of vector space nformaton retreval model based on supervsed learnng, n Proceedngs of the ffth nternatonal workshop on on Informaton retreval wth Asan languages, ser. IRAL 00, New York, NY, USA, 2000, pp [Onlne]. Avalable: [8] G. Salton, Ed., Automatc text processng. Boston, MA, USA: Addson- Wesley Longman Publshng Co., Inc., [9] Y. Zhao and G. Karyps, Emprcal and theoretcal comparsons of selected crteron functons for document clusterng, Mach. Learn., vol. 55, pp , June [Onlne]. Avalable: [10] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma, Learnng to cluster web search results, n Proceedngs of the 27th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 04, New York, NY, USA, 2004, pp [Onlne]. Avalable: [11] C. F. Olson, Parallel algorthms for herarchcal clusterng, Parallel Comput., vol. 21, [12] C. J. van Rjsbergen, Informaton Retreval, 2nd ed. Newton, MA, USA: Butterworth-Henemann, [13] J. Makhoul, F. Kubala, R. Schwartz, and R. Weschedel, Performance measures for nformaton extracton, n In Proceedngs of DARPA Broadcast News Workshop, 1999, pp [14] A. El-Hamdouch and P. Wllett, Comparson of herarchc agglomeratve clusterng methods for document retreval, The Computer Journal, vol. 32, pp , jan [Onlne]. Avalable: [15] M. Stenbach, G. Karyps, and V. Kumar, A comparson of document clusterng technques, [Onlne]. Avalable: [16] W. H. Day and H. Edelsbrunner, Effcent algorthms for agglomeratve herarchcal clusterng methods, Journal of Classfcaton, vol. 1, pp. 7 24, [Onlne]. Avalable: [17] G. A. Wlkn and X. Huang, A practcal comparson of two k-means clusterng algorthms, BMC Bonformatcs, vol. 9, p. S19, [Onlne]. Avalable: [18] J. Wu, H. Xong, and J. Chen, Adaptng the rght measures for k-means clusterng, n Proceedngs of the 15th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, ser. KDD 09, New York, NY, USA, 2009, pp [Onlne]. Avalable: [19] M. Chang and B. Mrkn, Experments for the number of clusters n k-means, n Progress n Artfcal Intellgence, ser. Lecture Notes n Computer Scence, J. Neves, M. Santos, and J. Machado, Eds. Sprnger Berln / Hedelberg, 2007, vol. 4874, pp [Onlne]. Avalable: 33 [20] D. Arthur and S. Vasslvtsk, k-means++: the advantages of careful seedng, n Proceedngs of the eghteenth annual ACM-SIAM symposum on Dscrete algorthms, ser. SODA 07, Phladelpha, PA, USA, 2007, pp [Onlne]. Avalable: [21] R. Matra, A. Peterson, and A. Ghosh, A systematc evaluaton of dfferent methods for ntalzng the k-means clusterng algorthm, IEEE Transactons on Knowledge and Data Engneerng, 2010.

7 DOCUMENT CLUSTERING CONCEPTS, METRICS AND ALGORITHMS 277 [22] G. W. Mllgan and P. D. Isaac, The valdaton of four ultrametrc clusterng algorthms, Pattern Recognton, vol. 12, no. 2, pp , [Onlne]. Avalable: [23] P. S. Bradley and U. M. Fayyad, Refnng ntal ponts for k-means clusterng, n Proceedngs of the Ffteenth Internatonal Conference on Machne Learnng, ser. ICML 98, San Francsco, CA, USA, 1998, pp [Onlne]. Avalable: [24] B. Mrkn, Clusterng for Data Mnng: A Data Recovery Approach. Chapman and Hall/CRC, [25] D. H. Fsher, Knowledge acquston va ncremental conceptual clusterng, Mach. Learn., vol. 2, pp , September [Onlne]. Avalable: [26] P. Cheeseman and J. Stutz, Advances n knowledge dscovery and data mnng, U. M. Fayyad, G. Patetsky-Shapro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA, USA: Amercan Assocaton for Artfcal Intellgence, 1996, ch. Bayesan classfcaton (AutoClass): theory and results, pp [Onlne]. Avalable: [27] S. Savares, D. L. Boley, S. Bttant, and G. Gazzanga, Choosng the cluster to splt n bsectng dvsve clusterng algorthms, n Second SIAM Internatonal Conference on Data Mnng, [28] M. Mela and D. Heckerman, An expermental comparson of model-based clusterng methods, Mach. Learn., vol. 42, pp. 9 29, January [Onlne]. Avalable: [29] G. Karyps, E. Han, and V. Kumar, Chameleon: Herarchcal clusterng usng dynamc modelng, Computer, vol. 32, pp , August [Onlne]. Avalable: [30] D. Boley, Prncpal drecton dvsve parttonng, Data Mn. Knowl. Dscov., vol. 2, pp , December [Onlne]. Avalable: [31] H. Zha, X. He, C. Dng, H. Smon, and M. Gu, Bpartte graph parttonng and data clusterng, n Proceedngs of the tenth nternatonal conference on Informaton and knowledge management, ser. CIKM 01, New York, NY, USA, 2001, pp [Onlne]. Avalable: [32] C. H. Zha, H. Zha, X. He, C. Dng, H. Smon, and M. Gu, Spectral relaxaton for k-means clusterng, n Advances n Neural Informaton Processng Systems. MIT Press, 2001, pp [33] I. S. Dhllon and D. S. Modha, Concept decompostons for large sparse text data usng clusterng, Mach. Learn., vol. 42, pp , January [Onlne]. Avalable: [34] O. Zamr, O. Etzon, O. Madan, and R. M. Karp, Fast and ntutve clusterng of web documents, n In Proceedngs of the 3rd Internatonal Conference on Knowledge Dscovery and Data Mnng, 1997, pp [35] M. Dash, S. Petrutu, and P. Scheuermann, Effcent parallel herarchcal clusterng, n Internatonal Europar Conference, [36] Y. Song, W. Chen, H. Ba, C. Ln, and E. Chang, Parallel spectral clusterng, Machne Learnng and Knowledge Dscovery n Databases, pp , [Onlne]. Avalable: 25 [37] Y. Lu, J. Mostafa, and W. Ke, A fast onlne clusterng algorthm for scatter/gather browsng, [38] D. R. Cuttng, D. R. Karger, and J. O. Pedersen, Constant nteractontme scatter/gather browsng of very large document collectons, n Proceedngs of the 16th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, ser. SIGIR 93, New York, NY, USA, 1993, pp [Onlne]. Avalable:

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Supervsed vs. Unsupervsed Learnng Up to now we consdered supervsed learnng scenaro, where we are gven 1. samples 1,, n 2. class labels for all samples 1,, n Ths s also

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15 CS434a/541a: Pattern Recognton Prof. Olga Veksler Lecture 15 Today New Topc: Unsupervsed Learnng Supervsed vs. unsupervsed learnng Unsupervsed learnng Net Tme: parametrc unsupervsed learnng Today: nonparametrc

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance Tsnghua Unversty at TAC 2009: Summarzng Mult-documents by Informaton Dstance Chong Long, Mnle Huang, Xaoyan Zhu State Key Laboratory of Intellgent Technology and Systems, Tsnghua Natonal Laboratory for

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Unsupervised Learning

Unsupervised Learning Pattern Recognton Lecture 8 Outlne Introducton Unsupervsed Learnng Parametrc VS Non-Parametrc Approach Mxture of Denstes Maxmum-Lkelhood Estmates Clusterng Prof. Danel Yeung School of Computer Scence and

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Machine Learning. Topic 6: Clustering

Machine Learning. Topic 6: Clustering Machne Learnng Topc 6: lusterng lusterng Groupng data nto (hopefully useful) sets. Thngs on the left Thngs on the rght Applcatons of lusterng Hypothess Generaton lusters mght suggest natural groups. Hypothess

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

K-means and Hierarchical Clustering

K-means and Hierarchical Clustering Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your

More information

Web Mining: Clustering Web Documents A Preliminary Review

Web Mining: Clustering Web Documents A Preliminary Review Web Mnng: Clusterng Web Documents A Prelmnary Revew Khaled M. Hammouda Department of Systems Desgn Engneerng Unversty of Waterloo Waterloo, Ontaro, Canada 2L 3G1 hammouda@pam.uwaterloo.ca February 26,

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Document Representation and Clustering with WordNet Based Similarity Rough Set Model IJCSI Internatonal Journal of Computer Scence Issues, Vol. 8, Issue 5, No 3, September 20 ISSN (Onlne): 694-084 www.ijcsi.org Document Representaton and Clusterng wth WordNet Based Smlarty Rough Set Model

More information

The Research of Support Vector Machine in Agricultural Data Classification

The Research of Support Vector Machine in Agricultural Data Classification The Research of Support Vector Machne n Agrcultural Data Classfcaton Le Sh, Qguo Duan, Xnmng Ma, Me Weng College of Informaton and Management Scence, HeNan Agrcultural Unversty, Zhengzhou 45000 Chna Zhengzhou

More information

Query Clustering Using a Hybrid Query Similarity Measure

Query Clustering Using a Hybrid Query Similarity Measure Query clusterng usng a hybrd query smlarty measure Fu. L., Goh, D.H., & Foo, S. (2004). WSEAS Transacton on Computers, 3(3), 700-705. Query Clusterng Usng a Hybrd Query Smlarty Measure Ln Fu, Don Hoe-Lan

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems A Unfed Framework for Semantcs and Feature Based Relevance Feedback n Image Retreval Systems Ye Lu *, Chunhu Hu 2, Xngquan Zhu 3*, HongJang Zhang 2, Qang Yang * School of Computng Scence Smon Fraser Unversty

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment A Webpage Smlarty Measure for Web Sessons Clusterng Usng Sequence Algnment Mozhgan Azmpour-Kv School of Engneerng and Scence Sharf Unversty of Technology, Internatonal Campus Ksh Island, Iran mogan_az@ksh.sharf.edu

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned

More information

Classifier Selection Based on Data Complexity Measures *

Classifier Selection Based on Data Complexity Measures * Classfer Selecton Based on Data Complexty Measures * Edth Hernández-Reyes, J.A. Carrasco-Ochoa, and J.Fco. Martínez-Trndad Natonal Insttute for Astrophyscs, Optcs and Electroncs, Lus Enrque Erro No.1 Sta.

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Collaboratively Regularized Nearest Points for Set Based Recognition

Collaboratively Regularized Nearest Points for Set Based Recognition Academc Center for Computng and Meda Studes, Kyoto Unversty Collaboratvely Regularzed Nearest Ponts for Set Based Recognton Yang Wu, Mchhko Mnoh, Masayuk Mukunok Kyoto Unversty 9/1/013 BMVC 013 @ Brstol,

More information

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like: Self-Organzng Maps (SOM) Turgay İBRİKÇİ, PhD. Outlne Introducton Structures of SOM SOM Archtecture Neghborhoods SOM Algorthm Examples Summary 1 2 Unsupervsed Hebban Learnng US Hebban Learnng, Cntd 3 A

More information

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION 1 THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Seres A, OF THE ROMANIAN ACADEMY Volume 4, Number 2/2003, pp.000-000 A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION Tudor BARBU Insttute

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

A Deflected Grid-based Algorithm for Clustering Analysis

A Deflected Grid-based Algorithm for Clustering Analysis A Deflected Grd-based Algorthm for Clusterng Analyss NANCY P. LIN, CHUNG-I CHANG, HAO-EN CHUEH, HUNG-JEN CHEN, WEI-HUA HAO Department of Computer Scence and Informaton Engneerng Tamkang Unversty 5 Yng-chuan

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Machine Learning 9. week

Machine Learning 9. week Machne Learnng 9. week Mappng Concept Radal Bass Functons (RBF) RBF Networks 1 Mappng It s probably the best scenaro for the classfcaton of two dataset s to separate them lnearly. As you see n the below

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z. TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS Muradalyev AZ Azerbajan Scentfc-Research and Desgn-Prospectng Insttute of Energetc AZ1012, Ave HZardab-94 E-mal:aydn_murad@yahoocom Importance of

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

Detection of an Object by using Principal Component Analysis

Detection of an Object by using Principal Component Analysis Detecton of an Object by usng Prncpal Component Analyss 1. G. Nagaven, 2. Dr. T. Sreenvasulu Reddy 1. M.Tech, Department of EEE, SVUCE, Trupath, Inda. 2. Assoc. Professor, Department of ECE, SVUCE, Trupath,

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals nkselector: A Web Mnng Approach to Hyperlnk Selecton for Web Portals Xao Fang and Olva R. u Sheng Department of Management Informaton Systems Unversty of Arzona, AZ 8572 {xfang,sheng}@bpa.arzona.edu Submtted

More information

CHAPTER 2 DECOMPOSITION OF GRAPHS

CHAPTER 2 DECOMPOSITION OF GRAPHS CHAPTER DECOMPOSITION OF GRAPHS. INTRODUCTION A graph H s called a Supersubdvson of a graph G f H s obtaned from G by replacng every edge uv of G by a bpartte graph,m (m may vary for each edge by dentfyng

More information

Available online at Available online at Advanced in Control Engineering and Information Science

Available online at   Available online at   Advanced in Control Engineering and Information Science Avalable onlne at wwwscencedrectcom Avalable onlne at wwwscencedrectcom Proceda Proceda Engneerng Engneerng 00 (2011) 15000 000 (2011) 1642 1646 Proceda Engneerng wwwelsevercom/locate/proceda Advanced

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS

HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS HCMX: AN EFFICIENT HYBRID CLUSTERING APPROACH FOR MULTI-VERSION XML DOCUMENTS VIJAY SONAWANE 1, D.RAJESWARA.RAO 2 1 Research Scholar, Department of CSE, K.L.Unversty, Green Felds, Guntur, Andhra Pradesh

More information

Classic Term Weighting Technique for Mining Web Content Outliers

Classic Term Weighting Technique for Mining Web Content Outliers Internatonal Conference on Computatonal Technques and Artfcal Intellgence (ICCTAI'2012) Penang, Malaysa Classc Term Weghtng Technque for Mnng Web Content Outlers W.R. Wan Zulkfel, N. Mustapha, and A. Mustapha

More information

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 16, No 2 Sofa 2016 Prnt ISSN: 1311-9702; Onlne ISSN: 1314-4081 DOI: 10.1515/cat-2016-0017 Hybrdzaton of Expectaton-Maxmzaton

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval Fuzzy -Means Intalzed by Fxed Threshold lusterng for Improvng Image Retreval NAWARA HANSIRI, SIRIPORN SUPRATID,HOM KIMPAN 3 Faculty of Informaton Technology Rangst Unversty Muang-Ake, Paholyotn Road, Patumtan,

More information

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK L-qng Qu, Yong-quan Lang 2, Jng-Chen 3, 2 College of Informaton Scence and Technology, Shandong Unversty of Scence and Technology,

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

Information Retrieval

Information Retrieval Anmol Bhasn abhasn[at]cedar.buffalo.edu Moht Devnan mdevnan[at]cse.buffalo.edu Sprng 2005 #$ "% &'" (! Informaton Retreval )" " * + %, ##$ + *--. / "#,0, #'",,,#$ ", # " /,,#,0 1"%,2 '",, Documents are

More information

/02/$ IEEE

/02/$ IEEE A Modfed Fuzzy ART for Soft Document Clusterng Ravkumar Kondadad and Robert Kozma Dvson of Computer Scence Department of Mathematcal Scences Unversty of Memphs, Memphs, TN 38152 ABSTRACT Document clusterng

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering 015 IEEE 17th Internatonal Conference on Hgh Performance Computng and Communcatons (HPCC), 015 IEEE 7th Internatonal Symposum on Cyberspace Safety and Securty (CSS), and 015 IEEE 1th Internatonal Conf

More information

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Analyzing Popular Clustering Algorithms from Different Viewpoints

Analyzing Popular Clustering Algorithms from Different Viewpoints 1000-9825/2002/13(08)1382-13 2002 Journal of Software Vol.13, No.8 Analyzng Popular Clusterng Algorthms from Dfferent Vewponts QIAN We-nng, ZHOU Ao-yng (Department of Computer Scence, Fudan Unversty, Shangha

More information

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1

Hierarchical agglomerative. Cluster Analysis. Christine Siedle Clustering 1 Herarchcal agglomeratve Cluster Analyss Chrstne Sedle 19-3-2004 Clusterng 1 Classfcaton Basc (unconscous & conscous) human strategy to reduce complexty Always based Cluster analyss to fnd or confrm types

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

A Two-Stage Algorithm for Data Clustering

A Two-Stage Algorithm for Data Clustering A Two-Stage Algorthm for Data Clusterng Abdolreza Hatamlou 1 and Salwan Abdullah 2 1 Islamc Azad Unversty, Khoy Branch, Iran 2 Data Mnng and Optmsaton Research Group, Center for Artfcal Intellgence Technology,

More information

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION Paulo Quntlano 1 & Antono Santa-Rosa 1 Federal Polce Department, Brasla, Brazl. E-mals: quntlano.pqs@dpf.gov.br and

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

A fast algorithm for color image segmentation

A fast algorithm for color image segmentation Unersty of Wollongong Research Onlne Faculty of Informatcs - Papers (Arche) Faculty of Engneerng and Informaton Scences 006 A fast algorthm for color mage segmentaton L. Dong Unersty of Wollongong, lju@uow.edu.au

More information

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems Determnng Fuzzy Sets for Quanttatve Attrbutes n Data Mnng Problems ATTILA GYENESEI Turku Centre for Computer Scence (TUCS) Unversty of Turku, Department of Computer Scence Lemmnkäsenkatu 4A, FIN-5 Turku

More information

Graph-based Clustering

Graph-based Clustering Graphbased Clusterng Transform the data nto a graph representaton ertces are the data ponts to be clustered Edges are eghted based on smlarty beteen data ponts Graph parttonng Þ Each connected component

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Querying by sketch geographical databases. Yu Han 1, a *

Querying by sketch geographical databases. Yu Han 1, a * 4th Internatonal Conference on Sensors, Measurement and Intellgent Materals (ICSMIM 2015) Queryng by sketch geographcal databases Yu Han 1, a * 1 Department of Basc Courses, Shenyang Insttute of Artllery,

More information

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming Optzaton Methods: Integer Prograng Integer Lnear Prograng Module Lecture Notes Integer Lnear Prograng Introducton In all the prevous lectures n lnear prograng dscussed so far, the desgn varables consdered

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm Internatonal Journal of Advancements n Research & Technology, Volume, Issue, July- ISS - on-splt Restraned Domnatng Set of an Interval Graph Usng an Algorthm ABSTRACT Dr.A.Sudhakaraah *, E. Gnana Deepka,

More information

Recommender System based on Higherorder Logic Data Representation

Recommender System based on Higherorder Logic Data Representation Recommender System based on Hgherorder Logc Data Representaton Lnna L Bngru Yang Zhuo Chen School of Informaton Engneerng, Unversty Of Scence and Technology Bejng Abstract Collaboratve Flterng help users

More information

Deep Classification in Large-scale Text Hierarchies

Deep Classification in Large-scale Text Hierarchies Deep Classfcaton n Large-scale Text Herarches Gu-Rong Xue Dkan Xng Qang Yang 2 Yong Yu Dept. of Computer Scence and Engneerng Shangha Jao-Tong Unversty {grxue, dkxng, yyu}@apex.sjtu.edu.cn 2 Hong Kong

More information

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines (IJCSIS) Internatonal Journal of Computer Scence and Informaton Securty, Herarchcal Web Page Classfcaton Based on a Topc Model and Neghborng Pages Integraton Wongkot Srura Phayung Meesad Choochart Haruechayasak

More information

KOHONEN'S SELF ORGANIZING NETWORKS WITH "CONSCIENCE"

KOHONEN'S SELF ORGANIZING NETWORKS WITH CONSCIENCE Kohonen's Self Organzng Maps and ther use n Interpretaton, Dr. M. Turhan (Tury) Taner, Rock Sold Images Page: 1 KOHONEN'S SELF ORGANIZING NETWORKS WITH "CONSCIENCE" By: Dr. M. Turhan (Tury) Taner, Rock

More information