documents 1. Introduction

Size: px
Start display at page:

Download "documents 1. Introduction"

Transcription

1 4 Efficient structurl similrity computtion etween XML documents Ali Aïtelhdj Computer Science Deprtment, Fculty of Electricl Engineering nd Computer Science Mouloud Mmmeri University of Tizi-Ouzou (UMMTO) Tizi-Ouzou, Algeri Astrct This work is minly motivted y the description of new pproch for clculting the structurl similrity of XML documents. Prcticlly, the mjority of existing work on XML documents clustering considers the tree structures of these documents s mere vectors nd, therefore, does not tke into ccount their hierrchicl contexts. Furthermore, in order to clculte the structurl similrity of XML documents, most methods encountered in these works perform depth-first trversl to visit the nodes of the tree structures of these documents. More precisely, it is the preorder tree wlk which is usully the most used. Recently, other studies present n lterntive pproch tht tkes into ccount the hierrchicl contexts of these tree structures, ut unfortuntely, they hve prticulrly high time complexity in the clcultion of structurl similrity. In this pper, we propose new method sed on redth-first trversl of these tree structures. The gol consists in clustering more rpidly XML documents shring similr structures. Besides the fct tht the method is fst, it lso tkes into ccount the hierrchicl contexts of XML documents. Reconciling the speed required for clustering XML documents with tking into ccount the hierrchicl contexts of their tree structures ensures higher reliility of the proposed method. To vlidte our proposl, experiments were conducted on oth rel nd synthetic XML dt. The results clerly demonstrte the viility of our pproch. Keywords: Clustering, Structurl similrity, hierrchicl context, Tree level, Ancestor nd descendnt levels, depth- nd redth-first trversls.. Introduction XML hs now ecome n unchllenged stndrd for the representtion nd exchnge of dt on the we. This hs led to the increse in heterogeneous XML sources. Furthermore, not only the collections of XML documents re reused ut their interchnge volume is continuously growing. However, with the current ville mens, the serch informtion in these documents is not trivil tsk. XML documents re chrcterized y content nd structure. However, such documents cnnot e exploited efficiently y the conventionl informtion retrievl methods. Indeed, these methods re sed on content oriented models, while the XML formt llows dding structurl constrints. This then requires dpting these models to etter exploit the ville XML dt. Similrly, trditionl pproches to dt processing, such s reltionl dtses hve proven ineffective. These re minly designed for strongly structured dt, wheres XML dt re semi-structured [0]. In ddition to this, given the heterogeneity nd prolifertion of XML documents on the we, it ecomes difficult for user to ccess the desired informtion. In this context mny uthors propose methods of clssifiction to orgnize nd nlyze lrge collections of XML documents. Our work flls within this perspective; we re interested in the clustering of XML documents sed on their structures. The ide ehind the clustering is tht if XML documents shre similr structures, they re more likely to correspond to the structurl prt of the sme query. This therefore llows reducing the response time nd incresing the ccurcy of serch engines. In other words, it cn sustntilly improve the process of informtion retrievl. Thus, the serch for relevnt informtion in lrge collection of documents will then return to interrogte smll clsses of documents. XML clustering tsk consists in grouping XML documents into clusters contining similr documents. This similrity could e themtic or structurl. In this pper, we re prticulrly interested in XML document clustering using the structurl similrity of their descriptions, i.e., the XML ordered leled tree providing the reltions etween the document elements. We will therefore ddress the structurl clustering of XML documents prolem s we would hve done with clustering of tree structures prolem [,, 3, 4, 4]. In other words, structurl clustering of XML documents pproch cn e exploited in vrious res tht require mngement of hierrchicl structures, such s the discovery of structurlly similr we nvigtionl pthwys, or tree-like ptterns, nd the discovery of structurlly similr mcromoleculr tree ptterns in ioinformtics [4, 34, 7]. The structurl similrity llows to group documents tht shre similr structures []. It will help to etter orgnize XML documents on the one hnd nd, on the other hnd, to etter nswer, in terms of efficiency nd effectiveness, queries contining structurl conditions. We recll tht queries in XML informtion retrievl could contin

2 4 keywords only or keywords nd structurl conditions. The min question we ddress in this context is how to cluster structurlly XML documents when their DTD is unknown. Our methodology in this pper is two-step. In the first step, ech XML document is represented y its tree summry structure, which is used s representtion model to clssify the corresponding XML document. In the second step, n efficient structurl similrity mesure sed on redth-first trversl of these tree summry structures is proposed. Within this frmework, the most importnt question dels with the wy to mesure the structurl similrity of XML documents. This is the question we ttempt to nswer in this work. This pper is orgnized s follows. Section provides summry view of relted work out clssifiction of XML documents y structure. Section 3 descries our clustering pproch. Section 4 is dedicted to the experimenttion. Finlly, in section 5 we conclude nd descrie the future work. siling nodes, while in the study y [,, 3], the tree summry is otined only y eliminting dupliction of siling nodes, i.e., hierrchicl reltionships etween XML elements re not completely chnged, nd so there is no loss of informtion. (i) (ii). Stte-of-the-Art The clssifiction pproches re divided in two min vrints clled supervised clssifiction nd unsupervised clssifiction (or clustering). Existing works on the clssifiction of XML documents cn e distinguished y the wy they represent the documents, ut lso y the clssifiction nd/or clustering methods used. We focus here on pproches tht represent documents y structure only. Within this frmework, we distinguish two min ctegories: document-sed clustering, where the clustering is sed on the document structure itself, nd DTD-sed clustering, where documents re clustered ccording to their DTD. We riefly descrie elow some of the most known pproches, highlighting their min fetures. We re prticulrly focus on pproches tht represent the document structures y leled trees. In the first ctegory, nmely clustering sed on document structure; the structure tht is used to clssify document is issued from the document itself. This structure could e either leled tree corresponding to the originl structure of the XML document (the whole structure of document) [5,, 9, 30, 3, 40] or rooted ordered leled tree summry [,, 3, 4]. The ltter hs the dvntge of reducing the computtionl complexity in the clustering, ut hs nevertheless the drwck of reking the reltionships etween XML elements. In the study y [4], tree summry is otined y two trnsformtions, s shown in Fig. : (i) the first one reduces the depth of the tree so tht the children of ny node hving the sme lel s one of its ncestors ecome direct descendnts (child) from this ncestor, (ii) the second one elimintes dupliction of Fig. Tree summry extrction In the study y [, 5, 39], nother wy for representing the structure of XML documents is proposed, nmely these pproches re sed on the discovery of su-trees tht most frequently occur in collections of leled trees representing XML documents. More precisely, in the study y [39], the frequent su-trees cn e unordered, wheres, in the other two pproches [, 5], they solutely must e ordered. All su-trees elow the rrow in Fig. re frequent ccording to the pproch [39]. The lst two of them re exct nd ordered, so they re lso frequent ccording to the pproch [, 5]. This indictes tht the pproch [39] is more generl nd esily pplied to heterogeneous XML documents, s opposed to the pproches [, 5] tht only pply to XML documents shring the sme DTD or XML Schem. Another proposl [0] consists of linerizing the structure of ech XML document, y representing it s numericl sequence nd, then, compring such sequences through the nlysis of their frequencies. The clustering consists then to compre the extrcted structure with cluster or its representtive. This representtive, usully clled centroid, is the most representtive tree summry of ll XML documents in the cluster [,, 3]. The centroid my chnge, i.e., it cn e replced y more pproprite tree, depending on the ssignment of new documents to the cluster. In the study y [39], cluster is chrcterized y the mximl frequent su-tree, i.e., the frequent su-tree tht hs the gretest numer of nodes mong ll su-trees in this cluster.

3 43 Note tht with frequent su-trees technique, n XML document my elong to severl clusters. In the study y [4,, 9, 30, 3, 40], ech cluster is represented y suset of similrly structured XML documents, wheres with [,, 4, 39] pproches, cluster hs only one representtive (the centroid or the mximl frequent sutree); this mens tht, in the process of detecting the pproprite cluster, the representtive of new document is compred only with the representtive of the cluster. Note tht in some pproches like [,, 4], new comprison is undertken for possile chnge of the centroid. Fig. Frequent su-trees detection Concerning the clustering process itself, severl pproches hve een proposed. Authors in [30, 40] propose n incrementl clustering sed on common pth similrity tking into ccount different criterions s the numer of common nodes etween the XML tree (tht of the XML document to e clssified) nd the trees of the cluster considered, the numer of common nodes pths, nd the order of the nodes of the XML tree. In the study y [,, 3], the uthors lso perform n incrementl clustering, ut it is sed on structurl similrity etween the centroid nd the structure of XML document to e clssified. For detecting structurl similrity etween XML documents, the uthors in [0] exploit the theory of discrete Fourier trnsform to effectively nd efficiently compre the encoded documents (i.e., signls) in the domin of frequencies. This pproch significntly differs from stndrd methods sed on grph-mtching lgorithms nd llows significnt reduction in the required computtion costs. Indeed, if N is the mximum numer of tgs in two documents, their mtching complexity is O(N log N), wheres it is O(N ) with those sed on edit distnce, s the Chwthe s lgorithm [6] nd one proposed in [4]. In the study y [, 4,, 3], the similrity etween two trees is sed on their edit distnce. Edit distnce mesures the numer of elementry opertions to trnsform one tree into nother. Most lgorithms for clculting the tree edit distnce re sed on the dynmic progrmming techniques [6 9, 35, 37, 4, 4]. Note, however, tht there my e severl sequences of edit opertions to trnsform one tree into nother. Therefore, the cost of the opertions in ech sequence is considered, nd the lowest cost sequence mong these defines the edit distnce etween trees [3]. Edit distnce llows performing clustering of these trees using ottom-up hierrchicl clssifiction method [, 4,, 3]. In the clustering pproches sed on frequent sutrees, the uthors [39] hve developed n lgorithm to detect the mximl su-tree. Similrly, the uthors in [5] lso hve developed their lgorithm very close to the lgorithms FREQT nd TREEMINER proposed respectively y [4] nd [45]. An XML tree cn pper in multiple clusters. In other words, n XML document cn e shred y severl clusters, i.e., it consists of severl su-trees ppering respectively in severl different clusters. Thus, under these chrcteristics, we cn sy tht these pproches elong to the fmily of non-exclusive (or overlpping) clustering. Concerning the second ctegory pproches, nmely DTD clssifiction, we list elow some of the most known. Recll tht the DTD is considered s context-free grmmr tht genertes potentilly infinite numer of the XML documents. From this fct, insted of clssifying directly the XML documents, the pproch proposes to clssify their DTDs in clusters. Thus, ech cluster ecomes the representtive of set of structurlly similr XML documents. The dvntge is tht it is possile to more quickly integrte considerle numer of XML documents together. The drwck is tht the nodes of DTD trees often denote regulr expressions whose hndling is not lwys trivil tsk nd furthermore cuses (in some cses) loss of informtion [8]. In [8], the uthors propose DTDs clustering model, nmed XClust. Ech DTD is represented y its tree structure. The similrity of two nodes is clculted y exploiting different levels of the tree, nmely the ontologicl similrity of the nodes (using dictionry), the similrity of their immedite descendnts (children), the similrity of their ncestors nd finlly, tht of the sutrees leves whose they re respectively the roots. In [36], the uthors propose mechnism which identifies syntcticlly the similrity of DTDs y dopting n scending clustering strtegy. Compred with XClust, the uthors in [4] exploit only the immedite descendnt context. In [7], the uthors develop n lgorithm which is sed on generic scheme of the DTD mtching. The mtching gol is to rech medin scheme corresponding to the DTDs tht re similrly structured. Like [8], in order to

4 44 clculte the DTDs similrities [7] relies on the dictionry, ut it only exploits the leves context. Finlly, the pproch proposed in [8] is sed on the lerning nd inference comined with n instnce of DTD. In fct, it works with supervised clssifiction, i.e., the clsses re known efore strting this clssifiction. There exist other methods such s pproches of [6, 9, 3, 4, 5, 4, 43, 44] originlly dedicted for clssifying or clustering XML documents using oth structure nd content. These methods re in fct more generl since they offer flexile models tht cn esily e dpted for deling with structure only. For instnce, in the study y [6, 43], the uthors propose network-sed stochstic model tht is le to descrie different kind of reltionships of XML elements. It ws proved tht this model is esily dpted to the structure lone. The model is sed on Byesin networks [3], to infer the different type of structurl reltionships in XML tree. 3. Clustering pproch sed on structurl similrity Our clustering pproch elongs to the first ctegory, nmely, clustering sed on document structure. The structure used to structurlly clssify document is issued from the document itself. Specificlly, for clssifying (clustering in our cse) XML documents y structure, we suggest the usge of their corresponding leled tree structurl summries. The lels correspond to tgs or ttriutes. In our pproch, ttriutes re treted s mere tgs. A leled tree summry representing n XML document is utomticlly extrcted from the document y prser. This extrcted tree summry is then used s model of representtion y clssifier to clssify the corresponding XML document. We show nd explin in Susection 3., how this prser works. Finlly, we give description of the proposed similrity mesure in Susection 3.. Note tht the ltter is sed on the redthfirst trversl of XML documents tree structures. 3. Tree structurl summry extrction We propose to represent XML documents y their tree structurl summries tht need miniml processing nd especilly void the loss of informtion. The sic ide is tht repetitions of tgs nd/or the possiility to hve optionl tgs (nd su-trees consequently) re one of the resons why XML documents cn e structurlly different even though they shre the sme DTD. In this context, our tree summry is regrded s generic structure in the sense tht, when siling tgs re duplicted, it is not necessry to hve this dupliction in the structure tht we wish to extrct. Note however, tht to void losing informtion, duplictions of nested nd/or cousin tgs re not removed s duplictions of siling tgs. One wy to void this is to consider them s immedite descendnts (su-tgs) of tgs in which they pper in XML documents. In Fig. 3 we show n overview of this representtion pproch focusing on ll its fetures. Indeed, the trnsformtion of the originl tree (i) in the tree summry (ii) shows tht the ttriute t ecomes in fct n immedite (direct) descendnt (son) node of the root node. As for the dupliction of siling nodes, it is removed while keeping the children ( c, nd c ) ttched to single occurrence of, ut the nodes c, which were originlly cousin nodes on (i), hve ecome rothers, whose we lso eliminted dupliction. However, s recommended, duplictions of nested nodes nd re mintined. Our extrcting lgorithm of tree summry is two-step. The first step is sed on SAX (Simple API for XML) API (Appliction Progrmming Interfce), which returns ll the tgs nd ttriutes encountered in n XML document. These tgs (or ttriutes) re intercepted, filtered nd then trnsformed y our prser into n intermedite form s shown through the prenthesized expression in Fig. 4. In the second step, this intermedite prenthesis expression is trnsformed y nother prser into the corresponding tree summry, ccording to projections of our pproch. i.e., y eliminting dupliction of siling nodes nd considering ech ttriute s n immedite descendnt of the element (tg) which it is ttched in the XML document. Most of the extrction tsk is performed during this second step. In fct, three essentil opertions re performed t this level: Pssge from the liner form of the XML document to its hierrchicl representtion; Removl of repetitions of siling nodes; Trnsforming possile ttriutes into immedite descendnts of the elements which they re ttched in the XML document. (i) c c t {Attriute} Fig. 3 Representtion pproch of n XML document Thus, insted of the originl trees to represent XML documents, we use their tree structurl summries, ut without loss of informtion, since we remove only the repetitions of siling nodes. This llows, on the one hnd, performing the mtching of these trees more quickly nd esily nd, on the other hnd, to provide high-qulity clustering. t (ii) c

5 45 Fig. 4 An XML document nd its corresponding tree summry 3. XML documents clustering 3.. Overview of the clustering technique used For clustering XML documents sed on structurl similrity we use well-known techniques in hierrchicl gglomertive clustering (lthough ny form of clustering could e used). Hierrchicl methods perform mergers etween dt sets; the peers of elements (or clusters) re successively merged until there is only one lrge set contining ll elements. The end result cn e schemticlly represented s tree of clusters nmed dendrogrm, s shown in Fig <> <--!comment --> <> TEXT <c> TEXT </c></> < t ="vl"> TEXT </> < t = s ="vl">text</> </> ( ( ( c ) ) ( ( t ) ) ( ( t ) ( s ) ) ) c t s Distnce threshold=.0 ( Cluster) Distnce threshold=0.5 (3 Clusters) Distnce threshold=0. (8 Clusters) Fig. 5 Dendrogrm of the scending hierrchicl clssifiction The dendrogrm shows the clusters tht were merged together, nd the minimum similrity etween these merged clusters. There re severl methods of hierrchicl scending clssifiction. They ll re sed on the following ide: ) Initilly, ech element of the dt set to e clssified is regrded s cluster. ) Clusters seprted y minimum distnce (i.e., mximum similrity) re grouped together. The distnces etween the remining clusters nd the new cluster set re reclculted. c) If there is more thn one cluster or hs not yet reched the minimum distnce (or mximum similrity threshold), go to step. With some methods, the distnce etween two clusters X nd Y is defined s the minimum distnce (mximum similrity) etween ll the peers of elements (x, y) such tht x is in X nd y in Y. With other methods, this is the verge distnce (verge similrity) which is considered s prmeter of the seprtion of clusters. We chose clustering method which is sed on the minimum distnce (i.e., the mximum similrity). We then used the single link clustering lgorithm using Prim s lgorithm [0] to clculte the MST (Minimum Spnning Tree or shortest pth) of grph. Given grph G = (N, A) with set of weighted edges A, nd set of nodes N. The minimum spnning tree (MST) of grph is n cyclic suset T A tht chin ll nodes whose totl weight (cost, distnce, vlue, etc.) denoted W (T) (the weight sum of T' s edges) is minimized. It ws shown in [] tht the MST contins ll the informtion required to implement the single link clustering. Given set of rooted leled ordered trees representing XML documents, we form complete grph G with n nodes N nd weighted edges A. The weight of n edge is the structurl distnce etween the nodes it connects. Nodes represent XML trees in our cse. For exmple, the single link clustering for threshold l cn e crried out y removing ll the edges hving weight l of MST in the G grph. The connected nodes of the remining grph re the single link clusters. It cn e seen in Fig. 6 grph with 7 nodes (corresponding to 7 XML documents), nd 0 edges Fig. 6 Grphicl representtion of the distnces etween XML trees

6 46 As indicted ove, the weight of n edge is the structurl distnce etween XML documents. For exmple, the structurl distnce etween the tree nd tree is 0.. Missing edges re the dditionl edges which mke the complete grph; their weights re equl to. Fig. 7 shows the shortest pth on the grph in Fig. 6. It cn e seen in Fig. 8 the prts of grph remining fter deleting ll edges with weight Fig. 7 The shortest pth in the grph of Fig Tle : Mtrix ssocited with the grph of Fig Tle : MSP mtrix of the mtrix of Fig cluster cluster Tle 3: Mtrix fter pplying threshold Fig. 8 Resulting grph fter deleting ll edges hving weight 0.4 There re two new components tht re formed, contining the nodes (,, 3, 6) nd the nodes (5, 7), respectively. This indictes the presence of two new clusters, nmely cluster with (,, 3, 6) s memers nd cluster with (5, 7) s memers. Nodes tht re not connected to other nodes re considered s single node clusters. The grph is represented y mtrix clled the ssocited mtrix. Is ssocited to the grph G = (N, A) of order n, squre mtrix of order n. This mtrix is formulted s follows:..,.., In Tles, nd 3 re shown mtrices respectively ssocited with grphs of Figs 6, 7 nd 8. It suffices now to use the mtrix otined fter pplying the threshold 0.4 to deduce the remining links etween nodes (representing XML documents) nd then uild the corresponding clusters. 3.. Overview of using Prim s lgorithm As nnounced ove, Prim s lgorithm [33] llows clculting the shortest pth (or MST) in given weighted grph G. In n informl wy, we pply the following points: Crete tree contining single node, chosen ritrrily from the grph G Crete set contining ll the edges in the grph G loop until every edge in the set connects two nodes in the tree remove from the set n edge with minimum weight tht connects node in the tree with node not in the tree dd tht edge to the tree Thus, the lgorithm continuously increses the size of tree, one edge t time, strting with tree consisting of single node, until it spns ll nodes of the initil grph G. A pseudo-code for Prim s lgorithm is given in Fig. 9. To show how to pply Prim s lgorithm to find minimum spnning tree in the weighted grph, we rely on the exmple of grph in Fig. 0. Prim s lgorithm will proceed s follows. First we ritrrily choose to strt with the node d, nd then we dd edge {d, e} of weight. Next,

7 47 we dd edge {c, e} of weight. Next, we dd edge {d, f} of weight. Next, we dd edge {, e} of weight 3. And finlly, we dd edge {, } of weight. This produces minimum spnning tree of weight = 0. The minimum spnning tree found is given in Fig.. Input: Given non-empty connected weighted grph G = (N, A), (the weights cn e negtive) Initiliztions: N new {x}; A new φ ; (where x is n ritrry node (strting point) from N) repet choose n edge {u, v} with miniml weight such tht u is in N new nd v is not (if there re multiple edges with the sme weight, ny of them my e picked) N new N new {v}; A new N new {u, v} until N new = N Output: N new nd A new descrie n MST Fig. 9 Prim's lgorithm pseudo-code Fig. 0 An exmple of weighted connected grph 3 6 c c Fig. Minimum spnning tree (MST) produced y pplying Prim s lgorithm on the grph in Fig. 0 We could strt with ny node to determine the MSP. In the cse of the previous exmple (in Fig. 0), we ritrrily chose to strt with the node d. But ny node cn e used to strt the process with Prim s lgorithm. The time complexity of the lgorithm depends hevily on how the choice is implemented in the edge / node to dd to the set t ech stge. With nive representtion, using n djcency mtrix grph representtion nd serching n rry of weights to find the minimum weight edge to dd requires O (N ) running time. Using simple inry hep dt structure nd n djcency list representtion, Prim s lgorithm cn e shown to run in time O (A log N). Using more sophisticted Fioncci hep, this cn e rought e d e d 4 f f down to O (A + N log N), which is symptoticlly fster when the grph is dense enough tht A is ω (N), i.e. A domintes N symptoticlly. However, we chose, for the purposes of our tests in this rticle, the djcency mtrix for the simplicity of its implementtion. At this stge, s previously nnounced, we focus in Susection 3.3, on the description of the structurl similrity mesure proposed. 3.3 Tree structure similrity Usully, to compre two words we use thesurus or dictionry. But when these words correspond to node nmes (lels) in tree, it is necessry to tke into ccount their respective tree reltionships. The ide is tht even though two nodes re represented y the sme nme, or y synonymous nmes, this does not men tht they remin necessrily similr in the context of their respective ncestors, descendnts, silings nd/or cousins, which cn e completely different. Thus, the similrity of two nodes depends not only on their ontologicl similrity (terms could e similr ecuse they hve sme string or could e semnticlly relted using dictionry), ut lso on their respective tree reltionships tht ply crucil role in the similrity clcultion. Most methods for clustering XML documents y structure use the edit distnce for mesuring the similrity etween their structures. We recll tht tree edit distnce mesures the numer of elementry opertions (insertions, deletions nd replcements of nodes) required to trnsform one tree into nother. On the other hnd, ll these methods perform depth-first trversl to visit nodes of tree. We propose novel method for clculting the similrity: Firstly, insted of performing depth-first trversl to visit nodes of tree, our proposl is to perform redth-first trversl, lso clled level y level trversl. In other words, we explore the redth, i.e., full width of the tree t given level, efore going deeper. Secondly, we tke into considertion the hierrchicl contexts of XML tree structures. Before descriing in detil our method, it is necessry to introduce some fundmentl concepts Bsic preliminry notions A tree level consists of siling nd/or cousin nodes. As suggested in our pproch, repetitions of siling nodes will e eliminted, ut not those of the cousin nodes. Therefore, it is possile to encounter on sme tree level severl duplictions of cousin nodes. It is then necessry in such cse to tke them into ccount in the similrity clcultion. To express tht, we cn use the concept of weight. Indeed, let,,, e vector ; its norm (Eucliden distnce) is. The usge of the norm llows exploiting efficiently the concept

8 48 of weight. We cn extend its use even to ojects tht re not necessry vectors of. Indeed, for exmple, if,,,,,, is tree level, then the weights (or frequencies) of, nd c re, nd 3, respectively. Therefore, if these weights re stored in vector such s,,3 then the norm ssocited with L is 3 7. The norm will serve therefter for the normliztion of the similrities vlues. Moreover, in order to fully highlight fetures of our pproch, it should lso recll some notions on depth- nd redth-first trversls of trees. Indeed, there re essentilly two different methods in which to visit systemticlly ll the nodes of tree, nmely, depth-first trversl nd redth-first trversl. Certin depth-first trversl methods occur frequently enough tht they re given nmes of their own: preorder trversl, inorder trversl nd postorder trversl. To descrie these concepts esily nd clerly it is etter to rely on concrete exmples. In fct, we do not relly need dwell too long on the detils of the tree trversl; we give only the minimum necessry to distinguish the redth-first trversl (which chrcterizes our proposed method) nd the depth-first trversl tht ws used in most existing clustering methods. Thus, for exmple, given the tree in the Fig. : preorder trversl would visit the elements in the order: A, B, C, D, E, F, G, H, I. This type of trversl is clled depth-first trversl ecuse it tries to go deeper in the tree efore exploring siling nodes. root B A C E H D F G I level 0 level level level 3 preorder (tree) if (tree not empty) visit root of tree preorder (left su_tree) preorder (right su_tree) Fig. 3 Preorder trversl lgorithm 3.3. Bredth- first tree trversl To our knowledge, the redth-first trversl lgorithm hs not een prcticlly pplied in existing work on clustering of XML documents. We encountered only one pproch in [9] tht ddressed the similrity computtion ccording to the similrities of the levels of XML tree structures. Recll tht in our pproch, the representtive structures of XML documents re tree structurl summries, structured s generl trees, i.e., where ech tree node cn hve ny numer of children. The lgorithm in Fig. 4 llows exploring generl tree nd retrieving its nodes, dopting the redth-first trversl. The redth-first trversl hs liner time complexity O (N) in the worst cse, s the depth-first trversl. redh_trversl (n : Node) egin level {n} while level φ ; {dept_level φ ; for ech node level {store in list; depth_level depth_level child_of ();} level depth_level ;} end Fig. 4 Bredth-first trversl lgorithm Fig. Simple generl tree For exmple, the trversl visits ll the descendnts of B (i.e., keeps going deeper) efore visiting B s siling D (nd ny of D's descendnts). As we hve seen, this kind of trversl cn e chieved y simple recursive lgorithm given in Fig. 3. Wheres the depth-first trversls re defined recursively, redth-first trversl is est understood s non-recursive trversl. The redth-first trversl of tree visits the nodes in the order of their depth in the tree. Bredth-first trversl lgorithm first visits ll the nodes t level 0 (i.e., the root), then ll the nodes t level one, nd so on. At ech level the nodes re visited from left to right. Thus, redth-first trversl of the tree shown in Fig. visits the nodes in the following order: A, B, D, C, E, H, F, G, I. Indeed, given tree of N nodes, the lgorithm in Fig. 4 clerly shows the linerity of the complexity time. At ech level, the nodes re visited from left to right, nd then stored in lists tht will e used therefter for clculting similrities. The dvntge of storing the nodes in the lists is twofold: On the one hnd, this llows esy clcultion of sic similrities etween levels of trees. On the other hnd, given two tree levels elonging respectively to two trees, it is possile to know the similrities of their respective ncestor nd descendnt levels. As suggested ove, the ncestor nd descendnt levels represent somehow hierrchicl contexts to tke into ccount in clculting the similrity of two levels of two given trees. These levels re somehow implicitly linked y hierrchicl reltionships in trees. The underlying ide is tht even though two tree levels re identicl, or very similr, this

9 49 does not men tht they remin necessrily similr in the context of their respective ncestor nd descendnt levels which cn e completely different. So, given ll these chrcteristics, we descrie nd explin in Susection 3.3.3, the structurl similrity mesure tht we propose, tking into ccount the hierrchicl reltionships etween levels in ech tree Structurl similrity mesure sed on redth-first tree trversl Let T nd T e two trees representing respectively two XML documents. We propose to compute their similrity s follows:,, (),, is the similrity of the levels l i nd l j. The levels l i nd l j elong respectively to T nd T. The ounds n nd m re the levels numers of T nd T respectively. Given two levels l i nd l j, we define their similrity ccording to their hierrchicl context s follows:, () w 0, w 0 nd w 3 0 re weights such tht w + w + w 3 =. S is the sic similrity of l i nd l j. It is expressed s follows:, The term, is the ontologicl similrity of the nodes e k nd e l (otined using dictionry). In other words,, if e k = e l,, if e k nd e l re synonymous, otherwise, 0. The nodes e k nd e l elong respectively to the levels l i nd l j. The ounds p nd q re the nodes numers of l i nd l j respectively. The product llows normlizing the sum,. The terms N nd N re two vectors whose elements re weights of nodes elonging respectively to the tree levels l i nd l j. Thus, S is clculted for ech pir of levels (l i, l j ). So the result is the sic similrity mtrix of trees T nd T. In Susection 3.3.4, we give n ide out the clcultion of this mtrix. S nd S 3 in some wy reflect the hierrchicl context in clculting the similrity of ech pir of levels (l i, l j ). S represents the similrity of descendnt levels of l i nd l j respectively. It is expressed s follows:,, The term, represents the sic similrity of the levels d k nd d l. The levels d k nd d l elong respectively to desc nd desc. The terms desc nd desc re the sets of descendnt levels of l i nd l j, respectively. (3) (4) The ounds r nd s re the levels numers of desc nd desc, respectively. S 3 is the similrity of ncestor levels of l i nd l j respectively. It is expressed s follows:,, The term, represents the sic similrity of the levels k nd l. The levels k nd l elong respectively to nc nd nc. The terms nc nd nc represent the sets of ncestor levels of l i nd l j, respectively. The ounds t nd u re the levels numers of nc nd nc, respectively Illustrtive exmple This exmple shows the different steps followed in computing the similrity of the two trees T nd T in Fig. 5 using the proposed structurl similrity mesure sed on redth-first tree trversl. The first step is to use Eq. (3) to clculte the similrity mtrix of levels of T nd T. As there re three levels in ech tree (T nd T ), we will hve mtrix (3 3). The clcultion gives the following mtrix: We note tht the similrity etween the lst levels of T nd T respectively is equl to 0.8, while it is equl to etween the other levels of the sme rnk. It is equl to 0 everywhere else. T T level 0 c e f g d level level Fig. 5 Comprison of two XML trees using the clcultion of the structurl similrity sed on the redth-first trversl Before clculting S nd S 3, it would e pproprite to define how to use the weights w, w nd w 3. Indeed, if we ignore the hierrchicl contexts (descendnt levels nd ncestors levels), it is not necessry to clculte S nd S 3, in this cse we tke w = with w = 0 nd w 3 = 0. Otherwise, in prticulr in the cse of XML documents, it is more nturl to give to S, S nd S 3 the weights w =, w =, nd w 3 =, respectively. Thus, with respect to the first cse mentioned, nmely tht we do not consider the hierrchicl contexts, the similrity etween two tree levels of two trees, respectively, is defined y S. g c e d (5)

10 430 Indeed, with (w =, w = 0 nd w 3 = 0), we hve,, ecuse w =. So the finl similrity clcultion of the two trees T nd T ecomes esy nd requires only exploring the mtrix (3 3) clculted ove with the formul (). There will therefore,, =,., =, 0.94 which is reltively good similrity vlue tht we could get y compring two vectors, so it does not reflect the tree view of XML documents. Hving sid this, ut if we consider the cse where w =, w = nd w 3 =, the clcultion is oviously more complicted, ut in principle reflects more relile similrity clcultion. The elements of the new similrity mtrix, efore clculting the finl similrity, re clculted using Eq. (), with w =, w = nd w 3 =. So we will hve,. In this cse, we must lso clculte S nd S 3 using Eqs. (4) nd (5). But to go fster, we clculte S nd S 3 only for non-zero similrity of the mtrix clculted ove. The elements concerned re those of the min digonl of the mtrix, nmely (, ), (, ) nd (3, 3) which re represented y the vlues, nd 0.8, respectively. Moreover, it should lso e noted tht some elements of the mtrix re not concerned y the clcultion of S or S 3, s for exmple those of the lst row or those of the first row of the mtrix. But for not distort the similrity computtion, we ttriute the vlue to S nd S 3, in the cse of the lst row nd the first row of the mtrix, respectively. Thus, for ech element of the mtrix equl to 0, we clculte the vlues of S nd S 3 s follows: The element (, ) hs no ncestor levels so, i.e., S 3 =, ut it hs two descendnt levels, nmely (, ) nd (, 3) such tht, nd,30, tht is to sy = 0.5. So, y pplying Eq. () we hve, The element (, ) hs only one ncestor level nd one descendnt level, corresponding respectively to (, ) nd (3, 3), which gives, nd 3,30.8, i.e.,,, nd,, hve, So y pplying Eq. () we The lst cse concerns the element (3, 3) tht hs no descendnt levels, ut hs four ncestor levels, nmely (, ), (, ), (, ) nd (, ). Regrding the descendnt level, we ssign the vlue, s expected, to, i.e., S =. Other vlues re clculted s follows:,,, 0,, 0 nd,. Thus, we hve,. Finlly, we otin 3, The finl mtrix formed y the elements,..,.. efore clculting the similrity of the two trees T nd T is given s follows: Applying the eqution (), we otin,,, Unlike the first result (nmely 0.94) without tking into ccount the hierrchicl contexts of trees T nd T, i.e., with w =, w = 0 nd w 3 = 0, the ltter result (nmely 0.905), seems to etter reflect the relity of the tree structure of XML documents. This exmple gives n ide on how to clculte the similrity ccording to our pproch, ut to vlidte our proposl we will mke severl tests in the experimentl prt of this pper Complexity of the structurl similrity clcultion Given generl tree of M nodes nd height h, this ltter is equl to the numer of tree levels. So, tree level, other thn tht of the root contins on verge nodes. Therefore, given two trees hving levels contining respectively nd nodes, then the clcultion of their sic similrity mtrix is chieved on verge in M M opertions since they hve respectively h nd h levels. In other words, it requires time complexity of order O (M M ), which is the sme s tht of clculting the similrity sed on edit distnce. However, in our pproch, unlike pproches sed on edit distnce, we extend the similrity clcultion tking into ccount the tree reltionships etween nodes. It will therefore e necessry to dd the clcultion of descendnt nd ncestor levels similrities, respectively. Indeed, sed on sic similrity mtrix S [..h,..h ], the worst cse time complexity of the dditionl clcultion is on the order. Note however, tht the heights h nd h re usully reltively much smller thn the tree sizes (numers of nodes) M nd M respectively. We thus otin time complexity slightly higher thn tht of the edit distnce, ut this is cceptle given the relevnce of the proposed similrity mesure tht tkes into ccount the hierrchicl reltionships of nodes. Remrk given tht we hve proposed similrity mesure other thn tht sed on the distnce for clustering XML documents, on the one hnd nd, on the other hnd, we relied on Prim s lgorithm tht computes the shortest pth (MST) in grph which is then exploited for clustering XML documents sed on their structurl distnces (ech node of the MST, symolizes the structure of n XML document), it is then necessry to dpt our similrity mesure. To do this, it suffices to replce the similrity

11 43 vlue clculted on the sis of the similrity mesure proposed y the distnce vlue ccording to the following Eq. (6). (6) In the next section, we evlute the effectiveness nd efficiency of our pproch. To this end, we conducted our experiments relying on severl different tests. 4. Experiment nd results 4.. Implementtion of the clustering system We developed first progrm in jv, under the Jcretor environment. The developed progrm consists of two modules: the first one is sed on SAX to crry out the first prsing s nnounced in Susection 3.. This module provides for ech treted XML document n intermedite file intercepted y second module to finlize the extrction of its corresponding tree summry. For clustering XML summry trees otined using the previous prsing progrm (i.e., tree structurl summries extrction progrm) we wrote second progrm in C++ tht uses the files (representing the tree summries) generted y the first progrm for clustering them. 4.. Experimentl frmework Our experiments re crried out on Lenovo, Intel Core Duo GHz CPU nd.99 GB of RAM. For dt set, the experiments were crried out on oth rel (ACM SIGMOD Record ) nd synthetic XML collections. ACM SIGMOD Record corpus concerns scientific rticles pulished y ACM SIGMOD conference nd is composed of pproximtely,000 XML documents shring 5 DTDs, nmely HomePge, IndexTermsPge, OrdinryIssuePge, ProceedingsPge, nd SigmodRecord. These DTDs cn, in fct, e considered s trget clsses ginst which we cn ssess our clustering pproch. This corpus is distriuted s shown in Tle 4. DTD Tle 4: ACM SIGMOD Record corpus distriution Numer of XML documents IndexTermsPge 90 OrdinryIssuePge 30 ProcedingsPge 6 SigmodRecord HomePge Evlution metrics The evlution is to verify to wht extent the clustering is susceptile to find clusters in greement with the clsses of the leled corpus, which re considered s trget clsses. To vlidte our pproch, we used the F-mesure, Recll nd Precision mesures, which re commonly used metrics to ssess the clustering results. F (F-mesure) [6] is comintion of Precision nd Recll. It mesures the lnce etween P (Precision) nd R (Recll) expressed respectively y the following Eqs. (7) nd (8). (7) (8) N c is the numer of documents in the cluster C, N d is the numer of documents in the trget clss (DTD) nd X d is the numer of documents in the trget clss ssigned to cluster C. We recll tht ech DTD is considered s trget clss with which we cn evlute our clustering. So we know priori these clsses, i.e., we know their numers nd the nmes of the documents they contin. The F-mesure F, in turn, is expressed y Eq. (9) representing the hrmonic men of Precision nd Recll Evlution nd discussion In this phse, we first derive from the previous XML collection, the corresponding tree summries, nd we then respectively proceeded to their clustering. The first clustering test consists in compring the mesure of similrity proposed with the similrity mesure sed on tree edit distnce nd the similrity mesure proposed in [3]. The second test is to compre some of our results with those of existing pproches. Finlly, the third test is to confirm the symptotic time complexity of our similrity mesure Similrity mesure proposed versus other similrity mesures In the first test, s expected, we compred the similrity mesure proposed to nother mesures, nmely the edit distnce nd the similrity mesure proposed in [3]. We chose to compre our similrity mesure with the edit distnce, ecuse the ltter is mesure of similrity tht hs een widely used in mny clustering pproches. The comprison with the work presented in [3], is justified y the fct tht we use exctly the sme model for representing XML documents, in this cse, structurl tree summries. This comprison test is prticulrly motivted y the response time of our clustering on the one hnd nd, on the other hnd, y the reliility of our similrity (9)

12 43 mesure. For this, we first replced in our clustering lgorithm, the similrity mesure descried y Eq. 6, respectively, y the edit distnce nd similrity mesure proposed in [3]. We then performed three series of tests with the sme vlues of distnce threshold in the intervl [ ]. However, given the recurring results (especilly for the 90 documents corresponding to IndexTermsPge which re lmost identicl, thus structurlly very similr), we used only of the corpus, nmely 348 XML documents, i.e., ( ) corresponding respectively to IndexTermsPge, OrdinryIssuePge, ProceedingsPge, SigmodRecord, nd HomePge. To hve cler ide out the performnce nd reliility of our tests, it would e pproprite to report the comprtive results (for the sme type of test) pulished in [3]. These results re given in Tle 5. Some revitions used in Tle 5: NC is the numer of clusters. The time unit on the column nmed Time is the second. The revitions SM nd ED denote respectively Similrity Mesure nd Edit Distnce. Finlly, the revition TH represents the similrity threshold. Tle 5: [3] similrity mesure versus edit distnce TH [3] SM ED NC Time NC Time Note tht the clustering lgorithm in [3] is completely different from the clustering lgorithm we proposed in this rticle. In this regrd, we recll tht our clustering here is sed on conventionl gglomertive hierrchicl clssifiction, while tht of the pproch [3] is n incrementl clustering. As nticipted ove, fter completing the first test with our similrity mesure (sed on Eq. 6), we replce, in our clustering lgorithm, our similrity mesure, successively, y the edit distnce nd [3] similrity mesure. We then perform two new series of tests whose the results re collected in Tle 6. Other revitions concerning Tle 6 re DT nd OSM; they denote respectively the distnce threshold nd similrity mesure (sed on Eq. 6). As cn e seen (Tle 6), in ll cses, i.e., with the similrity mesure proposed (OSM) or with other mesures (ED nd [3] SM), clustering time remins prcticlly the sme when the similrity threshold chnges (increses or decreses). Indeed, our clustering is sed on the minimum distnce s criterion for ggregtion. In other words, the numer of comprisons is prcticlly the sme for ech threshold distnce vlue. As for differences, there is lg in response times nd differences etween the similrities vlues otined. The difference in response times, s expected, is ovious, given the differences etween the equtions used y ll three mesures tested. The time prmeter is not very restrictive nd should not weigh hevily on the fesiility of such pplictions (clustering is not n interctive ppliction where time is lwys criticl prmeter). Note, however, tht differences in the vlues of the similrities re crucil, since it is on the sis of similrity tht it is decided tht document is or is not ssigned to cluster. Moreover, these differences hve direct impct on the numer of clusters (NC) otined in ech test. Indeed, with these thresholds, some documents re structurlly very distnt to sty together in the sme cluster. This is due to [3] similrity mesure nd our similrity mesure tht tke into ccount the ncestor nd descendnt context of nodes, so tht we find in the sme cluster s the documents hving very close hierrchicl structures. Thus, XML documents tht do not stisfy this condition, i.e., tht re not sufficiently structurlly similr, will migrte to other newly creted clusters. In fct, these new clusters re considered s not corresponding to ny DTD. We recll tht ech DTD is considered s trget clss ginst which we cn ssess our clustering. Tle 6: Our similrity mesure versus [3] mesure nd edit distnce mesure DT [3] SM ED OSM NC Time NC Time NC Time When we compre the results in Tles 5 nd 6, there re some differences. Indeed, if we consider the column nmed [3] SM in the two tles in question, we find tht there is cler difference in the clustering time. This is certinly due to our clustering lgorithm, which is fster compred to the lgorithm of the study y [3], which is reltively slow. The numer of clusters NC does not chnge rpidly with the distnce threshold in Tle 5 compred to NC in Tle 6. This is due to clustering lgorithms tht re different. The clustering lgorithm used in this rticle is simple lgorithm sed on conventionl gglomertive hierrchicl clssifiction, while the clustering pproch y [3] is n incrementl clustering. Our clustering

13 433 lgorithm uses only the minimum distnce s ggregtion criterion, while the clustering pproch y [3] is chrcterized y the moility of the centroid, representing ech cluster. Ech time n XML document must e dded, nd its representtive is systemticlly compred with ll existing centroids nd ll the trees in the cluster to which it is ssigned. During the comprison process, we cn either hve new centroid, which is systemticlly ssigned to newly creted cluster, or n existing centroid cn e replced y nother tree more representtive, mong those of the sme cluster. This clerly explins the differences etween the two pproches prticulrly regrding the time vrition in the clustering (SM column) in Tle 5. If we compre the column OSM in Tle 6, representing our pproch, we see tht it is somewht close to the result of the ED column in Tle 5, in terms of clustering time nd the numer of clusters NC. But it is somewht fr from the result of the ED column in Tle 6, prticulrly in terms of time clustering. We conclude tht our method is etter ecuse it is relile in terms of clustering time nd the qulity of clustering Comprison of some of our results with those of other clustering methods The second test is to compre some of our results with those otined in [3, 6, 3, 3, 38] pproches on portion of ACM SIGMOD collection. To this end we used the smple of XML documents in Tle 7. Tle 7: Distriution of ACM Sigmod record suset Nme of the DTD Numer of XML documents IndexTermsPge 5 OrdinryIssuePge 30 ProceedingsPge 6 SigmodRecord HomePge We chose to compre our method with those developed in [3, 6, 3, 3, 38] pproches for severl risons. First, s our pproch, these pproches use very close representtion, nmely tree structures to structurlly represent XML documents. Second, ecuse they use the sme dt set, nmely ACM SIGMOD corpus. Third, their clustering methods re ll sed on edit distnce or similrity tht is different from our mesure of similrity. Recll tht these results do not depend only on the similrity mesure, ut lso nd especilly, of the model (originl XML tree structure or XML tree structure summry) used to represent the structures of XML documents. In Tle 8, we cn see the results of this comprison. Note tht [3, 6, 3, 3, 38] results were reported in [3, 38]. These vlues represent the verge Precision, Recll, nd F-mesure, in the intervl [0, ]. The results in Tle 8 show tht our clustering hs slightly lower precision thn those of [6, 3, 38] pproches, ut it is very close to those of [3, 3] pproches. But it nevertheless hs etter Recll thn the mjority of other pproches, with the exception of tht of the [3] pproch. Finlly, the F-mesure otined y our clustering lso seems to e higher thn ll others, with the exception of tht of the pproch [3]. However, our clustering is etter overll, since it hs etter Precision thn tht of the pproch [3]. Tle 8: Comprison of our results with those of other pproches Approch Precision Recll F-mesure [3] [6] [3] [3] [38] Our pproch Time needed to clculte the structurl similrity etween two XML documents Finlly, s expected, in this third test, we will conduct experiments to determine the time required to clculte the structurl similrity etween two XML documents. To conduct these experiments, we generted set of 0 synthetic XML documents whose the numer of nodes vries respectively from 50 to 500. We conducted two sets of tests with the group of XML documents previously generted: The first one ws conducted y setting the vlues of the weights, nd to in Eq. (). As we hve lredy considered, it is more nturl, in the cse of XML documents, to give to S, S nd S 3 the sme weight, nmely w =, w =, nd w 3 =. Recll tht S nd S 3 represent respectively scendnt (ncestor) nd descendnt contexts. For more detils see the equtions for clculting the similrity. The second one ws conducted y setting the vlues of the weights, = = 0, in the sme eqution. In other words, we ignore the hierrchicl contexts (descendnt levels nd ncestors levels). Therefore, it is not necessry to clculte S nd S 3. In this cse our similrity mesure ehves like the edit distnce. Thus, the time complexity of clculting the similrity etween two trees T nd T is in the worst cse O (N ). Wht mtters in this test is not the qulity of clustering, ut the time required for comprison of two XML documents structures. Therefore, it is not necessry to hve summries of XML trees. For this, we slightly modified our prser, so s not remove repetitions of siling nodes nd thus to otin the originl XML tree structures (the whole structure of document). For more detils out this question, see Susection 3..

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Dt Mining y I. H. Witten nd E. Frnk Simplicity first Simple lgorithms often work very well! There re mny kinds of simple structure, eg: One ttriute does ll the work All ttriutes contriute eqully

More information

Presentation Martin Randers

Presentation Martin Randers Presenttion Mrtin Rnders Outline Introduction Algorithms Implementtion nd experiments Memory consumption Summry Introduction Introduction Evolution of species cn e modelled in trees Trees consist of nodes

More information

10.5 Graphing Quadratic Functions

10.5 Graphing Quadratic Functions 0.5 Grphing Qudrtic Functions Now tht we cn solve qudrtic equtions, we wnt to lern how to grph the function ssocited with the qudrtic eqution. We cll this the qudrtic function. Grphs of Qudrtic Functions

More information

2 Computing all Intersections of a Set of Segments Line Segment Intersection

2 Computing all Intersections of a Set of Segments Line Segment Intersection 15-451/651: Design & Anlysis of Algorithms Novemer 14, 2016 Lecture #21 Sweep-Line nd Segment Intersection lst chnged: Novemer 8, 2017 1 Preliminries The sweep-line prdigm is very powerful lgorithmic design

More information

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming Lecture 10 Evolutionry Computtion: Evolution strtegies nd genetic progrmming Evolution strtegies Genetic progrmming Summry Negnevitsky, Person Eduction, 2011 1 Evolution Strtegies Another pproch to simulting

More information

COMP 423 lecture 11 Jan. 28, 2008

COMP 423 lecture 11 Jan. 28, 2008 COMP 423 lecture 11 Jn. 28, 2008 Up to now, we hve looked t how some symols in n lphet occur more frequently thn others nd how we cn sve its y using code such tht the codewords for more frequently occuring

More information

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards A Tutology Checker loosely relted to Stålmrck s Algorithm y Mrtin Richrds mr@cl.cm.c.uk http://www.cl.cm.c.uk/users/mr/ University Computer Lortory New Museum Site Pemroke Street Cmridge, CB2 3QG Mrtin

More information

Fig.25: the Role of LEX

Fig.25: the Role of LEX The Lnguge for Specifying Lexicl Anlyzer We shll now study how to uild lexicl nlyzer from specifiction of tokens in the form of list of regulr expressions The discussion centers round the design of n existing

More information

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs. Lecture 5 Wlks, Trils, Pths nd Connectedness Reding: Some of the mteril in this lecture comes from Section 1.2 of Dieter Jungnickel (2008), Grphs, Networks nd Algorithms, 3rd edition, which is ville online

More information

What are suffix trees?

What are suffix trees? Suffix Trees 1 Wht re suffix trees? Allow lgorithm designers to store very lrge mount of informtion out strings while still keeping within liner spce Allow users to serch for new strings in the originl

More information

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1): Overview (): Before We Begin Administrtive detils Review some questions to consider Winter 2006 Imge Enhncement in the Sptil Domin: Bsics of Sptil Filtering, Smoothing Sptil Filters, Order Sttistics Filters

More information

Algorithm Design (5) Text Search

Algorithm Design (5) Text Search Algorithm Design (5) Text Serch Tkshi Chikym School of Engineering The University of Tokyo Text Serch Find sustring tht mtches the given key string in text dt of lrge mount Key string: chr x[m] Text Dt:

More information

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure , Mrch 12-14, 2014, Hong Kong An Algorithm for Enumerting All Mximl Tree Ptterns Without Dupliction Using Succinct Dt Structure Yuko ITOKAWA, Tomoyuki UCHIDA nd Motoki SANO Astrct In order to extrct structured

More information

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

In the last lecture, we discussed how valid tokens may be specified by regular expressions. LECTURE 5 Scnning SYNTAX ANALYSIS We know from our previous lectures tht the process of verifying the syntx of the progrm is performed in two stges: Scnning: Identifying nd verifying tokens in progrm.

More information

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries Tries Yufei To KAIST April 9, 2013 Y. To, April 9, 2013 Tries In this lecture, we will discuss the following exct mtching prolem on strings. Prolem Let S e set of strings, ech of which hs unique integer

More information

A dual of the rectangle-segmentation problem for binary matrices

A dual of the rectangle-segmentation problem for binary matrices A dul of the rectngle-segmenttion prolem for inry mtrices Thoms Klinowski Astrct We consider the prolem to decompose inry mtrix into smll numer of inry mtrices whose -entries form rectngle. We show tht

More information

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011 CSCI 3130: Forml Lnguges nd utomt Theory Lecture 12 The Chinese University of Hong Kong, Fll 2011 ndrej Bogdnov In progrmming lnguges, uilding prse trees is significnt tsk ecuse prse trees tell us the

More information

The Greedy Method. The Greedy Method

The Greedy Method. The Greedy Method Lists nd Itertors /8/26 Presenttion for use with the textook, Algorithm Design nd Applictions, y M. T. Goodrich nd R. Tmssi, Wiley, 25 The Greedy Method The Greedy Method The greedy method is generl lgorithm

More information

COMBINATORIAL PATTERN MATCHING

COMBINATORIAL PATTERN MATCHING COMBINATORIAL PATTERN MATCHING Genomic Repets Exmple of repets: ATGGTCTAGGTCCTAGTGGTC Motivtion to find them: Genomic rerrngements re often ssocited with repets Trce evolutionry secrets Mny tumors re chrcterized

More information

From Dependencies to Evaluation Strategies

From Dependencies to Evaluation Strategies From Dependencies to Evlution Strtegies Possile strtegies: 1 let the user define the evlution order 2 utomtic strtegy sed on the dependencies: use locl dependencies to determine which ttriutes to compute

More information

Definition of Regular Expression

Definition of Regular Expression Definition of Regulr Expression After the definition of the string nd lnguges, we re redy to descrie regulr expressions, the nottion we shll use to define the clss of lnguges known s regulr sets. Recll

More information

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism Efficient K-NN Serch in Polyphonic Music Dtses Using Lower Bounding Mechnism Ning-Hn Liu Deprtment of Computer Science Ntionl Tsing Hu University Hsinchu,Tiwn 300, R.O.C 886-3-575679 nhliou@yhoo.com.tw

More information

CS481: Bioinformatics Algorithms

CS481: Bioinformatics Algorithms CS481: Bioinformtics Algorithms Cn Alkn EA509 clkn@cs.ilkent.edu.tr http://www.cs.ilkent.edu.tr/~clkn/teching/cs481/ EXACT STRING MATCHING Fingerprint ide Assume: We cn compute fingerprint f(p) of P in

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Informtion Retrievl nd Orgnistion Suffix Trees dpted from http://www.mth.tu.c.il/~himk/seminr02/suffixtrees.ppt Dell Zhng Birkeck, University of London Trie A tree representing set of strings { } eef d

More information

Inference of node replacement graph grammars

Inference of node replacement graph grammars Glley Proof 22/6/27; :6 File: id293.tex; BOKCTP/Hin p. Intelligent Dt Anlysis (27) 24 IOS Press Inference of node replcement grph grmmrs Jcek P. Kukluk, Lwrence B. Holder nd Dine J. Cook Deprtment of Computer

More information

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties, Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis CS143 Hndout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexicl Anlysis In this first written ssignment, you'll get the chnce to ply round with the vrious constructions tht come up when doing lexicl

More information

GENERATING ORTHOIMAGES FOR CLOSE-RANGE OBJECTS BY AUTOMATICALLY DETECTING BREAKLINES

GENERATING ORTHOIMAGES FOR CLOSE-RANGE OBJECTS BY AUTOMATICALLY DETECTING BREAKLINES GENEATING OTHOIMAGES FO CLOSE-ANGE OBJECTS BY AUTOMATICALLY DETECTING BEAKLINES Efstrtios Stylinidis 1, Lzros Sechidis 1, Petros Ptis 1, Spiros Sptls 2 Aristotle University of Thessloniki 1 Deprtment of

More information

Graphs with at most two trees in a forest building process

Graphs with at most two trees in a forest building process Grphs with t most two trees in forest uilding process rxiv:802.0533v [mth.co] 4 Fe 208 Steve Butler Mis Hmnk Mrie Hrdt Astrct Given grph, we cn form spnning forest y first sorting the edges in some order,

More information

12-B FRACTIONS AND DECIMALS

12-B FRACTIONS AND DECIMALS -B Frctions nd Decimls. () If ll four integers were negtive, their product would be positive, nd so could not equl one of them. If ll four integers were positive, their product would be much greter thn

More information

Suffix trees, suffix arrays, BWT

Suffix trees, suffix arrays, BWT ALGORITHMES POUR LA BIO-INFORMATIQUE ET LA VISUALISATION COURS 3 Rluc Uricru Suffix trees, suffix rrys, BWT Bsed on: Suffix trees nd suffix rrys presenttion y Him Kpln Suffix trees course y Pco Gomez Liner-Time

More information

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method A New Lerning Algorithm for the MAXQ Hierrchicl Reinforcement Lerning Method Frzneh Mirzzdeh 1, Bbk Behsz 2, nd Hmid Beigy 1 1 Deprtment of Computer Engineering, Shrif University of Technology, Tehrn,

More information

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5 CS321 Lnguges nd Compiler Design I Winter 2012 Lecture 5 1 FINITE AUTOMATA A non-deterministic finite utomton (NFA) consists of: An input lphet Σ, e.g. Σ =,. A set of sttes S, e.g. S = {1, 3, 5, 7, 11,

More information

Topological Queries on Graph-structured XML Data: Models and Implementations

Topological Queries on Graph-structured XML Data: Models and Implementations Topologicl Queries on Grph-structured XML Dt: Models nd Implementtions Hongzhi Wng, Jinzhong Li, nd Jizhou Luo Astrct In mny pplictions, dt is in grph structure, which cn e nturlly represented s grph-structured

More information

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Position Heaps: A Simple and Dynamic Text Indexing Data Structure Position Heps: A Simple nd Dynmic Text Indexing Dt Structure Andrzej Ehrenfeucht, Ross M. McConnell, Niss Osheim, Sung-Whn Woo Dept. of Computer Science, 40 UCB, University of Colordo t Boulder, Boulder,

More information

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li 2nd Interntionl Conference on Electronic & Mechnicl Engineering nd Informtion Technology (EMEIT-212) Complete Coverge Pth Plnning of Mobile Robot Bsed on Dynmic Progrmming Algorithm Peng Zhou, Zhong-min

More information

Text mining: bag of words representation and beyond it

Text mining: bag of words representation and beyond it Text mining: bg of words representtion nd beyond it Jsmink Dobš Fculty of Orgniztion nd Informtics University of Zgreb 1 Outline Definition of text mining Vector spce model or Bg of words representtion

More information

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have Rndom Numers nd Monte Crlo Methods Rndom Numer Methods The integrtion methods discussed so fr ll re sed upon mking polynomil pproximtions to the integrnd. Another clss of numericl methods relies upon using

More information

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the LR() nlysis Drwcks of LR(). Look-hed symols s eplined efore, concerning LR(), it is possile to consult the net set to determine, in the reduction sttes, for which symols it would e possile to perform reductions.

More information

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST Suffi Trees Outline Introduction Suffi Trees (ST) Building STs in liner time: Ukkonen s lgorithm Applictions of ST 2 3 Introduction Sustrings String is ny sequence of chrcters. Sustring of string S is

More information

OUTPUT DELIVERY SYSTEM

OUTPUT DELIVERY SYSTEM Differences in ODS formtting for HTML with Proc Print nd Proc Report Lur L. M. Thornton, USDA-ARS, Animl Improvement Progrms Lortory, Beltsville, MD ABSTRACT While Proc Print is terrific tool for dt checking

More information

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants A Heuristic Approch for Discovering Reference Models by Mining Process Model Vrints Chen Li 1, Mnfred Reichert 2, nd Andres Wombcher 3 1 Informtion System Group, University of Twente, The Netherlnds lic@cs.utwente.nl

More information

Ma/CS 6b Class 1: Graph Recap

Ma/CS 6b Class 1: Graph Recap M/CS 6 Clss 1: Grph Recp By Adm Sheffer Course Detils Adm Sheffer. Office hour: Tuesdys 4pm. dmsh@cltech.edu TA: Victor Kstkin. Office hour: Tuesdys 7pm. 1:00 Mondy, Wednesdy, nd Fridy. http://www.mth.cltech.edu/~2014-15/2term/m006/

More information

Agilent Mass Hunter Software

Agilent Mass Hunter Software Agilent Mss Hunter Softwre Quick Strt Guide Use this guide to get strted with the Mss Hunter softwre. Wht is Mss Hunter Softwre? Mss Hunter is n integrl prt of Agilent TOF softwre (version A.02.00). Mss

More information

Notes for Graph Theory

Notes for Graph Theory Notes for Grph Theory These re notes I wrote up for my grph theory clss in 06. They contin most of the topics typiclly found in grph theory course. There re proofs of lot of the results, ut not of everything.

More information

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών ΕΠΛ323 - Θωρία και Πρακτική Μταγλωττιστών Lecture 3 Lexicl Anlysis Elis Athnsopoulos elisthn@cs.ucy.c.cy Recognition of Tokens if expressions nd reltionl opertors if è if then è then else è else relop

More information

1.1. Interval Notation and Set Notation Essential Question When is it convenient to use set-builder notation to represent a set of numbers?

1.1. Interval Notation and Set Notation Essential Question When is it convenient to use set-builder notation to represent a set of numbers? 1.1 TEXAS ESSENTIAL KNOWLEDGE AND SKILLS Prepring for 2A.6.K, 2A.7.I Intervl Nottion nd Set Nottion Essentil Question When is it convenient to use set-uilder nottion to represent set of numers? A collection

More information

UT1553B BCRT True Dual-port Memory Interface

UT1553B BCRT True Dual-port Memory Interface UTMC APPICATION NOTE UT553B BCRT True Dul-port Memory Interfce INTRODUCTION The UTMC UT553B BCRT is monolithic CMOS integrted circuit tht provides comprehensive MI-STD- 553B Bus Controller nd Remote Terminl

More information

box Boxes and Arrows 3 true 7.59 'X' An object is drawn as a box that contains its data members, for example:

box Boxes and Arrows 3 true 7.59 'X' An object is drawn as a box that contains its data members, for example: Boxes nd Arrows There re two kinds of vriles in Jv: those tht store primitive vlues nd those tht store references. Primitive vlues re vlues of type long, int, short, chr, yte, oolen, doule, nd flot. References

More information

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe CSCI 0 fel Ferreir d Silv rfsilv@isi.edu Slides dpted from: Mrk edekopp nd Dvid Kempe LOG STUCTUED MEGE TEES Series Summtion eview Let n = + + + + k $ = #%& #. Wht is n? n = k+ - Wht is log () + log ()

More information

Reducing a DFA to a Minimal DFA

Reducing a DFA to a Minimal DFA Lexicl Anlysis - Prt 4 Reducing DFA to Miniml DFA Input: DFA IN Assume DFA IN never gets stuck (dd ded stte if necessry) Output: DFA MIN An equivlent DFA with the minimum numer of sttes. Hrry H. Porter,

More information

Dr. D.M. Akbar Hussain

Dr. D.M. Akbar Hussain Dr. D.M. Akr Hussin Lexicl Anlysis. Bsic Ide: Red the source code nd generte tokens, it is similr wht humns will do to red in; just tking on the input nd reking it down in pieces. Ech token is sequence

More information

Preserving Constraints for Aggregation Relationship Type Update in XML Document

Preserving Constraints for Aggregation Relationship Type Update in XML Document Preserving Constrints for Aggregtion Reltionship Type Updte in XML Document Eric Prdede 1, J. Wenny Rhyu 1, nd Dvid Tnir 2 1 Deprtment of Computer Science nd Computer Engineering, L Trobe University, Bundoor

More information

Section 3.1: Sequences and Series

Section 3.1: Sequences and Series Section.: Sequences d Series Sequences Let s strt out with the definition of sequence: sequence: ordered list of numbers, often with definite pttern Recll tht in set, order doesn t mtter so this is one

More information

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds

A Sparse Grid Representation for Dynamic Three-Dimensional Worlds A Sprse Grid Representtion for Dynmic Three-Dimensionl Worlds Nthn R. Sturtevnt Deprtment of Computer Science University of Denver Denver, CO, 80208 sturtevnt@cs.du.edu Astrct Grid representtions offer

More information

L. Yaroslavsky. Fundamentals of Digital Image Processing. Course

L. Yaroslavsky. Fundamentals of Digital Image Processing. Course L. Yroslvsky. Fundmentls of Digitl Imge Processing. Course 0555.330 Lecture. Imge enhncement.. Imge enhncement s n imge processing tsk. Clssifiction of imge enhncement methods Imge enhncement is processing

More information

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) Numbers nd Opertions, Algebr, nd Functions 45. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES) In sequence of terms involving eponentil growth, which the testing service lso clls geometric

More information

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining EECS150 - Digitl Design Lecture 23 - High-level Design nd Optimiztion 3, Prllelism nd Pipelining Nov 12, 2002 John Wwrzynek Fll 2002 EECS150 - Lec23-HL3 Pge 1 Prllelism Prllelism is the ct of doing more

More information

MATH 25 CLASS 5 NOTES, SEP

MATH 25 CLASS 5 NOTES, SEP MATH 25 CLASS 5 NOTES, SEP 30 2011 Contents 1. A brief diversion: reltively prime numbers 1 2. Lest common multiples 3 3. Finding ll solutions to x + by = c 4 Quick links to definitions/theorems Euclid

More information

Suffix Tries. Slides adapted from the course by Ben Langmead

Suffix Tries. Slides adapted from the course by Ben Langmead Suffix Tries Slides dpted from the course y Ben Lngmed en.lngmed@gmil.com Indexing with suffixes Until now, our indexes hve een sed on extrcting sustrings from T A very different pproch is to extrct suffixes

More information

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table TDDD55 Compilers nd Interpreters TDDB44 Compiler Construction LR Prsing, Prt 2 Constructing Prse Tles Prse tle construction Grmmr conflict hndling Ctegories of LR Grmmrs nd Prsers Peter Fritzson, Christoph

More information

Registering as a HPE Reseller. Quick Reference Guide for new Partners in Asia Pacific

Registering as a HPE Reseller. Quick Reference Guide for new Partners in Asia Pacific Registering s HPE Reseller Quick Reference Guide for new Prtners in Asi Pcific Registering s new Reseller prtner There re five min steps to e new Reseller prtner. Crete your Appliction Copyright 2017 Hewlett

More information

Intermediate Information Structures

Intermediate Information Structures CPSC 335 Intermedite Informtion Structures LECTURE 13 Suffix Trees Jon Rokne Computer Science University of Clgry Cnd Modified from CMSC 423 - Todd Trengen UMD upd Preprocessing Strings We will look t

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Registering as an HPE Reseller

Registering as an HPE Reseller Registering s n HPE Reseller Quick Reference Guide for new Prtners Mrch 2019 Registering s new Reseller prtner There re four min steps to register on the Prtner Redy Portl s new Reseller prtner: Appliction

More information

Parallel Square and Cube Computations

Parallel Square and Cube Computations Prllel Squre nd Cube Computtions Albert A. Liddicot nd Michel J. Flynn Computer Systems Lbortory, Deprtment of Electricl Engineering Stnford University Gtes Building 5 Serr Mll, Stnford, CA 945, USA liddicot@stnford.edu

More information

PARALLEL AND DISTRIBUTED COMPUTING

PARALLEL AND DISTRIBUTED COMPUTING PARALLEL AND DISTRIBUTED COMPUTING 2009/2010 1 st Semester Teste Jnury 9, 2010 Durtion: 2h00 - No extr mteril llowed. This includes notes, scrtch pper, clcultor, etc. - Give your nswers in the ville spce

More information

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits Systems I Logic Design I Topics Digitl logic Logic gtes Simple comintionl logic circuits Simple C sttement.. C = + ; Wht pieces of hrdwre do you think you might need? Storge - for vlues,, C Computtion

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

ASTs, Regex, Parsing, and Pretty Printing

ASTs, Regex, Parsing, and Pretty Printing ASTs, Regex, Prsing, nd Pretty Printing CS 2112 Fll 2016 1 Algeric Expressions To strt, consider integer rithmetic. Suppose we hve the following 1. The lphet we will use is the digits {0, 1, 2, 3, 4, 5,

More information

PPS: User Manual. Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia

PPS: User Manual. Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia PPS: User Mnul Krishnendu Chtterjee, Mrtin Chmelik, Rghv Gupt, nd Ayush Knodi IST Austri (Institute of Science nd Technology Austri), Klosterneuurg, Austri In this section we descrie the tool fetures,

More information

9 Graph Cutting Procedures

9 Graph Cutting Procedures 9 Grph Cutting Procedures Lst clss we begn looking t how to embed rbitrry metrics into distributions of trees, nd proved the following theorem due to Brtl (1996): Theorem 9.1 (Brtl (1996)) Given metric

More information

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Lexical Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay Lexicl Anlysis Amith Snyl (www.cse.iit.c.in/ s) Deprtment of Computer Science nd Engineering, Indin Institute of Technology, Bomy Septemer 27 College of Engineering, Pune Lexicl Anlysis: 2/6 Recp The input

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Mobile IP route optimization method for a carrier-scale IP network

Mobile IP route optimization method for a carrier-scale IP network Moile IP route optimiztion method for crrier-scle IP network Tkeshi Ihr, Hiroyuki Ohnishi, nd Ysushi Tkgi NTT Network Service Systems Lortories 3-9-11 Midori-cho, Musshino-shi, Tokyo 180-8585, Jpn Phone:

More information

9 4. CISC - Curriculum & Instruction Steering Committee. California County Superintendents Educational Services Association

9 4. CISC - Curriculum & Instruction Steering Committee. California County Superintendents Educational Services Association 9. CISC - Curriculum & Instruction Steering Committee The Winning EQUATION A HIGH QUALITY MATHEMATICS PROFESSIONAL DEVELOPMENT PROGRAM FOR TEACHERS IN GRADES THROUGH ALGEBRA II STRAND: NUMBER SENSE: Rtionl

More information

ON THE DEHN COMPLEX OF VIRTUAL LINKS

ON THE DEHN COMPLEX OF VIRTUAL LINKS ON THE DEHN COMPLEX OF VIRTUAL LINKS RACHEL BYRD, JENS HARLANDER Astrct. A virtul link comes with vriety of link complements. This rticle is concerned with the Dehn spce, pseudo mnifold with oundry, nd

More information

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search. CS 88: Artificil Intelligence Fll 00 Lecture : A* Serch 9//00 A* Serch rph Serch Tody Heuristic Design Dn Klein UC Berkeley Multiple slides from Sturt Russell or Andrew Moore Recp: Serch Exmple: Pncke

More information

Topic 2: Lexing and Flexing

Topic 2: Lexing and Flexing Topic 2: Lexing nd Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennrt Beringer 1 2 The Compiler Lexicl Anlysis Gol: rek strem of ASCII chrcters (source/input) into sequence of

More information

Premaster Course Algorithms 1 Chapter 6: Shortest Paths. Christian Scheideler SS 2018

Premaster Course Algorithms 1 Chapter 6: Shortest Paths. Christian Scheideler SS 2018 Premster Course Algorithms Chpter 6: Shortest Pths Christin Scheieler SS 8 Bsic Grph Algorithms Overview: Shortest pths in DAGs Dijkstr s lgorithm Bellmn-For lgorithm Johnson s metho SS 8 Chpter 6 Shortest

More information

2014 Haskell January Test Regular Expressions and Finite Automata

2014 Haskell January Test Regular Expressions and Finite Automata 0 Hskell Jnury Test Regulr Expressions nd Finite Automt This test comprises four prts nd the mximum mrk is 5. Prts I, II nd III re worth 3 of the 5 mrks vilble. The 0 Hskell Progrmming Prize will be wrded

More information

2-3 search trees red-black BSTs B-trees

2-3 search trees red-black BSTs B-trees 2-3 serch trees red-lck BTs B-trees 3 2-3 tree llow 1 or 2 keys per node. 2-node: one key, two children. 3-node: two keys, three children. ymmetric order. Inorder trversl yields keys in scending order.

More information

ZZ - Advanced Math Review 2017

ZZ - Advanced Math Review 2017 ZZ - Advnced Mth Review Mtrix Multipliction Given! nd! find the sum of the elements of the product BA First, rewrite the mtrices in the correct order to multiply The product is BA hs order x since B is

More information

Spectral Analysis of MCDF Operations in Image Processing

Spectral Analysis of MCDF Operations in Image Processing Spectrl Anlysis of MCDF Opertions in Imge Processing ZHIQIANG MA 1,2 WANWU GUO 3 1 School of Computer Science, Northest Norml University Chngchun, Jilin, Chin 2 Deprtment of Computer Science, JilinUniversity

More information

Semistructured Data Management Part 2 - Graph Databases

Semistructured Data Management Part 2 - Graph Databases Semistructured Dt Mngement Prt 2 - Grph Dtbses 2003/4, Krl Aberer, EPFL-SSC, Lbortoire de systèmes d'informtions réprtis Semi-structured Dt - 1 1 Tody's Questions 1. Schems for Semi-structured Dt 2. Grph

More information

Video-rate Image Segmentation by means of Region Splitting and Merging

Video-rate Image Segmentation by means of Region Splitting and Merging Video-rte Imge Segmenttion y mens of Region Splitting nd Merging Knur Anej, Florence Lguzet, Lionel Lcssgne, Alin Merigot Institute for Fundmentl Electronics, University of Pris South Orsy, Frnce knur.nej@gmil.com,

More information

The dictionary model allows several consecutive symbols, called phrases

The dictionary model allows several consecutive symbols, called phrases A dptive Huffmn nd rithmetic methods re universl in the sense tht the encoder cn dpt to the sttistics of the source. But, dpttion is computtionlly expensive, prticulrly when k-th order Mrkov pproximtion

More information

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1 Mth 33 Volume Stewrt 5.2 Geometry of integrls. In this section, we will lern how to compute volumes using integrls defined by slice nlysis. First, we recll from Clculus I how to compute res. Given the

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

CS201 Discussion 10 DRAWTREE + TRIES

CS201 Discussion 10 DRAWTREE + TRIES CS201 Discussion 10 DRAWTREE + TRIES DrwTree First instinct: recursion As very generic structure, we could tckle this problem s follows: drw(): Find the root drw(root) drw(root): Write the line for the

More information

Approximate XML Structure Validation based on Document-Grammar Tree Similarity

Approximate XML Structure Validation based on Document-Grammar Tree Similarity Approximte XML Structure Vlidtion sed on Document-Grmmr Tree Similrity Joe Tekli, Richrd Cheir *, Agm J.M. Trin 3, Cetno Trin Jr. 3 4, Rento Fileto Dept. of Elec. nd Compt. Eng., SOE, Lenese Americn University

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

INTRODUCTION TO SIMPLICIAL COMPLEXES

INTRODUCTION TO SIMPLICIAL COMPLEXES INTRODUCTION TO SIMPLICIAL COMPLEXES CASEY KELLEHER AND ALESSANDRA PANTANO 0.1. Introduction. In this ctivity set we re going to introduce notion from Algebric Topology clled simplicil homology. The min

More information

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork MA1008 Clculus nd Liner Algebr for Engineers Course Notes for Section B Stephen Wills Deprtment of Mthemtics University College Cork s.wills@ucc.ie http://euclid.ucc.ie/pges/stff/wills/teching/m1008/ma1008.html

More information

Ma/CS 6b Class 1: Graph Recap

Ma/CS 6b Class 1: Graph Recap M/CS 6 Clss 1: Grph Recp By Adm Sheffer Course Detils Instructor: Adm Sheffer. TA: Cosmin Pohot. 1pm Mondys, Wednesdys, nd Fridys. http://mth.cltech.edu/~2015-16/2term/m006/ Min ook: Introduction to Grph

More information

Object and image indexing based on region connection calculus and oriented matroid theory

Object and image indexing based on region connection calculus and oriented matroid theory Discrete Applied Mthemtics 147 (2005) 345 361 www.elsevier.com/locte/dm Oject nd imge indexing sed on region connection clculus nd oriented mtroid theory Ernesto Stffetti, Antoni Gru, Frncesc Serrtos c,

More information

Midterm 2 Sample solution

Midterm 2 Sample solution Nme: Instructions Midterm 2 Smple solution CMSC 430 Introduction to Compilers Fll 2012 November 28, 2012 This exm contins 9 pges, including this one. Mke sure you hve ll the pges. Write your nme on the

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Lexical Analysis: Constructing a Scanner from Regular Expressions

Lexical Analysis: Constructing a Scanner from Regular Expressions Lexicl Anlysis: Constructing Scnner from Regulr Expressions Gol Show how to construct FA to recognize ny RE This Lecture Convert RE to n nondeterministic finite utomton (NFA) Use Thompson s construction

More information

CSCE 531, Spring 2017, Midterm Exam Answer Key

CSCE 531, Spring 2017, Midterm Exam Answer Key CCE 531, pring 2017, Midterm Exm Answer Key 1. (15 points) Using the method descried in the ook or in clss, convert the following regulr expression into n equivlent (nondeterministic) finite utomton: (

More information