Meaningful Change Detection in Structured Data.

Similar documents
COMP 423 lecture 11 Jan. 28, 2008

What are suffix trees?

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Presentation Martin Randers

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

Fig.25: the Role of LEX

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

A dual of the rectangle-segmentation problem for binary matrices

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

10.5 Graphing Quadratic Functions

Slides for Data Mining by I. H. Witten and E. Frank

Algorithm Design (5) Text Search

MATH 25 CLASS 5 NOTES, SEP

Dr. D.M. Akbar Hussain

Graphs with at most two trees in a forest building process

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

4452 Mathematical Modeling Lecture 4: Lagrange Multipliers

Definition of Regular Expression

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Section 10.4 Hyperbolas

OUTPUT DELIVERY SYSTEM

Ma/CS 6b Class 1: Graph Recap

The Greedy Method. The Greedy Method

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

Notes for Graph Theory

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

Reducing a DFA to a Minimal DFA

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

CS201 Discussion 10 DRAWTREE + TRIES

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Intermediate Information Structures

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

COMBINATORIAL PATTERN MATCHING

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants

Suffix trees, suffix arrays, BWT

UT1553B BCRT True Dual-port Memory Interface

ON THE DEHN COMPLEX OF VIRTUAL LINKS

CS481: Bioinformatics Algorithms

9 Graph Cutting Procedures

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Information Retrieval and Organisation

Distributed Systems Principles and Paradigms

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

Typing with Weird Keyboards Notes

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

MTH 146 Conics Supplement

ASTs, Regex, Parsing, and Pretty Printing

GENERATING ORTHOIMAGES FOR CLOSE-RANGE OBJECTS BY AUTOMATICALLY DETECTING BREAKLINES

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

Efficient Algorithms For Optimizing Policy-Constrained Routing

INTRODUCTION TO SIMPLICIAL COMPLEXES

12-B FRACTIONS AND DECIMALS

F. R. K. Chung y. University ofpennsylvania. Philadelphia, Pennsylvania R. L. Graham. AT&T Labs - Research. March 2,1997.

Ma/CS 6b Class 1: Graph Recap

Midterm 2 Sample solution

I/O Efficient Dynamic Data Structures for Longest Prefix Queries

8.2 Areas in the Plane

From Dependencies to Evaluation Strategies

documents 1. Introduction

CSCE 531, Spring 2017, Midterm Exam Answer Key

1.1. Interval Notation and Set Notation Essential Question When is it convenient to use set-builder notation to represent a set of numbers?

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search

2014 Haskell January Test Regular Expressions and Finite Automata

Engineer To Engineer Note

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork

PARALLEL AND DISTRIBUTED COMPUTING

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

Integration. September 28, 2017

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Grade 7/8 Math Circles Geometric Arithmetic October 31, 2012

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Agilent Mass Hunter Software

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFORMATICS 1 COMPUTATION & LOGIC INSTRUCTIONS TO CANDIDATES

Lexical Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Lecture 7: Integration Techniques

Misrepresentation of Preferences

File Manager Quick Reference Guide. June Prepared for the Mayo Clinic Enterprise Kahua Deployment

Digital Design. Chapter 6: Optimizations and Tradeoffs

An Efficient Divide and Conquer Algorithm for Exact Hazard Free Logic Minimization

Epson Projector Content Manager Operation Guide

Inference of node replacement graph grammars

Approximation of Two-Dimensional Rectangle Packing

arxiv: v1 [math.co] 18 Sep 2015

Mobile IP route optimization method for a carrier-scale IP network

Allocator Basics. Dynamic Memory Allocation in the Heap (malloc and free) Allocator Goals: malloc/free. Internal Fragmentation

PIA INQUIRY QUESTIONS LEASED DARK FIBER AND SPECIAL CONSTRUCTION

Topological Queries on Graph-structured XML Data: Models and Implementations

Tree Structured Symmetrical Systems of Linear Equations and their Graphical Solution

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών. Lecture 3b Lexical Analysis Elias Athanasopoulos

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Transcription:

Meningful Chnge Detection in Structured Dt Sudrshn S. Chwthe Hector Grci-Molin Computer Science Deprtment, Stnford University, Stnford, Cliforni 94305 fchw,hectorg@cs.stnford.edu Astrct Detecting chnges y compring dt snpshots is n importnt requirement for dierence queries, ctive dtses, nd version nd congurtion mngement. In this pper we focus on detecting meningful chnges in hierrchiclly structured dt, such s nested-oject dt. This prolem is much more chllenging thn the corresponding one for reltionl or t-le dt. In order to descrie chnges etter, we se our work not just on the trditionl \tomic" insert, delete, updte opertions, ut lso on opertions tht move n entire su-tree of nodes, nd tht copy n entire su-tree. These opertions llows us to descrie chnges in semnticlly more meningful wy. Since this chnge detection prolem is N P-hrd, in this pper we present heuristic chnge detection lgorithm tht yields close to \miniml" descriptions of the chnges, nd tht hs fewer restrictions thn previous lgorithms. Our lgorithm is sed on trnsforming the chnge detection prolem to prolem of computing minimum-cost edge cover of iprtite grph. We study the qulity of the solution produced y our lgorithm, s well s the running time, oth nlyticlly nd experimentlly. Introduction Detection of chnges etween dt structures is n importnt function in mny pplictions. For exmple, in the World-Wide We n nlyst my e interested in knowing how competitor's site hs chnged since the lst time visited. This my e chieved y sving snpshot of the previous HTML pges t the site (something tht most rowsers do for eciency nywy). In CAD design environment, n engineer my wish to understnd the dierences etween two relted ut concurrently developed chip designs. In This work ws supported y the Air Force Wright Lortory Aeronuticl Systems Center under DARPA Contrct F3365-93-- 339, y the Deprtment of the Air Force Rome Lortories under DARPA Contrct F30602-95-C-09, nd y equipment grnts from IBM Corportion, Digitl Equipment Corportion, nd Sun Microsystems. distriuted le system, n dministrtor my need to detect dierences etween two mirror le systems tht ecme prtitioned nd independently modied. In wrehousing environment, the chnges t site need to e identied so tht mterilized view cn e incrementlly mintined. In this pper we present n ecient lgorithm, mh-diff, for meningful chnge detection etween two hierrchiclly structured dt snpshots, or trees. The key word here is meningful (the \M" in the nme). Tht is, our gol is to portry the chnges etween two trees in succinct nd descriptive wy. As is commonly done, we portry the chnges s n edit script tht gives the sequence of opertions needed to trnsform one tree into nother. However, in this pper we use richer set of opertions thn hs ever een used efore, nd this leds, we elieve, to much higher qulity edit scripts. In prticulr, we use move nd copy opertions, in ddition to the more trditionl insert, delete, nd updte opertions. Thus, if sustructure (e.g., section of text, shift register) is moved to nother loction, our lgorithm will report it s single opertion. If the sustructure is copied (e.g., second shift register is dded which is identicl to one lredy in the circuit), then our lgorithm will identify it s such. Trditionl chnge detection lgorithms would report such chnges s sequences of inserts nd deletes (or simply inserts in the cse of copy), which do not convey the true mening of the chnge. Note tht detecting moves nd copies ecomes more importnt if the moved or copied sutree is lrge. For instnce, if we re compring le systems, nd lrge directory with thousnds of les is mounted elsewhere, we clerly do not wish to report the chnge s thousnds of le deletes followed y thousnds of le cretions. Also note tht to detect moves nd copies, it is essentil tht our lgorithm understnd the structure s well s the content of the dt. Thus, our lgorithm cnnot tret the dt s \t" informtion, e.g., s les with records or reltions with tuples. This mens tht techniques developed for t chnge detection [Mye86, LGM96] re not pplicle here. Algorithm mh-diff hs two dditionl importnt fetures: It does not rely on the existence of node (tomic oject) identiers tht cn mtch nodes in one tree to nodes in the other. In mny pplictions such identiers do not exist. For instnce, sentences nd prgrphs in text documents do not come with unique

identiers ttched. Even when the nodes re stored in dtse system (e.g., circuit components), we my e compring copies with the sme content ut dierent identiers. Thus, for full generlity, mh-diff does not ssume unique identiers tht spn the two trees, nd insted compres the contents of nodes to determine if they re relted. (If the trees hve such identiers, mh-diff could esily tke dvntge of them, ut we do not discuss tht here.) Algorithm mh-diff is sed on firly exile cost model. Ech opertion in the repertoire is given userdened xed cost, except for the updte opertion, whose cost is determined y user-provided function tht compres the vlues of two nodes. This gives end users gret ltitude in sying wht types of edit scripts re preferle for n ppliction. There is good reson why dierence lgorithms with the fetures we hve descried here hve not een developed erlier, even though they re clerly desirle. The reson is the inherent complexity of the prolem; one cn show tht the prolem is N P-hrd. Algorithm mh-diff provides heuristic solution, which is sed on trnsforming the prolem to the \edge cover domin." Tht is, insted of working with edit scripts, the lgorithm works with edge covers tht represent how one set of nodes mtch nother set. In this trnsformtion, the costs of the edit opertions re trnslted into costs on the edges of the cover. In n erlier pper [CRGMW96] we studied much simpler version of the chnge detection prolem. In tht work we did not consider copy opertions, we ssumed tht the numer of duplictes of node ws very limited, we ssumed ordered trees, nd we ssumed tht nodes hd \tgs" tht reect the structurl constrints on the input trees. (For exmple, nodes were tgged s sy \prgrphs" or \sections," mking it esier to mtch nodes.) All these restrictions mde it much simpler to nd minimum-cost edit script, nd indeed we developed n ecient lgorithm tht found minimum-cost script. Here, on the other hnd, here we drop these restrictions, nd introduce copy opertions. This leds to n lgorithm tht is very dierent from the one in [CRGMW96], nd tht yields heuristic solution in worst-cse O(n 3 ) time, where n is the numer of nodes, ut most often in roughly O(n 2 ) time. In Section 7 we compre in more detil mh-diff to our erlier work, s well s to other work on chnge detection. 2 Model nd Prolem Denition We use rooted, leled trees s our model for structured dt. These re trees in which ech node n hs lel l(n) tht is chosen from n ritrry domin L. The prolem of snpshot chnge detection in structured dt is thus the prolem of nding wy to edit the tree representtion of one snpshot to tht of the other. We denote tree T y its nodes N, the prent function p, nd the leling function l, nd write T = (N; p; l). The children of node n 2 N re denoted y C(n). We egin y dening the tree edit opertions tht we consider. Since there re mny wys to trnsform one tree to nother using these edit opertions, we dene cost model for these edit opertions, nd then dene the prolem of By reduction from the \exct cover y three-sets" prolem. nding minimum-cost edit script tht trnsforms one tree to nother. 2. Edit Opertions nd Edit Scripts In the following, we will ssume tht n edit opertion e is pplied to T = (N ; p ; l ), nd produces the tree T 2 = (N 2; p 2; l 2). We write this s T e! T2. We consider the following six edit opertions: Insertion: Intuitively, n insertion opertion cretes new tree node with given lel, nd plces it t given position in the tree. The position of the new node n in the tree is specied y giving its prent node p nd suset C of the children of p. The result of this opertion is tht n is child of p, nd the nodes C, tht were originlly children of p, re now children of the newly inserted node n. Formlly, n insertion opertion is denoted y ins(n; v; p; C), where n is the (unique) identier of the new node, v is the lel of the new node, p 2 N is the node tht is to e the prent of n, nd C C(p) is the set of nodes tht re to e the children of n. When pplied to T = (N ; p ; l ), we get tree T 2 = (N 2; p 2; l 2), where N 2 = N [ fng, p 2(n) = p, p 2(c) = n; 8c 2 C, p 2(c) = p (c); 8c 2 N? C, l 2(n) = v, nd l 2(m) = l (m); 8m 2 N. Due to spce constrints, we descrie the remining edit opertions only informlly elow; the forml denitions re in [CGM97]. Deletion: This opertion is the inverse of the insertion opertion. Intuitively, del(n) cuses n to dispper from the tree; the children of n re now the children of the (old) prent of n. The root of the tree cnnot e deleted. Updte: The opertion upd(n; v) chnges the lel of the node n to v. Move: A move opertion mov(n; p) moves the sutree rooted t n to nother position in the tree. The new position is specied y giving the new prent of the node, p. The root cnnot e moved. Copy: A copy opertion cpy(m; p) copies the sutree rooted t n to nother position. The new position is specied y giving the node p tht is to e the prent of the new copy. The root cnnot e copied. Glue: This opertion is the inverse of copy opertion. Given two nodes n nd n 2 such tht the sutrees rooted t n nd n 2 re isomorphic, glu(n ; n 2) cuses the sutree rooted t n to dispper. (It is conceptully \united" with the sutree rooted t n 2.) The root cnnot e glued. Although the glu opertion my seem unusul, note tht it is nturl choice for n edit opertion given the existence of the cpy opertion. As we will see in Exmple 2., inverting n edit script contining cpy opertions results in n edit script with glu opertion. This symmetry in the structure of edit opertions is useful in the design of our lgorithms. In ddition to the ove tree edit opertions, one my wish to consider opertions such s sutree delete opertion tht deletes ll nodes in given sutree. Similrly, one could dene sutree merge opertion tht merges two

or more sutrees. We do not consider such more complex edit opertions in this pper, ut note tht some of these opertions, (e.g., sutree deletes) my e detected y postprocessing the output of our lgorithm. We dene n edit script to e sequence of zero or more edit opertions tht cn e pplied in the order in which they occur in the sequence. Tht is, given tree T 0, sequence of edit opertions E = e ; e 2; : : : ; e k is n edit script e if there exist trees T i; i k such tht i T i?! Ti; i k. We sy tht the edit script E trnsforms T 0 to T k, nd write T E 0! T k. T 2 e 4 9 2 3 d ins(, g,, {9}) del() 3 d 5 f 6 7 cc d 0 d 3 5 6 f 7 cc 9 c mov(2,6) d 8 8 c cpy(7,) 0 T2 glu(2,7) mov(2,) 2 cc 4 e g 4 e g c 3 5 6 f 7 cc 9 5 6 f 7 cc 9 8 c d 0 2 4 e 2 3 d g 8 c d 0 Figure : Edit opertions on leled trees Exmple 2. Consider the tree T depicted in Figure. We represent the identier of ech node y the numer inside the circle representing the node. The lel of ech node is depicted to the right of the node. Thus, the root of the tree T hs n identier, nd lel. Figure shows how T is trnsformed y pplying the edit script to E = (ins(; g; ; f9g); mov(2; 6); cpy(7; )) T. Similrly, if we strt with the tree T 2 in the gure, the edit script E 2 = (glu(2; 7); mov(2; ); del()) trnsforms it ck to E E2 T. We write T! T 2, nd T 2! T. 2.2 Cost Model Given pir of trees, there re, in generl, severl edit scripts tht trnsform one tree to the other. For exmple, there is the trivil edit script tht deletes ll the nodes of one tree nd then inserts ll the nodes of the second tree. There re mny other edit scripts tht, informlly, do more work thn seems necessry. Formlly, we would like to nd n edit script tht is \miniml" in the sense tht it does no more work tht wht is solutely required. To this end, we dene cost model for edit opertions nd edit scripts. There re two mjor criteri for choosing cost model. Firstly, the cost model should ccurtely cpture the domin chrcteristics of the dt eing considered. For exmple, if we re compring the schemtics for two printed-circuit ords, we my prefer n edit script tht hs s few inserts s possile, nd insted descries chnges with moves nd copies of the old components. However, if we re compring text documents, we my prefer to see prgrph s new insertion, rther thn description of how it ws ssemled from its nd pieces of sentences from the old document. Secondly, the cost model should e simple to specify, nd should require little eort from the user. For exmple, cost model tht requires the user to specify dozens of prmeters is not desirle y this criterion, even though it my ccurtely model the domin. Another issue is the trde-o etween generlity of the cost model nd diculty in computing minimum-cost edit script. For exmple, very generl cost model would hve user-specied function to determine the cost of ech edit opertion, sed on the type of the edit opertion, s well s the prticulr nodes on which it opertes. However, such model is not menle to the design of ecient lgorithms for computing the minimum-cost edit script, since it does not permit us to reson out the reltive costs of the possile edit opertions. With the ove criteri in mind, we propose simple cost model in which the costs of insertion, deletion, move, copy, nd glue opertions re given y constnts, c i, c d, c m, c c, nd c g, respectively. Furthermore, given the symmetry etween ins nd del, nd cpy nd glu, it is resonle to use c i = c d, nd c c = c g. Since, intuitively, mov opertion cuses smller chnge thn either cpy or glu, it is lso resonle to use c m < c c. Note, however, tht our lgorithms do not depend on these reltionships etween the cost prmeters. The cost of n updte opertion depends on the old nd new vlues of the lel eing updted; tht is, c(upd(n; v)) = c u(v 0; v), where v 0 is the old lel of n, nd c u is domin-dependent function tht returns nonnegtive rel numer. Finlly, the cost P of n edit script E, denoted y c(e), is dened s the sum of the costs of the edit opertions in E. Tht is, c(e) = c(d). d2e Prolem Sttement: Given two rooted, leled trees T nd T 2, nd n edit script E such tht E trnsforms T to tree tht is isomorphic to T 2, nd such tht for every edit script E 0 with this property, C(E 0 ) C(E). 3 Method Overview In this Section, we present n overview of lgorithm mhdiff for computing minimum-cost edit script etween two trees. We present our lgorithm informlly using running exmple; the detils re deferred to lter sections. 3 d T 2 4 e 9 5 d 6 f 7 cc 0 c 8 5 T2 52 cc e 55 c 53 56 57 f d 59 58 g 60 cc 6 63 c 62 d 64 Figure 2: The trees for the running exmple in Section 3. Consider the two trees depicted in Figure 2. We would like to nd minimum-cost edit script tht trnsforms tree T into tree T 2. The reder my oserve tht these trees re isomorphic to the initil nd nl trees from Exmple 2. in Section 2. Note, however, tht there is no correspondence etween the node identiers of T nd T 2 in Figure 2. This is ecuse in Exmple 2. we pplied known edit script to

tree, trnsforming it to nother tree in the process, wheres in this section, we re trying to nd n edit script, given two trees with no informtion on the reltionship etween their nodes. Therefore, our rst step consists of nding correspondence etween the nodes of the two given trees. For exmple, consider the node 8 in Figure 2. We wnt to nd the node in T 2 tht corresponds to this node in T. The dshed lines in Figure 2 represent some of the possiilities. Intuitively, we cn see tht mtching the node 8 to the node 5 does not seem like good ide, since not only do the lels of the two nodes dier, ut the two nodes lso hve very dierent loctions in their respective trees; node 8 is lef node, while node 5 is the root node. Similrly, we my intuitively rgue tht mtching node 8 to node 62 seems promising, since they re oth lef nodes nd their lels mtch. However, note tht mtching nodes sed simply on their lels ignores the structure of the trees, nd thus is not, in generl, the est choice. We mke this intuitive notion of correspondence etween nodes more precise elow. 3. The Induced Grph Consider the complete iprtite grph B consisting of the nodes of T on one side, nd the nodes of T 2 on the other, plus the specil nodes (on T 's side) nd (on T 2's side). We cll B the induced grph of T nd T 2. The dshed lines in Figure 2 correspond to few edges of the induced grph. Intuitively, we would like to nd suset K of the edges of B tht tells us the correspondence etween the nodes of T nd T 2. If n edge connects node m 2 T to node n 2 T 2, it mens tht n ws \derived" from m. (For exmple, n my e copy of m.) We sy m is mtched to n. A node mtched to the specil node indictes tht it ws inserted, nd node mtched to indictes tht it ws deleted. Note tht this mtching etween nodes need not e one-to-one; node my e mtched to more thn one other nodes. (For exmple, referring to Figure 2 node 7 my e mtched to oth node 52 nd node 6.) The only restriction is tht node e mtched to t lest one other node. Thus, nding the correspondence etween the nodes of two trees consists essentilly of nding n edge cover 2 of their induced grph. The induced grph hs lrge numer of edge covers (this numer eing exponentil in the numer of nodes). However, we my intuitively oserve tht most of these possile edge covers of B re undesirle. For exmple, nd edge cover tht mps ll nodes in T to, nd ll nodes in T 2 to seems like d choice, since it corresponds to deleting ll the nodes of T nd then inserting ll the nodes of T 2. We will dene the correspondence etween n edge cover of n induced grph nd n edit script for the underlying trees formlly in Section 4, where we lso descrie how to compute n edit script corresponding to n edge cover. For now, we simply note tht, given n edge cover of the induced grph, we cn compute corresponding edit script for the underlying trees. Hence, we would like to select n edge cover of the induced grph tht corresponds to minimumcost edit script. 2 An edge cover of grph is suset K of the edges of the grph such tht ny node in the grph is incident on t lest one edge in K. 3.2 Pruning the Induced Grph We noted erlier tht mny of the potentil edge covers of the induced grph re undesirle ecuse they correspond to expensive nd undesirle edit scripts. Intuitively, we my therefore expect sustntil numer of the edges of the induced grph to e extrneous. Our next step, therefore, consists of removing (pruning) s mny of these extrneous edges s possile from the induced grph, y using some pruning rules. The pruning rules tht we use re conservtive, mening tht they remove only those edges tht we cn e sure re not needed y minimum-cost edit script. We discuss pruning rules in detil in Section 5.3, presenting only simple exmple here. As n exmple of the ction of simple pruning rule, consider the edge e = [5; 53], representing the correspondence etween nodes 5 nd 53 in Figure 2. Suppose tht the cost c U(; c) of updting the lel of node 5 to the lel c of node 53 is 3 units. Furthermore, let the cost of inserting node nd deleting node e unit ech. Then we cn sfely prune the edge [5; 53] ecuse, intuitively, given ny edge cover K tht includes the edge e, we cn generte nother edge cover tht excludes e, nd tht corresponds to n edit script tht is t lest s good s the one corresponding to K. As n illustrtion of such pruning, consider the edge cover K 2 = K? feg [ f[5; ]; [; 53]g. This edge cover corresponds to n edit script tht deletes the node 5, nd inserts the node 53. These two opertions cost totl of 2 units, which is less thn the cost of the updte opertion suggested y the edge e in edge cover K. We therefore conclude tht the edge [5; 53] in our running exmple my sfely e pruned. In Section 5.3 we present Pruning Rule 2, which is generliztion of this exmple. 2 3 4 5 6 7 8 9 0 5 52 53 55 56 57 58 59 60 6 62 63 64 Figure 3: The pruned induced grph for the trees in Figure 2 3.3 Finding n Edge Cover By pplying the pruning rules (Section 5.3) to the induced grph of our running exmple, sy we otin the pruned induced grph depicted in Figure 3 (ignore for the present the dierence etween dotted nd solid lines in the gure). Although the pruned induced grph typiclly hs fr fewer edges thn the originl induced grph does, it my still contin more edges thn needed to form n edge cover. In Section 4.2 we will see tht we need only consider edge covers tht re miniml; tht is, edge covers tht re not proper supersets of ny edge cover. In other words, we would like to remove from the pruned induced grph those edges tht re not needed to cover nodes. For exmple, in the pruned induced grph shown in Figure 3, hving ll four of the edges [7; 6], [7; 63], [9; 6], nd [9; 63] is unnecessry; we my remove either [7; 63] nd [9; 6]; or [7; 6] nd [9; 63]. However, it is not possile to decide priori which of these options is the etter one; tht is, it is not ovious which choice would led to n edit script of lower cost. With pruning, on the other hnd, there ws no dout tht certin edges could e -

removed. One wy to decide mong these options is to enumerte ll possile miniml edge covers of the pruned induced grph, nd the edit script corresponding to ech one (using the method descried lter in Section 5), nd to pick the one with the lest cost. However, given the exponentilly lrge numer of edge covers, this is oviously not n ecient lgorithm. To compute n optiml edge cover eciently, we need to e le to determine how much ech edge in the edge cover contriutes to the totl cost of n edit script corresponding to n edge cover contining it. Tht is, we need to distriute the cost of the edit script corresponding to n edge cover over the individul edges of the edge cover. Once we hve cost dened for ech edge in the pruned induced grph, we cn nd minimum-cost edge cover using stndrd techniques sed on reducing the edge cover prolem to weighted mtching prolem [PS82, Lw76]. For exmple, if the edges [7; 6], [7; 63], [9; 6], nd [9; 63], hve costs 0,.3, 0.2, nd 2.4, respectively, then we generte n edge cover tht includes [7; 6] nd [9; 6], nd excludes [7; 63] nd [9; 6]. Note, however, tht such reduction of the edit script prolem to n edge cover (nd thus, weighted mtching) prolem cnnot e exct, given the hrdness of the edit script prolem. 3 Indeed, our method of ssigning costs to edges of the induced grph (Section 5.) is only pproximte, nd thus the minimum-cost edge cover is not gurnteed to produce the est solution for the edit script prolem. 3.4 Generting the Edit Script Returning to the pruned induced grph of our running exmple, let us ssume tht we hve gone through the process of determining the cost of ech edge, nd hve computed minimum-cost edge cover ccording to these costs, otining the edge cover represented y the old edges in Figure 3. Our next step consists of using this edge cover to compute n edit script tht trnsforms the tree T to the tree T 2. Our lgorithm CtoS (Cover-to-Script) for this purpose is descried in Section 5. Here, we riey illustrte some of the ides used y the lgorithm y considering its ction on n edge in the edge cover for our running exmple. 2 e 4 3 d T 5 6 f cc 7 8 c 9 d 0 cpy g 52 cc e 55 60 c 53 56 57 f cc 6 63 nil 5 T2 d 59 58 c 62 d 64 Figure 4: Annotting edges in the edge cover of Figure 3 Consider the edge e = [7; 52] of the edge cover depicted y the old lines in Figure 3. In Figure 4, we depict this edge in reltion to the originl trees. (We lso depict two other edges from the edge cover. The edge cover edges re shown s dshed lines in Figure 4. We oserve tht there is one other edge in the edge cover tht is incident on node 7, viz. 3 unless P = N P, since we re considering polynomil-time reduction. [7; 6], suggesting tht the node 7 ws copied either directly, or indirectly (due to one of its ncestors eing copied). Furthermore, we note tht the prent (node 4) of node 7 is mtched to the prent (node 55) of node 6 (i.e., the edge [4; 55] exists in the edge cover), while the prent of node 52 is not mtched to the prent of node 7. This mtching of the prents suggests tht node 6 is the originl instnce of node 7, while node 52 is the copy. We therefore generte copy opertion tht copies the sutree rooted t node 7 to the loction of node 52. A convenient wy of depicting this copy opertion is y nnotting the corresponding edge ([7; 52] in our exmple) with cpy mrk; this scheme llows us to tlk out edit opertions without hving to refer to explicit node identiers. Edges tht do not correspond to ny edit opertion (e.g., [6; 57] in our exmple) re nnotted with nil mrk. In the sequel, we will use such edge nnottions interchngely with the ctul edit opertions tht they represent. Consider next the edges [8; 53] nd [8; 62]. Although oth these edge cover edges re incident on node 8, neither of them corresponds to cpy opertion, since the copy 52 of node 8 is generted \for free" when node 7 is copied. Therefore, oth these edges re nnotted nil. Proceeding thusly, we nnotte ll the edges in the edge cover of our running exmple, to otin the nnotted edge cover depicted in Figure 5, which shows only the edges with non-nil nnottions, for clrity. These nnottions correspond to the edit script (ins(g; ; f9g); mov(2; 6); cpy(7; )). We see tht this edit script is identicl to the one in Exmple 2., which hppens to e minimum cost edit script for our exmple. Of course, the ove edit opertions my lso e listed in the order (mov(2; 6); cpy(7; ); ins(g; ; f9g)). Both edit scripts hve the sme nl eect, nd hve the sme cost. In generl, ll edit scripts corresponding to set of nnotted edges hve the sme overll eect nd the sme cost. d 3 T 2 e 4 5 6 f mov cc 7 8 c 9 d 0 ins cpy 5 T2 g 52 cc e 55 60 c 53 56 57 f cc 6 63 nil d 59 58 c 62 d 64 Figure 5: Annotted edges of the edge cover of Figure 3 For the ove exmple mh-diff produces minimumcost edit script, ut it my sometimes not nd one with glolly minimum cost. In Section 6 we evlute how often this hppens nd we riey discuss how one could perform dditionl serching in the neighorhood of the script found y mh-diff. This concludes the overview of mh-diff. To summrize, the process consists of constructing n induced grph from the input trees, pruning the induced grph, nding minimum-cost edge cover of the pruned induced grph, nd nlly, using this edge cover to otin n edit script. In the following sections, we descrie these phses in detil. For ese of presenttion, we present these phses in different order thn the order in which they re performed. In prticulr, in Section 4, we egin y formlly dening the correspondence etween nd edit script nd n edge cover of the induced grph. In tht section, we lso descrie the

method for generting n edit script from n edge cover of the induced grph. In Section 5, we descrie how the cost of n edit script is distriuted over the edges of the corresponding edge cover of the induced grph. In tht section, we lso descrie how this cost function is pproximted y deriving upper nd lower ounds on the cost of n edge of the induced grph, nd how these ounds re used to prune the induced grph. Since nding minimum-cost edge cover for iprtite grph with xed edge costs is prolem tht hs een previously studied in the literture [PS82, Lw76], we do not present the detils in this pper. 4 Edge Covers nd Edit Scripts In this section, we descrie lgorithm CtoS, which genertes n edit script etween two trees, given n edge cover of their induced grph. Before we cn descrie this lgorithm, we need to understnd the reltionship etween n edit scripts etween two trees nd edge covers of their induced grph. Therefore, we rst dene the edge cover induced y n edit script. Tht is, we descrie how, given n edit script etween two trees, we generte n edge cover of the induced grph. (Note tht this process is the reverse of the process the lgorithm CtoS performs. However, denition of this reverse process is needed for the description of the lgorithm.) 4. Edge Cover Induced y n Edit Script In Section 3, we introduced the grph induced y two trees T nd T 2 s the complete iprtite grph B = (U; V; U V ), with U = N [ fg nd V = N 2 [ f g (where N nd N 2 re the nodes of T nd T 2, respectively). Let E e n edit script tht trnsforms T to T 2; tht is, T E! T 2. We now dene the edge cover K(E) induced y E. Intuitively, we otin K(E) s follows. Crete copy T 3 of T, nd introduce n edge etween ech node in T nd its copy in T 3. Apply the edit script to T 3, moving, copying, etc. the end-points of the edges with the nodes they re ttched to s nodes re moved, copied, etc. Thus, when n node n 2 T 3 is copied, producing node n 0, ny edge [m; n] is split to produce n new edge [m; n 0 ]. The other edit opertions re hndled nlogously. Furthermore, n edge etween the specil nodes nd is dded initilly, nd removed when it is no longer needed to cover either or. Due to spce limittions, we illustrte the denition of the edge cover induced y n edit script informlly using n exmple; the forml denition is in [CGM97]. T 2 e 4 d 3 5 6 f - 9 e 32 34 3 T3 39 cc d d d 7 0 33 35 36 f cc 37 40 c c 8 All edges [n, n30] exist implicitly 38 Figure 6: Exmple 4.: the initil edge cover Exmple 4. Consider the edit script from Exmple 2., nd the initil tree T from Figure. As descried ove, our rst step consists of creting copy T 3 of T, nd dding n edge etween ech node of T nd its counterprt in T 3. We lso dd the specil nodes nd, long with n edge connecting them. The result of this step is depicted in Figure 6. For clrity in presenttion, the edges etween the nodes of T nd their counterprts in T 3 re not shown in Figure 6; insted, we encode these edges using the node identiers of T nd T 2. Tht is, s indicted in the gure, imgine n edge [n; n 30]; 8n = : : : 0. T e 2 4 d 3 5 6 f - cc 7 9 d 0 8 c All edges [n, n30] exist implicitly cc 42 c 43 e 34 3 35 36 33 d 32 T3 g 4 f 37 cc 39 c d 38 40 Figure 7: Exmple 4.: the nl edge cover Our next step consists of pplying the edit script from Exmple 2. to the tree T 3. To enle this ppliction of the edit script for T to T 3, we chnge the node identiers in the edit script from the identiers of the nodes of T to those of T 3, otining E = (ins(4; g; 3; f39g), mov(32; 36), cpy(37; 3)). As result of the ins opertion, node with identier 4 nd lel g is inserted s child of node 3, nd node 37 is mde its child. In ddition, we dd n edge [; 4] to the induced edge cover. Next, consider the ction of the mov opertion, which moves node 32 to ecome child of node 37. This opertion does not dd ny new edges to the edge cover. (The existing edges [2; 32] nd [3; 33] continue to exist.) Finlly, the cpy opertion cretes copy of the sutree rooted t node 36, nd inserts this copy s child of node 3. In ddition, the edges [7; 42] nd [8; 43] re dded to the edge cover. The result is depicted in Figure 7, (which lso omits edges [n; n 30]; 8n = : : : 0 for clrity). Note tht the trnsformed tree T 3 is now isomorphic to the tree T 2 in Exmple 2., so tht essentilly, we now hve n edge cover of the induced grph of T nd T 2. 4.2 Using Edge Covers to Generte Edit Scripts The gol of using n edge cover is tht it should cpture the essentil spects of n edit script; tht is, no importnt informtion should e lost in going from n edit script to the edge cover induced y it. However, there re certin edit scripts for which this property does not hold. For exmple, consider n edit script E 2 tht inserts node p s the prent of ten silings (children of the sme prent) n ; : : : ; n 0, then moves p to nother loction in the tree, nd nlly deletes p. The node p is sent from oth the initil tree nd the nl tree. Therefore, n edge cover of the initil nd nl trees contins no record of the temporry insertion of node p. Thus, we hve lost some informtion in going from E 2 to the edge cover. Is the fct tht our edge covers cnnot cpture edit scripts like E 2 prolem? On the one hnd, E 2 could e the minimum cost edit script mh-diff is trying to nd. For exmple, sy tht insert, delete, nd move opertions ll cost one unit. The cost of E 2 would then e the cost of one insert, plus the

cost of one move, plus the cost of one delete, for totl cost of 3. If we do not use the \ulk move trick" tht E 2 uses, we need to move ech of n ; : : : ; n 0 individully, for cost of 0. Thus, E 2 could e the minimum cost edit script, nd if we rule it out, then mh-diff would miss it. On the other hnd, scripts like E 2 do not represent trnsformtions tht re meningful or intuitive to n end user. In other words, if user sw E 2, he would not understnd why node p ws inserted, since it relly hs no function in his ppliction. True, the costs provided y the user re intended to descrie the desirility of edit opertions, ut if we use these numers we cn end up with \tricky" scripts like E 2 tht re more confusing thn helpful. Another exmple of potentilly unintuitive edit script is the following: Consider n edit script E 3 tht moves node n to ecome child of nother node n 2, then mkes severl copies of the sutree rooted t n 2 (thus mking copies of n s well), nd nlly deletes the originl copy of n. This edit script moves n to plce where it does not need to e (under n 2) only to generte free copies of n. The cuse of the unintuitive nture of the edit scripts descried ove is n interction etween dierent edit opertions, which gives rise to \compound" eect. For exmple, in the edit script E 2 ove, the eect of the move opertion is compounded ecuse it cts on node tht ws previously inserted. Similrly, in edit script E 3 ove, the eects of the copy opertions re compounded ecuse they ct on sutree into which node ws previously moved. Our pproch is to disllow such unintuitive compound eects. A simple wy of chrcterizing edit scripts tht disllow undesirle compound eects is to require edit opertions to occur in phses, nd to order the phses ppropritely. In the following discussion, we use the nmes ins, del, etc. to denote phses consisting of, respectively, ins opertions, del opertions, etc. First, we require tht the ins phse occur fter the del phse, so tht n edit script cnnot rst insert node nd then delete it. Next, we require the other edit phses (upd, mov, cpy, nd glu) to occur fter the del phse (so tht nodes operted on y these phses cnnot e lter deleted), nd efore the ins phse (so tht inserted nodes cnnot e operted on y these phses). Furthermore, we require tht the upd (respectively, mov) phse occur fter the cpy phse nd efore the glu phse, so tht n edit script cnnot compound the eect of n upd (respectively, mov) opertion y copying the updted node (nd similrly for glues). These ordering constrints yield the following order of edit phses: del, cpy, upd, mov, glu, ins. (We chose the reltive order of the upd nd mov phses ritrrily.) One dditionl restriction, not covered y the ove ordering constrint, is the following: A node in sutree operted on y cpy opertion cnnot e operted on y glu opertion. We cll edit scripts tht stisfy these restrictions structured edit scripts. In the sequel, we consider only structured edit scripts. Structured edit scripts hve the following importnt property tht llows us to consider only miniml edge covers in the sequel. (A miniml edge cover is n edge cover tht is not proper superset of ny edge cover.) Lemm 4. The edge cover induced y structured edit script is miniml. The reder my oserve tht, in ddition to disllowing unintuitive compound eects, the ove restrictions lso disllow some intuitive sequences of opertions. For exmple, structured edit script cnnot delete node produced s result of cpy opertion. Therefore, structured edit script cnnot copy sutree contining 00 nodes if 99 of them re needed, ecuse it would e unle to delete the unwnted copy of the 00th node. An nlogous sitution exists for ins nd glu opertions. Our lgorithms [CGM97] ctully do permit such deletions (clled ghost deletions) fter copies, nd insertions (clled ghost insertions) efore glues. For similr resons, we lso permit certin move opertions to occur efore the cpy phse. Furthermore, we llow move or copy opertion to destintion tht is currently unville (e.g., ecuse it is produced y copy opertion) to e \pused" until the destintion ecomes ville. Lemm 4. remins true under these weker restrictions. We now descrie how, given miniml edge cover K of the grph induced y trees T nd T 2, we compute minimum-cost edit script corresponding to this edge cover. As explined in Section 3, we lso represent the edit opertions of such n edit script s nnottions on the ected edges. Due to spce constrints, we do not present the full detils of our lgorithm CtoS (cover-to-script) in this pper, nd present insted rief explntion of the sic ides ehind the lgorithm. The detiled lgorithm is presented in [CGM97]. The lgorithm proceeds in phses tht roughly reect the phses of structured edit script descried ove. We refer to edges elonging to the given edge cover K s K-edges. We sy two nodes re mtched to ech other if there is K-edge connecting them. The rst phse of the lgorithms is the delete phse, in which we generte n edit opertion del(m) for ech node m tht is mtched to the specil node. We clim tht ny edit script tht mtches m to must contin this del opertion, due to the following oservtions: Firstly, ny node mtched to is sent from the nl tree. Furthermore, there re only two wys in which node cn e mde to dispper: either it is deleted explicitly, or it is glued to some other node. (We use here the fct tht structured edit scripts cnnot rst glue node to nother nd then delete the second node.) However, the second method will not result in m mtching in the edge cover induced y the script; insted, m will mtch the node to which it ws glued. Therefore we cn sfely produce del(m) opertion for ll such nodes m. The next phse of the lgorithm hndles copy opertions. In prticulr, it looks for sets two or more of K-edges incident on common node m 2 T. Note tht from Lemm 4., nd the oservtion tht miniml edge covers cnnot contin ny pth of length three, it follows tht if e = [m; n] is such n edge, there cn e no other K-edge incident on n. We cll such set of edges ower with se m. This set of edges represents copies of the node m. However, s we hve seen in Section 3, some of the copies of m could e produced s result of some ncestor of m eing copied. We cll such copies free copies of m. Our lgorithm considers owers in preorder of the se nodes. As copy opertions re generted for some node m, we lso keep trck of the numer of free copies of nodes in the copied sutree. Knowing the numer of ville free copies llows us to determine exctly which owers correspond to explicit copy opertions nd which correspond to implicit (free) copies. Furthermore, ny unused free copies re nodes tht need to e deleted fter the copy opertion is performed. These re the ghost deletions we introduced ove. Finlly, note tht free copy my need to e moved to its nl loction; this sitution is esily detected y checking whether the prents of the ected nodes mtch.

The updte phse of the lgorithm is strightforwrd, nd produces n updte opertion for ech edge [m; n] such tht the lels of m nd n dier. Since we re considering only structured edit scripts, there is no wy to void such n updte; in prticulr, \tricks" like updting node nd then copying it re disllowed. The glue nd delete phses of the lgorithm re nlogous to the copy nd insert phses, respectively. The detils re in [CGM97]. 5 Finding the Edge Cover In this section we descrie how mh-diff nds miniml edge cover of the induced grph. The resulting cover will serve s input to lgorithm CtoS (Section 4). Our gol is to nd not just ny miniml edge cover, ut one tht corresponds to minimum-cost edit script. Let us cll such n miniml edge cover the trget cover. Consider n edge e in our pruned induced grph. To get to the trget cover, mh-diff must decide whether e should e included in the cover. To rech this decision, it would e nice if mh-diff knew the \cost" of e. Tht is, if e remins in the trget cover, then it would e nnotted (y lgorithm CtoS) with some opertion, nd we could sy tht the cost of this opertion is the cost of e. Unfortuntely, we hve \chicken nd the egg prolem" here: CtoS cnnot run until we hve the trget cover, nd we cnnot get the trget cover until we know the costs it will imply. To rek the impsse, our pproch uses the following ide: Insted of trying to compute the ctul cost of e, we compute n upper nd lower ound to this cost. These ounds cn e computed without the knowledge of which other edges re included in the trget cover, nd serve two purposes: Firstly, they llow us to design pruning rules tht re used to conservtively eliminte unnecessry edges from the induced grph. Secondly, fter pruning, the ounds cn guide our serch for the trget cover. As n enhncement, we ctully use vrition on the edge cost suggested ove. The following exmple shows tht simply \chrging" ech nnottion to the edge it is on is not entirely \fir." We re given tree T contining two nodes, n nd n 2 with the sme lel l. Furthermore n hs children n nd n 2 with lels nd, respectively, nd n 2 hs children n 2 nd n 22 with lels c nd d, respectively. Suppose T 2 is logicl copy of T. (Tht is, T nd T 2 re isomorphic.) Consider n edge cover tht mtches ech node in T to its copy in T 2 except tht it \cross mtches" n nd n 2 cross the trees, s shown in Figure 8. Given this edge cover, lgorithm CtoS will produce move opertion for ech of the nodes n, n 2, n 2, nd n 22. However, these move opertions were cused not y ny mismtching of the nodes n, n 2, n 2, or n 22, ut insted, y the mismtching of n nd n 2. Therefore it would e intuitively more fir to chrge these move opertions to the edges responsile for the mismtch, viz. [n ; n 0 2] nd [n 2; n 0 ]. To chieve this, we use the following scheme: If e is nnotted with ins, del, or upd in the trget cover, we do chrge e for this opertion. However, if e is nnotted y mov, cpy, or glu, then the prent of e, nd not e is chrged. We cll the edge costs computed in such fshion fir costs, nd dene them elow: m m n0 n0 l l l l n n2 n n2 n n2 n2 n22 n n2 n2 n22 mov mov mov Figure 8: Distriuting edge costs firly 5. An Edge-wise Cost Function Let K e n nnotted miniml edge cover. For n edge e 2 K, if the nnottion on e is mov, cpy, or glu, let c x(e) denote the cost of tht opertion. If e is nnotted with ins, del, or upd, then let c s(e) denote the cost of the opertion. Furthermore, let E(m) e the set of edges in K tht re incident on m, tht is, E(m) = f[m; n] 2 Kg. Let C(m) e the set of the children of m. We then dene the fir cost of ech edge [m; n] 2 K s follows: c K([m; n]) = c s(m; n) 2jE(m)j 2jE(n)j m 0 2C(m) [m 0 ;n 0 ]2K n 0 2C(n) [m 0 ;n 0 ]2K mov c x([m 0 ; n 0 ]) c x([m 0 ; n 0 ]) () Note tht this cost depends on K, nd thus is not function of e lone. The following lemm, proved in [CGM97], sttes tht the ove scheme of distriuting the cost of n edge cover over its component edges is sound one; tht is, dding up the cost edge-wise yields the overll cost of the edge cover (i.e., the cost of the corresponding edit script). Lemm 5. If K is n nnotted, miniml edge cover of the grph induced y two trees, then c(k) = P e2k ck(e). 5.2 Bounds on Edge Costs Although Lemm 5. suggests method of distriuting the cost of n nnotted edge cover (nd thus n edit script) over the component edges, the cost of ech edge depends on the other edges present in the edge cover, nd is thus not directly useful for computing minimum-cost edge cover. However, we use tht distriution scheme to derive upper nd lower ounds on the fir cost c K(e) of n edge e over ll miniml edge covers K. Intuitively, given tht the cost of ny upd nnottion on n edge is chrged to tht edge (y Eqution ), simple choice for the lower ound on the cost of n edge [m; n] is simply the cost c u(m; n) of updting the lel m to tht of n. However, we cn do little etter. In some cses, selecting n edge [m; n] (s prt of the edge cover eing constructed) my force some of the children m 0 of m to e moved to n. In prticulr, this hppens for those children of m 0 for which there is no edge tht could possily mtch m 0 to child of n. We cll such moves forced moves. In cses where we cn determine forced move exists, the cost of mov is dded to the lower ound cost. However, ccording to Eqution not ll the cost of forced move goes to edge [m; n]. In the worst

cse, the numer of edges incident on m, je(m)j, is lrge, leving [m; n] with n insignicnt contriution. However, if je(m)j is greter thn, we know y Lemm 4. tht je(n)j =, so forced moves on the n side would contriute to [m; n]. Thus, we my dd the minimum of the second nd the third terms in Eqution to the lower ound function. Formlly, let E e the set of edges in the induced grph of T nd T 2. 4 We dene the forced move cost, c mf (m 0 ; n) of node m 0 2 T with respect to nother node n 2 T 2 s follows: c mf (m 0 ; n) = c m, if 69n 0 2 C(n) such tht [m 0 ; n 0 ] 2 E, nd 0 otherwise. The cost c mf(m; n 0 ) is dened nlogously. We then dene the lower ound fir cost, c l, of n edge s follows: c l([m; n]) = c u(m; n) 8 <: 2 min m 0 2C(m) c mf (m 0 ; n); n 0 2C(n) c mf(m; n 0 ) 9= ; To help us compute the upper ound, let us now dene conditionl move cost, c mc. Intuitively, c mc(m 0 ; n) costs one mov cost unless there is prtner of m 0 tht is child of n. Formlly, c mc(m 0 ; n) = 0, if 9n 0 2 C(n) such tht [m 0 ; n 0 ] 2 E, nd c m otherwise. The cost c mc(n 0 ; m) is de- ned nlogously. Furthermore, dene c w(m; n) = c u(m; n) if m nd n re regulr nodes, 0 if (m = ) ^ (n = ), c i if(m = ) ^ (n 6= ), nd c d if (m 6= ) ^ (n = ). Using resoning similr to tht used for deriving the lower ound cost ove, we rrive t the following denition for the upper ound fir cost, c u, of n edge: c u([m; n]) = c w(m; n) 2 2 m 0 2C(m) n 0 2C(n) (c c(je(m 0 )j? ) c mc(m 0 ; n)) (c g(je(n 0 )j? ) c m?(n 0 ; m)) Note tht oth c u(e) nd c l(e) cn e computed y mh-diff without knowing the trget cover. Furthermore, the following lemm, proved in [CGM97], sttes tht the ove denitions of c u(e) nd c l(e), re upper nd lower ounds, respectively, on the fir cost contriution c K(e) of edge e to ny miniml edge cover K tht contins e. Lemm 5.2 Let B = (U; V; E) e the iprtite grph induced y trees T nd T 2. Let B 0 = (U; V; E 0 ), where E 0 E. Let K denote the collection of ll miniml edge covers of B 0. We then hve the following inequlities: c l(e) min c K(e) nd c u(e) mx c K(e) K2K K2K 5.3 Pruning Rules We now use the upper nd lower ound functions for the cost of n edge s dened ove to introduce the pruning rules we use to reduce the size of the induced grph of the two trees eing compred. Let e = [m; n] e ny edge in 4 As we will see lter, lthough E initilly includes ll edges in the complete iprtite grph, the pruning of edges results in successive reduction of the size of E. the induced grph. Let e 2 e ny edge incident on m, nd let e 3 e ny edge incident on n. Intuitively, our rst pruning rules removes n edge with lower ound cost tht is so high tht it is preferle to mtch ech of its nodes using some other edge tht hs suitly low upper ound cost. Pruning Rule Let C t = mxfc m; c c; c gg. c u(e 2) c u(e 3) 2C t then prune e. If c l(e ) Exmple 5. To illustrte this rule, consider tree T contining, mong others, two childless nodes (lel f) nd 2 (lel g). Similrly, T 2 contins childless nodes 3 (lel g) nd 4 (lel f), mong others. Sy the costs c m, c c, nd c g re one unit ech, while the updte costs re c u(f; g) = 3, nd c u(f; f) = c u(g; g) = 0. Let us now consider if edge e = [; 3] cn e pruned ecuse edges e 2 = [; 4] nd e 3 = [2; 3] exist. Since the nodes hve no children, it is esy to compute c l(e ) = c u(f; g) = 3, c u(e 2) = c u(f; f) = 0, nd c u(e 3) = c u(g; g) = 0. Since C t =, we see tht Pruning Rule holds nd e cn e sfely removed. The intuition is tht in the worst cse we cn replce e y edges e 2 nd e 3. Using the ltter edges could introduce t most the costs c u(e 2) nd c u(e 3), plus the cost of two mov, cpy, or glu opertions. The lst fctor cn rise, for instnce, if node 2 ends up eing mtched not only to node 3 ut to nother node in T 2. This mens tht node 2 needs to e copied, which would not hve een necessry if we hd kept edge e nd not used e 2. Similrly, the removl of edge e my cuse n extr glue opertion for node 4. However, even in this worst cse scenrio, the costs would e less thn the cost of updting the lel of node to tht of node 2, so we cn sfely remove the [; 2] edge. Our second pruning rule (lredy illustrted in Section 3) sttes tht if it is less expensive to delete node nd insert nother, we do not need to consider mtching the two nodes to ech other. More precisely, we stte the following: Pruning Rule 2 If c l(e ) c d(m) c i(n) then prune e. Note tht the ove pruning rules re simpler to pply if we let e 2 nd e 3 e the minimum-cost edge incident on m nd n, respectively. The following lemm, proved in [CGM97], tells us tht the pruning rules re conservtive: Lemm 5.3 Let E p e the set of edges pruned y repeted ppliction of Pruning Rules nd 2. Let K e ny miniml edge cover of the grph B. There exists miniml edge cover K 2 such tht () K 2\E p = ;, nd (2) C(K 2) C(K ). The pruning phse of our lgorithm consists of repetedly pplying Pruning Rules nd 2. Note tht the sence of edges rises the lower ound function, nd lowers the upper ound function, thus possily cusing more edges to get pruned. Our lgorithm updtes the cost ounds for the edges ected y the pruning of n edge whenever the edge is pruned. By mintining the pproprite dt structures, such cost-updte step fter n edge is pruned cn e performed in O(logn) time, where n is the numer of nodes in the induced grph. 5.4 Computing Min-Cost Edge Cover After ppliction of the pruning rules descried ove, we otin pruned induced grph, contining (typiclly smll)

suset of the edges in the originl induced grph. In fvorle cses, the remining edges contin only one miniml edge cover. However, typiclly, there my e severl miniml edge covers possile for the pruned induced grph. We now descrie how we select one of these miniml edge covers. We rst pproximte the fir cost of every edge e tht remins fter pruning y its lower ound e l(e). (We could hve lso use the upper ound, or n verge of oth ounds, since this is only n estimte.) Then, given these constnt estimted costs, we compute minimum-cost edge cover y reducing the edge cover prolem to iprtite weighted mtching prolem, s suggested in [PS82]. Since the weighted mtching prolem cn e solved using stndrd techniques, we do not present the detils in this pper, noting only tht given iprtite grph with n nodes nd e edges, the weighted mtching prolem cn e solved in time O(ne). For our ppliction, e is the numer of edges tht remin in the induced grph fter pruning. 6 Implementtion nd Performnce In this section, we descrie our implementtion of mh-diff, nd discuss its nlyticl nd empiricl performnce. Figure 9 depicts the overll rchitecture of our implementtion, with rectngles representing the modules (numered, for reference) of the progrm, nd other shpes representing dt. Given two trees T nd T 2 s input, Module constructs the induced grph (Section 3.). This induced grph is next pruned (Module 2) using the pruning rules of Section 5.3 to give the pruned induced grph. In Module 2, the updte cost for ech edge in the induced grph is computed using the domin-dependent comprison function for node lels (Section 2.2). The next three modules together compute minimum-cost edge cover of the pruned induced grph using the reduction of the edge cover prolem to weighted mtching prolem [PS82]. Tht is, the pruned induced grph is rst trnslted (y Module 3) into n instnce of weighted mtching prolem. This weighted mtching prolem is solved using pckge (Module 4) [Rot] sed on stndrd techniques [PS82]. The output of the weighted mtching solver is minimum-cost mtching, which is trnslted y Module 5 into K 0, minimum-cost edge cover of the pruned induced grph. Next, Module 6 uses the minimum-cost edge cover computed, to produce the desired edit script, using the method descried in Section 4.2). T T2 (4) weighted mtching solver min-cost mtching () Induced Grph Builder wt. mtching prolem (5) Mtching to cover trnsltor Induced Grph (3) Edge cover to wt. mtch Trnsltor K0 min-cost edge cover Edit Script Figure 9: System Architecture (2) Pruner Pruned Induced Grph (6) Cover to Script Recll tht since we use heuristic cost function to compute minimum-cost edge cover, the edge cover produced y our progrm, nd hence the edit script my not e the optiml one. We hve lso implemented simple serch module tht strts with minimum-cost edge cover K 0 (see Figure 9) computed y our progrm nd explores its neighorhood of miniml edge covers in n eort to nd etter solution. The serch proceeds y rst exploring miniml edge covers tht contin only one edge not in K 0. Next, we explore miniml edge covers contining two edges not in K 0, nd so on. The intuition is tht we expect the optiml solution to e \close" to the initil solution K 0. Although, in the worst cse, such n explortion my e extremely time-consuming, note tht s result of pruning edges, the serch spce is typiclly much smller thn the worst cse. Due to spce constrints, we do not descrie the detils of this serch phse in this pper. We hve used our implementtion to compute the dierences etween query results s prt of the Tsimmis nd C 3 projects t Stnford [CGMH 94, WU95]. These projects use the oem dt model, which is simple leled-oject model to represent tree-structured query results. In prticulr, we hve run our system on the output of Tsimmis queries over iliogrphic informtion source tht contins informtion out dtse-relted pulictions in formt similr to BiTe. Since the dt in this informtion source is minly textul, we tret ll lels s strings. For the domin-dependent lel-updte cost function, we use weighted chrcter-frequency histogrm dierence scheme tht compres strings sed on the numer of occurrences of ech chrcter of the lphet in them. For exmple, consider compring the lels \foor" nd \crowr." The chrcter-frequency histogrms re, respectively, ( : ; : ; f : ; o : 2; r : ) nd ( : ; : ; c : ; o : ; r : 2; w : ). The difference histogrm is (c :?; f : ; o : ; r :?; w :?). Adding up the mgnitudes of the dierences gives us 5, which we then normlize y the totl numer of chrcters in the strings (3), nd scle y prmeter (currently 5), to get the updte cost (5=3) 5 = :9. Let us now nlyze the running time of our progrm. Let n e the totl numer of nodes in oth input trees T nd T 2. Constructing the induced grph (Module, in Figure 9) involves uilding complete iprtite grph with O(n) nodes on ech side. We lso evlute the domin-dependent lelcomprison function for ech pir of nodes, nd store this cost on the corresponding edge. Thus, uilding the induced grph requires time O(kn 2 ), where k is the cost of the domin-dependent comprison function. Next, consider the pruning phse (Module 2). By mintining priority queue (sed on edge costs) of edges incident on ech node of the induced grph, the test to determine whether n edge my e pruned cn e performed in constnt time. If the edge is pruned, removing it from the induced grph requires constnt time, while removing it from the priority queues t ech of its nodes requires O(logn) time. When n edge [m; n] is pruned, we lso record the chnges to the costs c mc(m; p(n)), c mc(n; p(m)), c mf (m; p(n)), nd c mf (n; p(m)), which cn e done in constnt time. Thus, pruning n edge requires O(logn) time. Since t most O(n 2 ) re pruned, the totl worst cse cost of the pruning phse is O(n 2 logn). Let e e the numer of edges tht remin in the induced grph fter pruning. The minimum-cost edge cover is computed in time O(ne) y Modules 3, 4, nd 5. The computtion of the edit script from the minimum-cost edge cover cn e done in O(n) time y Module 6. (Note tht the numer of edges