2nd International Workshop on Web Dynamics in conjunction with the 11th International World Wide Web Conference Honolulu, Hawaii, USA, May 7, 2002

Size: px

Start display at page:

Download "2nd International Workshop on Web Dynamics in conjunction with the 11th International World Wide Web Conference Honolulu, Hawaii, USA, May 7, 2002"

Bertina Bradford
5 years ago
Views:

1 2nd International Workshop on Web Dynamics in conjunction with the 11th International World Wide Web Conference Honolulu, Hawaii, USA, May 7, 2002 Web Structure and Evolution Web Structure, Age, and Page Quality Ricardo Baeza-Yates, Felipe Saint-Jean, and Carlos Castillo...3 A Steady State Model for Graph Power Law David Eppstein, and Joseph Wang...15 A Multi-Layer Model for the Web Graph L. Laura, S. Leonardi, G. Caldarelli, and P. De Los Rios...25 XML Technologies Modeling Adaptive Hypermedia with an Object-Oriented Approach and XML Mario Cannataro, Alfredo Cuzzocrea, Carlo Mastroianni, Riccardo Ortale, and Andrea Pugliese...35 A Logico-Categorical Semantics of XML/DOM Carlos Henrique, and Cabral Duarte...45 XML Structure Compression Mark Levene, and Peter Wood...56 Web Information Retrieval Criteria for Evaluating Information Retrieval Systems in Highly Dynamic Environments Judit Bar-Ilan...70 Query Language for Structural Retrieval of Deep Web Information Stefan Muller, Ralf-Dieter Schimkat, and Rudolf Muller...78 Exploration versus Exploitation in Topic Driven Crawlers Gautam Pant, Padmini Srinivasan, and Filippo Menczer...88 Dynamic Applications

2 WebVigil: An approach to Just-In-Time Information Propagation in Large Network-Centric Environments Sharma Chakravarthy, Jyoti Jacob, Naveen Pandrangi, and Anoop Sanka...98 Caching Schema for Mobile Web Information Retrieval R. Lee, K. Goshima, Y. Kambayashi, and H. Takakura

3 Web Structure, Age and Page Quality Ricardo Baeza-Yates Felipe Saint-Jean Carlos Castillo Computer Science Department, University of Chile Blanco Encalada 2120, Santiago, Chile Abstract This paper is aimed at the study of quantitative measures of the relation between Web structure, age, and quality of Web pages. Quality is studied from dierent link-based metrics and their relationship with the structure of the Web and the last modication time of a page. We show that, as expected, Pagerank is biased against new pages. As a subproduct we propose a Pagerank variant that includes age into account and we obtain information on how the rate of change is related with Web structure. 1 Introduction The purpose of a Web search engine is to provide an infrastructure that supports relationships between publishers of content and readers. In this space, as the numbers involved are very big (500 million users [3] and more than 3 billion pages 1 in 36 million sites [4] at this time) it is critical to provide good measures of quality that allow the user to choose \good" pages. We think this is the main element that explain Google's [1] success. However, the notion of what is a "good page" and how is related to dierent Web characteristics is not well known. Therefore, in this paper we address the study of the relationships between the quality of a page, Web structure, and age of a page or a site. Age is dened as the time since the page was last updated. For web servers, we use the oldest page in the site, as a lower bound on the age of the site. The specic questions we explore are the following: How does the position of a web site in the structure of the Web depends on the Web site age? Depends the quality ofaweb page on where is located in the Web structure? We give some experimental data that sheds some light on these issues. Are link-based ranking schemes providing a fair score to newer pages? We nd that the answer is no for Pagerank [12], which isusedbygoogle[1], and we propose alternative ranking schemes that takes in account the age of the pages, an important problem according to [11]. Funded by Millenium Project "Center for Web Research", Mideplan, Chile. 1 This is a lower bound that comes from the coverage of a search engine. 1 3

4 Our study is focused in the Chilean Web, mainly the.cl domain on the two dierent times: rst half of 2000, when we collected 670 thousand pages in approximately 7,500 web sites (Set1) and the last half of year 2001, when we collected 795 thousand pages, corresponding to approximately Web sites (Set2). This data comes from the TodoCL search site ( which specializes on the Chilean Web and is part of a family of vertical search engines built using the Akwan search engine [2]. Most statistical studies about the web are based either on a \random" subset of the complete web, or on the contents of some web sites. In our case, the results are based on the analysis of the TodoCL collection, a search engine for Chilean Web pages. As this collection represents a large % of the Chilean web, we think that our sample is coherent, because it representsawell dened cultural context. The remaining of this paper is organized as follows. Section 2 presents previous work and that main concepts used in the sequel of the paper. Section 3 presents several relations among Web structure, age, and quality of Web pages. Section 4 presents the relation of quality of Web pages and age, followed by a modied Pagerank that is introduced in Section 5. We end with the some conclusions and future work. 2 Previous Work The most complete study of the Web structure [7] focus on page connectivity. One problem with this is that a page is not a logical unit (for example, a page can describe several documents and one document can be stored in several pages.) Hence, we decided to study the structure of how Web sites were connected, as Web sites are closer to be real logical units. Not surprisingly, we found in [5] that the structure in Chile at the Web site level was similar to the global Web 2 and hence we use the same notation of [7]. The components are: (a) MAIN, sites that are in the strong connected component of the connectivity graph of sites (b) IN, sites that can reach MAIN but cannot be reached from MAIN c) OUT, sites that can be reached from MAIN, but there is no path to go back to MAIN and d) other sites that can be reached from IN (t.in), sites in paths between IN and OUT (tunnel), sites that only reach OUT (t.out), and unconnected sites (island). In [5] we analyzed Set1 and we extended this notation by dividing the MAIN component into four parts: (a) MAIN-MAIN, which are sites that can be reached directly from the IN component and can reach directly the OUT component (b) MAIN-IN, which are sites that can be reached directly from the IN component but are not in MAIN-MAIN 2 Another example of the autosimilarity of the Web, which gives a scale invariant. 2 4

5 (c) MAIN-OUT, which are sites that can reach directly the OUT component, but are not in MAIN-MAIN (d) MAIN-NORM, which are sites not belonging to the previously dened subcomponents. We also gathered time information (last-modied date) for each page as informed by the Web servers. How Web pages change is studied in [9, 6,8], but here we focus on Web page age, that is, the time elapsed after the last modication. As the Web is young, we use months as time unit, and our study considers only the three last years as most Web sites are that young. The distribution of pages and sites for Set1 with respect to age is given in Figure 1. Figure 1: Cumulative distribution of pages (bottom) and sites (top) in function of age for Set1. The two main link based ranking algorithms known in the literature are Pagerank [12] and the hub and authority measures [10]. Pagerank is based on the probability of a random surfer to be on a page. This probability is modeled with two actions: the chance of the surfer to get bored and jump randomly to any page in the Web (with uniform probability), or choosing randomly one of the links in the page. This denes a Markov chain, that converges to a permanent state, where the probabilities are dened as follows: PR i = q +(1; q) 3 5 kx j=1 j6=i PR mj L mj

where q is the probability of getting bored (typically 0.15), m j with j 2 (1::k) are the pages that point to page i, and L j is the number of outgoing links in page j.

6 where q is the probability of getting bored (typically 0.15), m j with j 2 (1::k) are the pages that point to page i, and L j is the number of outgoing links in page j. The hub and authority are complementary functions. A page will have a high hub rank if it points to good content pages. In the similar way a page will have a high authority rankifitis referred by pages with good links. In this way the authority of a page is dened as the sum of the hub ranks of the pages that point to it, and the hub rank of a page is the sum of the authority of the pages it points to. When considering the rank of a Web site, we use the sum of all the ranks of the pages in the site, which is equivalent to the probability of being in any page of the site [5]. 3 Relations to the Web Structure One of the initial motivations of our study was to see if the IN and OUT components were related to Web dynamics or just due to bad Web sites. In fact, Web sites in IN could be considered as new sites which are not linked because of causality reasons. Similarly, OUT sites could be old sites which have not been updated. Figure 2 shows the relation between the macro-structure of the Web using the number of Web sites in each component to represent the area of each partofthe diagram for Set1. The colors represent Web site age (oldest, average, and newest page), such that a darker color represents older pages. The average case can be considered as the freshness of a site, while the newest page a measure of update frequency on a site. Figure 3 plots the cumulative distribution of the oldest page in each site for Set 1 in each component of the Web structure versus date in a logarithmic scale (these curves have the same shape as the ones in [7] for pages). The central part is a line and represents the typical power laws that appear in many Web measures. Figure 2: Visualization of Web structure and Web site age. These diagrams show that the oldest sites are in MAIN-MAIN, while the sites that are fresher on average are in MAIN-IN and MAIN-MAIN. Finally, the last diagram at the right shows that the update frequency is high in MAIN-MAIN and MAIN-OUT, while sites in IN and OUT are updated less frequently. 4 6

7 Figure 3: Web site age in the dierent components and page age (rightmost curve). Here we obtain some conrmation to what can be expected. The newer sites are in the Island component (and that is why they are not linked, yet). The oldest sites are in MAIN, in particular MAIN-MAIN, so the kernel of the Web comes mostly from the past. What is not obvious, is that on average sites in OUT are also newer than the sites in other components. Finally, IN shows two dierent parts: there is a group of new sites, but the majority are old sites. Hence, a large fraction of IN are sites that never became popular. In Table 1 we give the numerical data for the average age as well as the Web quality (sum for all the sites) in each component of the macro-structure of the Web, as well as the percentage change among both data sets in more than a year. Although Set1 did not include all the ISLANDS at that time (we estimate that Set1 was 70% of the sites), we can compare the core. The core has the smaller percentage but it is larger as Set2 triples the number of sites of Set1. OUT also has increased, which may imply a degradation of some part of the Web. Inside the core, MAIN-MAIN has increased in expense of MAIN-NORM. Overall, Set2 represents a Web much more connected than Set1. Several observations can be made from Table 1. First, sites in MAIN have the higher Pagerank, and inside it, MAIN-MAIN is the subcomponent with highest Pagerank. In a similar way MAIN- MAIN has the largest authority. This makes MAIN-MAIN a very important segment of the Web. Notice that IN has the higher hub which is natural because sites in MAIN have the higher authority. 5 7

8 Component size(%,set1) size(%,set2) age (days) Pagerank hub authority MAIN 23% 9.25% IN 15% 5.84% e e-08 OUT 45% 20.21% e e e-05 TUNNEL 1% 0.22% e e e-08 TENTACLES-IN 3% 3.04% e e e-06 TENTACLES-OUT 9% 1.68% e e e-09 ISLANDS 4% 59.73% e e e-11 MAIN-MAIN 2% 3.43% MAIN-OUT 6% 2.49% e e-07 MAIN-IN 3% 1.16% e e-06 MAIN-NORM 12% 2.15% e e e-07 Table 1: Age and page quality for Set2 in the dierent components of the macro-structure of the Chilean Web. ISLANDS have a low score in every rank. Studying age, sites in MAIN are the oldest, and inside it, sites in MAIN-MAIN are the oldest. As MAIN-MAIN also has good ranking, seems that older sites have the best content. This may be true when evaluating the quality of the content, but the value of the content, we believe inmany cases, could be higher for newer pages, as we need to add novelty to the content. Therefore there is a strong relation between the macro-structure of the Web and age/rank characteristics. This makes the macro-structure a valid partition of Websites. 4 Link-based Ranking and Age Now we study the correlation of the mentioned rank algorithms with the age of the pages. In [5] we gave qualitative data that showed that link-based ranking algorithms had bad correlation and that Pagerank was biased against new pages. Here we present quantitative data supporting those observations. Web pages were divided in 100 time segments of the same weight (that is, each segment has the same number of pages), and we calculated the standard correlation of each group pair of average rank values. Three graphs where obtained: Figure 4 which shows the correlation between Pagerank and authority, Figure 5 the correlation among Pagerank and hub, and Figure 6 shows the correlation of authorities and hubs. The low correlation between Pagerank and authority is surprising because both ranks are based on incoming links. This means that Pagerank and authority are dierent for almost every age percentile except the one corresponding to the older and newer pages which have Pagerank and authority rank very close to the minimum. Notice the correlation between hub/authority, which is relatively low but with higher value for pages about 8 months old. New pages and old pages have a lower correlation. Also notice that hub and authority are not biased with time. It is intuitive that new sites will have lowpagerank due to the fact that webmasters of other sites take time to know the site and refer to it in their sites. We show that this intuition is correct in Figure 7, where Pagerank is plotted against percentiles of page age. As can be seen, the newest 6 8

9 Figure 4: Correlation among Pagerank and authority with age. pages have avery low Pagerank, similar to very old pages. The peak of Pagerank is in pages of 1.6 months old. In a dynamic environment as the Web, new pages have a high value so a ranking algorithm should take an updated or new page as a valuable one. Pages with high Pagerank are usually good pages, but the opposite is not necessarily true (good precision does not imply good recall). So the answer is incomplete and a missing part of it is in new pages. In the next section we explore this idea. 5 An Age Based Pagerank Pagerank is a good way of ranking pages, and Google is a demonstration of it. But as seen before it has a tendency of giving higher ranks to older pages, giving new pages a very low rank. With that in mind we present some ideas for variants of Pagerank that give a higher value to new pages. A page that is relatively new and already has links to it should be considered good. Hence, the Pagerank model can be modied such that links to newer pages are chosen with higher probability. So, let f(age) be a decreasing function with age (present is 0), and dene f(x) as the weight ofa page of age x. Hence, we can rewrite the Pagerank computation as: PR i = q +(1; q) f(age i ) 7 9 kx j=1 j6=i PR mj L mj

10 Figure 5: Correlation among Pagerank and hub with age. where L mj as before is the number of links in page m j.ateach step, we normalize PR. Figures 8 and 9 shows the modied Pagerank by using f(age) = (1 + A e ;Bage ), q =0:15, and dierent values of A and B. Another possibility would be to take in account the age of the page pointing to i. That is, PR i = q +(1; q) kx j=1 j6=i f(age mj ) PR mj where F ( j)= P pages k linked by j f(age k) is the total weight of the links in a page. The result does not change to much, but the computation is slower. Yet another approach would be to study how good are the links based in the modication times of both pages involved in a link. Suppose that page P 1 has an actualization date of t 1, and similarly t 2 and t 3 for P 2 and P 3, such that t 1 <t 2 <t 3. Let's assume that P 1 and P 3 reference P 2. Then, we can make the following two observations: 1. The link (P 3 P 2 ) has a higher value than (P 1 P 2 ) because at time t 1 when the rst link was made the content ofp 2 may have been dierent, although usually the content and the links of a page improves with time. It is true that the link (P 3 P 2 )couldhave been created before t 3, but the fact that was not changed at t 3 validates the quality of that link. 2. For a smaller t 2 ; t 1, the reference (P 1 P 2 ) is fresher, so the link should increase its value. On F mj 8 10

11 Figure 6: Correlation among hubs and authorities with age. the other hand, the value of the link (P 3 P 2 ) should not depend on t 3 ; t 2 unless the content of P 2 changes. A problem with the assumptions above is that we do not really know when a link was changed and that they use information from the servers hosting the pages, which is not always reliable. These assumptions could be strengthened by using the estimated rate of change of each page. Let w(t s) be the weight of a link from a page with modication time t to a page with modication time s, such that w(t s)=1ift s or w(t s)=f(s ; t) otherwise, with f a fast decreasing function. Let W j be the weight of all the out-links of page j, then we can modify Pagerank using: PR i = q +(1; q) kx j=1 j6=i w(t j t i ) PR mj where t j is the modication time of page j. One drawback of this idea is that changing a page may decrease its Pagerank. W mj 9 11

12 Figure 7: Pagerank as a function of page age. Figure 8: Modied PageRank taking in account the page age (constant B)

13 Figure 9: Modied PageRank taking in account the page age (constant A). 6 Conclusions In this paper we have shown several relations between the macro structure of the Web, page and site age, and quality of pages and sites. Based on these results we have presented a modied Pagerank that takes in account the age of the pages. Google might be already doing something similar according to a BBC article 3 pointed by areviewer, but they do not say how. We are currently trying other functions, and we are also applying the same ideas to hubs and authorities. There is lot to do for mining the presented data. Further work includes how toevaluate the real goodness of a Web page link based ranking. Another line of research includes the analysis of search engines logs to study user behavior with respect to time. References [1] Google search engine: Main page [2] Akwan search engine: Main page [3] Nua internet - how many online [4] Netcraft web server survey [5] Baeza-Yates, R., and Castillo, C. Relating web characteristics with link analysis. In String Processing and Information Retrieval (2001), IEEE Computer Science Press / stm (in a private communication with Google sta they said that journalist had a lot of imagination

14 [6] Brewington, B., Cybenko, G., Stata, R., Bharat, K., and Maghoul, F. How dynamic is the web? In 9th World Wide Web Conference (2000). [7] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., and Tomkins, A. Graph structure in the web: Experiments and models. In 9th World Wide Web Conference (2000). [8] Cho, J., and Garcia-Molina, H. The evolution of the web and implications for an incremental crawler. In The VLDB Journal (2000). [9] Douglas, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. Rate of change and other metrics: a live study of the world wide web. In USENIX Symposium on Internet Technologies and Systems (1997). [10] Kleinberg, J. Authoritative sources in a hyperlinked environment. In 9th Symposium on discrete algorithms (1998). [11] Levene, M., and Poulovassilis, A. Report on international workshop on web dynamics, london, january [12] Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation algorithm: bringing order to the web. In 7th World Wide Web Conference (1998)

15 A Steady State Model for Graph Power Law David Eppstein Λ Joseph Wang Λ Abstract Power law distribution seems to be an important characteristic of web graphs. Several existing web graph models [8, 21] generate power law graphs by adding new vertices and non-uniform edge connectivities to existing graphs. Researchers [9, 10, 24] have conjectured that preferential connectivity and incremental growth are both required for the power law distribution. In this paper, we propose a different web graph model with power law distribution that does not require incremental growth. We also provide a comparison of our model with several others in their ability to predict web graph clustering behavior. 1 Introduction The growth of the World Wide Web (WWW) has been explosive and phenomenal. Google [1] has more than 2 billion pages searched as of February2002. The Internet Archive [2] has 10 billion pages archived as of March The existing growth-based models [6, 8, 21] are adequate to explain the web's current graph structure. It would be interesting to know if a different model is needed as the growth rate slows down [3] while its link structure continues to evolve. 1.1 Why Power Laws? Barabási et al. [9, 10] and Medina et al. [24] stated that preferential connectivity and incremental growth are both required for the power law distribution observed in the web. The importance of the preferential connectivityhas been shown byseveral researchers [8, 16]. Faloutsos et al. [15] observed that the internet topologyexhibits power law distribution in the form of y = x ff. When studying web characteristics, the documents can be viewed as vertices in a graph and the hyper-links as edges between them. Various researchers [7,8,19,22]have independentlyshowed the power law distribution in the degree sequence Λ Dept. Inf. & Comp. Sci., UC Irvine, CA , USA, feppstein,josephwg@ics.uci.edu. 1 15

16 of the web graphs. Huberman and Adamic [5, 16] showed a power law distribution in the web site sizes. [See [20] for a summaryof works on the web structure.] Medina et al. [24] showed that topologies generated by two widelyused generators the Waxman model [32], and the GT-ITM tool [13] do not have power law distribution in their degree sequences. Palmer and Steffan [27] proposed a power law degree generator that recursivelypartitions the adjacencymatrix into distribution. However, it is unclear if their generator actuallyemulates other web properties. The power law distribution seems to be an ubiquitous property. The power law distribution occurs in epidemics study[30], population study[28], genomes distribution [17, 29], various social phenomena [11, 26], and massive graphs [4, 6]. For the power law graphs in biological systems, the connectivity changes appear to be much more important than growth in size. 1.2 Properties for Graph Model Comparison Another important propertythat has been looked at is the diameters of web graphs. However, there are conflicting results in the published papers. Albert et al. [7] stated that the web graphs have thesmall world phenomenon [25, 31], in which the diameter is roughly 0:35 + 2:06 lg n, where n is the size of the web graph. For n = , ß 19. Lu [23] proved the diameters of random power law graphs are logarithmic function of n under the model proposed byaiello et al. [6]. However, Broder et al. [12] showed over 75% of time, there is no directed path between two random vertices. If there is a path, the average distance is roughly16 when viewing web graph as directed graph or 6:83 in the undirected case. Currently, there are few theoretical graph models [6, 8, 21, 27] for generating power law graphs. There are veryfew comparative studies that would allow us to determine which of these theoretical models are more accurate models of the web. We onlyknow that the model proposed by Kumar et al. [21] generates more bipartite cliques than other models. They believe clustering to be an important part of web graph structures that was insufficiently represented in previous models [6, 8]. 1.3 New Contributions In this paper, we show power law graphs do not require incremental growth, bydeveloping a graph model which (empirically) results in power laws byevolving a graph according to a Markov process while maintaining constant size and density. We also provide an easily computable graph property that can be used to capture cluster information in a graph without enumerating all possible subgraphs. 2 16

17 2 Steady State Model Our SteadyState (SS) model is very simple in comparison with other web graph models [6, 8, 21, 27]. It consists of repetitively removing and adding edges on a sparse random graph G. Let m be (n). To generate the initial sparse random graph G, we randomly add an 2m edge between vertices with probability. If the numberofedgesing is still less than n(n 1) m, then we start adding edges between vertices with probability of 0:5 until we have m edges. We reiterate the following steps r times on G, wherer is a parameter to our model. 1. Pick avertex v at random. If there is no edge incidents upon v, we pick another one. 2. Pick anedge(u; v) 2 G at random. 3. Pick avertex x at random. 4. Pick a vertex y with probabilityproportional to degree. 5. If (x; y) is not an edge in G and x is not equal to y, then remove edge(u; v) and add edge (x; y). One can view out model as an aperiodic Markov chain with some limiting distribution. If we repeat the above steps long enough, we will get a random graph drawn from the distribution no matter what the initial random sparse graph is. Note that unlike othermodels [6, 21], the graphs generated by our model do not contain self-loops nor multiple edges between two vertices. Barabási et al. [9] also proposed a non-growth model, which failed to produce a power law distribution. Both models have preferential connectivity features. However, there are several differences between our model and theirs. First, our edge set is fixed and the initial graph is generated via classical random graph model [14, 18]. Second, our model has rewiring" feature similar to one in the small world model [9, 25, 31]. 2.1 Simulation Results We simulated our model on graphs of different sizes, (500» n» 5000), and densities m, n (1» m» 3) for 5 times. We performed r = edge deletion/insertion operations on n each graph. The vertices' degree distributions appear to converge to power law distributions as the number of edge deletion/insertion operations increases. Some of our simulation results are shown in Figures 1-4. Figures 1and 3show degree distributions at various stage of simulations. Figures 2and 4show degree distributions for graphs with different densities 3 17

18 5 4.5 t = 0 t = 100K t = 10M No of Vertices (log) Degree (log) Figure 1: Initial G(500; 1500), & G After 100K and 10M Steps m. (For G(500; 1500), the best lines that fit our log-log plots have slopes between 1:34 and n 1:37 and correlation coefficients between 0:808 and 0:877. For G(3000; 9000), the slopes are between 1:51 and 1:62 and correlation coefficients are between 0:76 and 0:81.) 3 Cluster Information Given a subgraph S of G, d S (v) is the degree for vertex v in S. maximum degree d max in all subgraphs, which isdefinedas Here we examine the d max =max S min v2s d S (v): We used M max to denote the value obtained under graph model M. To compute d max for a graph G, we perform the following steps until G becomes empty: 1. Select a minimum degree vertex v from G. 2. Set d max to d(v) ifd(v) >d max. 3. Remove vertex v and its edges from G. 4 18

19 5 4.5 m = 500 m = 1000 m = No of Vertices (log) Degree (log) Figure 2: G(500; 500), G(500; 1000), and G(500; 1500) After 10M Steps 7 6 t = 0 t = 100K t = 10M 5 No of Vertices (log) Degree (log) Figure 3: Initial G(3000; 9000), & G After 100K and 10M Steps 5 19

20 7 6 m = 3000 m = 6000 m = No of Vertices (log) Degree (log) Figure 4: G(3000; 3000), G(3000; 6000), and G(3000; 9000) After 10M Steps 01 A B D Figure 5: Minimal Degree Vertex Elimination E C The above steps correctlycompute d max because we cannot remove any vertices of S until the degree of the current subgraph reaches d max. The minimal degree elimination sequence for graph in Figure 5 will be B; C; A; D; and E. The degrees when those vertices got eliminated are 1; 1; 2; 1; and 1. d max is 2 since maxf1; 1; 2; 1; 1g =2. Observation 1 For any model M that constructs a graph by adding a vertex at a time, and for which each newly added vertex has the same degree d = m n, dm max = d. Thus the Barabási and Albert's model (BA) [8] or the linear growth copying model in [21] has the same value for d max for graphs of all sizes once d = m is fixed. n Observation 2 The web graph generated by the linear model has minimum vertex degree 6 20

21 of d = m n. Hence, the linear model may not encapsulate all the crucial properties in a web graph if there are significant number of vertices with degree less than m n. 3.1 Web Crawl and Simulation Data We performed web crawl on various Computer Science sites. We then used the ACL model [6] to generate new graphs from degree sequences in the actual web graphs. We also run the SS model using n and m values from the actual web graphs with edge insertion/deletion steps. For each graph,we run both models 5 times. The following table shows the means μ and the standard deviations ff for d max values using the ACL model and the SS model. Site n m d max μ ACL ff ACL μ SS ff SS arizona berkeley caltech cmu cornell harvard mit nd stanford ucla ucsb ucsd uiowa uiuc unc washington Table 1: d max from Actual Web Crawl and Models Simulation In general, the ACL model and the SS model are generating less clustered graphs than what we see on actual web graphs. This implies that we need a more detailed model of web graph clustering behavior. 7 21

22 4 Conclusion and Open Problems Previously, researchers have conjectured that preferential connectivity and incremental growth are necessary factors in creating power law graphs. In this paper, we provide a model of graph evolution that produces power law without growth. Our SteadyState model is very simple in comparison with other graph models [21]. It also does not require prior degree sequences as in the ACL model [6]. The difficulty in comparing various models [6, 8, 21] is that each model has different parameters and inputs. Here we provide a simple graph property d max that captures the clustering behavior of graphs without complicated subgraph enumeration algorithm. It can be useful in gauging the accuracy of various models. From our web crawl data, we know that the linear models such as Barabási's [8] are not the best ones to use when considering d max. Both ACL and SS models are not generating dense-enough subgraphs when comparing against the actual web graphs. Thus, we need a better web graph model that mimics actual web graph clustering behavior. Here are some of our open problems: 1. Can one prove theoreticallythat the SS method actuallyhas a power law distribution? 2. How long does it take for our model to reach a steady state? As time proceeds, the high" degree vertices will attract more edges whereas all other vertices will have fewer edges connecting to them until we reach a state, after which the degree distribution won't fluctuate much. 3. What are other simple web graph properties that we can use to determine the accuracy of various models? 4. Are there any technique such as graph product that we can use to generate massive web graphs in relative short time? References [1] Google. [2] The Internet Archive. [3] Online Computer LibraryCenter. wcp.oclc.org. [4] Abello, J., Buchsbaum, A., and Westbrook, J. A functional approach to external graph algorithms. In Proceedings of 6th Eurpoean Syposium on Algorithms (1998), pp

23 [5] Adamic, L., and Huberman, B. Power-law distribution of the world wide web. Science 287 (2000), [6] Aiello, W., Chung, F., and Lu, L. A random graph model for massive graphs. In Proceedings on Theory of Computing (2000), pp [7] Albert, R., Jeong, H., and Barabási, A. Diameter of the world-wide web. Nature 401 (September 1999), [8] Barabási, A., and Albert, R. Emergence of scaling in random networks. Sicence 286, 5439 (1999), [9] Barabási, A., Albert, R., and Jeong, H. Mean-field theoryfor scale-free random networks. Physica A 272 (1999), [10] Barabási, A., Albert, R., and Jeong, H. Scale-free characteristics of random networks: the topologyof the world-wide web. Physica A 28 1 (2000), [11] Barabási, A., Albert, R., Jeong, H., and Bianconi, G. Pow-law distribution of the world wide web. Science 287 (2000), [12] Broder, A. Z., Kumar, S. R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. Graph structure in the web: experiements and models. In Proceedings of 9th WWW Conference (2000), pp [13] Calvert, K., Doar, M., and Zegura, E. Modeling internet topology. IEEE Communications Magazine (June 1997), [14] Erd}os, P., and Rényi, A. On random graphs i. Publ. Math. Dececen 6 (1959), [15] Faloutsos, M., Faloutsos, P., and Faloutsos, C. On power-law relationship of the internet topology. In Proceedings of the ACM SIGCOM Conference (1999), pp [16] Huberman, B., and Adamic, L. Growth dynamics of the world-wide web. Science 401 (September 1999), [17] Huynen, M. A., and van Nimwegen, E. Power laws in the size distribution of gene families in complete genomes: biological interpretations. Tech. Rep , Santa Fe Institue, [18] Janson, S., Luczak, T., and Rucinski, A. Random Graphs. John Wiley& Sons,

24 [19] Kleinberg, J., Kumar, S. R., Raghavan, P., Rajagopalan, S., and Tomkins, A. The web as a graph: Measurements, models and methods. In Proceedings on Combinatorics and Computing (1999), pp [20] Kleinberg, J., and Lawrence, S. The structure of the web. Science 294 (2001), [21] Kumar, S. R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., and Upfal, E. Stochastic models for the web graph. In Proceedings on Foundations of Computer Science (2000), pp [22] Kumar, S. R., Raghavan, P., Rajagopalan, S., and Tomkins, A. Trawling the weeb for emerging cyber-communities. In Proceedings of 8th WWW Conference (1999), pp [23] Lu, L. The diameter of random massive graphs. In Proceedings on Discrete Algorithms (2001), pp [24] Medina, A., Matta, I., and Byers, J. On the origin of power laws in internet topologies. ACM Computer Communication Review 30, 2 (2000), [25] Milgram, S. The small world problem. Psychol. Today 2 (1967), [26] Ormerod, P., and Smith, L. Power law distribution of lifespans of large firms: breakdown of scaling. Tech. rep., Volterra Consulting Ltd., [27] Palmer, C., and Steffan, J. Generating network topologies that obeypower laws. In Proceedings of IEEE Globecom (2000). [28] Palmer, M. W., and Whitge, P. S. Scale dependence and the species-area relationship. American Naturalist 144 (1994), [29] Qian, J., Luscombe, N. M., and Gerstein, M. Proten familyand fold occurrence in genomes: power-law behaviour and evolutionary model. Journal of Mol. Biology 313 (2001), [30] Rhodes, C. J., and Anderson, R. M. Power laws governing epidemics in isolated popluations. Nature 381 (1996), [31] Watts, D. J. Small worlds: the dynamics of networks between order and randomness. Princeton UniversityPress, Princeton, N.J., [32] Waxman, B. M. Routing of multipoint connections. IEEE Journal on Selected Areas in Communication 6, 9(December 1988),

25 A Multi-Layer Model for the Web Graph L. Laura S. Leonardi G. Caldarelli P. De Los Rios April 5, 2002 Abstract This paper studies stochastic graph models of the WebGraph. We present a new model that describes the WebGraph as an ensemble of different regions generated by independent stochastic processes (in the spirit of a recent paper by Dill et al. [VLDB 2001]). Models such as the Copying Model [17] and Evolving Networks Model [3] are simulated and compared on several relevant measures such as degree and clique distribution. 1 Overview The WWW can be considered as a Graph (WebGraph) where nodes are static html pages and (directed) edges are hyperlinks between these pages. This graph has been the subject of a variety of recent works aimed to understand the structure of the World Wide Web [3, 6, 9, 17, 18, 21]. Characterizing the graph structure of the web is motivated by several large scale web applications, primarily, mining information on the web. Data mining on the web has greatly benefited from the analysis of the link structure of the Web. One example is represented by the algorithms for ranking pages such as Page Rank [8] and HITS [15]. Link analysis is also at the basis of the sociology of content creation, and the detection of structures hidden in the web (such as bipartite cores of cyber communities and webrings[18]). Developing a realistic and accurate stochastic model of the web graph is a challenging and relevant task for several other reasons: Testing web applications on synthetic benchmarks. Detecting peculiar regions of the WebGraph, i.e. properties with the whole structure. local subsets that share different statistical Predicting the evolution of new phenomena in the Web. Dealing more efficiently with large scale computation (i.e. by recognizing the possibility of compressing a graph generated accordingly to such model [1]). Dip. di Informatica e Sistemistica Università di Roma La Sapienza Via Salaria Roma Italy. {laura,leon}@dis.uniroma1.it Sezione INFM and Dip. Fisica Università di Roma La Sapienza, P.le A. Moro Roma, Italy. gcalda@pil.phys.uniroma1.it Institut de Physique Theorique Universitié de Lausanne, BSP 1015 Lausanne, Switzerland and INFM - Sezione di Torino Politecnico, Dip. di Fisica, Politecnico di Torino, C.so Duca degli Abruzzi 24, Torino, Italy. E- mail:paolodelosrios@ipt.unil.ch 25 1

26 The study of the statistical properties of several observables in large samples of the WebGraph is at the basis of the validation of stochastic graph process models of the Web. Kumar et. al. [18] and Barabasi and Albert [5] suggested that both the indegree and outdegree distribution of the nodes of the WebGraph follow a power-law distribution. Experiments on larger scale made by Broder et. al. [9] confirmed it as a basic web property. The probability that the indegree of a vertex is i is distributed by a power-law, P r u [in-degree(u)= i] 1/i γ, for γ 2.1. The outdegree of a vertex is also distributed with a power law with exponent roughly equal to 2.7. The average number of edges per vertex is about 7. In the same paper [9] Broder et. al. presented a fascinating pictures of the Web s macroscopic structure: a bow-tie shape, where the major part of the sites can be divided into three sets: a core made by the strongly connected components (SCC), i.e. sites that are mutually connected each other, and two sets (IN and OUT) made by the sites that can only reach (or be reached by) the sites in the SCC set. They also showed that for randomly chosen source and destination there is only 24% of probability that any path exists, and, in that case, the average length is about 16 (against the 19 of Barabasi [5]). The authors also find out that the WebGraph exhibits the small world phenomenon [24, 16] typical of dynamical social networks, only if the hyperlinks are considered undirected: within few edges almost all pages are reachable from every other page within a giant central connected component including about 90 % of the web documents. A surprising number of specific topological structures such as bipartite cliques of relatively small size, from 3 to 10, has been recognized in the web [18]. The study of such structures is aimed to trace the emerging of a large number of still hidden cyber-communities: groups of individuals who share a common interest, together with web pages most popular among them. A bipartite clique is interpreted as a core of such community, defined by a set of fan, all pointing to a set of authorities, and the set of authorities, all pointed by the fans. Over such communities have been recognized on a sample of 200M pages on a Web crawl from Alexa of In a more recent paper Dill et al. [12] explain how the web shows a fractal structure in many different ways. Graph can be viewed as the outcome of a number of similar and independent stochastic processes. At various scales we have that there are cohesive collections of web pages (for example pages on a site, or pages about a topic) and these collections are structurally similar to the whole web (i.e. they exhibit the bow-tie structure and follow power-law for the indegree and outdegree). The central regions of such collections are called Thematically Unified Clusters (TUCs) and they provide a navigational backbone of the Web. The correlation between the distribution of PageRank [8] (as computed in the Google search engine) and the in-degree in the web graph has also been considered [22]. They show that the PageRank is distributed with a power law of exponent -2.1, but very surprisingly there is very little correlation between the PageRank and the in-degree of pages, i.e. pages with high in-degree may have very low PageRank. Most of the the statistical properties observed in the WebGraph cannot be found in traditional stochastic graph models. Firstly, traditional models such the random graph model of Erdös and Rényi are static models, while a stochastic model for the WebGraph evolves over time as new pages are published on the web or are removed from the Web. Secondly, the random graph model of Erdös and Rényi fail to capture the self-similar nature of the web graph. Signature of the self similar behavior of the structure is the ubiquitous presence of power laws. Indeed, power law distributions have also been observed for describing the popularity (number of clicks) of web pages or the topological structure of the graph of the Internet [14, 10]. Aiello, Chung and Lu [2] proposed stochastic graphs appropriately customized to reproduce the power law distribution on the degree. They present a model for an undirected graph, meant to represent 26 2

27 the traffic of phone calls, in which the degree of the vertices is drawn from a power law distribution. Albert and Barabasi and Jeong [3] started this study by presenting the first model of Evolving Network in which at every discrete time step a new vertex is introduced in the graph, and connects to existing vertices with a constant number of edges. A vertex is selected as the end-point of the an edge with probability proportional to its in-degree, with an appropriate normalization factor. This model shows a power law distribution over the in-degree of the vertices with exponent roughly -2 when the number od edges that connect every vertex to the graph is 7. The Copying model has been later proposed by Kumar et al. [17] to explain other relevant properties of the WebGraph. First property is the amazing presence of a large number of dense subgraphs such as bipartite cliques. The Copying model is also an evolving model in which for every new vertex entering the graph one selects randomly a prototype vertex p amongst the ones inserted in previous timesteps. A constant number d of links connect the new vertex to previously inserted vertices. The model is parametric over a copying factor α. The end-point of the lth link, l = 1,..., d, is either copied with probability α from the corresponding lth out-link of the vertex prototype p or it is selected at random with probability 1 α. The copying event is trying to model the formation of cyber communities in the web, web documents linking a common set of authorative pages for a topic of common interest. The model has been analytically studied and showed to hold power law distribution on both the in-degree and the number of disjoint bipartite cliques. Very recently Pandurangan, Raghavan and Upfal [22] proposed a model based on the rank values computed by the PageRank algorithm used in search engines such as Google. They propose to complement the Albert,Barabasi and Jeong model in the following manner. There are two parameters a, b [0, 1] such that a + b 1. With probability a the end-point of the the lth edge is chosen with probability proportional to its in-degree, with probability b is chosen with probability proportional to its PageRank, with probability 1 α β at random. The authors show on computer simulation that with an appropriate fitting of the parameters the graphs generated capture distributional properties of both Page Rank and in-degree. 1.1 Our work As a matter of fact the models presented so far correctly reproduce few observables as degree and Page Rank distribution. We cite from the conclusions of the work of Dill et al. There are many lacunae in our current understanding of the graph theoretic structure of the web. One of the principal holes deals with developing stochastic models for the evolution of the web graph (Extending [18]) that are rich enough to explain the fractal behavior of the web.... It is still missing a model that is rich enough to explain the self-similarity nature of the web and reproduce more relevant observables, for instance clique distribution, distance, connected components. The recent study of Dill et al. [12] gives a picture of the web explaining its fractal structure as produced by the presence in the web of multiple regions generated by independent stochastic processes. The different regions being different in size and aggregation criteria, for instance topic, geography or Internet domain. All these regions are connected together by a connectivity backbone formed by pages that are part of multiple regions. In fact all previous models present the Web as a flat organism, every page may potentially connect to every other page of the Web. This is indeed far from reality. We propose a Multi-Layer model in which every new page that enters the graph is assigned with a constant number of regions it belongs to and it is allowed to link only to vertices in the same region. When deciding the end-points of the edges 27 3

28 we adopt a combination of Copying and Evolving Network in the subgraph of the specific region. In particular, if an edge is not copied from the prototype vertex, its end-point is chosen with probability proportional to the in-degree in the existing graph. The final outcome of the stochastic process is the graph obtained by merging the edges inserted between vertices of all layers. This model is explained in details in Section 2. We then provide the results of an extensive experimental study of the distributional properties of relevant measures on graphs generated by several stochastic models, including the Evolving Network model [3], the Copying Model [17] and the Multi-Layer model introduced in this paper and make a comparative analysis with the experimental results presented in earlier papers and with the real data obtained from Notre Dame University domain on which the first analysis on the statistical properties of the web have been performed [5]. All models have been simulated up to 300,000 vertices and 2,100,000 edges. 1.2 Structure of the paper This paper is organized as follows: in section 2 we introduce the Multi-Layer WebGraph model. In section 3 we describe the models we simulate and the collection of metrics that we have computed. In section 4 we show the experimental results. 2 The Multi-Layer model In this section we present in details our model. The model evolves in discrete time steps. The graph will be formed by the union of L regions, also denoted as layers. At each time step t a new page x enters the graph and it is assigned with a fixed number l of regions and d of edges connecting to previously existing pages. Let X i (t) be the number of pages assigned to region i at time t. Let L(x) be the set of regions assigned to page x. We repeat l times the following random choice: L(x) = L(x) i, where region i is chosen in L/L(x) with probability proportional to X i (t) with a suitable normalization factor. The stochastic process above clearly defines a Zipf s distribution over the size of the population of the regions, i.e. the value X i (t). The d edges are evenly distributed (up to 1) between the l regions. Let c = d/x and α be the copying factor. Consider each region i to which vertex x is assigned. Vertex x will be connected by c or c + 1 edges to other vertices of region i. Denote by X the set of X i (t) vertices assigned to region i before time t. The layer i graph denoted by G i (t) is formed by the vertices of X and by the edges inserted before time t between edges of X. We choose a prototype vertex p in X. If we connect vertex x with c edges to region i, then for every l = 1,..., c, with probability α, the lth edge is copied by the lth edge of vertex p in G i (t). Otherwise, the lth endpoint is chosen amongst those vertices in X not already linked by x with probability proportional to the in-degree in G i (t) (plus 1) with a suitable normalization factor. If c + 1 edges need to be inserted and the prototype vertex is connected with only c edges, the (c + 1)th edge is chosen with probability proportional to the in-degree. The resulting graph has Edge set given by the union of the edges of all layers. 28 4

29 Figure 1: A Multi-Layer view of a graph. 3 Data Set and simulated models We study the graph properties of several data sets of about 300k vertices and about 2,1 million edges, therefore yielding an average degree of 7. In particular, we consider the following data sets: 1. ABJ A graph generated according to Evolving Network model by Albert, Barabasi and Jeong model [3]. 2. ACL A graph generated according to the Aiello,Chung and Lu model [2].This model is for undirected graphs. To generate a directed graph we modified the model in the following way: we build 2 sets, OS and IS. In OS we put 7 copies of each vertex (to simulate the average out-degree of 7), while in IS we put deg(v) copies of each vertex. deg(v) is chosen accordingly to a power law distribution with exponent Then we randomly match the two sets (i.e. every element in the set OS is matched with one element of IS). 3. CL0N A set of graphs generated according to the Kumar et al. model [17], with copying factor α = 0.N: e.g. CL03 has α = ER A graph generated according to the Random Graph model of Erdös and Renyi model. 5. ML0N-#L A set of graph generated according to the Multi-Layer model presented in this paper, with α = 0.N (copy factor). #L indicates the the number of layers, and, unless otherwise specified, we have 3 layers per page, with out-degree respectively 3,2,2 (that sum up to 7, the average out-degree of the Web). ML03-25 indicates a graph generated with α = 0.3 and 25 Layers. 6. SWP A graph generated according to the small-world model [26]: we considered a bi-dimensional lattice and added three distant random node to every node. 29 5

30 7. NotreDame The nd.edu subnet of the Web [3]. We compare the synthetic data obtained by the models listed above and the NotreDame data set on the following measures: a The degree distribution P (k) giving the frequency of a certain degree k in the graph. c The number of disjoint bipartite cliques in the graph. A bipartite clique (k, c) is formed by k vertices all connected by directed links to each of the c vertices. We implemented the method described in [18] to estimate the number of disjoint bipartite cliques in the graph. b The number of vertices N(d) within distance d from a certain vertex v 0 in a graph obtained by removing the orientation of the edges. d The clustering coefficient [20] measures at which extent the neighbors of a vertex are connected each other. It is defined in the following way: consider a vertex v of the graph G = (V, E). Now consider the k neighbors of the vertex v i.e. the vertices v 1..v k such that (v, v i ) is an edge of the graph G. The clustering coefficient is the average over all vertices of the graph of the ratio between the number of edges (v i, v j ), with i, j = 1..k, that belongs to E, and its maximum possible value k(k 1)/2, that is the number of edges in the complete graph made with vertexes v 1..v k. This measures of how much the neighbors of a vertex are connected each others. We measure both the directed Clustering Coefficient (C d ) than the the undirected Clustering Coefficient (C u ) of the graph obtained by removing the orientation of the edges. 4 Analysis of results The simulation results are summarized in Table 1. In the first and second column we report the number of disjoint bipartite cliques (3, 3) and (4, 3). The number of cliques observed in the NotreDame data set is considerably smaller than that observed in CL07 where the power law distribution with exponent 2.1 is obtained. In [18] Kumar et al. analyze a 200M web pages sample from Alexa, and they measure (4,3) cores, that are four times those present in the CL07 set and ten times the number in the NotreDame set (but both CL07 and NotreDame have only 300k nodes). We also notice that the number of cliques in the Copying Model grows by orders of magnitude vs α. The Multi-Layer model shows a number of cliques that grows vs α. All other models do not show any consistent number of cliques. Next in Table 1 we report the values of Clustering Coefficients. It is interesting to notice that, exactly as the number of cliques, the Clustering Coefficients for the Copying Model also grow by orders of magnitude vs α. In the Multi-Layer Model the Clustering Coefficients grow vs α and decreases vs L. The last column of Table 1 reports the measure of the fitting of the distribution of the in-degree in the log log scale. Figures illustrating the frequency of the indegree in the graph generated by the various models and in the NotreDame subdomain are reported in appendix. The NotreDame data set indegree follows a power law with exponent 2.1 as reported by several observation of the web graph. The evolving network exhibits a power law with in-degree 2.0. The ACL model is explicitly constructed with a power law distribution of exponent 2.1. It is interesting to notice that the Copying model follows a power law distribution only when the copying factor alpha approaches 0.5, but only with a fairly large α we have a slope close to 2.1. The Multi-Layer model in-degree follows a power law with exponent of 2.1 within a fairly large variation of the copying factor α [0.3, 0.7] and of the number of layers L [25, 100]. 30 6

31 In Figures 2 and 3 we see that all the simulated models and the NotreDame data set show a small world phenomena. All vertices of the graph are reachable within a distance of a few edges when the orientation of the edges is removed. This has already been observed for instance in [9]. (3, 3) (4, 3) CC d CC u γ ABJ ACL CL CL CL ER ML ML ML ML ML ML NotreDame SWP Table 1: Data Sets 5 Conclusion and further work In this work we present a Multi-Layer model of the Web graph and an attempt to compare several models of the web graph and real data set. We plan to further develop the idea of a Multi-Layer Model for the WebGraph and compare the simulated models with larger and different samples of the Web. We also plan to compare data set on other relevant metrics such as size of strongly connected components. Our effort is aimed to design a model that resembles the complex nature of the Web Graph. References [1] M. Adler and M. Mitzenmacher. Towards compressing Web graphs. U. of Mass.CMPSCI Technical Report [2] W.Aiello, F. Chung, L.Lu. A random graph model for massive graphs. Proc. ACM Symp. on Theory of computing, pp , [3] R. Albert, H. Jeong and A.L. Barabasi Nature 401, 130 (1999). [4] J.R. Banavar, A. Maritan and A. Rinaldo, Nature 399, 130 (1999). [5] A.L. Barabasi and R. Albert, Emergence of scaling in random networks Science , (1999). [6] B. Bollobas. Random Graphs, Academic Press, London, (1985). [7] B. Bollobas, F.Chung. The diameter of a cycle plus a random matching. SIAM Journal of Discrete Maths 1: , (1988). 31 7

32 Percentile of the whole graph reached with l links CL03 CL05 CL07 ACL ER SmallWorld ABJ Number of links Figure 2: Distance of portions of the graph. [8] S.Brin and L.Page. The anatomy of a large-scale hypertextual Web search engines. In Proceedings of the 7th WWW conference, 1998 [9] A.Broder, R.Kumar, F. Maghoul, P.Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener, Graph structure in the web [10] G. Caldarelli, R. Marchetti, L. Pietronero, Europhys. Lett.52, 386 (2000). [11] C. Cooper, A. Frieze. A general model of web graphs. (2001) [12] S. Dill, R. Kumar, K. McCurley, Sridhar Rajagopalan, D. Sivakumar, A Tomkins. Self-similarity in the web. (2001) [13] P. Erdös and R. Renyi, Publ. Math. Inst. Hung. Acad. Sci 5, 17 (1960). [14] M. Faloutsos, P. Faloutsos and C. Faloutsos. On Power-Law Relationships of the Internet Topology. ACM SIGCOMM (1999). [15] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, vol. 46 n. 5 pag (1997). [16] J.Kleimberg. The Small World Phenomenon: an algorithmic perspective. [17] R.Kumar, P.Raghavan, S.Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal. Stochastic models for the web graph. [18] S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for Emerging Cyber Communities. Proc. of the 8th WWW Conference, pp , (1999). 32 8

33 Percentile of the whole graph reached with l links ML03 25 ML03 50 ML ML05 25 ML05 50 ML NotreDame Number of links Figure 3: Distance of portions of the graph. [19] T.A. McMahon and J.T. Bonner, On Size and Life Freeman, New York (1983). [20] M.E.J.Newman, Models of the small world. J. Stat. Phys. 101, (2000). [21] C.H. Papadimitriou, Algorithms, Games and the Internet. STOC 01 [22] G. Pandurangan,P. Raghavan,E. Upfal. Using PageRank to Characterize Web Structure. [23] I. Rodriguez-Iturbe and A. Rinaldo, Fractal River Basins, Chance and Self-Organization, Cambridge University Press, Cambridge (1997). [24] D.Watts, S. Strogatz, Collective Dynamics of small-world networks, Nature , (1998). [25] G.B. West, J.H. Brown and B.J. Enquist, Science 276, 122 (1997). [26] D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998). [27] A description of the project, together with maps done by Cheswick B. Burch H.,is available at

34 Log-log plot of the indegree distribution

35 Modeling Adaptive Hypermedia with an Object-Oriented Approach and XML Mario Cannataro 1, Alfredo Cuzzocrea 1,2 Carlo Mastroianni 1, Riccardo Ortale 1,2, and Andrea Pugliese 1,2 1 ISI-CNR, Via P.Bucci, Rende, Italy 2 DEIS-Università della Calabria Abstract. This work presents an Application Domain model for Adaptive Hypermedia Systems and an architecture for its support. For the description of the high-level structure of the application domain we propose an object-oriented model based on the class diagrams of the Unified Modeling Language, extended with (i) a graph-based formalism for capturing navigational properties of the hypermedia and (ii) a logic-based formalism for expressing further semantic properties of the domain. The model makes use of XML for the description of metadata about basic information fragments and neutral pages to be adapted. Moreover, we propose a three-dimensional approach to model different aspects of the adaptation model, based on different user s characteristics: an adaptive hypermedia is modeled with respect to such dimensions, and a view over it corresponds to each potential position of the user in the adaptation space. In particular, a rule-based method is used to determine the generation and deliver process that best fits technological constraints. 1 Introduction In hypertext-based multimedia systems, the personalization of presentations and contents (i.e. their adaptation to user s requirements and goals) is becoming a major requirement. Application fields where content personalization is useful are manifold; they comprise on-line advertising, direct web-marketing, electronic commerce, on-line learning and teaching, etc. The need for adaptation arises from different aspects of the interaction between users and hypermedia systems. User classes to be dealt with are increasingly heterogeneous due to different interests and goals, world-wide deployment of information and services, etc. Furthermore, nowadays hypermedia systems must be made accessible from different user s terminals, through different kinds of networks and so on. To face some of these problems, in the last years the concepts of user model-based adaptive systems and hypermedia user interfaces have come together in the Adaptive Hypermedia (AH ) research theme [1, 5, 4]. The basic components of adaptive hypermedia systems are (i) the Application Domain Model, used to describe the hypermedia basic contents and their organization to depict more abstract concepts, (ii) the User Model, which attempts to describe some user s characteristics and his/her expectations in the browsing of hypermedia, and (iii) the Adaptation Model that describes how to adapt contents, i.e. how to the manipulate basic information fragments and the links. More recently, the capability to deliver a certain content to different kind of terminals, i.e. the support of multi-channel accessible web systems, is becoming an important requirement. To efficiently allow the realization of user-adaptable content and presentation, a modular and scalable approach to describe and support the adaptation process must be adopted. A number of interesting models, architectures and methodologies have been developed in the last years for describing and supporting (adaptive) hypermedia systems [11, 10, 14, 2, 8, 13, 12, 9]. In this paper we present a model for Adaptive Hypermedia and a flexible architecture for its support. Our work is specifically concerned with a complete and flexible data-centric support of adaptation; the proposed model has in fact a pervasive orientation towards such issues, typically not crucial in hypermedia models. We are intended to focus on (i) the description of the structure and contents of an Adaptive Hypermedia in such a way it is possible to point out the components 35

36 on which to perform adaptation (the what), and our belief is that the most promising approach in modeling the application domain is data-centric (in fact many recent researches employ well-known database modeling techniques); (ii) a simple representation of the logic of the adaptation process, distinguishing between adaptation driven by user needs and adaptation driven by technological constraints (the how); (iii) the support of a wide range of adaptation sources. The logical structure and contents of an adaptive hypermedia are described along two different layers. The lower layer aim is to define the content of XML pages and the associated semantics (using an object-oriented model) and navigational features of the hypermedia (with a directed graph model). The upper layer describes the structure of the hypermedia as a set of views associated to groups of users (i.e. stereotype profiles) and some semantic relationships among profiles (using logical rules). Finally, the adaptation model is based on a multidimensional approach: each part of the hypermedia is described along three different adaptivity dimensions, each related to a different aspect of user s characteristics (behavior, used technology and external environment). A view over the Application Domain corresponds to each possible position of the user in the adaptation space. The XML pages are independent from such position, and the final pages (e.g. HTML, WML, etc.) to be delivered are obtained through a transformation that is carried out in two distinct phases, the first one driven by the user s profile and environmental conditions, and the second one driven by technological aspects. 2 Adaptive Hypermedia Modeling In our approach to the modeling of adaptive hypermedia we chose to adopt the classical Object- Oriented paradigm, since it permits a complete high-level description of concepts. Furthermore, we chose to adopt XML as the basic formalism due to its flexibility and data-centric orientation. In fact, XML makes it possible to elegantly describe data access and dynamic data composition functions, allowing the use of pre-existing multimedia basic data (e.g. stored in relational databases and/or file systems) and the description of contents in a completely terminal-independent way. We model the (heterogeneous) data sources by means of XML metadescriptions (Sec. 2.2). The basic information fragments are extracted from data sources and used to compose descriptions of pages which are neutral with respect to the user s characteristics and preferences; such pages are called Presentation Descriptions (PD). The PDs are organized in a directed graph for navigational aspects, and in an object-oriented structure for semantic purposes in the Description Layer (Sec. 2.3), while knowledge-related concepts (topics) are associated to PDs and profiles in the Logical Layer (Sec. 2.4). The transformation from the PDs to the delivered final pages is carried out on the basis of the position of the user in an adaptation space (Sec. 2.1). The process is performed in two phases: in the first phase the PD is instantiated with respect to the environmental and user dimension and a technological independent PD is generated. In the second phase (Sec. 2.5) the PD is instantiated with respect to the technology dimension. 2.1 Adaptation Space In our proposal the application domain is modeled along three abstract orthogonal adaptivity dimensions: User s behavior (browsing activity, preferences, etc.); External environment (time-spatial location, language, socio-political issues, etc.); Technology (user s terminal, client/server processing power, kind of network, etc). The position of the user in the adaptation space can be denoted by a tuple of the form [B, E, T ]. The B value captures the user s profile; the E and T values respectively identify environmental 36

37 location and used technologies. The AHS monitors the different possible parameters that can affect the position of the user in the adaptation space, collecting a set of values, called User, Technological and External Variables. On the basis of such variables, the system identifies the position of the user. The user s behavior and external environment dimensions mainly drive the generation of pages content and links. Instead, the technology dimension mainly drives the adaptation of page layout and the page generation process. For example, an e-commerce web site could show a class of products that fits the user s needs (deducted from his/her behavior), applying a time-dependent price (e.g. night or day), formatting data with respect to the user terminal and sizing data with respect to network bandwidth. 2.2 XML Metadata about Basic Information Fragments Information fragments are the atomic elements used to build hypermedia contents; fragments are extracted from data sources that, in the proposed model, are described by XML meta-descriptions. The use of metadata is a key aspect for the support of multidimensional adaptation; for example, an image could be represented using different detail levels, formats or points of view (shots), whereas a text could be organized as a hierarchy of fragments or written in different languages. Each fragment is associated to a different portion of the multidimensional adaptation space. By means of meta-descriptions, data fragments of the same kind can be treated in an integrated way, regardless of their actual sources; in the construction of pages the author refers to metadata, thus avoiding too low-level access to fragments. A number of XML meta-descriptions have been designed making use of XML Schemas [16]. They comprise descriptions of text (hierarchically organized), object-relational database tables, queries versus object-relational data, queries versus XML data (expressed in XQuery [15]), video sequences, images, XML documents and HTML documents. As an example, consider the following meta-description of a query versus XML data, expressed in XQuery: <xquery alias="authorsquery"> <statement> <![CDATA[ <authors> {for $b in document("...")/bib/book where $b/subject = #key return <author> {$b/author/name} {$b/author/age} </author> } </authors> ]]> </statement> <result-structure> <value path-expression="/authors/author/name" alias="name"/> <value path-expression="/authors/author/age" alias="age"/>... </result-structure> </xquery> In such meta-description the key elements are the statement (eventually with some parameters, #key in the example) and a description of the resulting XML document. Specific parts of the documents extracted by means of the above-shown query, could be described as follows: 37

38 <xdoc alias="book-author"> <instance alias="db" description="database authors" location-type="xquery" location="authorsquery(#key=databases).name" schema="..."/> <instance alias="comp" description="compression author" location-type="xquery" location="authorsquery(#key=compression).name" schema="..."/> </xdoc> We emphasize that the meta-description of an XML document is abstracted with respect to its actual source, referred by means of location attributes; furthermore, many instances of the same document can be differentiated within the same meta-description (in the example above, authors of books having different subjects). Complex meta-descriptions allow direct referencing of single information fragments by means of aliases, dot-notation and parameters. For example, book-author.db in the previous example refers to the db instance of the book-author fragment, where db is in turn an alias for the query authorsquery(#key=database).name. 2.3 The Description Layer and the Presentation Descriptions In the description layer the application domain is modelled as a directed graph, where nodes are the presentation descriptions and arcs represent links. Furthermore, we apply the object-oriented paradigm to capture the semantic relationships among the PDs: we define the PDs as objects which are instances of predefined classes. Presentation descriptions are composed of four sections: The OOStructureInfo section includes information concerning the object-oriented organization of the domain. The interface of each class of PDs is composed of the set of ingoing and outgoing links. Furthermore, a type is associated to each link to express its semantics, and to define the compatibility among outgoing and ingoing links of different classes (e.g. a PD A can be linked to another PD B if the PD A has an outgoing link whose type is the same of the type of an ingoing link of the PD B). With regard to inheritance, a subclass inherits the information fragments and the links of the parent classes. The OOStructureInfo section is edited by means of a tool based on the Unified Modeling Language [17] (specifically on the class diagram), that allows the author to design the overall PD hierarchy of the application domain. Inside each PD, the tool automatically fills in the OOStructureInfo section, writing the O-O relationships among the PDs. The AdapDimensionsInfo section contains information about the instantiation of PDs with respect to the three adaptivity dimensions; this information describes how to extract fragments on the basis of the user position in the adaptation space, and which XSL stylesheet to apply in order to transform the PDs into the final pages to be delivered to the client. The AdapDimensionsInfo section is generated by means of a simple XML editor, on the basis of the author s domain knowledge. The ContentLayout section contains references to the information fragments to be used to compose the PDs. The AuxInfo section contains auxiliary information (e.g. data islands), which are not to be processed by the system itself, but can be used by the client (for example, they can contain embedded code for client applications), or by network lower layers. In the following, we show the XML Schema structure of the OOStructureInfo section. It includes the name of the PD class (entityname element), a set of parent classes (superentityname 38

39 elements) and the set of links that define the PD interface (link elements, each with an associated name and type). <xs:element name = "entityname" type = "xs:string"/> <xs:element name = "superentityname" type = "xs:string"/> <xs:element name = "linkname" type = "xs:string"/> <xs:element name = "linktype" type = "xs:string"/> <xs:attribute name = "direction" use = "required"> <xs:simpletype> <xs:restriction base = "xs:string"> <xs:enumeration value = "in"/> <xs:enumeration value = "out"/> </xs:restriction> </xs:simpletype> </xs:attribute> <xs:element name = "link"> <xs:complextype> <xs:sequence> <xs:element ref = "linkname"/> <xs:element ref = "linktype"/> </xs:sequence> <xs:attribute ref = "direction"/> </xs:complextype> </xs:element> <xs:element name = "OOStructureInfo"> <xs:complextype> <xs:sequence> <xs:element ref = "entityname"/> <xs:element ref = "superentityname" minoccours = "0" maxoccours = "unbounded"/> <xs:element ref = "link" minoccours = "1" maxoccours = "unbounded"/> </xs:sequence> </xs:complextype> </xs:element> 2.4 The Logical Layer The logical layer aim is to model the domain adaptivity with respect to stereotype user profiles. The logical model is a set of Profile Views (PV ), where each PV is a view of the overall hyperspace domain associated to a user profile. The PV associated to the profile p includes all the PDs accessible by the users belonging to profile p, therefore it represents the hypermedia domain and the navigational space for that profile. Following the two-layer definition of the application domain model, the model design is composed of two design phases, operating at two different abstraction levels. The first phase is related to the definition of the navigational directed graph: the author designs the presentation descriptions and their relations as hyperlinks. In the second phase the author identifies the user profiles and, on the basis of his/her domain knowledge, designs the PVs associated to each profile. The PV design can be carried out incrementally: for each PD, the author determines all the user profiles that can access it. At the end of the process, each PV is composed of all the PDs that have been associated to the corresponding user profile. Each PV (and consequently each associated profile) corresponds to a set of topics that represents the knowledge contained in the PV (knowledge description). More precisely, this correspondence can be formally defined through a function κ( ), that can be applied to a single PD, or to a user profile, and returns the corresponding set of topics: κ( ), applied to the presentation description P D i, returns the set of topics captured by P D i ; 39

40 κ( ), applied to the user profile p, returns the knowledge domain of p: κ(p) = i κ(p D i ) P D i P V p, where P V p is the profile view associated to p. The instantiation of a particular PD with respect to a given profile B and a given position along the external environment dimension E can be seen as the application of a function δ that transforms a given PD into a technology-independent XML page, called P D, which will be given as input to the multichannel layer (Sec. 2.5): δ : (P D, B, E) P D. Furthermore, we improve the logical description of the hypermedia structure by means of a Semantic Precedence Operator,, which is used to define constrains about profile changes. If applied to two knowledge domains, e.g. κ(p 2 ) κ(p 1 ), the operator points out that a user cannot access topics related to the profile p 2 until he/she has accessed some topics included in the knowledge domain of p 1 (i.e. until he accesses the PDs that capture those topics). This constraint can be better specified by means of a semantic precedence matrix. This matrix has as many rows as the topics of κ(p 2 ) and as many columns as the topics of κ(p 1 ). Each element (i, j) of the matrix can assume a boolean value; true means that the user must visit a PD containing the topic j of the domain κ(p 1 ) before his/her profile can change form p 1 to p 2 and the topic i of the domain κ(p 2 ) can be accessed. So each row specifies which topic of κ(p 1 ) a user belonging to the profile p 1 must know to change his profile to p 2 by entering a particular topic of κ(p 2 ). For example, a semantic precedence matrix with all values equal to true means that, whatever is the entry point to the profile p 2, the user must have visited all the topics of κ(p 1 ). An entry point of a profile p 2 is defined as a node (PD) that, when accessed through a hyperlink, allows the user to change his profile to p 2. The semantic precedence operator can be used in general logic rules; e.g. κ(p 1 ) κ(p 2 ) (κ(p 3 ) κ(p 4 )) is a rule that expresses a more complex relationship among the profiles involved. 2.5 Adaptation Model for the Technological Dimension The technology dimension drives the adaptation of the page layout to the client device (PC, handheld computer, WAP device, etc.), and the page generation method. The technology adaptation is performed by means of a Multichannel Layer. In the following we will describe (i) the alternative page generation methods, (ii) the technological variables, and (iii) the multichannel layer. Page generation methods. The final pages displayed on the client device (written in HTML, WML, etc.) are dynamically generated by performing a transformation of the P D using a XSL document/program. Several XSL stylesheets can be used to transform the P Ds, depending on the client device features. Moreover, three different page generation methods have been designed: a. The page generation takes place entirely on the server. It picks out the Information Fragments, applies the transformation using the appropriate XSL document, and then sends the page (HTML, WML, etc.) to the client. The main drawback of this method is that the client cannot access the XML content. As an example, if the client is an application, e.g. a workflow or a distributed computing application, it could need to access and process the XML data. b. Similar to the method (a), but the server sends to the client HTML (WML) pages that contain XML data islands. These data are not processed by the server XSL processor, and are not displayed on the client device, but can be accessed by client programs. c. The page generation is performed entirely on the client: the server sends to the client both the XML document and the XSL document that the client device will use to carry out the transformation. 40

41 The technological variables. The technological variables are used by the multichannel layer to adapt the presentation and the generation process to the client device. On our system we use five groups of technological variables: 1. Variables related to the XML and XSL support on the client device; 2. Variables addressing the client device processing power (client-side XSL formatting may take place only if the device can manage such a complex and time-consuming operation); 3. Variables related to the client device display features (resolution, dimensions, etc.); 4. Variables concerning the kind of client data usage (e.g., it is useful to know whether or not clients need to access and process the pure XML data); 5. Variables related to the server processing workload. The variables belonging to the first three groups are extracted from the device knowledge base (shown in the next Section), while data usage variables are associated to the client (e.g. they can be associated to the user profile), and group 5 variables are determined by the server. The Multichannel Layer. The main components of the multichannel layer are the Device Knowledge Base (DKB) and the Presentation Rules Executor (PRE). The device knowledge base is a repository composed of a set of entries, each describing the features of a specific client device. For each client request, the client device is determined according to the data contained in the User Agent field of the request, and the corresponding device entry is selected. Each entry is composed of variable-value pairs, where variables correspond to the technology variables belonging to the first three groups. The presentation rules executor determines the presentation layout and the page generation method based on a set of rules that check the values assumed by the technological variables. Rules are defined in an ad hoc XML syntax, and are modelled according to the well known eventcondition-action (ECA) paradigm. The following XML fragment shows a typical presentation rule. <rule id="1"> <conditions> <technological-variable group="1" xml-support="yes"/> <technological-variable group="2" processing-power="high"/> <technological-variable group="3" display-area="small" res-value="medium"/> <technological-variable group="4" xmldata-need="no"/> </conditions> <action> <xsl-formatting value="clientside" method="c" stylesheet="adhocstylesheet.xsl"/> </action> </rule> The rule states that if the client does not need to elaborate the XML data content (xmldata-need belonging to group 4), the client device fully supports XSL transformation (group 1 and 2 variables), and display features are appropriate for the data to visualize (group 3 variables), the PRE chooses the page generation method c (client-side XSL formatting), and the XSL stylesheet AdHocStylesheet.xsl. Note that the above rule does not consider the server-side workload. Another presentation rule may state that XSL formatting has to take place on the server side (even if the client device supports XML and XSL) if the client does not need to access XML data, unless the server workload exceeds a specified workload threshold. Events of the ECA paradigm are implicitly managed by the adaptive system and correspond to the user choice of a given PD. Conditions correspond to checks on the technological variable 41

42 values. Actions are performed when a logical expression (a presentation rule), composed of atomic conditions, is evaluated. The above XML syntax can be used to compose complex presentation rules, based on all the possible combinations of technological variable values. Actions mainly consist of the choice of the page generation method and the choice of the most suitable XSL stylesheet that will drive the XSL formatting. Fig. 1. The multichannel layer. Figure 1 shows the multichannel layer architecture. This is a logical layer that uses components belonging to the three architectural tiers (see Section 3). The Request/Response Manager processes client requests (coming either from a wired or a from a wireless device), and after the generation of the P Ds passes both to the PRE. The PRE extracts the user-agent data from the requests and accesses the device knowledge base to evaluate the technological variables. Then, the PRE executes the presentation rules contained in the presentation rules repository, according to the ECA paradigm. The PRE chooses the page generation method and the XSL stylesheet (picking it out form the Stylesheet Repository). In the case of client-side formatting (page generation method c), the PRE passes both the XML data (the P D) and the selected stylesheet to the Request/Response manager, which in turn sends them to the client. In the case of server-side formatting (page generation methods a or b), the PRE activates the Data Formatting Module, which converts the XML data according to the XSL directives, and then passes the formatted data to the Request/Response Manager, which in turn sends it to the client. 3 An Architecture for the Support of the Proposed Model We have designed and are currently implementing an architecture for the support of the proposed model, comprising an authoring suite and a run-time system. The main tasks of the authoring suite (named Java Adaptive Hypermedia Suite JAHS) are the following: (i) composition of the UML class diagrams; (ii) composition of the semantic precedence logic rules; (iii) transformation of the class diagrams and semantic precedence rules into an XML formalism; (iv) browsing of the data sources and composition of the XML metadescriptions; (v) construction of the presentation descriptions; (vi) specification of the event-condition-action rules for the instantiation of the PDs with respect to the technological dimension. For space reasons, we do not further detail the authoring suite; the interested reader can refer to [6]. 42

43 The run-time system has a three-tier architecture (Fig. 2), comprising the User, the Application and the Data tiers. The user tier receives final pages to be presented and eventually scripts or applets to be executed (e.g. for detecting local time, location, available bandwidth). The user s terminal and the terminal software (operating system, browser, etc.) are communicated by the terminal User Agent (e.g. the browser). Fig. 2. The run-time system architecture. The User Modeling Component (UMC ) maintains the most recent actions of the user and evaluates them giving as a result the user s profile. In this paper we do not address such issue, demanding it to an external module; in our previous work we have proposed an approach in which hypermedia links are mapped into graph edges weighted with the probability of following them, and a probabilistic algorithm is used to estimate the user s profile ([7]). The Adaptive Hypermedia Application Server (AHAS) runs together with a Web Server. It executes the following steps: (i) it communicates to the UMC the most recent choices of the user; (ii) it extracts from the XML repository the PD to be instantiated; (iii) it extracts the basic data fragments from the data sources on the basis of the user position; (iv) it interacts with the modules of the multichannel layer in order to generate the final page. The data tier comprises the Data Sources Level, the Repository Level and a Data Access Module. The data sources level is an abstraction of the different kinds of data sources; each data source S i is also accessed by a Wrapper software component, which generates the XML metadata describing the data fragments stored in S i. The Repository Level is a common repository for data provided by the Data Source Level or produced by the author. It stores (i) XML documents including the Presentation Descriptions, metadata, Schemas, XSL stylesheets, and the active rules shown in Sections 2 and 2.5; (ii) the UML class diagrams representing the logical structure of the hypermedia. Finally, the data access module implements an abstract interface for accessing the data sources and the repository levels. 43

44 4 Concluding Remarks In this paper we presented a model for adaptive hypermedia systems using the object-oriented paradigm and XML. An adaptive hypermedia is modeled considering a three-dimensional adaptation space, including the user s behavior, technology, and external environment dimensions. The adaptation process is performed evaluating the proper position of the user in the adaptation space, and transforming neutral XML pages according to that position. We believe that the main contributions of this paper are: a new data-centric model to describe adaptive hypermedia specifically concerned with a flexible and effective support of the adaptation process; the model integrates a graph-based description of navigational properties and an object-oriented semantic description of the hypermedia, and uses a logical formalism to model knowledge-related aspects; a flexible and modular architecture for the run-time support of adaptive hypermedia systems, with particular regard to the adaptation with respect to technological aspects. References 1. Adaptive Hypertext and Hypermedia Home Page, 2. M. Bordegoni, G. Faconti, S. Feiner, M.T. Maybury, T. Rist, S. Ruggieri, P. Trahanias, and M. Wilson, A Standard Reference Model for Intelligent Multimedia Presentation Systems, in Computer Standards and Interfaces 18, P. Brusilovsky, Methods and techniques of adaptive hypermedia, in User Modeling and User Adapted Interaction, v.6, n.2-3, P. Brusilovsky, A. Kobsa, J. Vassileva (eds.), Adaptive Hypertext and Hypermedia, Kluwer Academic Publishers, P. Brusilovsky, O. Stock, C. Strapparava (eds.), Adaptive Hypermedia and Adaptive Web-Based Systems, Proceedings of the International Conference AH 2000, Trento, Italy, M. Cannataro, A. Cuzzocrea, A. Pugliese, A multidimensional approach for modelling and supporting adaptive hypermedia systems, Proceedings of the International Conference on Electronic Commerce and Web Technologies (Ec-Web 2001), Munich, Germany, LNCS 2115, pp , Springer Verlag, M. Cannataro, A. Cuzzocrea, A. Pugliese, A probabilistic approach to model adaptive hypermedia systems, International Workshop on Web Dynamics, in conjunction with the International Conference on Database Theory, S. Ceri, P. Fraternali, A. Bongio, Web Modelling Language (WebML): a modelling language for designing web sites, WWW9 Conference, P. De Bra, G.J. Houben, H. Wu, AHAM - A dexter-based reference model for adaptive hypermedia, in Proceedings of the ACM Conference on Hypertext and Hypermedia, F. Garzotto, D. Paolini, D. Schwabe, HDM - A model-based approach to hypermedia application design, ACM Transactions on Information Systems, F. Halasz, M. Schwartz, The Dexter hypertext reference model, CACM v. 37, n. 2, M. F. Fernandez, D. Florescu, A. Y. Levy, D. Suciu, Catching the boat with Strudel: experiences with a web-site management system, in Proceedings of SIGMOD 98, G. Mecca, P. Atzeni, A. Masci, P. Merialdo, G. Sindoni, The Araneus Web-Based Management System, in Exhibits Program of ACM SIGMOD 98, D. Schwabe, G. Rossi, An object-oriented approach to Web-based applications design, in Theory and Practice of Object Systems, October World Wide Web Consortium, The XML Query language, World Wide Web Consortium, The XML Schema definition language, The Unified Modeling Language. 44

45 A Logico-Categorical Semantics of XML/DOM Carlos Henrique Cabral Duarte Universidade Estácio de Sá Rua do Bispo 83, Rio Comprido Rio de Janeiro, RJ, , Brazil BNDES Av. República do Chile 100, Centro Rio de Janeiro, RJ, , Brazil Abstract. The efforts of the World Wide Web Consortium in defining and recommending the adoption of an extensible markup language, XML, and a document object model, DOM, have been received with enthusiasm by the software development community. These recommendations have been continuously adopted as a practical way to define and realise two party interaction. Here we describe an attempt at providing a logicocategorical semantics for XML and DOM, which seems to be useful in the rigorous development of open distributed systems. Keywords: Extensible Markup Language, Document Object Model, Formal Methods, Distributed Systems, Software Engineering. 1 Introduction The efforts of the World Wide Web Consortium (W3C) in defining and recommending the adoption of an Extensible Markup Language (XML) [8] and a Document Object Model (DOM) [6] have been received with enthusiasm by the software development community. XML is a subset of the Standard Generalised Markup Language (SGML) [14] meant to be used on the World Wide Web (WWW) [4]. Like the Hypertext Markup Language (HTML) [3], XML was designed not only to ease the developmentofsoftware tools but also to interoperate with other WWW standards. DOM, on the other hand, is a public application programming interface (API) for manipulating HTML and XML documents. This language and model are now becoming de facto standards in the design and implementation of open distributed systems and frameworks. XML documents describe semi-structured data objects possibly associated to some processing instructions. As textual specifications written in a markup language, there is a standard way of writing and reading each document. That is, there is a standard grammar and interpretation for XML documents. In addition, each software client reading an XML document is expected to present the same data to any application, despite their final presentation. This means that there is no standard presentation style for XML documents. DOM defines a public interface for programatically accessing and manipulating XML documents and their parts. The model permits document parts to be created, modified and erased, while allowing applications to navigate from one document element to another, following its current structure. DOM is language independent and implementation neutral, but programming language bindings have been defined so as to enable the use of this model in developing real systems. Both XML and DOM have actually been defined in a rigorous but rather informal manner. The applications of this language in developing distributed systems based on point-to-point communication, wherein the understanding of messages and other exchanged structured documents must be precisely the same at both ends of communication, allied to the necessity of defining language bindings to make effective use of this model, suggest that it would be interesting to count upon a formal semantics as 45

46 their alternative definition. The existence of a logical semantics, say, would permit the rigorous verification of DOM-based distributed system properties, as well as facilitate the implementation of automated testing tools. XML and DOM congregate a particular set of characteristics which naturally leads to the development of a logical semantics that is also categorical, in the sense of applying Category Theory in Computer Science as advocated by Joseph Goguen [13]. More specifically, the first of these characteristics is that many XML documents represent the same data object, but are in fact different by definition. Since what really matters in their manipulation, due to the independence of presentation, is the object of their representation, this indicates the existence of an equivalence notion. Secondly, documents have structure and are inherently related by containment or inclusion. Their relationships are clearly functional, compositional, admit identity and can be used a basis for defining equivalence up to abstracting document representation. That is, we have all the ingredients for defining a category of XML document representations. In fact, what one can read out of these observations is that the emphasis in a definition of XML can be placed in capturing relations between objects (documents and their structure) rather than in simply capturing the objects themselves. This is precisely what the application of Category Theory is about. Mutatis mutandis, the same rationale above is valid concerning DOM, programming interfaces and their respective implementations. Contribution. In this paper, we propose a logico-categorical semantics addressing both XML and DOM. This semantics, which can be regarded as our original contribution here, consists in a particular application of a first-order many-sorted branching-time logical system developed as part of our previous work [10]. This seems to be a relevant contribution since it allows us to specify and reason about the specific class of open distributed systems underlying the web, contributing to better understand their static and dynamic aspects. We are not aware of other formalisms treating both W3C recommendations simultaneously. Related work. A substantial number of formal models have been developed to clarify the application of XML in particular contexts. These can be roughly classified in mathematical, logical and categorical models. The main purpose of the algebras developed by by Frasincar, Houben and Pau [12], among many others, is the mathematical study of XML query formulation and optimisation. A logical notion of satisfiability is developed by Arenas, Fan and Libkin in [2] to verify the consistency of XML documents with respect to type definitions with constraints. A type theoretic approach is proposed by Brown, Fuchs, Robie and Wadler [5] to check whether or not documents conform with type definitions endowed with keys. Alagic and Bernstein [1] develop a categorical method for schema integration that applies to XML document type definitions. The complexity of the aforementioned models seems to increase with the number of potential applications. Organisation. Sections 2 and 3 provide brief descriptions of XML and DOM respectively; Section 4 outlines the proposed logicocategorical semantics; Section 5 presents some conclusions and prospects for future research. 2 Brief Description of XML Physically, each XML document is a textual specification composed by syntactic unities called entities. Entities may contain unparsed and parsed data, the latter being formed by markups or other specific symbols. The logical structure of each document is defined by: declarations, comments, elements with their attributes, character references to the ISO/IEC set and processing instructions (PIs) to potential applications. These logical structures are represented using markup tags that are delimited by < and > but which may also internally use other punctuation marks. The 46

47 document entity contains the whole textual specification, which necessarily includes a root element. Elements define data objects, which are identified by names (tokens beginning with a letter or some punctuation marks). Names are used as part of tags to delimit the respective data object contents. Elements may have attributes, which can only hold plain values. In turn, element contents may also have a forestlike organisation, meaning that they may also contain lists of disjoint nested elements. An example document appears in Fig. 1. <?xml version="1.0"?>  <!DOCTYPE BOOK SYSTEM "book.dtd"> <BOOK LABEL="BROOKS1995"> <TITLE>The Mythical Man-Month: Essays in Software Engineering</TITLE> <AUTHOR>Frederick P. Brooks Jr.</AUTHOR> <EDITION>2nd</EDITION> <EDITOR/> <PUBLISHER>Addison-Wesley Pub Co.</PUBLISHER> <YEAR>1995</YEAR> </BOOK> Fig. 2: A book XML document. <?xml version="1.0"?>  <CARD> <HOLDER>CARLOS H C DUARTE</HOLDER> <NUMBER> </NUMBER> <BRAND>Supercard</BRAND> <EXPIRYDATE>31/02/2003</EXPIRYDATE> </CARD> Fig. 1: A credit card XML document. The CARD element in Fig. 1 is structured, having HOLDER, NUMBER, BRAND and EXPIRYDATE as members. The first two lines of that document contain a declaration and a comment, respectively to allow proper automated processing and to facilitate human comprehension. Not all XML documents have such a simple structure. Differently from CARD, the XML document in Fig. 2 is more complex, defining a BOOK with an empty EDITOR element, without specified value. In addition, the BOOK element is labelled by thevalue of an attribute, LABEL. In general, attributes are used to hold meta-data as is the case of LABEL. The structure and content of classes of documents may be specified using type definitions. A document type definition (DTD) is named and may have a specification separated from documents of that type. Such definitions can be referenced in client documents using the DOCTYPE markup. A DTD consists in a list of declarations of notations, fixed entities, entity types or attribute lists. Element contents and attribute values can be defined as blocks of characters or using a variant of regular expression syntax. Element members and their order can be determined. DTDs may be parameterised, with parameters marked with a preceding #, and have conditional sections, which are considered or not as part of the definition according to the fulfilment of defining conditions. An example DTD, satisfied by a class of book documents including that in Fig. 2, is presented in Fig. 3. <?xml version="1.0"?>  <!ELEMENT BOOK (TITLE,AUTHOR,EDITION,EDITOR,PUBLISHER,YEAR)> <!ELEMENT TITLE (#PCDATA) > <!ELEMENT AUTHOR ANY > <!ELEMENT EDITION (#PCDATA) > <!ELEMENT EDITOR ANY > <!ELEMENT PUBLISHER ANY > <!ELEMENT YEAR (#PCDATA) > <!ATTLIST BOOK LABEL CDATA #REQUIRED > Fig. 3: A DTD defining valid book documents. Using the ENTITY markup, definitions of entities can be specified, whose name may be placed between the delimiters & and ; to imply in a content expansion when the document is processed. These may rely on external definitions, which are specified through the SYSTEM markup followed by a literal (quoted string) 47

48 pointing to the actual location of the defining document. An example of a composed document appears in Fig. 4. <?xml version="1.0"?>  <!ENTITY CHCDCARD SYSTEM "card.xml"> <!ENTITY MYTHBOOK SYSTEM "book.xml"> <ORDER> &CHCDCARD; &MYTHBOOK; <PRICE>39.00</PRICE> <CURRENCY>USD</CURRENCY> </ORDER> Fig. 4: A book order XML document. A XML document is said to be well-formed only if satisfying the following conditions: 1. it complies with the XML grammar specified in [8]; 2. it satisfies some well-formedness constraints, such as to have matching start and end tags in the definition of each element; 3. all referenced entities are well-formed. As a consequence of a well-formedness constraint, documents cannot be directly or indirectly recursive. To any violation of wellformedness in reading an XML document corresponds a fatal error, which is not recoverable. The document in Fig. 1, for instance, is well-formed. A document is said to be valid if it is well-formed and complies with a given DTD. To each violation of validity in reading an XML document corresponds a recoverable error. For example, the document in Fig. 2is valid. Although XML was designed to give rise to documents legible by humans, the language definition is based on client document processors and corresponding computer applications. The error handling treatment mentioned above is to be followed by any client and application. When these are implemented as part of a WWW browser, the W3C suggestions concerning the definition of presentation style separated from each document should be taken into account. The W3C has even proposed a stylesheet language (XSL) [9] and style transformations (XSLT) [7] to address this issue, but the study of the respective recommendations is out of the scope of our current work. 3 Brief Description of DOM DOM defines a set of specifications for representing documents and their composition. Each document is represented as a tree of passive objects, not just as a data structure, so that it can present some observable behaviour. Documents and their components are addressed in an object-based way: they may have attributes representing state and methods that give rise to their behaviour. The respective hierarchy of class interfaces can be regarded as if organised by an inheritance relation, although this view is not mandatory. Note that, viewed in this way, DOM defines an abstract class structure that has to be mapped into a real programming language and refined with the implementation of each interface in order to support real applications. It is in this sense that DOM is considered just as an API. The current DOM definition does not specify how entire documents are created [6]. Assuming that a certain document object exists, the creation of its components follows the so-called factory pattern, which is specified as a method in the scope of the document object specification. For instance, to create a CARD element within the scope of an ORDER document, it would be necessary to call the createelement method of ORDER providing CARD as an argument. Each component of a document and documents themselves are regarded as complying with a primary specification called Node. It defines querying methods, such as nodetype, a type being that of documents, entities and other constructs; nodename, which depends on the node type; and nodevalue. The specification also defines the methods nodeparent, childrennodes and ownerdocument, all with intuitive mean- 48

49 ing; firstchild and lastchild, for recovering the first and the last elements in the list of nodes returned by childrennodes; previoussibling and nextsibling, for navigating in the structure where the node may be connected to; haschildnodes to verify if the node is structured, and attributes, returning any node attributes. A set of updating methods with intuitive behaviour is also defined: insertbefore, replacechild, removechild, appendchild, and clonenode. Apart from the features inherited from Node, the Document specification defines doctype, which points to a possibly existing DTD; element, pointing to the root element of the document; and implementation, which is explained below. In addition to those inherited from Node, the specification also defines factory methods for creating the various different types of constructs listed in the previous section, such as elements and declarations. Although the main purpose of XML is the definition of entire documents, in their manipulation it is often found convenient to deal with document parts, which would not strictly satisfy the main production rule of the XML grammar. DOM specifies DocumentFragment as a subtype of Node endowed with copy/modify/paste methods, which allow their use in the role of clipboards. Other auxiliary object types with an intuitive semantics, such as NodeList and NamedNodeMap, are also defined by DOM. The W3C recommendation concerning DOM is divided into three parts addressing the DOM core, XML and HTML. An interface for querying a specific implementation concerning the supported version of either of these languages is supplied as part of the model, DOMImplementation with method hasfeature. The distinguished support provided by each kind of implementation derives from the interpretation of tags, which is fixed in HTML and variable in XML. The model also specifies an interface for error handling, DOMException with a table of exception codes. These two implementation related issues are out of the scope of the present study. Other standard document components are captured in the DOM core through the specifications Element, Attr, Text and Comment. The XML specifics are represented as part of the interfaces Notation, CDATASection, DocumentType, Entity, EntityReference and ProcessingInstruction. These directly inherit the features of Node, the exceptions being Text, Comment and CDATASection, for which this relation is indirect due to the existence of CharacterData, a virtual interface to support the HTML specifics as well. A diagrammatic representation of the abstract class structure implied by DOM appears in Fig Semantics of XML/DOM 4.1 Core Semantics of XML We develop our work based on the insight that documents which are not well-formed do not have a failure free semantics. To propose a failure semantics that could capture the subtleties of ill formed documents would be a very complex task, which is not our purpose to develop here. Therefore, we only deal with wellformed documents in this paper. We also take advantage of the fact that valid documents are well-formed and treat all of them without distinction. That a document is valid complying with a certain DTD is reflected here just in the additional obligation to ensure that all type definition restrictions are satisfied by the document interpretation. A careful inspection of the XML notions shows that semantically not all of them have the same status. For instance, PIs do not affect how a document is understood, but just how it is to be dealt with. Their treatment clearly falls into the domain of pragmatics. The definition of entities, notations and comments serve just as syntactic sugar to ease human comprehension or automated processing. Consequently, these are ignored in the sequel. The semantically rich part of XML remaining to be treated here is that of elements with their types and contents; attributes with their plain values; and entire documents. These def- 49

50 NamedNodeMap NodeList Processing Instruction 0..N maps 0..N Node. lists 0..N 0..N 1..1 raises 0..N DOMException Attr Notation possesses 0..N 1..1 Element S S 0..N contains 0..N Document Fragment CharacterData Document 0..N supports 1..1 DOM Implementation 6 0..N specifies Comment Text CDATASection EntityReference Entity DocumentType 0..N refers to 1..1 Fig. 5: Abstract class structure defined by DOM. initions a priori do not have associated behaviour and can be represented through classical first-order theory presentations. We provide in Fig. 6 two theory presentations describing elements and values. These rather simple presentations become important in the interpretation of other XML constructs. The presentation Element specifies a sort symbol to stand for the universe of elements, elem, with? denoting the bottom element. To each element may be associated a type through a specific function, type. The constants of sort type serve as names for classes of elements and additional restrictions can be placed on these elements depending on whether or not there is a type definition constraining that particular class. Value specifies not only a sort symbol representing the plain values of attributes and some elements, val, but also a function nil denoting a distinguished bottom value. It is in the interpretation of structured data objects that most of the complexity of XML lies. In our logical view of XML inspired by DOM, only elements and values are called nodes. The defining presentations of these two notions are imported in the interpretation of nodes so that the respective symbols can be used in our axioms. Nodes have a forest-like ordered structure. Due to this fact, we also import a theory presentation defining natural numbers (which is omitted here), nat, and specify two functions, void and member, denoting respectively a bottom node and an ordered membership function. For each node x and natural number n, member(x; n) returns the n-th member of x or void if it does not ex- 50

51 ist. Each nodemay either hold an element and possibly define finitely many member nodes through member or hold a plain value, the reason for including in Node the functions node elem and node val. The resulting presentation appears in Fig. 7. Nodes may be atomic or have some structure. Axioms ( ) define the relationships between bottom nodes, elements and values. Non-bottom nodes must be associated either to a value or to an element (3.5) and structured nodes necessarily stand for elements (3.6). In order to capture the structural properties of nodes, we introduce in Node an auxiliary symbol heir and the corresponding defining axioms. For a pair of nodes x and y, heir(x; y) = 1 iff y is a descendant of x in its tree-like structure. Axiom (3.10) says that the members of a node define its direct descendants. Axiom (3.11) specifies that the descendants of a node are also defined by the descendants of its members. These two axioms provide an inductive definition for heir. As constraints, wehave (3.9) saying that nodes cannot be directly recursive; (3.12) stating the no node sharing property of trees; and (3.13) requiring that only descendants allowed by our inductive definition be considered as such. Attributes may be associated to elements in a many to one relationship. The presentation Attribute in Fig. 8 defines an attr elem function capturing this relationship. The function attr val returns a value for each attribute. Concluding our definitions, we formalise documents in Fig. 8. We consider that they are endowed with a function to return each Presentation Element (Elem) sorts elem; type functions? :! elem type : elem! type Presentation Value (Val) sorts val functions nil :! val Fig. 6: Semantics of Elements and Values. Presentation Node imports Nat; Elem; Val sorts node functions void :! node node val : node! val node elem : node! elem member : node nat! node heir : node node! nat axioms x; y; z : node; e : elem; m; n : nat node val(void) =nil (3.1) node elem(void) =? (3.2) member(void; 0)=void (3.3) 9n member(x; n) =void (3.4) x 6= void! (3.5) (node elem(x) =?$node val(x) 6= nil) member(x; 1) 6= void! node elem(x) 6=? (3.6) member(x; m) =void! (3.7) (8n m < n! member(x; n) =void) heir(x; y) =0_ heir(x; y) =1 (3.8) heir(x; x) =0 (3.9) (9n member(x; n) =y^y 6= void)! (3.10) heir(x; y) =1 (9n member(x; n) =y^y 6= void)! (3.11) (8z heir(y; z)=1!heir(x; z)=1) heir(y; x) =1^heir(z; x) =1! (3.12) y = z _ heir(y; z) =1_ heir(z; y) =1 heir(x; y) =1! (9n member(x; n)=y) _ (3.13) (9z heir(x; z) =1^ heir(z; y) =1) Fig. 7: Semantics of Nodes. document root node. According to (5.1), documents must define a tree-like structure, even if consisting only in the void node. The semantics of XML can be formalised as an amalgamation of all the previous theory presentations. This construction can also be explained in categorical terms, considering the imports statement ineach presentation as a definition of a family of identity morphisms including the named objects into their enclosing presentations. The resulting structure can be better visualised in a diagrammatic manner through the objects and arrows in Fig Semantics of XML Documents Our core XML semantics can be regarded as a meta-logical construction when it is used as a framework for capturing the semantics of par- 51

52 Presentation Attribute (Attr) imports Elem; Val sorts attr functions attr elem : attr! elem attr val : attr! val Presentation Document (Doc) imports Node; Attr sorts doc functions doc node : doc! node axioms d : doc member(doc node(d); 1) = void (5.1) Fig. 8: Semantics of Attributes/Documents. ticular documents. In order to illustrate this application, we return below to the book order example proposed in Section 2. We assign a distinct theory presentation to each document, which imports the whole XML semantics. Values mentioned in the document give rise to constants of sort val. Each type tag is represented as a constant of sort type. The elements, attributes, and their hierarchical organisation are captured setting the functions doc node, member, node elem, node val, attr elem, attr val and type accordingly. The result is not a compact presentation, but it is an exact representation. To illustrate this interpretation, we present in Fig. 10 the BOOK document of Section 2 annotated with labels identifying the kind of each construct therein. We assume the existence of constants x i of sort node, 1» i» 12; v j of sort val, 1» j» 6; t k of sort type, 1» k» 7; and a 1 of sort attr. Values are represented underlined and nodes are surrounded by boxes. Val Node 3 QQk Q? Q - ff Q Doc Q 6 QQQQs + Attr Elem Fig. 9: Categorical semantics of XML. <?xml version="1.0"?>  <!DOCTYPE BOOK SYSTEM "book.dtd"> <t 1:BOOK a 1:LABEL="BROOKS1995v 1 "> <t 2:TITLE> The Mythical Man-Month: x 3 Essays in Software Engineeringv 2 </TITLE> <t 3:AUTHOR> Frederick P. Brooks Jr.v 3 x 5 x 4 </AUTHOR> <t 4:EDITION> 2ndv 4 x 7 </EDITION> x 6 <t 5:EDITOR/> x 8 <t 6:PUBLISHER> Addison-Wesley Pub Co.v 5 x 10 x 9 </PUBLISHER> <t 7:YEAR> 1995v 6 x 12 </YEAR> x 11 </BOOK> Fig. 10: Annotated BOOK document. Figure 10 comprises a document which gives rise to a Book presentation containing an import Doc statement and axioms specifying as distinct and non-bottom the constants fd 1 : doc; x i : node; v j : val; t k : type; a 1 : attrg. The presentation specifies the root book node: doc node(d 1 )=x 1 (1) It is important to mention that we avoid to over constrain doc node in order to allow the semantic composition of documents, as illustrated below. It would be troublesome, for instance, if above x 1 were required to be the only node in the image of this function. The specified book element e 1 has just one attribute a 1 (LABEL) with value v 1 (BROOKS1995). This is mapped into the following axioms: x 2 x 1 attr val(a 1 )=v 1 (2) attr elem(a 1 )=e 1 (3) 8a 2 : attr attr elem(a 2 )=e 1! a 2 = a 1 (4) Now we have to deal with the complex structure of BOOK. Due to the limited graphical reso- 52

53 lution of this paper, it was impossible to make explicit in Fig. 10 the elements defined by the document, but we equally need to assume the existence of the respective constants e l of type elem, 1» l» 7, in order to formalise these document components. The structure of BOOK is interpreted with the help of the previously introduced functions member and node elem: node elem(x 1 )=e 1 (5) member(x 1 ; 0) = x 2 ::: (6) member(x 1 ; 5) = x 11 (7) member(x 1 ; 6) = void (8) Note that (8) is required in order to restrict the interpretation of BOOK to the nodes in Fig. 10. Without this kind of equation in the interpretation of each specification, unbound (open) documents would be admissible. Although interesting, this kind of document is not not acceptable in XML. Our core semantics, however, regards them as acceptable because the axioms in Fig. 7 only require the existence of an unspecified natural number limiting the elements of each document (3.4). The equations above capture only the two first hierarchical levels of the book document, containing the elements from TITLE to YEAR. To represent the whole document, we have to iterate this process until plain values are found. For example: node elem(x 2 )=e 2 (9) member(x 2 ; 0) = x 3 (10) member(x 2 ; 1) = void (11) node val(x 3 )=v 2 (12) member(x 3 ; 0) = void (13) These equations say that the node x 2 (TITLE) defines an element e 2, which in turn contains anodex 3 holding value v 2, the book title. Most of what remains to be interpreted of BOOK is uninteresting, a simple repetition of the cases above. The only exception is the interpretation of the empty element EDITOR, which is performed as follows: node elem(x 8 )=e 5 (14) member(x 8 ; 0) = void (15) x b t b :HOLD x c x a t h a :CARD Q A QQQ A ff AU h + x d x Q h f x h h Qs h t c :NUMBER t d :BRAND? x x e? x x g? x x i? x t e :EXPIRY v a :CHCD v b :1234 v c :SUPER v d :2003 Fig. 11: Structure of the card document. The process above may be used to produce an interpretation for any XML document. For instance, CARD can be interpreted as presented in Fig. 11 through a diagrammatic notation, where nodes are represented using squares, elements as empty circles and values as filled ones. Although we have omitted this detail from our examples, the semantics of DTDs can be produced in the same way. This process can be formalised through a functor mapping each XML syntactic construct into new components of a theory presentation, which is initially empty but is gradually augmented by the interpretation of each document component. We omit here the definition of this functor, whose signature is [ ] :XML Pres! Pres. 4.3 Compositional Semantics The semantics above determines a separate interpretation for each given XML document. One may wonder if some sort of compositionality is present in this kind of interpretation. The answer is affirmative. To illustrate this, we take the presentations resulting from the interpretation of CARD and BOOK described in Section 4.2 and attempt to compose these objects using presentation morphisms. We recall from [10, 11, 13] that the presentations and morphisms (functions) adopted here determine a category Pres. This allows us to make use of helpful categorical constructions. For instance, the commutative diagram in Fig. 12 describes the composition of Card and Book resulting in CardΩ XML Book. 53

54 The construction of CardΩBook can be explained by analogy with the process of defining Book from BOOK and Doc. We mentioned that Doc is imported into Book through an identity morphism (o 2 )such that: (i) a unique new document with its root node are required to exist in the target presentation (creation); (ii) all the constants assumed to exist only in the target presentation are distinguished from the image of the values defined in the source (completion) and (iii) the root of the top level document becomes a member of the new document root (demotion). The definition of Card is performed based on a morphism of the same family (o 1 ). Apart from obeying these rules with respect to Doc, CardΩBook must also be constructed in a minimalist way without equalising the image of the constants in Card and Book. This is ensured by computing a pushout, a categorical construction defined here by the morphisms ffi 1 and ffi 2 of the same family. These are defined in such a way that the images of Doc imported through Card and Book are collapsed in a single entity when they are put together. So we end up with just one member symbol in CardΩBook, which captures the complex node structure of both documents. Since the interpretation of each composite object is defined by the individual interpretations of its components when combined in the way suggested by our semantics, wehave compositionality. It is interesting to note that we can iterate this construction and obtain an interpretation for another of our example documents, ORDER, which contains elements defining the price and currency adopted in an electronic transaction. These elements are present only in Order, the interpretation of ORDER depicted in Fig. 12. Doc ρ> o ρ ρρ 1 Z ZZ o 2 Z~ Card Book P Z PPPPPPPPq ZZ ff 1 ffi 1 Z~ - CardΩBook - 1 ρ> ffi 2 ρ ρρ ff Order Fig. 12: Formal Semantics of our Example. Presentation DOMAttr imports Bool; Doc attributes name; value : val actions getname; isname(val); getv al; isv al(val); getspec; isspec(bool);setval(val) axioms v : val beg! value = nil (6.1) setv al(v) _ (value = v ^ X(value = v)) (6.2) setv al(v)! X(value = v) (6.3) getname ^ name = v! X(isName(v)) (6.4) getv al ^ value = v! X(isV al(v)) (6.5) getspec ^ value = v! X(isSpec(v 6= nil)) (6.6) Fig. 13: Semantics of DOM attributes. Another issue related to compositionality is the possibility of testing the structural equivalence of documents. It is also feasible to rely on categorical techniques for this purpose. Given two documents, a positive answer for this question is found if it is possible to determine two standard theory morphisms mapping their interpretations into each other in such a way that the resulting diagram commutes. 4.4 Core Semantics of DOM When we come to capturing the semantics of DOM, the rationale concerning the use of presentations and morphisms is also applicable, but because DOM objects have observable behaviour, we are obliged to make use of the temporal aspects of the adopted logical system. We define a theory presentation for each DOM object type. In each of these presentations, the XML semantics is imported; the attributes of the DOM interface are interpreted as attribute symbols and methods are captured using action symbols, with existing parameters represented as action arguments. Note that these steps are performed to provide a language rich enough for representing the dynamic properties of the respective DOM objects through some presentation axioms. In Fig. 13 we present the interpretation of attributes in DOM (Attr). The presentation says that attribute values are initially undefined (6.1), only change if requested (6.2) and modify subsequently after a request (6.3); and that attribute names, values and status are returned immediately after any query ( ) 1. 1 beg holds initially and Xp says that p holds next. 54

55 5 Final Remarks In this paper, we have outlined a logicocategorical semantics for both XML and DOM, which may be regarded as a formal alternative to their standard informal semantics. Our semantics is structured in terms of theory presentations of a first-order many-sorted branching-time logical system with equality, which were derived from the W3C recommendations and should be particularised in the interpretation of each document or application with their specific details. Using this work, it becomes possible to reason about the static and dynamic aspects of web-based open distributed systems and frameworks, manually or by using automated support. We believe that the most promising directions for applying this research are the study of integrating heterogeneous sources of information over the web and the design of web-based multimedia systems. Many other formal models of XML have been proposed in the literature with specific purposes. Most of these are based on the use of mathematical or logical constructions to capture the studied tree-like structure of XML documents (e.g. [2, 5, 9, 12]). These differ from our work due to the additional prooftheoretic treatment given to the XML notions here, which we consider better suited to support the rigorous development of software systems in general. The abstract categorical approach proposed in [1] keeps many similarities with our ideas, specifically related to the use of morphisms to describe the structure and collective behaviour of XML-based software systems. In particular, because our work is based on theory presentations and their morphisms, it is not difficult to see that the induced morphisms (functors) between the categories of models of these theories possess the reverse direction of the given morphisms, thus complying with the definitions in that work. Acknowledgements The author would like to thank the considerate comments and corrections of the anonymous referees of this paper. The reported work has been partially supported by CNPq research grant number /00-0. References [1] S. Alagic and P. A. Bernstein. A model theory for generic schema management. In Proc. 8th International Workshop on Database Programming Languages (DBPL'01), [2] M. Arenas, W. Fan, and L. Libkin. On verifying consistency of XML specifications. In Proc. 21th Symposium on Principles of Database Systems (PODS'02), [3] T. Berners-Lee and D. Connolly. Hypertext Markup Language (HTML) 2.0. RFC 1866, MIT/W3C, November [4] T. Berners-Lee et al. The World-Wide Web (WWW). Communications of the ACM, 37(8):76 82, [5] A. Brown, M. Fuchs, J. Robie, and P. Wadler. MSL a model for W3C XML schema. In Proc. 10th International World Wide Web Conference (WWW'01), pages ACM Press, [6] W. W. W. Consortium. Document object model (DOM) level 1 specification. W3C recommendation, October [7] W. W. W. Consortium. XSL transformations. W3C recommendation, November [8] W. W. W. Consortium. Extensible markup language (XML) 1.0 (second edition). W3C recommendation, October [9] W. W. W. Consortium. Extensible stylesheet language (XSL) 1.0. W3C recommendation, November [10] C. H. C. Duarte and T. Maibaum. A branchingtime logical system for open distributed systems development. In Proc. 9th Workshop on Logic, Language, Information and Computation (WOL- LIC'02), to appear, [11] J. Fiadeiro and T. Maibaum. Temporal theories as modularisation units for concurrent systems specification. Formal Aspects of Computing, 4(3): , [12] F. Frasincar, G.-J. Houben, and C. Pau. Xal: An algebra for XML query optimization. In Z. Zhou, editor, Proc. 13th Australasian Database Conference (ADC'2002), [13] J. A. Goghen. A categorical manifesto. Mathematical Structures in Computer Science, 1(1):49 67, [14] I. S. Organization. Information technology document description and processing languages. Technical Report ISO 8879:1986 TC2,

56 XML Structure Compression Mark Levene and Peter Wood Birkbeck College, University of London London WC1E 7HX, U.K. Abstract XML is becoming the universal language for communicating information on the Web and has gained wide acceptance through its standardisation. As such XML plays an important enabling role for dynamic computation over the Web. Compression of XML documents is crucial in this process as, in its raw form, it often contains a sizable amount of redundancy. Several XML compression algorithms have been proposed but none make use of the DTD when it is available. Here we present a novel compression algorithm for XML documents that conform to a given DTD, that separates the document s structure from its data, taking advantage of the regular structure of XML elements. Our approach seems promising as we are able to show that it minimises the length of encoding under the assumption that document elements are independent of each other. Our presentation is a preliminary investigation; it remains to carry out experiments to validate our approach on real data. 1 Introduction Extensible Markup Language (XML; [GP01] is the universal format for structured documents and data on the Web. With the vision of the semantic Web [BLHL01] becoming a reality, communication of information on the machine level will ultimately be carried out through XML. As the level of XML traffic grows so will the demand for compression techniques which take into account the XML structure to increase the compression ratio. The ability to compress XML is useful because XML is a highly verbose language, especially regarding the duplication of meta-data in the form of elements and attributes. A simple solution would be to use known text compression techniques [BCW90] and pipe XML documents through a standard text compression tool such as gzip ( or bzip2 (sourceware.cygnus.com/bzip2/index.html). The problems with this approach are twofold: firstly, compression of elements or attributes may be limited by existing tools due to the long range dependencies between elements and between attributes, i.e. the duplication is not necessary local, and secondly, to enhance compression, it may be useful to use different compression techniques on different components of XML. A simple idea to improve on just using standard text compression tools is to use a symbol table for XML elements and attributes prior to piping the result through gzip or bzip2. We are aware of two XML compression systems XMILL [LS00] and XMLPPM [Che01] that attempt to further improve on this idea. The idea behind XMILL is to transform the XML into three components: (1) elements and attributes, (2) text, and (3) document structure, 56 1

57 and then to pipe each of these components through existing text compressors. Another XML compression system, XMLPPM, refines this idea further by using different text compressors with different XML components, i.e. one model for element and attribute compression and another for text compression, and, in addition, it utilises the hierarchical structure of XML documents to further compress documents. We are aware of one previous compression algorithm for XML that uses the knowledge encapsulated in a Document Type Definition (DTD) [GP01]. The proposed method, called differential DTD compression [SM01], claims to encode only the information that is present in an XML document but not in its DTD. In this sense their approach is the same as the one we present here, but as opposed to [SM01] we concentrate on a particular algorithm, independently discovered, and its detailed analysis. To simplify the presentation, in this paper we consider DTDs which define only elements, rather than allowing the definition of attributes and entities as well. The compression techniques and algorithms could be adapted to cater for these additional components. As a result, all XML documents in this paper comprise only occurrences of elements. Prior to explaining our compression algorithm we briefly introduce the XML concepts we use via an example. Example 1.1 Consider the following DTD D which provides a simplistic representation for the contents of books: <!ELEMENT book (author, title, chapter+) > <!ELEMENT chapter (title, (paragraph figure)+) > <!ELEMENT author (#PCDATA) > <!ELEMENT title (#PCDATA) > <!ELEMENT paragraph (#PCDATA) > <!ELEMENT figure (EMPTY) > Each element definition comprises two parts: the left side gives the name of the element, while the right side defines the content model for the element. The content model is defined using a regular expression built from other element names. The six content models defined above are interpreted as follows. A book element has an author element as first child, a title element as second child, and one or more chapter elements as further children. A chapter element must have a title element followed by one or more paragraph or figure elements. Elements author, title and paragraph contain simply text, while figure elements are empty (presumably with data for the figure provided by an attribute, which we do not consider here). The following XML document d is valid with respect to (or conforms to) the DTD D: <book> <author>darrell Huff</author> <title>how to Lie with Statistics</title> <chapter> <title>introduction</title> <figure... /> <paragraph>with prospects of...</paragraph> 57 2

58 <paragraph>then a Sunday newspaper...</paragraph> [ 8 more paragraphs ] </chapter> <chapter> <title>the Sample with the Built-in Bias</title> [ 53 paragraphs and 7 figures ] </chapter> </book> The parse tree of document d with respect to DTD D, denoted by PARSE(d, D) is shown in Figure 1. The nodes in PARSE(d, D) correspond to the elements of the XML document, d, and operators from the regular expressions used in content models in D. We call the latter category of nodes structure nodes. In particular, in Figure 1, there are three types of structure nodes: nodes labelled by + are repetition nodes, nodes labelled by, are sequence nodes, and nodes labelled by are decision nodes. Figure 1: The parse tree for the XML document of Example 1.1 From now on we will assume that each XML document we deal with is valid with respect 58 3

59 to a given DTD D. We now give a conceptual overview of the compression algorithm we have devised. We split the compression of an XML document, say d, into two parts. First we obtain the parse tree PARSE(d, D) (see Figure 1) and prune from it all the leaf nodes containing text; we call the sequence of leaf nodes in a left-to-right fashion with a fixed delimiter between them, the data. In the second step, we apply further pruning to the parse tree maintaining a tree representation only of the structure that needs to be encoded in order that the decoding can reconstruct the document given DTD D; we denote the resulting tree by PRUNE(d, D). Figure 2 shows the pruned tree corresponding to the parse tree of Figure 1 (note that, in this example, the pruned parse tree does not contain sequence nodes since they can be deduced from DTD D). We can then encode PRUNE(d, D) using a breadth-first traversal, in such a way that each repetition node is encoded by a number of bits, say B, encoding the number of children of the repetition node, and each decision node is encoded by a single bit, which may be 0 or 1 according to its child; we call the resulting output the encoding. The algorithm presented in Section 2 does not explicitly construct PRUNE(d, D). Instead it traverses PARSE(d, D) while generating the encoding. However it is effectively only those nodes of PRUNE(d, D) which cause the algorithm to generate any output. Figure 2: The pruned parse tree for the XML document of Example 1.1 The compression of the document thus contains three elements: (1) the DTD, which is fixed, (2) the encoding of the document s structure given the DTD, and (3) the textual data contained in the document given the DTD. These outputs can be compressed further by piping them through standard text compression tools. We now give an example to illustrate our algorithm. Example 1.2 In order to encode the structure of the document d given in Example 1.1, we observe that every book must have an author and title as its first two children, so there is no need to encode this information, since the DTD D will be known to the decoder. All that needs to be encoded for children of book is the number of chapter elements present in the 59 4

60 document. This is suggested by the top repetition node labelled with + in Figure 2. In this case, there are two chapter elements. For each chapter element in d, we need to encode the number of paragraph and figure elements which occur. This is suggested by the lower two repetition nodes labelled with + in Figure 2. Since arbitrary sequences of paragraph and figure elements are permitted, all we can do in the encoding is to list the actual sequence which occurs, simply encoding the element names. This is suggested by the decision nodes labelled with and their children in Figure 2. Such an encoding is no better than if there were no DTD, but on the other hand we cannot possibly do any better (unless we ignore the order of elements). We can encode the occurrence of a paragraph in d by 0 and that of a figure by 1. Thus the encoding for the first chapter would be where 11 is the number of paragraph and figure elements represented in decimal. The second chapter requires 60 bits to represent the sequence of paragraph and figure elements. In order to provide an analysis of our algorithm we turn to information theory [Rez94] and the concept of entropy (or uncertainty), which implies that a minimum encoding of a message is a function of the likelihood of the message. Our algorithm is in the spirit of the two-part Minimum Description Length (MDL) encoding [HY01], which is based upon the idea of choosing the model that minimises the sum of the lengths of the encodings of (1) the model, and (2) the data encoded given the chosen model. In our case the DTD provides us with a model for the data, i.e. for the XML document, which we then use to encode the document s structure. Then, having this encoding available we can transmit the document structure and the actual data separately. We observe that if we view the DTD as an equivalence class of models, such that the document is valid with respect to all members of the class, then MDL encoding could be used to choose the preferred model. The techniques we present here are also inspired by work on using DTDs to optimise queries on XML repositories [Woo00, Woo01]. Constraints present in DTDs can be used to detect redundant subexpressions in queries. For example, assume that a DTD implies that every date element which has a day element as a child must also have month and year elements as children. Now a query (on a set of documents valid with respect to the DTD) which asks for date elements which have both a day element and a month element as children is equivalent to one which asks for date elements which just have a day element as a child. The rest of the paper is organised as follows. In In Section 2 we present the detail of our minimum length encoding algorithm. In Section 3 we provide some analysis of our algorithm, and finally, in Section 4 we give our concluding remarks. 2 Minimum Length Encoding of XML Document Structures Given an XML document d and DTD D, recall that PARSE(d, D) is the parse tree of d with respect to the DTD D. Now let STRUCT(d) be the tree representation of d with the leaf nodes containing text pruned from it. In this section we define encoding and decoding algorithms for STRUCT(d), assuming that d is valid with respect to DTD D. The encoding algorithm takes PARSE(d, D) as input and produces a minimal length encoding ENCODING(d, D) 60 5

61 of PARSE(d, D). The decoding algorithm takes ENCODING(d, D) and D as input and reconstructs STRUCT(d). The encoding and decoding algorithms operate in breadth-first order, considering the children of each element from the root in turn. The following example, along with Examples 1.1 and 1.2, illustrates the essence of one such step of the encoding algorithm. Example 2.1 Consider the following DTD (with simplified syntax): bookstore ((book magazine)+) book (author*, title?, date, isbn) author (((first-name first-initial), middle-initial?)?, last-name) magazine (title, volume?, issue?, date) date ((day?, month)?, year) This DTD uses two postfix operators not found in the DTD of Example 1.1: * represents zero or more occurrences of the preceding expression, while? represents that the preceding expression is optional. Occurrences of the operator * give rise to repetition nodes in the parse tree of a document, while those of operator? give rise to decision nodes. The encoding for the sequence of children of an element with name n in the document is based on how that sequence is parsed using the regular expression in the content model for n in the DTD. Consider the children of a date element. If there were only a year element as a child, then the encoding is simply 0, reflecting the fact that the optional subexpression (day?, month)? was not used in parsing. On the other hand, if date has both year and month children, then the encoding is 10, reflecting the fact that the subexpression (day?, month)? was used in the parsing (hence 1), but that the day? subexpression was not used (hence 0). For day, month and year children, the encoding would be 11. Hence the maximum length of the encoding is 2 bits, which is the shortest possible for representing the three possible sequences of children for date: (day, month, year), (month, year) and (year). At the other extreme, the content model for bookstore allows for any sequence of children (over the alphabet of book and magazine) whatsoever (apart from the empty sequence). Thus, as for paragraph and figure elements in Example 1.1, in the encoding all we can do is to list the actual sequence which occurs, simply encoding the element names using 0 for book and 1 for magazine. For encoding the children of a book element, we encode the number of author occurrences followed by 1 or 0 indicating whether or not title occurs. The encoding algorithm, called ENCODE-STRUCTURE, is shown in Figure 3. The algorithm makes use of a procedure called ENCODE which is shown in Figure 4. The algorithm takes as input the parse tree, PARSE(d, D), of a document d with respect to DTD D, and produces ENCODING(d, D), the encoding of STRUCT(d) with respect to D. As shown in Figure 1, nodes labelled with the operator in PARSE(d, D) have a single child which is the root of a parse tree for either the left-hand or right-hand operand of in the regular expression in which it appears (unless the operand which is used in the parsing is ɛ, in which case there will be no child). The encoding algorithm assumes that, rather than using the operator in PARSE(d, D), the operators L and R are used, where L (respectively, 61 6

62 algorithm ENCODE-STRUCTURE: Input: PARSE(d, D) Output: ENCODING(d, D) Method: begin /* Assume we have a single queue available, with operations enqueue(node) and dequeue() */ let i be the root node of PARSE(d, D); enqueue(i); while queue not empty do ENCODE(dequeue()) end Figure 3: Algorithm for encoding an XML tree. R) indicates that the child of the operator node corresponds to the left-hand (respectively, right-hand) operand in the corresponding regular expression. In the construction of PARSE(d, D), we assume that operators *, + and? are leftassociative and have the highest precedence. The operators, and are right-associative, with, having higher precedence than. The encoding algorithm also assumes that the document type of the document d being encoded corresponds to the first element name defined in the DTD D. Thus the document type of any document being encoded with respect to the DTD of Example 2.1 is assumed to be bookstore. A trivial addition to the encoding algorithm can overcome this restriction. Example 2.2 Consider the following XML document, which is valid with respect to the DTD given in Example 2.1: <bookstore> <book> <author> <first-initial>j</first-initial> <middle-initial>m</middle-initial> <last-name>coetzee</last-name> </author> <date><year>1990</year></date> <isbn> </isbn> </book> <magazine> <title>the Economist</title> <date><day>24</day><month>june</month><year>2000</year></date> </magazine> <book> <author> <first-name>nadine</first-name> <last-name>gordimer</last-name> </author> 62 7

63 procedure ENCODE(i): begin 1. if i labelled with n then if i has child node j labelled with an operator then enqueue(j); /* else node is empty or contains only text */ 2. if i labelled with, then begin let j and k be the first and second child nodes, respectively, of node i; ENCODE(j); ENCODE(k) end 3. if i labelled with L or R then begin if i labelled with L then output 0 else output 1; if i has child node j then ENCODE(j) end 4. if i labelled with? then if i has child node j then begin output 1; ENCODE(j) end else output 0 5. if i labelled with or + then begin let i have m child nodes, j 1,..., j m ; output m; for k = 1 to m do ENCODE(j k ) end 6. return end Figure 4: Procedure used in encoding algorithm. 63 8

64 <title>something Out There</title> <date><year>1984</year></date> <isbn> </isbn> </book> </bookstore> Algorithm ENCODE-STRUCTURE starts by calling the ENCODE procedure with the node labelled bookstore. So ENCODE executes line 1, and since this node has a child labelled with operator +, the child node is added to the queue. The procedure returns and is then called with the node labelled +. This causes line 5 to be executed and, since there are 3 child nodes, the number 3 is output followed by ENCODE being called for each of the child nodes, which are labelled with L, R and L, respectively. Processing the first of these decision nodes results in 0 being output from line 3, after which ENCODE is called with the child node labelled book. Line 1 causes this node to be added to the queue. Then 1 is output (line 3) and the node labelled magazine added to the queue (line 1), followed by 0 being output (line 3) and the second node labelled book being added to the queue. At this point the children of bookstore in STRUCT(d) have been encoded and they are on the queue ready for their own structures to be encoded. The above procedure continues until all the nodes in PARSE(d, D) have been processed. The final encoding is as follows (where the boxed items are integers representing numbers of occurrences, represented in decimal): children of: bookstore (3 children: book, magazine, book) 1 0 1st book (1 author, no title) 0 0 magazine (no volume, no issue) 1 1 2nd book (1 author, title) author of 1st book (more than last-name: first-initial, middle-initial) 0 date of 1st book (year only) 1 1 date of magazine (month and day) author of 2nd book (more than last-name: first-name, no middle-initial) 0 date of 2nd book (year only) Note that the length of an encoding for a particular element name can vary: the length of the encoding for the first and third dates above is 1, while that for the second date is 2. Assuming that each integer is encoded using a fixed length of 2 bits the total encoding length of the structure of d is 23 bits. We observe that the parenthesization of regular expressions (either explicitly or implicitly) in content models can affect the length of the encoding used. For example, let r 1 be a b c d and r 2 be (a b) (c d). Clearly r 1 r 2, but the maximum encoding length for r 1 is 3 (an occurrence of d is encoded as 111), while that for r 2 is 2. DTD designers can use this fact along with knowledge about the expected occurrences of elements to write their DTDs in such a way as to minimise the expected encoding length. For example, r 1 above gives the minimum expected encoding length of 1.75, if the probabilities of occurrence for a, b, c and d are, respectively, 0.5, 0.25, and

65 algorithm DECODE-STRUCTURE: Input: ENCODING(d, D), denoted e for short below, and DTD D Output: STRUCT(d) Method: begin /* Assume we have a single queue available, with operations enqueue(node) and dequeue() */ let n be the first element name defined in D (the assumed document type); create node i with element name n; enqueue(i); while queue not empty do begin i = dequeue(); let r be the content model in D of the element name of i; e = DECODE(e, r, i) end end Figure 5: Algorithm for decoding an encoded XML tree. We now describe how to decode ENCODING(d, D), that is, how to recover STRUCT(d). The decoding algorithm, called DECODE-STRUCTURE, is shown in Figure 5. The algorithm makes use of a procedure called DECODE which is shown in Figure 6. The procedure outputs STRUCT(d) while working through the encoding, the remainder of which it returns after each call. Note that because DECODE is guided by the structure of the regular expression for the content model of the element being processed, it knows what structure to expect at each stage, either an integer or a single bit. Furthermore, if DECODE is passed an encoding for more than the elements comprising the children of an element, the extra unused encoding is returned. Content models which can be empty do not cause problems because zero occurrences of a content model defined using * and instances of EMPTY as an operand of (or used implicitly in?) are encoded explicitly. Example 2.3 Assume that the DTD of Example 2.1 and the encoding produced in Example 2.2 are given as input to algorithm DECODE-STRUCTURE. The algorithm starts by creating a node labelled bookstore and calls DECODE with the full encoding, the regular expression (book magazine)+ and this node. Line 5 of DECODE is executed, n is found to be 3 and DECODE is called with the rest of the encoding after 3, the regular expression (book magazine), ((book magazine), (book magazine)) and the same node labelled bookstore. This results in line 3 being executed, with u being (book magazine). DECODE is then called with u, which causes line 4 to be executed. Since the next bit in the encoding is 0, DECODE is called with u being book. Line 2 then creates a new node labelled book and adds it to the queue. The remaining encoding (with the 3 and the 0 removed) is returned to line 3, where it is used as the first argument of a call of DECODE with regular expression (book magazine), (book magazine). The decoding of this consumes the 65 10

66 procedure DECODE(e, r, i): begin 1. if r = ɛ then return e 2. if r = n then begin create node j with element name n as next child of node i; enqueue(j); return e end 3. if r = (u, v) then return DECODE(DECODE(e, u, i), v, i) 4. if r = (u v) then if e = 0 e then return DECODE(e, u, i) else /* e = 1 e */ return DECODE(e, v, i) 5. if r = (u) or r = (u) + then begin let e = n e ; if n = 0 then return e else return DECODE(e, (u 1, ( (u n 1, u n ))), i), where u i = u, 1 i n end 6. if r = (u)? then return DECODE(e, (ɛ, u), i) end Figure 6: Procedure used in decoding algorithm

67 next 1 and 0 of the encoding, while adding nodes labelled magazine and book to the queue. Control now returns to the loop in DECODE-STRUCTURE, the children of bookstore having been decoded and added to the queue. The decoding continues in this way until STRUCT(d) has been reconstructed. The data can then be added to STRUCT(d), in a straightforward way, to obtain d. 3 Analysis of the Compression Algorithm The simplicity of the analysis we present for our algorithm hinges on the fact that the length of the encoding of the document s structure is O(n) bits, where n is the number of nodes in the parse tree; see [KM90] for a survey on tree compression methods. Given a parse tree p = PARSE(d, D) of an XML document d with respect to a given DTD, D, let m be the number of repetition nodes in p, and q be the number of decision nodes in p. It is evident that the length of the encoding of a document s structure output by our algorithm, is given by LEN(d, D) = mb + q, (1) where B is the number of bits needed to encode the maximum number of children of any repetition node in the parse tree. The encoding length, LEN(d, D), can be viewed as the number of binary choices one needs to make to reach all the leaf nodes in the parse tree p. So, in Example 1.1 in order to reach a leaf we need to know the chapter number, the paragraph or figure number, and finally whether the leaf is a paragraph or a figure. As an example of a special case we have a nested relational database [LL99] (which subsumes the standard relational database model) where we only need mb bits to encode its structure, as in this case there are no decision nodes. We note that although DTD rules are independent of each other, since DTDs induce a context-free grammar, there may be dependencies within rules that affect the expected value of LEN(d, D). For example, consider the definition of date given by date ((day?, month)?, year) where a maximum of two decisions need to be made, i.e. whether the date has a month and then whether it also has a day. In this case, assuming choices in the parsing process are equally likely, the expected length of an encoding of date is 1.5, since we can encode the case when no month is specified by one bit and the case when a month is specified by two bits, the second bit indicating whether a day is present or not. There are also situations when we can improve on the use of B bits to encode a repetition node. For example, if the number of children of most repetition nodes is bounded by B 1 and there are only few nodes having B 2 children, where B 2 is much larger than B 1, then a shorter code can be obtained by using a delimiter to signify the end of the code for the number of children, i.e. using a variable length code. In practice our technique of using B bits should work well, since the shorter codes, that only require B 1 B bits but use B 2 = B bits instead, are likely to be compressed by a standard compressor at a later stage. Now in what sense is the encoding length given by (1) optimal? For this we can derive a probability distribution over parse trees of documents given a DTD, assuming that their leaf 67 12

68 text nodes having been pruned. Thus for a decision node we assume that the probability of choosing a 0 or a 1 is 1/2 and for a repetition node we assume that the probability of having any number of repetitions is 1/2 B. On the assumption that all choices are independent of each other, it follows that the probability of a document structure given a DTD is given by and therefore P (d, D) = 2 (mb+q) LEN(d, D) = log 2 P (d, D) as required. (Obviously if the decisions are biased in some way we can improve on this bound.) Moreover, we can compute the entropy of a DTD, D, by H(D) = P (d, D) LEN(d, D), PARSE(d,D) with the sum being over all possible parse trees with respect to D that have had their text leaf nodes pruned. The entropy gives the expected encoding length of a document s structure given a DTD. We observe that, given a DTD, D, H(D) may be computed directly from D by computing the probability of DTD rules one by one according to their regular structure, noting, as above, that the rules are independent of each other. The above analysis assumes that the DTD is not recursive. It remains an open problem to extend the analysis to recursive DTDs. 4 Concluding Remarks We have presented a novel algorithm to compress XML documents which are valid with respect to a given DTD. Our approach seems promising as we have shown that it minimises the length of encoding under the assumption of an independent distribution of choices. Our approach differs from previous approaches as it utilises the DTD to encode the structure of XML documents which is separated from its data. We are looking at ways for further improving the encoding length of XML documents by dropping the assumption that the DTD is fixed, and MDL, in particular, may provide such a framework. In order to test the utility of our approach in practice, we plan to carry out experiments to compare it to existing systems for XML compression. References [BCW90] T. Bell,, J.G. Cleary, and I.H. Witten. Text Compression. Prentice Hall, Englewood Cliffs, NJ, [BLHL01] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic Web. Scientific American, 284:35 43, May [Che01] [GP01] J. Cheney. Compressing XML with multiplexed hierarchical models. In Proceedings of IEEE Data Compression Conference, pages , Snowbird, Utah, Charles F. Goldfarb and Paul Prescod. The XML Handbook. Prentice-Hall, third edition,

69 [HY01] [KM90] [LL99] [LS00] M. H. Hansen and Bin Yu. Model selection and the principle of minimum description length. American Statistical Association Journal, 96: , J. Katajainen and E. Mäkinen. Tree compression and optimization with applications. International Journal of Foundations of Computer Science, 1: , M. Levene and G. Loizou. A Guided Tour of Relational Databases and Beyond. Springer-Verlag, London, H. Liefke and D. Suciu. XMILL: An efficient compressor for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages , Dallas, Tx., [Rez94] F.M. Reza. An Introduction to Information Theory. Dover, New York, NY, [SM01] [Woo00] N. Sundaresan and R. Moussa. Algorithms and programming methods for efficient representation of XML for internet applications. In Proceedings of International World Wide Web Conference, pages , Hong Kong, Peter T. Wood. Rewriting XQL queries on XML repositories. In Proceedings 17th British National Conference on Databases (University of Exeter, UK, July 3 5), number 1832, pages , Berlin, Springer-Verlag. [Woo01] Peter T. Wood. Minimising simple XPath expressions. In Proceedings WebDB 2001: Fourth International Workshop on the Web and Databases (Santa Barbara, Ca., May 24 25), pages 13 18,

70 Criteria for Evaluating Information Retrieval Systems in Highly Dynamic Environments Judit Bar-Ilan School of Library, Archive and Information Studies The Hebrew University of Jerusalem P.O. Box 1255, Jerusalem, 91904, Israel Abstract This paper proposes a set of measures to evaluate search engine functionality over time. When coming to evaluate the performance of Web search engines, the evaluation criteria used in traditional information retrieval systems (precision, recall, etc.) are not sufficient. Web search engines operate in a highly dynamic, distributed environment, therefore it becomes necessary to assess search engine performance not just at a single point in time, but over a whole period. The size of a search engine's database is limited, and even if it grows, it grows more slowly than the Web. Thus the search engine has to decide whether and to what extent to include new pages in place of pages that were previously listed in the database. The optimal solution is that all new pages are listed, and no old ones are removed - but this of course is usually unachievable. The proposed metrics that evaluate search engine functionality in presence of dynamic changes include the percentage of newly added pages, and the percentage of the removed pages, which still exist on the Web. The percentage of non-existent pages (404 errors, nonexistent server, etc.) out of the set of retrieved pages indicates the timeliness of the search engine. The ideas in this paper elaborate on some of the measures introduced in a recently published paper (Bar-Ilan, 2002). I'd like to take advantage of the opportunity to discuss the problem of search engine evaluation in dynamic environments with the participants of the Web Dynamics Workshop. Introduction The World Wide Web is a very different environment from the usual setting, in which traditional information retrieval (IR) systems operate. Traditional systems operate in a highly controlled, centralized and relatively stable environment. New documents can be added, but in a controlled fashion. Sometimes old documents are removed or moved (for example in case the "current" database of a bibliographic retrieval system contains only documents from the last two years, and the older ones are moved to an archive); and documents or document representations may change - mistakes can be corrected. The major point is that all these processes are controlled. The Web, on the other hand is uncontrolled, distributed and highly dynamic - in short, total chaos. The situation was aptly expressed by Chakrabarti et al. (1999) the Web has evolved into a global mess of previously unimagined proportions. On the Web almost anyone can publish almost anything. Later the author of the page or the publishing site can decide to: change the content; remove the page or move it to another directory on the same server or to a different server; or publish the same content at another URL. All these changes occur continuously and the search engines are not being notified of any of them. They have to cope not only with the dynamic changes caused by the authors and the publishers (the servers), but also with problems on the way to these pages: communication or server failures. 1 70

71 Web search engines operate in a chaotic environment, which is rather different from the stable, controlled setting of classical IR systems. Still, up till now most studies evaluating search engines used traditional IR evaluation criteria. The best known IR evaluation measures are precision and recall. A large number of studies evaluated precision or top-ten precision (e.g. Leighton & Srivastava, 1999 or Gordon & Pathak, 1999); while only a very few attempted to estimate recall (e.g. Clarke & Willett, 1997). Precision and recall are the most widely used measures of effectiveness; but other criteria were also used to assess Web search performance (e.g. Zhu & Gauch, 2000; Singhal & Kaszkiel, 2001). The search engines' coverage of the Web has also been estimated (Bharat & Broder, 1998; Lawrence & Giles, 1998 and 1999). A number of early studies compared the search capabilities of the different search tools, usually based on the documentation provided by the search tools (e.g. Courtois, 1996). User studies assessed satisfaction [e.g. the NDP survey reported by Sullivan (2000a)] and search behavior (e.g. Watson, 1998; Jansen, Spink & Saracevic, 2000; or Holscher & Strube, 2000). Other IR evaluation criteria include studies on output form (e.g. Tombros & Sanderson, 1995; Zamir & Etzioni, 1999) or usability issues like interface design (e.g. Berenci et al., 1999). Issues related to searching in a dynamic environment have already been addressed, but not from the evaluation perspective. Some works studied the rate of change of Web pages in order to assess the benefits of caching (e.g., Douglis et al., 1997;) or to devise refresh schedules for the search engines (e.g., Brewington & Cybenko, 2000; Cho & Garcia-Molina, 2000) to be as fresh and timely as possible using available resources, or to characterize changes occurring to Web pages and sites over time (e.g. Koehler, 1999 & 2002, Bar-Ilan & Peritz, 1999). Several works (e.g., Bar-Ilan, 1999; Rousseau, 1999; Bar-Ilan, 2000) reported huge fluctuations over time in the number of results search engines retrieve for given queries. Lawrence and Giles (1999) raised an interesting point: "There may be a point beyond which it is not economical for them [the search engines, J. B] to improve their coverage and timeliness." Thus in addition to the technical and algorithmic difficulties, the financial aspects must also be taken into account. We are not only facing the question whether the search engines can cope with the growth and changes on the Web, but also whether they want to cope. The major issues of interest when coming to evaluate search engine performance in a dynamic environment are: 1) Timeliness/freshness Percentage of broken links Percentage of pages with where the indexed copy differs from the Web copy Percentage of "recently" created pages in the database 2) Stability over time Are there great fluctuations in the number results for a given query? Does the search engine "drop" from its database existing URLs relevant to the query? In the next session we describe the necessary framework for evaluation, then we formally define the metrics and discuss their meaning. The Framework In order to evaluate search engine performance over a period of time, the query/queries have to be asked periodically from the search engine. The 2 71

72 query/queries are run in search rounds. The search period is the span of time during which the searches were carried out. The search rounds should be equidistant. From our experience it is sufficient to run the query/queries once a month. We experienced with running the query once a week, but the observed changes were not very significant. An exception was an experiment we carried out in September- October, 1999 when huge daily fluctuations were observed in the results of HotBot. Notess (n.d.) reports that AltaVista has an ongoing problem with the number of results: because of unreported timeouts, it may retrieve a different number of results each time the search button is pressed. We have not encountered such problems with AltaVista during our searches. However, we made sure that the queries were run at a time when Internet communication is known to be low, on Sunday early mornings (around 5:00-7:00 GMT). In order to compute the percentage of newly added pages, the percentage of "dropped" pages and the percentage of broken links, all the URLs the search results point to must be visited in every search round. The best solution is to download all the pages the search results point to immediately after the query is run. This way the results can be examined in a more leisurely fashion, and more importantly, they can be viewed as they were seen at the time the searches were carried out by anyone wishing to inspect the results at a later time. The above requirement restricts the queries on which the search engine can be evaluated. The entire set of search results must be examined, thus the query has to be such, that the search engine presents all the hits for the given query. Most search engines limit the number of displayed results - they usually do not display more than the first 1000 results (AltaVista displays only 200, but this problem can be partially solved by carrying out several searches limited to different dates of creation of the URLs). Further steps must be taken in order to retrieve all the hits for search engines that cluster the search results. We also need a method to decide which of the retrieved documents are "relevant" to the query. Relevance is a very difficult notion and has been heavily discussed by the IR community [see for example (Saracevic, 1975) or (Mizzaro, 1997)] - there is no general agreement on how to judge relevance, even though relevance is the basis for computing the most widely used IR evaluation measures: precision, recall and coverage. Human relevance judgment in case of periodic searches with a large number of results is not feasible, thus we defined a more lenient measure, called technical relevance that can be computed automatically. A document is defined to be technically relevant if it fulfills all the conditions posed by the query: all search terms and phrases that suppose to appear in the document do appear, and all terms and phrases that are supposed to be missing from the document - terms preceded by a minus sign or a NOT operator, do not appear in the document. A URL is called a technically relevant URL, if it contains a technically relevant document (Bar-Ilan, 2002). Lawrence and Giles also took this approach (1999), even though they point out: "search engines can return documents that do not contain the query terms (for example documents with morphological variants or related terms)." It is advisable to choose query terms with as few morphological variants as possible (Northern Light, for example, did not differentiate between pages in which the query term appears in singular or in plural - in case of simple plural). From our experience, currently, related terms or concepts are very rarely substituted for the original query terms. Thus the notion of technical relevance provides a fast and easy method to differentiate between pages "about" the search topics and pages that clearly have nothing to do with the query (including broken links and otherwise inaccessible pages). 3 72

73 The Metrics - Definitions To evaluate the percentage of broken links, we define: broken(q,i) = (# broken links) / (total # results retrieved for query q in search round i) There are temporary communication failures, which may result in 404 messages, thus a second attempt must be made (at a slightly later time) to download these pages before deciding that they are really missing or inaccessible. Next we introduce new which counts the number of newly added URLs for i>1: new(q,i) = {technically relevant URLs retrieved in round i} - {technically relevant URLs retrieved by the search engine in search round j where j<i} This measure is influenced both by the growth of the subject on the Web and by the rate at which the search engine adds new pages to its database. A new page may be added to the search engine's database for two reasons: 1) the page has been created recently or its content was recently changed, so that the page became relevant for the query; 2) the page had already existed and had been relevant to the query for a "long" time, but the search engine only recently discovered and added it to its database. In order to try to differentiate between the two factors influencing new, we may run the same query on several large search engines in parallel, and try to create an "exhaustive" pool of pages technically relevant to the query for each search round. Then we can partition new(q,i) into totally-new(q,i,s) and newlydiscovered(q,i,s), where totally-new(q,i,s) = { technically relevant URLs retrieved by the search engine s in search round i }- {URLs in the pool of URLs retrieved before round i} newly-discovered(q,i,s) = new(q,i)- totally-new(q,i,s) There are no easy means to decide whether the search engine's information is outdated, except, perhaps, in case the document is totally unrelated with the query (not even the same concept). Some partial conclusions may be drawn form the search engine's summaries. An exception is Google, which caches most of the URLs it visited, thus is possible to compare the downloaded pages with the cached version. In order to carry out this comparison, the cached documents should also be downloaded - to compare the local and the Web copy, as they existed at the given point in time. These suggestions are "work-arounds", thus we have not defined a measure to evaluate the extent to which the search engine's information is outdated. Measures like freshness (Cho & Garcia-Molina, 2000) can be used also for evaluation, in case the evaluator has access to the search engine's database and the page as seen by the crawler can be reconstructed. In order to evaluate stability, we introduce the following measures for i>1: forgotten(q,i)= {technically relevant URLs retrieved in round (i-1), that exist on the Web and are technically relevant at round i, but are not retrieved in round i} A dropped URL is a URL that disappeared from the search engine's database, even though it still exists on the Web and is continues to be technically relevant; forgotten(q,i) counts the number of dropped URLs in round i. A URL u that was dropped in round i, may reappear in the database at some later round j (our experience shows that this does happen). Such URLs are called rediscovered URLs. Recovered(q,j) counts the number of rediscovered URLs in round i>1: recovered(q,j)= {technically relevant URLs retrieved in round j that were dropped in round i, i<j AND were not retrieved in round j-1} If a URL u appeared for the first time in round k, and was dropped in round i>k, and reappeared in round j>i, and was retrieved again in round j+1, it will be rediscovered 4 73

74 in round j, but not in round j+1 i.e., a URL is counted in recovered in the first round it reappears after being dropped. It may be the case that the URL was dropped in round i because the server on which it resides was down at the time the crawler tried to visit it. This may account for some (small part) of forgotten. There are two other possible explanations. The first: the search engine has limited resources, and it has to keep the balance between the newly discovered pages and the old pages in its database. The second explanation relates to the crawling policy of the search engine. If the search engine uses shadowing (see (Arasu, 2000): it has a database that serves the queries, and another database that is being built based upon the current crawling; the "new" database replaces the old one at some point of time), it is possible that the new database covers a substantially different set from the old one. It is well known that there is a lot of content duplication on the Web. Some of it is intentional (mirror sites), some results from simple copying of Web pages, and some is due to different aliases of the same physical address. Thus it is conceivable that the search engine dropped a given URL u, because it located another URL u' in its database with exactly the same content. To evaluate the extent to which content is lost we define: lost(q,i)= {dropped URLs in round i, for which there is no other URL which was retrieved for q in round i, with exactly the same content} Lost URLs are those URLs, which were dropped, and the search engine did not retrieve any content duplicates of these URLs in the current search round. These URLs cause real information loss for the users, information that was accessible through the search engine before, is not accessible anymore, even though the information is still available and pertinent on the Web. The results of a case study (Bar-Ilan, 2002) show that not only a high percentage of the URLs are dropped and rediscovered, but a significant portion of them were also lost. A URL can be dropped and then rediscovered several times during the search period. In order to assess the search performance over the whole search period we define: well-handled(q)= {technically relevant URLs retrieved for q that were never dropped during the search period} The URLs counted in well-handled are not necessarily retrieved during the whole search period. Such a URL can first appear in round i>1, and disappear from the list of retrieved URL in round j>i, if it disappears from the Web or ceases to be technically relevant to the query. A mishandled URL is a URL that was dropped at least once during the search period. Recall that dropped means that the URL wrongfully disappeared from the list of URLs retrieved for the query. mishandled(q)= {U dropped URLs, for i>1} The set of mishandled URLs can be further partitioned into mishandled-forgotten(q) - these are the URLs that were not rediscovered at some later time, and to mishandled-recovered(q). The last two measures assess the variability of the search results over time, and supplement the measures new and forgotten: self-overlap(q,i,j)= {technically relevant URLs that were retrieved both in round i and in round j} / {technically relevant URLs retrieved in round j} Let All(q) denote the set of all technically relevant URLs that were retrieved for the query q during the whole search period. Note that this is a virtual set; since it may include URLs that never coexisted on the Web at the same time. 5 74

75 self-overlap(q,i)= {technically relevant URLs that were retrieved in round i } / {All(q)} High self-overlap for all search rounds indicates that the search engine results for the given query are stable. Note that very high values of self-overlap not only indicate stability, but may also be warning signs that the search engine's database is becoming out of date - has not changed substantially for a long period of time. Thus for this measure the "optimal" values are neither very high nor very low. Evaluating Using these Measures and Future Work An initial study (Bar-Ilan, 1999) was carried out for a period of five months in 1998 with the query "informetrics OR informetric", using the six largest search engines at the time (AltaVista, Excite, HotBot, Infoseek, Lycos and Northern Light). In this study forgotten and recovered were computed, and Excite "forgot" 72% of the technically relevant URLs retrieved by it. In each of the search rounds Excite retrieved almost the same number of results (158 URLs on the average), but when we compared the sets of URLs we were rather surprised to discover that the overlap between the sets was very small - during the whole search period Excite retrieved a total of 535 technically relevant URLs. This result shows that it is not sufficient to look at the number of results only, but the URLs must also be examined. We carried out a second case study (Bar-Ilan, 2002) evaluating most of the abovedefined measures for a whole year during The query in the case study was "aporocactus". This word has few (if any) morphological variants. The query was run on six search engines (AltaVista, Excite, Fast, Google, HotBot and Northern Light) in parallel. The search engines mishandled between 33 and 89 percent of the technically relevant URLs retrieved by them during the whole search period. This time the search engines that mishandled the largest percentages of URLs were Google (89%) and HotBot (51%); even though Google retrieved by far the largest number of technically relevant URLs during the whole period - Google covered more than 70% of the set All. Except for Northern Light, almost all of the forgotten URLs were also lost, i.e. we were unable to locate in the search results another URL with exactly the same content. Naturally, we cannot draw any definite conclusions about specific search engines based on two queries during two search periods; but the case studies indicate the usefulness of the evaluation criteria defined in this paper. The "optimal search engine" should have high values for new, corresponding to the growth of the subject on the Web. Ideally the number of mishandled URLs should be zero, but as we explained before, the search engine has to decide on how to utilize its available resources, and has to compromise between adding new pages and removing old ones. The number of broken links should also approach zero, while self-overlap should neither be very high nor very low. When counting dropped and lost URLs we may also want to look at the rank of these URLs. Are these mostly lowly ranked URLs or is the distribution more or less uniform? Dropping lowly ranked URLs would correspond to the policy announced by Inktomi (Sullivan, 2000b) of removing non-popular URLs from the database. The above-mentioned case study did not look at the ranks of the dropped URLs. The criteria introduced here are a first step in defining a set of measures for evaluating search engine performance in dynamic environments. Future work should be carried out both in the theoretical and in the practical directions. We need to define additional criteria and to refine existing ones; and to carry out additional larger scale experiments to study the usefulness and the applicability of the measures. 6 75

76 Reference Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., & Raghavan, S. (2000). Searching the Web. Stanford University Technical Report [Online] Available: Bar-Ilan, J. (1999). Search Engine Results over Time - A Case Study on Search Engine Stability. Cybermetrics, 2/3. [Online]. Available: Bar-Ilan, J. (2000). Evaluating the Stability of the Search Tools Hotbot and Snap: A Case Study. Online Information Review, vol, 24(6): Bar-Ilan, J. (2002). Methods for Measuring Search Engine Performance over Time. JASIST, 54(3). Bar-Ilan, J., & Peritz B. C. (1999). The Life Span of a Specific Topic on the Web; the Case of 'Informetrics': a Quantitative Analysis. Scientometrics, 46(3): Berenci, E., Carpineto, C., Giannini, V., & Mizzaro, S. (1999). Effectiveness of keyword-based display and selection of retrieval results for interactive searches. In Lecture Notes in Computer Science, 1696: Bharat, K. and Broder, A. (1998). A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. In Proceedings of the 7th International World Wide Web Conference, April 1998, [Also online]. Available: Brewington, B. E., & Cybenko, G. (2000). How Dynamic is the Web? In Proceedings of the 9th International World Wide Web Conference, May [Also online]. Available: Chakrabarti, S., Dom B., Kumar, R. S., Raghavan, P., Rajagopalan, S., Tomkins, A., Kleinberg, J. M., & Gibson, D. (1999). Hypersearching the Web. Scientific American, 280(6): [Also online]. Available: Cho, J., & Garcia-Molina, H. (2000). Synchronizing a Database to Improve Freshness. SIGMOD RECORD, 29(2): Chu, H. & Rosenthal, M. (1996). Search Engines for the World Wide Web: A Comparative Study and Evaluation Methodology. ASIS96. [Online]. Available: Clarke, S.J., & Willett, P. (1997). Estimating the Recall Performance of Web Search Engines. Aslib Proceedings, 49(7), Courtois, M. P. (May/June 1996) Cool Tools for Searching the Web - An Update. Online, Douglis, F.; Feldmann, A.; Krishnamurthy, B.; & Mogul, J. (1997). Rate of Change and Other Metrics: A Live Study of the World Wide Web. In Proceedings of the Symposium on Internet Technologies and Systems, Monterey, California, December 8-11, [Online]. Available: Holscher, C., & Strube, G. (2000). Web Search Behavior of Internet Experts and Newbies.. In Proceedings of the 9th International World Wide Web Conference, May [Online]. Available: Gordon, M., & Pathak, P. (1999). Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines. Information Processing and Management, 35,

77 Jansen, B. J.; Spink, A; & Saracevic, T. (2000). Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Information Processing and Management, 36, Koehler, W. (1999). An Analysis of Web Page and Web Site Constancy and Permanence. JASIS. 50(2): Koehler, W. (2002). Web Page Change and Persistence - A Four Year Longitudinal Study. JASIST. 50(2): Lawrence, S. and Giles, C. L. (1998). Searching the World Wide Web. Science, 280, Lawrence, S. and Giles, C. L. (1999). Accessibility and Distribution of Information on the Web. Nature, 400: Leighton, H.V., & Srivastava, J. (1999). First 20 precision among World Wide Web search services (search engines). JASIS, 50, Mizzaro, S. (1997). Relevance: The Whole History. JASIS, 48(9): Notess, G. (no date). AltaVista Inconsistencies. [Online]. Available: Rousseau, R. (1999). Daily Time Series of Common Single Word Searches in AltaVista and NorthernLight. Cybermetrics, 2/3(1), paper 2, [Online]. Available: Saracevic, T. (1975). RELEVANCE: A review of and a Framework for the Thinking on the Notion in Information Science. JASIS, November-December, 1975, Sherman, C. (1999). The Search engines Speak. In Web Search. [Online]. Available: Singhal, A. & Kaszkiel, M. (2001). A Case Study in Web Search Using TREC Algorithms.. In Proceedings of the 10th International World Wide Web Conference, May [Online]. Available: Sullivan, D. (2000a). NDP Search and Portal Site Study. In Search Engine Watch Reports. [Online]. Available: Sullivan, D. (2000b). The Search Engine Update, August 2, 2000, Number 82. [Online]. Available: Tombros, A., & Sanderson, M. (1998). The advantages of query-biased summaries in Information Retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Watson, J. S. (1998) "If You Don't Have It, You Can't find It." A Close Look at Students' Perceptions of Using Technology. JASIS, 49(11), Zamir, O., & Etzioni, O. (1999). Grouper: A Dynamic Clustering Interface to Web Search Results. In Proceedings of the 8th International World Wide Web Conference, May [Online]. Available: Zhu, X., & Gauch, S. (2000). Incorporating Quality Metrics in Centralized/Distributed Information Retrieval on the World Wide Web. In Proceedings of the 23rd International ACM SIGIR Conference. July 2000, Athens, Greece,

78 Query Language for Structural Retrieval of Deep Web Information Stefan Müller 1, Ralf-Dieter Schimkat 1, and Rudolf Müller 2 1 University oftübingen, WSI for Computer Science, Sand 13, D Tübingen, Germany 2 University of Maastricht, Department of Quantitative Economics, P.O. Box 616, 6200 MD Maastricht, The Netherlands Abstract. Information provided on the Web is often hidden in databases and dynamically extracted by script pages according to some user input. Web information systems follow this concept when they must handle a huge amount of fast changing data that is somehow related (e.g. shoppers, route planers, yellow pages). In contrast to static pages it is very hard for traditional search engines to properly index pages with non-static content in a way that Web users can perform a precise search. The information that is provided through dynamic script pages is often called the Hidden or Deep Web. We present a retrieval approach that uses the structure of the database schemas and further schema information to calculate a similarity between a structured query and registered Deep Web information systems. Our retrieval process is a combination of structured and keyword-based retrieval. This combination should overcome the drawback of a solely keyword-based indexing method which is insufficient to exactly describe the content of the Deep Web. Dynamically and individually adjustable retrieval behavior further improves the useability ofour approach. 1 Introduction The World-Wide-Web (WWW) as a global repository for information is a big improvement for the quality of life of people around the world: Organizing a journey, buying or selling goods, or just searching detailed and up-to-date information becomes cheaper, faster, and easier. The main problem is to find suitable pages to perform the individual tasks. The research efforts taken to solve this problem brought up a lot of different technologies, e.g. to better describe the content of Web pages (RDF [15]), the Web services (UDDI [14]), and elaborated index and search strategies (pagerank used by Google[13]). The WWW contains a lot of pages that are not explicitly made persistent in a Web-accessible manner. These pages are dynamically constructed by scripts or servlets according to some user input, e.g. travel destinations, or product lists of Web shops. The set of those pages is called the Deep Web. Information providers use dynamic pages when they have to handle a mass of related information that is rapidly changing over time. Describing this information in a way that traditional search engines can present them to an interested user is only possible, if the Uniform Resource Locator (URL) 78

79 is used to transmit user input to the Web site. Search engines feeded with keywords like "flight", "from", "Frankfurt", "to", "Hawaii" would probably not deliver a link to a travel agency, because "Frankfurt" and "Hawaii" is an information of the Deep Web and therefore not available for the index generation of search engines. A keyword description of script pages (the entry points to the Deep Web) can only describe very vague the Deep Web information and the real content of a Web site. Additionally the main part of the information provided by keywords relies on the relations between them and are not expressed. A search engine for Deep Web information should use this structural information, too. It is interesting to see that the structural information (lost when a user's question is transformed to a set of keywords) is actually used to manage the information of the Deep Web: Most of the sites offering dynamic Web pages store their data within relational databases for which Entity-Relationship models (ER models [4]) accurately describe the content and relations. ER models are abstract enough to deliver information about the stored data without using the data itself. Imagine a search engine that indexes the internal ER models of Deep Web information systems and provides a front end to formulate queries in a structural manner like ER modelling. This paper presents a realization of this approach based on a framework for general model retrieval [12]. The paper is structured as follows: We briefly describe the components of the framework and their application in the domain of ER models in the next section. Section 3 evaluates the structural retrieval approach in a class room experiment. Section 4 shows how the structural query approach can be combined with the necessary keyword-based retrieval. Dynamic control of retrieval behavior is explained in section 5. Before we summarize and conclude this paper, section 6 gives an overview of other approaches to search for Deep Web information. 2 Structural Retrieval of ER models The management of complex models like ER models is a very popular scientific field. Mapping, matching, or merging models is required in many situations, e.g. when companies want to share data contained in different kind of databases or documents. There is a need of more abstract data descriptions and operations to automatically perform the given tasks. Bernstein et al. described in [1] an algebra on a generalized view of complex models. Matching of models with respect to their inherent structure is one of the issues that have tobetackled. Firestorm [10] is a framework for the retrieval of models according to their implicit graph structure. The details are presented in [11] and [12]. The core of the framework is a representation of models through three-layer graphs called Structured Service Models (SSM). This representation is based on the concepts of Structured Modelling [6], which is used to formulate mathematical 79

80 models. Three-layer graphs allow the design of algorithms that compute the graph similarity between different SSMs. After stating a query in a graphical way the system returns an ordered list of the models that are most similar to the query with respect to the graph structures. ER Model SSM Transaction Entity Entity Entity Transaction Account Customer (1,1) performs (1,1) (0,N) (1,1) (0,N) (0,N) Account (1,1) (0,N) has Customer Relationship performs Relationship has Fig. 1. Mapping between ER model and SSM We extended the framework to the field of ER models by thedefinition of a one-to-one mapping between ER models and SSMs following the rules of [12]: Different kind of entities become different kindof nodes of the first layer. Relationships become nodes of the third layer and the relations between entities and dependencies between entities and relationships become nodes on the second layer with types derived from the cardinalities of the relations. Figure 1 shows a simple ER model and its corresponding SSM. The mapping rules enforce that SSMs always decompose into two bipartite graphs. Apart of the mappings we implemented graphical user interfaces used to state queries and to visualize the stored SSMs as natural ER models. The server part of Firestorm remains merely unchanged. It stores SSMs in a relational database and provides algorithms to specify the similarity between SSMs. This similarity exploits the adjacency structure of SSM graphs. Given two graphs G =(V;E)andG 0 =(V 0 ;E 0 ), we look at partial mappings ß of the nodes from G onto the nodes of G 0 which only map nodes onto nodes of the same type. The number q expresses the numberofedgesing that are implicitly mapped onto edges in G 0, i.e. edges (v; w) 2 E such that (ß(v);ß(w)) 2 E 0. The matching quality realized by the mapping ß is defined as d(ß) =2q=(jEj + je 0 j). The similarity between two graphs is then defined by the maximum over all matching qualities of mappings from G onto G 0. Mappings are always one to one. The matching quality is a rational number between 0 and 1 with a value of 1 indicating that there exists a mapping between the library graph and the query graph that matches exactly all edges. As we can assume w.l.o.g. that there are no isolated nodes in a SSM, this is the case if and only if both graphs are isomorphic. 80

81 The authors of [11] showed that finding a mapping of optimal quality is NP-hard. But the three-layer structure of SSMs supports the design of heuristic and exact retrieval algorithms as well as a fast filter. The exact algorithm enumerates all feasible mappings of nodes from the second layer and solves for each of them the matching problems on layers 1 and 3 to optimality by using algorithms for weighted bipartite matching. The exact algorithm is polynomial for a fixed number of nodes on layer 2, and a reasonable algorithm for small numbers of nodes on layer 2. Filter and heuristic algorithms speed up the retrieval time tremendously by reducing the set of graphs to which the exact algorithm has to be applied. Further details of the algorithms are explained in [11]. Fig. 2. Applet interface of Firestorm As an example, figure 2 shows the search for a site that deals with information in the context of online banking. Customers have accounts and perform transactions. There is actually no online banking system registered in Firestorm and we get a top match for a system that logs car accidents because the graph structure is identical. Obviously this is not an answer we want to achieve. So there is a need to combine the structural retrieval with 81

82 context-related keyword search. Section 4 focuses on that task. Furthermore we added new functionality to Firestorm to fine-tune the retrieval algorithms with respect to the retrieval situation of a user. These functionalities are explained in section 5. But first of all, reporting experiences with a class room exercise will indicate that structural similarity of ER models is a good measure for the closeness of corresponding real world situations (if the context is predefined). 3 Evaluation of Structural Retrieval This section will show how well the structures of SSMs capture the semantics of corresponding ER. Modelling is an art and one may argue that e.g. the same real world situation can be modelled in very different ways leading to different SSMs. The structural retrieval approach reduces the detection of similarity between different ER models to the graph similarity between their SSMs but the retrieval quality depends on semantic similarity between ER models. The restrictiveness in structure and available means to create ER models somehow limits the modelling freedom but a mathematical quality proof is out of sight. As explained in the previous section Firestorm was extended to the domain of ER models. This means that the system can report the similarity between two different ER models. The focus of our working group at the University of Tübingen (Germany) is on database and information systems. Part of a database course is to learn how to create ER models that describe an aspect of the real world to be managed with a database. In this course Firestorm is used by students to create and manage their ER models. The course contains three exercises that require students to design ER models for three different real world cases. These cases are formulated as natural language text. The text implies the naming of entities and relationships so that we can concentrate only on the structure. Some staff members evaluated the solutions of the students and assigned credit points according to how well the ER models represent the real world cases. For each exercise a staff member created a master model representing the expected solution. This made it possible to evaluate whether high similarity on the SSMs (respectively the corresponding ER models) reflects a certain similarity according to the semantics of the modelled real world cases, expressed by a high grade. A positive answer would promise to apply the structural retrieval approach for the domain of ER models. These master models were used as queries to the system with the corresponding student models building the library. The system computed the (graph) similarity between each student model and the master model. The similarities were converted into credit points by simply multiplying the similarities with maximal credit points. 82

83 The results are encouraging. Figure 3 relates the credit points assigned by the staff members to the credit points assigned by Firestorm. The differences in credit point assignment are marginal so it seems that graph similarity between SSMs can be used to determine semantic similarity between ER models (as long as the context is clear). Exercise 1 Exercise 2 Exercise 3 credit points staff staff system 2 1 system credit points credit points staff system student student student Fig. 3. Credit point assignment (staff vs. Firestorm ) 4 Combining Structural and Keyword-based Retrieval The results of the previous section are very promising but as the example usage of section 2 shows structural retrieval on its own cannot distinguish between different contexts. The application of Firestorm on the Web with thousands of sites offering heterogenous information needs a mechanism to influence the structural retrieval by something describing the context. A meaningful context description is given by the names of the elements (e.g. Entities, Relationships) that compose an ER model. To calculate the structural similarity the algorithms map nodes on nodes and count the number of implicitly overlapping edges. For all algorithms we can define weights per edge without changing the algorithm structure. Structural retrieval can then be combined with some kind of keyword-based retrieval in the following way: 1. The system fills a matrix that keeps the closeness between node names for each combination of nodes. If the system only allows the usage of names of a controlled vocabulary or names with a given semantic (e.g. by RDF [15]) the normalized string distances (e.g. edit distance) express the closeness by a value between 0 and 1 with a value 1 indicating that two names have an identical meaning. 2. The retrieval algorithms use the values of the matrix to identify the weight of realized edge mapping which is the mean value of the corresponding node mapping matrix values. The similarity between structural identical graphs that have no node names in common will get much lower. 83

84 5 Dynamic Recall/Precision Control Like in any other retrieval system, the retrieval quality can be measured in terms of Recall and Precision. Recall is the quotient of the number of relevant ER models contained in the result set by the number of all relevant ER models of the system with respect to a query. Precision is the quotient of the number of relevant ER models contained in the result set and the size of the result set with respect to a given query. Recall and precision are no objective measures because each user has its own definition of relevance. In the following we use recall and precision in aweaker sense defining that all ER models are relevant with respect to a query if the similarity ismore than a certain threshold. The threshold can be defined by a user just before starting a retrieval action. Additionally we implemented a functionality to control the behavior of the structural retrieval algorithms. Level 1 Level 2 Level 3 Level 4 (0,1) Generic Connection (*,1) (1,1) (*,N) (0,N) (1,N) (N,M) Fig. 4. Node type tree of second layer The control functionality relies on a type tree of SSM nodes. As we know from section 2 the elements of ER models are mapped onto SSM nodes of different types. We define a type tree that starts with a generic root type and and specializes the types along the tree paths. Figure 4 shows the subtree of the second layer SSM node types for the domain of ER models. Following the root type the second level defines the types of all nodes on the second layer. These nodes represent connections between entities and relationships. The third level of the tree distinguishes between (*,N) connections and (*,1) connections that represent two different kinds of connection cardinalities in (min; max) notation. The fourth level of the tree captures the most specific cardinality distinction ((0,1),(1,1),(0,N),(1,N) and (N,M)). Assume that the SSM of the query model and the SSMs of all library models only contain nodes of the most specific type tree levels. Before a user triggers a retrieval action the user specifies for each layer which tree level should be used by the algorithms for the similarity computation, in other words which subtypes can be matched to each other. This means e.g. that if a users chooses the third tree level for the second layer the algorithms can 84

85 map a node of type (0,N) onanodeoftype (1,N). This is impossible if the user chooses the fourth tree level for the second layer. The effect of this mechanism is that for a certain threshold choosing a higher type level reduces the size of the result set. Some of the relevant models may not appear in the result set but the precision of the remaining models is higher. On the other hand setting a low type level leads to a large result set with possibly a lot of irrelevant models. But irrelevant models can contain a lot of hints that help to specify the type levels in away that the results meet a user's expectations. With this functionality a user can adjust the retrieval system towards the individual sense of relevance and quality. 6 Related Work Searching for information that is hidden in the Deep Web is not a new task. Sites providing information generated by scripts out of data stored in databases exist nearly as long as the Web exists. There are a number of search engines that index those sites and query them according to some user input. We look at some of those engines and relate them to the structural retrieval approach. One of the most popular commercial Deep Web search engines is LexiBot [3] from BrightPlanet. It searches among 2200 databases and provides a query interface for simple text or boolean queries. The users can adjust the search strategies and result presentation to its own preferences. Also from BrightPlanet is the free search engine CompletePlanet [2]. It contains the addresses of about databases organized in a directory structure with categories and subcategories. The search interface allows simple text input. The LII [9] is a free search engine with an annotated, searchable subject directory of about 9000 Web resources. The resources are selected and evaluated by librarians for their usefulness to users of public libraries. Simple text and boolean operators are the means to state queries. The search engines of IntelliSeek (ProFusion [8] and InvisibleWeb [7]) resemble much the other mentioned directory-based search engines. Neither of the presented search engines use the database structure of Deep Web information systems. They describe the content of those systems with keywords and organize them (by hand or automated) within a directory. Query interfaces require simple text and boolean queries. They are easier to use than our query interface that is based on ER models. But especially the structural dependencies between different information units captures much more semantics and therefore allows a more precise search. 85

86 7 Summary and Conclusion This paper describes a structural retrieval approach to search for Web sites that provide information contained in the Deep Web. Therefore we extend the general model retrieval framework Firestorm to the domain of ER models. Influenced by the special requirements of the Web we describe a technique to combine structural and keyword-based retrieval. A functionality for the dynamic control of retrieval behavior based on type trees provides the users with means to adjust the system with respect to the individual preferences according to recall and precision. Users have to state their queries to the system by specifying ER models to describe the desired information in structural correlation. Many users are unfamiliar with ER modelling but it is a standardized and powerful way to express dependencies between information objects. The simple graphical user interface should allow users with basic knowledge to specify their own ER models. Another drawback of the approach is the absence of ER models that Web sites need to register with the system. Most of the target sites store their data within relational databases. If they do not have ER models that describe their databases a tool could probably create a model out of the database schema. Apart of the sandbox evaluation presented in section 3 the structural retrieval approach has yet to prove its applicability under the real world conditions of the Web. Therefore we hope that many Web sites register to our system to provide Web users a promising way to search for information of the Deep Web. Looking into the future the ongoing hype of the Semantic Web [5] brings up many applications that require tools to handle semantic descriptions of information. The semantic description is mostly done following the principles of RDF [15]. RDF definitions resemble graphs that show the dependencies among the individual elements. The management of RDF definitions can benefit a lot by an RDF extension of Firestorm. References 1. P. A. Bernstein, A. Y. Halevy, and R. Pottinger. A vision of management of complex models. SIGMOD Record, 29(4):55 63, BrightPlanet. CompletePlanet, BrightPlanet. LexiBot, P. P. S. Chen. The entity-relationship model toward a unified view of data. Proceedings of the 1th Conference on Very Large Databases, Morgan Kaufman pubs. (Los Altos CA), Kerr (ed), pp.173, S. Decker. The Semantic Web Community Portal, A. M. Geoffrion. An introduction to structured modeling. Management Science, 33(5): ,

87 7. IntelliSeek. InvisibleWeb The Search Engine of Search Engines, IntelliSeek. ProFusion, Library of California. Librarians' Index to the Internet, S. Müller. FIRESTORM FIrst a REtrieval SysTem for Operations Research Models, S. Müller and R. Müller. Retrieval of service descriptions using structured service models. In Proceedings of the 10th Annual Workshop on Information Technologies and System, pages 55 60, S. Müller and R. Schimkat. A general, web-enabled model retrieval approach. In Proceedings of the International Conference for Object-Oriented Information Systems (OOIS 2001), L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web, page98pagerank.html. 14. UDDI community. Universal Description, Discovery and Integration of Business for the Web (UDDI), World Wide Web Consortium (W3C). Resource Description Framework, Feb

88 Exploration versus Exploitation in Topic Driven Crawlers Gautam Filippo Department of Management Sciences! School of Library and Information Science The University of Iowa Iowa City, IA {gautam-pant,padmini-srinivasan,filippo-menczer}@uiowa.edu Abstract The dynamic nature of the Web highlights the scalability limitations of universal search engines. Topic driven crawlers can address the problem by distributing the crawling process across users, queries, or even client computers. The context available to a topic driven crawler allows for informed decisions about how to prioritize the links to be visited. Here we focus on the balance between a crawler s need to exploit this information to focus on the most promising links, and the need to explore links that appear suboptimal but might lead to more relevant pages. We investigate the issue for two different tasks: (i) seeking new relevant pages starting from a known relevant subset, and (ii) seeking relevant pages starting a few links away from the relevant subset. Using a framework and a number of quality metrics developed to evaluate topic driven crawling algorithms in a fair way, we find that a mix of exploitation and exploration is essential for both tasks, in spite of a penalty in the early stage of the crawl. 1 Introduction A recent projection estimates the size of the visible Web today (March 2002) to be around 7 billion static pages [10]. The largest search engine, Google, claims to be searching about 2 billion pages. The fraction of the Web covered by search engines has not improved much over the past few years [16]. Even with increasing hardware and bandwidth resources at their disposal, search engines cannot keep up with the growth of the Web and with its rate of change [5]. These scalability limitations stem from search engines attempt to crawl the whole Web, and to answer any query from any user. Decentralizing the crawling process is a more scalable approach, and bears the additional benefit that crawlers can be driven by a rich context (topics, queries, user profiles) within which to interpret pages and select the links to be visited. It comes as no surprise, therefore, that the development of topic driven crawler algorithms has received significant attention in recent years [9, 14, 8, 18, 1, 19]. Topic driven crawlers (also known as focused crawlers) respond to the particular information needs expressed by topical queries or interest profiles. These could be the needs of an individual user (query time or online crawlers) or those of a community with shared interests (topical search engines and portals). Evaluation of topic driven crawlers is difficult due to the lack of known relevant sets for Web searches, to the presence of many conflicting page quality measures, and to the need for fair gauging of crawlers time and space algorithmic complexity. In recent research we presented an evaluation framework designed to support the comparison of topic driven crawler algorithms 188

89 under specified resource constraints [19]. In this paper we further this line of research by investigating the relative merits of exploration versus exploitation as a defining characteristic of the crawling mechanism. The issue of exploitation versus exploration is a universal one in machine learning and artificial intelligence, since it presents itself in any task where search is guided by quality estimations. Under some regularity assumption, one can assume that a measure of quality at one point in the search space provides some information on the quality of nearby points. A greedy algorithm can then exploit this information by concentrating the search in the vicinity of the most promising points. However, this strategy can lead to missing other equally good or even better points, for two reasons: first, the estimates may be noisy; and second, the search space may have local optima that trap the algorithm and keep it from locating global optima. In other words, it may be necessary to visit some bad points in order to arrive at the best ones. At the other extreme, algorithms that completely disregard quality estimates and continue to explore in a uniform or random fashion do not risk getting stuck at local optima, but they do not use the available information to bias the search and thus may spend most of their time exploring suboptimal areas. A balance between exploitation and exploration of clues is obviously called for in heuristic search algorithms, but the optimal compromise point is unknown unless the topology of the search space is well understood which is typically not the case. Topic driven crawlers fit into this picture very well if one views the Web as the search space, with pages as points and neighborhoods as defined by hyperlinks. A crawler must decide which pages to visit based on the cues provided by links from nearby pages. If one assumes that a relevant page has a higher probability to be near other relevant pages than to any random page, then quality estimate of pages provide cues that can be exploited to bias the search process. However, given the short range of relevance clues on the Web [17], a very relevant page might be only a few links behind an apparently irrelevant one. Balancing the exploitation of quality estimate information with exploration of suboptimal pages is thus crucial for the performance of topic driven crawlers. It is a question that we study empirically with respect to two different tasks. In the first, we seek relevant pages starting from a set of relevant links. Applications of such a task are query-time search agents that use results of a search engine as starting points to provide a user with recent and personalized results. Since we start from relevant links, we may expect an exploratory crawler to perform reasonably well. The second task involves seeking relevant pages while starting the crawl from links that are a few links away from a relevant subset. Such a task may be a part of Web mining or competitive intelligence applications (e.g., a search starting from competitors home pages). If we do not start from a known relevant subset, ex- the appropriate balance of exploration vs. ploitation becomes an empirical question. 2 Evaluation Framework 2.1 Topics, Examples and Neighbors In order to evaluate crawler algorithms, we need topics, some corresponding relevant examples, and neighbors. The neighbors are URLs extracted from neighborhood of the examples. We obtain our topics from the Open Directory (DMOZ). We ran randomized Breadth-First crawls starting from each of the main categories on the DMOZ site. 1 The crawlers identify DMOZ leaves, i.e., pages that have no children category nodes. Leaves with five or more external links are then used to derive topics. We thus collected 100 topics. A topic is represented by three types of information derived from the corresponding leaf page. First, the words in the DMOZ hierarchy form the topic s keywords. Second, up to 10 external links form the topic s examples. Third, we concatenate the text descriptions and anchor text of the target URLs (written by DMOZ human editors) to form a topic description. The difference be

90 Table 1: A sample topic. The description is truncated for space limitations. Topic Keywords Topic Description Examples Recreation Hot Air Ballooning Organizations Aerostat Society of Australia Varied collection of photos and facts about ballooning in Australia, Airships, Parachutes, Balloon Building and more. Includes an article on the Theory of Flight. Albuquerque Aerostat Ascension Association A comprehensive site covering a range of ballooning topics including the Albuqeurque Balloon Fiesta, local education and safety programs, flying events, club activities and committees, and club history. Arizona Hot Air Balloon Club [...] tween topic keywords and topic descriptions is that we give the former to the crawlers, as models of (short) query-like topics, while we use the latter, which are much more detailed representations of the topics, to gauge the relevance of the crawled pages in our post-hoc analysis. Table 1 shows a sample topic. The neighbors are obtained for each topic through the following process. For each of the examples, we obtain the top 20 inlinks as returned by Google. 2 Next, we get the top 20 inlinks for each of the inlinks obtained earlier. Hence, if we had 10 examples to start with, we may now have a maximum of 4000 unique URLs. A subset of 10 URLs is then picked at random from this set. The links in such a subset are called the neighbors. 2.2 Architecture We use the a previously proposed evaluation framework to compare different crawlers [19]. The framework allows one to easily plug in modules implementing arbitrary crawling algorithms, which share data structures and utilities to optimize efficiency without affecting the fairness of the evaluation. As mentioned before, we use the crawlers for two different tasks. For the first task, the crawlers start from the examples while for the second the starting points are the neighbors. In either case, as the pages are fetched their component URLs are added to a list that we call the frontier. A crawler may use topic s keywords to 2 guide the selection of frontier URLs that are to be fetched at each iteration. For a given topic, a crawler is allowed to crawl up to MAX PAGES = 2000 pages. However, a crawl may end sooner if the crawler s frontier should become empty. We use a timeout of 10 seconds for Web downloads. Large pages are chopped so that we retrieve only the first 100 KB. The only protocol allowed is HTTP (with redirection allowed), and we also filter out all but static pages with text/html content. Stale links yielding HTTP error codes are removed as they are found (only good links are used in the analysis). We constrain the space resources a crawler algorithm can use by restricting the frontier size to MAX BUFFER = 256 URLs. If the buffer becomes full then the crawler must decide which links are to be replaced as new links are added. 3 Crawling Algorithms In this paper we study the notion of exploration versus exploitation. We begin with a single family of crawler algorithms with a single greediness parameter to control the exploration/exploitation behavior. In our previous experiments [19] we found that a naive Best-First crawler displayed the best performance among three crawlers considered. Hence, in this study we explore variants of the Best-First crawler. More generally, we examine the Best-N-First family of crawlers where the parameter N controls the characteristic of interest. Best-First crawlers have been studied before [9, 14]. The basic idea is that given a frontier of 390

91 Best_N_First(topic, starting_urls, N) { foreach link (starting_urls) { enqueue(frontier, link); } while (#frontier > 0 and visited < MAX_PAGES) { links_to_crawl := dequeue_top_links(frontier, N); foreach link (randomize(links_to_crawl)) { doc := fetch_new_document(link); } } } score := sim(topic, doc); foreach outlink (extract_links(doc)) { if (#frontier >= MAX_BUFFER) { dequeue_link_with_min_score(frontier); } enqueue(frontier, outlink, score); } Figure 1: Pseudocode of Best-N-First crawlers. links, the best link according to some estimation criterion is selected for crawling. Best-N-First is a generalization in that at each iteration a batch of top N links to crawl are selected. After completing the crawl of N pages the crawler decides on the next batch of N and so on. As mentioned above, the topic s keywords are used to guide the crawl. More specifically this is done in the link selection process by computing the lexical similarity between a topic s keywords and the source page for the link. Thus the similarity between a page p and the topic is used to estimate the relevance of the pages linked from p. The N URLs with the best estimates are then selected for crawling. Cosine similarity is used by the crawlers and the links with minimum similarity score are removed from the frontier if necessary in order to not exceed the MAX BUFFER size. Figure 1 offers a simplified pseudocode of a Best-N-First crawler. Best-N-First offers an ideal context for our study. The parameter N controls the greedy behavior of the crawler. Increasing N results in crawlers with greater emphasis on exploration and consequently a reduced emphasis on exploitation. Decreasing N reverses this; selecting a smaller set of links is more exploitative of the evidence available regarding the potential merits of the links. In our experiments we test five mutants of the crawler by setting N to 1, 16, 64, 128 and 256. We refer to them as BFSN where N is one of the above values. 4 Evaluation Methods Table 2 depicts our overall methodology for crawler evaluation. The two rows of Table 2 indicate two different methods for gauging page quality. The first is a purely lexical approach wherein similarity to the topic description is used to assess relevance. The second method is primarily linkage based and is an approximation of the retrieval/ranking method used by Google [6]; it uses PageRank to discriminate between pages containing the same number of topic keywords. The columns of the table show that our measures are used both from a static and a dynamic perspective. The static approach examines crawl quality assessed from the full set of (up to 2000) pages crawled for each query. In contrast the dynamic measures provide a temporal characterization of the crawl strategy, by considering the pages fetched while the crawl is in progress. More specifically, the static approach measures coverage, i.e., the ability to retrieve good pages where the quality of a page is assessed in two different ways (corresponding to the rows of the table). Our static plots show the ability of each crawler to retrieve more or fewer highly relevant pages. This is analogous to plotting recall as a function of generality. The dynamic approach examines the quality of retrieval as the crawl progresses. Dynamic plots offer a trajectory over time that displays the dynamic behavior of the crawl. The measures are built on average (quality-based) ranks and are generally inversely related to precision. As the average rank decreases, an increasing proportion of the crawled set can be expected to be relevant. It should be noted that scores and ranks used in each dynamic measure are computed omnisciently, i.e., all calculations for each point in time for a crawler are done using data generated from the full crawl. For instance, all PageRank scores are calculated using the full set of retrieved pages. This strategy is quite reasonable given that we want to use the best possible evidence when judging page quality. 491

92 Table 2: Evaluation Schemes and Measures. The static scheme is based on coverage of top pages (ranked by quality metric among all crawled pages, S). S crawler is the set of pages visited by a crawler. The dynamic scheme is based on the ranks (by quality metric among all crawled pages, S) averaged over the crawl sets at time t, S crawler (t). Lexical Linkage Static Scheme S crawler top ranksmart (S) S crawler top rankkw,p R (S) Dynamic Scheme p S crawler (t) rank SMART (p)/ S crawler (t) p S crawler (t) rank KW,P R(p)/ S crawler (t) 4.1 Lexical Based Page Quality We use the SMART system [23] to rank the retrieved pages by their lexical similarity to the topic. The SMART system allows us to pool all the pages crawled by all the crawlers for a topic and then rank these against the topic description. The system utilizes term weighting strategies involving term frequency and inverse document frequency computed from the pooled pages for a topic. SMART computes the similarity between the query and the topic as a dot product of the topic and page vectors. It outputs a ranked set of pages based on their topic similarity scores. That is, for each page we get a rank which we refer to as rank SMART (cf. Table 2). Thus given a topic, the percentage of top n pages ranked by SMART (where n varies) that are retrieved by each crawler may be calculated, yielding the static evaluation metric. For the dynamic view we use the rank SMART values for pages to calculate mean rank SMART at different points of the crawl. If we let S crawler (t) denote the set of pages retrieved up to time t, then we calculate mean rank SMART over S crawler (t). The set S crawler (t) of pages increases in size as we proceed in time. We approximate t by the number of pages crawled. The trajectory of mean rank SMART values over time displays the dynamic behavior of a crawler. 4.2 Linkage Based Page Quality It has been observed that content alone does not give a fair measure of the quality of the page [15]. Algorithms such as HITS [15] and PageRank [6] use the linkage structure of the Web to rank pages. PageRank in particular estimates the global popularity of a page. The computation of PageRanks can be done through an iterative process. PageRanks are calculated once after all the crawls are completed. That is, we pool the pages crawled for all the topics by all the crawlers and then calculate the PageRanks according to the algorithm described in [13]. We sort the pages crawled for a given topic, by all crawlers, first based on the number of topic keywords they contain and then sort the pages with same number of keywords by their PageRank. The process gives us a rank KW,P R for each page crawled for a topic. Once again, our static evaluation metric measures the percentage of top n pages (ranked by rank KW,P R ) crawled by a crawler on a topic. In the dynamic metric, mean rank KW,P R is plotted over each S crawler (t) where t is the number of pages crawled. 5 Results For each of the evaluation schemes and metrics outlined in Table 2, we analyzed the performance of each crawler on the two tasks. 5.1 Task 1 : Starting from Examples For the first task the crawlers start from a relevant subset of links, the examples, and use the hyperlinks to navigate and discover more relevant pages. The results for the task are summarized by the plots in Figure 2. For readability, we are only plotting the performance of a selected subset of the Best-N-First crawlers (N = 1, 256). The behavior of the remaining 592

93 80 75 a) Static Lexical Performance BFS1 BFS b) Dynamic Lexical Performance % crawled average rank number of top pages 1600 BFS1 BFS pages crawled 65 c) Static Linkage Performance BFS1 BFS d) Dynamic Linkage Performance % crawled 50 average rank number of top pages BFS1 BFS pages crawled Figure 2: Static evaluation (left) and dynamic evaluation (right) of representative crawlers on Task 1. The plots correspond to lexical (top) and linkage (bottom) quality metric. Error bars correspond to ±1 standard error across the 100 topics in this and the following plots. crawlers (BFS16, BFS64 and BFS128) can be extrapolated between the curves corresponding to BFS1 and BFS256. The most general observation we can draw from the plots is that BFS256 achieves a significantly better performance under the static evaluation schemes, i.e., a superior coverage of the most highly relevant pages based on both quality metrics and across different numbers of top pages (cf. Figure 2a,c). The difference between the coverage by crawlers for different N increases as one considers fewer highly relevant pages. These results indicate that exploration is important to locate the highly relevant pages when starting from relevant links, whereas too much exploitation is harmful. The dynamic plots give us a richer picture. (Recall that here lowest average rank is best.) BFS256 still does significantly better than other crawlers on the lexical metric (cf. Figure 2b). However, the linkage metric shows that BFS256 pays a large penalty in the early stage of the crawl (cf. Figure 2d). However, the crawler regains quality over the longer run. The better coverage of highly relevant pages by this crawler (cf. Figure 2c) may help us interpret the improvement observed in the second phase of the crawl. We conjecture that by exploring suboptimal links early on, BFS256 is capable of eventually discovering paths to highly relevant pages that escape more greedy strategies. 5.2 Task 2: Starting from Neighbors The success of a more exploratory algorithm on the first task may not come as a surprise since we start from known relevant pages. However, in 693

94 40 38 a) Static Lexical Performance BFS1 BFS256 BreadthFirst b) Dynamic Lexical Performance % crawled average rank number of top pages BFS1 BFS256 BreadthFirst pages crawled 40 c) Static Linkage Performance BFS1 BFS256 BreadthFirst 7000 d) Dynamic Linkage Performance % crawled 25 average rank number of top pages BFS1 BFS256 BreadthFirst pages crawled Figure 3: Static evaluation (left) and dynamic evaluation (right) of representative crawlers on Task 2. The plots correspond to lexical (top) and linkage (bottom) quality metric. the second task we use the links obtained from the neighborhood of relevant subset as the starting points with the goal of finding more relevant pages. We take the worst (BFS1) and the best (BFS256) crawlers on Task 1, and use them for Task 2. In addition, we add a simple Breadth- First crawler that uses the limited size frontier as a FIFO queue. The Breadth-First crawler is added to observe the performance of a blind exploratory algorithm. A summary of the results is shown through plots in Figure 3. As for Task 1, we find that the more exploratory algorithm, BFS256, performs significantly better than BFS1 under static evaluations for both lexical and linkage quality metrics (cf. Figure 3a,c). In the dynamic plots (cf. Figure 3b,d) BFS256 seems to bear an initial penalty for exploration but recovers in the long run. The Breadth-First crawler performs poorly on all evaluations. Hence, as a general result we find that exploration helps an exploitative algorithm, but exploration without guidance goes astray. Due to the availability of relevant subsets (examples) for each of the topics in the current task, we plot the average recall of the relevant examples against number of pages crawled (Figure 4). The plot illustrates the target-seeking behavior of the three crawlers if the examples are viewed as the targets. We again find BFS256 outperforming BFS1 while Breadth-First trails behind. 6 Related Research Research on the design of effective focused crawlers is very vibrant. Many different types of crawling algorithms have been developed. For example, Chakrabarti et al. [8] use classifiers built from training sets of positive and negative 794

95 average recall BFS1 BFS256 BreadthFirst pages crawled Figure 4: Average recall of examples when the crawls start from the neighbors. example pages to guide their focused crawlers. Fetuccino [3] and InfoSpiders [18] begin their focused crawling with starting points generated from CLEVER [7] or other search engines. Most crawlers follow fixed strategies, while some can adapt in the course of the crawl by learning to estimate the quality of links [18, 1, 22]. The question of exploration versus exploitation in crawler strategies has been addressed in a number of papers, more or less directly. Fish-Search [11] limited exploration by bounding the depth along any path that appeared suboptimal. Cho et al. [9] found that exploratory crawling behaviors such as implemented in the Breadth-First algorithm lead to efficient discovery of pages with good PageRank. They also discuss the issue of limiting the memory resources (buffer size) of a crawler, which has an impact on the exploitative behavior of the crawling strategy because it forces the crawler to make frequent filtering decisions. Breadth-First crawlers also seem to find popular pages early in the crawl [20]. The exploration versus exploitation issue continues to be studied via variations on the two major classes of Breadth-First and Best-First crawlers. For example, in recent research on Breadth-First focused crawling, Diligenti et al. [12] address the shortsightedness of some crawlers when assessing the potential value of links to crawl. In particular, they look at how to avoid short-term gains at the expense of less-obvious but larger long-term gains. Their solution is to build classifiers that can assign pages to different classes based on the expected link distance between the current page and relevant documents. The area of crawler quality evaluation has also received much attention in recent research [2, 9, 8, 19, 4]. For instance, many alternatives for assessing page importance have been explored, showing a range of sophistication. Cho et al. [9] use the simple presence of a word such as computer to indicate relevance. Amento et al. [2] compute the similarity between a page and the centroid of the seeds. In fact contentbased similarity assessments form the basis of relevance decisions in several examples of research [8, 19]. Others exploit link information to estimate page relevance with methods based on in-degree, out-degree, PageRank, hubs and authorities [2, 3, 4, 8, 9, 20]. For example, Cho et al. [9] consider pages with PageRank score above a threshold as relevant. Najork and Wiener [20] use a crawler that can fetch millions of pages per day; they then calculate the average PageRank of the pages crawled daily, under the assumption that PageRank estimates relevance. Combinations of link and content-based relevance estimators are evident in several approaches [4, 7, 18]. 7 Conclusions In this paper we used an evaluation framework for topic driven crawlers to study the role of exploitation of link estimates versus exploration of suboptimal pages. We experimented with a family of simple crawler algorithms of varying greediness, under limited memory resources for two different tasks. A number of schemes and quality metrics derived from lexical features and link analysis were introduced and applied to gauge crawler performance. We found consistently that exploration leads to better coverage of highly relevant pages, in spite of a possible penalty during the early stage of the crawl. An obvious explanation is that exploration allows to trade off short term gains for longer term and potentially larger gains. However, we also found that a blind exploration 895

96 when starting from neighbors of relevant pages, leads to poor results. Therefore, a mix of exploration and exploitation is necessary for good overall performance. When starting from relevant examples (Task 1), the better performance of crawlers with higher exploration could be attributed to their better coverage of documents close to the relevant subset. The good performance of BFS256 starting away from relevant pages, shows that its exploratory nature complements its greedy side in finding highly relevant pages. Extreme exploitation (BFS1) and blind exploration (Breadth-First), impede performance. Nevertheless, any exploitation seems to be better than none. Our results are based on short crawls of 2000 pages. The same may not hold for longer crawls; this is an issue to be addressed in future research. The dynamic evaluations do suggest that for very short crawls it is best to be greedy; this is a lesson that should be incorporated into algorithms for query time (online) crawlers such as MySpiders 3 [21]. The observation that higher exploration yields better results can motivate parallel and/or distributed implementations of topic driven crawlers, since complete orderings of the links in the frontier, as required by greedy crawler algorithms, do not seem to be necessary for good performance. Therefore crawlers based on local decisions seem to hold promise both for the performance of exploratory strategies and for the efficiency and scalability of distributed implementations. In particular, we intend to experiment with variations of crawling algorithms such as InfoSpiders [18], that allow for adaptive and distributed exploratory strategies. Other crawler algorithms that we intend to study in future research include Best-First strategies driven by estimates other than lexical ones. For example we plan to implement a Best-N-First family using link estimates based on local versions of the rank KW,P R metric used in this paper for evaluations purposes. We also plan to test more sophisticated lexical crawlers such as InfoSpiders and Shark Search [14], which can prioritize over links from a single page. 3 A goal of present research is to identify optimal trade-offs between exploration and exploitation, where either more exploration or more greediness would degrade performance. A large enough buffer size will have to be used so as not to constrain the range of exploration/exploitation strategies as much as happened in the experiments described here due to the small MAX BUFFER. Identifying an optimal exploration/exploitation trade-off would be the first step toward the development of an adaptive crawler that would attempt to adjust the level of greediness during the crawl. Finally, two things that we have not done in this paper are to analyze the time complexity of the crawlers and the topic-specific performance of each strategy. Regarding the former, clearly more greedy strategies require more frequent decisions and this may have an impact on the efficiency of the crawlers. Regarding the latter, we have only considered quality measures in the aggregate (across topics). It would be useful to study how appropriate trade-offs between exploration and exploitation depend on different characteristics such as topic heterogeneity. Both of these issues are the object of ongoing research. References [1] C. Aggarwal, F. Al-Garawi, and P. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proc. 10th Intl. World Wide Web Conference, pages , [2] B. Amento, L. Terveen, and W. Hill. Does authority mean quality? Predicting expert quality ratings of web documents. In Proc. 23rd ACM SIGIR Conf. on Research and Development in Information Retrieval, pages , [3] I. Ben-Shaul, M. Herscovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim, V. Soroka, and S. Ur. Adding support for dynamic and focused search with Fetuccino. Computer Networks, 31(11 16): ,

97 [4] K. Bharat and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pages , [5] B. E. Brewington and G. Cybenko. How dynamic is the Web? In Proc. 9th International World-Wide Web Conference, [6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1 7): , [7] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 30(1 7):65 74, [8] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(11 16): , [9] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proc. 7th Intl. World Wide Web Conference, Brisbane, Australia, [10] Cyveillance. Sizing the internet. White paper, July corporate/white papers.htm. [11] P. De Bra and R. Post. Information retrieval in the World Wide Web: Making clientbased searching feasible. In Proc. 1st Intl. World Wide Web Conference, [12] M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proc. 26th International Conference on Very Large Databases (VLDB 2000), pages , Cairo, Egypt, [13] T. Haveliwala. Efficient computation of pagerank. Technical report, Stanford Database Group, [14] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm An application: Tailored Web site mapping. In Proc. 7th Intl. World-Wide Web Conference, [15] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): , [16] S. Lawrence and C. Giles. Accessibility of information on the Web. Nature, 400: , [17] F. Menczer. Links tell us about lexical and semantic web content. arxiv:cs.ir/ [18] F. Menczer and R. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2 3): , [19] F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, [20] M. Najork and J. L. Wiener. Breadth-first search crawling yields high-quality pages. In Proc. 10th International World Wide Web Conference, [21] G. Pant and F. Menczer. Myspiders: Evolve your own intelligent web crawlers. Autonomous Agents and Multi-Agent Systems, 5(2): , [22] J. Rennie and A. K. McCallum. Using reinforcement learning to spider the Web efficiently. In Proc. 16th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, [23] G. Salton. The SMART Retrieval System Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ,

98 WebVigil: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments 1 Sharma Chakravarthy, Jyoti Jacob, Naveen Pandrangi, Anoop Sanka Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX sharma@cse.uta.edu ABSTRACT Efficient and effective change detection and notification is becoming increasingly important for environments such as WWW and distributed heterogeneous systems. Change detection for structured data has been studied extensively. Change detection and notification for unstructured data in the form of html and XML documents is the goal of this work. The objectives of this work are to investigate the specification, management, and propagation of changes as requested by a user in a timely manner while meeting the quality of service requirements. In this paper, we elaborate on the problem, issues that need to be addressed, and our preliminary approach. We present an architecture and discuss the functionality that needs to be supported by various modules in the architecture. We plan on using the active capability in the form of Event-Condition- Action (or ECA) rules developed so far, and a combination of push/pull paradigm for this problem. 1 Introduction Active rules have been proposed as a paradigm to satisfy the needs of many database and other applications that require a timely response to situations. Event Condition Action (or ECA) rules are used to capture the active capability in a system. The utility and functionality of active capability (ECA rules) has been well established in the context of databases. In order for the active capability to be useful for a large class of advanced applications, it is necessary to go beyond what has been proposed/developed in the context of databases. Specifically, extensions beyond the current state of the art in active capability are needed along several dimensions: 1. Make the active capability available for nondatabase applications, in addition to database applications; 2. Make the active capability available in distributed environments 3. Make the active capability available for heterogeneous sources of events (whether they are databases are not). In this paper, we address 2) and 3) based on our preliminary architecture. There are a number of situations where one needs to know when changes are made to one or more documents that are stored in a distributed (typically heterogeneous) environment. The numbers of documents that need to be monitored for changes are large and are spread over multiple information repositories. The emphasis here is on selective notification; that is, changes 1 This work was supported, in part, by the Office of Naval Research & the SPAWAR System Center San Diego & by the Rome Laboratory (grant F ), and by NSF (grant IIS ). 98

99 are notified to appropriate persons/groups based upon interest (or profile/policy) that has been established earlier. Also, there should be a mechanism for establishing the interests/profiles/policies. Currently, change detection is done either manually or by using queries to check whether any document of interest has changed (since the last check). This entails wasted resources and at the same time does not meet the intended timeliness (where important) of change detection and associated notification. Also, quality of service issues cannot be accommodated in this approach. As an example, the above situation is very common in a large software development project where there are a number of documents, such as requirements analysis, design specification, detailed design document, and implementation documents. The life cycle of such projects are in years (and some in decades) and changes to various documents of the project take place throughout the life cycle. Typically, a large number of people are working on the project and managers need to be aware of the changes to any one of the documents to make sure the changes are propagated properly to other relevant documents and appropriate actions are taken. Large software developments happen in distributed environments. Information retrieval in the context of the web is another example that has similar characteristics. Different users may be interested in knowing changes to specific web pages (or even combinations there-of), and want to know when those changes take place. The approach proposed in this paper will avoid periodic polling of the web to see whether the information has changed or not. Some examples are: students want to know when the web contents of the courses they have registered for change; users may want to know when news items are posted with some specific context they are interested in. In general, the ability to specify changes to arbitrary documents and get notified in different ways will be useful for reducing the wasteful navigation of web in this information age. The proposed approach also provides a powerful way to disseminate information efficiently without sending unnecessary or irrelevant information. It also frees the user from having to constantly monitor for changes using the pull paradigm. Today, information retrieval is mostly done using the pull technology where the user is responsible for posing the appropriate query (or queries) to retrieve needed information. The burden of knowing changes to contents of pages in interested web sites is on the user, rather than on the system. Although there are a number of systems that send information to interested users selectively (periodically by airlines, for example), the approach commonly used is to use a mailing list to send compiled information. Other tools that provide real-time updates in the web context (e.g., stock updates) are custom systems that still use the pull technology underneath to refresh the screen periodically. We believe that some of the techniques developed for active databases, when extended appropriately along with new research extensions will provide a solution to the above class of problems. In addition, there is the theoretical foundation for event specification, and its detection in centralized and distributed environments. The main objective of this project is to develop the theory, architecture, and prototype implementation of a selective propagation approach that can be applied to web and other large-scale network-centric environments. We will draw upon the techniques developed for Sentinel and re-examine them from a broader, general-purpose context. Some of the issues that will be investigated in this project are: 99 2

100 Development of an approach (both language and constraints) to specify (primitive) changes to a hierarchical (XML) document at different level of granularity. Develop a GUI, if needed. Ability to specify combinations of primitive changes using a language such as Snoop which will allow one to specify higher levels of abstractions of changes (such as combinations of changes, sequences of changes, aggregate changes, etc.) Develop techniques for selective propagation between a web server and its browsing clients Extend the above to propagate selective changes from one or more web server to another web server (distributed case) Develop propagation techniques that take into account QoS and other constraints Developing solutions to the above issues will enable us to develop a general-purpose solution to selective information propagation for a large network-centric environment. The remainder of the paper is organized as follows. In section two we give an overview of related work. In section three we discuss the push/pull paradigms and their relevance to the change detection problem on structured documents. In section four, we present architecture and discuss the functionality of the components. Finally, we discuss future work and draw some conclusions in section 5. 2 Related Work Many tools have been developed and are currently available for tracking changes to web pages. AIDE (AT&T Internet Difference Engine) developed by AT&T [1] shows the difference between two html pages. The granularity of change detection is restricted to a page in AIDE. It is not possible to view changes at a finer level of granularity, such as links within a page, keywords, images, table, lists or phrases. Changedetection.com [2] allows users to register their request and notifies them when there is a change. We believe that polling (or timestamp information) is used for detecting changes to a page of interest. When a change is detected, the user is notified. The notification does not include what has changed in the page. The user is not given a choice of specifying the type of changes to be tracked on a particular page. Again, the granularity is a page. Mind-it [3] and WebCQ [4] both support customized change detection and notification. Mind-it formerly known as URL-Minder is commercially available. Both these systems track changes to a finer level of granularity in a page. They do not support change specification on multiple pages and combinations of changes within a page (e.g., phrase change and a link change). They also do not use active capability for either detecting changes or propagating changes. In Xyleme [5, 6], the idea of active paradigm is being used for detecting changes by evaluation of continuous/monitoring queries on XML/HTML documents. The focus is on the subscription language and continuous queries. 3 Push/Pull Paradigms Traditional approach to information management has been through the use of a Database Management System (or a DBMS). Early DBMSs were developed to satisfy the needs of certain classes of business applications (mainly airline and banking industries). The requirements of these industries were to store, retrieve, and manipulate large amounts of data concurrently, and in a consistent manner (plus allow for failure recovery etc.). Data was stored in databases and the user had to perform 100 3

101 operations explicitly to retrieve data from the system. The burden of retrieving relevant information was on the user. This is the traditional pull paradigm where the user retrieves information by performing an explicit action in the form of a query, application, or transaction execution. Even for traditional business applications, such as inventory control, the pull approach poses certain limitations. For example, in order to keep track of an inventory item (to order additional supplies when the number of widgets falls below a threshold), one has to periodically check (by executing a query) to find out how many widgets are currently present. The traditional DBMS is not capable of automatically informing the user that widgets have fallen below a prespecified threshold. Not surprisingly, this approach is still heavily used in web navigation, search, and retrieval. Figure 1 indicates a different approach to information retrieval and management. In this push paradigm, the user does not have to query or retrieve information as it changes. The system is responsible for accepting user needs (in the form of situations to monitor, business rules, constraints, profiles, continuous search queries,) and informs the user (or a set of users) when something of interest happens. For the widget example above, the user indicates the threshold and the notification mechanism. The system monitors the quantity on hand every time a widget is sold or returned (only when a change takes place; not periodically) and informs the user in a timely manner. This paradigm relieves the user from frequently querying the data sources, and shifts the responsibility of situation monitoring from the user to the system. Of course, in order to accomplish this, the system needs to have additional functionality that is not part of traditional DBMSs. Although this mode of operation is recognized as beneficial and results in significantly less data transfers, accomplishing this for various architectures (such as distributed, federated and network- Answers Query Self-monitoring Reactive System Repository/ Web Store Figure 1 Information Retrieval using the Push Paradigm centric) requires enhancements to the underlying system or incorporate agents or mediators that can carry this out in a non-intrusive manner. In other words, the system needs to have the capability to selectively push information. This is a paradigm shift from how traditional information systems are architected and implemented. It is also a paradigm shift from the users viewpoint as well. 3.1 Push-Based Architectures Business rules Constraints Invariants Situations to monitor Updates Transactions Applications Push technology can be introduced into a system in a number of ways. The approach primarily depends on the characteristics of the underlying system in terms of its openness. The following options can be inferred based on the underlying system characteristics: Integrated In this approach the underlying system is actually modified to incorporate the push technology in the form of ECA (eventcondition-action) rules. This approach assumes that the source code for the underlying software 101 4

102 is available and the developers have sufficient understanding of the system to make changes at the kernel level. For example, the Sentinel object-oriented active system [7-9] used this approach on the OpenOODB system from Texas Instruments [10]. The sentry mechanism of the underlying system was extended to introduce notifications inside the wrapper for each method to detect primitive events. Once primitive events were detected, more complex composite events were detected and rules executed outside of the underlying system. The primary advantage of the integrated approach is its flexibility to add minimum amount of code and incorporate many kinds of optimisation that results in good performance. The footprint for primitive event detection is small. Some of the functionality needed for selective push technology (such as deferred action execution) can be easily incorporated using the integrated approach. So far, a number of research prototypes of active database systems have been developed, such as HiPAC [11], Ariel[12], Sentinel [7, 13], Starburst [14], Exact [15], Postgres [16], PEARD [17], SAMOS [18, 19] etc. Most of them are developed from scratch or integrated directly into the kernel of the DBMS. The integrated approach provides the following advantages [7]: Do not require any changes to existing applications. DBMS is responsible for optimizing ECA rules. DBMS functionality is extended. Modularity/maintenance of applications is better and maintenance is easier. However, the implementation of an integrated approach requires access to the internals of a DBMS into which the active capability is being integrated. This requirement of access to source code makes the cost of integrated approach very high and requires a long integration time as well. Hence, most integrated systems are research prototypes Agent-Based/Mediated The assumption for this approach is that one does not have access to the source code of the underlying system. In fact, this is true in many real-life scenarios where a commercial-of-theshelf (or COTS) system is being used (relational DBMS is an example). However, the underlying system may provide some hooks using whic h one can incorporate push capability effectively. We have experimented with this approach in a number of ways and have developed mediators/agents [20] to add full active capability to a relational DBMS. Intelligent agents are introduced between the end user (client) and the system (of course transparently to the user) and the agent provides additional capabilities that are not provided by the underlying system Wrapper-Based For this approach, the assumption is that the underlying system is a legacy system and as a result does not support appropriate hooks and hence it is extremely difficult (and impossible in most cases) to modify the underlying source code. Typically, a wrapper (or a whopper) is built which interfaces to the outside world and push capabilities are added to this wrapper. The wrapper in turn uses the API of the underlying legacy system and may add some additional functionality, not provided by the underlying system (sorting, for example). This approach needs a good understanding of the underlying system and the wrapper has to be developed for each legacy system separately. This approach is not preferred unless this is the only alternative to bring the system on par with other systems to bring the legacy system into a federation or a distributed environment

103 4 WebVigil Architecture WebVigil is a change detection and notification system, which can monitor and detect changes to unstructured documents in general. The current work addresses HTML/XML documents that are part of a web repository. WebVigil aims at investigating the specification, management, and propagation of changes as requested by the user in a timely manner while meeting the quality of service requirements. Figure 2 summarizes the high level architecture of WebVigil. Users specify their interest in the form of a sentinel that is used for change detection and presentation. Information from the sentinel is extracted and stored in a data/knowledge base and is used by the other modules in the system. changes of interest with respect to a page. A partial syntax of the sentinel is shown in Figure 3. The system generates a unique identifier for every sentinel. The sentinel-target specifies the Url to be monitored for change detection. Sentinel type can be a primitive change (links, images ) or a composite change (combination of primitive changes using options such as AND, OR and NOT). The lifespan of the sentinel can be periodic (from a fixed point of time to another fixed point of time) or aperiodic (from and to activation/termination of other sentinels set by the same user). Once the sentinel is initialised, it becomes active when the condition associated with it becomes true. User specification Presentation/ Notification Data/Knowledge Base ECA Rule Generation Change detection Caching and Management Event based Fetching Figure 2. Web Vigil Architecture The functionality of each module in the architecture is described briefly in the following sections. 4.1 User specification Users may wish to track changes to a given page with respect to links, words, keywords, phrase, images, table(s), list(s), or any change. We define such a request from the user as a sentinel. The user creates a sentinel to define the Figure 3. User Specification The Notify of a sentinel specifies the frequency with which the user wishes to be informed of changes. The notify options gives the users a set of methods for change notification. The sentinel is set with default settings unless stated otherwise by the user. The default settings being: FROM: time at which sentinel is initiated. NOTIFY: Immediate

104 BY: . The Immediate indicates that the user should be notified as soon as the page changes. Of course, there may be a small interval between the change occurrence and detection by the WebVigil. We plan on quantifying this difference more formally and validate through experiments. If an interval is specified, the user is notified using the interval even if the page changes several times during that interval. Consider the following scenario: Jill wants to be notified daily by for change in links and images to the page starting from Feb 2,2002 to Mar 2, 2002.The sentinel for the above scenario is as follows Create Sentinel sen_1 ON MONITOR links AND images FROM Feb 2, 2002 TO Mar 2, 2002 NOTIFY every day BY jill@aol.com 4.2 Data/Knowledge Base (D/KB) Knowledge Base is a persistent repository containing meta-data about each user, number and names of sentinels set by each user, and details of the contents of the sentinel (frequency of notification, change type etc.). User input is parsed and required information is extracted and stored for later use. For example, for each Url, it stores the following parameters: last modified date, last check time, checksum, and frequency of checks. D/KB may also store notification method and notification frequency for each <user-url> pair. The D/KB also acts as a persistent store so that all the memory resident information can be regenerated in case of a system crash. The rest of the modules of WebVigil use the D/KB for information needed at run time. AIDE maintains a relational database containing information about each page, each user and relationship between them. 4.3 ECA Rule Generation We plan on using ECA rules and event detection approach in two places; i) rules for retrieving pages in an intelligent manner based on the user specification (e.g., user frequency coupled with whether the page has changed in that interval) and ii) for propagating pages to detect higher level changes. ECA rules will help us to propagate changes requested by the user in a timely manner. In WebVigil ECA rule generation module uses the concepts defined in [8, 9] to provide the required active capability. This module constructs and maintains change detection graphs which keep track of relationships between the pages and sentinels. Each node specifies the change requested in the sentinel on that page. In a change detection graph, the leaf node represents the page of interest and non-leaf nodes represent operators for various types of changes (e.g., phrase change is an operator). S1 S2 S3 P1 Figure 4. Change Detection Graphs Figure 4 shows a change detection graph for a page P1 where nodes S1 and S2 represent the type of change detection requested by sentinels present on P1. For every leaf node P i a periodic or aperiodic rule Ri is generated with the event part of the rule specifying the frequency and the action part with calls to the fetch procedure followed by a notification to the change detection graph, if necessary. P

105 4.4 Change Detection Detection algorithms have been developed to detect changes between two versions of a page with respect to a change type. For a change to be detected the object of interest is extracted from the given versions of the page depending upon the change type. Figure 5 shows the change types that are identified and supported in the current prototype of WebVigil. Change to links, images, words and keyword(s) is captured in terms of insertion or deletion. Object identification, extraction and change detection is complicated for phrases. For identifying an object (phrase) in a given page we use the words surrounding it as its signature. We assume that these words are relatively stable. WebCQ [4] uses the concept of a bounding box to tackle this problem. Change to table and list is specified in terms of an update made to their contents. An insertion of a new table or list is not captured under this change type. For phrase change an insert or delete indicates appearance or disappearance of the complete phrase in the page. Currently the change detection algorithms are being reviewed for better performance and for scale up. Abiteboul et al [21] detect changes at the page level and insertions at the node level and is somewhat different from our focus. Change Type Insert Delete Update Link - Image - Keyword(s) - Words - Phrase Table - - List - - Figure 5. List of Change Types 4.5 Caching and Management of pages An important feature of WebVigil architecture is its centralized server based repository service that archives and manages versions of pages. WebVigil retrieves and stores only those pages needed by a sentinel. The primary purpose of the repository service is to reduce the number of network connections to the remote web server, there by reducing network traffic. When a remote page fetch is initiated, the repository service checks for the existence of the remote page in its cache and if present, the latest version of the page in the cache is returned. In cases of cache miss, the repository service requests that the page be fetched from the appropriate remote server. Subsequent requests for the web page can access the page from the cache instead of repeatedly invoking a fetch procedure. The repository service reduces network traffic and latency for obtaining the web page because WebVigil can obtain the Target Web Pages from the cache instead of having to request the page directly from the remote server. The quality of service for the repository service includes managing multiple versions of pages with out excessive storage overhead. WebGUIDE [22] manages versions of pages by storing pages in RCS [23] format. 4.6 Page Retrieval WebVigil uses a wrapper for the task of retrieving the pages registered with it. The wrapper is responsible for informing WebVigil about changes in the properties of the pages. By properties, we mean the size of the page and last modified time stamp. When there is change in time stamp of the page with an increase or decrease in page size, the wrapper notifies WebVigil of the change, which then fetches and caches the page. In cases where time stamp is modified, but the page size remains the same, the wrapper informs this as a change. WebVigil 105 8

106 fetches and calculates the checksum of the page. The page is cached only if the calculated checksum differs from the checksum of the cached copy of this page. For dynamically generated pages, WebVigil directly fetches the page without using the wrapper, as page properties are not available. It then checks for change by calculating the checksum of the page. The wrapper may, depending on the paradigm (Push/Pull) be either located at the web server or be a part of WebVigil. Irrespective of its location the primary function of the wrapper is to retrieve metadata and inform WebVigil of the change in page properties. WebVigil in turn fetches and caches pages of interest. Web Server 1.Retrieval of Page properties 2.Page properties 3.Retrieval of page Figure 6. Local Wrapper In the pull approach the wrapper is located at the WebVigil. It polls and pulls the properties of the pages from the remote web server Figure 6 illustrates this approach. In the push approach the wrapper is located at the remote web server. The wrapper is assumed to know all those pages that are registered with WebVigil and belong to the web server on which it resides. It informs (pushes) the change information to WebVigil. Figure 7 illustrates this approach. The localization of the wrapper is a trade off between communication, processing and storage. At first glance it may seem obvious that the localization of wrapper should be used, W R A P P E R WebVigil but the cost of polling and network cost may be crucial in which case remote wrapper will be preferable. We intend to develop both local and remote wrappers, evaluate their performance, and use them appropriately. Web Server W R A P P E R Figure 7. Remote Wrapper 4.7 Presentation and Notification The presentation method selected should clearly state the detected differences between two web pages to the user. Therefore, computing and displaying the detected differences is very important. In this section, issues related to displaying and notifying the detected changes are discussed Presentation 1.Send Page Properties 2.Retrieval of page WebVigil Different methods of displaying changes used by the existing tools are: 1.) Merging two documents, 2.) Displaying only the changes 3.) Highlighting the differences in both the pages [1, 4]. Summarizing the common and changed data into a single merged document has the advantage of displaying the common portions only once [1]. HTMLdiff [1] and Unixdiff [24] use this style to display detected changes. The disadvantage of this approach is that it is difficult for the user to view the changes when they are large in number. Displaying only the computed differences is a better option when the user is interested in tracking changes to multiple pages or when the number of changes is large. But, highlighting the 106 9

107 differences by displaying both the pages sideby-side is preferable for changes like any change and phrase change. In this case, the detected differences can be perceived better if the change in the new page is shown relative to the old page. Because WebVigil will track multiple types of changes on a web page, and eventually notify using different media ( , PDA, laptop etc.), combination of all presentation styles discussed above will be relevant, as the information to be notified will vary depending on factors like notification method, number of detected differences and type of changes Notification What, When and How to notify are three important issues for proper notification. These issues are discussed below: Presentation Content Presentation content should be concise and lucid. Users should be able to clearly perceive the computed differences in the context of his/her predefined specification. The notification report could contain the following basic information: The change detected in the latest page relative to the reference page User specified type of change like any change, all words etc. URL for which the change detection module is invoked. Small summary explaining the detected change. This could include statuses of changes such as Insert, Delete and Changed for certain type of user-defined types of changes like images, all links and keywords or/and the different timestamps indicating the modification, polling, change detection and notification date. The size of the notification report will depend upon the maximum information that can be sent to a user by satisfying the network quality of service requirements Notification frequency A detected change can be notified in two ways: Notify immediately when the change is detected Notify after a fixed time interval. The user may want to be notified immediately of changes on particular pages. In such cases, immediate notification should be sent to the user. Alternatively, frequency of change detection will be very high for web pages that are modified frequently. Since frequent notification of these detected changes will prove to be a bottleneck on the network, it is preferable to send notification periodically. Thus the user can specify the notification interval in the sentinel Notification methods Different notify options like , fax, PDA and web page can be used for notification. Notification can be initiated either by the server or by the client. In WebVigil, server based push initiation is considered. The server, based on the notification frequency can push the information to the user, thus propagating the changes just in time (JIT). 5 Conclusions and Future Work 5.1 Conclusions The basic architecture of WebVigil has been designed to track and propagate changes on unstructured documents as requested by the user in a timely manner, meeting the quality of service requirements. The design accommodates specification of multiple types of changes, on multiple web pages (composite events). The 107

108 existing event specification language SNOOP [8, 9] will be used for specifying composite events. The design and implementation of the system will address the issues regarding scalability and user flexibility. Implementation of WebVigil will augment the current strategy of pulling information periodically and checking for interesting changes. 5.2 Future Work This section describes future extension to the basic functionality of WebVigil. The present method detects changes between the current and the last changed page. This method can be improved upon by giving the user the choice to select the reference page. The user can specify a fixed reference page and must have the flexibility to change the reference. The moving window concept for tracking changes in WebVigil can be improved by allowing a page to be used as reference for detecting changes for the next n pages where user will define n. After changes are detected in n pages, the nth page becomes the reference page. Consider the following scenario: Jill wants to use the first version of the page as reference. He wants to track changes for the next five revisions to the page with this reference. After five changes, the reference page should be the fifth page and the next five changes should be tracked relative to this page. An added feature will be to notify the user of cumulative changes. The user can be given the option of being notified of cumulative n changes where n should be specified in the sentinel. Additional feature like user s personalized change summary page can be provided. The user can lookup this page to get the history of his installed sentinels and the changes tracked till date. 6 References 1. Douglis, F., et al., The AT&T Internet Difference Engine: Tracking and Vie wing Changes on the Web. in World Wide Web. 1998, Baltzer Science Publishers. p Changedetection, 3. Mind-it, 4. Liu, L., C. Pu, and W. Tang. WebCQ: Detecting and Delivering Information Changes on the Web. in the Proceedings of International Conference on Information and Knowledge Management (CIKM) Washington D.C: ACM Press. 5. Xyleme, 6. Nguyen, B., et al. Monitoring XML Data on the Web. in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data Chakravarthy, S., et al., Design of Sentinel: An Object-Oriented DBMS with Event-Based Rules. Information and Software Technology, (9): p Chakravarthy, S., et al., Composite Events for Active Databases: Semantics, Contexts and Detection, in Proc. Int'l. Conf. on Very Large Data Bases VLDB. 1994: Santiago, Chile. p Chakravarthy, S. and D. Mishra, Snoop: An Expressive Event Specification Language for Active Databases. Data and Knowledge Engineering, (10): p Wells, D., J.A. Blakeley, and C.W. Thompson, Architecture of an Open Object-Oriented Database Management System. IEEE Computer, (10): p Chakravarthy, S., et al., HiPAC: A Research Project in Active, Time -Constrained Database Management (Final Report). 1989, Xerox Advanced Information Technology. 12. Hanson, E., The Ariel Project, in Active Database Systems - Triggers and Rules For Advanced Database Processing. 1996, Morgan Kaufman Publishers Inc. p

109 13. Anwar, E., L. Maugis, and S. Chakravarthy, A New Perspective on Rule Support for Object- Oriented Databases, in 1993 ACM SIGMOD Conf. on Management of Data. 1993: Washington D.C. p J. W. Hunt and M.D.Mcllroy, An algorithm for efficient file comparison. 1995, Bell Laboratories: Murray Hill, N.J. 14. Widom, J., The Starburst Rule System, in Active Database Systems - Triggers and Rules For Advanced Database Processing. 1996, Morgan Kaufman Publishers Inc. p Diaz, O., N. Paton, and P. Gray, Rule Management in Object-Oriented Databases: A Unified Approach, in Proceedings 17th International Conference on Very Large Data Bases. 1991: Barcelona (Catalonia, Spain). 16. Stonebraker, M. and G. Kemnitz, The Postgres Next -Generation Database Management System. Communications of the ACM, (10): p Alexander, S.D. Urban, and S.W. Dietrich, PEARD: A Prototype Environment for Active Rule Debugging. Intelligent Information Systems : Integrating Artificial Intelligence and Database Technologies, (Number 2). 18. Gatziu, S. and K.R. Dittrich, Events in an Active Object-Oriented System, in Rules in Database Systems., N. Paton and M. Williams, Editors. 1993, Springer. p Gatziu.S and K.R.Dittrich, SAMOS: an Active, Object-Oriented Database System, in IEEE Quarterly Bulletin on Data Engineering p Li, L. and S. Chakravarthy. An Agent-Based Approach to Extending the Native Active Capability of Relational Database Systems. in ICDE Australia: IEEE. 21. Cobena, G., S. Abiteboul, and A. Marian, Detecting Changes in XML Documents. Data Engineering, Douglis, F., et al., WebGUIDE: Querying and Navigating Changes in Web Repositories. in Fifth International World Wide Web Conference Paris, France. 23. Tichy, W., RCS: a system for version control, in Software -Practice & Experience p

110 Caching Schema for Mobile Web Information Retrieval R. Lee y, K. Goshima y, Y. Kambayashi y and H. Takakura z Graduate School of Informatics, Kyoto University y Data Processing Center, Kyoto University z Sakyo Kyoto , Japan fryong, gossy, yahikog@db.soc.i.kyoto-u.ac.jp, takakura@rd.kudpc.kyoto-u.ac.jp Abstract The web cache management in mobile devices becomes to be an important problem. In this paper, we discuss that the traditional cache method LRU(Least Recently Used) is not sucient in mobile environments. Since the data related to the places already visited by the mobile user may not be used again, a user behavior in the mobile environments should be considered. In another aspect of web data uses, we are often connecting to the web with the wired broadband networks. Before traveling with the mobile devices that are limited in the storage space and the power, carrying out the web data in the mobile cache will be a good strategy to reduce the eorts in mobile-based web information retrieval. In the pre-fetching of web data, we can improve the cache eciency signicantly by dealing with metadata such as selected URL's (with proper keywords) instead of taking out the whole web contents. In this paper, caching algorithms considering web contents, metadata and user behavior history are developed. In order to determine priority of web pages, word relationships obtained from a volume of web pages are used. 1 Introduction One of the serious bottle-necks of mobile systems is slow communication speed. If required information is stored in the cache of the mobile system, it will be retrieved rapidly. As cache size is limited in mobile systems, we need to develop proper algorithms for such a purpose. In this paper, we will discuss priority computation algorithms for mobile systems to be used for sight-seeing. Although the application is limited to identify the problems, it will be rather easy to extend the result to other cases. Compared with traditional CPU cache and web cache, we have todevelop dierent algorithms to determine the priorities of contents in the cache. For example, in the traditional cache, data used recently will have high priority. In mobile applications, if a user visited a place, information related to the place may not be required later. As the cache size is limited we will store URL's instead of web contents if the priority is not very high. If a user is located at place A, there are some possibility tovisit place B, if both are related. Strength of relationships is calculated by the number of appearance of (A,B) pair in one paragraph (or one web page). Usually the pair (A,B) appear frequently if they are located in a short distance. There are, however, cases they are far a part. In such cases these locations are semantically 1 110

111 related (for example, having similar nature). In order to nd out relationship among words, we rst classied words into G- words(geo-words, related to location names) and N-words(non-geo-words). We have collected 2 million web pages related to Kyoto City in Japan, which are the basis of deriving word relationships. In Section 2, we will discuss the characteristics of mobile cache algorithms by comparing other typical cache algorithms. Section 3 shows organization of mobile cache for travel guide as example. How to determine priorities of cache contents using word relationships is discussed in Section 4. Consideration of web environments is discussed in Section 5. Section 6 shows some problems found by simulation. 2 Characteristics of Mobile Cache In this section we will compare traditional CPU cache, web cache, and mobile cache to be discussed in this paper, to identify problems of mobile cache. 2.1 Traditional CPU Cache Requirements and characteristics for CPU cache are as follows. 1. All the data are equal size. 2. Cache size is very small. 3. Update of data is performed in cache. 4. In many case data are accessed sequentially. As speed of CPU is very high, we needto use very ecient algorithm for cache replacement. LRU(Least Recently Used) is very popular, since the computational overhead is very low and reasonably good results are obtained[7]. 2.2 Web Cache Unlike CPU, the communication speed is very slow for the case of web cache, thus we can use complicated algorithms for web cache[1, 2]. 1. Data sizes distribute from very small to very large(for example, video). 2. We can use disks for cache, so the cache size can be rather large. 3. Update is performed at web pages. Some cache contents will become outof-dates easily. 4. There are some related pages (shown by links). 5. There are several dierent usage patterns, which will appear periodically. For example, usage pattern from 9 a.m., just before 5 p.m., and nights may be dierent. The most serious problem is size[2, 4]. If we put one large data object in the cache, other data cannot be stored in the cache. We need to consider the size beside the recency treated bylru. Furthermore, because of the above 5, we need to adjust usage patterns. A simple way is to use popularity (how many times a page is used in 24 hours). We have developed a complicated web cache algorithm considering size, recency and popularity[3]. 2.3 Cache for Mobile Systems Mobile systems usually have a small amount of cache and data usage patterns is quite different from the above two cases[6, 8]. Usually cache is used to utilize part data. Requirement for mobile systems is quite different. For example, a person visit a famous sight-seeing spot, the map to approach the place may not be used again, although it may be used frequently before. Another example is that all the lunch information will not be utilized after the user takes a lunch. Since the communication speed is very slow, the cache system have to store data to be used near future, by predicting user's behavior. For example, a person is going to a some place by a car, parking information near the place should be stored in advance

112 As a cache size is small, we may have to store URL's instead of contents. To nd proper URL's, keywords for each URL are required. As a summary, requirements and characteristics of mobile cache are as follows. 1. The cache size is rather small. 2. Cache algorithm should use prediction information of the owner's behavior. If the data object is used and it is predicted not to be used again, contents will be erased from the cache. The URL and some meta information (usage information) should be stored in the cache, unless usage is sure to be never occurred again. By predicting future user behavior, we can order the URL's stored in the cache using metadata attached to them. Contents of important URL's are retrieved in advanced, since the communication speed is slow. 3. By the current web page used by the user, we can add URL's from its link information if required. For traveling purpose, before travel we can store URL's, which may be required during the travel. By actual usage of web pages we have to modify the URL list by the above 3. 3 Mobile Travel Guide In this section, we will discuss functional requirements for a mobile travel guide support system. Mobile environment has many constraints such as limited storages, slow communication, poor user interfaces and slow operation speed. To overcome these problems, we will discuss generalized cache algorithms for mobile systems. Unlike conventional cache, metadata are also cached and user behavior is reected to cached contents as discussed in Section Cache Contents As a strategy to use contents in mobile environment, there are the following three patterns. 1. Storing selected web contents into mobile devices If all necessary web contents are stored in cache, we do not need to use the web. Communication cost will be reduced. Since some web contents may be updated frequently, the user cannot set the latest information of such pages by this method. 2. Downloading all necessary web information when requested Conventional cellular phones use this method, since they cannot store a large amount of data. 3. Storing metadata which help retrieval necessary web pages Instead of storing all the web contents, we can store only URL's in order to reduce the storage cost. Web page selection is performed before getting the contents. We can store a large number of URL's for small cache area. According to the travel plan, we can determine the priority of web pages. Web pages with the top level priority Contents of these web pages are actually stored in the cache, if they will not change frequently. Otherwise, only URL's are stored. Web pages with the second level priority Only parts of web contents are stored with URL's. These partial contents are used to nd proper URL's. Link information may be also stored in cache. For partial contents, we can use frequently appearing keywords and/or the top parts (including the titles) of pages. Web pages with the third level priority Only URL's are stored. Figure 1 shows the organization of the system to be 3 112

discussed in this paper. According to the user behavior history, priorities of web pages are dynamically modied. We use word relationships to determine the priority.

113 discussed in this paper. According to the user behavior history, priorities of web pages are dynamically modied. We use word relationships to determine the priority. In planning phase, we assume that the user can use the wired communication environment with enough bandwidth. This phase consists of two kinds of works. Oneistodecide visiting places in the target area. The other is to download data to be used oine which will be needed for guide. First, a user can get knowledge of the places by browsing web pages, and determine his objective spots and visiting order. Then, required metadata are stored into the mobile device. The required metadata are lists of G-words, keywords, and URL's, and their relationships G-words of user's destination and keywords of his interests initial values of user status (e.g. money conditions, previously visited, etc.) Figure 1: System Organization 3.2 System Functions The system has functions of a web page cache management and a dynamic travel guide, using metadata such as relationships of geowords or keywords derived from the web, and user environment parameters. Two phases are considered in the system, planning phase and retrieval-and-guide phase. Planning phase is performed at home before the departure of a tour. In this phase, the system works to support of surveying destination area by retrieving web pages, decision making of target spots and visiting order, and storing metadata (relations, user status) into the mobile devices. On the other hand, retrieval-and-guide phase is during the travel in outdoors. This phase works for ecient management of web page cache based on page priority ranking, and active dynamic travel guide by suggestion of closely related web pages to the user Planning Phase In addition, web page contents themselves are downloaded to store into web page cache. That reduces access frequency of mobile retrieval Retrieval-and-Guide Phase In retrieval-and-guide phase, two functions are provided: web page cache management based on page priority ranking active dynamic travel guide and page pre-fetching Active dynamic travel guide is a function to recommend the user web pages related to the current location, interesting keywords or his other conditions. This function considers users who have not well prepared or unexpected plan changes. Guide selects pages which have high priority in the cache and show the user, as if it says \Why don't you visit this spot written on this page?" In this way, the user can nd the next destination suitable to him. We regard this function as a kind of guide agent. Moreover, the speed of retrieval can be improved by pre-fetching the target pages

Web Structure, Age and Page Quality. Computer Science Department, University of Chile. Blanco Encalada 2120, Santiago, Chile.

Web Structure, Age and Page Quality Ricardo Baeza-Yates Felipe Saint-Jean Carlos Castillo Computer Science Department, University of Chile Blanco Encalada 2120, Santiago, Chile E-mail: frbaeza,fsaint,ccastillg@dcc.uchile.cl