Effectively Maintaining Multiple View Consistency in Web Warehouses

Effectively Maintaining Multiple View Consistency in Web Warehouses Yan Zhang Center for Information Sciences Peking University, Beijing, China zhy@cis.pku.edu.cn Xiangdong Qin Department of Science and Technology Hebei University, Baoding, China qinxd@mail.hbu.edu.cn Abstract To make a web warehouse reflect the real web accurately, we should keep timeliness, freshness and consistency for the webviews. This paper fouses on one important part of webview maintenance mutual consistency between webviews, which is formally named as multiple webview consistency (abbreviated as MVC). Although the same problem has been well studied in a traditional data warehouse, however, it has never been considered for a web warehouse. Since data sources in a web warehousing environment do not propagate base data changes to the information consumers, which is different from the case in the traditional data warehouses, it is not feasible to keep complete mutal consistency between the webviews in web warehouses. In this paper we introduce the Interrelated-MVC, a new term based on the features of the web environment. We show that it is enough to keep Interrelated-MVC in web warehouses. After that we present the algorithms which guarantee Interrelated-MVC and are scaleable to the vertiginous real web. 1. Introduction 1.1. Motivation Recently the web has become the widest and most heterogeneous set of information sources, which is accessible by a large spectrum of users, being a computer plus an Internet connection the only prerequisites. However, it is difficult to locate the specified information on the web because of its instability and the lack of any global structure and organization over the data it contains. The deep web [2], 500 times larger than the surface web, makes things worse. Web warehouses [5, 8, 13, 10] solve these problems effectively by using webviews to collect and integrate different information, thereby very helpful for many research areas, especially Online Analytical Processing and Decision Support Systems. People begin to use mediated views to integrate information from heterogeneous sources since Gio Wiederhold proposed the mediated architecture in 1991 [12]. In web warehouses, a webview is a materialized view usually defined by a query (generally a query is related to a specified topic). Webviews enable users interested in specified topics to query a single data structure instead of issuing lots of queries on different structures [13]. Another advantage is, keeping the views materialized can significantly improve query performance [8]. To make the web warehouse reflect the real web more accurately, we should keep the webviews timeliness, freshness and consistency. This paper fouses on one important part of webview maintenance mutual consistency between webviews, which is formally named as multiple webview consistency (abbreviated as MVC) in our research (we borrow this term from [18]). MVC is actually a familiar thing to us. For example, suppose we have a web warehouse, integrating the information for all the electronic products on amazon.com, shopping.com, bustbuy.com, yahoo.com, etc. When we do online shopping, we usually begin to query the related information, such as the maximum price, the minmum price and the average price for a product. After that we choose the best one to go shopping. However, in the case that the warhouse tells us that the maximum price for a given product is 2400RMB and its average price is 2500RMB, we will inevitably be confused. Actually the reason may be very simple: the warehouse does not update the average price together with the maximum price. The following example illustrates the MVC clearly. Suppose a web warehouse system WR is composed of five materialized views: V 1, V 2, V 3, V 4 and V 5. V 1 =R S, V 2 =R U, V 3 =T Q, V 4 =Q, V 5 =U Y, as shown in Figure 1. When we detect that S has changed, we should refresh V 1. At the same time, we should detect whether the base data item R has changed. If so, we should refresh V 2 together with V 1. Otherwise, users who access both V 1 and V 2 will possibly get conflicting information. For the same reason, we probably need to refresh V 5 together with V 1 and V 2. In fact, the base data items R, S, U and Y constitute one group. Any change of any member in this group can lead to all of the derived views refreshing. The simplest solution for MVC is using an integrator pro-

Figure 1: The webviews constitute two disjoint groups. Changes of the base data in one group can lead to all derived views refreshing. cess to process all the updates of materialized views in sequence. We know that web warehouse is based on polling pattern, which means that web data sources will not propagate their changes and the warehouse must poll the data sources to get the changing information [4]. Therefore when a change is detected, the web warehouse can find all the correlative views and refresh them one by one. After their refreshing, the integrator process submits a transaction to the web warehouse. When the transaction is committed, the integrator process begins to process the next detected change. This solution surely can guarantee MVC, however, it does not allow any parallelism and is not suitable for the fugitive and vertiginous web environment. 1.2. Our Contributions To solve this MVC problem, we play a tradeoff between data consistency and system parallelism. In a web warehouse as shown in Figure 2, each data source has its monitor/wrapper, which is responsible for detecting data change. When the monitor/wrapper finds a change of base data, it sends the change information to the integrator. The integrator provides each change a number by arrival order and passes the number to the merge process. At the same time, the integrator forwards the change information to relevant view managers. Each view has its own manager, which handles the delta computation or complete re-computation for the view. When a view manager receives change information of base data, it re-computes the changes to the view and then sends a list of actions to a merge process. The merge process collects all of the actions, holds them until all affected views finish their processing. Afterwards the merge process forwards all the actions to the web warehouse in a single transaction. The merge process makes sure that the transactions are committed in sequence. The remainder of this paper is organized as follows. Section 2 gives a brief description of the related work. Section 3 formally introduces the definitions of multiple view consistency and some related terms. Section 4 provides the algo- Figure 2: The architecture of a web warehouse. rithms to keep MVC, followed by the details of MVC maintenance discussed in Section 5. Finally, we conclude our paper and show some future work in Section 6. 2. Related Work There are lots of previous studies about both MVC and SVC (single view consistency, which means the consistency between a view and its base data) in traditional data warehouses. For example, Hull and Zhou presents Squirrel mediators, which support views that integrate data from multiple data sources [9]. They mainly discuss SVC in their system, at the same time, they address that it is possible to achieve MVC by sequencing the propagation of each source update. Zhuge et al. introduce the concept of queuing the view updates at the data warehouse and demonstracte their method Strobe, which commits the updates to the data warehouse only when the unanswered query set is empty [16]. However, the Strobe algorithm suffers the potential threat of infinite waiting, i.e., the data warehouse extent may never get updated. Agrawal et al. use special detection methods for concurrent updates that do not need the global time stamp and require no quiescent state before being able to update the data warehouse [1]. Although most algorithms such as PVM [14], ECA [15], and Strobe [16] are designed for SVC, some of them can be extended to handle MVC [11]. Besides, there are also some studies focused specifically on MVC [6, 7, 17]. For example, Zhuge et al. define multiple views to be consistent with each other as the multiple view consistency problem and present comprehensive discussions about MVC in data warehousing environment [17]. Although previous research has well studied both MVC

and SVC, however, these studies do not consider the difference between traditional data warehouses and web warehouses. Traditional data warehouses usually know all the changes of base data, while the data sources in a web environment usually do not propagate their changes to consumers. Therefore, a web warehouse can only use a polling mechanism to detect the changes of base data [4]. This characteristic distinguishs the MVC in web warehouses far away from things in the traditional data warehouses. We believe what is discussed in this paper has not been addressed in previous research. 3. Multiple View Consistency and Related Terms In this section, we first introduce the definition of Complete-MVC. We show that although it works well in a traditional data warehousing environment, however, it is not feasible in a web warehouse. Therefore, we propose the Interrelated-MVC, which is practical and efficient in web warehouses. Definition 1 (Complete-MVC): When there are multiple views in the web warehouse, a warehouse state ws is a vector with one element for the state of each webview. Each warehouse view maintenance transaction updates one or more views. The warehouse state advances after each warehouse transaction. Assume the sequence of warehouse maintenance transactions yields a sequence of warehouse states W W seq = ws 0, ws 1,..., ws n. We say that ws j is complete-multiple-view-consistent with source state ss i, written ws j ss i, if and only if for each view V at the state ws j, its content is the same with V(ss i ). V(ss i ) represents the result of evaluating the expression of V at source state ss i. Definition 1 is similar as the definition of MVC in [17]. Both of them require that all the views in the system should keep consistency. This is reasonable and feasible in a traditional data warehousing environment, because this environment is based on source update propagating and the warehouse ia able to get every change of base data. However, data sources in a web environment have autonomy and independency. Web warehouses can only get the changes of the base data by detecting. A web warehouse is unable to guarantee that the base data do not change even if it has not detected any change, so it has to refresh all the views to make sure each affected view is refreshed [3]. It is impossible in practice to refresh all the views in the whole system just because a minor change of base data. In fact, it is not necessary either. The goal of keeping multiple view consistency is to provide correct and consistent data to the users. So what we need to do is to consistently keep the interrelated views, not all the views. Thus we introduce the term Interrelated-MVC. We believe it is enough for a web warehouse to keep consistency at this level. Definition 2 (Explicit dependence): In a web warehouse, if view V 2 must be updated before the update of view V 1 for consistency, we say that V 1 explicitly depends on V 2. Definition 3 (Dependent set): In a web warehouse, the dependent set VE of a view V contains all the views it explicitly depends on. Usually the dependent set of a view is assigned explicitly at the time when the view is constructed. Definition 4 (Implicit dependence): In a web warehouse, if view V 1 and V 2 share a same base data item, we say that V 1 and V 2 are implicitly interdependent (also called as implicitly dependent). Implicit dependence relation has transferability, which means if V 1 and V 2 are implicitly interdependent, V 2 and V 3 are implicitly interdependent, then V 1 and V 3 are implicitly interdependent. Definition 5 (Interrelated-MVC): When there are multiple views in a web warehouse, a warehouse state ws is a vector with one element for the state of each view. Each warehouse view maintenance transaction updates one or more views. The warehouse state advances after each warehouse transaction. Assuming the sequence of warehouse maintenance transactions produces a sequence of warehouse states W W seq = ws 0, ws 1,..., ws n, we say that ws j is interrelated-multiple-view-consistent with source state ss i, written rs j =ssi, if and only if for each view V at the warehouse state ws j, there is a source state ss k existing before ss i and makes the content of V same as V (ss k ) for each view V explicitly or implicitly dependent on view V. At the same time, there exists at least one k to make k = i. 4. Algorithms for Interrelated-MVC In the system architecture as shown in Figure 2, there are two critical steps for keeping the Interrelated-MVC: (1) How does the integrator determine the relevant-view set and base data update operations after it gets the change of base data; (2) How does the merge process organize the action lists coming from view managers and submit them to the web warehouse to be committed. We now present the algorithms. 4.1. Determining the Relevant View Set and Base Data Update Set The algorithm CRU calculates the relative views and update information, as shown in Table 1. In the algorithm, iteration 1 will stop sooner or later. If most of the views in

the web warehouse are constructed from separate base data items, i.e., the base data are not largely shared by views, the iteration can stop quickly and the time complexity is not high. However, when the views in the web warehouse are greatly interrelated, it is very complex to compute REL i. Thus we can reduce the computing complexity by using system partitioning. System partitioning means that the view managers in the system are partitioned into several groups and the action lists generated by the view managers will be submitted group by group. User access is prohibited before all correlative transactions are committed. If only we know the peaks and troughs of user accesses, we can maintain the views smoothly in the troughs (for example, nights or weekends). 4.2. Organizing and Committing of Action Lists We use algorithm SADV to organize and commit the action lists of the base data changes. There are three points in algorithm SADV (shown in Table 2): (1) All the updates of relevant views must be processed at the same time and must be submitted in a single transaction; (2) If two transactions Table 1: Algorithm CRU. do not share views, their committing orders are independent; (3) If two transactions update the same view(s), they need to be committed in the order that the integrator assigns to. 4.3. Characteristics of Algorithm CRU and SADV Theorem 1 The algorithm CRU+SADV is consistent under Interrelated-MVC. Proof. The algorithm CRU starts to calculate REL i from a base data change C i and all the related explicitly and implicitly dependent webviews are included in the final result of REL i. According to the algorithm SADV, all the update actions to these webviews will be grouped into a transaction by Merge Process and committed to the system. If two transactions update a same webview, they will be committed strictly following the chronicle. Therefore, the algorithm CRU+SADV can guarantee the Interrelated-MVC. Given an initial web warehouse state, which is consistent, if an algorithm can generate a consistent web warehouse state sequence for a specified webview set, this algorithm is completely consistent for this webview set.

Table 2: Algorithm SADV. Suppose the Monitor detects that a base data change C i will result in the updates to some webviews, the it Integrator/ Monitor will computes the set U i, which contains all of the base data changes related to C i, to keep the Interrelated- MVC. Let SS=U 1, U 2,..., U f, in which the sequence number is given by the Integrator. The Merge Process commits the warehouse transaction and changes the warehouse state, using the algorithm SADV. One U i determines one transaction, thereby the number of warehouse transactions is f. Assume the commit sequence for these transactions is W=W T i1, W T i2,..., W T if, and the corresponding warehouse state sequence is W seq =ws i0, ws i1, ws i2,..., ws if, in which ws i0 is the initial state. Since the Merge Process may reorder the update actions, the sequence i 1, i 2,... may be different with the sequence 1, 2,.... According to W seq, we can construct the base data change execution sequence S=U i1, U i2,..., U if, and the data sources state S seq =ss i0, ss i1, ss i2,..., ss if, in which ss i0 is the initial state for data sources. Let us prove that S seq is a consistent state sequence. We know S and SS have the same base data chnage sets. For two sets U i and U j which contain the conflicting base data changes, they will change the same base data and update the same webviews. Without loss of generality, let us assume U i is preceding to U j in the sequence SS, then U i is also preceding to U j in the sequence S. The reason lies in the fact that W T j cannot be committed in front of W T i in algorithm SADV. Since in the sequence S and SS, all of the conflicting base data changes keep the same order, they are actually equivalent. Therefore S seq is consistent. Let us define a mapping from W seq to S seq and make m(ws ik )=ss ik, in which 0 k f. According to the assumption, ws i0 =ssi0. Since each view manager is complete and it executes the actions in chronicle order, for any webview V, implicitly or explicitly influenced by U ik, its content is equal to V (ss ik ) at the warehouse state ws ik. Therefore, for any k (0 k f), we have ws ik =ssik. Since all the independent (explicitly or implicitly) webviews will be updated together, so, when ws i < ws j, we actually can have m(ws i ) = ss i < ss j = m(ws j ). Therefore, W seq is consistent under Interrelated-MVC. Theorem 2 The algorithm is prompt, i.e., it will not make any unnecessary delay for any update of the views. Proof. This is straightforward. Non-conflicting transactions are committed in parallel in algorithm SADV. 5. Maintenance of Multiple View Consistency in System Refreshing During system refreshing, we can partition views into some groups according to their dependence in order to keep MVC as well as achieve high parallelism. The refreshing tasks in different groups can use different threads/processes, which can keep the mutual consistency between interrelated views. The algorithms are shown in Table 3 and Table 4.

Table 4: Algorithm WWP. Table 3: Algorithm FRV. For the step 2 in algorithm FRV (Table 3), the iteration can stop sooner or later: the algorithm at most processes all the views in the web warehouse. The result of algorithm WWP (Table 4) is many REL(V ) partitions. We can regard a REL(V ) partition as a V-BD (View-BaseData) hypergraph. The nodes are the views and the edges connecting the nodes are the base data items. Maybe there are many edges between two nodes. Each partition is inter-connected internally. In fact, the partitioning of a web warehouse is to find the maximized V-BD sub-graphs. As shown in Figure 3, we obtain k partitions of the web warehouse system. Each partition has some view managers and they share one merge process. When the system refreshes the webviews, the influence is strictly limited inside the partition. Views in different partitions do not affect each other. Thus we can get two maintenance methods to guarantee MVC. The first method is to process each update inside the partition sequentially according to the number that the integrator assigns to. At the same time, the updates in different partitions should be processed in parallel. Here an update means some a change to the base data and the correlative operations. This method is easy and it can achieve some parallelism. The second method is to process the views inside a partition according to the algorithm CRU/SADV instead of processing them simply in sequence. This method works for the large and complex partitions. The computing complexity is not very high since the views that need to be processed are limited inside one partition, not the whole warehouse. When refreshing a view V i, checks only need to be applied inside the partition which contains V i. No need to do it outside the partition. We can get the view set that need to be refreshed by checking the V-BD graph of the partition: start Figure 3: Web warehouse partitioning. from V i, check its edges and purge those are not affected; then check the nodes connected to it by the same way. Each node will be checked only once. After checking all the connected nodes, we get the exact view set that need to be refreshed along with V i. When adding a view into the warehouse, we consider the relations between the existing partitions and the new view. When a partition implicitly or explicitly depends on the view, it should be added to the partition. When two partitions depend on it, they should merge together. If the view has no relation with all existing partitions, it becomes a new partition. When purging a view from the warehouse, we should process the partition which contains the view over again using Algorithm WRP. Generally we will get some new partitions, however, the original partition sometimes will remain as a whole.

6. Conclusion In this paper we introduce how to keep mutual consistency between views in a web warehouse. Since it is far different from the case in the traditional data warehousing environment, we introduce a new term Interrelated-MVC and demonstrate that it is more feasible to keep this consistency in a web warehouse. After that, we present some corresponding algorithms to guarantee it. Furthermore, we discuss how to achieve more parallelism at the same time. The algorithms presented in this paper have been validated in our prototype system, which focuses on the E-commerce. We expect to examine the performance of our algorithms in a large-scale and practical web integrated environment in the near future. References [1] D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance at data warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 417 427, 1997. [2] M. K. Bergman. The deep web: surfacing hidden value. Journal of Electronic Publishing, 7(1), 2001. [3] J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In SIGMOD 00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 117 128, New York, NY, USA, 2000. ACM Press. [4] J. Cho and H. Garcia-Molina. Estimating frequency of change. ACM Trans. Inter. Tech., 3(3):256 290, 2003. [5] S. Cluet, P. Veltri, and D. Vodislav. Views in a large scale xml repository. In Proceedings of VLDB 01, 2001. [6] L. S. Colby, A. Kawaguchi, D. F. Lieuwen, I. S. Mumick, and K. A. Ross. Supporting multiple view maintenance policies. SIGMOD Record, 26(2):405 416, 1997. [7] L. S. Colby and I. S. Mumick. Staggered maintenance of multiple views. In In workshop on materialized views, pages 119 128, 1996. [8] J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: A repository of web pages. In Proceedings of 9th International World Wide Web Conference, pages 277 293, 2000. [9] R. Hull and G. Zhou. A framework for supporting data integration using the materialized and virtual approaches. In SIGMOD 96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 481 492, New York, NY, USA, 1996. ACM Press. [10] K.C.-C.Chang, B.He, and Z.Zhang. Metaquerier over the deep web: Shallow integration across holistic sources. In Proceedings of VLDB-IIWeb 04, August 2004. [11] W. J. Labio, R. Yerneni, and H. Garcia-Molina. Shrinking the warehouse update window. In SIGMOD 99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 383 394, New York, NY, USA, 1999. ACM Press. [12] G. Wiederhold. Mediators in the architecture of future information systems. IEEE Computer, 25(3):38 49, 1992. [13] L. Xyleme. A dynamic warehouse for xml data of the web. In IEEE Data Engineering Bulletin, 2001. [14] X. Zhang, L. Ding, and E. A. Rundensteiner. Parallel multisource view maintenance. The VLDB Journal, 13(1):22 48, 2004. [15] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View maintenance in a warehousing environment. SIGMOD Rec., 24(2):316 327, 1995. [16] Y. Zhuge, H. Garcia-Molina, and J. L. Wiener. The strobe algorithms for multi-source warehouse consistency. In In International Conference on Parallel and Distributed Information Systems, pages 146 157, 1996. [17] Y. Zhuge, H. Garcia-Molina, and J. L. Wiener. Multiple view consistency for data warehousing. In ICDE 97: Proceedings of the Thirteenth International Conference on Data Engineering, pages 289 300, Washington, DC, USA, 1997. IEEE Computer Society. [18] Y. Zhuge, H. Garcia-Molina, and J. L. Wiener. Consistency algorithms for multi-source warehouse view maintenance. Distrib. Parallel Databases, 6(1):7 40, 1998.