Foundations of Data Warehouse Quality (DWQ)

Size: px

Start display at page:

Download "Foundations of Data Warehouse Quality (DWQ)"

Teresa Dennis
6 years ago
Views:

1 DWQ Foundations of Data Warehouse Quality (DWQ) v.1.1 Document Number: DWQ -- INRIA Project Name: Foundations of Data Warehouse Quality (DWQ) Project Number: EP Title: Author: Workpackage: Document Type: Classification: Distribution: Status: Designing Data Warehouse Refreshment Systems M. Bouzeghoub, F. Fabret, F. Llirbat, M. Matulovic, and E. Simon WP8 Report Public DWQ Consortium Draft Document file: wp8_refreshment.doc(word) Version: 1.0 Date: June 17th, 1997 Number of pages: /DWQ Consortium -1- DWQ/

2 Document Change Record Version Date Reason for Change v Oct., 1997 First Draft 1996/DWQ Consortium -2- DWQ/

3 1. Introduction The refreshment of a data warehouse is an incremental data warehouse process which can be decomposed into four logical activities which are: (i) extract a source change from a data source, that characterizes the changes that have occurred in this source since the last extraction, (ii) clean a source change using some predefined data, (iii) integrate the source changes coming from multiple sources, and (iv) determine which views in the data warehouse need to be updated. This is a logical decomposition whose operational implementation receives many different answers when looking at the state of the data warehouse product market. There is a large variety of data warehousing applications that have very different requirements of quality and configurations. However, an important problem raised by the study of existing tools for data warehouse refreshment is that they offer little customization facilities with respect to the scheduling of the refreshment process, both at the functionality and performance levels. In fact, each tool provides a fixed set of possibilities that is intended to cover some «frequent» data warehouse applications. Then two cases occur : (i) the tool cannot handle the requirements of a given application, or (ii) the tool is open enough to accommodate specific user-defined solutions. In the latter case, there are two main problems. First, these tools generally offer poor facilities and almost no methodology to help the user engineering his/her specific implementation of the refreshment system. Typically, the system offers the possibility of defining specific events that will trigger the refreshment process and one can program an ad-hoc solution for the transform and integrate steps of the refreshment process. The resulting program can be invoked from within the tool. Thus, although an ad-hoc solution can be deployed, no facilities are provided to facilitate its engineering. Second, the tools that are flexible enough to enable customizing a refreshment system are generally quite complex to use. Therefore, there is a real need for tools that enable a fast and customized development of data warehouse refreshment solutions. We have already stated in the introduction that active rules are appropriate means to implement a data warehouse refreshment process. In the following, we highlight some features of the refreshment process which match active rules features. Refreshment process is an event-driven application; active rules provide convenient way to specify events, and provide an execution monitor to detect their instances and to calculate them whenever they are composite. Different events characterize the refreshment process. We can roughly distinguish data changes (update events) and process checkpoints (monitoring events). The refreshment process is a complex system which may be composed of asynchronous activities that need a certain monitoring task which is also eventdriven task. Refreshment process evolves frequently, following the evolution of data sources and the evolution of view definitions; active rules provide modular specification which allows to easily modify the refreshment activities in order to adapt them to the new requirements or infrastructure modification. 1996/DWQ Consortium -3- DWQ/

4 There is no a single refreshment process which is suitable for all data warehouse applications or all data warehouse configurations, so there is a need to frequently engineer specific refreshment process for specific applications or data warehouse configurations. Using active rules allows to benefit from generic execution mechanisms and high level language which allow rapid development of the refreshment activities. Some of the activities of the refreshment process, such as extraction, cleaning and integration, can be done by commercial products; their integration into the global refreshment process should be done in transparent way. Active rules allow to consider the activities handled by these tools as their action part, which can be executed in an atomic way. The main original contribution of the work presented in this report is to show how a data warehouse refreshment system can be suitably modeled as an active application. We show how this process takes advantage of the modularity of active rules, and capitalize on recent advances in the formalization of active rules execution models to better understand application semantics. Our contribution is, first, a general methodology to specify a data warehouse refreshment system, which starts from a conceptual specification and progressively transforms it into logical and physical specifications; second, a generic active monitor which can be adapted to some refreshment activities. Apart from this introduction, this report is structured as follows. In Section 2, we present the logical view we have of a data warehouse architecture, on its initial design and its refreshment. We will particularly point to the refreshment tasks that can be modeled by active rules. We also show how this logical view of the data warehouse conforms to the DWQ framework. Section 3 presents a technical summary of the definitions and the principles of the generic active monitor we have defined. Section 4 demonstrates, through simple examples, how this active monitor can be adapted and instanciated in order to handle some active tasks of the refreshment process. Finally, Section 5 concludes and presents our future directions of work. 2. Logical data warehouse architecture, initial design and refreshment tasks This section presents the logical view we have of a data warehouse architecture, of its initial design and its successive refreshments. We will particularly point to the refreshment tasks that can be modeled by active rules. We also show how this logical view of the data warehouse conforms to the DWQ framework Logical view of a data warehouse architecture Data warehouse components The data warehouse can be defined as a hierarchy of data stores (Figure 1) which goes from source data to the highly aggregated data (data marts). Between these two extreme data stores, we can find different other stores depending on the requirements of OLAP applications. One of these stores is the corporate data warehouse (CDW) 1996/DWQ Consortium -4- DWQ/

5 which groups all aggregated views that serve to generate the data marts. The corporate data store can be complemented by an operational data store (ODS) which groups the base data collected and integrated from the sources. ODS contains the common source data from which aggregated views are derived. This data, although it contains some aggregation, is considered as a multi-relational view which synthesizes the source data. Within this ODS, we can maintain during a period of time a history, ODShistory, of all data collected from the sources at different moments. To each source, an intermediate store, called Source-delta, which groups the change extracted from the source at a given instant can be associated. After cleaning this change, we generate a Cleaned-delta source. Again, we can maintain a history of this source-delta, called Delta-history, during a certain period of time during which changes can be accumulated. Both ODS-history and Delta-source history are optional if they are not required by the semantics of OLAP applications. There is a difference between Source-history and an ODS-history. Source-history is defined when the frequency of the extraction step is different from the frequency of the cleaning step or the integration step. This can occur when the data source does not maintain its own history or when the volume of extracted data is not relevant with respect to the aggregation needed by the OLAP application. ODS-history is defined when OLAP applications need to accumulate data for statistical processing for example. Obviously, this hierarchy of data stores is a logical way to represent the data flows which go from the sources to the data marts. Concretely, all the intermediate stores between the sources and the data marts can be represented in the same database. This logical view allows a certain traceability of the design and the refreshment processes, leading to a better understanding of their construction and scheduling respectively. In the following paragraphs we describe the main phases and steps of the data warehouse design. 1996/DWQ Consortium -5- DWQ/

6 CDW V1 V4 Aggregation V2 V3 Aggregat. V5 AGGREGATION/UPDATING ODS-Hustory ODS-Historisat. Historisat. ODS INTEGRATION/LOADING M.S. Integration Integrator D-History D-History D-History S-Historisation Historisat.1 Historisat.2 Historisat.3 S. Cleaning C-Delta Cleaner1 C-Delta Cleaner2 C-Delta Cleaner3 PREPARATION S-Delta S-Delta S-Delta S.Extraction Extractor1 Extractor2 Extractor3 S1 S2 S3 Figure 1: Logical data warehouse architecture Data warehouse Design The design of a data warehouse consists in the definition of all the meta data which describes all the data warehouse objects (data stores), and the definition of the initial creation of the data warehouse stores as well as their periodic refreshment. This design starts with OLAP application requirements in one hand (expressed as conceptual views) and the set of sources potentially useful for the computation of these views in another hand. With respect to the DWQ framework, this definition meta data is structured at three abstraction levels: Conceptual level which contains the meta data which characterize the usage of the sources (access rights, quality factors, extraction frequency, etc.), the mapping rules between source models and relational model, the definition of cleaning rules, the definition of integration assertions, indications on histories (time periods, volumes), quality factors expected for views. Logical level where each data warehouse store has its own description. Sources are described with their respective model while all other data warehouse stores are described in the relational model. The logical schema of the corporate data 1996/DWQ Consortium -6- DWQ/

7 warehouse is a calculation graph of all views defined either on the sources on other views of the data warehouse. Physical level defines the actual implementation of the data warehouse stores. As stated in the DWQ framework, mappings are of multiple kinds: structural mappings, data mappings, knowledge mappings and requirements mappings. Mappings between the conceptual level and the logical level mainly correspond to the quality function deployment, using the house matrix for example. This house matrix transforms quality factors into technical strategies which allow to achieve the quality level described by these factors. This matrix is refined in several steps and evolves to represent the mappings between the logical level and the physical level. Other mappings between sources and data warehouse perspectives are represented as queries Definition of the refreshment process The refreshment process aims to propagate changes raised in the data sources to the data warehouse store. This propagation follows three phases: (i) preparation phase, (ii) loading phase and (iii) aggregation phase. Each phase is composed of several steps handling different tasks (Figure 1). These phases and steps are the same as in the initial construction of the data warehouse, except that the refreshment process is concerned by an incremental management of the updates. Phases and steps are governed by certain meta data which describes source and extractor capabilities, semantics needed by the OLAP application, and the moments where the refreshment is relevant to these applications. In the remaining of this section, we describe the different phases of the refreshment process, then we define the different refreshment strategies, and the way to plan these strategies. The preparation phase The preparation phase is composed of three steps applied to each data source : (1) extraction of data changes from the source, producing the source-delta stores, (2) cleaning of this data, producing the cleaned-deltas, (3) historisation of this data, producing delta-histories. The second and the third step are optional steps, depending on the quality and the representation of the source data, and on whether we need history or not on a data source. The cleaning and the historisation steps can be done in different orders, depending on performance or on the semantics of the cleaning rules. For example, if we have a cleaning rule which trashes duplicated data, it makes more sense to do it after historisation than before. But if we have a cleaning rule which adapts a certain format to another, it makes sense to do it after the extraction, or even to do it on the fly during the extraction. This means that at the operational level, the cleaning task can be distributed; some cleaning rules are applied before historisation, others are applied after historisation. The integration / loading phase 1996/DWQ Consortium -7- DWQ/

8 The loading phase is composed of two steps: (4) multi-source data integration, producing the operational data store (ODS), (5) data warehouse historisation, producing ODS-history. Step (4) is optional if the data warehouse is built from a single source, or if the sources are independent, i.e., there are no common objects nor common links. Step (5) is optional if we don't need a history on the integrated data; this means there is no OLAP application defined on this history. The integration activity consists roughly in (i) matching data coming from different sources, (ii) detecting multi-source inconsistencies with respect to integration assertions defined at the schema integration level, transforming and cleaning data which does not conform to these assertions, and assert the integrated data in the resulting relations. The multi-source integration step should be done with respect to a certain scenario depending, for example on the duration of the preparation phase of each source, on the semantics of integration rules (do some rules before others), on performance reasons (do in parallel some intermediate integration), etc. The data warehouse historisation is a necessary activity when there are OLAP applications which operate on statistical samples elaborated for certain periods of time. There might be OLAP applications interested by short-time prediction or longtime prediction and for which both factual ODS and ODS-History are interesting. The aggregation/propagation phase The aggregation phase can be composed of as many steps as the number of intermediate views in the view definition hierarchy. The aggregation phase is nothing than a recursive view evaluation from a set of operands. The refreshment process at this level may consist in an incremental evaluation of certain operands, depending on whether these operands are materialized or derived views, and on whether the materialized views are incrementally updated or completely re-evaluated. Refreshment tasks that can be modeled by active rules A refreshment process can be seen as an application program which monitors a set of activities devoted to data extraction, data cleaning, data integration, etc. This refreshment application is composed of a main activity, and a set of activities identified in the previous section (Figure 2). Some of these activities can be implemented using active rules, others will be implemented using classical programming languages. Our current assumption is that data cleaning, data integration and update propagation are typically the activities which can be implemented using active rules. With regard to its monitoring role, the main activity itself can also be implemented using active rules. In this report we restrict our experiment to the three activities of cleaning, integration and propagation, and we will show later how to implement them using active rules. 1996/DWQ Consortium -8- DWQ/

9 A REFRESHMENT PROCESS Propagation Activity Integration Activity Main Activity Cleaning Activity ExtractionActivity Figure 2: Activities of the refreshment process 2.3. Engineering a refreshment process This section describes an informal methodology which helps in building a refreshment. It also describes the meta data needed by this process, according to the DWQ Framework Engineering approach As shown in the previous section, the definition of a refreshment process is a complex activity which should be done in several phases and tasks, organized according to a certain strategy, and using different parameters. Thus, engineering a refreshment process must follow a classical life-cycle for process development (Figure 3) which successively generates a conceptual definition of the refreshment process, its logical specification and its physical implementation. Four activities materialize this engineering process: Requirement Analysis which helps to acquire all the necessary knowledge to the definition of the refreshment process. Conceptual Design which provides a first definition of the refreshment process as a planning scenario of possible strategies. Logical Design which transforms this conceptual scenario into a formal specification in terms of a master algorithm, rules and their execution semantics. Physical Design which implements the master algorithm as a master program and rules with their semantics as an active monitor driven by events generated by the master program. 1996/DWQ Consortium -9- DWQ/

10 The following subsections detail these processes and their corresponding inputs and outputs. Requirements of OLAP Applications Metadata Requirements Analysiss Requirements for the Refreshment Process Source Definitions View Definitions Quality on Sources & Expected Quality Conceptual Definition of the Refreshment Process Conceptual Process Logical Definition of the Refreshment Process Set of active rules + Execution semantics Active Monitor Physical Definition of the Refreshment Processs Operational Refreshment Process ENGINEERING PROCESS Figure 3: Methodology to define a refreshment process Requirement Analysis This activity is an informal activity which allows to Identify the views concerned by the refreshment process to build, Identify data sources and their corresponding data extractors, Identifying the quality factors attached to the data sources and those that the data warehouse content must achieve (data quality policy), Define the quality function deployment, that is associate to each quality factor the corresponding technical strategy. Conceptual Design Defining the refreshment process is a complex design which should consider many parameters related to refreshment strategy, sources features, task capabilities and user needs. The following procedures give an intuition of this definition process. Refreshment (V1,...Vn) For each view or a set of views to refresh do Select sources on which views are directly or indirectly defined; Choose the refreshment strategy; Select the relevant meta data to use; Define the refreshment window; Plan each refreshment phase; 1996/DWQ Consortium -10- DWQ/

11 end. PlanningPreparation (S1,...,Sp) For each data source Si do Select the corresponding extractor; Select other tasks to perform (i.e. cleaning, historisation); Organize these tasks into a sequence; Define events that govern the starting and the progress of these tasks; end. PlanningLoading (S1,...,Sp) Select the integration strategy; Identify synchronization points where the integration starts; Define the dynamic planning of the source preparation ends; Apply transformation with respect to integration assertions; Possibly perform the historisation task; end. PlanningAggregation (V1,...,Vn) Select materialized views; Select derived views; Define the computation dependency graph between views; Define a computation strategy for this graph; end. Planning a refreshment process means choosing a certain execution strategy for the component activities and for the whole refreshment process (Main Program). Defining a refreshment strategy consists in: sequencing preparation tasks, deciding on the level of parallelisation between different source preparations, deciding on the history definition, defining the graph of computational dependencies between views, deciding on the way to evaluate the views in this graph, Identifying all the major events that trigger the refreshment activities. The refreshment process should be planned statically or dynamically, depending on the activities, in order to achieve a certain quality of service, that is, for example, to provide a high degree of freshness for aggregated data or to propagate changes in an optimal time (availability). Static planning concerns the selection of tasks that should be performed on each source or on the data warehouse, and the logical sequencing of these tasks. Dynamic planning concerns the identification of parallel tasks, the definition of the point before the integration, and the mode of detection of all events that trigger these tasks. Both static and dynamic plannings are necessary to define the refreshment process. This planning is bounded by the data freshness time (i.e. the last data state which the OLAP application is interested in) and the data availability time (the deadline at which the aggregate data is significant to the OLAP application). This interval is called the refreshment window. Logical Design 1996/DWQ Consortium -11- DWQ/

12 Logical design of the refreshment process consists in : definition of the main activity algorithm and the checkpoints where refreshment activities are invoked, selection of activities that will be specified using active rules (active activities), definition of active rules related to each active activity, definition of the semantics of each rule, definition of the algorithms of non-active activities. The output of the logical design is then a set of rules with their semantics (see section 3 on the expression of this semantics) for active activities and a set of algorithmic specifications for the non-active activities. Section 4 shows, through simple examples, how some refreshment activities can be expressed with rules and how the semantics of these rules is specified. Rules are Event-Condition-Action rules. The semantics is the definition of the operational execution of this rule (when and how events are detected and composed, when and how conditions are evaluated, when and how actions are executed, what is the context in which a rule is executed, what is the rule effect, how rules are scheduled, etc...). Section 3 gives a more formal definition of these. The logical design should provide an aide in the definition of the rules and their semantics, that is how to translate to active rules the refreshment tasks and how to define the semantics of these rules from the strategies defined at the conceptual level. What are events, what are conditions and what are actions. Which event initiates the refreshment process. Physical Design Physical design of the refreshment process consists in the transformation of the logical rules and their semantics into an operational active program. This can be done using the generic operational active monitor. The components of this generic monitor are instanciated in such a way they implement the semantics defined for the rules, yielding a dedicated active monitor for data warehouse refreshment. Physical design consists also in the programming of all non active activities Meta data which governs the refreshment process The metadata used by the refreshment process concerns the sources and the data warehouse logical schemes, the sources and data warehouse physical schemes, and the corresponding mappings between all these, including the mappings from the conceptual levels. Logical schemes of sources are uniform representations of the sources parts which are relevant to the data warehouse goals (Figure 4). Besides the source definitions and the data warehouse definition, the refreshment process needs some other meta data such as the frequency of extraction or integration, the time interval of historisation, and all the time points associated to the activation of 1996/DWQ Consortium -12- DWQ/

13 the different tasks executed in the different steps. There should be a certain coherence between these parameters to avoid certain mismatch between the user needs and the capabilities of the data warehouse system. For example, the integration frequency should be consistent to the extraction frequency of multiple sources, as well as the time intervals of the different histories (source history and data warehouse history). PROCESS DIMENSION (DW System Architecture) Process 1 Process 3 Process 2 Process 4 Process 7 Refrershement Process Process 5 META-DATA DIMENSION (Metabase Content) Conceptual Level Source Model Entreprise Model Client Model Logical Level Source Schema Data Warehouse Schema Client Schema Physical Level Source Data Store DW Store Client Data Store Source Perspective EntreprisePerspective Client Perspective Figure 4: Position of the refreshment process with respect to DWQ Framework 3. Fundamentals basis of an active monitor As explained before, our goal is to model refreshment activities as cleaning, integration, updating by means of a set of active rules equipped with their execution model. The first step taken in this section is to formally characterize an execution model with a small number of parameters. These parameters form a hierarchy as showed by Figure 5. We first present some important assumptions. Second, we describe the types of events supported by a refreshment system. In a third part, the parameters are described in four parts that correspond to the highest classification level in Figure 5. Finally, we give the operational rule model that allows to map the semantic parameters with the behavior of the rules. Our presentation of the semantic parameters and of the operational model is a simplified version of the results presented in [Llir97] and [BFLM97] General Assumptions We represent a system supporting active refreshment activities by a master program (master, for short) that generates events to an active monitor, and a set of rules managed by this active monitor. 1996/DWQ Consortium -13- DWQ/

14 The Master Program The implementation of the master program depends on the data warehouse application. To illustrate this notion, we present several possible implementations for a master program. A master program can manage a set of alarm clocks that trigger calls to the extraction programs, for instance one alarm clock can be associated with every source extraction program. Every call to an extraction program returns a source change that is then passed as a primitive event to the active monitor. As another possible implementation, the master can manage persistent queues associated with the sources (one queue per source and one queue for the data warehouse). Every extraction program writes into its queue. The master program can then read the queues in some particular ordering, get a message, and pass it as an event to the active monitor. An important assumption made by our modeling is that when a master program passes an event to an active monitor, it gets interrupted until the processing of this event by the active monitor resumes. A second assumption is that an instance of a master program communicates with a single instance of an active monitor. However, a parallel implementation of a master program is also possible. For example, an instance of a master program can be associated with one persistent queue. This means that source changes coming from different sources will be processed independently. Of course, such a solution works only if the application guarantees that source changes can be processed in parallel without causing any conflict The rules The refreshment rules are event, condition, action rules that specify refreshment activities to undertake when certain situations arise. This situation is described by an event and, possibly a condition. The action part is reduced to a simple call to any application program whose execution is considered as atomic in the active monitor point of view. In the following, we will describe in more details the situation part and the context in which the rules are executed The events The events that may trigger the rules are associated with event types and event context types. An event type is described by an identifier (its name) and a possible sequence of formal parameters. An event context type is described by an identifier (its name) and a data structure. An event of type t is specified by an instance of t and an instance of the context type associated with t. Events can be primitive or constructed events. Primitive events correspond to : source changes or data warehouse changes received by the refreshment system, data modification operations generated by the rules within the refreshment system, or data warehouse access operations received by the refreshment system. Primitive events are produced by the master program and the actions of the rules. The instant of a primitive event e is the time at which e is received by the active monitor. This 1996/DWQ Consortium -14- DWQ/

15 definition deserves some explanations. Suppose that a data modification occurs at a source. In our modeling, the time at which the event corresponding to this data modification is signaled to the active monitor is the instant for the event. However, the instant at which this data modification occurred in the source can be captured in the context associated with the event (if it is necessary for the refreshment system). A constructed event type t is defined from a set St of primitive or constructed event types by three components : an identifier (its name), a time interval definition, and a synthesis function. Informally, a time interval is defined over the set E of events of types St that have been signaled to the active monitor, and returns an interval of time whose bounds correspond to event instants in E. The synthesis function is defined over a time interval I over E, and returns a set of events of type t. All these events have the upper bound of I as event instant. A constructed event type may have parameters, it is associated with a context type. To illustrate the notion of constructed events, consider the following example. Suppose that a region is decomposed into districts. In each district, a database registers local measurements about air pollution and air quality. A global data warehouse is defined to store aggregated data about air quality in the region. Suppose that each source sends its changes every hour to the data warehouse. Then a possible refreshment strategy would be to trigger the refreshment of the data warehouse only when there are at least two districts whose level of pollution (computed on a scale of 5 values) exceeds level 2 in the last 12 hours. In this case, the time interval function associated with this constructed triggering event is : [current_time - 12h ; current_time], the synthesis function computes the instances of the constructed event, and a possible context type for this event is the number of regions which have communicated a pollution level greater or equal than 2. Note that the notion of constructed event considered in this report is more general than the notion of composite event, defined from a set of primitive events and a set of logical and temporal operators, as in [CKAK94]. However, in that particular case, the authors show how to specify the corresponding time interval function, and synthesis function Rule execution semantics The execution semantics of an active system may be decomposed in a set of dimensions. Each dimension, so called semantic parameter, allows to define a facet of the rules behavior (see Figure 5). 1996/DWQ Consortium -15- DWQ/

16 Local semantics Triggering Evaluation Triggering Point Interval Synthesis Evaluation Point Evaluation Plan Semantics Execution Execution Point Execution Plan Global semantics Triggering Triggering Point Triggering Policy Selection Figure 5: Semantic dimensions Selection Point Selection Policy At the first level, we distinguish between the local and the global dimensions. The local dimensions describe the behavior of one rule independently from the others rules. Processing a rule is decomposed into three phases : the triggering phase where the rule is triggered, the evaluation phase where the condition of the rule is evaluated and the execution phase where the action is executed. Describing the local semantics of a rule r consists in specifying when and how each phase is processed for r. The global dimensions describe the global behavior of a set of rules. Indeed, a refreshment application generally uses several rules ; so, we have to specify how each phase of the processing of each rule taken individually is scheduled with respect to the processing of the other rules. Describing the global semantics of a set of rules consists in specifying when each individual phase have to be processed in the global process. In what follows, we present each component of this description The triggering phase The triggering phase of a rule r produces a set S of events triggering the rule and a set of rule instances of r (one per triggering event in S). Here, we use the standard definition of the rule instance notion : a rule instance of r associated with an event e is the rule r in which the event part is instanciated with e. In our modeling, we assume that only the events having a constructed type are able to trigger the refreshment rules. This assumption allows us to consider all the rules in an uniform manner. It doesn t jeopardize our semantic model ; indeed, any primitive event may easily be seen as a constructed event. The description of the semantic dimensions of the triggering phase of a rule r specifies in what situations the phase may begin and what is the mechanism used to produce the triggering events and the rule instances of r. 1996/DWQ Consortium -16- DWQ/

17 The situations where the triggering phase may begin are specified by means of synchronization points called (local) triggering points of r. The local triggering points are points where the master program may be interrupted for executing r, they are produced in message form by the master. As we suppose that the refreshment rules are triggered by constructed events, describing the triggering mechanism of r consists in providing the formal specification of the constructed triggering events of the rule. This is achieved by specifying a time interval, so called triggering interval, and a synthesis function. The interval is described by the specification of its bounds. The maximal lower bound of the interval is the time where the master program began. There is many way to specify interval boundary for a rule r. For example, the time where a triggering phase of r began, and the moment where a certain primitive (or constructed) event occurred are examples of possible interval boundaries. The synthesis function for r considers the events that were received by the master and the active monitor, and that have an event instant included in the interval. Then it derives a set of constructed events. The events the function has to consider are specified by their types : they may be primitive events or triggering events of the rule. Every event returned by the function is a triggering event for r that ultimately allows to produce a rule instance of r The semantics of the evaluation and the execution phases The description of the semantic dimensions of the evaluation phase and of the execution phase of a rule r specifies in what situations the phases may begin and what are the mechanisms used for evaluating and for executing the rule instances of r. The situations where the evaluation phase may begin are specified by means of local evaluation points of r, and the situations where the execution phase may begin (provided the condition was satisfied) are specified by means of local execution points of r. The local evaluation and local execution points are synchronization points where the master program may be interrupted for respectively evaluating the condition and executing the action of the rule instances of r. These points are produced in message form by the master. The need for specifying evaluation and execution plans for a rule r occurs every time the synthesis function may produce several triggering events for r. The evaluation and execution plans specify how shall we monitor the processing of the corresponding rule instances of the rule. For example the plans for r may specify that all the instances are processed in parallel The global semantics of a set of rules 1996/DWQ Consortium -17- DWQ/

18 The global semantics of a set of rules is described by specifying global synchronization points, a global triggering policy and a global selection policy. The global synchronization points allow to synchronize each rule with respect to the other rules. Given a rule r, a global synchronization point for r may be posted at the end of the condition or at the end of the action program of r. Such a point is produced in message form during the processing of the rule instances of r. A global synchronization point p for a rule r is associated with a set of rules, say S. Point p may be a triggering point, an evaluation point, or an execution ; that is a point where the processing of the rule instances of r may be interrupted for respectively triggering, evaluating, executing rules in S. Such a point is described by specifying in what rule the point is posted, what is its position in this rule and what is its set of associated rules. It may happen that several rules should have the same local triggering point or are attached to the same global triggering point. The triggering policy selects the rules for which the point is actually a triggering point. A triggering policy may be described by specifying what mechanism is used to perform the selection. It may happen that several rules should have the same local evaluation point, or the same local execution point, or are attached to the same global evaluation point or to the same global execution point. The selection policy has two functions : first it selects the rules for which the point is actually an execution point or an evaluation point, and second, it selects what rule instances of the selected rules have to be evaluated or executed. A selection policy may be described by specifying what mechanism is used to perform the selection Operational model of the rules Now we present an operational model of the rules intended to show how the semantic dimensions described above fully determine the behavior of the rules. In this model, we see the execution of the master program and of the rules it triggers as the execution of a set of tasks governed by a synchronizer. A task corresponds either to the execution of the master program or to the execution of a rule instance. The way the synchronizer composes and schedules the orders sent to the tasks is subject to the semantic dimensions of the refreshment application. 1996/DWQ Consortium -18- DWQ/

19 Synchroniser Messages Execution order & Task generation Figure 6: Program and rule execution The interactions between the synchronizer and the tasks are shown in Figure 6. The tasks send messages to the synchronizer and receive orders from the synchronizer. There is no inter-task communications. Every time a task sends a message to the synchronizer, it remains inactive until the synchronizer responds to the message. The synchronizer handles the messages one after another in the order where the messages were sent by the tasks. In reaction to a message the synchronizer may create new tasks or send orders to inactive tasks ; this reaction is determined by the semantic dimensions and the current state of the event history and the task history. The event history (EH) and the task history (TH) respectively contain all the event received by the active monitor from the beginning of the master program, and all the orders sent by the synchronizer to the tasks from the beginning of the master program The tasks : state diagram and messages We describe a rule instance task T by the state diagram shown in Figure 7. S:true Evaluated R:begin_action Triggered R:begin_rule Evaluating Executing S:false End S:end Figure 7: State diagram of rule instance Gray ovals, dashed ovals and white ovals respectively represent inactive states (i.e.. states where T wait for an order sent by the synchronizer), final states and active states (i.e. states where T evaluates the condition or executes the action of the associated rule). There are two inactive states (triggered and evaluated), one final state (end), and two active states (evaluating and executing). A transition from a state St to a state St is noted (St, St ) ; it is annotated with a label of the form R : m or S : m. Given an arc (St, St ), if St is an active state, then St is an 1996/DWQ Consortium -19- DWQ/

20 inactive state and the associated label is of the form R : m, with the following meaning : the task is in state St and expects the order m from the synchronizer ; when the task receives this order, its state changes by passing from St to St. On the opposite, if St is an active state, then St is an inactive (or final) state and the associated label is of the form S : m, with the following meaning : the task is in state St, and enters in state St by sending the message m to the synchronizer. The triggered state is the initial state of task T : that is the state of every rule instance task when it is created by the synchronizer. From this state, the task waits for the begin-rule order. When it receives this order, T enters in the active state evaluating where the condition is evaluated, and then T sends a message to the synchronizer to report the evaluation result. The following state of T depends on the evaluation result. If the condition is not satisfied, T signals its execution end to the synchronizer (message false) and enters in the final state end where the execution of T is abandoned. On the opposite, if the condition is satisfied, T signals this result to the synchronizer (message true) and enters in the inactive state evaluated where it expects the begin-action order from the synchronizer. When T receives this order, it enters in the active state executing where the action is executed. At the end of the action execution, T enters in the end state and signals to the synchronizer that the execution is done (message end). R:continue Interrupted Executing S:interrupt Figure 8: State diagram of the Master program The state diagram of the master program is shown in Figure 8. The program may be executing or interrupted for executing the rules. At certain points of its execution, the master sends the message interrupt to the synchronizer. This message signals a local synchronization point of the rules. On the receipt of the message, the synchronizer monitors the execution of the rules. The master is interrupted until the synchronizer sends the continue order. An important assumption of our operational model is that the master program is never in the evaluating state when some rule instance task is active The event history and the task history The execution of the master program that triggers active rules may be traced using two histories : the event history and the task history. The event history (EH) contains all the events produced by the master program and the rule action programs. For every event, the event instant is the time where the event was registered in EH. We impose two constraints. First, all the events produced by executing the action program of a rule instance are registered in the event history 1996/DWQ Consortium -20- DWQ/

21 before the rule instance signals to the synchronizer that the execution is done (message end). Second, all the events produced from the beginning of the master program have been registered in the event history before the master program sends to the synchronizer the interrupt message. The task history (TH) records the messages and the orders reporting all the state changes of the tasks during the execution of the master program and of the triggered rules. It also records the orders given by the synchronizer to create new rule instances. Each message contained in the task history mentions what task sent the message and at what time the synchronizer took the message into account. Each order mentions what task received the order and at what time the synchronizer sent the order. Each creating order mentions which is the created task and at what time the synchronizer ordered to create the task The synchronizer The synchronizer handles the messages coming from the tasks one after another. The kind of order it sends in response to a given message is defined by applying the semantic dimensions. To do that, it uses the information provided by the histories EH and TH. Indeed, each semantic dimension may be expressed as a set of formula over EH and TH (see [BFLM97] for more details). The synchronizer behavior is described in the algorithm shown in Figure 9. When the synchronizer receives a message, it operates in two steps. The first step is dedicated to create new rule instance tasks while the second step computes what orders have to be sent to the existing tasks. To create new tasks, the synchronizer computes in D what rules have reached a triggering point. To do that the synchronizer uses the specification of the local and global triggering points and the current state of the task history. Nevertheless, every rule having reach a triggering point is not necessarily triggered, and the synchronizer selects among the possibly triggered rules what rules have to be ultimately triggered. To do that, it applies the global triggering policy and put the result in Rdec. For each selected rule r, the synchronizer computes a set of rule instances of r by executing the triggering phase of the rule as specified in the local semantics of r. Finally, for each rule instance produced the synchronizer orders to create the associated task ; this order is registered in the task history. To compute what orders have to be send to the tasks, the synchronizer computes in Exec what rules have reached an evaluation or an execution point ; to do that, it uses the specification of the evaluation and execution points. Then the synchronizer uses the selection policy to select tasks in Exec, and send the appropriate order to the selected tasks. If there is neither selected task nor active task, the synchronizer send the continue order to the master program. Synchronizer algorithm input : a message m, and T the task that sent m 1996/DWQ Consortium -21- DWQ/

22 let EH and TH denote the current state of the event history and the task history ; step 1 : let D be the set of rules having reached a triggering point ; if D is not empty then let Rdec denote the set of rules of D that have to be triggered ; for each rule r in Rdec computes in Inst_r the set of rule instances by applying the triggering phase of r ; for each rule instance in Inst_r create a rule instance task ; step 2 : let Exec be the set of rules having reached an evaluation point or an execution point and let TS be the set of rule instances defined as follows : T is in TS if T is an instance of some rule r contained in Exec, and T may be selected with respect to both the global selection semantics and the evaluation plan of r ; if TS is empty and there is no task being in the evaluating state or in the executing state then send the order continue to the master task else for each task T in TS send the appropriate order to T with respect to the current state of T ; Figure 9 : Synchronizer algorithm 4. How to instanciate a generic active monitor to get an active application In this section we present an example of a refreshment strategy and we show how it can be expressed using active rules and the semantics parameters presented in the previous section. As described in section 2.2, we consider that a refreshment process consists of four main activities : the Extraction activity, the Cleaning activity, the Integration activity and the Propagation activity. Thus we choose a strategy for each activity and show how these strategies can be expressed using rules and semantics parameters. Activity 1: An Example of an Extraction Strategy. A very simple strategy consists in periodically extracting data from sources : We associate an extraction program P S and a time period T S with each source S. The extraction program P S produces a source-delta D S which contains the changes in the source S since its previous execution. We model such a strategy by associating to each source S an active rule r S whose condition is always true and whose action executes the extraction program P S and generates an event E S with a context equal to D S. The triggering and the execution points of r S correspond to the end of each time period T S. The synthesis function returns the predefined event TRUE with no 1996/DWQ Consortium -22- DWQ/

23 associated context and the triggering interval is undefined (i.e., the triggering phase always triggers one and only one instance of the rule r S and no input parameter is passed to the action of r S ). If a same clock tip corresponds to the end of several time periods T S1... T Sn we need a global triggering policy and a global selection policy to select the rules in { r S1,..., r Sn } that are effectively triggered and executed. Intuitively, since each rule is attached to a different source all the rules can be triggered at the same time and executed in parallel. Activity 2: An Example of a Cleaning Strategy. A simple strategy consists in periodically cleaning the data that are extracted from the sources. We associate a cleaning program C S, a time period Tc S and a function f S with each source S. We model the cleaning of a source S by a rule rc S whose condition is always true and whose action executes the extraction program C S. Triggering and execution points of rc S correspond to the end of each time period Tc S. The upper and lower bounds of the triggering interval I s are respectively the end and the beginning of the last time period. The synthesis function is a function of the form f o select S where select S is a function that selects in I s all the source-deltas associated with the source S. The function f eliminates from these deltas all the redundant or useless information and computes a data structure that can be used as input parameter to the cleaning program Cs. Cs produces a set of «cleaned events» that converts this information in a format readable by the integration process. If a same clock tip corresponds to the end of several time periods Tc S1... Tc Sn we need a global triggering policy and a global selection policy to select the rules in { rc S1,..., rc Sn } that are effectively triggered and executed. Intuitively, since each rule is attached to a different source all the rules can be triggered at the same time and executed in parallel. Activity 3: An Example of an Integration Strategy We consider an integration strategy that computes the content of Operational Data Store. ( We assume that there is no ODS History). Let S ODS be the schema of ODS. We assume that there is no integrity constraint that involves more than one relation. In such a case each relation of S ODS can be computed independently from other relations. Let S R be the set of sources that allows to compute the relation R and let P R be the integration program that computes the new value of R given a set of cleaned events associated to the sources of S R and the old value of R. We adopt a strategy that applies P R each time a sufficient amount of untreated cleaned events is produced. We model the integration of a relation R by a rule r R whose condition is always true and whose action executes the extraction program P R. 1996/DWQ Consortium -23- DWQ/

Chapter 2 Overview of the Design Methodology

Chapter 2 Overview of the Design Methodology This chapter presents an overview of the design methodology which is developed in this thesis, by identifying global abstraction levels at which a distributed