Data warehousing and data mining are both popular technologies in recent years.

Size: px

Start display at page:

Download "Data warehousing and data mining are both popular technologies in recent years."

Whitney Bridges
5 years ago
Views:

1 Chapter 1 Introduction Data warehousing and data mining are both popular technologies in recent years. Data warehousing is an information infrastructure to store and integrate different data sources into a consistent repository, and through OLAP (On-Line Analytical Processing) tools business managers can analyze these data in various perspectives to discover valuable information for strategic decision. Data mining, on the other hand, is the exploration and analysis of data, automatically or semi-automatically, to discover meaningful patterns and rules. From the business viewpoint, the integration of these two technologies can allow a corporation to understand its customers behaviors, and to use this information to gain market competition. Among various pattern interested by data mining research community, association rule has attracted great attention recently. An association rules is a rule of the form A B (sup = s %, conf = c %), which reveals the concurrence between two itemsets A and B. An example is PC => Laser Printer (sup = 30%, conf = 80%), which means there are 30% customers will buy PC and Laser Printer together, and 80% of those customers who buy PC also get Laser Printer. Mining association rules from large database is a data and computation intensive task. To reduce the complexity of association mining, researchers have proposed the 1

2 concept of integrating data warehousing system and association mining algorithms. For example, the DBMiner system [22] developed by J. Han and its research team adopts an OLAP-based association mining approach. Similar paradigm was presented in [22]. The primary problem of OLAP-based approach is that the OLAP data cube is not feasible for on line association mining. Excessive efforts are still required to complete the task. As such, Lin et al. [15] proposed the concept of OLAM (On-Line Association Mining) cube, an extension of Ice-berg cube [3] used to store frequent multidimensional itemsets. They also proposed a framework of on-line multidimensional association rule mining system, called OMARS, to provide users an environment to execute OLAP-like query to mine association rules from data warehouses efficiently. This thesis is a companion toward the implementation of OMARS. Particularly, the problem of selecting appropriate OLAM cubes to materialize and store in OMARS is concerned. And, in accordance with the proposed mining algorithms in OMARS, a suitable model to evaluate the cost of selecting data cubes to materialize is also developed. 1.1 Contributions The main contributions of this thesis are as follows: 1. We exploit the devising dependency between OLAM cubes with regard to association query, thereby devising the structure of OLAM lattice. 2. We deploy the model for evaluating the cost of answering association queries using materialized OLAM cubes, which is a preliminary step for 2

3 OLAM cubes selection. 3. We modify and implement some state-of-the-art heuristic algorithms, and draw comparisons between these algorithms to evaluate their effectiveness. 1.2 Thesis Organization This thesis is organized as follows. We describe past researches and related work about the data warehousing and data mining technologies in Chapter 2. In Chapter 3, we describe the OMARS framework briefly. Chapter 4 formulates our OLAM cube selection problem. The algorithm analysis and cost model is described in Chapter 5. Chapter 6 explains our algorithms, and Chapter 7 shows the experimental results conducted in this research. Finally, we conclude our work and point out some future research directions in Chapter 8. 3

4 Chapter 2 Background and Related Work 2.1 Data Warehouse and OLAP Data Warehouse As coined by W. H. Inmon, the term Data warehouse refers to a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management s decision-making process [11]. In this regard, a data warehouse is a database dedicated to support decision making. According to the demand of analysts, the data comes from different databases are extracted and transformed into the data warehouse. If users want to execute queries, the system only needs to search the data warehouse instead of the source databases. For this reason, it can save much more query processing time for users. A data warehouse system is composed of three primary parts: 1. The source databases in the backend: In the backend, the data are collected from various sources, internal or external, legacy or operational, and any change to these sources is continually monitored by several modules called 4

5 monitors/ wrappers. 2. The data warehouse and data marts in the core: The reconciled data are stored in the data warehouse and data mart, which are central repository for the whole system. 3. The analysis tools in the front end: The analysis tools supported in the front end are usually OLAP, query/tabulation tools, and data mining software. The typical structure of a data warehouse is illustrated in Figure 2.1. Monitoring & Administration OLAP Data sources Servers Tools Metadata Monitors/ wrappers Analysis Data Warehouse External sources Extract Clean Transform Load Refresh Serve Query/Reporting Data Mining Operational databases Data mart Figure 2.1. A typical architecture of data warehouse [11] On-Line Analytical Processing (OLAP) Although the data stored in a data warehouse have been cleaned, filtered, and integrated, it still requires much time to transform the data into useful strategic information owing to the massive amount of data stored in data warehouse. The concept of On-Line Analytical Processing (OLAP) [4] refers to the process of creating and managing multidimensional data for analysis and visualization. To provide fast 5

6 and multidimensional analysis of data in a data warehouse, the OLAP tool precomputes aggregation over data and organizes the result as a data cube composed of several dimensions, each representing one of the user analysis perspectives. The typical operations provided by OLAP include roll-up, drill-down, slice and dice and pivot [8]. Roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data. The slice operation performs a selection on one dimension of the given cube, resulting in a subcube, while the dice operation defines a subcube by performing a selection on two or more dimensions. The pivot operation, which is also called rotate, is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. These OLAP operations are illustrated in Figure Data Warehouse Data Model Because the data warehouse systems require a concise, subject-oriented schema that facilitates on-line data analysis, the entity-relationship data model that is generally used in relational database systems is not suitable for data warehouse system. For this purpose, the most popular data model for a data warehouse is a multidimensional data model. Two common relational models that facilitate multidimensional analysis are star schema, and snowflake. 6

7 Product P6 P5 P4 P3 P2 P1 C1 C2 C3 C4 Customer Supplier S1 S2 S3 S4 P6 P5 P4 P3 P2 P1 C1 C2 C3 C4 Customer Product Slice Pivot Dice Customer Roll-Up Drill-Down Product C1 C2 C3 C4 P6 P5 P4 P3 P2 P1 Product Supplier S4 S3 S2 S1 Product Supplier S1 S2 P2 P1 P6 P5 P4 P3 P2 P1 All Customer C1 C2 Customer Figure 2.2. The typical operations of OLAP Star Schema Star schema, proposed by Kimball [12], is the most popular dimensional model used in data warehouse community. A star schema consists of a fact table and several dimension tables. The fact table stores a list of foreign keys which correspond to dimension tables, and numeric measure of user interests. Each dimension table contains a set of attributes. Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order). An example of star schema is depicted in Figure 2.3, whose schema hierarchy is illustrated in Figure

8 Customer Customer_ID City Education Fact Table Sales Time Product Customer_ID Time_ID Product_ID Quantity Time_ID Date Month Product_ID Category Figure 2.3. An example of star schema for sales All All All Category Month City Education Product_ID Date Customer_ID Time_ID Figure 2.4. An example of schema hierarchy for sales star 8

9 2.2.2 Snowflake Data Model The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional individual and hierarchical tables. An example of snowflake data model is depicted in Figure 2.5. Customer Customer_ID City Education Fact Table Sales Time Product Customer_ID Time_ID Product_ID Quantity Time_ID MID Date Month MID Month Product_ID Category Figure 2.5. An example of snowflake schema for sales The major difference between snowflake schema and star schema is that the dimension tables of snowflake model may be kept in normalized form to reduce redundancies. Through this characteristic one can easily maintain and save storage space than that by star schema data model. On the other hand, the star schema can integrate schema hierarchies into a dimension table, thereby incurring no join operation during hierarchical traverse of the dimensions. Hence, the star schema data 9

10 model is more popular than snowflake schema data model. 2.3 Association Rule Mining Association Rules Association rule mining is one of the prominent activities conducted in data mining community. The concept of association rule mining is to search interesting relationships among items in a given data set. For example, the information that customers who purchase diapers also tend to buy beers at the same time is represented in association rule below: Diaper => Beer [sup = 2%, conf = 60%] Rule support and confidence are two measures of rule interestingness. A support of 2% means that 2% of customers purchase diaper and beer together. A confidence of 60% means that 60% of the customers who purchase a diaper also buy beer. Typically, an association rule is considered interesting if it satisfies a minimum support threshold and a minimum confidence threshold that are set by users or domain experts. The process of association rule mining can be divided into two steps: 1. Frequent itemsets generation: In this step, all itemsets with support greater than the minimum support threshold are first discovered. 2. Rule construction: After generating all frequent itemsets, the confidence of these frequent itemsets much greater than minimum confidence threshold. Then, we can discover association rules. 10

11 The most popular and influential association mining algorithm is Apriori [2], which the apriori knowledge of frequent k-itemsets to generate candidate (k+1)-itemsets. When the maximum length of frequent itemsets is l, Apriori needs l passes of database scans. Since the Apriori algorithm costs much time to generate the candidate itemsets and to count the support of each itemset, many variant algorithms have been proposed to improve the efficiency of mining process Multi-dimensional Association Rules The concept of multi-dimensional association rules is first proposed by H. Zhu [22], which is used to describe associations between data values from data warehouse, because where the data schema is composed of multiple dimensions, and each dimension may contain many attributes. Following the work in [22], we can divide the multi-dimensional association rules into three different types as follows: 1. Inter-dimensional association rule: This is the association among a set of dimensions. For example, suppose an OLAP cube is composed of three dimensions: Product, Supplier, Customer, and whose data is listed in Table 2.1. An inter dimensional association rule is: Supplier ( Hong Kong ), Product ( Sport Wear ) Customer ( John ) 2. Intra-dimensional association rule: This is the association among items coming from one dimension. From Table 2.1, a possible intra-dimensional association rule is: Product ( Sport Wear ) Product ( Tents ) 11

12 3. Hybrid association rule: This is the association among a set of dimensions, but some items in the rule are from one dimension. It can be regarded as a combination of inter-dimensional and intra-dimensional associations. According to Table 2.1, a hybrid-association rule is: Product ( Sport Wear ), Supplier ( Hong Kong ) Product ( Tents ) Table 2.1. A relational representation of OLAP cube Supplier Product Customer Count HongKong HongKong HongKong Mexico Mexico Mexico Mexico Mexico Seattle Seattle Seattle Seattle Tokyo Tokyo Tokyo Tokyo Sport Wear Sport Wear Water Purifier Alert Devices Carry Bags Carry Bags Tents Tents Carry Bags Sport Wear Sport Wear Water Purifier Carry Bags Sport Wear Tents Alert Devices John Mary John Peter Peter Bill Sue Mary John Peter John Bill Sue Bill Sue John Related Work Data Cube 12

13 The concept of data cube is first proposed by Gray et al [6], which allow the analysts to view the data stored in data warehouse from various aspects and to employ multidimensional analysis. Each cell in a data cube represents the measured value. For example, consider a sales data cube with three dimensions, Product, Supplier, Customer, and one measure value, Sales_total. This cube is depicted in Figure 2.6 and can be expressed as a SQL query as follows: Select Product, Supplier, Customer SUM(Sales) AS Total Sales From Sales_Fact Group by Product, Supplier, Customer; Product Supplier s4 s3 s2 s1 p1 p2 p3 p4 p5 p6 c1 c2 c3 c4 Customer Figure 2.6 An example of data cube Cube Selection Problem In order to accelerate the query processing, it is important to select the most suitable cubes to materialize. In general, there are three options to select the cubes to materialize. 1. Materialize all data cubes: This method costs the lowest query time but 13

14 needs the largest storage space, because the whole cubes have to be materialized. 2. Materialize nothing: This method saves the largest storage space but needs the largest query time, because there is no cube to be materialized. 3. Materialize a fraction of data cubes: This method selects a part of the data cubes to materialize. But how to select the most suitable cubes to materialize under a space constraint is difficult. Indeed, it has been proved to be a NP-hard problem [9]. According to the above discussions, the best way is to materialize all data cubes. However, the space limit of data warehouse would hinder us to do this. On the other hand, if we materialize nothing, it will cost too much query time. Therefore, we should try to select the most suitable cubes to materialize even this problem is an NP-hard problem. In the literature, there has been a substantial contribution in this problem, which can be classified into three main categories: 1. Heuristic method: This category is mainly based on the greedy paradigm. Harinarayan et al. [9] was the first one to consider the problem of materialized views selection for supporting multidimensional analysis in OLAP. They proposed a lattice model and provided a greedy algorithm to solve this problem. Gupta et al. [7] further extend their work to include indices selection. Ezeife [5] also considered the same problem but proposed a uniform approach using a more detailed const model. Shukla et al. [17] proposed a modified greedy algorithm that selects only according to the cube size. Their algorithm was shown to have the same quality as Harinarayan s greedy method but is more efficient. 2. Exhaustive method: The work in [19] supposed that all queries should be 14

15 answered solely by the materialized views, with or without rewriting the users queries. They modeled the problem as a state space optimization problem, and provided exhaustive and heuristic algorithms without concern for the storage constraint. Soutyrina and Fotouhi [18] proposed a dynamic programming algorithm to solve the problem, which can yield the optimal set of cubes. 3. Genetic method: There is some work devoted to applying genetic algorithms to the view selection problem [10, 20, 21]. Following the AND-OR view graph used in [7], Horng et al. [10] proposed a genetic algorithm to select the appropriate set of views to minimize the query cost and view maintenance cost. A similar genetic algorithm with different repairing scheming is proposed in [13], which use a greedy repair method to correct the infeasible solutions instead of using a penalty function to punish the fitness of the infeasible solutions. Researches have shown that the repair scheme is better in dealing with infeasible solutions than penalty function is [16]. Rather than optimize the view selection from a given query processing plan, the work in [20, 21] focus on finding an optimal set of processing plans for multiple queries. A solution in their genetic algorithm thus represents a set of processing plans for the given queries. 15

16 Chapter 3 The OMARS Framework In this chapter, we will give a brief review of the OMARS framework, because our research deals with the problem of how to select the most suitable OLAM cubes to materialize in this system. The OMARS framework, as illustrated in Figure 3.1, integrates data warehouse, on-line analytical processing, and the OLAM Cube, whose objective is to provide an efficient and convenient platform, allowing users to perform OLAP-like association explorations. Through the OMARS system, users can perform multidimensional associational mining queries, interactively change the dimensions that comprise the associations, and refine the constraints such as minimum support and minimum confidence. Functionality of each component is described in the following sections. Data Warehouse OLAP Cube Auxiliary Cube Cube Manager OLAM Mediator OLAM Engine OLAM Cube Figure 3.1. The OMARS framework [15]. 16

17 3.1 OLAM Cube and Auxiliary Cube OLAM cube is a new concept proposed by Lin et al. [15], which is used to store the frequent itemsets with supports greater than or equal to a presetting minimum support, denoted as prims. In this regard, the OLAM cube can be regarded as an extension of iceberg cube. The main difference is that the iceberg cube stores the information of frequent itemsets derived from inter-dimensional associations, while OLAM cube is feasible for all of the three different associations. When the minsup of user s query is greater or equal than prims, it can accelerate the process of mining association rules because of the OLAM cube stores the frequent itemsets with supports greater or equal than prims. Although the OLAM cube can be used to generate association rules efficiently when minsup is greater than prims, it fails to solve the situation that minsup is lower than prims. To alleviate this problem, the OMARS system embraces another type of data cube, called auxiliary cube. The concept of auxiliary cube is used to store the infrequent itemsets with length of K α, where K α denotes the cutting-level employed by the mining algorithm CBW on used in OMARS. 3.2 Cube Manager This component is responsible for three different tasks: 1. Cube selection: This refers to how to select the most proper cubes to materialize, in order to minimize the query cost and/or maintenance cost under the constraint of limited storage space. 2. Cube computation: This portion is to deal with the work of efficiently generating the set of materialized cubes produced by the cube selection 17

18 module. 3. Cube maintenance: This part concerns the problem of how to maintain the materialized cubes when the data in the data warehouse are updated. Our research in this thesis indeed deals with the implementation issue of the cube selection task of Cube Manager. We will discuss this in the next chapter. 3.3 OLAM Mediator and OLAM Engine OLAM Engine is an interface between the OMARS system and the users. It accepts user s queries and invokes the appropriate algorithm to mine multidimensional association rules. When OLAM Engine receives a user s query, it will analyze the query and forward relevant information to OLAM Mediator, which then looks for the most relevant cube and returns the result to OLAM Engine. Here the most relevant cube denotes the materialized OLAM cube that can answer the query and consume the smallest cost. There are two possibilities of the search result returned by OLAM Mediator, and each should be handled in different way. 1. OLAM Mediator can find the most relevant cube: In this case, OLAM Mediator has to further compare the minsup of user s query to prims, and to handle this situation according to the following two different cases: i. minsup prims: The discovered OLAM cube is capable of answering the query. Return this cube to OLAM Engine. ii. minsup < prims: The discovered OLAM cube can not answer the query without the aid of the auxiliary cube. Return the OLAM cube and its accompanied auxiliary cube to OLAM Engine. 18

19 2. OLAM Mediator can not find the cube: In this case, OLAM Mediator has to search the OLAP Cube repository to determine if there is an OLAP cube whose data can be used to answer the query. If the answer is yes, return the discovered OLAP cube to OLAM Engine; otherwise, notify OLAM Engine to execute the mining procedure from the data warehouse afresh. We will discuss the above cases in more detail and devise to the cost evaluation of each case in Chapter 5. 19

20 Chapter 4 Problem Formulation In this chapter, we first elaborate the correspondence between OLAM query and OLAM cube, and describe the concept of OLAM lattice. After this, we will define the problem of OLAM cube selection. 4.1 OLAM Cube and OLAM Query As described in Chapter 3, OLAM cube is used to store frequent itemsets, aiming at accelerating the process of mining association rules. To clarify the structure of OLAM cube and its relationship between multidimensional associations, we first introduce a four-tuple mining meta-pattern to specify the form of multidimensional association query. The definition is as follows: Definition 4.1. Suppose a star schema S containing a fact table and m dimension tables {D 1, D 2,, D m }. Let T be a jointed table from S composed of a 1, a 2,., a k attributes, such that a i, a j Attr(D k ), there is no hierarchical relation between a i and a j, 1 i, j r, 1 k m. Here Attr(D k ) denotes the attribute set of dimension table D k. A meta-pattern of multidimensional associations from T is defined as follows: 20

21 MP: < t G, t M, ms, mc >, where ms denotes the minimum support, mc the minimum confidence, t G the group of transaction attributes, t M the group of item attributes, for t G, t M {a 1, a 2,., a k } and t G t M =. The above-mentioned meta-form specification of multidimensional association queries can present three different multidimensional association rules defined in [22], intra-association, inter-association, and hybrid association. For example, consider a jointed table T involving three dimensions from the star schema in Figure 2.3. The content of T is shown in Table 4.1. If the item attribute set t M consists of only one attribute, then the meta pattern corresponds to an intra-association. Table 4.1. A jointed table T from star schema Tid City Education Date Month Product_ID Category 1 Taipei Bachelor 7/12 July 1 A 2 Taipei High school 7/12 July 2 A 3 N.Y. Master 7/18 July 1 A 4 Toronto Master 8/2 Aug. 3 B 5 Seattle Master 8/3 Aug. 4 B 6 N.Y. High School 8/2 Aug. 1 A 7 Toronto High School 7/4 July 1 A 8 Seattle Bachelor 7/18 July 5 C 9 Taipei Bachelor 8/2 Aug. 2 A 10 N.Y. Bachelor 9/1 Sep. 3 B For instance, let t G = {City}, t M = {Category}. We may have the following intra-association rule: 21

22 (Category, A ) (Category, B ) (sup = 40%, conf = 80%) Note that to facilitate this mining task, the table T has to be, implicitly or explicitly, transformed into a transaction table as follows: City Taipei N.Y. Toronto Seattle Category A A, B A, B B, C On the other hand, if t M 2, then the resulting associations will be inter-association or hybrid association. For example, let t G =, t M = {Education, Month}. We have an inter-association: (Education, Master ) (Month, July ) (sup = 40%, conf = 80%) Like intra-association, the table T has to be transformed into the following form: Tid Education Month 1 Bachelor July 2 High school July 3 Master July 4 Master Aug. 5 Master Aug. 6 High School Aug. 7 High School July 8 Bachelor July 9 Bachelor Aug. 10 Bachelor Sep. Note that in this case, the transaction attribute is the same as the original table T. But if t G = {City}, we will have a hybrid-association: (Education, Master ), (Month, July ) 22

23 (Month, Aug. ) (sup = 40%, conf = 80%) For this case, the transformed table will be: City Education Month Taipei Bachelor, High School July, Aug. N.Y. Master, High School, Bachelor July, Aug., Sep. Toronto Master, High School Aug., July Seattle Master, Bachelor Aug., July Cube. After explaining the mining patterns, we will clarify the structure of OLAM Definition 4.2. Given a meta-pattern MP with transaction attribute set t G and item attribute set t M, and a presetting minsup, prims, the corresponding OLAM cube, MCube(t G, t M ), is the set of the frequent itemsets with supports larger than prims. The following examples illustrate the corresponding OLAM cube for different kinds of multidimensional association rules. Example 4.1. An intra-dimensional OLAM Cube: Let t G = {City}, t M = {Category}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table 4.2. Table 4.2. An example of intra OLAM cube expressed in table Category A B A, B Support Example 4.2. An inter-dimensional OLAM cube: Let t G =, t M = {Education, Month}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in 23

24 Table 4.3. Table 4.3. An example inter-dimensional OLAM cube expressed in table Education Month Support Bachelor High school Master - - Bachelor High school Master July Aug. July July Aug Example 4.3. A hybrid-dimensional OLAM cube: Let t G = {City}, t M = {Education, Month}, and prims = 3. From Table 4.1, the resulting OLAM cube is shown in Table 4.4. Table 4.4. An example hybrid-dimensional OLAM cube expressed in table Education Month support Bachelor High school Master Bachelor Bachelor High school High school Master Master Bachelor High school Master July Aug. July, Aug. July Aug. July Aug. July Aug. July, Aug. July, Aug. July, Aug

25 4.2 OLAM Lattice In accordance with the definition of OLAM cube, we can generate all possible OLAM cubes from the star schema, thereby forming an OLAM lattice. In order to provide hierarchical navigation and multidimensional exploration, the OMARS system [15] models the OLAM lattice as a three-layer structure. The first layer lattice expresses the combination of all dimensions. The second layer further exploits inter-attribute combinations for each dimensional combination in the first layer lattice. The third layer exploits all OLAM cubes corresponding to the meta-patterns derived from each subcube in the second layer. Note that the real OLAM cubes are stored in the third layer. For example, consider the star schema illustrated in Figure 2.3. The first layer lattice shown in Figure 4.1 is composed of eight possible dimensional combinations. After constructing the first layer lattice, we choose the node composed of customer and time dimensions, and extended it to form a second layer lattice shown in Figure 4.2. Each node of the second layer lattice is constructed by attaching any attribute chosen from the selected dimensions. Finally, we extend cube <(city, education), (date)> to form the third layer lattice shown in Figure 4.3. It can be observed that there is one OLAM cube corresponding to inter-association, (city, education, date); three OLAM cubes corresponding to hybrid-associations, (date*, city, education), (*education, city, date) and (city*, education, date); and three cubes corresponding to intra-associations, (education*, date*, city), (city*, date*, education), (city*, education*, date). Note that (city*, education*, date*) is shown to complete the lattice structure, which is useless and will not be materialized. 25

26 Customer, Product, Time Customer, Product, - Customer, -, Time -, Product, Time Customer, -, - -, Product, - -, -, Time <-, -, -> Figure 4.1. The1 st layer OLAM lattice for the example star schema in Figure 2.3 Customer, -, Time (city), (date) (education), (date) (education), (month) (city), (month) (city, education), (date) (city, education), (month) (city), (date, month) (education), (date, month) (city, education), (date, month) Figure 4.2. The 2 nd layer lattice derived from <customer, time, -> in the 1 st layer 26

27 0 transaction attribute * : transaction attributes city, education, date Inter association 1 transaction 1 attribute Hybrid association city, education, *date city, *education, date *city, education, date 2 transaction attributes city, *education, *date *city, education, *date *city, *education, date 3 transaction attributes *city, *education, *date Intra Association Figure 4.3. The 3 rd layer lattice derived from the subcube <(city, education), date > in the 2 nd layer Because the real OLAM cubes are stored in the third layer lattice, we can mine multidimensional association rules efficiently through materialize these OLAM cubes. From these three layers lattice, we discover attribute dependency that defined as follows: Proposition 4.1 Consider two OLAM cubes, MCube( tg, t ) 1 M and MCube( t, ) 1 G t 2 M. 2 If t = tg and t M t 2 M, then every itemset in MCube( t, ) 1 G t 2 M2 G1 2 must be a subset of an itemset in MCube( tg, t M ), and these two itemsets have the same support 1 1 value. 27

28 Example 4.4. Consider the table T in Table 4.1. Let MCube( tg, t ) 1 M1 be the cube illustrated in Table 4.4 and MCube( tg, tm ) 2 2 that illustrated in Table 4.5. Hence t = t = { City}, t = { Education, Month}, t = { Education}, and prims = 3. It G1 G2 M1 M 2 can be verified that every frequent itemsets stored in MCube( tg, t ) 2 M2 is a subset of frequent itemsets in MCube( tg, t M ), and both itemsets have the same support value. 1 1 Table 4.5. An OLAM Cube Education Bachelor High school Master Support According to Proposition 4.1, we know there is a dependency between OLAM cubes in the third lattice, which is formalized below. Definition 4.3. Consider two OLAM cubes, MCube( tg, t ) 1 M and MCube( t, ) 1 G t 2 M. 2 We say that MCube( tg, t ) 2 M is dependent upon MCube( t, ) 2 G t 1 M if t 1 G = t 1 G2 and t M 2 tm, and is denoted as MCube( t, ) 1 G t 2 M MCube( t, ) 2 G t 1 M. 1 One important aspect of Definition 4.3 is that if MCube( tg, t ) 2 M 2 MCube( t, t ) then all multidimensional queries that can be answered via G1 M1 MCube( tg, t ) 2 M can also be answered via 2 G1 M1 MCube( t, t ). Furthermore, it should be notice that not all of the OLAM cubes derived in the lattice have to be materialized and stored, because the concept hierarchies defined over the attributes in the star schema provide the possibility to prune some redundant cubes. 28

29 Consider an OLAM cube, MCube(t G, t M ). We observed that there are two different types of redundancy. Proposition 4.2. Schema redundancy: Let a i, a j t G. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a redundancy of cube MCube(t G -{ a j }, t M ). Example 4.5. Consider the jointed table in Table 4.1. Let t M = {Category}. The resulting table by grouping Date and Month as transaction attributes is shown in Table 4.6. Note that this table has the same transactions as that obtained by grouping Date as transaction attribute, as shown in Table 4.7. Thus, the resulting cube MCube({Date, Month}, {Category}) is the same as MCube({Date}, {Category}). Table 4.6. The resulting table by grouping {Date, Month} as transaction attributes for Table 4.1 Date Month Category 7/4 July A 7/12 July A 7/18 July A, C 8/2 Aug. A, B 8/3 Aug. B 9/1 Sep. B 29

30 Table 4.7. The resulting table by grouping {Date} as transaction attribute for Table 4.1 Date Category 7/4 A 7/12 A 7/18 A, C 8/2 A, B 8/3 B 9/1 B Proposition 4.3. Values Redundancy: Let a i, a j t M. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a cube with values redundancy. Example 4.6. Consider the jointed table in Table 4.1. Let t G = {City}, t M = {Date, Month} and prims = 2. The resulting OLAM cube is shown in Table 4.8. One can observe that the tuples with dotted lines in this table are redundant patterns. Therefore, it satisfies the values redundancy. Note that if it holds the values redundancy, we must prune the redundant patterns during the generation of frequent itemsets. 30

31 Table 4.8. The resulting OLAM cube MCube({City}, {Date, Month}) Date Month support 7/18 8/ July Aug /18 July 2 7/18 8/2 Aug. July 2 3 8/2 Aug. 3 July, Aug. 4 7/18 July, Aug. 2 8/2 July, Aug. 3 In addition to above observations, we observe that any OLAM cube is useless if it satisfies the following property. Proposition 4.4. Useless Property: Let a i t G and t M = {a j }. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a useless cube. Example 4.7. Let t G = {City, Date}, and t M = {Month}. The resulting table from table 4.1 by grouping {City, Date} as transactions is shown in Table 4.9. One can observe that the cardinality of every transaction is 1. Therefore, we cannot find any association rule from this table. 31

32 Table 4.9. The resulting table by grouping {City, Date} as transaction attribute for Table 4.1 City Date Month Toronto Taipei Taipei N.Y. N.Y. Toronto Seattle N.Y. 7/4 7/12 8/2 7/18 8/2 8/2 8/3 9/1 July July Aug July Aug. Aug. Aug. Sep. 4.3 OLAM Cube Selection We now proceed to give a formal definition of the OLAM cube selection problem. To this end, we introduce symbols as shown in Table Assume that an OLAM lattice L contains n OLAM data cubes D { d1, d2,..., d n }, the set of users queries is = 1 2 Q= { q, q,..., q m }, the set of query frequencies is F = { fq, f,..., } 1 q f 2 q m, and the space constraint is S. The OLAM cube selection problem is denoted as a five-tuple θ = { L, D, Q, F, S}. A solution to θ is a subset of D, say M, that can minimize the following cost function subject to constraint d S, d M m min fq * Eq (, ) i i M. i= 1 32

33 Symbol L D d n Q q m F f q i S M Definition Lattice Table The Symbol Table Set of data cubes n th data cube Set of user queries m th user query Set of user query frequencies Frequency of the i th query Space constraint Set of materialized cubes Eq ( i, M) The total time to response i th query in materialized views 33

34 Chapter 5 Evaluation of OLAM Query Cost 5.1 Query Evaluation Flow As stated previously, the primary task of OLAM Engine is to generate association rules according to users queries. After receiving a query, OLAM Engine analyzes the query, transfers the necessary information to OLAM Mediator, and then waits for the most matching cube from OLAM Mediator. When OLAM Mediator receives the information of users queries from OLAM Engine, it will look for the most matching cube. First, OLAM Engine searches for the required OLAM cube. If found, then it further checks whether minsup prims; and if yes, then returns the found OLAM cube to OLAM Engine, otherwise returns the corresponding auxiliary cube of the found OLAM cube and notifies OLAM Engine to perform association mining from data warehouse with the aid of this auxiliary cube. On the other hand, if OLAM Engine can not find any qualified OLAM cube to answer user query, it will notify OLAM Engine to perform association mining from data warehouse afresh. The above described procedure employed by OLAM Mediator is depicted in Figure

35 Start OLAM Query No Is the required OLAM cube found? The required OLAM cube does not exist Yes No minsup >= prims Yes Return the OLAM cube, and auxiliary cube Return the OLAM cube End Figure 5.1 The flow diagram of OLAM query An important thing worth mentioning is that, for simplicity, we do not consider OLAP cubes in this study, the OMARS system did take account of this kind of data cubes in association mining. In accordance with the work flow of OLAM Mediator and OLAM Engine, our paradigm for evaluating OLAM query cost is shown below: 35

36 Procedure Evacost_OLAMQ(q) begin Let q = < t G, t M, minsup>; found = OLAMQ_search(q, CQ); if found = TRUE then if prims minsup then cost = the cost for evaluating query q using OLAM cube CQ.Mcube; /*case 1*/ else cost = the cost for evaluating query q using CQ.Mcube, auxiliary cube CQ.XCube and data warehouse; /*case 2*/ end if else cost = the cost for evaluating query q using data warehouse; /*case 3*/ end if return cost; end Figure 5.2. The procedure to compute the cost of user s query In summary, there are three different cases to be dealt with: Case 1: evaluating the cost via the qualified OLAM cube. Case 2: evaluating the cost via OLAM cube, auxiliary cube, and data warehouse. Case 3: evaluating the cost via data warehouse. The cost complexity evaluation for each case will be elaborated in the following 36

37 sections. We end this section with the description of OLAMQ_search. Procedure OLAMQ_search(q, CQ) begin found = FALSE; if MCube(q. t G, q. t M ) is materialized then CQ.MCube = MCube(q.t G, q.t M ); CQ.XCube = XCube(q.t G, q.t M ); found = TRUE; end if CurQ = φ ; for each MCube in the OLAM lattice do if MCube is materialized and MCube. t G = q.t G and MCube.t M q. t M and (MCube.t M CurQ. tm or CurQ = φ ) then CurQ = MCube; if found then CQ.MCube = CurQ; CQ.XCube = XCube(q. t G, CurQ. t M ); end if return found end Figure 5.3. Procedure OLAMQ_search Example 5.1. Suppose the OMARS system stores the following three materialized t tm1 OLAM cubes, MCube(, G 1 ), where t = {City}, and G t 1 M 1 = {Education, Date}, t G t 2 M 2 MCube(, ), where t G = {City}, and 2 M 2 t = {Education, Date, Category}; 37

38 MCube( t, t ), where = {Date}, G3 M3 tg 3 t M 3 = {City}, and prims = 3. We have three users queries as follows: q 1, q 2, q 3, where qt 1.. G = {City}, qt 1 M = {Education, q2. t. G 2 Date}, and qms. 1 = 4; = {City}, q tm = {Education, Date, Category}, and q. ms. 2 = 2; qt 3. G = {Date}, q3 tm = {City, Education}, and q3. ms = 3. According to the above three queries, we have three conditions listed as follows: 1. When the user s query is q 1, this condition is the same as Case 1 described above. Because the corresponding OLAM cube can be found in OMARS system, and the minsup of user s query is higher than prims, we can use MCube(, G 1 respond user s query immediately. t t M1 2. When the user s query is q 2, this condition is the same as Case 2 described above. Because the minsup of user s query is lower than prims, there is a need to utilize t t M 2 the corresponding auxiliary cube of the found OLAM cube MCube(, G 2 data warehouse to answer query q 2. q 3 ) to ) and 3. When the user s query is, this condition is the same as Case 3 described above. Because we can not find the any matching OLAM cube in OMARS system, we should utilize data warehouse to answer query. q Cost Evaluation for Case 1 In this case, the OLAM cube returned from OLAM Mediator can be utilized to respond users queries. The CBW on algorithm [15] is employed to mine association rules. For convenience and facilitating the analysis, we replicate the CBW on algorithm in Figure 5.4. Because the qualified frequent itemsets have been stored in the found 38

39 OLAM cube, and minsup prims, there is no need to generate the frequent itemsets via Apriori-like algorithm. All we have to do is scanning frequent itemsets in OLAM cube and performing the association_gen procedure in Figure 5.7 to generate qualified association rules. Algorithm CBW on Input: relevant cube MCube(t G, t M ), minsup and prims; Output: The set of frequent itemsets F; 1 if minsup < prims then 2 AF = {X sup(x) minsup, X Auxiliary Cube} {Y Y MCube(t G, t M ) and Y = K α }; 3 DF = Dwnsearch on (T, AF, K α, minsup); 4 UF = Upsearch(AF, minusup); 5 F = DF UF; 6 else 7 F = {X X MCube(t G, t M ) and sup(x) minsup}; 8 end if 9 return F; Figure 5.4. Algorithm CBW on 39

40 Procedure Dwnsearch on 1 for i=1 to D do 2 scan the i-th transaction t i ; 3 delete those items in t i but not in AF; 4 for each subset X of t i and 2 X K α do 5 sup(x)++; 6 end for 7 DF = {X sup(x) minsup} AF; Figure 5.5. Procedure Dwnsearch on Procedure Upsearch 1 transform horizontal data format T into t_id lists; 2 F = frequent Kα-itemsets; K α 3 k = K α, F k = F Kα ; 4 repeat 5 k++; 6 C k = new candidate k-itemsets generated from F k-1 ; 7 for each X C k do 8 perform bit-vector intersection on X; 9 count the support of X; 10 end for 11 F = {X sup(x) prims, X Ck}; k 12 UF = UF F k ; 13 until F k = Figure 5.6. Procedure Upsearch 40

41 Procedure association_gen (F: set of all frequent itemsets; min_conf: minimum confidence threshold) begin for each l F do generate P(l) = l - ; // P(l): power set of l for each s l and s l-s do if support_count(l) / support_count(s) output s l s; min_conf then end Figure 5.7. Procedure association_gen The cost thus can be divided into two parts: 1. Frequent itemsets discovery: This involves searching the frequent itemsets stored in OLAM cube with support lower than minsup of user s query, which costs D M, for D M denoting the OLAM cube. 2. Rule generation: For each discovered frequent itemset, we construct all possible rules from it, compute the confidence, and keep those satisfy the minimum confidence. The key point for the complexity analysis thus lies in the number of candidate rules to be generated and inspected. Our first step toward this direction is to consider the number of rules that can be generated from a frequent k-itemset and all of its subsets. Lemma 1. The number of rules that can be constructed from a k-itemset is 2 k -2. Proof. Recall that each rule that can be constructed from an itemset X has the form for 41

42 A X and A φ, A X A. Thus, the number of different A s determines the number of rules, which is k 1 i= 1 k ( i ) k = 2 2. Lemma 2. For a k-itemset X, the total number of rules that can be generated from X and its subsets is k k Proof. From Lemma 1, we can derive k 2 k 3 k k ( 2)( 2 2) + ( 3)( 2 2 ) ( k )( 2 2) k k k i k ( i ) 2 2 ( i ) = i= 2 i= 2 k k k i k i k = ( i ) k 2 ( i ) 1 k i= 0 i= 0 ( ) k k k = = + k k k Now, if we know the set of maximal frequent itemsets, then we can complete the analysis. Unfortunately, the exact set is unobtainable without the a priori knowledge of user s specified minsup. We thus resort to an estimation that proceeds by taking prims in place of minsup. Then we apply sampling to obtain a random subset of the warehouse data, and we can either 1. compute the maximal frequent itemsets for each OLAM cube using any maximal pattern mining algorithm, or 2. apply the CBW off algorithm to estimate K α (cutting level), compute frequent itemsets with cardinality of K α, and regard these itemsets as the maximal frequent itemsets. 42

43 Let MF denotes the set of maximal patterns. If the first approach is adopted, the computation spent on rule generation will be X MF X X + 1 ( ), or ( ), Kα Kα + 1 FK α if the second approach is used. Here, for simplicity, we adopted the second approach. Finally, combing the cost of frequent itemsets discovery and rule generation, we have Kα FK 3 α α + D. M 5.3 Cost Evaluation for Case 2 In this case, algorithm CBW on illustrated in Figure 5.4 will execute the minsup < prims part of the if clause, which comprises three different steps. F kα 1. Generate AF, i.e.. This requires scanning the auxiliary cube and the OLAM cube. The cost is D X + DM, where DX denotes auxiliary cube, and DM denotes OLAM cube. 2. Execute procedure Dwnsearch on illustrated in Figure 5.5. Note that this procedure presumes the availability of the corresponding jointed table, and ignores the preprocessing step to generate the jointed table. To account for this task and simplify the discussion, we assume this cost is w and the table is T. As illustrated in Figure 5.5, the Dwnsearch on procedure needs to scan all the 43

44 transactions in the database. The I/O cost is α T. Next we estimate the cost for the most consumptive step: counting itemset support. Let l denotes the average length of each transaction. This step costs l l l ( ) + ( ) + + ( ) l T 2..., or T in brief. 3 Kα K α ( ) i i= 2 Finally, the total cost consumed by the Dwnsearch on procedure equals Kα l ( ) i α T + T. i= 2 3. Execute procedure Upsearch illustrated in Figure 5.6. To minimize the I/O cost and avoid combinatorial decomposition, the Upsearch procedure first transforms the transaction data into vertical data format called transaction-id lists, then utilizes this structure to count the supports of itemsets. The cost lies in three main steps. (1) Data transformation. This requires α T data scan. (2) Candidate generation. The dominate operation is itemset join. If the largest itemset cardinality is K max. This task consumes at Kmax ( 2 ) F 1 most k. K = Kα + 1 (3) Counting candidate support. For each k-itemset, counting involves k-1 bit-vector intersections and one bit-vector accumulation. Summing this cost over all candidate itemsets, we have Finally, the total cost for procedure Upsearch is K max Ci i T. i= kα

45 Kmax i= Kα ( ( )) F i 2 i α T + C i T +. Combing all of the analysis, we have ( ) Kα Kmax l Fi 1 ( ) ( 2 ) α( D + D + 2 T ) + T + C i T + + F 3 Kα X M i i Kα i= 2 i= Kα Cost Evaluation for Case 3 In this case, we should generate table T according to user s query, and it costs D log D. After this, the CBW off algorithm shown in Figure 5.8 is performed. It can be observed that except step 1, the steps employed by CBW off are quite similar to those by CBW on in Case 2. Since step 1 costs α T + K T, α this makes the total cost for this case be ( ) Kα Kmax l Fi 1 Kα ( i) i ( 2 ) K. α D log D + 3 α T + K T + T + C i T + + F 3 α i= 2 i= Kα

46 Algorithm CBW off (T, prims) Input: Table T and prims; Output: The set of frequent itemsets F; 1 scan T to compute K α and generate all frequent 1-itemsets F 1 ; 2 DF = Dwnsearch(T, K α, F 1, prims); 3 UF = Upsearch(DF, prims); 4 return F = DF UF; Figure 5.8. Algorithm CBW off Procedure Dwnsearch 1 for i=1 to D do 2 scan the i-th transaction t i ; 3 delete the items in t i that are not in F 1 ; 4 for each subset X of t i and 2 X K α do 5 sup(x)++; 6 end for 7 store all X in Auxiliary cube for X = K α and sup(x) < prims; 8 DF={X sup(x) prims}; Figure 5.9. Procedure Dwnsearch To sum up, we list the cost functions for the three cases below: Kα Case 1: F 3 + α D. Kα M 46

47 ( ) Kα Kmax Kα X M i i 2 K. α i= 2 i= Kα + 1 ( ) ( ) l Fi 1 Case 2: α( D + D + 2 T ) + T + C i T + + F 3 Case 3: ( ) Kα Kmax l Fi 1 Kα ( i) i ( 2 ) K. α D log D + 3 α T + K T + T + C i T + + F 3 α i= 2 i= Kα

48 Chapter 6 OLAM Cube Selection Methods In this chapter, we describe three typical heuristic algorithms proposed for OLAP cube selection problem, and elaborate how to modify and combine our cost models depicted in last chapter with each method to select the most suitable OLAM cubes. The methods include forward greedy selection (FGS) method proposed by Harinarayan et al. [9], Pick by size (PBS) selection method proposed by Shukla et al. [17], and the backward greedy selection (BGS) method proposed by Lin and Kuo [13]. 6.1 Forward Greedy Selection Method (FGS) The forward greedy selection method is proposed by Harinarayan et al. [19]. As is known to all, the greedy algorithm always chooses the local optimal solution in each step under some constraint. For this purpose, we define a benefit function B(d i, M) as follows: 1 B( di, M) = ( E( q, M) E( q, M di)) (6.1) q Q d i 48

49 We use our benefit function to compute the benefit of all unselected OLAM cubes, and combine the forward selection method to choose the most suitable OLAM cubes one by one to materialize from empty until no cube can be added. The forward selection method is described below: Algorithm 1. Forward greedy selection (FGS) Step 0. Let M=φ. Step 1. When d < S, repeat Step 2 to Step 5. d M Step 2. According to equation (6.1), calculate the benefit of all unselected OLAM cubes d i, for 1 i n, and d i M. Step 3. Select the OLAM cube with the maximal benefit according to results of Step 2, and set it as d j. Step 4. M M {d j }. Step 5. Go to Step 1. Figure 6.1. Forward Greedy Selection Method Example 6.1. Suppose that we select three attributes city c, education e, and date d from a sales star schema illustrated in Figure 2.3. Figure 6.2 depicts all possible OLAM cubes formed with these three attributes as well as their dependencies, where all OLAM cubes with the same transaction t G are packed into a meta-cube. The dotted line between any two metacubes is used for clarification purpose, which accomplishes the lattice structure of metacubes in terms of t G. Note that according to proposition 4.1, the dependency exists only in OLAM cubes within the same metacube. For simplification, let us consider how to select the most suitable OLAM cubes from three OLAM cubes ced*, cd*, and ed* to materialize under space constraint. The symbols 49

50 used in this example are shown in Table 6.1, and the required parameter settings are shown in Table 6.2. Besides, we assume that the base relation size is 64, and prims is 3. Table 6.3 shows the first two selection steps using FGS. t G M Table 6.1. The symbols used in cost model the set transaction attributes t the set of mining attributes α K α I/O to computation ratio the cardinality of maximal frequent itemset K max the cardinality of the largest itemset C i number of candidate i-itemsets l average length of each transaction F i number of frequent i-itemsets D M size of OLAM cube D X size of auxiliary cube f frequency of OLAM cube T size of the table composed of attributes t D size of base relation G t M Table 6.2. The required parameter settings subcubes α K α K max C3 C4 l F2 F 3 D M D X T minsup f d*ce d*c d*e

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:- UNIT III: Data Warehouse and OLAP Technology: An Overview : What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to