Data warehousing and data mining are both popular technologies in recent years.
|
|
- Whitney Bridges
- 5 years ago
- Views:
Transcription
1 Chapter 1 Introduction Data warehousing and data mining are both popular technologies in recent years. Data warehousing is an information infrastructure to store and integrate different data sources into a consistent repository, and through OLAP (On-Line Analytical Processing) tools business managers can analyze these data in various perspectives to discover valuable information for strategic decision. Data mining, on the other hand, is the exploration and analysis of data, automatically or semi-automatically, to discover meaningful patterns and rules. From the business viewpoint, the integration of these two technologies can allow a corporation to understand its customers behaviors, and to use this information to gain market competition. Among various pattern interested by data mining research community, association rule has attracted great attention recently. An association rules is a rule of the form A B (sup = s %, conf = c %), which reveals the concurrence between two itemsets A and B. An example is PC => Laser Printer (sup = 30%, conf = 80%), which means there are 30% customers will buy PC and Laser Printer together, and 80% of those customers who buy PC also get Laser Printer. Mining association rules from large database is a data and computation intensive task. To reduce the complexity of association mining, researchers have proposed the 1
2 concept of integrating data warehousing system and association mining algorithms. For example, the DBMiner system [22] developed by J. Han and its research team adopts an OLAP-based association mining approach. Similar paradigm was presented in [22]. The primary problem of OLAP-based approach is that the OLAP data cube is not feasible for on line association mining. Excessive efforts are still required to complete the task. As such, Lin et al. [15] proposed the concept of OLAM (On-Line Association Mining) cube, an extension of Ice-berg cube [3] used to store frequent multidimensional itemsets. They also proposed a framework of on-line multidimensional association rule mining system, called OMARS, to provide users an environment to execute OLAP-like query to mine association rules from data warehouses efficiently. This thesis is a companion toward the implementation of OMARS. Particularly, the problem of selecting appropriate OLAM cubes to materialize and store in OMARS is concerned. And, in accordance with the proposed mining algorithms in OMARS, a suitable model to evaluate the cost of selecting data cubes to materialize is also developed. 1.1 Contributions The main contributions of this thesis are as follows: 1. We exploit the devising dependency between OLAM cubes with regard to association query, thereby devising the structure of OLAM lattice. 2. We deploy the model for evaluating the cost of answering association queries using materialized OLAM cubes, which is a preliminary step for 2
3 OLAM cubes selection. 3. We modify and implement some state-of-the-art heuristic algorithms, and draw comparisons between these algorithms to evaluate their effectiveness. 1.2 Thesis Organization This thesis is organized as follows. We describe past researches and related work about the data warehousing and data mining technologies in Chapter 2. In Chapter 3, we describe the OMARS framework briefly. Chapter 4 formulates our OLAM cube selection problem. The algorithm analysis and cost model is described in Chapter 5. Chapter 6 explains our algorithms, and Chapter 7 shows the experimental results conducted in this research. Finally, we conclude our work and point out some future research directions in Chapter 8. 3
4 Chapter 2 Background and Related Work 2.1 Data Warehouse and OLAP Data Warehouse As coined by W. H. Inmon, the term Data warehouse refers to a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management s decision-making process [11]. In this regard, a data warehouse is a database dedicated to support decision making. According to the demand of analysts, the data comes from different databases are extracted and transformed into the data warehouse. If users want to execute queries, the system only needs to search the data warehouse instead of the source databases. For this reason, it can save much more query processing time for users. A data warehouse system is composed of three primary parts: 1. The source databases in the backend: In the backend, the data are collected from various sources, internal or external, legacy or operational, and any change to these sources is continually monitored by several modules called 4
5 monitors/ wrappers. 2. The data warehouse and data marts in the core: The reconciled data are stored in the data warehouse and data mart, which are central repository for the whole system. 3. The analysis tools in the front end: The analysis tools supported in the front end are usually OLAP, query/tabulation tools, and data mining software. The typical structure of a data warehouse is illustrated in Figure 2.1. Monitoring & Administration OLAP Data sources Servers Tools Metadata Monitors/ wrappers Analysis Data Warehouse External sources Extract Clean Transform Load Refresh Serve Query/Reporting Data Mining Operational databases Data mart Figure 2.1. A typical architecture of data warehouse [11] On-Line Analytical Processing (OLAP) Although the data stored in a data warehouse have been cleaned, filtered, and integrated, it still requires much time to transform the data into useful strategic information owing to the massive amount of data stored in data warehouse. The concept of On-Line Analytical Processing (OLAP) [4] refers to the process of creating and managing multidimensional data for analysis and visualization. To provide fast 5
6 and multidimensional analysis of data in a data warehouse, the OLAP tool precomputes aggregation over data and organizes the result as a data cube composed of several dimensions, each representing one of the user analysis perspectives. The typical operations provided by OLAP include roll-up, drill-down, slice and dice and pivot [8]. Roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data. The slice operation performs a selection on one dimension of the given cube, resulting in a subcube, while the dice operation defines a subcube by performing a selection on two or more dimensions. The pivot operation, which is also called rotate, is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. These OLAP operations are illustrated in Figure Data Warehouse Data Model Because the data warehouse systems require a concise, subject-oriented schema that facilitates on-line data analysis, the entity-relationship data model that is generally used in relational database systems is not suitable for data warehouse system. For this purpose, the most popular data model for a data warehouse is a multidimensional data model. Two common relational models that facilitate multidimensional analysis are star schema, and snowflake. 6
7 Product P6 P5 P4 P3 P2 P1 C1 C2 C3 C4 Customer Supplier S1 S2 S3 S4 P6 P5 P4 P3 P2 P1 C1 C2 C3 C4 Customer Product Slice Pivot Dice Customer Roll-Up Drill-Down Product C1 C2 C3 C4 P6 P5 P4 P3 P2 P1 Product Supplier S4 S3 S2 S1 Product Supplier S1 S2 P2 P1 P6 P5 P4 P3 P2 P1 All Customer C1 C2 Customer Figure 2.2. The typical operations of OLAP Star Schema Star schema, proposed by Kimball [12], is the most popular dimensional model used in data warehouse community. A star schema consists of a fact table and several dimension tables. The fact table stores a list of foreign keys which correspond to dimension tables, and numeric measure of user interests. Each dimension table contains a set of attributes. Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order). An example of star schema is depicted in Figure 2.3, whose schema hierarchy is illustrated in Figure
8 Customer Customer_ID City Education Fact Table Sales Time Product Customer_ID Time_ID Product_ID Quantity Time_ID Date Month Product_ID Category Figure 2.3. An example of star schema for sales All All All Category Month City Education Product_ID Date Customer_ID Time_ID Figure 2.4. An example of schema hierarchy for sales star 8
9 2.2.2 Snowflake Data Model The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional individual and hierarchical tables. An example of snowflake data model is depicted in Figure 2.5. Customer Customer_ID City Education Fact Table Sales Time Product Customer_ID Time_ID Product_ID Quantity Time_ID MID Date Month MID Month Product_ID Category Figure 2.5. An example of snowflake schema for sales The major difference between snowflake schema and star schema is that the dimension tables of snowflake model may be kept in normalized form to reduce redundancies. Through this characteristic one can easily maintain and save storage space than that by star schema data model. On the other hand, the star schema can integrate schema hierarchies into a dimension table, thereby incurring no join operation during hierarchical traverse of the dimensions. Hence, the star schema data 9
10 model is more popular than snowflake schema data model. 2.3 Association Rule Mining Association Rules Association rule mining is one of the prominent activities conducted in data mining community. The concept of association rule mining is to search interesting relationships among items in a given data set. For example, the information that customers who purchase diapers also tend to buy beers at the same time is represented in association rule below: Diaper => Beer [sup = 2%, conf = 60%] Rule support and confidence are two measures of rule interestingness. A support of 2% means that 2% of customers purchase diaper and beer together. A confidence of 60% means that 60% of the customers who purchase a diaper also buy beer. Typically, an association rule is considered interesting if it satisfies a minimum support threshold and a minimum confidence threshold that are set by users or domain experts. The process of association rule mining can be divided into two steps: 1. Frequent itemsets generation: In this step, all itemsets with support greater than the minimum support threshold are first discovered. 2. Rule construction: After generating all frequent itemsets, the confidence of these frequent itemsets much greater than minimum confidence threshold. Then, we can discover association rules. 10
11 The most popular and influential association mining algorithm is Apriori [2], which the apriori knowledge of frequent k-itemsets to generate candidate (k+1)-itemsets. When the maximum length of frequent itemsets is l, Apriori needs l passes of database scans. Since the Apriori algorithm costs much time to generate the candidate itemsets and to count the support of each itemset, many variant algorithms have been proposed to improve the efficiency of mining process Multi-dimensional Association Rules The concept of multi-dimensional association rules is first proposed by H. Zhu [22], which is used to describe associations between data values from data warehouse, because where the data schema is composed of multiple dimensions, and each dimension may contain many attributes. Following the work in [22], we can divide the multi-dimensional association rules into three different types as follows: 1. Inter-dimensional association rule: This is the association among a set of dimensions. For example, suppose an OLAP cube is composed of three dimensions: Product, Supplier, Customer, and whose data is listed in Table 2.1. An inter dimensional association rule is: Supplier ( Hong Kong ), Product ( Sport Wear ) Customer ( John ) 2. Intra-dimensional association rule: This is the association among items coming from one dimension. From Table 2.1, a possible intra-dimensional association rule is: Product ( Sport Wear ) Product ( Tents ) 11
12 3. Hybrid association rule: This is the association among a set of dimensions, but some items in the rule are from one dimension. It can be regarded as a combination of inter-dimensional and intra-dimensional associations. According to Table 2.1, a hybrid-association rule is: Product ( Sport Wear ), Supplier ( Hong Kong ) Product ( Tents ) Table 2.1. A relational representation of OLAP cube Supplier Product Customer Count HongKong HongKong HongKong Mexico Mexico Mexico Mexico Mexico Seattle Seattle Seattle Seattle Tokyo Tokyo Tokyo Tokyo Sport Wear Sport Wear Water Purifier Alert Devices Carry Bags Carry Bags Tents Tents Carry Bags Sport Wear Sport Wear Water Purifier Carry Bags Sport Wear Tents Alert Devices John Mary John Peter Peter Bill Sue Mary John Peter John Bill Sue Bill Sue John Related Work Data Cube 12
13 The concept of data cube is first proposed by Gray et al [6], which allow the analysts to view the data stored in data warehouse from various aspects and to employ multidimensional analysis. Each cell in a data cube represents the measured value. For example, consider a sales data cube with three dimensions, Product, Supplier, Customer, and one measure value, Sales_total. This cube is depicted in Figure 2.6 and can be expressed as a SQL query as follows: Select Product, Supplier, Customer SUM(Sales) AS Total Sales From Sales_Fact Group by Product, Supplier, Customer; Product Supplier s4 s3 s2 s1 p1 p2 p3 p4 p5 p6 c1 c2 c3 c4 Customer Figure 2.6 An example of data cube Cube Selection Problem In order to accelerate the query processing, it is important to select the most suitable cubes to materialize. In general, there are three options to select the cubes to materialize. 1. Materialize all data cubes: This method costs the lowest query time but 13
14 needs the largest storage space, because the whole cubes have to be materialized. 2. Materialize nothing: This method saves the largest storage space but needs the largest query time, because there is no cube to be materialized. 3. Materialize a fraction of data cubes: This method selects a part of the data cubes to materialize. But how to select the most suitable cubes to materialize under a space constraint is difficult. Indeed, it has been proved to be a NP-hard problem [9]. According to the above discussions, the best way is to materialize all data cubes. However, the space limit of data warehouse would hinder us to do this. On the other hand, if we materialize nothing, it will cost too much query time. Therefore, we should try to select the most suitable cubes to materialize even this problem is an NP-hard problem. In the literature, there has been a substantial contribution in this problem, which can be classified into three main categories: 1. Heuristic method: This category is mainly based on the greedy paradigm. Harinarayan et al. [9] was the first one to consider the problem of materialized views selection for supporting multidimensional analysis in OLAP. They proposed a lattice model and provided a greedy algorithm to solve this problem. Gupta et al. [7] further extend their work to include indices selection. Ezeife [5] also considered the same problem but proposed a uniform approach using a more detailed const model. Shukla et al. [17] proposed a modified greedy algorithm that selects only according to the cube size. Their algorithm was shown to have the same quality as Harinarayan s greedy method but is more efficient. 2. Exhaustive method: The work in [19] supposed that all queries should be 14
15 answered solely by the materialized views, with or without rewriting the users queries. They modeled the problem as a state space optimization problem, and provided exhaustive and heuristic algorithms without concern for the storage constraint. Soutyrina and Fotouhi [18] proposed a dynamic programming algorithm to solve the problem, which can yield the optimal set of cubes. 3. Genetic method: There is some work devoted to applying genetic algorithms to the view selection problem [10, 20, 21]. Following the AND-OR view graph used in [7], Horng et al. [10] proposed a genetic algorithm to select the appropriate set of views to minimize the query cost and view maintenance cost. A similar genetic algorithm with different repairing scheming is proposed in [13], which use a greedy repair method to correct the infeasible solutions instead of using a penalty function to punish the fitness of the infeasible solutions. Researches have shown that the repair scheme is better in dealing with infeasible solutions than penalty function is [16]. Rather than optimize the view selection from a given query processing plan, the work in [20, 21] focus on finding an optimal set of processing plans for multiple queries. A solution in their genetic algorithm thus represents a set of processing plans for the given queries. 15
16 Chapter 3 The OMARS Framework In this chapter, we will give a brief review of the OMARS framework, because our research deals with the problem of how to select the most suitable OLAM cubes to materialize in this system. The OMARS framework, as illustrated in Figure 3.1, integrates data warehouse, on-line analytical processing, and the OLAM Cube, whose objective is to provide an efficient and convenient platform, allowing users to perform OLAP-like association explorations. Through the OMARS system, users can perform multidimensional associational mining queries, interactively change the dimensions that comprise the associations, and refine the constraints such as minimum support and minimum confidence. Functionality of each component is described in the following sections. Data Warehouse OLAP Cube Auxiliary Cube Cube Manager OLAM Mediator OLAM Engine OLAM Cube Figure 3.1. The OMARS framework [15]. 16
17 3.1 OLAM Cube and Auxiliary Cube OLAM cube is a new concept proposed by Lin et al. [15], which is used to store the frequent itemsets with supports greater than or equal to a presetting minimum support, denoted as prims. In this regard, the OLAM cube can be regarded as an extension of iceberg cube. The main difference is that the iceberg cube stores the information of frequent itemsets derived from inter-dimensional associations, while OLAM cube is feasible for all of the three different associations. When the minsup of user s query is greater or equal than prims, it can accelerate the process of mining association rules because of the OLAM cube stores the frequent itemsets with supports greater or equal than prims. Although the OLAM cube can be used to generate association rules efficiently when minsup is greater than prims, it fails to solve the situation that minsup is lower than prims. To alleviate this problem, the OMARS system embraces another type of data cube, called auxiliary cube. The concept of auxiliary cube is used to store the infrequent itemsets with length of K α, where K α denotes the cutting-level employed by the mining algorithm CBW on used in OMARS. 3.2 Cube Manager This component is responsible for three different tasks: 1. Cube selection: This refers to how to select the most proper cubes to materialize, in order to minimize the query cost and/or maintenance cost under the constraint of limited storage space. 2. Cube computation: This portion is to deal with the work of efficiently generating the set of materialized cubes produced by the cube selection 17
18 module. 3. Cube maintenance: This part concerns the problem of how to maintain the materialized cubes when the data in the data warehouse are updated. Our research in this thesis indeed deals with the implementation issue of the cube selection task of Cube Manager. We will discuss this in the next chapter. 3.3 OLAM Mediator and OLAM Engine OLAM Engine is an interface between the OMARS system and the users. It accepts user s queries and invokes the appropriate algorithm to mine multidimensional association rules. When OLAM Engine receives a user s query, it will analyze the query and forward relevant information to OLAM Mediator, which then looks for the most relevant cube and returns the result to OLAM Engine. Here the most relevant cube denotes the materialized OLAM cube that can answer the query and consume the smallest cost. There are two possibilities of the search result returned by OLAM Mediator, and each should be handled in different way. 1. OLAM Mediator can find the most relevant cube: In this case, OLAM Mediator has to further compare the minsup of user s query to prims, and to handle this situation according to the following two different cases: i. minsup prims: The discovered OLAM cube is capable of answering the query. Return this cube to OLAM Engine. ii. minsup < prims: The discovered OLAM cube can not answer the query without the aid of the auxiliary cube. Return the OLAM cube and its accompanied auxiliary cube to OLAM Engine. 18
19 2. OLAM Mediator can not find the cube: In this case, OLAM Mediator has to search the OLAP Cube repository to determine if there is an OLAP cube whose data can be used to answer the query. If the answer is yes, return the discovered OLAP cube to OLAM Engine; otherwise, notify OLAM Engine to execute the mining procedure from the data warehouse afresh. We will discuss the above cases in more detail and devise to the cost evaluation of each case in Chapter 5. 19
20 Chapter 4 Problem Formulation In this chapter, we first elaborate the correspondence between OLAM query and OLAM cube, and describe the concept of OLAM lattice. After this, we will define the problem of OLAM cube selection. 4.1 OLAM Cube and OLAM Query As described in Chapter 3, OLAM cube is used to store frequent itemsets, aiming at accelerating the process of mining association rules. To clarify the structure of OLAM cube and its relationship between multidimensional associations, we first introduce a four-tuple mining meta-pattern to specify the form of multidimensional association query. The definition is as follows: Definition 4.1. Suppose a star schema S containing a fact table and m dimension tables {D 1, D 2,, D m }. Let T be a jointed table from S composed of a 1, a 2,., a k attributes, such that a i, a j Attr(D k ), there is no hierarchical relation between a i and a j, 1 i, j r, 1 k m. Here Attr(D k ) denotes the attribute set of dimension table D k. A meta-pattern of multidimensional associations from T is defined as follows: 20
21 MP: < t G, t M, ms, mc >, where ms denotes the minimum support, mc the minimum confidence, t G the group of transaction attributes, t M the group of item attributes, for t G, t M {a 1, a 2,., a k } and t G t M =. The above-mentioned meta-form specification of multidimensional association queries can present three different multidimensional association rules defined in [22], intra-association, inter-association, and hybrid association. For example, consider a jointed table T involving three dimensions from the star schema in Figure 2.3. The content of T is shown in Table 4.1. If the item attribute set t M consists of only one attribute, then the meta pattern corresponds to an intra-association. Table 4.1. A jointed table T from star schema Tid City Education Date Month Product_ID Category 1 Taipei Bachelor 7/12 July 1 A 2 Taipei High school 7/12 July 2 A 3 N.Y. Master 7/18 July 1 A 4 Toronto Master 8/2 Aug. 3 B 5 Seattle Master 8/3 Aug. 4 B 6 N.Y. High School 8/2 Aug. 1 A 7 Toronto High School 7/4 July 1 A 8 Seattle Bachelor 7/18 July 5 C 9 Taipei Bachelor 8/2 Aug. 2 A 10 N.Y. Bachelor 9/1 Sep. 3 B For instance, let t G = {City}, t M = {Category}. We may have the following intra-association rule: 21
22 (Category, A ) (Category, B ) (sup = 40%, conf = 80%) Note that to facilitate this mining task, the table T has to be, implicitly or explicitly, transformed into a transaction table as follows: City Taipei N.Y. Toronto Seattle Category A A, B A, B B, C On the other hand, if t M 2, then the resulting associations will be inter-association or hybrid association. For example, let t G =, t M = {Education, Month}. We have an inter-association: (Education, Master ) (Month, July ) (sup = 40%, conf = 80%) Like intra-association, the table T has to be transformed into the following form: Tid Education Month 1 Bachelor July 2 High school July 3 Master July 4 Master Aug. 5 Master Aug. 6 High School Aug. 7 High School July 8 Bachelor July 9 Bachelor Aug. 10 Bachelor Sep. Note that in this case, the transaction attribute is the same as the original table T. But if t G = {City}, we will have a hybrid-association: (Education, Master ), (Month, July ) 22
23 (Month, Aug. ) (sup = 40%, conf = 80%) For this case, the transformed table will be: City Education Month Taipei Bachelor, High School July, Aug. N.Y. Master, High School, Bachelor July, Aug., Sep. Toronto Master, High School Aug., July Seattle Master, Bachelor Aug., July Cube. After explaining the mining patterns, we will clarify the structure of OLAM Definition 4.2. Given a meta-pattern MP with transaction attribute set t G and item attribute set t M, and a presetting minsup, prims, the corresponding OLAM cube, MCube(t G, t M ), is the set of the frequent itemsets with supports larger than prims. The following examples illustrate the corresponding OLAM cube for different kinds of multidimensional association rules. Example 4.1. An intra-dimensional OLAM Cube: Let t G = {City}, t M = {Category}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table 4.2. Table 4.2. An example of intra OLAM cube expressed in table Category A B A, B Support Example 4.2. An inter-dimensional OLAM cube: Let t G =, t M = {Education, Month}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in 23
24 Table 4.3. Table 4.3. An example inter-dimensional OLAM cube expressed in table Education Month Support Bachelor High school Master - - Bachelor High school Master July Aug. July July Aug Example 4.3. A hybrid-dimensional OLAM cube: Let t G = {City}, t M = {Education, Month}, and prims = 3. From Table 4.1, the resulting OLAM cube is shown in Table 4.4. Table 4.4. An example hybrid-dimensional OLAM cube expressed in table Education Month support Bachelor High school Master Bachelor Bachelor High school High school Master Master Bachelor High school Master July Aug. July, Aug. July Aug. July Aug. July Aug. July, Aug. July, Aug. July, Aug
25 4.2 OLAM Lattice In accordance with the definition of OLAM cube, we can generate all possible OLAM cubes from the star schema, thereby forming an OLAM lattice. In order to provide hierarchical navigation and multidimensional exploration, the OMARS system [15] models the OLAM lattice as a three-layer structure. The first layer lattice expresses the combination of all dimensions. The second layer further exploits inter-attribute combinations for each dimensional combination in the first layer lattice. The third layer exploits all OLAM cubes corresponding to the meta-patterns derived from each subcube in the second layer. Note that the real OLAM cubes are stored in the third layer. For example, consider the star schema illustrated in Figure 2.3. The first layer lattice shown in Figure 4.1 is composed of eight possible dimensional combinations. After constructing the first layer lattice, we choose the node composed of customer and time dimensions, and extended it to form a second layer lattice shown in Figure 4.2. Each node of the second layer lattice is constructed by attaching any attribute chosen from the selected dimensions. Finally, we extend cube <(city, education), (date)> to form the third layer lattice shown in Figure 4.3. It can be observed that there is one OLAM cube corresponding to inter-association, (city, education, date); three OLAM cubes corresponding to hybrid-associations, (date*, city, education), (*education, city, date) and (city*, education, date); and three cubes corresponding to intra-associations, (education*, date*, city), (city*, date*, education), (city*, education*, date). Note that (city*, education*, date*) is shown to complete the lattice structure, which is useless and will not be materialized. 25
26 Customer, Product, Time Customer, Product, - Customer, -, Time -, Product, Time Customer, -, - -, Product, - -, -, Time <-, -, -> Figure 4.1. The1 st layer OLAM lattice for the example star schema in Figure 2.3 Customer, -, Time (city), (date) (education), (date) (education), (month) (city), (month) (city, education), (date) (city, education), (month) (city), (date, month) (education), (date, month) (city, education), (date, month) Figure 4.2. The 2 nd layer lattice derived from <customer, time, -> in the 1 st layer 26
27 0 transaction attribute * : transaction attributes city, education, date Inter association 1 transaction 1 attribute Hybrid association city, education, *date city, *education, date *city, education, date 2 transaction attributes city, *education, *date *city, education, *date *city, *education, date 3 transaction attributes *city, *education, *date Intra Association Figure 4.3. The 3 rd layer lattice derived from the subcube <(city, education), date > in the 2 nd layer Because the real OLAM cubes are stored in the third layer lattice, we can mine multidimensional association rules efficiently through materialize these OLAM cubes. From these three layers lattice, we discover attribute dependency that defined as follows: Proposition 4.1 Consider two OLAM cubes, MCube( tg, t ) 1 M and MCube( t, ) 1 G t 2 M. 2 If t = tg and t M t 2 M, then every itemset in MCube( t, ) 1 G t 2 M2 G1 2 must be a subset of an itemset in MCube( tg, t M ), and these two itemsets have the same support 1 1 value. 27
28 Example 4.4. Consider the table T in Table 4.1. Let MCube( tg, t ) 1 M1 be the cube illustrated in Table 4.4 and MCube( tg, tm ) 2 2 that illustrated in Table 4.5. Hence t = t = { City}, t = { Education, Month}, t = { Education}, and prims = 3. It G1 G2 M1 M 2 can be verified that every frequent itemsets stored in MCube( tg, t ) 2 M2 is a subset of frequent itemsets in MCube( tg, t M ), and both itemsets have the same support value. 1 1 Table 4.5. An OLAM Cube Education Bachelor High school Master Support According to Proposition 4.1, we know there is a dependency between OLAM cubes in the third lattice, which is formalized below. Definition 4.3. Consider two OLAM cubes, MCube( tg, t ) 1 M and MCube( t, ) 1 G t 2 M. 2 We say that MCube( tg, t ) 2 M is dependent upon MCube( t, ) 2 G t 1 M if t 1 G = t 1 G2 and t M 2 tm, and is denoted as MCube( t, ) 1 G t 2 M MCube( t, ) 2 G t 1 M. 1 One important aspect of Definition 4.3 is that if MCube( tg, t ) 2 M 2 MCube( t, t ) then all multidimensional queries that can be answered via G1 M1 MCube( tg, t ) 2 M can also be answered via 2 G1 M1 MCube( t, t ). Furthermore, it should be notice that not all of the OLAM cubes derived in the lattice have to be materialized and stored, because the concept hierarchies defined over the attributes in the star schema provide the possibility to prune some redundant cubes. 28
29 Consider an OLAM cube, MCube(t G, t M ). We observed that there are two different types of redundancy. Proposition 4.2. Schema redundancy: Let a i, a j t G. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a redundancy of cube MCube(t G -{ a j }, t M ). Example 4.5. Consider the jointed table in Table 4.1. Let t M = {Category}. The resulting table by grouping Date and Month as transaction attributes is shown in Table 4.6. Note that this table has the same transactions as that obtained by grouping Date as transaction attribute, as shown in Table 4.7. Thus, the resulting cube MCube({Date, Month}, {Category}) is the same as MCube({Date}, {Category}). Table 4.6. The resulting table by grouping {Date, Month} as transaction attributes for Table 4.1 Date Month Category 7/4 July A 7/12 July A 7/18 July A, C 8/2 Aug. A, B 8/3 Aug. B 9/1 Sep. B 29
30 Table 4.7. The resulting table by grouping {Date} as transaction attribute for Table 4.1 Date Category 7/4 A 7/12 A 7/18 A, C 8/2 A, B 8/3 B 9/1 B Proposition 4.3. Values Redundancy: Let a i, a j t M. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a cube with values redundancy. Example 4.6. Consider the jointed table in Table 4.1. Let t G = {City}, t M = {Date, Month} and prims = 2. The resulting OLAM cube is shown in Table 4.8. One can observe that the tuples with dotted lines in this table are redundant patterns. Therefore, it satisfies the values redundancy. Note that if it holds the values redundancy, we must prune the redundant patterns during the generation of frequent itemsets. 30
31 Table 4.8. The resulting OLAM cube MCube({City}, {Date, Month}) Date Month support 7/18 8/ July Aug /18 July 2 7/18 8/2 Aug. July 2 3 8/2 Aug. 3 July, Aug. 4 7/18 July, Aug. 2 8/2 July, Aug. 3 In addition to above observations, we observe that any OLAM cube is useless if it satisfies the following property. Proposition 4.4. Useless Property: Let a i t G and t M = {a j }. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a useless cube. Example 4.7. Let t G = {City, Date}, and t M = {Month}. The resulting table from table 4.1 by grouping {City, Date} as transactions is shown in Table 4.9. One can observe that the cardinality of every transaction is 1. Therefore, we cannot find any association rule from this table. 31
32 Table 4.9. The resulting table by grouping {City, Date} as transaction attribute for Table 4.1 City Date Month Toronto Taipei Taipei N.Y. N.Y. Toronto Seattle N.Y. 7/4 7/12 8/2 7/18 8/2 8/2 8/3 9/1 July July Aug July Aug. Aug. Aug. Sep. 4.3 OLAM Cube Selection We now proceed to give a formal definition of the OLAM cube selection problem. To this end, we introduce symbols as shown in Table Assume that an OLAM lattice L contains n OLAM data cubes D { d1, d2,..., d n }, the set of users queries is = 1 2 Q= { q, q,..., q m }, the set of query frequencies is F = { fq, f,..., } 1 q f 2 q m, and the space constraint is S. The OLAM cube selection problem is denoted as a five-tuple θ = { L, D, Q, F, S}. A solution to θ is a subset of D, say M, that can minimize the following cost function subject to constraint d S, d M m min fq * Eq (, ) i i M. i= 1 32
33 Symbol L D d n Q q m F f q i S M Definition Lattice Table The Symbol Table Set of data cubes n th data cube Set of user queries m th user query Set of user query frequencies Frequency of the i th query Space constraint Set of materialized cubes Eq ( i, M) The total time to response i th query in materialized views 33
34 Chapter 5 Evaluation of OLAM Query Cost 5.1 Query Evaluation Flow As stated previously, the primary task of OLAM Engine is to generate association rules according to users queries. After receiving a query, OLAM Engine analyzes the query, transfers the necessary information to OLAM Mediator, and then waits for the most matching cube from OLAM Mediator. When OLAM Mediator receives the information of users queries from OLAM Engine, it will look for the most matching cube. First, OLAM Engine searches for the required OLAM cube. If found, then it further checks whether minsup prims; and if yes, then returns the found OLAM cube to OLAM Engine, otherwise returns the corresponding auxiliary cube of the found OLAM cube and notifies OLAM Engine to perform association mining from data warehouse with the aid of this auxiliary cube. On the other hand, if OLAM Engine can not find any qualified OLAM cube to answer user query, it will notify OLAM Engine to perform association mining from data warehouse afresh. The above described procedure employed by OLAM Mediator is depicted in Figure
35 Start OLAM Query No Is the required OLAM cube found? The required OLAM cube does not exist Yes No minsup >= prims Yes Return the OLAM cube, and auxiliary cube Return the OLAM cube End Figure 5.1 The flow diagram of OLAM query An important thing worth mentioning is that, for simplicity, we do not consider OLAP cubes in this study, the OMARS system did take account of this kind of data cubes in association mining. In accordance with the work flow of OLAM Mediator and OLAM Engine, our paradigm for evaluating OLAM query cost is shown below: 35
36 Procedure Evacost_OLAMQ(q) begin Let q = < t G, t M, minsup>; found = OLAMQ_search(q, CQ); if found = TRUE then if prims minsup then cost = the cost for evaluating query q using OLAM cube CQ.Mcube; /*case 1*/ else cost = the cost for evaluating query q using CQ.Mcube, auxiliary cube CQ.XCube and data warehouse; /*case 2*/ end if else cost = the cost for evaluating query q using data warehouse; /*case 3*/ end if return cost; end Figure 5.2. The procedure to compute the cost of user s query In summary, there are three different cases to be dealt with: Case 1: evaluating the cost via the qualified OLAM cube. Case 2: evaluating the cost via OLAM cube, auxiliary cube, and data warehouse. Case 3: evaluating the cost via data warehouse. The cost complexity evaluation for each case will be elaborated in the following 36
37 sections. We end this section with the description of OLAMQ_search. Procedure OLAMQ_search(q, CQ) begin found = FALSE; if MCube(q. t G, q. t M ) is materialized then CQ.MCube = MCube(q.t G, q.t M ); CQ.XCube = XCube(q.t G, q.t M ); found = TRUE; end if CurQ = φ ; for each MCube in the OLAM lattice do if MCube is materialized and MCube. t G = q.t G and MCube.t M q. t M and (MCube.t M CurQ. tm or CurQ = φ ) then CurQ = MCube; if found then CQ.MCube = CurQ; CQ.XCube = XCube(q. t G, CurQ. t M ); end if return found end Figure 5.3. Procedure OLAMQ_search Example 5.1. Suppose the OMARS system stores the following three materialized t tm1 OLAM cubes, MCube(, G 1 ), where t = {City}, and G t 1 M 1 = {Education, Date}, t G t 2 M 2 MCube(, ), where t G = {City}, and 2 M 2 t = {Education, Date, Category}; 37
38 MCube( t, t ), where = {Date}, G3 M3 tg 3 t M 3 = {City}, and prims = 3. We have three users queries as follows: q 1, q 2, q 3, where qt 1.. G = {City}, qt 1 M = {Education, q2. t. G 2 Date}, and qms. 1 = 4; = {City}, q tm = {Education, Date, Category}, and q. ms. 2 = 2; qt 3. G = {Date}, q3 tm = {City, Education}, and q3. ms = 3. According to the above three queries, we have three conditions listed as follows: 1. When the user s query is q 1, this condition is the same as Case 1 described above. Because the corresponding OLAM cube can be found in OMARS system, and the minsup of user s query is higher than prims, we can use MCube(, G 1 respond user s query immediately. t t M1 2. When the user s query is q 2, this condition is the same as Case 2 described above. Because the minsup of user s query is lower than prims, there is a need to utilize t t M 2 the corresponding auxiliary cube of the found OLAM cube MCube(, G 2 data warehouse to answer query q 2. q 3 ) to ) and 3. When the user s query is, this condition is the same as Case 3 described above. Because we can not find the any matching OLAM cube in OMARS system, we should utilize data warehouse to answer query. q Cost Evaluation for Case 1 In this case, the OLAM cube returned from OLAM Mediator can be utilized to respond users queries. The CBW on algorithm [15] is employed to mine association rules. For convenience and facilitating the analysis, we replicate the CBW on algorithm in Figure 5.4. Because the qualified frequent itemsets have been stored in the found 38
39 OLAM cube, and minsup prims, there is no need to generate the frequent itemsets via Apriori-like algorithm. All we have to do is scanning frequent itemsets in OLAM cube and performing the association_gen procedure in Figure 5.7 to generate qualified association rules. Algorithm CBW on Input: relevant cube MCube(t G, t M ), minsup and prims; Output: The set of frequent itemsets F; 1 if minsup < prims then 2 AF = {X sup(x) minsup, X Auxiliary Cube} {Y Y MCube(t G, t M ) and Y = K α }; 3 DF = Dwnsearch on (T, AF, K α, minsup); 4 UF = Upsearch(AF, minusup); 5 F = DF UF; 6 else 7 F = {X X MCube(t G, t M ) and sup(x) minsup}; 8 end if 9 return F; Figure 5.4. Algorithm CBW on 39
40 Procedure Dwnsearch on 1 for i=1 to D do 2 scan the i-th transaction t i ; 3 delete those items in t i but not in AF; 4 for each subset X of t i and 2 X K α do 5 sup(x)++; 6 end for 7 DF = {X sup(x) minsup} AF; Figure 5.5. Procedure Dwnsearch on Procedure Upsearch 1 transform horizontal data format T into t_id lists; 2 F = frequent Kα-itemsets; K α 3 k = K α, F k = F Kα ; 4 repeat 5 k++; 6 C k = new candidate k-itemsets generated from F k-1 ; 7 for each X C k do 8 perform bit-vector intersection on X; 9 count the support of X; 10 end for 11 F = {X sup(x) prims, X Ck}; k 12 UF = UF F k ; 13 until F k = Figure 5.6. Procedure Upsearch 40
41 Procedure association_gen (F: set of all frequent itemsets; min_conf: minimum confidence threshold) begin for each l F do generate P(l) = l - ; // P(l): power set of l for each s l and s l-s do if support_count(l) / support_count(s) output s l s; min_conf then end Figure 5.7. Procedure association_gen The cost thus can be divided into two parts: 1. Frequent itemsets discovery: This involves searching the frequent itemsets stored in OLAM cube with support lower than minsup of user s query, which costs D M, for D M denoting the OLAM cube. 2. Rule generation: For each discovered frequent itemset, we construct all possible rules from it, compute the confidence, and keep those satisfy the minimum confidence. The key point for the complexity analysis thus lies in the number of candidate rules to be generated and inspected. Our first step toward this direction is to consider the number of rules that can be generated from a frequent k-itemset and all of its subsets. Lemma 1. The number of rules that can be constructed from a k-itemset is 2 k -2. Proof. Recall that each rule that can be constructed from an itemset X has the form for 41
42 A X and A φ, A X A. Thus, the number of different A s determines the number of rules, which is k 1 i= 1 k ( i ) k = 2 2. Lemma 2. For a k-itemset X, the total number of rules that can be generated from X and its subsets is k k Proof. From Lemma 1, we can derive k 2 k 3 k k ( 2)( 2 2) + ( 3)( 2 2 ) ( k )( 2 2) k k k i k ( i ) 2 2 ( i ) = i= 2 i= 2 k k k i k i k = ( i ) k 2 ( i ) 1 k i= 0 i= 0 ( ) k k k = = + k k k Now, if we know the set of maximal frequent itemsets, then we can complete the analysis. Unfortunately, the exact set is unobtainable without the a priori knowledge of user s specified minsup. We thus resort to an estimation that proceeds by taking prims in place of minsup. Then we apply sampling to obtain a random subset of the warehouse data, and we can either 1. compute the maximal frequent itemsets for each OLAM cube using any maximal pattern mining algorithm, or 2. apply the CBW off algorithm to estimate K α (cutting level), compute frequent itemsets with cardinality of K α, and regard these itemsets as the maximal frequent itemsets. 42
43 Let MF denotes the set of maximal patterns. If the first approach is adopted, the computation spent on rule generation will be X MF X X + 1 ( ), or ( ), Kα Kα + 1 FK α if the second approach is used. Here, for simplicity, we adopted the second approach. Finally, combing the cost of frequent itemsets discovery and rule generation, we have Kα FK 3 α α + D. M 5.3 Cost Evaluation for Case 2 In this case, algorithm CBW on illustrated in Figure 5.4 will execute the minsup < prims part of the if clause, which comprises three different steps. F kα 1. Generate AF, i.e.. This requires scanning the auxiliary cube and the OLAM cube. The cost is D X + DM, where DX denotes auxiliary cube, and DM denotes OLAM cube. 2. Execute procedure Dwnsearch on illustrated in Figure 5.5. Note that this procedure presumes the availability of the corresponding jointed table, and ignores the preprocessing step to generate the jointed table. To account for this task and simplify the discussion, we assume this cost is w and the table is T. As illustrated in Figure 5.5, the Dwnsearch on procedure needs to scan all the 43
44 transactions in the database. The I/O cost is α T. Next we estimate the cost for the most consumptive step: counting itemset support. Let l denotes the average length of each transaction. This step costs l l l ( ) + ( ) + + ( ) l T 2..., or T in brief. 3 Kα K α ( ) i i= 2 Finally, the total cost consumed by the Dwnsearch on procedure equals Kα l ( ) i α T + T. i= 2 3. Execute procedure Upsearch illustrated in Figure 5.6. To minimize the I/O cost and avoid combinatorial decomposition, the Upsearch procedure first transforms the transaction data into vertical data format called transaction-id lists, then utilizes this structure to count the supports of itemsets. The cost lies in three main steps. (1) Data transformation. This requires α T data scan. (2) Candidate generation. The dominate operation is itemset join. If the largest itemset cardinality is K max. This task consumes at Kmax ( 2 ) F 1 most k. K = Kα + 1 (3) Counting candidate support. For each k-itemset, counting involves k-1 bit-vector intersections and one bit-vector accumulation. Summing this cost over all candidate itemsets, we have Finally, the total cost for procedure Upsearch is K max Ci i T. i= kα
45 Kmax i= Kα ( ( )) F i 2 i α T + C i T +. Combing all of the analysis, we have ( ) Kα Kmax l Fi 1 ( ) ( 2 ) α( D + D + 2 T ) + T + C i T + + F 3 Kα X M i i Kα i= 2 i= Kα Cost Evaluation for Case 3 In this case, we should generate table T according to user s query, and it costs D log D. After this, the CBW off algorithm shown in Figure 5.8 is performed. It can be observed that except step 1, the steps employed by CBW off are quite similar to those by CBW on in Case 2. Since step 1 costs α T + K T, α this makes the total cost for this case be ( ) Kα Kmax l Fi 1 Kα ( i) i ( 2 ) K. α D log D + 3 α T + K T + T + C i T + + F 3 α i= 2 i= Kα
46 Algorithm CBW off (T, prims) Input: Table T and prims; Output: The set of frequent itemsets F; 1 scan T to compute K α and generate all frequent 1-itemsets F 1 ; 2 DF = Dwnsearch(T, K α, F 1, prims); 3 UF = Upsearch(DF, prims); 4 return F = DF UF; Figure 5.8. Algorithm CBW off Procedure Dwnsearch 1 for i=1 to D do 2 scan the i-th transaction t i ; 3 delete the items in t i that are not in F 1 ; 4 for each subset X of t i and 2 X K α do 5 sup(x)++; 6 end for 7 store all X in Auxiliary cube for X = K α and sup(x) < prims; 8 DF={X sup(x) prims}; Figure 5.9. Procedure Dwnsearch To sum up, we list the cost functions for the three cases below: Kα Case 1: F 3 + α D. Kα M 46
47 ( ) Kα Kmax Kα X M i i 2 K. α i= 2 i= Kα + 1 ( ) ( ) l Fi 1 Case 2: α( D + D + 2 T ) + T + C i T + + F 3 Case 3: ( ) Kα Kmax l Fi 1 Kα ( i) i ( 2 ) K. α D log D + 3 α T + K T + T + C i T + + F 3 α i= 2 i= Kα
48 Chapter 6 OLAM Cube Selection Methods In this chapter, we describe three typical heuristic algorithms proposed for OLAP cube selection problem, and elaborate how to modify and combine our cost models depicted in last chapter with each method to select the most suitable OLAM cubes. The methods include forward greedy selection (FGS) method proposed by Harinarayan et al. [9], Pick by size (PBS) selection method proposed by Shukla et al. [17], and the backward greedy selection (BGS) method proposed by Lin and Kuo [13]. 6.1 Forward Greedy Selection Method (FGS) The forward greedy selection method is proposed by Harinarayan et al. [19]. As is known to all, the greedy algorithm always chooses the local optimal solution in each step under some constraint. For this purpose, we define a benefit function B(d i, M) as follows: 1 B( di, M) = ( E( q, M) E( q, M di)) (6.1) q Q d i 48
49 We use our benefit function to compute the benefit of all unselected OLAM cubes, and combine the forward selection method to choose the most suitable OLAM cubes one by one to materialize from empty until no cube can be added. The forward selection method is described below: Algorithm 1. Forward greedy selection (FGS) Step 0. Let M=φ. Step 1. When d < S, repeat Step 2 to Step 5. d M Step 2. According to equation (6.1), calculate the benefit of all unselected OLAM cubes d i, for 1 i n, and d i M. Step 3. Select the OLAM cube with the maximal benefit according to results of Step 2, and set it as d j. Step 4. M M {d j }. Step 5. Go to Step 1. Figure 6.1. Forward Greedy Selection Method Example 6.1. Suppose that we select three attributes city c, education e, and date d from a sales star schema illustrated in Figure 2.3. Figure 6.2 depicts all possible OLAM cubes formed with these three attributes as well as their dependencies, where all OLAM cubes with the same transaction t G are packed into a meta-cube. The dotted line between any two metacubes is used for clarification purpose, which accomplishes the lattice structure of metacubes in terms of t G. Note that according to proposition 4.1, the dependency exists only in OLAM cubes within the same metacube. For simplification, let us consider how to select the most suitable OLAM cubes from three OLAM cubes ced*, cd*, and ed* to materialize under space constraint. The symbols 49
50 used in this example are shown in Table 6.1, and the required parameter settings are shown in Table 6.2. Besides, we assume that the base relation size is 64, and prims is 3. Table 6.3 shows the first two selection steps using FGS. t G M Table 6.1. The symbols used in cost model the set transaction attributes t the set of mining attributes α K α I/O to computation ratio the cardinality of maximal frequent itemset K max the cardinality of the largest itemset C i number of candidate i-itemsets l average length of each transaction F i number of frequent i-itemsets D M size of OLAM cube D X size of auxiliary cube f frequency of OLAM cube T size of the table composed of attributes t D size of base relation G t M Table 6.2. The required parameter settings subcubes α K α K max C3 C4 l F2 F 3 D M D X T minsup f d*ce d*c d*e
Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-
UNIT III: Data Warehouse and OLAP Technology: An Overview : What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to
More informationBasics of Dimensional Modeling
Basics of Dimensional Modeling Data warehouse and OLAP tools are based on a dimensional data model. A dimensional model is based on dimensions, facts, cubes, and schemas such as star and snowflake. Dimension
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
More informationDecision Support Systems aka Analytical Systems
Decision Support Systems aka Analytical Systems Decision Support Systems Systems that are used to transform data into information, to manage the organization: OLAP vs OLTP OLTP vs OLAP Transactions Analysis
More informationData Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems
Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,
More informationData Warehousing and Data Mining
Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement:
More informationDATA WAREHOUING UNIT I
BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009
More informationLectures for the course: Data Warehousing and Data Mining (IT 60107)
Lectures for the course: Data Warehousing and Data Mining (IT 60107) Week 1 Lecture 1 21/07/2011 Introduction to the course Pre-requisite Expectations Evaluation Guideline Term Paper and Term Project Guideline
More informationThis tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.
About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This
More informationAn Overview of Data Warehousing and OLAP Technology
An Overview of Data Warehousing and OLAP Technology CMPT 843 Karanjit Singh Tiwana 1 Intro and Architecture 2 What is Data Warehouse? Subject-oriented, integrated, time varying, non-volatile collection
More informationData Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems
Data Warehousing and Data Mining CPS 116 Introduction to Database Systems Announcements (December 1) 2 Homework #4 due today Sample solution available Thursday Course project demo period has begun! Check
More informationNovel Materialized View Selection in a Multidimensional Database
Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/
More informationDATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY
DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY CHARACTERISTICS Data warehouse is a central repository for summarized and integrated data
More informationA Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective
A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India
More informationFig 1.2: Relationship between DW, ODS and OLTP Systems
1.4 DATA WAREHOUSES Data warehousing is a process for assembling and managing data from various sources for the purpose of gaining a single detailed view of an enterprise. Although there are several definitions
More informationcollection of data that is used primarily in organizational decision making.
Data Warehousing A data warehouse is a special purpose database. Classic databases are generally used to model some enterprise. Most often they are used to support transactions, a process that is referred
More informationSummary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse
Principles of Knowledge Discovery in bases Fall 1999 Chapter 2: Warehousing and Dr. Osmar R. Zaïane University of Alberta Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in bases University
More informationOLAP Introduction and Overview
1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata
More informationDATA MINING AND WAREHOUSING
DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making
More informationData Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A
Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 432 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business
More informationANU MLSS 2010: Data Mining. Part 2: Association rule mining
ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements
More informationIT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS
PART A 1. What are production reporting tools? Give examples. (May/June 2013) Production reporting tools will let companies generate regular operational reports or support high-volume batch jobs. Such
More informationData Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 22 Table of contents 1 Introduction 2 Data warehousing
More informationData warehouses Decision support The multidimensional model OLAP queries
Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing
More informationCompSci 516 Data Intensive Computing Systems
CompSci 516 Data Intensive Computing Systems Lecture 20 Data Mining and Mining Association Rules Instructor: Sudeepa Roy CompSci 516: Data Intensive Computing Systems 1 Reading Material Optional Reading:
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA
More informationEfficient Remining of Generalized Multi-supported Association Rules under Support Update
Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou
More informationWhat is a Data Warehouse?
What is a Data Warehouse? COMP 465 Data Mining Data Warehousing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Defined in many different ways,
More informationAn Improved Apriori Algorithm for Association Rules
Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan
More informationQuestion Bank. 4) It is the source of information later delivered to data marts.
Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
More informationChapter 4, Data Warehouse and OLAP Operations
CSI 4352, Introduction to Data Mining Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining
More informationETL and OLAP Systems
ETL and OLAP Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester
More informationData Warehousing and Decision Support
Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical
More informationData Warehousing and Decision Support
Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 4320 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business
More informationCT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN
Q.1 a. Define a Data warehouse. Compare OLTP and OLAP systems. Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and 2 Non volatile collection of data in support of management
More informationFull file at
Chapter 2 Data Warehousing True-False Questions 1. A real-time, enterprise-level data warehouse combined with a strategy for its use in decision support can leverage data to provide massive financial benefits
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationJarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques
Jarek Szlichta http://data.science.uoit.ca/ Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques Frequent Itemset Mining Methods Apriori Which Patterns Are
More informationCHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI
CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS Assist. Prof. Dr. Volkan TUNALI Topics 2 Business Intelligence (BI) Decision Support System (DSS) Data Warehouse Online Analytical Processing (OLAP)
More informationOptimization using Ant Colony Algorithm
Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter
More informationComputing Data Cubes Using Massively Parallel Processors
Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University
More informationData warehouse architecture consists of the following interconnected layers:
Architecture, in the Data warehousing world, is the concept and design of the data base and technologies that are used to load the data. A good architecture will enable scalability, high performance and
More informationChapter 4: Mining Frequent Patterns, Associations and Correlations
Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent
More informationComplete. The. Reference. Christopher Adamson. Mc Grauu. LlLIJBB. New York Chicago. San Francisco Lisbon London Madrid Mexico City
The Complete Reference Christopher Adamson Mc Grauu LlLIJBB New York Chicago San Francisco Lisbon London Madrid Mexico City Milan New Delhi San Juan Seoul Singapore Sydney Toronto Contents Acknowledgments
More informationThe application of OLAP and Data mining technology in the analysis of. book lending
2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017) The application of OLAP and Data mining technology in the analysis of book lending Xiao-Han Zhou1,a,
More informationAssociation Rules. Berlin Chen References:
Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A
More informationCT75 DATA WAREHOUSING AND DATA MINING DEC 2015
Q.1 a. Briefly explain data granularity with the help of example Data Granularity: The single most important aspect and issue of the design of the data warehouse is the issue of granularity. It refers
More informationAssociation Rule Mining. Entscheidungsunterstützungssysteme
Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
More informationThe strategic advantage of OLAP and multidimensional analysis
IBM Software Business Analytics Cognos Enterprise The strategic advantage of OLAP and multidimensional analysis 2 The strategic advantage of OLAP and multidimensional analysis Overview Online analytical
More informationData Warehouse and Mining
Data Warehouse and Mining 1. is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions. A. Data Mining. B. Data Warehousing. C. Web Mining. D. Text
More informationData Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data warehousing
More informationAdnan YAZICI Computer Engineering Department
Data Warehouse Adnan YAZICI Computer Engineering Department Middle East Technical University, A.Yazici, 2010 Definition A data warehouse is a subject-oriented integrated time-variant nonvolatile collection
More informationAcknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process.
MTAT.03.183 Data Mining Week 7: Online Analytical Processing and Data Warehouses Marlon Dumas marlon.dumas ät ut. ee Acknowledgment This slide deck is a mashup of the following publicly available slide
More informationData Warehousing and OLAP
Data Warehousing and OLAP INFO 330 Slides courtesy of Mirek Riedewald Motivation Large retailer Several databases: inventory, personnel, sales etc. High volume of updates Management requirements Efficient
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 07 : 06/11/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter
More information1. Inroduction to Data Mininig
1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the
More informationData Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction 2 Data warehousing
More informationINSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad
INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program
More informationCarnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem
Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26) Data mining detailed outline Problem
More informationImproving the Performance of OLAP Queries Using Families of Statistics Trees
Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University
More informationData warehouse and Data Mining
Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationA Multi-Dimensional Data Model
A Multi-Dimensional Data Model A Data Warehouse is based on a Multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in
More informationWKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems
Management Information Systems Management Information Systems B10. Data Management: Warehousing, Analyzing, Mining, and Visualization Code: 166137-01+02 Course: Management Information Systems Period: Spring
More informationOLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube
OLAP2 outline Multi Dimensional Data Model Need for Multi Dimensional Analysis OLAP Operators Data Cube Demonstration Using SQL Multi Dimensional Data Model Multi dimensional analysis is a popular approach
More informationRocky Mountain Technology Ventures
Rocky Mountain Technology Ventures Comparing and Contrasting Online Analytical Processing (OLAP) and Online Transactional Processing (OLTP) Architectures 3/19/2006 Introduction One of the most important
More informationChapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the
Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule
More informationDecision Support Systems 2012/2013. MEIC - TagusPark. Homework #5. Due: 15.Apr.2013
Decision Support Systems 2012/2013 MEIC - TagusPark Homework #5 Due: 15.Apr.2013 1 Frequent Pattern Mining 1. Consider the database D depicted in Table 1, containing five transactions, each containing
More informationData Warehouse and Data Mining
Data Warehouse and Data Mining Lecture No. 02 Introduction to Data Warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationCHAPTER 3 Implementation of Data warehouse in Data Mining
CHAPTER 3 Implementation of Data warehouse in Data Mining 3.1 Introduction to Data Warehousing A data warehouse is storage of convenient, consistent, complete and consolidated data, which is collected
More informationCS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)
CS614- Data Warehousing Solved MCQ(S) From Midterm Papers (1 TO 22 Lectures) BY Arslan Arshad Nov 21,2016 BS110401050 BS110401050@vu.edu.pk Arslan.arshad01@gmail.com AKMP01 CS614 - Data Warehousing - Midterm
More informationData Warehousing and OLAP Technologies for Decision-Making Process
Data Warehousing and OLAP Technologies for Decision-Making Process Hiren H Darji Asst. Prof in Anand Institute of Information Science,Anand Abstract Data warehousing and on-line analytical processing (OLAP)
More informationThis tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.
About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This
More informationData Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa
ICS 421 Spring 2010 Data Warehousing 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/30/2010 Lipyeow Lim -- University of Hawaii at Manoa 1 Data Warehousing
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationData mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.
Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and A. Pavlo Data mining detailed outline
More information5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS
5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS Association rules generated from mining data at multiple levels of abstraction are called multiple level or multi level association
More informationOn-Line Application Processing
On-Line Application Processing WAREHOUSING DATA CUBES DATA MINING 1 Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming,
More information2 CONTENTS
Contents 4 Data Cube Computation and Data Generalization 3 4.1 Efficient Methods for Data Cube Computation............................. 3 4.1.1 A Road Map for Materialization of Different Kinds of Cubes.................
More informationDHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI
DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6702 Data Warehousing & Data Mining Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation:
More informationData Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)
Data Warehouse Logical Design Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato) Data Mart logical models MOLAP (Multidimensional On-Line Analytical Processing) stores data
More information1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar
1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar 1) What does the term 'Ad-hoc Analysis' mean? Choice 1 Business analysts use a subset of the data for analysis. Choice 2: Business analysts access the Data
More informationDecision Support, Data Warehousing, and OLAP
Decision Support, Data Warehousing, and OLAP : Contents Terminology : OLAP vs. OLTP Data Warehousing Architecture Technologies References 1 Decision Support and OLAP Information technology to help knowledge
More informationAssociation mining rules
Association mining rules Given a data set, find the items in data that are associated with each other. Association is measured as frequency of occurrence in the same context. Purchasing one product when
More informationData Warehouse and Data Mining
Data Warehouse and Data Mining Lecture No. 07 Terminologies Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Database
More informationMining Association Rules in OLAP Cubes
Mining Association Rules in OLAP Cubes Riadh Ben Messaoud, Omar Boussaid, and Sabine Loudcher Rabaséda Laboratory ERIC University of Lyon 2 5 avenue Pierre Mès-France, 69676, Bron Cedex, France rbenmessaoud@eric.univ-lyon2.fr,
More informationValue Added Association Rules
Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency
More informationMarket baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.
Frequent itemset Association&decision rule mining University of Szeged What frequent itemsets could be used for? Features/observations frequently co-occurring in some database can gain us useful insights
More informationUsing Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment
Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Ching-Huang Yun and Ming-Syan Chen Department of Electrical Engineering National Taiwan
More informationChapter 4 Data Mining A Short Introduction
Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview
More informationApriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the
More informationFrequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar
Frequent Pattern Mining Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Item sets A New Type of Data Some notation: All possible items: Database: T is a bag of transactions Transaction transaction
More informationPESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore
Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic
More informationChapter 5, Data Cube Computation
CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation Full Cube Full
More information2. Discovery of Association Rules
2. Discovery of Association Rules Part I Motivation: market basket data Basic notions: association rule, frequency and confidence Problem of association rule mining (Sub)problem of frequent set mining
More informationAC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery
: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,
More informationData Warehouse and Data Mining
Data Warehouse and Data Mining Lecture No. 04-06 Data Warehouse Architecture Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationCHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)
CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP) INTRODUCTION A dimension is an attribute within a multidimensional model consisting of a list of values (called members). A fact is defined by a combination
More informationA MAS Based ETL Approach for Complex Data
A MAS Based ETL Approach for Complex Data O. Boussaid, F. Bentayeb, J. Darmont Abstract : In a data warehousing process, the phase of data integration is crucial. Many methods for data integration have
More information