Data warehousing and data mining are both popular technologies in recent years.

Size: px
Start display at page:

Download "Data warehousing and data mining are both popular technologies in recent years."

Transcription

1 Chapter 1 Introduction Data warehousing and data mining are both popular technologies in recent years. Data warehousing is an information infrastructure to store and integrate different data sources into a consistent repository, and through OLAP (On-Line Analytical Processing) tools business managers can analyze these data in various perspectives to discover valuable information for strategic decision. Data mining, on the other hand, is the exploration and analysis of data, automatically or semi-automatically, to discover meaningful patterns and rules. From the business viewpoint, the integration of these two technologies can allow a corporation to understand its customers behaviors, and to use this information to gain market competition. Among various pattern interested by data mining research community, association rule has attracted great attention recently. An association rules is a rule of the form A B (sup = s %, conf = c %), which reveals the concurrence between two itemsets A and B. An example is PC => Laser Printer (sup = 30%, conf = 80%), which means there are 30% customers will buy PC and Laser Printer together, and 80% of those customers who buy PC also get Laser Printer. Mining association rules from large database is a data and computation intensive task. To reduce the complexity of association mining, researchers have proposed the 1

2 concept of integrating data warehousing system and association mining algorithms. For example, the DBMiner system [22] developed by J. Han and its research team adopts an OLAP-based association mining approach. Similar paradigm was presented in [22]. The primary problem of OLAP-based approach is that the OLAP data cube is not feasible for on line association mining. Excessive efforts are still required to complete the task. As such, Lin et al. [15] proposed the concept of OLAM (On-Line Association Mining) cube, an extension of Ice-berg cube [3] used to store frequent multidimensional itemsets. They also proposed a framework of on-line multidimensional association rule mining system, called OMARS, to provide users an environment to execute OLAP-like query to mine association rules from data warehouses efficiently. This thesis is a companion toward the implementation of OMARS. Particularly, the problem of selecting appropriate OLAM cubes to materialize and store in OMARS is concerned. And, in accordance with the proposed mining algorithms in OMARS, a suitable model to evaluate the cost of selecting data cubes to materialize is also developed. 1.1 Contributions The main contributions of this thesis are as follows: 1. We exploit the devising dependency between OLAM cubes with regard to association query, thereby devising the structure of OLAM lattice. 2. We deploy the model for evaluating the cost of answering association queries using materialized OLAM cubes, which is a preliminary step for 2

3 OLAM cubes selection. 3. We modify and implement some state-of-the-art heuristic algorithms, and draw comparisons between these algorithms to evaluate their effectiveness. 1.2 Thesis Organization This thesis is organized as follows. We describe past researches and related work about the data warehousing and data mining technologies in Chapter 2. In Chapter 3, we describe the OMARS framework briefly. Chapter 4 formulates our OLAM cube selection problem. The algorithm analysis and cost model is described in Chapter 5. Chapter 6 explains our algorithms, and Chapter 7 shows the experimental results conducted in this research. Finally, we conclude our work and point out some future research directions in Chapter 8. 3

4 Chapter 2 Background and Related Work 2.1 Data Warehouse and OLAP Data Warehouse As coined by W. H. Inmon, the term Data warehouse refers to a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management s decision-making process [11]. In this regard, a data warehouse is a database dedicated to support decision making. According to the demand of analysts, the data comes from different databases are extracted and transformed into the data warehouse. If users want to execute queries, the system only needs to search the data warehouse instead of the source databases. For this reason, it can save much more query processing time for users. A data warehouse system is composed of three primary parts: 1. The source databases in the backend: In the backend, the data are collected from various sources, internal or external, legacy or operational, and any change to these sources is continually monitored by several modules called 4

5 monitors/ wrappers. 2. The data warehouse and data marts in the core: The reconciled data are stored in the data warehouse and data mart, which are central repository for the whole system. 3. The analysis tools in the front end: The analysis tools supported in the front end are usually OLAP, query/tabulation tools, and data mining software. The typical structure of a data warehouse is illustrated in Figure 2.1. Monitoring & Administration OLAP Data sources Servers Tools Metadata Monitors/ wrappers Analysis Data Warehouse External sources Extract Clean Transform Load Refresh Serve Query/Reporting Data Mining Operational databases Data mart Figure 2.1. A typical architecture of data warehouse [11] On-Line Analytical Processing (OLAP) Although the data stored in a data warehouse have been cleaned, filtered, and integrated, it still requires much time to transform the data into useful strategic information owing to the massive amount of data stored in data warehouse. The concept of On-Line Analytical Processing (OLAP) [4] refers to the process of creating and managing multidimensional data for analysis and visualization. To provide fast 5

6 and multidimensional analysis of data in a data warehouse, the OLAP tool precomputes aggregation over data and organizes the result as a data cube composed of several dimensions, each representing one of the user analysis perspectives. The typical operations provided by OLAP include roll-up, drill-down, slice and dice and pivot [8]. Roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data. The slice operation performs a selection on one dimension of the given cube, resulting in a subcube, while the dice operation defines a subcube by performing a selection on two or more dimensions. The pivot operation, which is also called rotate, is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. These OLAP operations are illustrated in Figure Data Warehouse Data Model Because the data warehouse systems require a concise, subject-oriented schema that facilitates on-line data analysis, the entity-relationship data model that is generally used in relational database systems is not suitable for data warehouse system. For this purpose, the most popular data model for a data warehouse is a multidimensional data model. Two common relational models that facilitate multidimensional analysis are star schema, and snowflake. 6

7 Product P6 P5 P4 P3 P2 P1 C1 C2 C3 C4 Customer Supplier S1 S2 S3 S4 P6 P5 P4 P3 P2 P1 C1 C2 C3 C4 Customer Product Slice Pivot Dice Customer Roll-Up Drill-Down Product C1 C2 C3 C4 P6 P5 P4 P3 P2 P1 Product Supplier S4 S3 S2 S1 Product Supplier S1 S2 P2 P1 P6 P5 P4 P3 P2 P1 All Customer C1 C2 Customer Figure 2.2. The typical operations of OLAP Star Schema Star schema, proposed by Kimball [12], is the most popular dimensional model used in data warehouse community. A star schema consists of a fact table and several dimension tables. The fact table stores a list of foreign keys which correspond to dimension tables, and numeric measure of user interests. Each dimension table contains a set of attributes. Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order). An example of star schema is depicted in Figure 2.3, whose schema hierarchy is illustrated in Figure

8 Customer Customer_ID City Education Fact Table Sales Time Product Customer_ID Time_ID Product_ID Quantity Time_ID Date Month Product_ID Category Figure 2.3. An example of star schema for sales All All All Category Month City Education Product_ID Date Customer_ID Time_ID Figure 2.4. An example of schema hierarchy for sales star 8

9 2.2.2 Snowflake Data Model The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional individual and hierarchical tables. An example of snowflake data model is depicted in Figure 2.5. Customer Customer_ID City Education Fact Table Sales Time Product Customer_ID Time_ID Product_ID Quantity Time_ID MID Date Month MID Month Product_ID Category Figure 2.5. An example of snowflake schema for sales The major difference between snowflake schema and star schema is that the dimension tables of snowflake model may be kept in normalized form to reduce redundancies. Through this characteristic one can easily maintain and save storage space than that by star schema data model. On the other hand, the star schema can integrate schema hierarchies into a dimension table, thereby incurring no join operation during hierarchical traverse of the dimensions. Hence, the star schema data 9

10 model is more popular than snowflake schema data model. 2.3 Association Rule Mining Association Rules Association rule mining is one of the prominent activities conducted in data mining community. The concept of association rule mining is to search interesting relationships among items in a given data set. For example, the information that customers who purchase diapers also tend to buy beers at the same time is represented in association rule below: Diaper => Beer [sup = 2%, conf = 60%] Rule support and confidence are two measures of rule interestingness. A support of 2% means that 2% of customers purchase diaper and beer together. A confidence of 60% means that 60% of the customers who purchase a diaper also buy beer. Typically, an association rule is considered interesting if it satisfies a minimum support threshold and a minimum confidence threshold that are set by users or domain experts. The process of association rule mining can be divided into two steps: 1. Frequent itemsets generation: In this step, all itemsets with support greater than the minimum support threshold are first discovered. 2. Rule construction: After generating all frequent itemsets, the confidence of these frequent itemsets much greater than minimum confidence threshold. Then, we can discover association rules. 10

11 The most popular and influential association mining algorithm is Apriori [2], which the apriori knowledge of frequent k-itemsets to generate candidate (k+1)-itemsets. When the maximum length of frequent itemsets is l, Apriori needs l passes of database scans. Since the Apriori algorithm costs much time to generate the candidate itemsets and to count the support of each itemset, many variant algorithms have been proposed to improve the efficiency of mining process Multi-dimensional Association Rules The concept of multi-dimensional association rules is first proposed by H. Zhu [22], which is used to describe associations between data values from data warehouse, because where the data schema is composed of multiple dimensions, and each dimension may contain many attributes. Following the work in [22], we can divide the multi-dimensional association rules into three different types as follows: 1. Inter-dimensional association rule: This is the association among a set of dimensions. For example, suppose an OLAP cube is composed of three dimensions: Product, Supplier, Customer, and whose data is listed in Table 2.1. An inter dimensional association rule is: Supplier ( Hong Kong ), Product ( Sport Wear ) Customer ( John ) 2. Intra-dimensional association rule: This is the association among items coming from one dimension. From Table 2.1, a possible intra-dimensional association rule is: Product ( Sport Wear ) Product ( Tents ) 11

12 3. Hybrid association rule: This is the association among a set of dimensions, but some items in the rule are from one dimension. It can be regarded as a combination of inter-dimensional and intra-dimensional associations. According to Table 2.1, a hybrid-association rule is: Product ( Sport Wear ), Supplier ( Hong Kong ) Product ( Tents ) Table 2.1. A relational representation of OLAP cube Supplier Product Customer Count HongKong HongKong HongKong Mexico Mexico Mexico Mexico Mexico Seattle Seattle Seattle Seattle Tokyo Tokyo Tokyo Tokyo Sport Wear Sport Wear Water Purifier Alert Devices Carry Bags Carry Bags Tents Tents Carry Bags Sport Wear Sport Wear Water Purifier Carry Bags Sport Wear Tents Alert Devices John Mary John Peter Peter Bill Sue Mary John Peter John Bill Sue Bill Sue John Related Work Data Cube 12

13 The concept of data cube is first proposed by Gray et al [6], which allow the analysts to view the data stored in data warehouse from various aspects and to employ multidimensional analysis. Each cell in a data cube represents the measured value. For example, consider a sales data cube with three dimensions, Product, Supplier, Customer, and one measure value, Sales_total. This cube is depicted in Figure 2.6 and can be expressed as a SQL query as follows: Select Product, Supplier, Customer SUM(Sales) AS Total Sales From Sales_Fact Group by Product, Supplier, Customer; Product Supplier s4 s3 s2 s1 p1 p2 p3 p4 p5 p6 c1 c2 c3 c4 Customer Figure 2.6 An example of data cube Cube Selection Problem In order to accelerate the query processing, it is important to select the most suitable cubes to materialize. In general, there are three options to select the cubes to materialize. 1. Materialize all data cubes: This method costs the lowest query time but 13

14 needs the largest storage space, because the whole cubes have to be materialized. 2. Materialize nothing: This method saves the largest storage space but needs the largest query time, because there is no cube to be materialized. 3. Materialize a fraction of data cubes: This method selects a part of the data cubes to materialize. But how to select the most suitable cubes to materialize under a space constraint is difficult. Indeed, it has been proved to be a NP-hard problem [9]. According to the above discussions, the best way is to materialize all data cubes. However, the space limit of data warehouse would hinder us to do this. On the other hand, if we materialize nothing, it will cost too much query time. Therefore, we should try to select the most suitable cubes to materialize even this problem is an NP-hard problem. In the literature, there has been a substantial contribution in this problem, which can be classified into three main categories: 1. Heuristic method: This category is mainly based on the greedy paradigm. Harinarayan et al. [9] was the first one to consider the problem of materialized views selection for supporting multidimensional analysis in OLAP. They proposed a lattice model and provided a greedy algorithm to solve this problem. Gupta et al. [7] further extend their work to include indices selection. Ezeife [5] also considered the same problem but proposed a uniform approach using a more detailed const model. Shukla et al. [17] proposed a modified greedy algorithm that selects only according to the cube size. Their algorithm was shown to have the same quality as Harinarayan s greedy method but is more efficient. 2. Exhaustive method: The work in [19] supposed that all queries should be 14

15 answered solely by the materialized views, with or without rewriting the users queries. They modeled the problem as a state space optimization problem, and provided exhaustive and heuristic algorithms without concern for the storage constraint. Soutyrina and Fotouhi [18] proposed a dynamic programming algorithm to solve the problem, which can yield the optimal set of cubes. 3. Genetic method: There is some work devoted to applying genetic algorithms to the view selection problem [10, 20, 21]. Following the AND-OR view graph used in [7], Horng et al. [10] proposed a genetic algorithm to select the appropriate set of views to minimize the query cost and view maintenance cost. A similar genetic algorithm with different repairing scheming is proposed in [13], which use a greedy repair method to correct the infeasible solutions instead of using a penalty function to punish the fitness of the infeasible solutions. Researches have shown that the repair scheme is better in dealing with infeasible solutions than penalty function is [16]. Rather than optimize the view selection from a given query processing plan, the work in [20, 21] focus on finding an optimal set of processing plans for multiple queries. A solution in their genetic algorithm thus represents a set of processing plans for the given queries. 15

16 Chapter 3 The OMARS Framework In this chapter, we will give a brief review of the OMARS framework, because our research deals with the problem of how to select the most suitable OLAM cubes to materialize in this system. The OMARS framework, as illustrated in Figure 3.1, integrates data warehouse, on-line analytical processing, and the OLAM Cube, whose objective is to provide an efficient and convenient platform, allowing users to perform OLAP-like association explorations. Through the OMARS system, users can perform multidimensional associational mining queries, interactively change the dimensions that comprise the associations, and refine the constraints such as minimum support and minimum confidence. Functionality of each component is described in the following sections. Data Warehouse OLAP Cube Auxiliary Cube Cube Manager OLAM Mediator OLAM Engine OLAM Cube Figure 3.1. The OMARS framework [15]. 16

17 3.1 OLAM Cube and Auxiliary Cube OLAM cube is a new concept proposed by Lin et al. [15], which is used to store the frequent itemsets with supports greater than or equal to a presetting minimum support, denoted as prims. In this regard, the OLAM cube can be regarded as an extension of iceberg cube. The main difference is that the iceberg cube stores the information of frequent itemsets derived from inter-dimensional associations, while OLAM cube is feasible for all of the three different associations. When the minsup of user s query is greater or equal than prims, it can accelerate the process of mining association rules because of the OLAM cube stores the frequent itemsets with supports greater or equal than prims. Although the OLAM cube can be used to generate association rules efficiently when minsup is greater than prims, it fails to solve the situation that minsup is lower than prims. To alleviate this problem, the OMARS system embraces another type of data cube, called auxiliary cube. The concept of auxiliary cube is used to store the infrequent itemsets with length of K α, where K α denotes the cutting-level employed by the mining algorithm CBW on used in OMARS. 3.2 Cube Manager This component is responsible for three different tasks: 1. Cube selection: This refers to how to select the most proper cubes to materialize, in order to minimize the query cost and/or maintenance cost under the constraint of limited storage space. 2. Cube computation: This portion is to deal with the work of efficiently generating the set of materialized cubes produced by the cube selection 17

18 module. 3. Cube maintenance: This part concerns the problem of how to maintain the materialized cubes when the data in the data warehouse are updated. Our research in this thesis indeed deals with the implementation issue of the cube selection task of Cube Manager. We will discuss this in the next chapter. 3.3 OLAM Mediator and OLAM Engine OLAM Engine is an interface between the OMARS system and the users. It accepts user s queries and invokes the appropriate algorithm to mine multidimensional association rules. When OLAM Engine receives a user s query, it will analyze the query and forward relevant information to OLAM Mediator, which then looks for the most relevant cube and returns the result to OLAM Engine. Here the most relevant cube denotes the materialized OLAM cube that can answer the query and consume the smallest cost. There are two possibilities of the search result returned by OLAM Mediator, and each should be handled in different way. 1. OLAM Mediator can find the most relevant cube: In this case, OLAM Mediator has to further compare the minsup of user s query to prims, and to handle this situation according to the following two different cases: i. minsup prims: The discovered OLAM cube is capable of answering the query. Return this cube to OLAM Engine. ii. minsup < prims: The discovered OLAM cube can not answer the query without the aid of the auxiliary cube. Return the OLAM cube and its accompanied auxiliary cube to OLAM Engine. 18

19 2. OLAM Mediator can not find the cube: In this case, OLAM Mediator has to search the OLAP Cube repository to determine if there is an OLAP cube whose data can be used to answer the query. If the answer is yes, return the discovered OLAP cube to OLAM Engine; otherwise, notify OLAM Engine to execute the mining procedure from the data warehouse afresh. We will discuss the above cases in more detail and devise to the cost evaluation of each case in Chapter 5. 19

20 Chapter 4 Problem Formulation In this chapter, we first elaborate the correspondence between OLAM query and OLAM cube, and describe the concept of OLAM lattice. After this, we will define the problem of OLAM cube selection. 4.1 OLAM Cube and OLAM Query As described in Chapter 3, OLAM cube is used to store frequent itemsets, aiming at accelerating the process of mining association rules. To clarify the structure of OLAM cube and its relationship between multidimensional associations, we first introduce a four-tuple mining meta-pattern to specify the form of multidimensional association query. The definition is as follows: Definition 4.1. Suppose a star schema S containing a fact table and m dimension tables {D 1, D 2,, D m }. Let T be a jointed table from S composed of a 1, a 2,., a k attributes, such that a i, a j Attr(D k ), there is no hierarchical relation between a i and a j, 1 i, j r, 1 k m. Here Attr(D k ) denotes the attribute set of dimension table D k. A meta-pattern of multidimensional associations from T is defined as follows: 20

21 MP: < t G, t M, ms, mc >, where ms denotes the minimum support, mc the minimum confidence, t G the group of transaction attributes, t M the group of item attributes, for t G, t M {a 1, a 2,., a k } and t G t M =. The above-mentioned meta-form specification of multidimensional association queries can present three different multidimensional association rules defined in [22], intra-association, inter-association, and hybrid association. For example, consider a jointed table T involving three dimensions from the star schema in Figure 2.3. The content of T is shown in Table 4.1. If the item attribute set t M consists of only one attribute, then the meta pattern corresponds to an intra-association. Table 4.1. A jointed table T from star schema Tid City Education Date Month Product_ID Category 1 Taipei Bachelor 7/12 July 1 A 2 Taipei High school 7/12 July 2 A 3 N.Y. Master 7/18 July 1 A 4 Toronto Master 8/2 Aug. 3 B 5 Seattle Master 8/3 Aug. 4 B 6 N.Y. High School 8/2 Aug. 1 A 7 Toronto High School 7/4 July 1 A 8 Seattle Bachelor 7/18 July 5 C 9 Taipei Bachelor 8/2 Aug. 2 A 10 N.Y. Bachelor 9/1 Sep. 3 B For instance, let t G = {City}, t M = {Category}. We may have the following intra-association rule: 21

22 (Category, A ) (Category, B ) (sup = 40%, conf = 80%) Note that to facilitate this mining task, the table T has to be, implicitly or explicitly, transformed into a transaction table as follows: City Taipei N.Y. Toronto Seattle Category A A, B A, B B, C On the other hand, if t M 2, then the resulting associations will be inter-association or hybrid association. For example, let t G =, t M = {Education, Month}. We have an inter-association: (Education, Master ) (Month, July ) (sup = 40%, conf = 80%) Like intra-association, the table T has to be transformed into the following form: Tid Education Month 1 Bachelor July 2 High school July 3 Master July 4 Master Aug. 5 Master Aug. 6 High School Aug. 7 High School July 8 Bachelor July 9 Bachelor Aug. 10 Bachelor Sep. Note that in this case, the transaction attribute is the same as the original table T. But if t G = {City}, we will have a hybrid-association: (Education, Master ), (Month, July ) 22

23 (Month, Aug. ) (sup = 40%, conf = 80%) For this case, the transformed table will be: City Education Month Taipei Bachelor, High School July, Aug. N.Y. Master, High School, Bachelor July, Aug., Sep. Toronto Master, High School Aug., July Seattle Master, Bachelor Aug., July Cube. After explaining the mining patterns, we will clarify the structure of OLAM Definition 4.2. Given a meta-pattern MP with transaction attribute set t G and item attribute set t M, and a presetting minsup, prims, the corresponding OLAM cube, MCube(t G, t M ), is the set of the frequent itemsets with supports larger than prims. The following examples illustrate the corresponding OLAM cube for different kinds of multidimensional association rules. Example 4.1. An intra-dimensional OLAM Cube: Let t G = {City}, t M = {Category}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in Table 4.2. Table 4.2. An example of intra OLAM cube expressed in table Category A B A, B Support Example 4.2. An inter-dimensional OLAM cube: Let t G =, t M = {Education, Month}, and prims = 2. From Table 4.1, the resulting OLAM cube is shown in 23

24 Table 4.3. Table 4.3. An example inter-dimensional OLAM cube expressed in table Education Month Support Bachelor High school Master - - Bachelor High school Master July Aug. July July Aug Example 4.3. A hybrid-dimensional OLAM cube: Let t G = {City}, t M = {Education, Month}, and prims = 3. From Table 4.1, the resulting OLAM cube is shown in Table 4.4. Table 4.4. An example hybrid-dimensional OLAM cube expressed in table Education Month support Bachelor High school Master Bachelor Bachelor High school High school Master Master Bachelor High school Master July Aug. July, Aug. July Aug. July Aug. July Aug. July, Aug. July, Aug. July, Aug

25 4.2 OLAM Lattice In accordance with the definition of OLAM cube, we can generate all possible OLAM cubes from the star schema, thereby forming an OLAM lattice. In order to provide hierarchical navigation and multidimensional exploration, the OMARS system [15] models the OLAM lattice as a three-layer structure. The first layer lattice expresses the combination of all dimensions. The second layer further exploits inter-attribute combinations for each dimensional combination in the first layer lattice. The third layer exploits all OLAM cubes corresponding to the meta-patterns derived from each subcube in the second layer. Note that the real OLAM cubes are stored in the third layer. For example, consider the star schema illustrated in Figure 2.3. The first layer lattice shown in Figure 4.1 is composed of eight possible dimensional combinations. After constructing the first layer lattice, we choose the node composed of customer and time dimensions, and extended it to form a second layer lattice shown in Figure 4.2. Each node of the second layer lattice is constructed by attaching any attribute chosen from the selected dimensions. Finally, we extend cube <(city, education), (date)> to form the third layer lattice shown in Figure 4.3. It can be observed that there is one OLAM cube corresponding to inter-association, (city, education, date); three OLAM cubes corresponding to hybrid-associations, (date*, city, education), (*education, city, date) and (city*, education, date); and three cubes corresponding to intra-associations, (education*, date*, city), (city*, date*, education), (city*, education*, date). Note that (city*, education*, date*) is shown to complete the lattice structure, which is useless and will not be materialized. 25

26 Customer, Product, Time Customer, Product, - Customer, -, Time -, Product, Time Customer, -, - -, Product, - -, -, Time <-, -, -> Figure 4.1. The1 st layer OLAM lattice for the example star schema in Figure 2.3 Customer, -, Time (city), (date) (education), (date) (education), (month) (city), (month) (city, education), (date) (city, education), (month) (city), (date, month) (education), (date, month) (city, education), (date, month) Figure 4.2. The 2 nd layer lattice derived from <customer, time, -> in the 1 st layer 26

27 0 transaction attribute * : transaction attributes city, education, date Inter association 1 transaction 1 attribute Hybrid association city, education, *date city, *education, date *city, education, date 2 transaction attributes city, *education, *date *city, education, *date *city, *education, date 3 transaction attributes *city, *education, *date Intra Association Figure 4.3. The 3 rd layer lattice derived from the subcube <(city, education), date > in the 2 nd layer Because the real OLAM cubes are stored in the third layer lattice, we can mine multidimensional association rules efficiently through materialize these OLAM cubes. From these three layers lattice, we discover attribute dependency that defined as follows: Proposition 4.1 Consider two OLAM cubes, MCube( tg, t ) 1 M and MCube( t, ) 1 G t 2 M. 2 If t = tg and t M t 2 M, then every itemset in MCube( t, ) 1 G t 2 M2 G1 2 must be a subset of an itemset in MCube( tg, t M ), and these two itemsets have the same support 1 1 value. 27

28 Example 4.4. Consider the table T in Table 4.1. Let MCube( tg, t ) 1 M1 be the cube illustrated in Table 4.4 and MCube( tg, tm ) 2 2 that illustrated in Table 4.5. Hence t = t = { City}, t = { Education, Month}, t = { Education}, and prims = 3. It G1 G2 M1 M 2 can be verified that every frequent itemsets stored in MCube( tg, t ) 2 M2 is a subset of frequent itemsets in MCube( tg, t M ), and both itemsets have the same support value. 1 1 Table 4.5. An OLAM Cube Education Bachelor High school Master Support According to Proposition 4.1, we know there is a dependency between OLAM cubes in the third lattice, which is formalized below. Definition 4.3. Consider two OLAM cubes, MCube( tg, t ) 1 M and MCube( t, ) 1 G t 2 M. 2 We say that MCube( tg, t ) 2 M is dependent upon MCube( t, ) 2 G t 1 M if t 1 G = t 1 G2 and t M 2 tm, and is denoted as MCube( t, ) 1 G t 2 M MCube( t, ) 2 G t 1 M. 1 One important aspect of Definition 4.3 is that if MCube( tg, t ) 2 M 2 MCube( t, t ) then all multidimensional queries that can be answered via G1 M1 MCube( tg, t ) 2 M can also be answered via 2 G1 M1 MCube( t, t ). Furthermore, it should be notice that not all of the OLAM cubes derived in the lattice have to be materialized and stored, because the concept hierarchies defined over the attributes in the star schema provide the possibility to prune some redundant cubes. 28

29 Consider an OLAM cube, MCube(t G, t M ). We observed that there are two different types of redundancy. Proposition 4.2. Schema redundancy: Let a i, a j t G. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a redundancy of cube MCube(t G -{ a j }, t M ). Example 4.5. Consider the jointed table in Table 4.1. Let t M = {Category}. The resulting table by grouping Date and Month as transaction attributes is shown in Table 4.6. Note that this table has the same transactions as that obtained by grouping Date as transaction attribute, as shown in Table 4.7. Thus, the resulting cube MCube({Date, Month}, {Category}) is the same as MCube({Date}, {Category}). Table 4.6. The resulting table by grouping {Date, Month} as transaction attributes for Table 4.1 Date Month Category 7/4 July A 7/12 July A 7/18 July A, C 8/2 Aug. A, B 8/3 Aug. B 9/1 Sep. B 29

30 Table 4.7. The resulting table by grouping {Date} as transaction attribute for Table 4.1 Date Category 7/4 A 7/12 A 7/18 A, C 8/2 A, B 8/3 B 9/1 B Proposition 4.3. Values Redundancy: Let a i, a j t M. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a cube with values redundancy. Example 4.6. Consider the jointed table in Table 4.1. Let t G = {City}, t M = {Date, Month} and prims = 2. The resulting OLAM cube is shown in Table 4.8. One can observe that the tuples with dotted lines in this table are redundant patterns. Therefore, it satisfies the values redundancy. Note that if it holds the values redundancy, we must prune the redundant patterns during the generation of frequent itemsets. 30

31 Table 4.8. The resulting OLAM cube MCube({City}, {Date, Month}) Date Month support 7/18 8/ July Aug /18 July 2 7/18 8/2 Aug. July 2 3 8/2 Aug. 3 July, Aug. 4 7/18 July, Aug. 2 8/2 July, Aug. 3 In addition to above observations, we observe that any OLAM cube is useless if it satisfies the following property. Proposition 4.4. Useless Property: Let a i t G and t M = {a j }. If a i, a j are in the same dimension and a j is an ancestor of a i, then MCube(t G, t M ) is a useless cube. Example 4.7. Let t G = {City, Date}, and t M = {Month}. The resulting table from table 4.1 by grouping {City, Date} as transactions is shown in Table 4.9. One can observe that the cardinality of every transaction is 1. Therefore, we cannot find any association rule from this table. 31

32 Table 4.9. The resulting table by grouping {City, Date} as transaction attribute for Table 4.1 City Date Month Toronto Taipei Taipei N.Y. N.Y. Toronto Seattle N.Y. 7/4 7/12 8/2 7/18 8/2 8/2 8/3 9/1 July July Aug July Aug. Aug. Aug. Sep. 4.3 OLAM Cube Selection We now proceed to give a formal definition of the OLAM cube selection problem. To this end, we introduce symbols as shown in Table Assume that an OLAM lattice L contains n OLAM data cubes D { d1, d2,..., d n }, the set of users queries is = 1 2 Q= { q, q,..., q m }, the set of query frequencies is F = { fq, f,..., } 1 q f 2 q m, and the space constraint is S. The OLAM cube selection problem is denoted as a five-tuple θ = { L, D, Q, F, S}. A solution to θ is a subset of D, say M, that can minimize the following cost function subject to constraint d S, d M m min fq * Eq (, ) i i M. i= 1 32

33 Symbol L D d n Q q m F f q i S M Definition Lattice Table The Symbol Table Set of data cubes n th data cube Set of user queries m th user query Set of user query frequencies Frequency of the i th query Space constraint Set of materialized cubes Eq ( i, M) The total time to response i th query in materialized views 33

34 Chapter 5 Evaluation of OLAM Query Cost 5.1 Query Evaluation Flow As stated previously, the primary task of OLAM Engine is to generate association rules according to users queries. After receiving a query, OLAM Engine analyzes the query, transfers the necessary information to OLAM Mediator, and then waits for the most matching cube from OLAM Mediator. When OLAM Mediator receives the information of users queries from OLAM Engine, it will look for the most matching cube. First, OLAM Engine searches for the required OLAM cube. If found, then it further checks whether minsup prims; and if yes, then returns the found OLAM cube to OLAM Engine, otherwise returns the corresponding auxiliary cube of the found OLAM cube and notifies OLAM Engine to perform association mining from data warehouse with the aid of this auxiliary cube. On the other hand, if OLAM Engine can not find any qualified OLAM cube to answer user query, it will notify OLAM Engine to perform association mining from data warehouse afresh. The above described procedure employed by OLAM Mediator is depicted in Figure

35 Start OLAM Query No Is the required OLAM cube found? The required OLAM cube does not exist Yes No minsup >= prims Yes Return the OLAM cube, and auxiliary cube Return the OLAM cube End Figure 5.1 The flow diagram of OLAM query An important thing worth mentioning is that, for simplicity, we do not consider OLAP cubes in this study, the OMARS system did take account of this kind of data cubes in association mining. In accordance with the work flow of OLAM Mediator and OLAM Engine, our paradigm for evaluating OLAM query cost is shown below: 35

36 Procedure Evacost_OLAMQ(q) begin Let q = < t G, t M, minsup>; found = OLAMQ_search(q, CQ); if found = TRUE then if prims minsup then cost = the cost for evaluating query q using OLAM cube CQ.Mcube; /*case 1*/ else cost = the cost for evaluating query q using CQ.Mcube, auxiliary cube CQ.XCube and data warehouse; /*case 2*/ end if else cost = the cost for evaluating query q using data warehouse; /*case 3*/ end if return cost; end Figure 5.2. The procedure to compute the cost of user s query In summary, there are three different cases to be dealt with: Case 1: evaluating the cost via the qualified OLAM cube. Case 2: evaluating the cost via OLAM cube, auxiliary cube, and data warehouse. Case 3: evaluating the cost via data warehouse. The cost complexity evaluation for each case will be elaborated in the following 36

37 sections. We end this section with the description of OLAMQ_search. Procedure OLAMQ_search(q, CQ) begin found = FALSE; if MCube(q. t G, q. t M ) is materialized then CQ.MCube = MCube(q.t G, q.t M ); CQ.XCube = XCube(q.t G, q.t M ); found = TRUE; end if CurQ = φ ; for each MCube in the OLAM lattice do if MCube is materialized and MCube. t G = q.t G and MCube.t M q. t M and (MCube.t M CurQ. tm or CurQ = φ ) then CurQ = MCube; if found then CQ.MCube = CurQ; CQ.XCube = XCube(q. t G, CurQ. t M ); end if return found end Figure 5.3. Procedure OLAMQ_search Example 5.1. Suppose the OMARS system stores the following three materialized t tm1 OLAM cubes, MCube(, G 1 ), where t = {City}, and G t 1 M 1 = {Education, Date}, t G t 2 M 2 MCube(, ), where t G = {City}, and 2 M 2 t = {Education, Date, Category}; 37

38 MCube( t, t ), where = {Date}, G3 M3 tg 3 t M 3 = {City}, and prims = 3. We have three users queries as follows: q 1, q 2, q 3, where qt 1.. G = {City}, qt 1 M = {Education, q2. t. G 2 Date}, and qms. 1 = 4; = {City}, q tm = {Education, Date, Category}, and q. ms. 2 = 2; qt 3. G = {Date}, q3 tm = {City, Education}, and q3. ms = 3. According to the above three queries, we have three conditions listed as follows: 1. When the user s query is q 1, this condition is the same as Case 1 described above. Because the corresponding OLAM cube can be found in OMARS system, and the minsup of user s query is higher than prims, we can use MCube(, G 1 respond user s query immediately. t t M1 2. When the user s query is q 2, this condition is the same as Case 2 described above. Because the minsup of user s query is lower than prims, there is a need to utilize t t M 2 the corresponding auxiliary cube of the found OLAM cube MCube(, G 2 data warehouse to answer query q 2. q 3 ) to ) and 3. When the user s query is, this condition is the same as Case 3 described above. Because we can not find the any matching OLAM cube in OMARS system, we should utilize data warehouse to answer query. q Cost Evaluation for Case 1 In this case, the OLAM cube returned from OLAM Mediator can be utilized to respond users queries. The CBW on algorithm [15] is employed to mine association rules. For convenience and facilitating the analysis, we replicate the CBW on algorithm in Figure 5.4. Because the qualified frequent itemsets have been stored in the found 38

39 OLAM cube, and minsup prims, there is no need to generate the frequent itemsets via Apriori-like algorithm. All we have to do is scanning frequent itemsets in OLAM cube and performing the association_gen procedure in Figure 5.7 to generate qualified association rules. Algorithm CBW on Input: relevant cube MCube(t G, t M ), minsup and prims; Output: The set of frequent itemsets F; 1 if minsup < prims then 2 AF = {X sup(x) minsup, X Auxiliary Cube} {Y Y MCube(t G, t M ) and Y = K α }; 3 DF = Dwnsearch on (T, AF, K α, minsup); 4 UF = Upsearch(AF, minusup); 5 F = DF UF; 6 else 7 F = {X X MCube(t G, t M ) and sup(x) minsup}; 8 end if 9 return F; Figure 5.4. Algorithm CBW on 39

40 Procedure Dwnsearch on 1 for i=1 to D do 2 scan the i-th transaction t i ; 3 delete those items in t i but not in AF; 4 for each subset X of t i and 2 X K α do 5 sup(x)++; 6 end for 7 DF = {X sup(x) minsup} AF; Figure 5.5. Procedure Dwnsearch on Procedure Upsearch 1 transform horizontal data format T into t_id lists; 2 F = frequent Kα-itemsets; K α 3 k = K α, F k = F Kα ; 4 repeat 5 k++; 6 C k = new candidate k-itemsets generated from F k-1 ; 7 for each X C k do 8 perform bit-vector intersection on X; 9 count the support of X; 10 end for 11 F = {X sup(x) prims, X Ck}; k 12 UF = UF F k ; 13 until F k = Figure 5.6. Procedure Upsearch 40

41 Procedure association_gen (F: set of all frequent itemsets; min_conf: minimum confidence threshold) begin for each l F do generate P(l) = l - ; // P(l): power set of l for each s l and s l-s do if support_count(l) / support_count(s) output s l s; min_conf then end Figure 5.7. Procedure association_gen The cost thus can be divided into two parts: 1. Frequent itemsets discovery: This involves searching the frequent itemsets stored in OLAM cube with support lower than minsup of user s query, which costs D M, for D M denoting the OLAM cube. 2. Rule generation: For each discovered frequent itemset, we construct all possible rules from it, compute the confidence, and keep those satisfy the minimum confidence. The key point for the complexity analysis thus lies in the number of candidate rules to be generated and inspected. Our first step toward this direction is to consider the number of rules that can be generated from a frequent k-itemset and all of its subsets. Lemma 1. The number of rules that can be constructed from a k-itemset is 2 k -2. Proof. Recall that each rule that can be constructed from an itemset X has the form for 41

42 A X and A φ, A X A. Thus, the number of different A s determines the number of rules, which is k 1 i= 1 k ( i ) k = 2 2. Lemma 2. For a k-itemset X, the total number of rules that can be generated from X and its subsets is k k Proof. From Lemma 1, we can derive k 2 k 3 k k ( 2)( 2 2) + ( 3)( 2 2 ) ( k )( 2 2) k k k i k ( i ) 2 2 ( i ) = i= 2 i= 2 k k k i k i k = ( i ) k 2 ( i ) 1 k i= 0 i= 0 ( ) k k k = = + k k k Now, if we know the set of maximal frequent itemsets, then we can complete the analysis. Unfortunately, the exact set is unobtainable without the a priori knowledge of user s specified minsup. We thus resort to an estimation that proceeds by taking prims in place of minsup. Then we apply sampling to obtain a random subset of the warehouse data, and we can either 1. compute the maximal frequent itemsets for each OLAM cube using any maximal pattern mining algorithm, or 2. apply the CBW off algorithm to estimate K α (cutting level), compute frequent itemsets with cardinality of K α, and regard these itemsets as the maximal frequent itemsets. 42

43 Let MF denotes the set of maximal patterns. If the first approach is adopted, the computation spent on rule generation will be X MF X X + 1 ( ), or ( ), Kα Kα + 1 FK α if the second approach is used. Here, for simplicity, we adopted the second approach. Finally, combing the cost of frequent itemsets discovery and rule generation, we have Kα FK 3 α α + D. M 5.3 Cost Evaluation for Case 2 In this case, algorithm CBW on illustrated in Figure 5.4 will execute the minsup < prims part of the if clause, which comprises three different steps. F kα 1. Generate AF, i.e.. This requires scanning the auxiliary cube and the OLAM cube. The cost is D X + DM, where DX denotes auxiliary cube, and DM denotes OLAM cube. 2. Execute procedure Dwnsearch on illustrated in Figure 5.5. Note that this procedure presumes the availability of the corresponding jointed table, and ignores the preprocessing step to generate the jointed table. To account for this task and simplify the discussion, we assume this cost is w and the table is T. As illustrated in Figure 5.5, the Dwnsearch on procedure needs to scan all the 43

44 transactions in the database. The I/O cost is α T. Next we estimate the cost for the most consumptive step: counting itemset support. Let l denotes the average length of each transaction. This step costs l l l ( ) + ( ) + + ( ) l T 2..., or T in brief. 3 Kα K α ( ) i i= 2 Finally, the total cost consumed by the Dwnsearch on procedure equals Kα l ( ) i α T + T. i= 2 3. Execute procedure Upsearch illustrated in Figure 5.6. To minimize the I/O cost and avoid combinatorial decomposition, the Upsearch procedure first transforms the transaction data into vertical data format called transaction-id lists, then utilizes this structure to count the supports of itemsets. The cost lies in three main steps. (1) Data transformation. This requires α T data scan. (2) Candidate generation. The dominate operation is itemset join. If the largest itemset cardinality is K max. This task consumes at Kmax ( 2 ) F 1 most k. K = Kα + 1 (3) Counting candidate support. For each k-itemset, counting involves k-1 bit-vector intersections and one bit-vector accumulation. Summing this cost over all candidate itemsets, we have Finally, the total cost for procedure Upsearch is K max Ci i T. i= kα

45 Kmax i= Kα ( ( )) F i 2 i α T + C i T +. Combing all of the analysis, we have ( ) Kα Kmax l Fi 1 ( ) ( 2 ) α( D + D + 2 T ) + T + C i T + + F 3 Kα X M i i Kα i= 2 i= Kα Cost Evaluation for Case 3 In this case, we should generate table T according to user s query, and it costs D log D. After this, the CBW off algorithm shown in Figure 5.8 is performed. It can be observed that except step 1, the steps employed by CBW off are quite similar to those by CBW on in Case 2. Since step 1 costs α T + K T, α this makes the total cost for this case be ( ) Kα Kmax l Fi 1 Kα ( i) i ( 2 ) K. α D log D + 3 α T + K T + T + C i T + + F 3 α i= 2 i= Kα

46 Algorithm CBW off (T, prims) Input: Table T and prims; Output: The set of frequent itemsets F; 1 scan T to compute K α and generate all frequent 1-itemsets F 1 ; 2 DF = Dwnsearch(T, K α, F 1, prims); 3 UF = Upsearch(DF, prims); 4 return F = DF UF; Figure 5.8. Algorithm CBW off Procedure Dwnsearch 1 for i=1 to D do 2 scan the i-th transaction t i ; 3 delete the items in t i that are not in F 1 ; 4 for each subset X of t i and 2 X K α do 5 sup(x)++; 6 end for 7 store all X in Auxiliary cube for X = K α and sup(x) < prims; 8 DF={X sup(x) prims}; Figure 5.9. Procedure Dwnsearch To sum up, we list the cost functions for the three cases below: Kα Case 1: F 3 + α D. Kα M 46

47 ( ) Kα Kmax Kα X M i i 2 K. α i= 2 i= Kα + 1 ( ) ( ) l Fi 1 Case 2: α( D + D + 2 T ) + T + C i T + + F 3 Case 3: ( ) Kα Kmax l Fi 1 Kα ( i) i ( 2 ) K. α D log D + 3 α T + K T + T + C i T + + F 3 α i= 2 i= Kα

48 Chapter 6 OLAM Cube Selection Methods In this chapter, we describe three typical heuristic algorithms proposed for OLAP cube selection problem, and elaborate how to modify and combine our cost models depicted in last chapter with each method to select the most suitable OLAM cubes. The methods include forward greedy selection (FGS) method proposed by Harinarayan et al. [9], Pick by size (PBS) selection method proposed by Shukla et al. [17], and the backward greedy selection (BGS) method proposed by Lin and Kuo [13]. 6.1 Forward Greedy Selection Method (FGS) The forward greedy selection method is proposed by Harinarayan et al. [19]. As is known to all, the greedy algorithm always chooses the local optimal solution in each step under some constraint. For this purpose, we define a benefit function B(d i, M) as follows: 1 B( di, M) = ( E( q, M) E( q, M di)) (6.1) q Q d i 48

49 We use our benefit function to compute the benefit of all unselected OLAM cubes, and combine the forward selection method to choose the most suitable OLAM cubes one by one to materialize from empty until no cube can be added. The forward selection method is described below: Algorithm 1. Forward greedy selection (FGS) Step 0. Let M=φ. Step 1. When d < S, repeat Step 2 to Step 5. d M Step 2. According to equation (6.1), calculate the benefit of all unselected OLAM cubes d i, for 1 i n, and d i M. Step 3. Select the OLAM cube with the maximal benefit according to results of Step 2, and set it as d j. Step 4. M M {d j }. Step 5. Go to Step 1. Figure 6.1. Forward Greedy Selection Method Example 6.1. Suppose that we select three attributes city c, education e, and date d from a sales star schema illustrated in Figure 2.3. Figure 6.2 depicts all possible OLAM cubes formed with these three attributes as well as their dependencies, where all OLAM cubes with the same transaction t G are packed into a meta-cube. The dotted line between any two metacubes is used for clarification purpose, which accomplishes the lattice structure of metacubes in terms of t G. Note that according to proposition 4.1, the dependency exists only in OLAM cubes within the same metacube. For simplification, let us consider how to select the most suitable OLAM cubes from three OLAM cubes ced*, cd*, and ed* to materialize under space constraint. The symbols 49

50 used in this example are shown in Table 6.1, and the required parameter settings are shown in Table 6.2. Besides, we assume that the base relation size is 64, and prims is 3. Table 6.3 shows the first two selection steps using FGS. t G M Table 6.1. The symbols used in cost model the set transaction attributes t the set of mining attributes α K α I/O to computation ratio the cardinality of maximal frequent itemset K max the cardinality of the largest itemset C i number of candidate i-itemsets l average length of each transaction F i number of frequent i-itemsets D M size of OLAM cube D X size of auxiliary cube f frequency of OLAM cube T size of the table composed of attributes t D size of base relation G t M Table 6.2. The required parameter settings subcubes α K α K max C3 C4 l F2 F 3 D M D X T minsup f d*ce d*c d*e

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:- UNIT III: Data Warehouse and OLAP Technology: An Overview : What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to

More information

Basics of Dimensional Modeling

Basics of Dimensional Modeling Basics of Dimensional Modeling Data warehouse and OLAP tools are based on a dimensional data model. A dimensional model is based on dimensions, facts, cubes, and schemas such as star and snowflake. Dimension

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Decision Support Systems aka Analytical Systems

Decision Support Systems aka Analytical Systems Decision Support Systems aka Analytical Systems Decision Support Systems Systems that are used to transform data into information, to manage the organization: OLAP vs OLTP OLTP vs OLAP Transactions Analysis

More information

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,

More information

Data Warehousing and Data Mining

Data Warehousing and Data Mining Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement:

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

Lectures for the course: Data Warehousing and Data Mining (IT 60107) Lectures for the course: Data Warehousing and Data Mining (IT 60107) Week 1 Lecture 1 21/07/2011 Introduction to the course Pre-requisite Expectations Evaluation Guideline Term Paper and Term Project Guideline

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

An Overview of Data Warehousing and OLAP Technology

An Overview of Data Warehousing and OLAP Technology An Overview of Data Warehousing and OLAP Technology CMPT 843 Karanjit Singh Tiwana 1 Intro and Architecture 2 What is Data Warehouse? Subject-oriented, integrated, time varying, non-volatile collection

More information

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems Data Warehousing and Data Mining CPS 116 Introduction to Database Systems Announcements (December 1) 2 Homework #4 due today Sample solution available Thursday Course project demo period has begun! Check

More information

Novel Materialized View Selection in a Multidimensional Database

Novel Materialized View Selection in a Multidimensional Database Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/

More information

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY CHARACTERISTICS Data warehouse is a central repository for summarized and integrated data

More information

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India

More information

Fig 1.2: Relationship between DW, ODS and OLTP Systems

Fig 1.2: Relationship between DW, ODS and OLTP Systems 1.4 DATA WAREHOUSES Data warehousing is a process for assembling and managing data from various sources for the purpose of gaining a single detailed view of an enterprise. Although there are several definitions

More information

collection of data that is used primarily in organizational decision making.

collection of data that is used primarily in organizational decision making. Data Warehousing A data warehouse is a special purpose database. Classic databases are generally used to model some enterprise. Most often they are used to support transactions, a process that is referred

More information

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse Principles of Knowledge Discovery in bases Fall 1999 Chapter 2: Warehousing and Dr. Osmar R. Zaïane University of Alberta Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in bases University

More information

OLAP Introduction and Overview

OLAP Introduction and Overview 1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 432 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business

More information

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements

More information

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS PART A 1. What are production reporting tools? Give examples. (May/June 2013) Production reporting tools will let companies generate regular operational reports or support high-volume batch jobs. Such

More information

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 22 Table of contents 1 Introduction 2 Data warehousing

More information

Data warehouses Decision support The multidimensional model OLAP queries

Data warehouses Decision support The multidimensional model OLAP queries Data warehouses Decision support The multidimensional model OLAP queries Traditional DBMSs are used by organizations for maintaining data to record day to day operations On-line Transaction Processing

More information

CompSci 516 Data Intensive Computing Systems

CompSci 516 Data Intensive Computing Systems CompSci 516 Data Intensive Computing Systems Lecture 20 Data Mining and Mining Association Rules Instructor: Sudeepa Roy CompSci 516: Data Intensive Computing Systems 1 Reading Material Optional Reading:

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA

More information

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Efficient Remining of Generalized Multi-supported Association Rules under Support Update Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou

More information

What is a Data Warehouse?

What is a Data Warehouse? What is a Data Warehouse? COMP 465 Data Mining Data Warehousing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Defined in many different ways,

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Chapter 4, Data Warehouse and OLAP Operations

Chapter 4, Data Warehouse and OLAP Operations CSI 4352, Introduction to Data Mining Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining

More information

ETL and OLAP Systems

ETL and OLAP Systems ETL and OLAP Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester

More information

Data Warehousing and Decision Support

Data Warehousing and Decision Support Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical

More information

Data Warehousing and Decision Support

Data Warehousing and Decision Support Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 4320 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business

More information

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN Q.1 a. Define a Data warehouse. Compare OLTP and OLAP systems. Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and 2 Non volatile collection of data in support of management

More information

Full file at

Full file at Chapter 2 Data Warehousing True-False Questions 1. A real-time, enterprise-level data warehouse combined with a strategy for its use in decision support can leverage data to provide massive financial benefits

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

Jarek Szlichta  Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques Jarek Szlichta http://data.science.uoit.ca/ Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques Frequent Itemset Mining Methods Apriori Which Patterns Are

More information

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS Assist. Prof. Dr. Volkan TUNALI Topics 2 Business Intelligence (BI) Decision Support System (DSS) Data Warehouse Online Analytical Processing (OLAP)

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Computing Data Cubes Using Massively Parallel Processors

Computing Data Cubes Using Massively Parallel Processors Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University

More information

Data warehouse architecture consists of the following interconnected layers:

Data warehouse architecture consists of the following interconnected layers: Architecture, in the Data warehousing world, is the concept and design of the data base and technologies that are used to load the data. A good architecture will enable scalability, high performance and

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

Complete. The. Reference. Christopher Adamson. Mc Grauu. LlLIJBB. New York Chicago. San Francisco Lisbon London Madrid Mexico City

Complete. The. Reference. Christopher Adamson. Mc Grauu. LlLIJBB. New York Chicago. San Francisco Lisbon London Madrid Mexico City The Complete Reference Christopher Adamson Mc Grauu LlLIJBB New York Chicago San Francisco Lisbon London Madrid Mexico City Milan New Delhi San Juan Seoul Singapore Sydney Toronto Contents Acknowledgments

More information

The application of OLAP and Data mining technology in the analysis of. book lending

The application of OLAP and Data mining technology in the analysis of. book lending 2nd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2017) The application of OLAP and Data mining technology in the analysis of book lending Xiao-Han Zhou1,a,

More information

Association Rules. Berlin Chen References:

Association Rules. Berlin Chen References: Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A

More information

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015 Q.1 a. Briefly explain data granularity with the help of example Data Granularity: The single most important aspect and issue of the design of the data warehouse is the issue of granularity. It refers

More information

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining. Entscheidungsunterstützungssysteme Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

More information

The strategic advantage of OLAP and multidimensional analysis

The strategic advantage of OLAP and multidimensional analysis IBM Software Business Analytics Cognos Enterprise The strategic advantage of OLAP and multidimensional analysis 2 The strategic advantage of OLAP and multidimensional analysis Overview Online analytical

More information

Data Warehouse and Mining

Data Warehouse and Mining Data Warehouse and Mining 1. is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions. A. Data Mining. B. Data Warehousing. C. Web Mining. D. Text

More information

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data warehousing

More information

Adnan YAZICI Computer Engineering Department

Adnan YAZICI Computer Engineering Department Data Warehouse Adnan YAZICI Computer Engineering Department Middle East Technical University, A.Yazici, 2010 Definition A data warehouse is a subject-oriented integrated time-variant nonvolatile collection

More information

Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process.

Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process. MTAT.03.183 Data Mining Week 7: Online Analytical Processing and Data Warehouses Marlon Dumas marlon.dumas ät ut. ee Acknowledgment This slide deck is a mashup of the following publicly available slide

More information

Data Warehousing and OLAP

Data Warehousing and OLAP Data Warehousing and OLAP INFO 330 Slides courtesy of Mirek Riedewald Motivation Large retailer Several databases: inventory, personnel, sales etc. High volume of updates Management requirements Efficient

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 07 : 06/11/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction 2 Data warehousing

More information

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program

More information

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26) Data mining detailed outline Problem

More information

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Improving the Performance of OLAP Queries Using Families of Statistics Trees Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University

More information

Data warehouse and Data Mining

Data warehouse and Data Mining Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

A Multi-Dimensional Data Model

A Multi-Dimensional Data Model A Multi-Dimensional Data Model A Data Warehouse is based on a Multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in

More information

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems Management Information Systems Management Information Systems B10. Data Management: Warehousing, Analyzing, Mining, and Visualization Code: 166137-01+02 Course: Management Information Systems Period: Spring

More information

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube OLAP2 outline Multi Dimensional Data Model Need for Multi Dimensional Analysis OLAP Operators Data Cube Demonstration Using SQL Multi Dimensional Data Model Multi dimensional analysis is a popular approach

More information

Rocky Mountain Technology Ventures

Rocky Mountain Technology Ventures Rocky Mountain Technology Ventures Comparing and Contrasting Online Analytical Processing (OLAP) and Online Transactional Processing (OLTP) Architectures 3/19/2006 Introduction One of the most important

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

Decision Support Systems 2012/2013. MEIC - TagusPark. Homework #5. Due: 15.Apr.2013

Decision Support Systems 2012/2013. MEIC - TagusPark. Homework #5. Due: 15.Apr.2013 Decision Support Systems 2012/2013 MEIC - TagusPark Homework #5 Due: 15.Apr.2013 1 Frequent Pattern Mining 1. Consider the database D depicted in Table 1, containing five transactions, each containing

More information

Data Warehouse and Data Mining

Data Warehouse and Data Mining Data Warehouse and Data Mining Lecture No. 02 Introduction to Data Warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

CHAPTER 3 Implementation of Data warehouse in Data Mining

CHAPTER 3 Implementation of Data warehouse in Data Mining CHAPTER 3 Implementation of Data warehouse in Data Mining 3.1 Introduction to Data Warehousing A data warehouse is storage of convenient, consistent, complete and consolidated data, which is collected

More information

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures) CS614- Data Warehousing Solved MCQ(S) From Midterm Papers (1 TO 22 Lectures) BY Arslan Arshad Nov 21,2016 BS110401050 BS110401050@vu.edu.pk Arslan.arshad01@gmail.com AKMP01 CS614 - Data Warehousing - Midterm

More information

Data Warehousing and OLAP Technologies for Decision-Making Process

Data Warehousing and OLAP Technologies for Decision-Making Process Data Warehousing and OLAP Technologies for Decision-Making Process Hiren H Darji Asst. Prof in Anand Institute of Information Science,Anand Abstract Data warehousing and on-line analytical processing (OLAP)

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa ICS 421 Spring 2010 Data Warehousing 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/30/2010 Lipyeow Lim -- University of Hawaii at Manoa 1 Data Warehousing

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem. Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and A. Pavlo Data mining detailed outline

More information

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS 5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS Association rules generated from mining data at multiple levels of abstraction are called multiple level or multi level association

More information

On-Line Application Processing

On-Line Application Processing On-Line Application Processing WAREHOUSING DATA CUBES DATA MINING 1 Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming,

More information

2 CONTENTS

2 CONTENTS Contents 4 Data Cube Computation and Data Generalization 3 4.1 Efficient Methods for Data Cube Computation............................. 3 4.1.1 A Road Map for Materialization of Different Kinds of Cubes.................

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6702 Data Warehousing & Data Mining Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation:

More information

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato) Data Warehouse Logical Design Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato) Data Mart logical models MOLAP (Multidimensional On-Line Analytical Processing) stores data

More information

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar 1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar 1) What does the term 'Ad-hoc Analysis' mean? Choice 1 Business analysts use a subset of the data for analysis. Choice 2: Business analysts access the Data

More information

Decision Support, Data Warehousing, and OLAP

Decision Support, Data Warehousing, and OLAP Decision Support, Data Warehousing, and OLAP : Contents Terminology : OLAP vs. OLTP Data Warehousing Architecture Technologies References 1 Decision Support and OLAP Information technology to help knowledge

More information

Association mining rules

Association mining rules Association mining rules Given a data set, find the items in data that are associated with each other. Association is measured as frequency of occurrence in the same context. Purchasing one product when

More information

Data Warehouse and Data Mining

Data Warehouse and Data Mining Data Warehouse and Data Mining Lecture No. 07 Terminologies Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Database

More information

Mining Association Rules in OLAP Cubes

Mining Association Rules in OLAP Cubes Mining Association Rules in OLAP Cubes Riadh Ben Messaoud, Omar Boussaid, and Sabine Loudcher Rabaséda Laboratory ERIC University of Lyon 2 5 avenue Pierre Mès-France, 69676, Bron Cedex, France rbenmessaoud@eric.univ-lyon2.fr,

More information

Value Added Association Rules

Value Added Association Rules Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency

More information

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged. Frequent itemset Association&decision rule mining University of Szeged What frequent itemsets could be used for? Features/observations frequently co-occurring in some database can gain us useful insights

More information

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment Ching-Huang Yun and Ming-Syan Chen Department of Electrical Engineering National Taiwan

More information

Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview

More information

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the

More information

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Frequent Pattern Mining Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Item sets A New Type of Data Some notation: All possible items: Database: T is a bag of transactions Transaction transaction

More information

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic

More information

Chapter 5, Data Cube Computation

Chapter 5, Data Cube Computation CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation Full Cube Full

More information

2. Discovery of Association Rules

2. Discovery of Association Rules 2. Discovery of Association Rules Part I Motivation: market basket data Basic notions: association rule, frequency and confidence Problem of association rule mining (Sub)problem of frequent set mining

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Data Warehouse and Data Mining

Data Warehouse and Data Mining Data Warehouse and Data Mining Lecture No. 04-06 Data Warehouse Architecture Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP) CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP) INTRODUCTION A dimension is an attribute within a multidimensional model consisting of a list of values (called members). A fact is defined by a combination

More information

A MAS Based ETL Approach for Complex Data

A MAS Based ETL Approach for Complex Data A MAS Based ETL Approach for Complex Data O. Boussaid, F. Bentayeb, J. Darmont Abstract : In a data warehousing process, the phase of data integration is crucial. Many methods for data integration have

More information