Year. <Sales> Product. Supplier

Size: px

Start display at page:

Download "Year. <Sales> Product. Supplier"

Alan Robinson
5 years ago
Views:

1 Fragmentation of Multidimensional Databases Derek Munneke Kirsten Wahlstrom Mukesh Mohania Advanced Computing Research Centre School of Computer and Information Science University of South Australia, The Levels Campus Mawson Lakes 595 SA, Australia fd.munneke, K.Wahlstrom, Abstract. As diverse demands are placed on On-Line Analytical Processing (OLAP) systems, distribution of the supporting multidimensional database (MDDB), and thus fragmentation, become an issue. We consider a recent multidimensional data model, and show that two basic strategies of fragmentation are possible: slice 'n dice and sever. The operators that realise these methods are composed from the basic set given in the data model, and the strategies are shown to satisfy the fragmentation correctness rules. 1 Introduction Multidimensional databases (MDDB) exploit the multidimensional nature of On-Line Analytical Processing (OLAP) [Codd et al., 1993]. As the popularity of OLAP grows, an increasingly diverse group of users are making demands of the systems, possibly with conicting goals. In order to meet the performance requirements of the systems, implementation as distributed databases will become an important consideration. Fragmentation is an important part of distributed design [ Ozsu and Valduriez, 1991]. On-Line Analytical Processing is a decision support tool which enables the sophisticated analysis of an organisation's performance by providing access to views of data which characterise the multidimensional nature of the enterprise [Codd et al., 1993]. Contemporary relational database systems are designed for On-Line Transaction Processing (OLTP), whose operational demands make them unsuitable for OLAP applications[chaudhuri and Dayal, 1997]. A multidimensional database stores data as hypercubes which enable the OLAP operations to be implemented directly over the data structure. Consequently, several Proceedings of the Tenth Australasian Database Conference, Auckland, New Zealand, January 18{ Copyright Springer-Verlag, Singapore. Permission to copy this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or personal advantage; and this copyright notice, the title of the publication, and its date appear. Any other use or copying of this document requires specic prior permission from Springer-Verlag.

2 multidimensional database systems have been developed commercially, for example [ARBOR, 1998; IRI, 1997], and multidimensional databases have become a topic of active research [Colliat, 1995; Li and Wang, 1996; Gyssens and Lakshmanan, 1996; Agrawal et al., 1997; Hacid and Sattler, 1997; Lehner et al., 1998]. One issue which has seen little investigation is the fragmentation of a multidimensional database. Fragmentation plays an important role in the design of a distributed database system, it enables the denition of appropriate units of distribution which enhance query performance by enabling concurrent and parallel execution of queries, the requirement for parallel processing of OLAP operations is argued in [Datta et al., 1998; Liang and Orlowska, 1998], and by minimising communication costs. An existing MDDB fragmentation strategy applies the common OLAP operation `slice 'n dice' to partition the hypercube into smaller subcubes. This strategy is currently employed by vendors of multidimensional databases that support fragmentation, for example the Essbase OLAP Server Partitioning option [ARBOR, 1998]. We propose an additional strategy called sever that fragments the original data by removing dimensions from a hypercube to produce fragments with fewer dimensions than the original data cube. In the next section we give an example data set that will help to illustrate following discussion, and then in Section 3 we present a multidimensional model recently proposed by Agrawal, Gupta and Sarawagi [Agrawal et al., 1997]. We proceed in Section 4 to present the two fragmentation strategies, slice 'n dice and sever, and their derived forms, by drawing analogies with relational fragmentation. We present the operations that realise the two fragmentation methods and both are shown to obey the correctness rules. 2 An example data set Year Product Supplier Sales 1997 a X 1997 a Y b X b Y a X a Y b X b Y 4 Table 1. A small subset of data for a retail company. Consider a retail company that buys products from suppliers to sell them in its own stores. A data warehouse is constructed to collect sales gures for each

3 product over time, including information about the supplier of the products sold. The information may be used by store managers to track store sales over time, the company marketing department to assess campaign strategies, and inventory to calculate when orders need to be made. A small subset of example data is shown in Table 1. This data may be represented equivalently by a three-dimensional cube shown in Figure 1. Year <Sales> X a 2 1 b Product Supplier Y Fig. 1. A data cube with dimensions `Year', `Product' and `Supplier', and measure `Sales' representing the example data set given in Table 1. 3 Multidimensional databases 3.1 Terminology A multidimensional database uses a hypercube representation of the data, often referred to as a data cube. Determining attributes, such as `Year', `Product' and `Supplier' in Figure 1, are referred to as dimensions whose values denote positions on the dimension. The determined attributes, such as `Sales', are called measures. The values of a measure are the cell values in the data cube, called elements. The choice of which attributes should be determined and which are determining is a database design decision [Agrawal et al., 1997]. Operations are dened which make use of the visual image of the data cube. Pivot rotates the cube to `show' a particular face and slice 'n dice selects some subset of the data by `cutting up' the cube. In addition, OLAP operations such as roll up and drill down change the granularity of the dimensions by aggregating over the dened hierarchies and expanding a hierarchy to display further detail, respectively.[chaudhuri and Dayal, 1997]

4 In addition to the above terminology, we dene the notion of a cube's degree and introduce the concept of key dimensions: Degree of a data cube C is the number of dimensions in the cube, expressed as deg(c). Key dimensions is a dimension or a set of dimensions whose positions map to a unique element, with a dened value, on every other dimension. The data cube in Figure 1 has degree 3, and the set fyear, Supplier, Productg is the key as positions on these dimensions uniquely map to elements containing Sales information. There may be multiple choices of key dimensions so, as in the Entity-Relationship (ER) model, the possible choices are called candidate keys, of which one may be chosen as the primary key. 3.2 A Data model Agrawal, Gupta and Sarawagi [Agrawal et al., 1997] recently proposed a data model to act as a framework for research in multidimensional databases. A pertinent feature of the model is the symmetric treatment of dimensions and measures that allows the operators to act on both equivalently. This permits a general specication of the fragmentation strategies without restricting their applicability. In the logical model the data is organised in one or more data cubes C j with the following components: { k dimensions, each with a name D i and a domain dom i = dom(d i ) over which the position values are dened. We denote dims(c) = hd 1 ; : : : ; D k i as a k-tuple containing the dimension names of the cube in the order specied by the element mapping function E(C). { Elements, dened as a mapping E(C) from dom 1 : : :dom k to either, 1 or an n-tuple he 1 ; : : : ; e n i. E(C)(d 1 ; : : : ; d k ) refers to an element in cell location (d 1 ; : : : ; d k ) of cube C. The d i refer to values, not positions per se, so the dimensions are not required to have ranked or discrete domains. { An n-tuple of names that describes the attributes the n-tuple elements of the cube. We denote this as attributes(c) = ha 1 ; : : : ; A n i. The cell values of the cube may be either, 1 or an n-tuple. The value for element E(C)(d 1 ; : : : ; d k ) indicates that no value exists in the database for the combination of dimension positions (d 1 ; : : : ; d k ). If the combination is present, then the element may be either a 1 to indicate a value exists, or an n-tuple that provides the additional data available for that combination. If any element of a cube is a n-tuple then none of the elements can be a 1, and vice versa. If all elements of a cube are then the cube is empty, and if the domain dom i of a dimension contains no values then no element can be dened, thus the cube is considered empty also. The operators support the symmetric treatment of dimensions and measures to provide a minimal set of operators that provides all the functionality of current

5 multidimensional products [Agrawal et al., 1997]. We refer the reader to [Agrawal et al., 1997] for a description of the operators. The operators in this model support a query model, as opposed to the `oneoperation-at-a-time' computational model presently used by many existing products. A query model allows complex queries to be composed without having to generate the often meaningless intermediate cubes, and supports optimisation of complex multidimensional queries. As there is no distinction in the data model between measures and dimensions, the `Sales' in Figure 1 may be transformed into just another dimension, as illustrated in Figure 2. Year Sales a b Product X Y Supplier Fig. 2. The data cube from Figure 1 may be represented as a 4-dimensional data cube in which `Sales' is also on a dimension. The positions on the `Sales' dimension consist of all possible sales gures. Representative lines of the binary data is shown corresponding to Supplier `X' and Product `a', but other values cannot be shown in this representation of 4-dimensional space. Other models have also been proposed by Li and Wang [Li and Wang, 1996], and Gyssens and Lakshmanan [Gyssens and Lakshmanan, 1996] but neither of these treated dimensions and measures symmetrically. Note that this is a logical model and does not force or imply any storage mechanism. 4 Multidimensional fragmentation Fragmentation of a multidimensional database divides a global data cube, into fragment cubes containing a subset of the data. The slice 'n dice operation is a method of fragmentation for multidimensional databases. Slice 'n dice is analogous to horizontal fragmentation in the relational model as it partitions the data into subsets without altering the structure of the

6 database. The dimensions remain unchanged by slice 'n dice, just as attributes are not are not altered by horizontal fragmentation of a relational database. We dene an additional fragmentation strategy called sever. Sever is analogous to relational vertical fragmentation in that it reduces the degree of a data cube just as vertical fragmentation reduces the degree of a relation. The methods for fragmentation and reconstruction for these two strategies are given using the data model and operators described in the previous section. Derived and hybrid forms follow naturally from the basic schemes. There are three rules which ensure the semantic validity of a fragmentation strategy which are referred to as the correctness rules. The three rules verify completeness to ensure no data is lost during fragmentation, check the original data representation can be reconstructed, and test for disjointness so replication is controlled explicitly by allocation [ Ozsu and Valduriez, 1991; Ceri and Pelagatti, 1983]. We show that slice 'n dice and sever obey these rules. 4.1 Slice 'n dice fragmentation Fig. 3. Slice 'n dice fragmentation is realised by the restrict operator. The restrict operator realises the multidimensional slice 'n dice operation, acting to partition the data into smaller subcubes. Figure 3 shows an example of slice 'n dice fragmentation where the data cube has been partitioned into eight subcubes; in general l dimensions, each divided m j (1 j l) ways will produce m 1 m 2 : : : m l subcubes. The general expressions to fragment a data cube C into m subcubes C j that satisfy a simple predicate P j i may be specied as: C 1 = restrict(c; f[d 1 ; P 1 1 ]; : : : ; [D l ; P 1 l ]g). C m = restrict(c; f[d 1 ; P m 1 ]; : : : ; [D l ; Pl m ]g) (1)

7 allowing restrictions on multiple dimensions 1 D i (1 i l) and multiple restrictions across a dimension. A simpler scheme allows restrictions on only a single dimension D i at a time, corresponding to just slice, so that m restrictions produces m partitions: C 1 = restrict(c; f[d i ; P 1 ]g). C m = restrict(c; f[d i ; P m ]g): (2) The more general case can then be built up by iteratively applying this operation across all the dimensions to be fragmented on. To guarantee completeness, the predicates P j that dene the restrictions must be cover the entire domain of the dimension: P 1 (x) _ : : : _ P m (x) = true 8x 2 dom i : (3) To ensure disjointness the predicates must be mutually exclusive: P 1 (x) ^ : : : ^ P m (x) = false 8x 2 dom i : (4) The data cube may be reconstructed by applying the join operator along all dimensions, using the identity function id as the mapping functions and the function f unique as the element merging function: C = join(c i ; [D 1 ; id]; : : : ; [D k ; id]g; f unique ) for 1 i m where id(d) = d; 8 < e 1 if e 2 = or e 1 = e 2, f unique (e 1 ; e 2 ) = e : 2 if e 1 =, error otherwise. (5) The function f unique checks if each element is uniquely determined by its position in the cube, that is it determines if data inconsistencies exist. For disjoint fragments f unique will always return a value, however if the fragments are not disjoint f unique eectively tests for violation of referential integrity. By using the identity mapping function the join operator is analogous to the relational union operator. To join in this way the subcubes must be `union compatible', that is they must have the same degree and each corresponding domain dom i be type compatible. Fragments produced by the slice 'n dice are intrinsicly union compatible, and thus this constraint is always satised. Fragmentation on one dimension followed by fragmentation on another dimension, corresponding to a slice then dice, is analogous to derived horizontal fragmentation of a relational database. 1 The action of restrict on multiple dimensions is an extension we have added that diers from the original specication by [Agrawal et al., 1997], but is similar to how the merge operator is expressed.

8 Example Fragmentation of the example data from Section 2 by slicing on the `Product' dimension is performed by the following operations: C 1 C 2 = restrict(c; f[product; position = `a']g) = restrict(c; f[product; position = `b']g): This produces two subcubes C 1 and C 2 dened by the predicates position = a and position = b respectively. The fragmentation is complete as the two predicates cover the entire domain of `Product': (position = `a') _ (position = `b') = true; 8position 2 dom Product ; and the fragmentation is disjoint as the predicates are exclusive: (position = `a') ^ (position = `b') = false; 8position 2 dom Product : To reconstruct the cube we join C 1 and C 2 with the functions specied in equation (5): C = join( C 1 ; f[year; id]; [Supplier; id]; [Product; id]g; C 2 ; f[year; id]; [Supplier; id]; [Product; id]g; f unique ): Fragmentation by slice 'n dice can be useful for organisations that are geographically distributed, in which case the data cube can be divided into the subcubes of data most appropriate for a location. For example the two dierent products, `a' and `b', may only be sold in locations `A' and `B' respectively, thus location B does not require the gures for product a, and vice versa. 4.2 Sever fragmentation Fig. 4. Sever fragmentation reduces the dimensionality of the data cube. Sever produces fragments with less dimensions than the original cube by merging a dimension and then destroying it. This is analogous to relational projection

9 in that the new data set has less determining attributes, that is the degree is reduced. In order to perform fragmentation by sever more than one non-key dimension must exist. This is necessary in order to produce severed fragment from which the original data cube can be reconstructed. To sever the cube C, rst a set of l non-key dimensions fd 1 1 ; : : : ; D1 l g are merged the by merge operator, using the function f exists that maps each domain to a single position, labelled `9', and the function f unique, as dened by equation (5), to combine the elements: C 1 = merge(c; f[d 1 1 ; f exists]; : : : ; [D 1 l ; f exists]g; f unique ) where f exists (d i ) = `9' 8d i 2 dom i : (6) Once the dimensions to be severed are merged successfully to a single position `9', which indicates if a value existed corresponding to that position, the dimension can be destroyed: C 1 = destroy(c 1 ; D 1 1);. (7) C 1 = destroy(c 1 ; Dl 1): This results in a fragment C 1 of degree k? l, where deg(c) = k. If the primary key, that is the key maintained across all fragments, consists of a set of j dimensions fd key 1 ; : : : ; D key j g, then C 1 contains a set of k? l? j non-key dimensions D 1 : D 1 = fd 1 ; : : : ; D k g? (fd 1 ; : : : ; 1 D1 l g) [ fdkey 1 ; : : : ; D key j g where fd 1 ; : : : ; 1 D1 l g \ fdkey ; : : : ; D key 1 j g = ;: Repeat the above steps for m disjoint sets D i ; 1 i m; of non-key dimensions to produce m fragments C 1 ; : : : ; C m. The fragmentation is complete if every dimension is contained in at least one fragment. That is guaranteed if D 1 [ : : : [ D m [ fd key 1 ; : : : ; D key j g = fd 1 ; : : : ; D k g: (8) Disjointness is dened only on the non-key dimensions, as the key dimensions are necessarily replicated to satisfy reconstruction. Disjointness is ensured if the fragment's set of non-key dimensions D i are chosen to be disjoint: D 1 \ : : : \ D m = ;: (9) Reconstruction of the cube is achieved by joining the fragments on the key dimensions using the identity mapping function and the element merging function f unique to check referential integrity. C = join(c i ; [D key 1 ; id]; : : : ; [D key j ; id]g; f unique ) for 1 i m: (1)

10 C 1 C C 2 Sever C 1.1 Dice C Slice C C 2.2 Fig. 5. Hybrid fragmentation produces a tree structure. If the fragmentation is complete, then the data cube can be properly reconstructed. Slice 'n dice and sever may be used in combination to produce hybrid fragmentation which forms a tree structure, as illustrated in Figure 5. Completeness and disjointness follow naturally if the constituent fragmentation schemes are complete and disjoint. Reconstruction of the original or intermediate data cubes is achieved by reconstructing fragments from the leaves of the tree and then progressively reconstructing up the tree. Example The data cube C from Figure 1 cannot be fragmented by sever as only one non-key dimension exists, namely `Sales'. If we introduce an enterprise constraint that \each product has a unique supplier", then `Supplier' becomes a determined attribute, and `Year' and `Product' are the key dimensions. A new example data cube C illustrates this in Figure 6. Before performing fragmentation of C by sever, the `Sales' measure in Figure 6 is converted to a dimension, as in Figure 2, by using the pull operator. Then the 4 dimensional cube is severed into two 3 dimensional cubes by the following operations, where the functions are dened in equations (5) & (6). C 1 = merge(c ; f[sales; f exists ]g; f unique ) C 1 = destroy(c 1 ; Sales) produces a subcube C 1 with dimensions Year, Product, and Supplier. C 2 = merge(c ; f[supplier; f exists ]g; f unique ) C 2 = destroy(c 2 ; Supplier) produces a subcube C 2 with dimensions Year, Product, and Sales.

11 Year <Sales> a 1 b Product X Y Supplier Fig. 6. The data cube from Figure 1 with the enterprise constraint \each product has a unique supplier". Note that the data cube now contains zero elements() that indicate no corresponding value, rather than a zero value. The fragmentation is complete as every dimension (Year, Product, Supplier and Sales) is in at least one of the cubes, and disjoint as the non-key dimensions of the cubes (Supplier in C 1 and Sales in C 2 ) are disjoint. To reconstruct the data cube the subcubes are joined on key dimensions: C = join( C 1 ; f[year; id]; [Supplier; id]g; C 2 ; f[year; id]; [Supplier; id]g; f unique ): The sever fragmentation strategy is useful when departments within an organisation require dierent information. For example the marketing department is concerned with the sales gures and does not need to know information about suppliers, whilst the ordering department need to know the suppliers details but do not require sales gures. 5 Conclusion and Further work Two basic fragmentation strategies are possible in multidimensional databases: slice 'n dice and sever. Slice 'n dice partitions the data cube into subcubes by dividing the domains on the dimensions, in a manner analogous to relational horizontal fragmentation. Sever fragments the data cube by reducing its degree, which is comparable to vertical fragmentation of a relational database. Fragmentation by slice 'n dice is well known, though perhaps not formally stated, but fragmentation by sever is a new strategy. The relevance of sever will emerge as demands on multidimensional OLAP systems diverge. We have only considered fragmentation strategies and not discussed algorithms that determine how the data should be partitioned. Algorithms similar

12 to those developed for relational systems [ Ozsu and Valduriez, 1991] can be applied to a multidimensional system, however we believe additional algorithms that exploit the multidimensional characteristics of the system may be possible. References Agrawal, Rakesh, Gupta, Ashish, and Sarawagi, Suntina (1997). Modeling multidimensional databases. In Proc. 13th Int'l Conference on Data Engineering, pages 232{243, Los Alamitos, CA. IEEE Comput. Soc. Press. ARBOR (1998). Arbor Essbase OLAP Server. Arbor Software Corporation, Sunnyvale CA. available via (25 July 1998). Ceri, Stefano and Pelagatti, Giuseppe (1983). Distributed Databases: Principles and Systems. McGraw-Hill Book Co., Singapore. Chaudhuri, Surajit and Dayal, Umeshwar (1997). An overview of data warehousing and olap technology. SIGMOD Record, 26(1):65{74. Codd, E. F., Codd, S. B., and Salley, C. T. (1993). Providing olap (on-line analytical processing) to user-analysts: An it mandate. Technical report, E.F.Codd & Associates. Colliat, George (1995). Olap, relational and multidimensional database systems. Technical report, Arbor Software Corporation, Sunnyvale, CA. Datta, Anindya, Moon, Bongki, and Thomas, Helen (1998). A case for parallelism in data warehousing and olap. In Proc. of IEEE First Int'l Workshop on Data Wharehouse design and OLAP Technology. Gyssens, Marc and Lakshmanan, Laks V.S. (1996). A foundation for multidimensional databases. In Proc. of 22nd VLDB Conference. Hacid, Mohand-Said and Sattler, Ulrike (1997). An object-centered multidimensional data model with hierarchically structured dimensions. In Proc. IEEE Knowledge and Data Engineering Exchange Workshop, pages 65{72. IRI (1997). QScan. IRI Software Information Resources, Inc., Waltham MA. available via (25 July 1998). Lehner, W., Albrect, J., and Wedekind, H. (1998). Normal forms for multidimensional databases. In 1th Int'l Conf. on Scientic and Statistical Data Dase Management. Li, Chang and Wang, X. Sean (1996). A data model for supporting on-line analytical processing. In Proc. 5th Int'l Conf. on Information and Knowledge Management, pages 81{88. Liang, Weifa and Orlowska, Maria (1998). Computing multidimensional aggregates in parallel. In International Conf. on Parallel and Distributed Systems. IEEE Comput. Soc. Press. Ozsu, M. Tamer and Valduriez, Patrick (1991). Principles of Distributed Database Systems. Prentice Hall, Englewood Clis, NJ.

Computing Appropriate Representations for Multidimensional Data

Computing Appropriate Representations for Multidimensional Data Yeow Wei Choong LI - Université F Rabelais HELP Institute - Malaysia choong yw@helpedumy Dominique Laurent LI - Université F Rabelais Tours