OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

OLAP2 outline Multi Dimensional Data Model Need for Multi Dimensional Analysis OLAP Operators Data Cube Demonstration Using SQL Multi Dimensional Data Model Multi dimensional analysis is a popular approach to extend the additional features for reporting. Instead of submitting multiple queries data is structured to trigger fast and easy access to interactively answer the questions posed by different kinds of the users. This type of analysis is generally performed on large corporate warehouses or data marts. The view of data as Multi Dimensional Array can be generalized to more than Three Dimensions. In OLAP applications the bulk of the data can be represented in such a Multi Dimensional Array. The systems which used to store multi dimensional data is termed as MOLAP Systems. Multi Dimensional array can also represented as an array as shown below. The relation with related dimensions to the measure of interest is called Fact Table. Multi dimensional data model focus mainly on a collection of measures called numerical measures which are termed as facts and depends on a number of associated dimensions. In the data warehousing literature Data Cube is a one of the popular structure which is widely used in representing the multi dimensional model. We represented the multi dimensional data model in two ways, one is in the form of table and another is in the form of Array. The Dimension names are not shown in the above diagram but the Ids associated with each dimension are represented as PID for Product ID, TimeId for Time Dimension and LocId for Location ID and Sales is here as a Numeric Measure. A Sample Data Cube The representation given below shows the annual sales of various categories of products in different quarters over different cities. For example annual sales of TVs for 3 cities are shown in the form of array and total sales in the form a simple number. So if we observe below diagram we have different kinds of aggregation stored in this physical structure. One is the total sales over all cities and over all products; another kind of aggregation is total sales over all quarter for each product.

If we observe this diagram it specifies different kinds of aggregations in form of physical structure called Data Cube. Here we have 3 Dimensions, but if we have more than 3 Dimensions you can't find a physical structure. But we can always generalize by using a concept called Hyper Cube. Cuboids Corresponding to the Cube The slide shows 3 Dimensions called Product, Date and Country. That means to say we have combined computations across combinations of these dimensions. If we see this diagram, this is a integration of different kinds of computations over combination of these three dimensions. So this each combination is nothing but a particular type of cuboid. Here there are 4 types of cuboids, 0- Diemnsion cuboid, which is Apex cuboid, which gives the grand sales. And Base or 3-D cuboid which gives sales for each product, date and city. Between these two aggregations we have different kinds of computations. But the duty of this computation is whatever the computation we do at intermediate level, the values always more than computation from lower level and smaller than the values from the higher level. So this type of structure in mathematics is termed as Lattice. So the concept of data cube imitate the behavior of lattice. Because we are taking all the combinations with some ordering and this ordering is nothing but Partial Ordering. Data Cube involving four Dimensions This slide talks about the ascension of a Three dimension cuboid into a collection of 4-D cuboid. If we have 4 dimensions we have 24 cuboids. We have one cuboid of 0-Dimension, one cuboid of 4-Dimension, 4 cuboids of 3 dimension nature, 6 cuboids of 2-dimension nature. In the similar way we can extend the combinations of computations over several dimensions in the form of a structure. But however no physical structure exists to show when the no. of dimensions are more than three. But if it is less than three we can always show physically. If it is more than three dimensions, we need to think in a abstract sense in the form of a cube called Hyper cube. Questions on Multi Dimension Model Q 1) What is Hyper cube? As I mentioned earlier Hyper cube is a generic metaphor for representing the Multi Dimensional Data. A group of Data Cells arranged by the Dimensions of the Data. For Example if we take the Spread Sheet, Spread sheet exemplifies a two dimensional array with data cell arranged in rows and columns. Each being a dimension, it means row is one dimension and column is another dimension. In a similar way if we think in a database table is also represented in the form rows and columns. So both representations Spread sheets and Database tables are metaphors for representing data in excel sheets and data in Database table. In a similar way Hyper cube is a generic metaphor for representing Multi Dimensional Data. I demonstrate now, since it is a complex topic, I go in a more detail manner about the concept of hyper cube. As I mentioned Spread sheets use Worksheets and Database use Tables, Hyper cubes are used to understand Multi Dimensional views. For example consider two dimensional cross tabulation report which models a location and product to measure sales. So this two dimension grid provides all possible combinations of locations and products. That means it have K locations and P products for each combination of location across the products sales are computed here. How is it viewed?

In fact we can view the hyper view in an abstract manner by understanding how the cube is build in one dimension, how the cube is viewed in two dimension and how the cube is viewed in the 3-dimension. Now I will explain how it can be viewed in each dimension step by step. For example If we have a data in one dimension we can view it in single row or column. If we have two dimensions like in our case product and location it is a matrix or a table. If the no. of dimensions are more than 3 the imagination of hyper cube is quite difficult. In the sense that we can t represent Physically in the form of a structure. But you should understand in an abstract sense. So in the sense that if we imagine a control panel of a stereo sound system or day to day presentations. So if we use the sound system each slider control one aspect of the sound such as balance, volume, bass and treble. So all these things are Knobs which can be controlled. We can adjust the parameter control. In a similar way if have more than three controls, we can add these higher dimensional data into a two dimensional grid in this manner. What is going to happen if we add more and more dimensions such as payment methods, coupons etc. and the grid becomes cube. That means if I add for location, product cube the payment method dimension then it becomes a three dimension cube, if coupons is added then it becomes hyper cube. So any no. of dimensions can be added. So therefore there is no physical metaphor exist for more than 4 dimensions. So this is how we should interpret the concept of generalizing to two dimensions, three dimensions and beyond three dimensions. Q 2) Can you explain the cube by taking measures and dimensions? If we see the normal commercial tools like Cognos, Business Objects there is a concept of cube. That means the OLAP software packages support this transformation with the concept of Power Cube. So this power cube concept is available as a part of cognos tool. Cognos is provided with a concept called transformer. This transformer job is to transform the data into a form of cube. Which is nothing but a hyper cube but from the terminology point of view cognos tools names this as power cube, in this sense power cube is same as hyper cube. This is basically used to organize the data into selected business perspectives, say for example the power cube shown in the slide gives a power cube involving measures and dimensions. Here Time is a dimension, Status is a dimension, performance, indicators and salaries are measures. That means when you takes a particular category of values and each dimension you get combination, on that combination we are calculating the aggregate values for all these measures. The cell contains these aggregated values. As I mentioned earlier if we see the hyper cube concept, there is no physical metaphor exist for hyper cube, but I can always map into a two dimensional plane. For example here there exist 4 dimensions, and I can add as many dimensions as I want. If I define the measures across the different combinations of dimensions, measures are automatically calculated based on the functions defined. Once the functions are triggered a cube is generated and shown here. So any way a hyper cube is simulated by the facility provided in Cognos with the name called transformer. Extending the answer to the solution: Representing Multi Dimensional Data Since this is an abstract concept I would like to also extend the discussion by representing physically the data in both the forms. For example if you take a two dimensional representation cube represents the data in an array, relational table only represents multi dimensional data in two dimension. Suppose If you take an array of two dimension, what is the total revenue generated by sales in each city and each product of year 2009. That means if you define there are 4 quarters in a year, for each quarter and in each city what are the total sales. So this is a two dimensional grid, so the measure is here total sales. But choice of representation is always based on types of queries that end user asks. So there are two different kinds of representations, array based and tabular based representation. Now if you observe the table representation in order to represent all the sales in each city for each quarter in year 2009 we need three columns. First column is for city wise, second column is for quarter names and third column is for storing the measure. So that means to store the data in a three field relational table requires three columns. For suppose if we represent the same data in a matrix requires only two columns. For example here in slide we have city values, time values and total revenue measure values. In this representation there are two dimensions and one measure. The same data is transformed in the form of a matrix representation by considering the database values stored across the rows. Here dimension values are nothing but the database values. These will become the headings for row and columns. For example Q1, Q2, Q3 and Q4 are the headings for the

quarter and Glasgow, London etc;are the headings for City. Then if we take the combination of these over a time the values 45677 is the total revenue in the quarter 4 in the city Aberdeen. As I mentioned, both the themes use the concepts of cells, then the way the data is represented in two dimensional matrix where the database values becomes column headings. For example what is the total revenue generated by property sales for each type of property (flat or house) in each city, in each quarter of 2009. Four columns are required to represent the above in two dimensional matrix. Depending on the combination of attributes aggregate operation is applied and the cell value for that combination hold the measure. So that is why measures are associated with dimensions. Q) What are the different kinds of functional support provided for data cubes in commercial databases? As you all know that very popular commercial database exist today in the market they are db2, sequel server, oracle, ingres, postgres etc. In fact all these packages are now providing functional support for data cube. That means the functionality of the traditional sql feature extended by incorporating additional features to do manipulations for multi dimensional operations. As I mentioned there are different kinds of multi dimensional operations like slicing, dicing, pivoting, rollup,drill down. To do all these operations we need lot of aggregations and computation required at the backend of the database. So most of the commercial databases pushed the functionality within the database level by incorporating various power full operations. If we take oracle, oracle provides two powerful operators in OLAP to do the aggregation on combination of dimensions, they are Cube and Grouping sets. Cube is used to find the aggregations across k dimensions and grouping set is used to compute selected combinations of aggregations. Q) Explain the lattice concept with example. There are different kinds of subsets for a given set. Suppose if you have k elements in a set you can have 2k subsets including empty subset. What is the relationship exist among these subsets. If you take any k element subset, this subset always contains k-1 element subsets also. In the sense that if you have a two element subset, then this 2 element subset then this always contained in a 3 element subset. Then what is the meaning of the containment, it is nothing but some king of subset ordering. I can always order these 2k subsets in a partial order the relationship is called the containment. When you trigger this order, you take any element the element is always more than the one element subset or more than one subset. In a similar way if you take any subset that is always contained in the original set.if you have observed this we have some relationship partial ordering and also greatest lower bound and least upper bound. This is nothing but a lattice. In the sense a lattice is a partial order set with bounds. The same concept is widely used in here in representing Multi dimensional data analysis. I will explain how the lattice is represented. Suppose if we have 3 dimensions product, City and Date. We consider these as our 3 elements which are nothing but the perspectives or dimensions. If we take these, then we have 3 sets of 2 element combinations {Product, City}, {Product, Date} and also {Date, City} and also contains the 3 element set {Product, City, Date} which forms a lattice of cuboids. Once we have 2 dimensional computations we can as well compute total sales for each product over all cities and over all dates. Using one dimensional cuboid we can compute grand sales over all products, over all cities and over all dates. If you observe the all the grand sales and the base cuboid, the base cuboid is the greatest lower bound and apex cuboid is the least upper bound.

That is why a cube is nothing but a collection of cuboids, and each cuboid is nothing but a aggregation. If we integrate one dimensional, two dimensional and three dimensional aggregations then that is a data cube. Q) Why should we take the cuboid in multi dimensional model and not in any of the polygon, when there are more 4 dimensions? There is no physical metaphor exist for more than 3 dimensions. Physically we can t view, we need to think in a abstract sense. Cuboid is a mathematical terminology which was brought from the discrete mathematics for representing certain types of aggregations. So the representation of aggregations always follows the containment principal. So the relational model has a strong mathematical base, which is nothing but a set concept in a similar way here. Why not of What is Polygon means? polygon is a general term, poly means many. In a way cuboid is a part of polygon. Hence the technical name given for 4 dimensions is Fesaract. There is no visual representation for Facer act. Q) What is the difference between Relational DBMS and Multi Dimensional DBMS. If you observe in the above slide, all is a 0-dimensional cuboid, product, city, date are one-dimensional cuboid and so forth. What happens when we add one more dimension. Just now discussed that the visual representation for 4-D is messy. Even though the diagram is messy, we can understand the concept by representing in the form of a lattice. From the functionality point of view both are used for certain kind of activities. I will narrate with an example, which has two dimensions and one measure, which is also represented in a multi dimensional representation. In the multi dimensional database the data is transformed into square because here we have only two dimensions. Here the values of the columns are used as column and row heading in the Multi Dimensional model. If we have three values in each of the column then total we need to have 27 different cells but whereas Multi Dimensional database requires only 9 cells for the same. If you extend the complexity by adding one more column say Dealership, we have 4 columns in relation. If I select a particular dealer cell then I get a matrix for particular dealer. To observe the complexity here when one dimension is added to the Multi dimensional data model requires always less no. of cells when compared to the relational table representation. Operations on relational table are much slower than the operations on the Multi dimensional cube. Suppose if I add one more dimension say Time then each dimension is become a 3-dimensional cube. Hence multi dimensional structures are much faster than relational table because of less storage space and also it accommodates more no. of values in less storage. Establish the need for Multi Dimensional Analysis Generally One dimensional queries say for example how many units of item a in store did we sell located in Delhi. The other query shown on the slide is how much revenue did the new item X generated during the last six months, broken down by individual months in AP state by individual stores. Broken down by promotions (p1, p2 ) compared to estimates, and compared to the previous version of the product. For efficient analysis the decision maker should equip with easy way to calculating complex analysis along different business dimensions. Such an environment we can establish using representation of data model called multi dimensional model. The basic advantage of using this data model is to provide easy and flexible access to information decision makers have an ability to analyze the data along any no. of dimensions at any level of aggregation with capability of viewing the results in varying no. of ways. Also they must have ability to navigate the results from one level of summarization to the next level of the summarization. Such a type of power does not exist in the 1-D queries. Therefore without having a solid system with this kind of facility then the purpose of using data warehouse is incomplete. That is the reason why multi dimensional analysis is very widely used. Of course the time is also important dimension in any system. Every analytical query is executed with time as one of the dimension. An Analytical system must recognize the sequential nature of time. Because of these factors traditional systems are very much inappropriate for answering complex queries. Concept hierarchy is very much important and widely used in data warehouses. Concept hierarchy defines a sequence of mappings in a set of low level concepts to the higher level concepts which are more general in nature. The different kinds of hierarchies for Industry, Region and Time. Here Time is having two different kinds of the hierarchies,

collection of days is Month, collection of months is Quarter, collection of quarter is a Year. Another hierarchy over location is office>city>country>region. So the purpose of using the hierarchy in OLAP basically provides better navigation facility for the decision makers. Q) Explain why multi dimensional analysis is important? From the analysis point of view this we call as very important structure because it is very easy and flexible access in the sense that you can retrieve any kind of aggregation just by querying. Any kind of ad-hoc queries can be answered. Analysis of the data is easy and we can also show the results by varying the levels. That means we do the computations bottom to top or top to bottom. Which we call in OLAP as drill down and drill up. Q) What are the Constraints applied on the OLAP? Constraints can be from theoretical point of view and implementation point of view. From the implementation point of view, as I mentioned earlier there are various extensions available in sql with operator names as Rollup and Cube. But the disadvantage of these operators are that, there is no way to compute desired set of computation using them. It means if you use rollup using k dimension then we can compute k+1 computation. If you use cube with k dimensions you can compute 2k computation. So from the constraints point of view here, suppose if they are analysts want to view only desired set of combinations of aggregations there are no automatic way in rollup and cube operator. So support that every package enriches the functionality of sql by supporting analytical concept called Grouping function. So this grouping function is used to compute the desired levels of combinations of aggregates. That means it is providing the flexibility for the analysts to represents combinations using this concept. In addition to that so many packages also provide partial rollup and partial cube computations by extending the grouping function operator. The same question can also be answered theoretically. There are so many ways to compute cubes theoretically by Iceberg queries and BUC algorithm. Q) What is Navigation and what are the OLAP operations that provide navigation? The navigation is basically is used to move from one level of concept to another level of concept. That means from low level concepts to higher level concepts. If you see the lattice structure the low level concept is a n-dimensional cuboid, 0-dimensional cuboid is a higher level structure. So once if we know the real world data,that means once you have low level aggregations then I can compute all these higher level computations. In OLAP this is possible by using two powerful operators namely rollup and drill down. The drill down is reverse to rollup and vice versa. Drill down is moving from high level cuboid to low level cuboid and roll up is moving from low level cuboid to high level cuboid. Roll UP: {product,city,date} à {product,city} à {all} Drill Down: {all} à {product, city} à {product, city, date} OLAP Operators The summary of all the OLAP operators are shown in the slide below. These five operators provide easy access to and flexibility to decision makers to compute different kinds of aggregations. These operators are very convenient set of operators for end users. Data Cube Demonstration using SQL We can easily simulate the Data Cube using SQL Operators. You all know that a popular clause used in SQL to do the aggregation is Group by clause. So by writing different kinds of queries using group by clause we can combine the different kinds of aggregations by applying union operator on various sub queries. For example if we have two dimensions, as I mentioned we have 22 combinations of aggregations. One query is for grand sum and another query for base cuboid and the two queries are aggregations on remaining dimensions. When you have 4 queries one for each cuboid then by union operator of SQL I can combine all the results of these 4 queries in a single result set. So that means whatever the results the data cube gives over any number of dimensions those results can be simulated in a single query using SQL. That means when we need to represent sub queries and then all these are united using union operator. So if you do this way we can easily get possible set of aggregated values that are required to represent the data cube.