Decision Support Systems 2012/2013 MEIC - TagusPark Homework #1 Due: 11.Mar.2013 1 Data Description and Pre-processing 1. A hospital is conducting a study on obesity in adult men and, as part of that study, tested the age and body fat for 18 randomly selected adults, with the following results: Age 38 27 48 17 33 32 38 38 26 % Fat 17.1 20.7 25.2 13.6 13.4 19.8 11.7 17.3 15.6 Age 34 33 41 45 46 26 35 22 23 % Fat 20.5 22.3 20.2 17.5 22.7 7.5 27.2 9.2 16.1 (a) ( 1 / 2 val.) Compute the mean, median and standard deviation for Age and % Fat. Include the expressions you used in your calculations and any additional elements you find relevant. Note: In your calculations, take into consideration that the data above concerns a sample of the population, not the whole population. The mean can be computed using the expression: X Age = 1 N N x i = 33.44 i=1 XFat = 1 N N x i = 17.64 i=1 The median can be computed upon sorting the data and determining the middle element. In this case, since we have an even number of data-points, median Age = x N/2 + x N/2+1 2 Finally, the (sample) variance can be computed as: = 33.5 median Fat = x N/2 + x N/2+1 2 = 17.4 and we get S 2 Age = 1 N 1 N (x i x Age ) 2 = 75.90 SFat 2 = 1 N (x i x Fat ) 2 = 27.95 N 1 i=1 s Age = i=1 SAge 2 = 8.71 s Fat = SFat 2 = 5.29.
Homework 1 Decision Support Systems Page 2 of 10 (b) ( 1 / 2 val.) Draw a scatter plot and a q-q plot based on the two variables. Include a brief explanation of the plots. The scatter plot is obtaining by plotting each pair of data-points (x Age, x Fat ) as they appear in the original table. The resulting plot is: 30 Scatter plot of Age vs. % of body fat 25 Body fat (\%) 20 15 10 5 15 20 25 30 35 40 45 50 Age (years) The q-q plot, on the other hand, can be obtaining by pairing the quantiles of the two attributes. In this case, since both attributes have the same number of data-points, the q-q plot can easily be obtained by sorting the values in both attributes and plotting the resulting pairs, to yield: 30 Q Q plot of Age vs. % of body fat 25 Body fat (%) 20 15 10 5 15 20 25 30 35 40 45 50 Age (years) (c) (1 val.) Normalize the two variables using min-max normalization, so that the data fits in the interval [0, 1]. Include the expressions you used in your calculations and any additional elements you find relevant.
Homework 1 Decision Support Systems Page 3 of 10 To normalize the two variables, we use, in each dataset, x i,norm = x i min j {x j } max j x j min j {x j } to get: Age norm 0.68 0.32 1.00 0.00 0.52 0.48 0.68 0.68 0.29 % Fat norm 0.49 0.67 0.90 0.31 0.20 0.62 0.21 0.50 0.41 Age norm 0.55 0.52 0.77 0.90 0.94 0.29 0.58 0.16 0.19 % Fat norm 0.66 0.75 0.65 0.51 0.77 0.00 1.00 0.09 0.44 (d) (1 val.) Calculate the correlation coefficient (Pearson s product moment coefficient) between the two attributes. Based on this computation, justify whether the two variables are positively or negatively correlated. The (sample) correlation coefficient betweeen Age and %Fat can be computed as corr(x, Y ) = 1 (N 1) N i=1 x i X s X yi Ȳ s Y. In our case, we have corr(age, %Fat) = 0.54 and we can conclude that, since corr(age, %Fat) > 0, Age and %Fat are positively correlated. 2. (3 val.) Suppose that you want to predict the value of some (discrete) variable Y knowing that some other variable, X, takes some given value, x. For example, returning to the setup of Question 1, you could be interested in predicting the value of % Fat for a person with Age = 35. Show that, in terms of expected squared error: E = E [ (Y c) 2 X = x ], the value c = E [Y X = x] is the best possible prediction for Y. Indicate all relevant computations. Suggestion: Compute the minimum of the above expression with respect to c. Recall that E [f(y ) X = x] denotes the expected value of f(y ) conditioned on the random variable X taking the value x, and is analytically given by E [f(y ) X = x] = f(y)p [Y = y X = x]. y In order to compute the value of c that minimizes the squared error E, we derive it with respect to
Homework 1 Decision Support Systems Page 4 of 10 c, to yield: de dc = d dc E [ (Y c) 2 X = x ] [ ] d = E dc (Y c)2 X = x = E [ 2(Y c) X = x] = 2E [Y X = x] + 2c. due to the linearity of E [ ] Equating the above expression to 0 and solving for c, we finally get: c = E [Y X = x]. 2 OLAP Queries 3. Suppose that a data warehouse for a technical support company is structured around three dimensions, Time, Technician and Client, and the measures Count and Charge. The measure Count keeps track of the number of times that a client/technicial was assisted/called upon. The measure Charge keeps track of the payments charged to each customer upon a visit by a technician. (a) (1 val.) Starting with the cuboid [Day, Technician, Client], what specific OLAP operations should be performed in order to list the total fee collected by each technician in 2012? Note: In your answer, you are free to assume any (reasonable) hierarchy for each of the different dimensions in the DW. You should indicate them in your answer. We assume that each of the three dimensions is organized according to the following hierarchies: Time is organized in the hierarchy Day-Month-Semester-Year-all Technician is organized in Technician-Expertise-Area-all Client is organized in Client-Neighborhood-City-State-all. Given this hierarchy, we would require the following OLAP operations: 1. A roll-up on the dimension Time from Day to Year. 2. A roll-up on the dimension Client from Client to all, to aggregate on this dimension. 3. A slice on the dimension Time to select Year = 2012. (b) (1 val.) Write a Transact-SQL query to obtain the same result, assuming that the data are stored in a relational database with the scheme Fee(Day, Month, Year, Technician, Store, Client, Charge). You should use the relevant OLAP operators you practiced in the lab session (indicate only the query). A possible T-SQL query would be:
Homework 1 Decision Support Systems Page 5 of 10 Technician, SUM(Charge) Fee WHERE Year = 2012 Technician WITH ROLLUP where we used ROLLUP to also include the total amount charged in 2012 (aggregated over all technicians). 4. (2 val.) Give an example of a query that uses grouping with ROLLUP that cannot be expressed by a single clause. Besides your query, you should indicate the clauses necessary to obtain the same information. In your example, you can use the JoBS database used in the lab session (indicate only the queries, not the result). Resorting to the JoBS database, as suggested, the query CASE WHEN GROUPING(E.EngineerName)=1 THEN All Engineers ELSE E.EngineerName END, CASE WHEN GROUPING(S.PartNumber)=1 THEN All Parts ELSE S.PartNumber END, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId E.EngineerName, S.PartNumber WITH ROLLUP would require three clauses if the ROLLUP clause were not used, E.EngineerName, S.PartNumber, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId E.EngineerName,
Homework 1 Decision Support Systems Page 6 of 10 S.PartNumber UNI E.EngineerName, All Parts, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId E.EngineerName UNI All Engineers, All Parts, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId in order to compute the total parts per Engineer and the total number of parts (overall). 2.1 Practical Questions (Using SQL Server 2008) For the following questions, you should use the AdventureWorksDW2012 database. To this purpose, at the beginning of your code you should include the following SQL statement: USE AdventureWorksDW2012; GO You can use the Object Explorer in the MS SQL Server Management Studio to explore the tables in this database as well as the attributes in each table. For completeness, Fig. 1 includes a (simplified) representation of the relevant tables and attributes for this homework, where the attributes in italic correspond to primary keys. 5. (3 val.) Write down the SQL query necessary to obtain the relation described in Fig. 2 from the table dbo.factinternetsales (see Fig. 1). This relation describes, for each order in dbo.factinternetsales, the first and last name of the corresponding customer, the postal code, state and country associated with the customer, the name of the product in the order, its shipping date, and the total amount paid by the customer. You need only to indicate the query, not the results. In terms of the relations in Fig. 1, The attribute OrderNumber in the new table corresponds to the attribute SalesOrderNumber in dbo.factinternetsales; The attribute State in the new table corresponds to the attribute StateProvinceCode in the table dbo.dimgeography;
Homework 1 Decision Support Systems Page 7 of 10 dbo.dimcustomer CustomerKey GeographyKey FirstName LastName BirthDate... dbo.factinternetsales SalesOrderNumber ProductKey ShipDateKey CustomerKey SalesAmount... dbo.dimgeography GeographyKey EnglishCountryRegionName City StateProvinceCode CountryRegionCode PostalCode... dbo.dimproduct ProductKey EnglishProductName ModelName ProductLine... dbo.dimdate DateKey FullDateAlternateKey CalendarYear... Figura 1: Simplified schema of the AdventureWorksDW2008 that includes only the relevant tables and attributes. (NoName) OrderNumber FirstName LastName PostalCode State Country Product ShipDate Total Figura 2: Table for question 5. The attribute Country in the new table corresponds to the attribute CountryRegionCode in the table dbo.geography; The attribute Product in the new table corresponds to the attribute EnglishProductName in dbo.dimproduct; The attribute ShipDate in the new table corresponds to the attribute FullDateAlternateKey in dbo.dimdate; The attribute Total in the new table corresponds to the attribute SalesAmount in the table dbo.factinternetsales; In your query you should use adequate JOIN operations. In particular, note that there may be orders for which not all information above may be available, but which should still be included in the results of your query. The SQL query would be: F.SalesOrderNumber AS OrderNumber, C.FirstName, C.LastName, G.PostalCode
Homework 1 Decision Support Systems Page 8 of 10 G.StateProvinceCode AS State, G.CountryRegionCode AS Country, P.EnglishProductName AS Product, D.FullDataAlternateKey AS ShipDate, F.SalesAmount AS Total dbo.factinternetsales F LEFT JOIN dbo.dimcustomer C F.CustomerKey = C.CustomerKey LEFT JOIN dbo.dimgeography G C.GeographyKey = G.GeographyKey LEFT JOIN dbo.dimproduct P F.ProductKey = P.ProductKey LEFT JOIN dbo.dimdate D F.ShipDateKey = D.DateKey 6. From the table dbo.factinternetsales (see Fig. 1), (a) (2 val.) Write down the SQL query necessary to determine the total sales amount per calendar year. You should include both the query and the obtained results. The SQL query is: D.CalendarYear AS Year, SUM(F.SalesAmount) AS Total dbo.factinternetsales F dbo.dimdate D F.ShipDateKey = D.DateKey D.CalendarYear The corresponding values are: Year Total 2007 9,517,548.53 2008 10,158,562.38 2005 3,105,587.33 2006 6,576,978.98
Homework 1 Decision Support Systems Page 9 of 10 (b) (2 val.) With a single query, determine both the global and per year the total sales amount (Suggestion: Use a ROLLUP clause). You should include both the SQL query and the obtained results. The SQL query is: D.CalendarYear AS Year, SUM(F.SalesAmount) AS Total dbo.factinternetsales F dbo.dimdate D F.ShipDateKey = D.DateKey D.CalendarYear WITH ROLLUP The corresponding values are: Year Total 2005 3,105,587.33 2006 6,576,978.98 2007 9,517,548.53 2008 10,158,562.38 All 29,358,677.22 (c) (3 val.) Using the CUBE clause, determine the total sales amount across the two dimensions: Year, corresponding to the CalendarYear attribute in table dbo.dimdate (associated with the shipping date), and Country, corresponding to the EnglishCountryRegionName attribute in table dbo.dimgeography. Write down the adequate SQL query and express the results as a crosstabulation. The SQL query is: D.CalendarYear as Year, G.EnglishCountryRegionName as Country, SUM(F.SalesAmount) as Total dbo.factinternetsales F dbo.dimdate D F.ShipDateKey = D.DateKey dbo.dimcustomer C F.CustomerKey = C.CustomerKey
Homework 1 Decision Support Systems Page 10 of 10 dbo.dimgeography G C.GeographyKey = G.GeographyKey D.CalendarYear, G.EnglishCountryRegionName WITH CUBE The corresponding cross-tabulation is: 2005 2006 2007 2008 All Australia 1,251,388.1 2,166,222.5 3,002,149.1 2,641,240.9 9,061,000.6 Canada 143,251.5 618,206.8 507,224.8 709,161.8 1,977,844.9 France 172,716.1 508,910.0 992,681.6 969,710.0 2,644,017.7 Germany 219,372.8 528,003.6 1,021,797.2 1,125,138.8 2,894,312.3 United Kingdom 280,855.7 583,463.6 1,271,712.8 1,255,680.0 3,391,712.2 United States 1,038,003.1 2,172,172.5 2,721,983.1 3,457,630.8 9,389,789.5 All 3,105,587.3 6,576,979.0 9,517,548.5 10,158,562.4 29,358,677.2