Decision Support Systems 2012/2013. MEIC - TagusPark. Homework #1. Due: 11.Mar.2013

Similar documents
--NOTE: When the task does not specify sort order, it is your responsibility to order the information -- so that is easy to interpret.

/* Module 9 Subqueries

Sample Dataedo documentation. Data warehouse. Documentation

Assignment 1. Question 1: Brock Wilcox CS

Temporal Data Warehouses: Logical Models and Querying

Data Analysis and Data Science

In this task, you specify formatting properties for the currency and percentage measures in the Analysis Services Tutorial cube.

Chapter 18: Data Analysis and Mining

SQL SERVER ASSIGNMENTS OPPORTUNITIES CREATE A SCHOOL DATABASE SIMILAR TO THIS

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Implementing a Data Warehouse with SQL Server 2014

SQL Server Analysis Services

CIS 611: ENTERPRISE DATABASE AND DATA WAREHOUSING. Project: Multidimensional OLAP Cube using Adventure Works Data Warehouse

SSAS 2008 Tutorial: Understanding Analysis Services

ECLT 5810 Data Preprocessing. Prof. Wai Lam

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

Decision Support Systems aka Analytical Systems

Data Mining By IK Unit 4. Unit 4

ETL and OLAP Systems

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Unit 7: Basics in MS Power BI for Excel 2013 M7-5: OLAP

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Data warehouses Decision support The multidimensional model OLAP queries

Getting to Know Your Data

Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

Regression Analysis and Linear Regression Models

Data Science. Data Analyst. Data Scientist. Data Architect

SQL Server 2005 Analysis Services

STA 570 Spring Lecture 5 Tuesday, Feb 1

Data Warehousing and Decision Support

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Data Warehousing and Decision Support

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Sql Fact Constellation Schema In Data Warehouse With Example

Call: SAS BI Course Content:35-40hours

Course Number : SEWI ZG514 Course Title : Data Warehousing Type of Exam : Open Book Weightage : 60 % Duration : 180 Minutes

Data Warehouses. Yanlei Diao. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Real-World Performance Training Dimensional Queries

Testing Masters Technologies

EXAMGOOD QUESTION & ANSWER. Accurate study guides High passing rate! Exam Good provides update free of charge in one year!

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Adnan YAZICI Computer Engineering Department

Basics of Dimensional Modeling

Business Analytics in the Oracle 12.2 Database: Analytic Views. Event: BIWA 2017 Presenter: Dan Vlamis and Cathye Pendley Date: January 31, 2017

Acknowledgment. MTAT Data Mining. Week 7: Online Analytical Processing and Data Warehouses. Typical Data Analysis Process.

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Data Warehousing & OLAP

Dta Mining and Data Warehousing

Data Mining and Analytics. Introduction

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

CS570: Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Information Management course

Deccansoft Software Services Microsoft Silver Learning Partner. SSAS Syllabus

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

An Overview of Data Warehousing and OLAP Technology

OFFICIAL MICROSOFT LEARNING PRODUCT 10778A. Implementing Data Models and Reports with Microsoft SQL Server 2012 Companion Content

Data Warehousing and Data Mining SQL OLAP Operations

Implementing Data Models and Reports with SQL Server 2014

Syllabus. Syllabus. Motivation Decision Support. Syllabus

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

SAS Visual Analytics 8.2: Working with Report Content

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Big Data 13. Data Warehousing

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Advanced Data Management Technologies

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Data Warehousing Conclusion. Esteban Zimányi Slides by Toon Calders

collection of data that is used primarily in organizational decision making.

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10

Reminds on Data Warehousing

Year 10 General Mathematics Unit 2

Data Modeling and Databases Ch 7: Schemas. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Welcome to the topic of SAP HANA modeling views.

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Writing Queries Using Microsoft SQL Server 2008 Transact- SQL

Decision Support Systems 2012/2013. MEIC - TagusPark. Homework #5. Due: 15.Apr.2013

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

INTERMEDIATE SQL GOING BEYOND THE SELECT. Created by Brian Duffey

Relations and Functions 2.1

Implementing and Maintaining Microsoft SQL Server 2005 Analysis Services

Choosing the Right Procedure

Hyperion Interactive Reporting Reports & Dashboards Essentials

Homework set 4 - Solutions

Writing Analytical Queries for Business Intelligence

Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich

SAS Web Report Studio 3.1

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Multiple Regression White paper

QUALITY MONITORING AND

Transcription:

Decision Support Systems 2012/2013 MEIC - TagusPark Homework #1 Due: 11.Mar.2013 1 Data Description and Pre-processing 1. A hospital is conducting a study on obesity in adult men and, as part of that study, tested the age and body fat for 18 randomly selected adults, with the following results: Age 38 27 48 17 33 32 38 38 26 % Fat 17.1 20.7 25.2 13.6 13.4 19.8 11.7 17.3 15.6 Age 34 33 41 45 46 26 35 22 23 % Fat 20.5 22.3 20.2 17.5 22.7 7.5 27.2 9.2 16.1 (a) ( 1 / 2 val.) Compute the mean, median and standard deviation for Age and % Fat. Include the expressions you used in your calculations and any additional elements you find relevant. Note: In your calculations, take into consideration that the data above concerns a sample of the population, not the whole population. The mean can be computed using the expression: X Age = 1 N N x i = 33.44 i=1 XFat = 1 N N x i = 17.64 i=1 The median can be computed upon sorting the data and determining the middle element. In this case, since we have an even number of data-points, median Age = x N/2 + x N/2+1 2 Finally, the (sample) variance can be computed as: = 33.5 median Fat = x N/2 + x N/2+1 2 = 17.4 and we get S 2 Age = 1 N 1 N (x i x Age ) 2 = 75.90 SFat 2 = 1 N (x i x Fat ) 2 = 27.95 N 1 i=1 s Age = i=1 SAge 2 = 8.71 s Fat = SFat 2 = 5.29.

Homework 1 Decision Support Systems Page 2 of 10 (b) ( 1 / 2 val.) Draw a scatter plot and a q-q plot based on the two variables. Include a brief explanation of the plots. The scatter plot is obtaining by plotting each pair of data-points (x Age, x Fat ) as they appear in the original table. The resulting plot is: 30 Scatter plot of Age vs. % of body fat 25 Body fat (\%) 20 15 10 5 15 20 25 30 35 40 45 50 Age (years) The q-q plot, on the other hand, can be obtaining by pairing the quantiles of the two attributes. In this case, since both attributes have the same number of data-points, the q-q plot can easily be obtained by sorting the values in both attributes and plotting the resulting pairs, to yield: 30 Q Q plot of Age vs. % of body fat 25 Body fat (%) 20 15 10 5 15 20 25 30 35 40 45 50 Age (years) (c) (1 val.) Normalize the two variables using min-max normalization, so that the data fits in the interval [0, 1]. Include the expressions you used in your calculations and any additional elements you find relevant.

Homework 1 Decision Support Systems Page 3 of 10 To normalize the two variables, we use, in each dataset, x i,norm = x i min j {x j } max j x j min j {x j } to get: Age norm 0.68 0.32 1.00 0.00 0.52 0.48 0.68 0.68 0.29 % Fat norm 0.49 0.67 0.90 0.31 0.20 0.62 0.21 0.50 0.41 Age norm 0.55 0.52 0.77 0.90 0.94 0.29 0.58 0.16 0.19 % Fat norm 0.66 0.75 0.65 0.51 0.77 0.00 1.00 0.09 0.44 (d) (1 val.) Calculate the correlation coefficient (Pearson s product moment coefficient) between the two attributes. Based on this computation, justify whether the two variables are positively or negatively correlated. The (sample) correlation coefficient betweeen Age and %Fat can be computed as corr(x, Y ) = 1 (N 1) N i=1 x i X s X yi Ȳ s Y. In our case, we have corr(age, %Fat) = 0.54 and we can conclude that, since corr(age, %Fat) > 0, Age and %Fat are positively correlated. 2. (3 val.) Suppose that you want to predict the value of some (discrete) variable Y knowing that some other variable, X, takes some given value, x. For example, returning to the setup of Question 1, you could be interested in predicting the value of % Fat for a person with Age = 35. Show that, in terms of expected squared error: E = E [ (Y c) 2 X = x ], the value c = E [Y X = x] is the best possible prediction for Y. Indicate all relevant computations. Suggestion: Compute the minimum of the above expression with respect to c. Recall that E [f(y ) X = x] denotes the expected value of f(y ) conditioned on the random variable X taking the value x, and is analytically given by E [f(y ) X = x] = f(y)p [Y = y X = x]. y In order to compute the value of c that minimizes the squared error E, we derive it with respect to

Homework 1 Decision Support Systems Page 4 of 10 c, to yield: de dc = d dc E [ (Y c) 2 X = x ] [ ] d = E dc (Y c)2 X = x = E [ 2(Y c) X = x] = 2E [Y X = x] + 2c. due to the linearity of E [ ] Equating the above expression to 0 and solving for c, we finally get: c = E [Y X = x]. 2 OLAP Queries 3. Suppose that a data warehouse for a technical support company is structured around three dimensions, Time, Technician and Client, and the measures Count and Charge. The measure Count keeps track of the number of times that a client/technicial was assisted/called upon. The measure Charge keeps track of the payments charged to each customer upon a visit by a technician. (a) (1 val.) Starting with the cuboid [Day, Technician, Client], what specific OLAP operations should be performed in order to list the total fee collected by each technician in 2012? Note: In your answer, you are free to assume any (reasonable) hierarchy for each of the different dimensions in the DW. You should indicate them in your answer. We assume that each of the three dimensions is organized according to the following hierarchies: Time is organized in the hierarchy Day-Month-Semester-Year-all Technician is organized in Technician-Expertise-Area-all Client is organized in Client-Neighborhood-City-State-all. Given this hierarchy, we would require the following OLAP operations: 1. A roll-up on the dimension Time from Day to Year. 2. A roll-up on the dimension Client from Client to all, to aggregate on this dimension. 3. A slice on the dimension Time to select Year = 2012. (b) (1 val.) Write a Transact-SQL query to obtain the same result, assuming that the data are stored in a relational database with the scheme Fee(Day, Month, Year, Technician, Store, Client, Charge). You should use the relevant OLAP operators you practiced in the lab session (indicate only the query). A possible T-SQL query would be:

Homework 1 Decision Support Systems Page 5 of 10 Technician, SUM(Charge) Fee WHERE Year = 2012 Technician WITH ROLLUP where we used ROLLUP to also include the total amount charged in 2012 (aggregated over all technicians). 4. (2 val.) Give an example of a query that uses grouping with ROLLUP that cannot be expressed by a single clause. Besides your query, you should indicate the clauses necessary to obtain the same information. In your example, you can use the JoBS database used in the lab session (indicate only the queries, not the result). Resorting to the JoBS database, as suggested, the query CASE WHEN GROUPING(E.EngineerName)=1 THEN All Engineers ELSE E.EngineerName END, CASE WHEN GROUPING(S.PartNumber)=1 THEN All Parts ELSE S.PartNumber END, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId E.EngineerName, S.PartNumber WITH ROLLUP would require three clauses if the ROLLUP clause were not used, E.EngineerName, S.PartNumber, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId E.EngineerName,

Homework 1 Decision Support Systems Page 6 of 10 S.PartNumber UNI E.EngineerName, All Parts, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId E.EngineerName UNI All Engineers, All Parts, SUM(S.UnitsHeld) EngineerStock S Engineers E S.EngineerId = E.EngineerId in order to compute the total parts per Engineer and the total number of parts (overall). 2.1 Practical Questions (Using SQL Server 2008) For the following questions, you should use the AdventureWorksDW2012 database. To this purpose, at the beginning of your code you should include the following SQL statement: USE AdventureWorksDW2012; GO You can use the Object Explorer in the MS SQL Server Management Studio to explore the tables in this database as well as the attributes in each table. For completeness, Fig. 1 includes a (simplified) representation of the relevant tables and attributes for this homework, where the attributes in italic correspond to primary keys. 5. (3 val.) Write down the SQL query necessary to obtain the relation described in Fig. 2 from the table dbo.factinternetsales (see Fig. 1). This relation describes, for each order in dbo.factinternetsales, the first and last name of the corresponding customer, the postal code, state and country associated with the customer, the name of the product in the order, its shipping date, and the total amount paid by the customer. You need only to indicate the query, not the results. In terms of the relations in Fig. 1, The attribute OrderNumber in the new table corresponds to the attribute SalesOrderNumber in dbo.factinternetsales; The attribute State in the new table corresponds to the attribute StateProvinceCode in the table dbo.dimgeography;

Homework 1 Decision Support Systems Page 7 of 10 dbo.dimcustomer CustomerKey GeographyKey FirstName LastName BirthDate... dbo.factinternetsales SalesOrderNumber ProductKey ShipDateKey CustomerKey SalesAmount... dbo.dimgeography GeographyKey EnglishCountryRegionName City StateProvinceCode CountryRegionCode PostalCode... dbo.dimproduct ProductKey EnglishProductName ModelName ProductLine... dbo.dimdate DateKey FullDateAlternateKey CalendarYear... Figura 1: Simplified schema of the AdventureWorksDW2008 that includes only the relevant tables and attributes. (NoName) OrderNumber FirstName LastName PostalCode State Country Product ShipDate Total Figura 2: Table for question 5. The attribute Country in the new table corresponds to the attribute CountryRegionCode in the table dbo.geography; The attribute Product in the new table corresponds to the attribute EnglishProductName in dbo.dimproduct; The attribute ShipDate in the new table corresponds to the attribute FullDateAlternateKey in dbo.dimdate; The attribute Total in the new table corresponds to the attribute SalesAmount in the table dbo.factinternetsales; In your query you should use adequate JOIN operations. In particular, note that there may be orders for which not all information above may be available, but which should still be included in the results of your query. The SQL query would be: F.SalesOrderNumber AS OrderNumber, C.FirstName, C.LastName, G.PostalCode

Homework 1 Decision Support Systems Page 8 of 10 G.StateProvinceCode AS State, G.CountryRegionCode AS Country, P.EnglishProductName AS Product, D.FullDataAlternateKey AS ShipDate, F.SalesAmount AS Total dbo.factinternetsales F LEFT JOIN dbo.dimcustomer C F.CustomerKey = C.CustomerKey LEFT JOIN dbo.dimgeography G C.GeographyKey = G.GeographyKey LEFT JOIN dbo.dimproduct P F.ProductKey = P.ProductKey LEFT JOIN dbo.dimdate D F.ShipDateKey = D.DateKey 6. From the table dbo.factinternetsales (see Fig. 1), (a) (2 val.) Write down the SQL query necessary to determine the total sales amount per calendar year. You should include both the query and the obtained results. The SQL query is: D.CalendarYear AS Year, SUM(F.SalesAmount) AS Total dbo.factinternetsales F dbo.dimdate D F.ShipDateKey = D.DateKey D.CalendarYear The corresponding values are: Year Total 2007 9,517,548.53 2008 10,158,562.38 2005 3,105,587.33 2006 6,576,978.98

Homework 1 Decision Support Systems Page 9 of 10 (b) (2 val.) With a single query, determine both the global and per year the total sales amount (Suggestion: Use a ROLLUP clause). You should include both the SQL query and the obtained results. The SQL query is: D.CalendarYear AS Year, SUM(F.SalesAmount) AS Total dbo.factinternetsales F dbo.dimdate D F.ShipDateKey = D.DateKey D.CalendarYear WITH ROLLUP The corresponding values are: Year Total 2005 3,105,587.33 2006 6,576,978.98 2007 9,517,548.53 2008 10,158,562.38 All 29,358,677.22 (c) (3 val.) Using the CUBE clause, determine the total sales amount across the two dimensions: Year, corresponding to the CalendarYear attribute in table dbo.dimdate (associated with the shipping date), and Country, corresponding to the EnglishCountryRegionName attribute in table dbo.dimgeography. Write down the adequate SQL query and express the results as a crosstabulation. The SQL query is: D.CalendarYear as Year, G.EnglishCountryRegionName as Country, SUM(F.SalesAmount) as Total dbo.factinternetsales F dbo.dimdate D F.ShipDateKey = D.DateKey dbo.dimcustomer C F.CustomerKey = C.CustomerKey

Homework 1 Decision Support Systems Page 10 of 10 dbo.dimgeography G C.GeographyKey = G.GeographyKey D.CalendarYear, G.EnglishCountryRegionName WITH CUBE The corresponding cross-tabulation is: 2005 2006 2007 2008 All Australia 1,251,388.1 2,166,222.5 3,002,149.1 2,641,240.9 9,061,000.6 Canada 143,251.5 618,206.8 507,224.8 709,161.8 1,977,844.9 France 172,716.1 508,910.0 992,681.6 969,710.0 2,644,017.7 Germany 219,372.8 528,003.6 1,021,797.2 1,125,138.8 2,894,312.3 United Kingdom 280,855.7 583,463.6 1,271,712.8 1,255,680.0 3,391,712.2 United States 1,038,003.1 2,172,172.5 2,721,983.1 3,457,630.8 9,389,789.5 All 3,105,587.3 6,576,979.0 9,517,548.5 10,158,562.4 29,358,677.2