Written Exam Data Warehousing and Data Mining course code:

Similar documents
Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Exam Advanced Data Mining Date: Time:

Association Pattern Mining. Lijun Zhang

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Advanced Relational Database Management MISM Course S A3 Spring 2019 Carnegie Mellon University

CPSC 311: Analysis of Algorithms (Honors) Exam 1 October 11, 2002

Advanced Relational Database Management MISM Course F A Fall 2017 Carnegie Mellon University

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Classification by Association

Association mining rules

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #5: Entity/Relational Models---Part 1

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

DATA WAREHOUING UNIT I

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Database Design with Entity Relationship Model

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points

Question Bank. 4) It is the source of information later delivered to data marts.

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

WORKING WITH PIVOT TABLES

CS154 Midterm Examination. May 4, 2010, 2:15-3:30PM

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Toronto. CSC340S - Information Systems Analysis and Design

Information Management Fundamentals by Dave Wells

Course Syllabus. Programming Language Paradigms. Spring - DIS Copenhagen. Semester & Location: Elective Course - 3 credits.

List of Exercises: Data Mining 1 December 12th, 2015

No. of Printed Pages : 7 MBA - INFORMATION TECHNOLOGY MANAGEMENT (MBAITM) Term-End Examination December, 2014

Modelling Structures in Data Mining Techniques

Chapter 4 Data Mining A Short Introduction

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Supervised and Unsupervised Learning (II)

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

The appendix contains information about the Classic Models database. Place your answers on the examination paper and any additional paper used.

Online Application Walkthrough for an Application for a Master s Programme

INSTITUTE OF INFORMATION TECHNOLOGY UNIVERSITY OF DHAKA

Data Analytics. Qualification Exam, May 18, am 12noon

Data Warehouse Testing. By: Rakesh Kumar Sharma

Software Requirements Specification Version 1.1 August 29, 2003

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Stat 602X Exam 2 Spring 2011

Rochester Institute of Technology Golisano College of Computing and Information Sciences Department of Information Sciences and Technologies

Software Design Description Report

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

PESIT Bangalore South Campus

PARTICIPANT Guide. Unit 6

Basics of Dimensional Modeling

Unit I. By Prof.Sushila Aghav MIT

CS264: Homework #4. Due by midnight on Wednesday, October 22, 2014

DATA MINING AND WAREHOUSING

Instructor: Craig Duckett. Lecture 11: Thursday, May 3 th, Set Operations, Subqueries, Views

Web For Alumni. Web-Based Service

Data Mining and Data Warehousing Introduction to Data Mining

TDWI Data Modeling. Data Analysis and Design for BI and Data Warehousing Systems

IMPORTANT: Circle the last two letters of your class account:

CURRICULUM The Architectural Technology and Construction. programme

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem.

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Database Ph.D. Qualifying Exam Spring 2006

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015)

NATIONAL ASSOCIATION OF SCHOOL PSYCHOLOGISTS (NASP)

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Philadelphia University Faculty of Information Technology Department of Computer Science --- Semester, 2007/2008. Course Syllabus

Introduction to Data Mining

Bachelor of Engineering Technology (Electronics & Controls) Curriculum Document. Australian College of Kuwait. (September 2015) BEEF15 - Version 5.

Guest Lecture. Daniel Dao & Chad Cotton

2 CONTENTS

Department of Electrical Engineering and Computer Sciences Spring 2001 Instructor: Dan Garcia CS 3 Midterm #2. Personal Information

CHAPTER 3: DATA MODELING USING THE ENTITY-RELATIONSHIP (ER) MODEL

Individual Project. Agnieszka Jastrzębska Władysław Homenda Lucjan Stapp

B-Trees and External Memory

Higher National Unit Specification. General information for centres. Unit title: CAD: 3D Modelling. Unit code: DW13 34

The Use of Soft Systems Methodology for the Development of Data Warehouses

1 Variations of the Traveling Salesman Problem

COMP Instructor: Dimitris Papadias WWW page:

B-Trees and External Memory

CS143 Handout 20 Summer 2012 July 18 th, 2012 Practice CS143 Midterm Exam. (signed)

Working with Data. L1 Introduction to Database & SQL

EXAM PREPARATION GUIDE

Introduction to AI Spring 2006 Dan Klein Midterm Solutions

Comparison of FP tree and Apriori Algorithm

Statistical Techniques in Robotics (STR, S15) Lecture#06 (Wednesday, January 28)

INTRODUCTION USER POPULATION

FROM A RELATIONAL TO A MULTI-DIMENSIONAL DATA BASE

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Artificial Intelligence Naïve Bayes

CS 1567 Intermediate Programming and System Design Using a Mobile Robot Aibo Lab3 Localization and Path Planning

The Game of Criss-Cross

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Oracle9i Data Mining. Data Sheet August 2002

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

COS 226 Fall 2015 Midterm Exam pts.; 60 minutes; 8 Qs; 15 pgs :00 p.m. Name:

CS157a Fall 2018 Sec3 Home Page/Syllabus

Association Rule Mining. Entscheidungsunterstützungssysteme

Final Exam DATA MINING I - 1DL360

Introduction to Access 97/2000

Transcription:

Written Exam Data Warehousing and Data Mining course code: 232020 30 January 2008 (13:30-17:00) Remarks: The exercises are clearly marked as DM for data mining and DW for data warehousing to allow you to start with the topic you feel most confident about. Answer each exercise on a different sheet. In this way the correction can take place in parallel. In case we have exam paper in booklet-form, you can try to separate the sheets. Do not forget to put your name and student number on every sheet. Motivate yours answers. The motivation / argumentation plays an important role in grading the exercise. You are allowed to use the study material and notes for the written exam. The practicum has to be completed satisfactorily before one is admitted to the written exam. The grade for the written exam is immediately the grade for the course. In case of doubt, the result of the practicum may be taken into account. There are 4 exercises. For each assignment, the number of points is given. In total, there are 40 points. 1

Assignment 1 (DM): Classification (15 pts) A retailer wants for marketing purposes distinguish between costumers younger then 35 and customers older then 35. The following table summarizes the data set in the data base of the retailer in an abstract form. The relevant attributes, determined by domain knowledge, are for convenience denoted by A, B and C. The values for A are a1, a2 and a3. The values for B are b1 and b2. The values for C are c1 and c2. Assume that the retailer wants to A B C Number of Instances Y O a1 b1 c1 14 0 a2 b1 c1 0 4 a3 b1 c1 6 2 a1 b2 c1 0 12 a2 b2 c1 6 4 a3 b2 c1 0 6 a1 b1 c2 0 8 a2 b1 c2 8 0 a3 b1 c2 2 0 a1 b2 c2 0 4 a2 b2 c2 2 2 a3 b2 c2 4 0 use Decision Trees to classify the costumers in the class young, denoted by Y, and old, denoted by O. Part 1a Compute the Classification error (pg. 150 handout Ch. 4) for the A attribute. Part 1b According to the Classification error, which attribute would be chosen as the first splitting attribute? For each attribute show the contingency table and the corresponding Classification error. 2

Part 1c Draw the resulting Decision Tree of depth 1, based on your outcome of Part b. Repeat Part b for the children of the root node, i.e. the nodes on level 1. Draw the resulting Decision Tree of depth 2. Part 1d Compute the error rate of your Decision Tree of depth 2, using the resubstition error (pg. 180 handout Ch. 4). Part 1e One could also use Naive Bayes as a classification approach. Assume a new customer nc comes in and has attribute values A = a2 and C = c1. How will this customer c be classified if one uses: The partially unfolded Decision Tree of Part c. A Naive Bayes classifier. Part 1f Explain the main differences between a Decision Tree classifier and a Naive Bayes classifier. 3

Assignment 2: Association Rules (6 pts) A supermarket stores all the transactions in a large database. These transactions database can be used for basket analysis. For the sake of simplicity and time we focus only on a small part of the the database and of all the items: transaction t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 items {bread,cheese, milk} {bread,cheese, jelly, peanut butter} {cheese,jelly,milk} {bread,cheese,jelly} {milk,peanut butter} {bread,cheese,milk,peanut butter} {jelly,milk} {jelly,milk,peanut butter} {bread,cheese,milk,peanut butter} {jelly,peanut butter} {bread,cheese} {bread,jelly,peanut butter} {cheese,milk} {bread,cheese,jelly,milk} Part 2a Part of the transaction data base. Compute the support and confidence of the following association rules: 1. {cheese} = {bread} 2. {bread} = {cheese} 3. = {peanut butter}with the empty set. Part 2b Compute all the association rules of the form X = {bread} with support s 50% and confidence α 60%. 4

Part 2c Suppose one wants to compute only association rules of the form X = {bread} with certain support s and confidence α. How must the Apriori algorithm be adapted in order to generate in an efficient way only association rules of the above form? Only describe clearly what must be adapted and how. 5

Assignment 3 (DW): Case Eniac (12 pts) The alumni association 1 of Computer Science called Eniac wants to analyze how strong the relationship is between the company where students do their final master project and the company of the first job of the student. They suspect that students often stay at the same company, i.e., have their first job at the same company as their master project, but it is unknown how often this occurs. Eniac is also interested in the degree to which the topic of the master project influences a student s first job and if there is a significant difference in how long someone stays in his/her first job when he/she does or does not has the first job at the same company as his/her master project. Eniac therefore likes to set up a data warehouse in which their own data on members is merged with data from the ASAS-system of the faculty. Eniac is founded in 1992, so data on their members is collected since then. ASAS contains information on open, running, and finished internships (Dutch stages ) and master projects. ASAS is running since 2002, hence contains data since 2002. For this exam question, you may assume that ASAS has complete data on all interships and master projects since 2002 of the whole of the faculty (which is not true in reality). The data warehouse project needs to be rather cost efficient, so priority lies with a data warehouse focussed on the above questions rather than on extensibility for other questions. Eniac (fictitious) ASAS (simplified) Member name studentnumber studyprogramme startyear dateofmasterdefense masterprojectcompany id currentjob id address emailaddress Company company id name Job member id company id nrofjob function Project project id kind studentname studentnumber study id supervisor id projecttitle projecttopic description status (open, running, or finished) company id startdate enddate Company company id name Studyprogramme study id name Supervisor supervisor id name emailaddress Figure 1: Databases 1 According to Merriam-Webster dictionary, alumnus means (1) a person who has attended or has graduated from a particular school, college, or university, or (2) a person who is a former member, employee, contributor, or inmate. In other words, alumni are former students of, in this case, Computer Science. 6

Part 3a (2 pts) i) Does Figure 1 contain metadata or not? Explain your answer. ii) Figure 1 contains many ambiguities that have to be clarified before a data warehouse can be set up. For example, Member.studyprogramme : does it contain a code like CS or is it in full Computer Science. Moreover, in the past, study programmes have had different names and there was a time when there was no separation between bachelor and master. Choose 2 attributes except studyprogramme and Company.name that you consider as the most ambiguous and describe as accurate as possible which ambiguities have to be clarified for them. Part 3b (4 pts) i) The data is by far not complete. Not all former students are member of Eniac (although many are), not every student does his master project externally at a company, etc. Discuss how problematic this is and advise how to deal with it in the data warehousing project. ii) Both databases have a table with companies. You can t simply compare them on company name nor id, while it is evident that this table plays a vital role in determining if students have their first job with the same company as their master project. Describe as accurate as possible which problems or complexities you forsee with the conversion and comparison of these tables. Also explain how you advise to approach solving those problems and complexities. Part 3c (5 pts) i) Which attributes and/or tables are not needed in the data warehouse. Explain your answer. ii) Give a design for the data structure of the data warehouse by means of a star schema with table names and attributes. iii) Give an estimation for the number of rows of your fact table. Mention your assumptions and explicitly provide the calculation. iv) With this data warehouse, can all business questions be fully answered? Explain as accurately as possible to what degree the questions can be answered and which considerations the analysts need to take into account when looking at the results. 7

Part 3d (1 pts) Eniac likes to repeat the analysis every year with fresh data. Discuss how you would approach this. Involve as many aspects as possible in your discussion and use proper data warehousing terminology if appropriate. 8

Assignment 4 (DW): Advanced Topics(7 pts) Part 4a (3 pts) Year 2003 2004 2005 Total City Gotham City 120 130 140 (b) Metropolis 90 80 70 Total (a) Table 1: Number of Cars per Year per City. Assume the numbers in Table 1 are the number of cars per city per year. What are the conditions that have to be true in order to calculate the total number of cars in cell (a) from the data given in Table 1? What are the conditions for calculating the total number of cars in cell (b) from the data given in Table 1? How could you discover if these conditions are met? What could you do, if these conditions are not met? Part 4b (4 pts) The Mail Order Company used a data warehouse for analyzing mail campaigns. The three Tables 3, 4, 5 show different cross tables. Assume all differences in the cross tables are statistical significant. The three Figures i), ii), and iii) in Table 2 show different causal graphs, which encode alternative believes about the causal influences between the variables. Assume that each graph shows the complete causal model. State for all nine combinations between the three cross tables and the three causal graphs: given the data table, would you reject the causal model (yes, no)? That means, which causal graph is inconsistent with which data table? Explain why you think that the causal graph iii) is consistent or inconsistent with the Table 3. 9

+ + + + + M a i l i n g R i c h O r d e r i) M a i l i n g O r d e r R i c h ii) M a i l i n g R i c h O r d e r iii) Table 2: Causal Graphs i), ii), and iii). Each graph shows alternative believes about the causal influences between the variables. E.g. Graph i) means that if a person is Rich, this has a positive causal influence that he/she creates an Order. However, the fact that he/she got a Mailing is not causing a higher chance for an Order. 10

Mailing Yes No Total Order Yes 800 200 1000 No 200 800 1000 Total 1000 1000 2000 Table 3: Order reactions (yes, no) from the customers after mailing campaign (yes, no). Rich Yes No Total Order Yes 1600 400 2000 No 400 1600 2000 Total 2000 2000 4000 Table 4: Order reactions (yes, no) from the customers depending on their wealth (Rich (yes, no)) Mailing Yes No Total Rich Yes No Total Yes No Total Order Yes 1000 1000 2000 1000 1000 2000 4000 No 1000 1000 2000 1000 1000 2000 4000 Total 2000 2000 4000 2000 2000 4000 8000 Table 5: Order reactions (yes, no) from the customers depending on their wealth (Rich (yes, no)) and the mailing campaign (yes, no) 11