Written Exam Data Warehousing and Data Mining course code: 232020 30 January 2008 (13:30-17:00) Remarks: The exercises are clearly marked as DM for data mining and DW for data warehousing to allow you to start with the topic you feel most confident about. Answer each exercise on a different sheet. In this way the correction can take place in parallel. In case we have exam paper in booklet-form, you can try to separate the sheets. Do not forget to put your name and student number on every sheet. Motivate yours answers. The motivation / argumentation plays an important role in grading the exercise. You are allowed to use the study material and notes for the written exam. The practicum has to be completed satisfactorily before one is admitted to the written exam. The grade for the written exam is immediately the grade for the course. In case of doubt, the result of the practicum may be taken into account. There are 4 exercises. For each assignment, the number of points is given. In total, there are 40 points. 1
Assignment 1 (DM): Classification (15 pts) A retailer wants for marketing purposes distinguish between costumers younger then 35 and customers older then 35. The following table summarizes the data set in the data base of the retailer in an abstract form. The relevant attributes, determined by domain knowledge, are for convenience denoted by A, B and C. The values for A are a1, a2 and a3. The values for B are b1 and b2. The values for C are c1 and c2. Assume that the retailer wants to A B C Number of Instances Y O a1 b1 c1 14 0 a2 b1 c1 0 4 a3 b1 c1 6 2 a1 b2 c1 0 12 a2 b2 c1 6 4 a3 b2 c1 0 6 a1 b1 c2 0 8 a2 b1 c2 8 0 a3 b1 c2 2 0 a1 b2 c2 0 4 a2 b2 c2 2 2 a3 b2 c2 4 0 use Decision Trees to classify the costumers in the class young, denoted by Y, and old, denoted by O. Part 1a Compute the Classification error (pg. 150 handout Ch. 4) for the A attribute. Part 1b According to the Classification error, which attribute would be chosen as the first splitting attribute? For each attribute show the contingency table and the corresponding Classification error. 2
Part 1c Draw the resulting Decision Tree of depth 1, based on your outcome of Part b. Repeat Part b for the children of the root node, i.e. the nodes on level 1. Draw the resulting Decision Tree of depth 2. Part 1d Compute the error rate of your Decision Tree of depth 2, using the resubstition error (pg. 180 handout Ch. 4). Part 1e One could also use Naive Bayes as a classification approach. Assume a new customer nc comes in and has attribute values A = a2 and C = c1. How will this customer c be classified if one uses: The partially unfolded Decision Tree of Part c. A Naive Bayes classifier. Part 1f Explain the main differences between a Decision Tree classifier and a Naive Bayes classifier. 3
Assignment 2: Association Rules (6 pts) A supermarket stores all the transactions in a large database. These transactions database can be used for basket analysis. For the sake of simplicity and time we focus only on a small part of the the database and of all the items: transaction t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 items {bread,cheese, milk} {bread,cheese, jelly, peanut butter} {cheese,jelly,milk} {bread,cheese,jelly} {milk,peanut butter} {bread,cheese,milk,peanut butter} {jelly,milk} {jelly,milk,peanut butter} {bread,cheese,milk,peanut butter} {jelly,peanut butter} {bread,cheese} {bread,jelly,peanut butter} {cheese,milk} {bread,cheese,jelly,milk} Part 2a Part of the transaction data base. Compute the support and confidence of the following association rules: 1. {cheese} = {bread} 2. {bread} = {cheese} 3. = {peanut butter}with the empty set. Part 2b Compute all the association rules of the form X = {bread} with support s 50% and confidence α 60%. 4
Part 2c Suppose one wants to compute only association rules of the form X = {bread} with certain support s and confidence α. How must the Apriori algorithm be adapted in order to generate in an efficient way only association rules of the above form? Only describe clearly what must be adapted and how. 5
Assignment 3 (DW): Case Eniac (12 pts) The alumni association 1 of Computer Science called Eniac wants to analyze how strong the relationship is between the company where students do their final master project and the company of the first job of the student. They suspect that students often stay at the same company, i.e., have their first job at the same company as their master project, but it is unknown how often this occurs. Eniac is also interested in the degree to which the topic of the master project influences a student s first job and if there is a significant difference in how long someone stays in his/her first job when he/she does or does not has the first job at the same company as his/her master project. Eniac therefore likes to set up a data warehouse in which their own data on members is merged with data from the ASAS-system of the faculty. Eniac is founded in 1992, so data on their members is collected since then. ASAS contains information on open, running, and finished internships (Dutch stages ) and master projects. ASAS is running since 2002, hence contains data since 2002. For this exam question, you may assume that ASAS has complete data on all interships and master projects since 2002 of the whole of the faculty (which is not true in reality). The data warehouse project needs to be rather cost efficient, so priority lies with a data warehouse focussed on the above questions rather than on extensibility for other questions. Eniac (fictitious) ASAS (simplified) Member name studentnumber studyprogramme startyear dateofmasterdefense masterprojectcompany id currentjob id address emailaddress Company company id name Job member id company id nrofjob function Project project id kind studentname studentnumber study id supervisor id projecttitle projecttopic description status (open, running, or finished) company id startdate enddate Company company id name Studyprogramme study id name Supervisor supervisor id name emailaddress Figure 1: Databases 1 According to Merriam-Webster dictionary, alumnus means (1) a person who has attended or has graduated from a particular school, college, or university, or (2) a person who is a former member, employee, contributor, or inmate. In other words, alumni are former students of, in this case, Computer Science. 6
Part 3a (2 pts) i) Does Figure 1 contain metadata or not? Explain your answer. ii) Figure 1 contains many ambiguities that have to be clarified before a data warehouse can be set up. For example, Member.studyprogramme : does it contain a code like CS or is it in full Computer Science. Moreover, in the past, study programmes have had different names and there was a time when there was no separation between bachelor and master. Choose 2 attributes except studyprogramme and Company.name that you consider as the most ambiguous and describe as accurate as possible which ambiguities have to be clarified for them. Part 3b (4 pts) i) The data is by far not complete. Not all former students are member of Eniac (although many are), not every student does his master project externally at a company, etc. Discuss how problematic this is and advise how to deal with it in the data warehousing project. ii) Both databases have a table with companies. You can t simply compare them on company name nor id, while it is evident that this table plays a vital role in determining if students have their first job with the same company as their master project. Describe as accurate as possible which problems or complexities you forsee with the conversion and comparison of these tables. Also explain how you advise to approach solving those problems and complexities. Part 3c (5 pts) i) Which attributes and/or tables are not needed in the data warehouse. Explain your answer. ii) Give a design for the data structure of the data warehouse by means of a star schema with table names and attributes. iii) Give an estimation for the number of rows of your fact table. Mention your assumptions and explicitly provide the calculation. iv) With this data warehouse, can all business questions be fully answered? Explain as accurately as possible to what degree the questions can be answered and which considerations the analysts need to take into account when looking at the results. 7
Part 3d (1 pts) Eniac likes to repeat the analysis every year with fresh data. Discuss how you would approach this. Involve as many aspects as possible in your discussion and use proper data warehousing terminology if appropriate. 8
Assignment 4 (DW): Advanced Topics(7 pts) Part 4a (3 pts) Year 2003 2004 2005 Total City Gotham City 120 130 140 (b) Metropolis 90 80 70 Total (a) Table 1: Number of Cars per Year per City. Assume the numbers in Table 1 are the number of cars per city per year. What are the conditions that have to be true in order to calculate the total number of cars in cell (a) from the data given in Table 1? What are the conditions for calculating the total number of cars in cell (b) from the data given in Table 1? How could you discover if these conditions are met? What could you do, if these conditions are not met? Part 4b (4 pts) The Mail Order Company used a data warehouse for analyzing mail campaigns. The three Tables 3, 4, 5 show different cross tables. Assume all differences in the cross tables are statistical significant. The three Figures i), ii), and iii) in Table 2 show different causal graphs, which encode alternative believes about the causal influences between the variables. Assume that each graph shows the complete causal model. State for all nine combinations between the three cross tables and the three causal graphs: given the data table, would you reject the causal model (yes, no)? That means, which causal graph is inconsistent with which data table? Explain why you think that the causal graph iii) is consistent or inconsistent with the Table 3. 9
+ + + + + M a i l i n g R i c h O r d e r i) M a i l i n g O r d e r R i c h ii) M a i l i n g R i c h O r d e r iii) Table 2: Causal Graphs i), ii), and iii). Each graph shows alternative believes about the causal influences between the variables. E.g. Graph i) means that if a person is Rich, this has a positive causal influence that he/she creates an Order. However, the fact that he/she got a Mailing is not causing a higher chance for an Order. 10
Mailing Yes No Total Order Yes 800 200 1000 No 200 800 1000 Total 1000 1000 2000 Table 3: Order reactions (yes, no) from the customers after mailing campaign (yes, no). Rich Yes No Total Order Yes 1600 400 2000 No 400 1600 2000 Total 2000 2000 4000 Table 4: Order reactions (yes, no) from the customers depending on their wealth (Rich (yes, no)) Mailing Yes No Total Rich Yes No Total Yes No Total Order Yes 1000 1000 2000 1000 1000 2000 4000 No 1000 1000 2000 1000 1000 2000 4000 Total 2000 2000 4000 2000 2000 4000 8000 Table 5: Order reactions (yes, no) from the customers depending on their wealth (Rich (yes, no)) and the mailing campaign (yes, no) 11