Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Similar documents
Chapter 6. SQL: SubQueries

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Objective. The goal is to review material covered in Chapters 1-5. Do the following questions from the book.

Lecture 03. Spring 2018 Borough of Manhattan Community College

SQL DDL. Intro SQL CREATE TABLE ALTER TABLE Data types Service-based database in Visual Studio Database in PHPMyAdmin

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Lecture 03. Fall 2017 Borough of Manhattan Community College

6.9 List the names and addresses of all guests in London, alphabetically ordered by name.

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

DATA WAREHOUING UNIT I

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Frequent Pattern Mining

Data Mining Part 3. Associations Rules

BCB 713 Module Spring 2011

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Introduction to Data Mining and Data Analytics

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 4: Mining Frequent Patterns, Associations and Correlations

BIG DATA SCIENTIST Certification. Big Data Scientist

Association Pattern Mining. Lijun Zhang

DATA MINING TRANSACTION

1 P a g e D a t a b a s e o l d p a p e r b y w a s i m

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Data warehouses Decision support The multidimensional model OLAP queries

1. Inroduction to Data Mininig

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Chapter 4 Data Mining A Short Introduction

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

PART III SOLUTIONS TO REVIEW QUESTIONS AND EXERCISES

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

Table Of Contents: xix Foreword to Second Edition

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Data Science. Data Analyst. Data Scientist. Data Architect

Question Bank. 4) It is the source of information later delivered to data marts.

Data Mining Concepts

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Association Rule Mining. Entscheidungsunterstützungssysteme

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Chapter 1, Introduction

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Data Preprocessing. Slides by: Shree Jaswal

Data Warehousing and Data Mining

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

Supervised and Unsupervised Learning (II)

UNIVERSITY OF BOLTON CREATIVE TECHNOLOGIES COMPUTING PATHWAY SEMESTER TWO EXAMINATION 2014/2015 ADVANCED DATABASE SYSTEMS MODULE NO: CPU6007

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

DATA MINING AND WAREHOUSING

Big Data Analytics. Rasoul Karimi

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Code No: R Set No. 1

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Association Rules Outline

2. Discovery of Association Rules

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

1Z0-526

ALTERNATE SCHEMA DIAGRAMMING METHODS DECISION SUPPORT SYSTEMS. CS121: Relational Databases Fall 2017 Lecture 22

Knowledge Discovery and Data Mining 1 (VO) ( )

Big Data with Hadoop Ecosystem

COMP 465 Special Topics: Data Mining

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Association Rules. Berlin Chen References:

Database Systems. A Practical Approach to Design, Implementation, and Management. Database Systems. Thomas Connolly Carolyn Begg

Data Analysis and Data Science

An Improved Apriori Algorithm for Association Rules

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Exam Advanced Data Mining Date: Time:

Chapter 4: Association analysis:

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Data Science Course Content

After completing this course, participants will be able to:

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Tutorial on Association Rule Mining

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Lecture notes for April 6, 2005

Chapter 3: Supervised Learning

Contents. Preface to the Second Edition

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Open Book Examination. Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. M.Sc. in Mathematics & Computational Science

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

11/04/16. Data Profiling. Helena Galhardas DEI/IST. References

Data Mining Concepts & Techniques

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Transcription:

COMP 60711 Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Data Engineering Date: Thursday 25th January 2018 Time: 09:45-11:45 Please answer ONE Question from Section A and ONE Question from Section B Use a SEPARATE answer book for each SECTION This is a CLOSED book examination The use of electronic calculators is permitted provided they are not programmable and do not store text [PTO]

Section A 1. a) Consider the data sample shown in the table below, which describes user actions when accessing the commercial Web site of a company. Each table row describes an action performed by a client of the company when accessing the Web site. Suggest four data reduction techniques that could be used to avoid the storing of the thousands of actions collected daily by the company and that could facilitate analysis of the data as well as the gathering of Business Intelligence. State any assumptions you make about how the data is to be analysed. (8 marks) Answer: There Date Time Username IPaddress URL Action Object 13-06-2017 14:22:56 sramos 2001:db8:0:1234:0:567:8:1 https://www.amretailers.com Login 13-06-2017 14:23:20 sramos 2001:db8:0:1234:0:567:8:1 https://www.amretailers.com/b/ref=... Search 13-06-2017 14:24:01 dverd 2010:db9:0:3492:0:534:6:2 https://www.amretailers.com/b/ref=... choose_product prod_010 13-06-2017 14:27:12 sramos 2001:db8:0:1234:0:567:8:1 https://www.amretailers.com/b/ref=... choose_product_class prod_class_30 13-06-2017 14:29:44 sramos 2001:db8:0:1234:0:567:8:1 https://www.amretailers.com/b/ref=... choose_product prod_120 13-06-2017 14:27:12 dverd 2010:db9:0:3492:0:534:6:2 https://www.amretailers.com/b/ref=... add_to_basked prod_010 13-06-2017 14:31:21 sramos 2001:db8:0:1234:0:567:8:1 https://www.amretailers.com/b/ref=... add_to_basked prod_120 13-06-2017 14:33:50 sramos 2001:db8:0:1234:0:567:8:1 https://www.amretailers.com/b/ref=... pay_basked b) Data acquisition is one of the phases in the big data lifecycle where the employed data collection methods produce very large sets of data that can vary in format, speed, and quality (e.g., different types of sensors). In this context, discuss, at least, four issues that typically arise in data acquisition. (8 marks) c) Define data profiling and discuss four challenges involved in profiling very large data sets. (9 marks) d) Explain the steps involved in the process of cleaning data, discussing the issues that can arise at each step and the limitations of data cleaning as a process. e) Explain why and how integration of Big data from heterogeneous sources can impact on the quality of the integrated data and make data cleaning more difficult. Page 2 of 8

2. a) Apply the general lifecycle management principles of Identification of Objectives in Maintaining Data, Minimalism and Information Security in the context of the Amazon.com, Inc. (the largest Internet-based retailer in the world). This will require from you the proposal of specific management procedures for Amazon's data sets. Consider, for example, its commercial website, which collects user behaviour data on a daily basis. You should also make assumptions about other datasets relevant to the Amazon's businesses. Your procedures should dictate how the torrent of data (for example data describing user actions when accessing the Amazon Web site) should be managed by Amazon, as well as other relevant datasets. State any assumptions you make. (10 marks) b) Considering parallel processing as a solution to the problem of performance of Big data applications for data profiling, answer the following questions: (i) Explain how the Weighted Average of a numerical column in a table can be calculated in a distributed manner (i.e., if the column values are distributed across a number of nodes of a cluster, and the overall Weighted Average of the values is to be calculated). (ii) Explain how the Median of the values in same column can be calculated in a distributed manner. (6 marks) c) Recent technological advances (e.g., network and communication infrastructures, service oriented programming paradigms, etc.) have had an impact on the organization and functionality of modern Information Systems, changing their architecture into network-based structures. In this context, explain how these changes have had an impact on the quality of the data that runs through the processes of these systems, and discuss how data quality can be assessed and improved in these systems. d) Describe Data Lakes and discuss how Data Lake architectures can help integrating data from different sources. (Question 2 continues on the following page) Page 3 of 8

(Question 2 continues from the previous page) e) Consider the following relational database schemas that store information about hotels, room bookings and guests, and which need to be integrated. Aiming at reconciling the conflicts that need to be resolved before integration is possible, answer the following questions. Use the notation Schema_i.TableName.AttributeName, to refer to an attribute in any of the two schemas; Schema_i.TableName, to refer to a table in any of the two schemas. Schema_1: Hotel(hotelNo, hotelname, address) Room(roomNo, hotelno, type, price, features) Booking(hotelNo, guestno, datefrom, dateto, roomno) HolidayGuest(guestNo, guestname, guestaddress, packageno) BusinessGuest(guestNo, guestname, guestaddress) Package(packageNo, packagename, Company, discount) Schema_2: Hotel(hotelNo, hotelname, classification, streetno, streetname, postcode, city, country) HotelFacilities(hotelNo, facilityno, numofunits) Facility(facilityNo, facilityname, description) Bedroom(roomNo, hotelno, type, additionals, price) Reservation(roomNo, hotelno, guestno, datefrom, dateto, discountguesttype) Guest(guestNo, guestname, guestaddress, type) i. Describe two different one-to-one table structure conflicts, one of them being the case of a missing, but implicit attribute. ii. Describe one table inclusion conflict. (1 marks) iii. Produce an SQL view that derives a table with the following structure from the Schema_2 database, where durationofstay can be obtained by simply subtracting attribute datefrom from attribute dateto, discount can be null for some guests, and totaltopay can be obtained using durationofstay, roomprice and discount. HotelBill(guestNo, roomprice, durationofstay, discount, totaltopay) Page 4 of 8

Section B 3. a) In the context of big data and business intelligence: (i) Characterise the 5 Vs of big data. (ii) (iii) Consider the differences between OLTP and OLAP? Why might different database systems be used for each type? Compare and contrast the use of row store and column store; in particular, consider the advantages of their use for data warehousing and OLTP. Illustrate your answer. b) In the context of classification: i) Outline a decision tree classification algorithm ii) Explain the potential effects of unbalanced data on the usefulness of a classifier. Suggest the advantages of various methods to handle unbalanced data. Give examples to support your answer. c) In the context of association rule (itemset) mining: The following contingency matrix shows a breakdown of transactions for coffee and tea drinkers (assume 1000 transactions). coffee not Total coffee tea 150 50 200 not tea 650 150 800 Total 800 200 1000 (i) Calculate support, confidence and lift for the association rule {tea} -> {coffee}. (Question 3 continues on the following page) Page 5 of 8

(Question 3 continues from the previous page) (ii) (iii) Explain why lift better represents the relationship between tea and coffee drinkers. Consider ways in which association rule (itemset) mining might be extended. Discuss in your answer adding multi-dimensionality, quantitative intervals and hierarchies to the standard approach. d) In the context of the analytics landscape Discuss and contextualise the four purposes of analytics in the analytics landscape and their role in insight and decision making; in particular, consider the temporal and business value factors. Illustrate your answer. (6 marks) Page 6 of 8

4. a) In the context of association rule (itemset) mining: (i) Discuss the terms correlation and causation. (i) (ii) Outline the working of the Apriori algorithm. Explain the importance of the subset property. (5 marks) Using Apriori, suppose that L3 is the list: { {a,b,c}, {a,b,d}, {a,c,d}, {b,c,d}, {b,c,w}, {b,c,x}, {p.q.r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s} }. a. At the join step of the algorithm, which itemsets are placed in C4 (the candidate set)? b. Which itemsets are discarded by the prune step of the algorithm? Provide your working as appropriate. b) In the context of classification: (i) Two classifiers designed to predict patients' susceptibility to allergy are being designed and tested, independently of one another. Each of the classifiers predict that a patient is either positive (allergic) or negative (normal) based on a combination of observable factors. The tests result in the following two confusion matrices, one for each classifier: Table A: predicted allergic normal Total allergic 30 70 100 actual normal 20 500 520 Total 50 570 620 Table B: predicted allergic normal Total allergic 70 30 100 actual normal 200 320 520 Total 270 350 620 (Question 4 continues on the following page) Page 7 of 8

(Question 4 continues from the previous page) Calculate the accuracy, recall, and F-measure for each classifier based on these tables. Based on their values, can you recommend one classifier over the other, give this type of application? Justify your answer ii) (iii) Compare and contrast information gain, gain ratio and the GINI index. Some classification algorithms run out of memory in trying to fit all data in memory to create the classification model. Discuss ways in which you might address the issue of memory capacity? c) Discuss the trade-offs between privacy of the individual on the one hand and the utility of data on the other. Use well-known case studies of data disclosure (AOL, Netflix, Barabasi) as appropriate to illustrate these trade-offs. How do we better guarantee that data is private? (6 marks) END OF EXAMINATION Page 8 of 8