Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Similar documents
Tribhuvan University Institute of Science and Technology MODEL QUESTION

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

DATA WAREHOUING UNIT I

List of Exercises: Data Mining 1 December 12th, 2015

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

Table Of Contents: xix Foreword to Second Edition

DATA MINING TRANSACTION

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Code No: R Set No. 1

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Data Preprocessing. Slides by: Shree Jaswal

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

COMPUTER SCIENCE AND ENGINEERING TUTORIAL QUESTION BANK

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Basics of Dimensional Modeling

UNIT -1 UNIT -II. Q. 4 Why is entity-relationship modeling technique not suitable for the data warehouse? How is dimensional modeling different?

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Machine Learning - Clustering. CS102 Fall 2017

UNIT -1 UNIT -II. Q. 4 Why is entity-relationship modeling technique not suitable for the data warehouse? How is dimensional modeling different?

2. On classification and related tasks

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Science Course Content

Question Bank. 4) It is the source of information later delivered to data marts.

Introduction to Data Mining

Business Intelligence Roadmap HDT923 Three Days

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Machine Learning Techniques for Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

1. Inroduction to Data Mininig

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Machine Learning: Algorithms and Applications Mockup Examination

Intro to Artificial Intelligence

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Development of datamining software for the city water supply company

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining: Exploring Data. Lecture Notes for Chapter 3

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

What is Data Mining? Data Mining. Data Mining Architecture. Illustrative Applications. Pharmaceutical Industry. Pharmaceutical Industry

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

K- Nearest Neighbors(KNN) And Predictive Accuracy

Evaluating Classifiers

Contents. Part I Setting the Scene

Model s Performance Measures

Decision Support Systems 2012/2013. MEIC - TagusPark. Homework #1. Due: 11.Mar.2013

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Evaluating Classifiers

Machine Learning: Symbol-based

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

Sql Fact Constellation Schema In Data Warehouse With Example

On-Line Analytical Processing (OLAP) Traditional OLTP

ECLT 5810 Evaluation of Classification Quality

Evaluating Classifiers

DATA MINING AND WAREHOUSING

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube


Classification with Decision Tree Induction

Unit 7: Basics in MS Power BI for Excel 2013 M7-5: OLAP

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Machine Learning and Bioinformatics 機器學習與生物資訊學

COMP 465: Data Mining Classification Basics

Building Intelligent Learning Database Systems

Network Traffic Measurements and Analysis

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data mining: concepts and algorithms

Data Mining: STATISTICA

Unsupervised learning in Vision

COMP 465 Special Topics: Data Mining

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

Assignment 1. Question 1: Brock Wilcox CS

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

ISLEWORTH & SYON BOYS SCHOOL

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

Transcription:

Section A 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences. b) Discuss the reasons behind the phenomenon of data retention, its disadvantages, and how data retention can be overcome. (10 marks) c) Discuss Web Services as a means for processing data. In your discussion, mention at least two issues that would: i. Prevent you from using Web Services ii. Drive you to use Web Services in a business context. (11 marks) d) Define the term data quality methodology and describe the three common phases of a data quality methodology. Page 2 of 7

2. a) Explain the following: Nowadays, Business deals with data, not with documents. (6 marks) b) Give an example of a company, describing its business. Provide a data lifecycle management solution that can help the company keep high data accessibility and performance without increase in costs. (8 marks) c) Enumerate and explain three advantages and three disadvantages of the Web- DBMS integration approach, providing examples. (11 marks) d) Explain the reason why most data quality methodologies focus on structured and semi-structured data. [PTO] Page 3 of 7

Section B 3. a) Suppose a market shopping data warehouse consists of four dimensions: customer, date, product, and store, and two measures: count, and avg sales, where avg sales stores the real sales in pounds at the lowest level but the corresponding average sales at other levels. (i) (ii) (iii) Draw a snowflake schema diagram (you do not have to mark every possible level, but make clear your implicit assumptions on the levels of a dimension). (3 marks) Starting with the base cuboid [customer, date, product, store], what specific OLAP operations (e.g., roll-up student to department (level)) should be performed in order to list the average sales of each cosmetic product since January 2005? If each dimension has 5 levels (excluding all), such as store-city-stateregion-country, how many cuboids does this cube contain (including base and apex cuboids)? (2 marks) b) (i) Two classifiers designed to predict patients' susceptibility to allergy are being designed and tested, independently of one another. Each of the classifiers predict that a patient is either positive (allergic) or negative (normal) based on a combination of observable factors. The tests result in the following two confusion matrices, one for each classifier: Matrix A: actual predicted allergic normal total allergic 20 80 100 normal 30 490 520 total 50 570 620 Page 4 of 7

Matrix B: actual predicted allergic normal total allergic 80 20 100 normal 190 330 520 total 270 350 620 Calculate the accuracy, recall, and F-measure for each of the classifiers based on these tables. Based on their values, can you recommend one classifier over the other, considering the type of application they are used for? Justify your answer (ii) A new classifier that can associate probabilities to class assignments is being tested, and it has produced the following table of class assignments: instance class probability distribution predicted error? c1 c2 c3 1 0.1 0.55 0.35 c2 Y 2 0.2 0.7 0.1 c2 N Quadratic and information loss functions are two ways to assess the prediction accuracy of this classifier. Calculate both of them for this table (the table is partial but, for the sake of the exercise, we may assume it is complete), and briefly discuss why you would use one of them rather then the other to report on the classifier accuracy. c) Suppose that you are employed as a data mining consultant for an Internet search engine company. Describe how data mining can help the company by giving specific examples of how techniques, such as clustering, classification, association rule mining and anomaly detection can be applied. Give examples to support your answer. (12 marks) Page 5 of 7

4. a) Consider the problem of finding the K nearest neighbours of a data object. A programmer designs the following algorithm for this task. 1: for i = 1 to number of data objects do 2: Find the distances of the ith object to all other objects. 3: Sort these distances in decreasing order. (Keep track of which object is associated with each distance.) 4: return the objects associated with the first K distances of the sorted list 5: end for Describe the potential problems with this algorithm if there are duplicate objects in the data set. Assume the distance function will only return a distance of 0 for objects that are the same. How would you address this problem? Make clear any assumptions you make. b) The following contingency shows a breakdown of transactions for two products, A and B. B not B sum(row) A 6500 1000 7500 not A 1000 1500 2500 sum(col) 7500 2500 10000 Use the table and the lift metric to determine whether attributes A and B are negatively correlated, positively correlated, or independent. If you were the shop manager, what would you conclude regarding the joint sales or products A and B? Page 6 of 7

c) Given the following dataset: A B class a x c1 b x c1 a x c2 a y c2 b y c2 b y c2 Use the information gain metric to determine which of the two attributes is the better one to split on when constructing a decision tree. Show your working. (6 marks) d) Both decision-tree induction and associative classification may generate rules for classification. What are their major differences? Why is it that in many cases an associative induction may lead to better accuracy in prediction? e) Distinguish between noise and outliers. In your answer address the following questions (justify your answers): Is noise ever interesting or desirable? Can noise objects be outliers? Are noise objects always outliers? Are outliers always noise objects? Can noise make a typical value into an unusual one, or vice versa? (10 marks) END OF EXAMINATION Page 7 of 7