Data Analytics. Qualification Exam, May 18, am 12noon

Similar documents
7 Techniques for Data Dimensionality Reduction

CS 540: Introduction to Artificial Intelligence

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Topics In Feature Selection

Exam Advanced Data Mining Date: Time:

10-701/15-781, Fall 2006, Final

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Data Mining and Knowledge Discovery: Practice Notes

The American University in Cairo Department of Computer Science & Engineering CSCI &09 Dr. KHALIL Exam-I Fall 2011

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Counting Product Rule

Performance Analysis of Data Mining Classification Techniques

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Classification. Slide sources:

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Bayes Classifiers and Generative Methods

Homework #6 (Constraint Satisfaction, Non-Deterministic Uncertainty and Adversarial Search) Out: 2/21/11 Due: 2/29/11 (at noon)

Regularization and model selection

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

The exam is closed book, closed notes except your one-page cheat sheet.

CS 540: Introduction to Artificial Intelligence

Knowledge-based systems in bioinformatics 1MB602. Exam

To earn the extra credit, one of the following has to hold true. Please circle and sign.

CPSC 211, Sections : Data Structures and Implementations, Honors Final Exam May 4, 2001

Lecture 25: Review I

Machine Learning: Algorithms and Applications Mockup Examination

STATISTICS (STAT) Statistics (STAT) 1

Machine Learning

Data Mining and Knowledge Discovery Practice notes 2

IMPORTANT: Circle the last two letters of your class account:

Department of Computer Science and Engineering. COSC 4213: Computer Networks II (Fall 2005) Instructor: N. Vlajic Date: November 3, 2005

1 Document Classification [60 points]

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Data Mining and Knowledge Discovery: Practice Notes

Advanced Search Genetic algorithm

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Classification Algorithms in Data Mining

Features: representation, normalization, selection. Chapter e-9

CSCI567 Machine Learning (Fall 2014)

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Louis Fourrier Fabien Gaie Thomas Rolf

Introduction to AI Spring 2006 Dan Klein Midterm Solutions

Weka ( )

CHAPTER 2 Modeling Distributions of Data

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Midterm 1

CSCI-630 Foundations of Intelligent Systems Fall 2015, Prof. Zanibbi

Student Number: Lab day:

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

CS-171, Intro to A.I. Mid-term Exam Winter Quarter, 2016

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Exam Marco Kuhlmann. This exam consists of three parts:

CS 481/681 Advanced Computer Graphics, Spring 2004 Take-Home Midterm Exam

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Alternatives to Direct Supervision

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Data Mining and Knowledge Discovery: Practice Notes

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

Bayes Net Learning. EECS 474 Fall 2016

CPSC 311: Analysis of Algorithms (Honors) Exam 1 October 11, 2002

CS303 LOGIC DESIGN FINAL EXAM

6.034 QUIZ 1. Fall 2002

CSCI544, Fall 2016: Assignment 1

(Due to rounding, values below may be only approximate estimates.) We will supply these numbers as they become available.

Exam 2. Name: UVa ID:

Bayesian Classification Using Probabilistic Graphical Models

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Final Examination. Winter Problem Points Score. Total 180

CS 540-1: Introduction to Artificial Intelligence

1 Training/Validation/Testing

6.00 Introduction to Computer Science and Programming Fall 2008

What is machine learning?

CS 151 Midterm. Instructions: Student ID. (Last Name) (First Name) Signature

MATH 1075 Final Exam

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Good Cell, Bad Cell: Classification of Segmented Images for Suitable Quantification and Analysis

BIG DATA SCIENTIST Certification. Big Data Scientist

Spring 2007 Midterm Exam

Without fully opening the exam, check that you have pages 1 through 11.

6.034 QUIZ 1 Solutons. Fall Problem 1: Rule-Based Book Recommendations (30 points)

CSE 143, Winter 2010 Midterm Exam Wednesday February 17, 2010

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CSE 131 Introduction to Computer Science Fall Exam I

Question: Total Points: Score:

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Department of Computer Science Faculty of Engineering, Built Environment & IT University of Pretoria. COS122: Operating Systems

EECS 3214 Midterm Test Winter 2017 March 2, 2017 Instructor: S. Datta. 3. You have 120 minutes to complete the exam. Use your time judiciously.

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Taxonomy of Semi-Supervised Learning Algorithms

Nearest Neighbor Classification

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

UNIVERSITY REGULATIONS

COS 126 General Computer Science Fall Exam 1

Unsupervised Learning

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Florida EEL 4744 Summer 2014 Dr. Eric M. Schwartz Department of Electrical & Computer Engineering 1 July Oct-14 6:41 PM

CSE 143: Computer Programming II Spring 2015 Midterm Exam Solutions

Modelling Structures in Data Mining Techniques

Transcription:

CS220 Data Analytics Number assigned to you: Qualification Exam, May 18, 2014 9am 12noon Note: DO NOT write any information related to your name or KAUST student ID. 1. There should be 12 pages including this cover page. 2. Closed-book exam. No books, notes, computers, phones, or internet access. 3. A calculator with basic functions is allowed. 4. If you need more room to work out your answer to a question, please use the back of the page and clearly indicate that we should look there. 5. You have 180 minutes. No extension will be given. 6. Good luck! Grading: (for instructor use only) Question Topic Max. Score Score 1 Local search 12 2 Constraint satisfaction problem 8 3 Principal component analysis 10 4 Data preprocessing 8 5 ROC curves and AUC 10 6 Counting 12 7 Maximum likelihood estimation 8 8 Feature selection 12 9 A* Search 10 10 Decision tree 10 Total 100 1

1. (12 points) Local search Suppose you are given a problem and are asked to solve it by genetic algorithm. List six components of the genetic algorithm that need to be defined/specified before you can apply the genetic algorithm to solve the problem. 2

2. (8 points) Constraint satisfaction problem Please fill the following form for arc consistency check. Arc Examined Value deleted Note: Value deleted should be answered with the value and the corresponding node. 3

3. (10 points) Principal Component Analysis Given three data points in three-dimensional space, (1,1,1), (2,2,4) and (3,3,7). Please show how to use PCA to reduce the dimensionality of the data. 4

4. (8 points) Data Preprocessing You are given a classification dataset that consists of 8 data samples, each of which is represented by three features. Sample Feature 1 Feature 2 Feature 3 Label index S1 5 0.2 800 0 S2 8 0.3 300 1 S3 2 0.5 800 S4 6 0.5 150 1 S5 7 0.4 250 S6 4 0.3 750 S7 1 0.4 750 0 S8 9 0.5 200 Now you are going to solve a supervised learning problem on this dataset. Please normalize the training data to make sure each feature value after normalization is valued between 0 and 1. Please give the training data after normalization. Hint: training data may not necessarily be the entire data set. 5

5. (10 points) ROC curves and AUC Given a dataset that contains five data samples, each of which is represented by one feature. The feature values and the corresponding labels are given in the table below. Please draw the ROC curve for this dataset and calculate the AUC. (Hint: no need to smooth the ROC curve) Sample index Feature values Label S1 0.8 0 S2 0.8 1 S3 0.9 1 S4 0.9 0 S5 0.6 1 6

6. (12 points) Counting (This question has THREE subquestions) Suppose you want to train a classification model, P(X Y), where X is the feature vector of length n (n features) and Y is the class label. Assume that each feature has two possible discrete values and there are three possible classes. a. (4 points) How many independent parameters do you need to train in order to directly learn P(X Y)? b. (4 points) If we use Naïve Bayes P(X Y), what assumption do you need to make? c. (4 points) If we use Naïve Bayes and suppose your assumption holds, how many independent parameters do you need to learn? 7

7. (8 points) Maximum likelihood estimation In DNA, also known as the Code of Life, there exist four different possible bases: adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). Now, you are given an organism which has a set of unknown DNA base frequencies. Let p A, p C, p G, and p T be those unknown frequencies. Assume that you obtain a strand of DNA and you want to infer the unknown frequencies. Let n A, n C, n G, n T be the corresponding number of bases that you observe for A, C, G and T. Please infer the maximum likelihood estimates of the unknown parameters p A, p C, p G, and p T. 8

8. (12 points) Feature Selection (This question has TWO subquestions) a. (6 points) What is the main difference between filter methods and wrapper methods for feature selection? b. (6 points) List the advantage and disadvantage of filter methods and wrapper methods for feature selection. 9

9. (10 points) A* Search Please use A* search to solve the following problem, where S is the starting node and G is the goal node. The heuristic function values for each node and the edge weights are known. Specify at each step: the nodes that have been expanded; the nodes in the queue; the next node selected at this step to be expanded; and the evaluation value, i.e., f, for this selected node Step Nodes expanded Nodes in queue Next node to expand f for the next node 1 None S S 10 Please fill the table by the standard A* search until the algorithm terminates. Please list the final path from S to G selected by the A* search and the final cost of the path below: 10

10. (10 points) Decision tree Consider the following training data and the following decision tree learned from this data using the ID3 algorithm (without any post-pruning). The last column is the class label. Show that the choice of the Wind attribute at the second level of the tree is correct, by showing that its information gain is superior to the alternative choices. The definition for information gain is Gain = Entropy(p) Entropy(i) 11

12