Data Analytics. Qualification Exam, May 18, am 12noon

CS220 Data Analytics Number assigned to you: Qualification Exam, May 18, 2014 9am 12noon Note: DO NOT write any information related to your name or KAUST student ID. 1. There should be 12 pages including this cover page. 2. Closed-book exam. No books, notes, computers, phones, or internet access. 3. A calculator with basic functions is allowed. 4. If you need more room to work out your answer to a question, please use the back of the page and clearly indicate that we should look there. 5. You have 180 minutes. No extension will be given. 6. Good luck! Grading: (for instructor use only) Question Topic Max. Score Score 1 Local search 12 2 Constraint satisfaction problem 8 3 Principal component analysis 10 4 Data preprocessing 8 5 ROC curves and AUC 10 6 Counting 12 7 Maximum likelihood estimation 8 8 Feature selection 12 9 A* Search 10 10 Decision tree 10 Total 100 1

1. (12 points) Local search Suppose you are given a problem and are asked to solve it by genetic algorithm. List six components of the genetic algorithm that need to be defined/specified before you can apply the genetic algorithm to solve the problem. 2

2. (8 points) Constraint satisfaction problem Please fill the following form for arc consistency check. Arc Examined Value deleted Note: Value deleted should be answered with the value and the corresponding node. 3

3. (10 points) Principal Component Analysis Given three data points in three-dimensional space, (1,1,1), (2,2,4) and (3,3,7). Please show how to use PCA to reduce the dimensionality of the data. 4

4. (8 points) Data Preprocessing You are given a classification dataset that consists of 8 data samples, each of which is represented by three features. Sample Feature 1 Feature 2 Feature 3 Label index S1 5 0.2 800 0 S2 8 0.3 300 1 S3 2 0.5 800 S4 6 0.5 150 1 S5 7 0.4 250 S6 4 0.3 750 S7 1 0.4 750 0 S8 9 0.5 200 Now you are going to solve a supervised learning problem on this dataset. Please normalize the training data to make sure each feature value after normalization is valued between 0 and 1. Please give the training data after normalization. Hint: training data may not necessarily be the entire data set. 5

5. (10 points) ROC curves and AUC Given a dataset that contains five data samples, each of which is represented by one feature. The feature values and the corresponding labels are given in the table below. Please draw the ROC curve for this dataset and calculate the AUC. (Hint: no need to smooth the ROC curve) Sample index Feature values Label S1 0.8 0 S2 0.8 1 S3 0.9 1 S4 0.9 0 S5 0.6 1 6

6. (12 points) Counting (This question has THREE subquestions) Suppose you want to train a classification model, P(X Y), where X is the feature vector of length n (n features) and Y is the class label. Assume that each feature has two possible discrete values and there are three possible classes. a. (4 points) How many independent parameters do you need to train in order to directly learn P(X Y)? b. (4 points) If we use Naïve Bayes P(X Y), what assumption do you need to make? c. (4 points) If we use Naïve Bayes and suppose your assumption holds, how many independent parameters do you need to learn? 7

7. (8 points) Maximum likelihood estimation In DNA, also known as the Code of Life, there exist four different possible bases: adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). Now, you are given an organism which has a set of unknown DNA base frequencies. Let p A, p C, p G, and p T be those unknown frequencies. Assume that you obtain a strand of DNA and you want to infer the unknown frequencies. Let n A, n C, n G, n T be the corresponding number of bases that you observe for A, C, G and T. Please infer the maximum likelihood estimates of the unknown parameters p A, p C, p G, and p T. 8

8. (12 points) Feature Selection (This question has TWO subquestions) a. (6 points) What is the main difference between filter methods and wrapper methods for feature selection? b. (6 points) List the advantage and disadvantage of filter methods and wrapper methods for feature selection. 9

9. (10 points) A* Search Please use A* search to solve the following problem, where S is the starting node and G is the goal node. The heuristic function values for each node and the edge weights are known. Specify at each step: the nodes that have been expanded; the nodes in the queue; the next node selected at this step to be expanded; and the evaluation value, i.e., f, for this selected node Step Nodes expanded Nodes in queue Next node to expand f for the next node 1 None S S 10 Please fill the table by the standard A* search until the algorithm terminates. Please list the final path from S to G selected by the A* search and the final cost of the path below: 10

10. (10 points) Decision tree Consider the following training data and the following decision tree learned from this data using the ID3 algorithm (without any post-pruning). The last column is the class label. Show that the choice of the Wind attribute at the second level of the tree is correct, by showing that its information gain is superior to the alternative choices. The definition for information gain is Gain = Entropy(p) Entropy(i) 11