Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Size: px

Start display at page:

Download "Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)"

Leo Green
5 years ago
Views:

1 Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points) Name: This exam is open book and notes. You can use a calculator but no laptops, cell phones, nor other electronic devices are allowed. 1. [5 points] Consider the following tables named Student and Transcript: Student Transcript ID Name Status StudentID CourseID Semester Year Grade 1 Mary Doe Senior 1 1 Spring Tom Thumb Senior 1 2 Fall Bill Brown Junior 2 1 Spring Fall Course ID Name Credits 1 CSE CSE CSE (a) Write SQL to create the schema for Course table. Assume ID and credits are integers, and Name is a string with a maximum length of 10 characters. Make sure you specify its primary key as the ID. CREATE TABLE Course ( ID INTEGER, Name VARCHAR(10), Credits INTEGER, PRIMARY KEY(StudentID, Course, Semester) ); (b) Write an SQL statement to insert the first row as seen above (i.e., insert the row having ID=1, name=cse480, and credits=3). INSERT INTO Course VALUES (1, CSE 480, 3); (c) Write an SQL query that returns the student IDs who took CSE482 in the Spring 2016 semester that had at least a 3.5 GPA (i.e., GPA 3.5). SELECT StudentID FROM Transcript, Course WHERE CourseID = ID AND Grade >= 3.5 AND Name = CSE 482 AND Semester = Spring AND Year = 2016; 1

2 2. [6 points] Consider the table below: ID Hours Spent Online Class Discretize the hours spent attribute using equal width, equal frequency, and entropybased approaches (with number of bins equal to 3). Show the bin number for each user ID in the table given below. Assume bin #1 has the lowest range of values, followed by bin #2, and bin #3. Fill in each entry in the table below with bin numbers #1, #2, or #3. User ID Equal width Equal frequency Entropy-based

3 3. [5 points] Consider the Misra-Gries algorithm for finding frequent items in a data stream. (a) Assume the stream consists of the following sequence of Twitter hashtags arriving one after another: #cdc, #cdc, #music, #movie, #cdc, #music, #sport, #cdc, #cdc, #music, #cdc Suppose the size of the buffer used by the Misra-Gries algorithm is equal to 3. Show the content of the buffer after processing the first ten hashtags of the sequence. Make sure you list the items along with their frequency values. The content of the buffer after processing the above hashtags is shown below: #cdc 5 #music 2 (b) The following table shows the frequencies of Twitter hashtags that appear in 400 streaming tweets from CDC (assuming each tweet has only 1 hashtag). Hashtag Actual Frequency #cdc 100 #health 93 #zika 88 #cancer 78 #diabetes 20 #obesity 17 #listeria 3 #ecoli 1 What is the minimum number of buffers needed to ensure that the hashtag #cdc is kept in one of the buffers after processing the 400 tweets. All items whose frequency is larger than m k, where m is the number of streaming tweets and k 1 is the buffer size, will be kept in the buffer. Since the frequency of the #cdc is 100 and < 100 = 400 4, we need k = 5. Therefore, the buffer size needed is k 1 = 4. 3

4 4. [6 points] Consider the following training set for classifying patients. Weight Loss Diarrhea #Patients with Disease #Patients without Disease Yes Yes No Yes No No 5 25 Yes No Suppose we would like to construct a 1-level decision tree from the training data above. Assume the tree is constructed using gini index as impurity measure. Figure 1: Candidate 1-level decision trees for patient classification. (a) Calculate the overall gini index using Weight Loss as splitting attribute. The distribution of the data is summarized below: With Disease Without Disease Weight Loss Yes No Gini (Weight Loss=Yes) = 1 ( )2 ( 25 Gini (Weight Loss=No) = 1 ( )2 ( 35 Overall Gini = )2 = )2 = = (b) Calculate the overall gini index using Diarrhea as splitting attribute. The distribution of the data is summarized below: With Disease Without Disease Diarrhea Yes No Gini (Diarrhea=Yes) = 1 ( )2 ( 20 Gini (Diarrhea=No) = 1 ( )2 ( 40 Overall Gini = )2 = )2 = = (c) Which tree, (a) Weight loss or (b) Diarrhea, has a lower gini index? Calculate the training error of the tree with lower gini index. Tree (a) Weight Loss, has a lower gini index. The error is 40/110 = 36.4% 4

5 5. [6 points] Figure 2: Performance on train and test sets for decision trees with varying depth used for classification. The x-axis if the depth of the tree ranging {2,3,...,9,10,15,20,...,50} and he y-axis is the accuracy. Figure 3: Showing the 2D real valued feature rows (i.e., their x and y coordinates). Where x and o are the two class labels and the triangle is the one we are looking to predict using the Nearest-Neighbor method. (a) Which of the following are TRUE statements in relation to Figure 2? Circle ALL the correct answers. i. Using a decision tree with depth 20 has desirable performance. ii. ***Using a decision tree with depth 2 is underfitting.*** iii. Using a decision tree with depth 4 is overfitting. iv. *** Using a decision tree with depth 50 is overfitting.*** Note: The answer is selected with ***s above. (b) True or False. Accuracy is not a good measure for model evaluation if the classes are highly imbalanced. True (c) True or False. Gini index attains its best value (i.e., the one you would choose for a split if creating a new branch on the decision tree) when all the data instances in the leaves have the same class. True 5

6 (d) What is the predicted class for the triangle in Figure 3 when the number of neighbors is set to k=1,3, and 5 (and using Euclidean distance)? k = 1 = o k = 3 = o k = 5 = x 6

7 6. [5 points] (a) Given the following support information: Itemset Support {Butter} 45% {Bread} 50% {Tea} 35% {Bread, Butter} 40% {Bread, Tea} 25% {Butter, Tea} 30% {Bread, Butter, Tea} 5% Determine whether the confidence of the following 3 rules: i). {Bread} {Butter} 0.4/0.5 = 0.8 ii). {Bread, Tea} {Butter} 0.05/.25 = 0.2 iii). {Butter, Tea} {Bread} 0.05/0.3 = (b) Please circle ALL true statements given the following: The confidence of the rule {A} {C} is 30%, {A} {B} is 100%, and {B} {A} is 100%. i. The confidence of the rule {A,B} {C} must be strictly less than 30%. ii. *** The confidence of the rule {A,B} {C} must be equal to 30%. *** iii. ***The confidence of {B} {C} must be equal to 30%.*** iv. From the above we can infer nothing about the confidence of the rule {A,B} {C}. Note: The answer is selected with ***s above. 7

8 7. [6 points] (a) Suppose you are given the following dataset, which contains 8 dense groups of points (assume each group has the same number of points). Suppose you apply 2 different clustering methods to partition the dataset into two clusters. The first solution (shown in (a)) partitions the data into a top and bottom halves, while the second solution (shown in (b)) partitions the data into a left and right halves. Which of the two clustering solutions, (a) or (b), has a lower sum-of-square error. Solution (b) has a lower sum-of-square-error since the distance between each point to its nearest centroid is, on average, smaller than the solution in (a). (b) In the figure below, assume hierarchical clustering has merged the 15 data points until there are 3 remaining clusters. Figure 4: Intermediate clustering results Suppose the 3 clusters were obtained using the complete link (MAX) method, resulting in the following 3 3 distance matrix (the rows and columns are sorted in increasing order of their cluster IDs): Which two clusters will be merged to produce 2 clusters? Show the resulting 2 2 distance matrix after the merging. Clusters 2 and 3 will be merged. [ 0 ]

9 (c) Based on the below dendrogram, if we wanted to obtain 3 clusters, which points would be grouped together. Example format for answer: {1,2,3,4}, {5}, {6} if you wanted to say the first four are in a cluster and 5 and 6 are defining their own clusters, which compose the 3 clusters for the data. Figure 5: Dendrogram for 6 data points. {1,3 }, {2,4,5}, and {6 } 9

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their