Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Size: px
Start display at page:

Download "Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)"

Transcription

1 Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points) Name: This exam is open book and notes. You can use a calculator but no laptops, cell phones, nor other electronic devices are allowed. 1. [5 points] Consider the following tables named Student and Transcript: Student Transcript ID Name Status StudentID CourseID Semester Year Grade 1 Mary Doe Senior 1 1 Spring Tom Thumb Senior 1 2 Fall Bill Brown Junior 2 1 Spring Fall Course ID Name Credits 1 CSE CSE CSE (a) Write SQL to create the schema for Course table. Assume ID and credits are integers, and Name is a string with a maximum length of 10 characters. Make sure you specify its primary key as the ID. CREATE TABLE Course ( ID INTEGER, Name VARCHAR(10), Credits INTEGER, PRIMARY KEY(StudentID, Course, Semester) ); (b) Write an SQL statement to insert the first row as seen above (i.e., insert the row having ID=1, name=cse480, and credits=3). INSERT INTO Course VALUES (1, CSE 480, 3); (c) Write an SQL query that returns the student IDs who took CSE482 in the Spring 2016 semester that had at least a 3.5 GPA (i.e., GPA 3.5). SELECT StudentID FROM Transcript, Course WHERE CourseID = ID AND Grade >= 3.5 AND Name = CSE 482 AND Semester = Spring AND Year = 2016; 1

2 2. [6 points] Consider the table below: ID Hours Spent Online Class Discretize the hours spent attribute using equal width, equal frequency, and entropybased approaches (with number of bins equal to 3). Show the bin number for each user ID in the table given below. Assume bin #1 has the lowest range of values, followed by bin #2, and bin #3. Fill in each entry in the table below with bin numbers #1, #2, or #3. User ID Equal width Equal frequency Entropy-based

3 3. [5 points] Consider the Misra-Gries algorithm for finding frequent items in a data stream. (a) Assume the stream consists of the following sequence of Twitter hashtags arriving one after another: #cdc, #cdc, #music, #movie, #cdc, #music, #sport, #cdc, #cdc, #music, #cdc Suppose the size of the buffer used by the Misra-Gries algorithm is equal to 3. Show the content of the buffer after processing the first ten hashtags of the sequence. Make sure you list the items along with their frequency values. The content of the buffer after processing the above hashtags is shown below: #cdc 5 #music 2 (b) The following table shows the frequencies of Twitter hashtags that appear in 400 streaming tweets from CDC (assuming each tweet has only 1 hashtag). Hashtag Actual Frequency #cdc 100 #health 93 #zika 88 #cancer 78 #diabetes 20 #obesity 17 #listeria 3 #ecoli 1 What is the minimum number of buffers needed to ensure that the hashtag #cdc is kept in one of the buffers after processing the 400 tweets. All items whose frequency is larger than m k, where m is the number of streaming tweets and k 1 is the buffer size, will be kept in the buffer. Since the frequency of the #cdc is 100 and < 100 = 400 4, we need k = 5. Therefore, the buffer size needed is k 1 = 4. 3

4 4. [6 points] Consider the following training set for classifying patients. Weight Loss Diarrhea #Patients with Disease #Patients without Disease Yes Yes No Yes No No 5 25 Yes No Suppose we would like to construct a 1-level decision tree from the training data above. Assume the tree is constructed using gini index as impurity measure. Figure 1: Candidate 1-level decision trees for patient classification. (a) Calculate the overall gini index using Weight Loss as splitting attribute. The distribution of the data is summarized below: With Disease Without Disease Weight Loss Yes No Gini (Weight Loss=Yes) = 1 ( )2 ( 25 Gini (Weight Loss=No) = 1 ( )2 ( 35 Overall Gini = )2 = )2 = = (b) Calculate the overall gini index using Diarrhea as splitting attribute. The distribution of the data is summarized below: With Disease Without Disease Diarrhea Yes No Gini (Diarrhea=Yes) = 1 ( )2 ( 20 Gini (Diarrhea=No) = 1 ( )2 ( 40 Overall Gini = )2 = )2 = = (c) Which tree, (a) Weight loss or (b) Diarrhea, has a lower gini index? Calculate the training error of the tree with lower gini index. Tree (a) Weight Loss, has a lower gini index. The error is 40/110 = 36.4% 4

5 5. [6 points] Figure 2: Performance on train and test sets for decision trees with varying depth used for classification. The x-axis if the depth of the tree ranging {2,3,...,9,10,15,20,...,50} and he y-axis is the accuracy. Figure 3: Showing the 2D real valued feature rows (i.e., their x and y coordinates). Where x and o are the two class labels and the triangle is the one we are looking to predict using the Nearest-Neighbor method. (a) Which of the following are TRUE statements in relation to Figure 2? Circle ALL the correct answers. i. Using a decision tree with depth 20 has desirable performance. ii. ***Using a decision tree with depth 2 is underfitting.*** iii. Using a decision tree with depth 4 is overfitting. iv. *** Using a decision tree with depth 50 is overfitting.*** Note: The answer is selected with ***s above. (b) True or False. Accuracy is not a good measure for model evaluation if the classes are highly imbalanced. True (c) True or False. Gini index attains its best value (i.e., the one you would choose for a split if creating a new branch on the decision tree) when all the data instances in the leaves have the same class. True 5

6 (d) What is the predicted class for the triangle in Figure 3 when the number of neighbors is set to k=1,3, and 5 (and using Euclidean distance)? k = 1 = o k = 3 = o k = 5 = x 6

7 6. [5 points] (a) Given the following support information: Itemset Support {Butter} 45% {Bread} 50% {Tea} 35% {Bread, Butter} 40% {Bread, Tea} 25% {Butter, Tea} 30% {Bread, Butter, Tea} 5% Determine whether the confidence of the following 3 rules: i). {Bread} {Butter} 0.4/0.5 = 0.8 ii). {Bread, Tea} {Butter} 0.05/.25 = 0.2 iii). {Butter, Tea} {Bread} 0.05/0.3 = (b) Please circle ALL true statements given the following: The confidence of the rule {A} {C} is 30%, {A} {B} is 100%, and {B} {A} is 100%. i. The confidence of the rule {A,B} {C} must be strictly less than 30%. ii. *** The confidence of the rule {A,B} {C} must be equal to 30%. *** iii. ***The confidence of {B} {C} must be equal to 30%.*** iv. From the above we can infer nothing about the confidence of the rule {A,B} {C}. Note: The answer is selected with ***s above. 7

8 7. [6 points] (a) Suppose you are given the following dataset, which contains 8 dense groups of points (assume each group has the same number of points). Suppose you apply 2 different clustering methods to partition the dataset into two clusters. The first solution (shown in (a)) partitions the data into a top and bottom halves, while the second solution (shown in (b)) partitions the data into a left and right halves. Which of the two clustering solutions, (a) or (b), has a lower sum-of-square error. Solution (b) has a lower sum-of-square-error since the distance between each point to its nearest centroid is, on average, smaller than the solution in (a). (b) In the figure below, assume hierarchical clustering has merged the 15 data points until there are 3 remaining clusters. Figure 4: Intermediate clustering results Suppose the 3 clusters were obtained using the complete link (MAX) method, resulting in the following 3 3 distance matrix (the rows and columns are sorted in increasing order of their cluster IDs): Which two clusters will be merged to produce 2 clusters? Show the resulting 2 2 distance matrix after the merging. Clusters 2 and 3 will be merged. [ 0 ]

9 (c) Based on the below dendrogram, if we wanted to obtain 3 clusters, which points would be grouped together. Example format for answer: {1,2,3,4}, {5}, {6} if you wanted to say the first four are in a cluster and 5 and 6 are defining their own clusters, which compose the 3 clusters for the data. Figure 5: Dendrogram for 6 data points. {1,3 }, {2,4,5}, and {6 } 9

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs Introduction to Database Systems CSE 414 Lecture 26: More Indexes and Operator Costs CSE 414 - Spring 2018 1 Student ID fname lname Hash table example 10 Tom Hanks Index Student_ID on Student.ID Data File

More information

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example Student Introduction to Database Systems CSE 414 Hash table example Index Student_ID on Student.ID Data File Student 10 Tom Hanks 10 20 20 Amy Hanks ID fname lname 10 Tom Hanks 20 Amy Hanks Lecture 26:

More information

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Machine Learning - Clustering. CS102 Fall 2017

Machine Learning - Clustering. CS102 Fall 2017 Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for

More information

CSE 344 FEBRUARY 14 TH INDEXING

CSE 344 FEBRUARY 14 TH INDEXING CSE 344 FEBRUARY 14 TH INDEXING EXAM Grades posted to Canvas Exams handed back in section tomorrow Regrades: Friday office hours EXAM Overall, you did well Average: 79 Remember: lowest between midterm/final

More information

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability Decision trees A decision tree is a method for classification/regression that aims to ask a few relatively simple questions about an input and then predicts the associated output Decision trees are useful

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 10-11: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements No WQ this week WQ4 is due next Thursday HW3 is due next Tuesday should be

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Introduction to Database Systems CSE 344

Introduction to Database Systems CSE 344 Introduction to Database Systems CSE 344 Lecture 6: Basic Query Evaluation and Indexes 1 Announcements Webquiz 2 is due on Tuesday (01/21) Homework 2 is posted, due week from Monday (01/27) Today: query

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 15-16: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements Midterm on Monday, November 6th, in class Allow 1 page of notes (both sides,

More information

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

Lectures for the course: Data Warehousing and Data Mining (IT 60107) Lectures for the course: Data Warehousing and Data Mining (IT 60107) Week 1 Lecture 1 21/07/2011 Introduction to the course Pre-requisite Expectations Evaluation Guideline Term Paper and Term Project Guideline

More information

Decision Trees: Representation

Decision Trees: Representation Decision Trees: Representation Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning

More information

Tribhuvan University Institute of Science and Technology MODEL QUESTION

Tribhuvan University Institute of Science and Technology MODEL QUESTION MODEL QUESTION 1. Suppose that a data warehouse for Big University consists of four dimensions: student, course, semester, and instructor, and two measures count and avg-grade. When at the lowest conceptual

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2015 Quiz I There are 12 questions and 13 pages in this quiz booklet. To receive

More information

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013 Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer

More information

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning y=f(x): true function (usually not known) D: training

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Your Name: Your student id: Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Problem 1 [5+?]: Hypothesis Classes Problem 2 [8]: Losses and Risks Problem 3 [11]: Model Generation

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

CSE 190, Spring 2015: Midterm

CSE 190, Spring 2015: Midterm CSE 190, Spring 2015: Midterm Name: Student ID: Instructions Hand in your solution at or before 7:45pm. Answers should be written directly in the spaces provided. Do not open or start the test before instructed

More information

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Warehousing and Mining Lecture 4 by Hossen Asiful Mustafa Classification: Definition Given a collection of records (training set

More information

Data Mining Classification - Part 1 -

Data Mining Classification - Part 1 - Data Mining Classification - Part 1 - Universität Mannheim Bizer: Data Mining I FSS2019 (Version: 20.2.2018) Slide 1 Outline 1. What is Classification? 2. K-Nearest-Neighbors 3. Decision Trees 4. Model

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Introduction to Database Systems CSE 344

Introduction to Database Systems CSE 344 Introduction to Database Systems CSE 344 Lecture 10: Basics of Data Storage and Indexes 1 Student ID fname lname Data Storage 10 Tom Hanks DBMSs store data in files Most common organization is row-wise

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

CS Machine Learning

CS Machine Learning CS 60050 Machine Learning Decision Tree Classifier Slides taken from course materials of Tan, Steinbach, Kumar 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K

More information

CS127 Homework #3. Due: October 11th, :59 P.M. Consider the following set of functional dependencies, F, for the schema R(A, B, C, D, E)

CS127 Homework #3. Due: October 11th, :59 P.M. Consider the following set of functional dependencies, F, for the schema R(A, B, C, D, E) CS127 Homework #3 Warmup #1 Consider the following set of functional dependencies, F, for the schema R(A, B, C, D, E) 1. Find the candidate keys for the schema R. AB, C, D, and EA 2. Compute the closure,

More information

Data Mining Algorithms

Data Mining Algorithms Algorithms Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Looking for patterns in data Machine

More information

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records, 10/5/2017 MIST.6060 Business Intelligence and Data Mining 1 Distance Measures Nearest Neighbors In a p-dimensional space, the Euclidean distance between two records, a = a, a,..., a ) and b = b, b,...,

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Midterm Examination CS 540-2: Introduction to Artificial Intelligence Midterm Examination CS 54-2: Introduction to Artificial Intelligence March 9, 217 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 17 3 12 4 6 5 12 6 14 7 15 8 9 Total 1 1 of 1 Question 1. [15] State

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Fall 2017 ECEN Special Topics in Data Mining and Analysis Fall 2017 ECEN 689-600 Special Topics in Data Mining and Analysis Nick Duffield Department of Electrical & Computer Engineering Teas A&M University Organization Organization Instructor: Nick Duffield,

More information

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic

More information

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414 Introduction to Data Management CSE 414 Unit 4: RDBMS Internals Logical and Physical Plans Query Execution Query Optimization Introduction to Database Systems CSE 414 Lecture 16: Basics of Data Storage

More information

Problem 1: Binary, Decimal, and Hex Number Representations [6 marks]

Problem 1: Binary, Decimal, and Hex Number Representations [6 marks] Problem 1: Binary, Decimal, and Hex Number Representations [6 marks] A conversion table is provided at the end of this exam, you can pull it off and use it for this question. 1. [1 mark] Convert the following

More information

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

CSE 190D Spring 2017 Final Exam

CSE 190D Spring 2017 Final Exam CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 10: Basics of Data Storage and Indexes 1 Reminder HW3 is due next Tuesday 2 Motivation My database application is too slow why? One of the queries is very slow why? To

More information

PROBLEM 4

PROBLEM 4 PROBLEM 2 PROBLEM 4 PROBLEM 5 PROBLEM 6 PROBLEM 7 PROBLEM 8 PROBLEM 9 PROBLEM 10 PROBLEM 11 PROBLEM 12 PROBLEM 13 PROBLEM 14 PROBLEM 16 PROBLEM 17 PROBLEM 22 PROBLEM 23 PROBLEM 24 PROBLEM 25

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

8. Tree-based approaches

8. Tree-based approaches Foundations of Machine Learning École Centrale Paris Fall 2015 8. Tree-based approaches Chloé-Agathe Azencott Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000. Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).

More information

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017 Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised

More information

Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh

Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh Databasesystemer, forår 2005 IT Universitetet i København Forelæsning 8: Database effektivitet. 31. marts 2005 Forelæser: Rasmus Pagh Today s lecture Database efficiency Indexing Schema tuning 1 Database

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

CSE 190, Fall 2015: Midterm

CSE 190, Fall 2015: Midterm CSE 190, Fall 2015: Midterm Name: Student ID: Instructions The test will start at 5:10pm. Hand in your solution at or before 6:10pm. Answers should be written directly in the spaces provided. Do not open

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999 Text Categorization Foundations of Statistic Natural Language Processing The MIT Press1999 Outline Introduction Decision Trees Maximum Entropy Modeling (optional) Perceptrons K Nearest Neighbor Classification

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the

More information

How do students do in a C++ based CS2 course, if the CS1 course is taught in Python? Short answer: no different than those who took CS1 with C++.

How do students do in a C++ based CS2 course, if the CS1 course is taught in Python? Short answer: no different than those who took CS1 with C++. Richard Enbody William F. Punch Mark McCullen 1 Overview How do students do in a C++ based CS2 course, if the CS1 course is taught in Python? Short answer: no different than those who took CS1 with C++.

More information

CSE 131 Introduction to Computer Science Fall Exam I

CSE 131 Introduction to Computer Science Fall Exam I CSE 131 Introduction to Computer Science Fall 2015 Given: 24 September 2015 Exam I Due: End of session This exam is closed-book, closed-notes, no electronic devices allowed. The exception is the sage page

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes ) CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16 In-Class Midterm ( 11:35 AM 12:50 PM : 75 Minutes ) This exam will account for either 15% or 30% of your overall grade depending on your

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

K- Nearest Neighbors(KNN) And Predictive Accuracy

K- Nearest Neighbors(KNN) And Predictive Accuracy Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Search. The Nearest Neighbor Problem

Search. The Nearest Neighbor Problem 3 Nearest Neighbor Search Lab Objective: The nearest neighbor problem is an optimization problem that arises in applications such as computer vision, pattern recognition, internet marketing, and data compression.

More information

Read this before starting!

Read this before starting! Points missed: Student's Name: Total score: /50 points East Tennessee State University Department of Computer and Information Sciences CSCI 2910 (Tarnoff) Server/Client Side Programming TEST 2 for Spring

More information

Michael Kifer, Arthur Bernstein, Philip M. Lewis. Solutions Manual

Michael Kifer, Arthur Bernstein, Philip M. Lewis. Solutions Manual Michael Kifer, Arthur Bernstein, Philip M. Lewis Solutions Manual Copyright (C) 2006 by Pearson Education, Inc. For information on obtaining permission for use of material in this work, please submit a

More information

SCHEME OF COURSE WORK. Data Warehousing and Data mining

SCHEME OF COURSE WORK. Data Warehousing and Data mining SCHEME OF COURSE WORK Course Details: Course Title Course Code Program: Specialization: Semester Prerequisites Department of Information Technology Data Warehousing and Data mining : 15CT1132 : B.TECH

More information

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering

More information

CS573 Data Privacy and Security. Li Xiong

CS573 Data Privacy and Security. Li Xiong CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:

More information

Exam Advanced Data Mining Date: Time:

Exam Advanced Data Mining Date: Time: Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Classification: Decision Trees

Classification: Decision Trees Classification: Decision Trees IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University 1 Decision Tree Example Will a pa)ent have high-risk based on the ini)al 24-hour observa)on?

More information

Midterm 1: CS186, Spring 2012

Midterm 1: CS186, Spring 2012 Midterm 1: CS186, Spring 2012 Prof. J. Hellerstein You should receive a double- sided answer sheet and a 7- page exam. Mark your name and login on both sides of the answer sheet. For each question, place

More information

CSE 158, Fall 2017: Midterm

CSE 158, Fall 2017: Midterm CSE 158, Fall 2017: Midterm Name: Student ID: Instructions The test will start at 5:10pm. Hand in your solution at or before 6:10pm. Answers should be written directly in the spaces provided. Do not open

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2013/12/09 1 Practice plan 2013/11/11: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate

More information

CSE 212 : JAVA PROGRAMMING LAB. IV Sem BE (CS&E) (2013) DEPT OF COMPUTER SCIENCE & ENGG. M. I. T., MANIPAL. Prepared By : Approved by :

CSE 212 : JAVA PROGRAMMING LAB. IV Sem BE (CS&E) (2013) DEPT OF COMPUTER SCIENCE & ENGG. M. I. T., MANIPAL. Prepared By : Approved by : 1 CODE: CSE212 CSE 212 : JAVA PROGRAMMING LAB IV Sem BE (CS&E) (2013) DEPT OF COMPUTER SCIENCE & ENGG. M. I. T., MANIPAL Prepared By : Approved by : Dr. Harish S. V. Mr. Chidananda Acharya Ms. Roopashri

More information

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

CSE 131 Introduction to Computer Science Fall 2016 Exam I. Print clearly the following information:

CSE 131 Introduction to Computer Science Fall 2016 Exam I. Print clearly the following information: CSE 131 Introduction to Computer Science Fall 2016 Given: 29 September 2016 Exam I Due: End of Exam Session This exam is closed-book, closed-notes, no electronic devices allowed The exception is the "sage

More information

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time

More information

Classification by Association

Classification by Association Classification by Association Cse352 Ar*ficial Intelligence Professor Anita Wasilewska Generating Classification Rules by Association When mining associa&on rules for use in classifica&on we are only interested

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager

More information

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical

More information