Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)
|
|
- Leo Green
- 5 years ago
- Views:
Transcription
1 Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points) Name: This exam is open book and notes. You can use a calculator but no laptops, cell phones, nor other electronic devices are allowed. 1. [5 points] Consider the following tables named Student and Transcript: Student Transcript ID Name Status StudentID CourseID Semester Year Grade 1 Mary Doe Senior 1 1 Spring Tom Thumb Senior 1 2 Fall Bill Brown Junior 2 1 Spring Fall Course ID Name Credits 1 CSE CSE CSE (a) Write SQL to create the schema for Course table. Assume ID and credits are integers, and Name is a string with a maximum length of 10 characters. Make sure you specify its primary key as the ID. CREATE TABLE Course ( ID INTEGER, Name VARCHAR(10), Credits INTEGER, PRIMARY KEY(StudentID, Course, Semester) ); (b) Write an SQL statement to insert the first row as seen above (i.e., insert the row having ID=1, name=cse480, and credits=3). INSERT INTO Course VALUES (1, CSE 480, 3); (c) Write an SQL query that returns the student IDs who took CSE482 in the Spring 2016 semester that had at least a 3.5 GPA (i.e., GPA 3.5). SELECT StudentID FROM Transcript, Course WHERE CourseID = ID AND Grade >= 3.5 AND Name = CSE 482 AND Semester = Spring AND Year = 2016; 1
2 2. [6 points] Consider the table below: ID Hours Spent Online Class Discretize the hours spent attribute using equal width, equal frequency, and entropybased approaches (with number of bins equal to 3). Show the bin number for each user ID in the table given below. Assume bin #1 has the lowest range of values, followed by bin #2, and bin #3. Fill in each entry in the table below with bin numbers #1, #2, or #3. User ID Equal width Equal frequency Entropy-based
3 3. [5 points] Consider the Misra-Gries algorithm for finding frequent items in a data stream. (a) Assume the stream consists of the following sequence of Twitter hashtags arriving one after another: #cdc, #cdc, #music, #movie, #cdc, #music, #sport, #cdc, #cdc, #music, #cdc Suppose the size of the buffer used by the Misra-Gries algorithm is equal to 3. Show the content of the buffer after processing the first ten hashtags of the sequence. Make sure you list the items along with their frequency values. The content of the buffer after processing the above hashtags is shown below: #cdc 5 #music 2 (b) The following table shows the frequencies of Twitter hashtags that appear in 400 streaming tweets from CDC (assuming each tweet has only 1 hashtag). Hashtag Actual Frequency #cdc 100 #health 93 #zika 88 #cancer 78 #diabetes 20 #obesity 17 #listeria 3 #ecoli 1 What is the minimum number of buffers needed to ensure that the hashtag #cdc is kept in one of the buffers after processing the 400 tweets. All items whose frequency is larger than m k, where m is the number of streaming tweets and k 1 is the buffer size, will be kept in the buffer. Since the frequency of the #cdc is 100 and < 100 = 400 4, we need k = 5. Therefore, the buffer size needed is k 1 = 4. 3
4 4. [6 points] Consider the following training set for classifying patients. Weight Loss Diarrhea #Patients with Disease #Patients without Disease Yes Yes No Yes No No 5 25 Yes No Suppose we would like to construct a 1-level decision tree from the training data above. Assume the tree is constructed using gini index as impurity measure. Figure 1: Candidate 1-level decision trees for patient classification. (a) Calculate the overall gini index using Weight Loss as splitting attribute. The distribution of the data is summarized below: With Disease Without Disease Weight Loss Yes No Gini (Weight Loss=Yes) = 1 ( )2 ( 25 Gini (Weight Loss=No) = 1 ( )2 ( 35 Overall Gini = )2 = )2 = = (b) Calculate the overall gini index using Diarrhea as splitting attribute. The distribution of the data is summarized below: With Disease Without Disease Diarrhea Yes No Gini (Diarrhea=Yes) = 1 ( )2 ( 20 Gini (Diarrhea=No) = 1 ( )2 ( 40 Overall Gini = )2 = )2 = = (c) Which tree, (a) Weight loss or (b) Diarrhea, has a lower gini index? Calculate the training error of the tree with lower gini index. Tree (a) Weight Loss, has a lower gini index. The error is 40/110 = 36.4% 4
5 5. [6 points] Figure 2: Performance on train and test sets for decision trees with varying depth used for classification. The x-axis if the depth of the tree ranging {2,3,...,9,10,15,20,...,50} and he y-axis is the accuracy. Figure 3: Showing the 2D real valued feature rows (i.e., their x and y coordinates). Where x and o are the two class labels and the triangle is the one we are looking to predict using the Nearest-Neighbor method. (a) Which of the following are TRUE statements in relation to Figure 2? Circle ALL the correct answers. i. Using a decision tree with depth 20 has desirable performance. ii. ***Using a decision tree with depth 2 is underfitting.*** iii. Using a decision tree with depth 4 is overfitting. iv. *** Using a decision tree with depth 50 is overfitting.*** Note: The answer is selected with ***s above. (b) True or False. Accuracy is not a good measure for model evaluation if the classes are highly imbalanced. True (c) True or False. Gini index attains its best value (i.e., the one you would choose for a split if creating a new branch on the decision tree) when all the data instances in the leaves have the same class. True 5
6 (d) What is the predicted class for the triangle in Figure 3 when the number of neighbors is set to k=1,3, and 5 (and using Euclidean distance)? k = 1 = o k = 3 = o k = 5 = x 6
7 6. [5 points] (a) Given the following support information: Itemset Support {Butter} 45% {Bread} 50% {Tea} 35% {Bread, Butter} 40% {Bread, Tea} 25% {Butter, Tea} 30% {Bread, Butter, Tea} 5% Determine whether the confidence of the following 3 rules: i). {Bread} {Butter} 0.4/0.5 = 0.8 ii). {Bread, Tea} {Butter} 0.05/.25 = 0.2 iii). {Butter, Tea} {Bread} 0.05/0.3 = (b) Please circle ALL true statements given the following: The confidence of the rule {A} {C} is 30%, {A} {B} is 100%, and {B} {A} is 100%. i. The confidence of the rule {A,B} {C} must be strictly less than 30%. ii. *** The confidence of the rule {A,B} {C} must be equal to 30%. *** iii. ***The confidence of {B} {C} must be equal to 30%.*** iv. From the above we can infer nothing about the confidence of the rule {A,B} {C}. Note: The answer is selected with ***s above. 7
8 7. [6 points] (a) Suppose you are given the following dataset, which contains 8 dense groups of points (assume each group has the same number of points). Suppose you apply 2 different clustering methods to partition the dataset into two clusters. The first solution (shown in (a)) partitions the data into a top and bottom halves, while the second solution (shown in (b)) partitions the data into a left and right halves. Which of the two clustering solutions, (a) or (b), has a lower sum-of-square error. Solution (b) has a lower sum-of-square-error since the distance between each point to its nearest centroid is, on average, smaller than the solution in (a). (b) In the figure below, assume hierarchical clustering has merged the 15 data points until there are 3 remaining clusters. Figure 4: Intermediate clustering results Suppose the 3 clusters were obtained using the complete link (MAX) method, resulting in the following 3 3 distance matrix (the rows and columns are sorted in increasing order of their cluster IDs): Which two clusters will be merged to produce 2 clusters? Show the resulting 2 2 distance matrix after the merging. Clusters 2 and 3 will be merged. [ 0 ]
9 (c) Based on the below dendrogram, if we wanted to obtain 3 clusters, which points would be grouped together. Example format for answer: {1,2,3,4}, {5}, {6} if you wanted to say the first four are in a cluster and 5 and 6 are defining their own clusters, which compose the 3 clusters for the data. Figure 5: Dendrogram for 6 data points. {1,3 }, {2,4,5}, and {6 } 9
Extra readings beyond the lecture slides are important:
1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their
More informationList of Exercises: Data Mining 1 December 12th, 2015
List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns
More informationMidterm Examination CS540-2: Introduction to Artificial Intelligence
Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search
More informationIntroduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs
Introduction to Database Systems CSE 414 Lecture 26: More Indexes and Operator Costs CSE 414 - Spring 2018 1 Student ID fname lname Hash table example 10 Tom Hanks Index Student_ID on Student.ID Data File
More informationHash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example
Student Introduction to Database Systems CSE 414 Hash table example Index Student_ID on Student.ID Data File Student 10 Tom Hanks 10 20 20 Amy Hanks ID fname lname 10 Tom Hanks 20 Amy Hanks Lecture 26:
More informationCS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008
CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem
More informationCMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)
CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification
More informationMachine Learning - Clustering. CS102 Fall 2017
Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for
More informationCSE 344 FEBRUARY 14 TH INDEXING
CSE 344 FEBRUARY 14 TH INDEXING EXAM Grades posted to Canvas Exams handed back in section tomorrow Regrades: Friday office hours EXAM Overall, you did well Average: 79 Remember: lowest between midterm/final
More informationDecision trees. Decision trees are useful to a large degree because of their simplicity and interpretability
Decision trees A decision tree is a method for classification/regression that aims to ask a few relatively simple questions about an input and then predicts the associated output Decision trees are useful
More informationCSE4334/5334 DATA MINING
CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 10-11: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements No WQ this week WQ4 is due next Thursday HW3 is due next Tuesday should be
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationIntroduction to Database Systems CSE 344
Introduction to Database Systems CSE 344 Lecture 6: Basic Query Evaluation and Indexes 1 Announcements Webquiz 2 is due on Tuesday (01/21) Homework 2 is posted, due week from Monday (01/27) Today: query
More informationCOMP 465: Data Mining Classification Basics
Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 15-16: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements Midterm on Monday, November 6th, in class Allow 1 page of notes (both sides,
More informationLectures for the course: Data Warehousing and Data Mining (IT 60107)
Lectures for the course: Data Warehousing and Data Mining (IT 60107) Week 1 Lecture 1 21/07/2011 Introduction to the course Pre-requisite Expectations Evaluation Guideline Term Paper and Term Project Guideline
More informationDecision Trees: Representation
Decision Trees: Representation Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning
More informationTribhuvan University Institute of Science and Technology MODEL QUESTION
MODEL QUESTION 1. Suppose that a data warehouse for Big University consists of four dimensions: student, course, semester, and instructor, and two measures count and avg-grade. When at the lowest conceptual
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2015 Quiz I There are 12 questions and 13 pages in this quiz booklet. To receive
More informationVoronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013
Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer
More informationSupervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...
Supervised Learning Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning y=f(x): true function (usually not known) D: training
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationSolution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013
Your Name: Your student id: Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Problem 1 [5+?]: Hypothesis Classes Problem 2 [8]: Losses and Risks Problem 3 [11]: Model Generation
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationClassification and Regression
Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan
More informationCSE 190, Spring 2015: Midterm
CSE 190, Spring 2015: Midterm Name: Student ID: Instructions Hand in your solution at or before 7:45pm. Answers should be written directly in the spaces provided. Do not open or start the test before instructed
More informationClassification: Basic Concepts, Decision Trees, and Model Evaluation
Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Warehousing and Mining Lecture 4 by Hossen Asiful Mustafa Classification: Definition Given a collection of records (training set
More informationData Mining Classification - Part 1 -
Data Mining Classification - Part 1 - Universität Mannheim Bizer: Data Mining I FSS2019 (Version: 20.2.2018) Slide 1 Outline 1. What is Classification? 2. K-Nearest-Neighbors 3. Decision Trees 4. Model
More informationClassification Algorithms in Data Mining
August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms
More informationIntroduction to Database Systems CSE 344
Introduction to Database Systems CSE 344 Lecture 10: Basics of Data Storage and Indexes 1 Student ID fname lname Data Storage 10 Tom Hanks DBMSs store data in files Most common organization is row-wise
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationPerformance Evaluation of Various Classification Algorithms
Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------
More informationIntroduction to Data Mining and Data Analytics
1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationCS Machine Learning
CS 60050 Machine Learning Decision Tree Classifier Slides taken from course materials of Tan, Steinbach, Kumar 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K
More informationCS127 Homework #3. Due: October 11th, :59 P.M. Consider the following set of functional dependencies, F, for the schema R(A, B, C, D, E)
CS127 Homework #3 Warmup #1 Consider the following set of functional dependencies, F, for the schema R(A, B, C, D, E) 1. Find the candidate keys for the schema R. AB, C, D, and EA 2. Compute the closure,
More informationData Mining Algorithms
Algorithms Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Looking for patterns in data Machine
More information10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,
10/5/2017 MIST.6060 Business Intelligence and Data Mining 1 Distance Measures Nearest Neighbors In a p-dimensional space, the Euclidean distance between two records, a = a, a,..., a ) and b = b, b,...,
More information7. Decision or classification trees
7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationMidterm Examination CS 540-2: Introduction to Artificial Intelligence
Midterm Examination CS 54-2: Introduction to Artificial Intelligence March 9, 217 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 17 3 12 4 6 5 12 6 14 7 15 8 9 Total 1 1 of 1 Question 1. [15] State
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationFall 2017 ECEN Special Topics in Data Mining and Analysis
Fall 2017 ECEN 689-600 Special Topics in Data Mining and Analysis Nick Duffield Department of Electrical & Computer Engineering Teas A&M University Organization Organization Instructor: Nick Duffield,
More informationPESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore
Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic
More informationData Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414
Introduction to Data Management CSE 414 Unit 4: RDBMS Internals Logical and Physical Plans Query Execution Query Optimization Introduction to Database Systems CSE 414 Lecture 16: Basics of Data Storage
More informationProblem 1: Binary, Decimal, and Hex Number Representations [6 marks]
Problem 1: Binary, Decimal, and Hex Number Representations [6 marks] A conversion table is provided at the end of this exam, you can pull it off and use it for this question. 1. [1 mark] Convert the following
More informationTRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa
TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so
More informationWhat Is Data Mining? CMPT 354: Database I -- Data Mining 2
Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT
More informationCSE 190D Spring 2017 Final Exam
CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 10: Basics of Data Storage and Indexes 1 Reminder HW3 is due next Tuesday 2 Motivation My database application is too slow why? One of the queries is very slow why? To
More informationPROBLEM 4
PROBLEM 2 PROBLEM 4 PROBLEM 5 PROBLEM 6 PROBLEM 7 PROBLEM 8 PROBLEM 9 PROBLEM 10 PROBLEM 11 PROBLEM 12 PROBLEM 13 PROBLEM 14 PROBLEM 16 PROBLEM 17 PROBLEM 22 PROBLEM 23 PROBLEM 24 PROBLEM 25
More informationArtificial Intelligence. Programming Styles
Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to
More information8. Tree-based approaches
Foundations of Machine Learning École Centrale Paris Fall 2015 8. Tree-based approaches Chloé-Agathe Azencott Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr
More informationHomework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.
Homework # 4 1. Attribute Types Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as qualitative (nominal or ordinal) or quantitative (interval or ratio).
More informationClustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017
Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised
More informationDatabasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh
Databasesystemer, forår 2005 IT Universitetet i København Forelæsning 8: Database effektivitet. 31. marts 2005 Forelæser: Rasmus Pagh Today s lecture Database efficiency Indexing Schema tuning 1 Database
More informationEvaluating Classifiers
Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationClassification with Decision Tree Induction
Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree
More informationCSE 190, Fall 2015: Midterm
CSE 190, Fall 2015: Midterm Name: Student ID: Instructions The test will start at 5:10pm. Hand in your solution at or before 6:10pm. Answers should be written directly in the spaces provided. Do not open
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationText Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999
Text Categorization Foundations of Statistic Natural Language Processing The MIT Press1999 Outline Introduction Decision Trees Maximum Entropy Modeling (optional) Perceptrons K Nearest Neighbor Classification
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization
More informationTo earn the extra credit, one of the following has to hold true. Please circle and sign.
CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the
More informationHow do students do in a C++ based CS2 course, if the CS1 course is taught in Python? Short answer: no different than those who took CS1 with C++.
Richard Enbody William F. Punch Mark McCullen 1 Overview How do students do in a C++ based CS2 course, if the CS1 course is taught in Python? Short answer: no different than those who took CS1 with C++.
More informationCSE 131 Introduction to Computer Science Fall Exam I
CSE 131 Introduction to Computer Science Fall 2015 Given: 24 September 2015 Exam I Due: End of session This exam is closed-book, closed-notes, no electronic devices allowed. The exception is the sage page
More information1) Give decision trees to represent the following Boolean functions:
1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following
More informationCSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )
CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16 In-Class Midterm ( 11:35 AM 12:50 PM : 75 Minutes ) This exam will account for either 15% or 30% of your overall grade depending on your
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationK- Nearest Neighbors(KNN) And Predictive Accuracy
Contact: mailto: Ammar@cu.edu.eg Drammarcu@gmail.com K- Nearest Neighbors(KNN) And Predictive Accuracy Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni.
More informationSOCIAL MEDIA MINING. Data Mining Essentials
SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationSearch. The Nearest Neighbor Problem
3 Nearest Neighbor Search Lab Objective: The nearest neighbor problem is an optimization problem that arises in applications such as computer vision, pattern recognition, internet marketing, and data compression.
More informationRead this before starting!
Points missed: Student's Name: Total score: /50 points East Tennessee State University Department of Computer and Information Sciences CSCI 2910 (Tarnoff) Server/Client Side Programming TEST 2 for Spring
More informationMichael Kifer, Arthur Bernstein, Philip M. Lewis. Solutions Manual
Michael Kifer, Arthur Bernstein, Philip M. Lewis Solutions Manual Copyright (C) 2006 by Pearson Education, Inc. For information on obtaining permission for use of material in this work, please submit a
More informationSCHEME OF COURSE WORK. Data Warehousing and Data mining
SCHEME OF COURSE WORK Course Details: Course Title Course Code Program: Specialization: Semester Prerequisites Department of Information Technology Data Warehousing and Data mining : 15CT1132 : B.TECH
More informationCHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM
CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering
More informationCS573 Data Privacy and Security. Li Xiong
CS573 Data Privacy and Security Anonymizationmethods Li Xiong Today Clustering based anonymization(cont) Permutation based anonymization Other privacy principles Microaggregation/Clustering Two steps:
More informationExam Advanced Data Mining Date: Time:
Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationClassification: Decision Trees
Classification: Decision Trees IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University 1 Decision Tree Example Will a pa)ent have high-risk based on the ini)al 24-hour observa)on?
More informationMidterm 1: CS186, Spring 2012
Midterm 1: CS186, Spring 2012 Prof. J. Hellerstein You should receive a double- sided answer sheet and a 7- page exam. Mark your name and login on both sides of the answer sheet. For each question, place
More informationCSE 158, Fall 2017: Midterm
CSE 158, Fall 2017: Midterm Name: Student ID: Instructions The test will start at 5:10pm. Hand in your solution at or before 6:10pm. Answers should be written directly in the spaces provided. Do not open
More informationData Mining and Knowledge Discovery: Practice Notes
Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 2013/12/09 1 Practice plan 2013/11/11: Predictive data mining 1 Decision trees Evaluating classifiers 1: separate
More informationCSE 212 : JAVA PROGRAMMING LAB. IV Sem BE (CS&E) (2013) DEPT OF COMPUTER SCIENCE & ENGG. M. I. T., MANIPAL. Prepared By : Approved by :
1 CODE: CSE212 CSE 212 : JAVA PROGRAMMING LAB IV Sem BE (CS&E) (2013) DEPT OF COMPUTER SCIENCE & ENGG. M. I. T., MANIPAL Prepared By : Approved by : Dr. Harish S. V. Mr. Chidananda Acharya Ms. Roopashri
More informationHomework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationMining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.
DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationCSE 131 Introduction to Computer Science Fall 2016 Exam I. Print clearly the following information:
CSE 131 Introduction to Computer Science Fall 2016 Given: 29 September 2016 Exam I Due: End of Exam Session This exam is closed-book, closed-notes, no electronic devices allowed The exception is the "sage
More informationData Mining Concepts & Tasks
Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time
More informationClassification by Association
Classification by Association Cse352 Ar*ficial Intelligence Professor Anita Wasilewska Generating Classification Rules by Association When mining associa&on rules for use in classifica&on we are only interested
More informationBirkbeck (University of London)
Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:
More informationData Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners
Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager
More informationNominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN
NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical
More information