VECTOR SPACE CLASSIFICATION

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "VECTOR SPACE CLASSIFICATION"

Transcription

1 VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei Lecture series 1

2 RecALL: Naïve Bayes classifiers Classify based on prior weight of class and conditional parameter for what each word says: c NB argmax log P(c j ) log P(x i c j ) c j C i positions Training is done by counting and dividing: P(c j ) N c j N Don t forget to smooth P(x k c j ) T c j x k xi V [T c j x i ] 2

3 Vector SPACE text classification Today: Vector space methods for Text Classification Rocchio classification K Nearest Neighbors Linear classifier and non-linear classifier Classification with more than two classes 3

4 Vector SPACE text classification Vector space methods for Text Classification 4

5 VECTOR SPACE CLASSIFICATION Vector Space Representation Each document is a vector, one component for each term (= word). Normally normalize vectors to unit length. High-dimensional vector space: Terms are axes 10, dimensions, or even 100, Docs are vectors in this space How can we do classification in this space? 5

6 VECTOR SPACE CLASSIFICATION As before, the training set is a set of documents, each labeled with its class (e.g., topic) In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Hypothesis 1: Documents in the same class form a contiguous region of space Hypothesis 2: Documents from different classes don t overlap We define surfaces to delineate classes in the space 6

7 Documents in a vector space Government Science Arts 7

8 Test document: which class? Government Science Arts 8

9 Test document = government Government Science Arts Our main topic today is how to find good separators 9 9

10 Vector SPACE text classification Rocchio text classification i 10

11 Rocchio text classification Rocchio Text Classification Use standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category Prototype = centroid of fmembers m of class Assign test documents to the category with the closest prototype vector based on cosine similarity 11

12 DEFINITION OF CENTROID (c) 1 v (d) D c d Dc Where D c is the set of all documents that belong to class c and v (d) is the vector space representation of d. Note that centroid will in general not be a unit vector even when the inputs are unit vectors. 12

13 ROCCHIO TEXT CLASSIFICATION r r1 r2 r3 b1 b2 t=b b 13

14 ROCCHIO PROPERTIES Forms a simple generalization of the examples in each class (a prototype). Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length. Classification is based on similarity to class prototypes. Does not guarantee classifications are consistent with the given training data. 14

15 ROCCHIO ANOMALY Prototype models have problems with polymorphic (disjunctive) categories. r r1 r2 t=r b b1 b2 r3 r4 15

16 Vector SPACE text classification k Nearest Neighbor Classification i 16

17 K NEAREST NEIGHBOR CLASSIFICATION knn = k Nearest Neighbor To classify document d into class c: Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c d) as i/k Choose as class argmax c P(c d) [ = majority class] 17

18 K NEAREST NEIGHBOR CLASSIFICATION Unlike Rocchio, knn classification determines the decision boundary locally. For 1NN (k=1), we assign each document to the class of its closest neighbor. For knn, we assign each document to the majority class of its k closest neighbors. K here is a parameter. The rationale of knn: contiguity hypothesis. We expect a test document d to have the same label as the training documents located nearly. 18

19 Knn: k=1 19

20 Knn: k=1 1,5,

21 KNN: weighted sum voting 21

22 K NEAREST NEIGHBOR CLASSIFICATION Test Government Science Arts 22

23 K NEAREST NEIGHBOR CLASSIFICATION Learning is just storing the representations of the training examples in D. Testing instance x (under 1NN): Compute similarity between x and all examples in D. Assign x the category of the most similar example in D. Does not explicitly compute a generalization or category prototypes. Also called: Case-based learning Memory-based learning Lazy learning Rationale of knn: contiguity hypothesis 23

24 SIMILARITY METRICS Nearest neighbor method depends on a similarity (or distance) metric. Simplest for continuous m-dimensional m instance space is Euclidean distance. Simplest for m-dimensional m binary instance space is Hamming distance (number of feature values that differ). For text, cosine similarity of tf.idf weighted vectors is typically most effective. 24

25 An Example: consine similarity r1 r2 r3 t=b b2 b1 25

26 Knn discusion Functional definition of similarity e.g. cos, Euclidean, kernel functions,... How many neighbors do we consider? Value of k determined empirically (normally 3 or 5 ) Does each neighbor get the same weight? Weighted-sum or not 26

27 Knn discusion (cont.) No feature selection necessary Scales well with large number of classes Dont n t need to train n classifiers s for n classes s Classes can influence each other Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training i necessary Actually: perhaps not true. (Data editing, etc.) May be more expensive at test time 27

28 Text Classification Linear classifier and non-linear classifier 28

29 Linear Classifier Many common text classifiers are linear classifiers Naïve Bayes Perceptron Rocchio Logistic regression Support vector machines (with linear kernel) Linear regression 29

30 Linear Classifier: 2d In two dimensions, a linear classifier is a line. These lines have functional form w 1 x 1 w2x2 b The classification rules: if w1 x1 w2x2 b, => c if w x w x b, => not-c Here: T ( x 1, x2) : 2D vector representation of the document T ( w 1, w2 ) : the parameter vector that t defines the boundary 30

31 Linear Classifier We can generalize this 2D linear classifier to higher dimensions by defining a hyperplane: The classification rules is then: Why Rocchio and Naive Bayes classifiers are linear classfiers. 31

32 Non-linear classifier Non-linear classifier: k-nn A linear classifier e.g. Naive Bayes does badly on the task: knn will do very well (assuming enough training data) 32

33 Text classification Classification i with more than two classes 33

34 Classification with more than two classes Classification for classes that are not mutually exclusive is called any-of classification problem. Classification for classes that are mutually exclusive is called one-of classification problem. We have learned two-class linear classifiers. linear classifier that can classify d to c or not-c. How can we extend the two-class linear classifiers to J>2 classes. to classify a document d to one of or any of classes c1, c2, c3 34

35 Classification with more than two classes For one-of of classification tasks: 1. Build a classifier for each class, where the training set consists of the set of documents in the class and its complement. 2. Given the test document, apply each classifier separately. 3. Assign the document to the class with the maximum score the maximum confidence value or the maximum probability 35

36 Classification with more than two classes For any-of classification tasks: 1. Build a classifier for each class, where the training set consists of the set of documents in the class and its compement. 2. Given the test document, apply each classifier separately. The decision of one classifier has no influence on the decisions of the other classfier. 36

37 THE TEXT CLASSIFICATION PROBLEM An example: Document d with only a sentance: London is planning to organize the 2012 Olympics. We have six classes: <UK>, <China>, <car>, <coffee>, <elections>, <sports> Determined: <UK> and <sports> p(uk)p(d UK) > t1 p(sports)p(d sports) > t2 Naive Bayes Text Classification 37

38 summary Vector space methods for Text Classification Rocchio classification K Nearest Neighbors Linear classifier and non-linear classifier Classification with more than two classes 38

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris Manning and Pandu Nayak Ch. 13 Standing queries The path from IR to text classification: You

More information

Distribution-free Predictive Approaches

Distribution-free Predictive Approaches Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Multi-Class Logistic Regression and Perceptron

Multi-Class Logistic Regression and Perceptron Multi-Class Logistic Regression and Perceptron Instructor: Wei Xu Some slides adapted from Dan Jurfasky, Brendan O Connor and Marine Carpuat MultiClass Classification Q: what if we have more than 2 categories?

More information

Classification & Clustering. Hadaiq Rolis Sanabila

Classification & Clustering. Hadaiq Rolis Sanabila Classification & Clustering Hadaiq Rolis Sanabila hadaiq@cs.ui.ac.id Natural Language Processing and Text Mining Pusilkom UI 22 26 Maret 2016 CLASSIFICATION 2 Categorization/Classification Given: A description

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Supervised Learning: Nonparametric

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Support Vector Machines

Support Vector Machines Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose

More information

Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 15-1: Support Vector Machines Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Machine Learning Algorithms (IFT6266 A7) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Text Analytics. Text Clustering. Ulf Leser

Text Analytics. Text Clustering. Ulf Leser Text Analytics Text Clustering Ulf Leser Text Classification Given a set D of docs and a set of classes C. A classifier is a function f: D C Problem: Finding a good classifier A good classifier assigns

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

COMPARATIVE ANALYSIS OF PARTICLE SWARM OPTIMIZATION ALGORITHMS FOR TEXT FEATURE SELECTION

COMPARATIVE ANALYSIS OF PARTICLE SWARM OPTIMIZATION ALGORITHMS FOR TEXT FEATURE SELECTION San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-13-2015 COMPARATIVE ANALYSIS OF PARTICLE SWARM OPTIMIZATION ALGORITHMS FOR TEXT FEATURE SELECTION

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric. CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Text Similarity, Text Categorization, Linear Methods of Classification Sameer Maskey Announcement Reading Assignments Bishop Book 6, 62, 7 (upto 7 only) J&M Book 3, 45 Journal

More information

Classification: Linear Discriminant Functions

Classification: Linear Discriminant Functions Classification: Linear Discriminant Functions CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Discriminant functions Linear Discriminant functions

More information

Jarek Szlichta

Jarek Szlichta Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns

More information

Introduction to Clustering

Introduction to Clustering Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of

More information

CSE 494/598 Lecture-11: Clustering & Classification

CSE 494/598 Lecture-11: Clustering & Classification CSE 494/598 Lecture-11: Clustering & Classification LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **With permission, content adapted from last year s slides and from Intro to DM dmbook@cs.umn.edu

More information

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set. Evaluate what? Evaluation Charles Sutton Data Mining and Exploration Spring 2012 Do you want to evaluate a classifier or a learning algorithm? Do you want to predict accuracy or predict which one is better?

More information

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing K-Nearest Neighbour Classifier Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Reminder Supervised data mining Classification Decision Trees Izabela

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

Content-based Recommender Systems

Content-based Recommender Systems Recuperação de Informação Doutoramento em Engenharia Informática e Computadores Instituto Superior Técnico Universidade Técnica de Lisboa Bibliography Pasquale Lops, Marco de Gemmis, Giovanni Semeraro:

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute

More information

Linear Regression: One-Dimensional Case

Linear Regression: One-Dimensional Case Linear Regression: One-Dimensional Case Given: a set of N input-response pairs The inputs (x) and the responses (y) are one dimensional scalars Goal: Model the relationship between x and y (CS5350/6350)

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set - solutions Thursday, October What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove. Do not

More information

IBL and clustering. Relationship of IBL with CBR

IBL and clustering. Relationship of IBL with CBR IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

CLUSTERING. JELENA JOVANOVIĆ Web:

CLUSTERING. JELENA JOVANOVIĆ   Web: CLUSTERING JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is clustering? Application domains K-Means clustering Understanding it through an example The K-Means algorithm

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Instance-Based Learning.

Instance-Based Learning. Instance-Based Learning www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts k-nn classification k-nn regression edited nearest neighbor k-d trees for nearest

More information

Nearest Neighbor Classification

Nearest Neighbor Classification Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Computer Science 591Y Department of Computer Science University of Massachusetts Amherst February 3, 2005 Topics Tasks (Definition, example, and notes) Classification

More information

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007 Information Retrieval Lecture 5 - The vector space model Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 28 Introduction Boolean model: all documents

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 6: Flat Clustering Hinrich Schütze Center for Information and Language Processing, University of Munich 04-06- /86 Overview Recap

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 20: 10/12/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule. CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your

More information

Object and Action Detection from a Single Example

Object and Action Detection from a Single Example Object and Action Detection from a Single Example Peyman Milanfar* EE Department University of California, Santa Cruz *Joint work with Hae Jong Seo AFOSR Program Review, June 4-5, 29 Take a look at this:

More information

Region-based Segmentation

Region-based Segmentation Region-based Segmentation Image Segmentation Group similar components (such as, pixels in an image, image frames in a video) to obtain a compact representation. Applications: Finding tumors, veins, etc.

More information

CHAPTER 6 EXPERIMENTS

CHAPTER 6 EXPERIMENTS CHAPTER 6 EXPERIMENTS 6.1 HYPOTHESIS On the basis of the trend as depicted by the data Mining Technique, it is possible to draw conclusions about the Business organization and commercial Software industry.

More information

Information Retrieval. Techniques for Relevance Feedback

Information Retrieval. Techniques for Relevance Feedback Information Retrieval Techniques for Relevance Feedback Introduction An information need may be epressed using different keywords (synonymy) impact on recall eamples: ship vs boat, aircraft vs airplane

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Summarizing Public Opinion on a Topic

Summarizing Public Opinion on a Topic Summarizing Public Opinion on a Topic 1 Abstract We present SPOT (Summarizing Public Opinion on a Topic), a new blog browsing web application that combines clustering with summarization to present an organized,

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 9, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47

More information

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in

More information

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time: English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

QED Q: Why is it called the triangle inequality? A: Analogue with euclidean distance in the plane: picture Defn: Minimum Distance of a code C:

QED Q: Why is it called the triangle inequality? A: Analogue with euclidean distance in the plane: picture Defn: Minimum Distance of a code C: Lecture 3: Lecture notes posted online each week. Recall Defn Hamming distance: for words x = x 1... x n, y = y 1... y n of the same length over the same alphabet, d(x, y) = {1 i n : x i y i } i.e., d(x,

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Weight adjustment schemes for a centroid based classifier

Weight adjustment schemes for a centroid based classifier Weight adjustment schemes for a centroid based classifier Shrikanth Shankar and George Karypis University of Minnesota, Department of Computer Science Minneapolis, MN 55455 Technical Report: TR 00-035

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning. Xiaojin Zhu Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, and Behrouz Minaei Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

Lecture 7: Relevance Feedback and Query Expansion

Lecture 7: Relevance Feedback and Query Expansion Lecture 7: Relevance Feedback and Query Expansion Information Retrieval Computer Science Tripos Part II Ronan Cummins Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Artificial Neural Networks (Feedforward Nets)

Artificial Neural Networks (Feedforward Nets) Artificial Neural Networks (Feedforward Nets) y w 03-1 w 13 y 1 w 23 y 2 w 01 w 21 w 22 w 02-1 w 11 w 12-1 x 1 x 2 6.034 - Spring 1 Single Perceptron Unit y w 0 w 1 w n w 2 w 3 x 0 =1 x 1 x 2 x 3... x

More information

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring This lecture: IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring 1 Ch. 6 Ranked retrieval Thus far, our queries have all

More information

Data Science Tutorial

Data Science Tutorial Eliezer Kanal Technical Manager, CERT Daniel DeCapria Data Scientist, ETC Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 2017 SEI SEI Data Science in in Cybersecurity Symposium

More information

SVM cont d. Applications face detection [IEEE INTELLIGENT SYSTEMS]

SVM cont d. Applications face detection [IEEE INTELLIGENT SYSTEMS] SVM cont d A method of choice when examples are represented by vectors or matrices Input space cannot be readily used as attribute-vector (e.g. too many attrs) Kernel methods: map data from input space

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Bruno Martins. 1 st Semester 2012/2013 Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

Instance-Based Learning. Goals for the lecture

Instance-Based Learning. Goals for the lecture Instance-Based Learning Mar Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Chapter 4: Non-Parametric Techniques

Chapter 4: Non-Parametric Techniques Chapter 4: Non-Parametric Techniques Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Supervised Learning How to fit a density

More information

Function Algorithms: Linear Regression, Logistic Regression

Function Algorithms: Linear Regression, Logistic Regression CS 4510/9010: Applied Machine Learning 1 Function Algorithms: Linear Regression, Logistic Regression Paula Matuszek Fall, 2016 Some of these slides originated from Andrew Moore Tutorials, at http://www.cs.cmu.edu/~awm/tutorials.html

More information

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical

More information

Using a genetic algorithm for editing k-nearest neighbor classifiers

Using a genetic algorithm for editing k-nearest neighbor classifiers Using a genetic algorithm for editing k-nearest neighbor classifiers R. Gil-Pita 1 and X. Yao 23 1 Teoría de la Señal y Comunicaciones, Universidad de Alcalá, Madrid (SPAIN) 2 Computer Sciences Department,

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Spam Classification Documentation

Spam Classification Documentation Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 04/06/2015 Reading: Chapter 14.3: Hastie, Tibshirani, Friedman. Additional resources: Center Based Clustering: A Foundational Perspective. Awasthi,

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information