Problem 1: Complexity of Update Rules for Logistic Regression

Similar documents
Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees

Geometric data structures:

Locality- Sensitive Hashing Random Projections for NN Search

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Going nonparametric: Nearest neighbor methods for regression and classification

Instance-based Learning

Perceptron as a graph

Going nonparametric: Nearest neighbor methods for regression and classification

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Based on Raymond J. Mooney s slides

Case Study 1: Estimating Click Probabilities

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Supervised vs unsupervised clustering

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Supervised Learning: Nearest Neighbors

Classifiers and Detection. D.A. Forsyth

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Instance-based Learning

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Machine Learning (CSE 446): Practical Issues

SOCIAL MEDIA MINING. Data Mining Essentials

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering Documents. Case Study 2: Document Retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Machine Learning / Jan 27, 2010

Nearest Neighbor Methods

Warped parallel nearest neighbor searches using kd-trees

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Data Mining and Machine Learning: Techniques and Algorithms

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning: Think Big and Parallel

Support Vector Machines + Classification for IR

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CSE 573: Artificial Intelligence Autumn 2010

Classification and K-Nearest Neighbors

Learning Models of Similarity: Metric and Kernel Learning. Eric Heim, University of Pittsburgh

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Tree-based methods for classification and regression

Slides for Data Mining by I. H. Witten and E. Frank

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Large-scale visual recognition Efficient matching

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

ECE 5424: Introduction to Machine Learning

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

CS249: ADVANCED DATA MINING

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSC 411: Lecture 05: Nearest Neighbors

Nearest Neighbors Classifiers

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Text Analytics (Text Mining)

Classification: Feature Vectors

Data Mining in Bioinformatics Day 1: Classification

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Machine Learning and Pervasive Computing

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

A Brief Look at Optimization

7 Approximate Nearest Neighbors

Lecture 7: Decision Trees

Text Analytics (Text Mining)

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Balanced Box-Decomposition trees for Approximate nearest-neighbor. Manos Thanos (MPLA) Ioannis Emiris (Dept Informatics) Computational Geometry

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Chapter 5 Efficient Memory Information Retrieval

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Gene Clustering & Classification

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute

Algorithms for Nearest Neighbors

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

k-nearest Neighbors + Model Selection

Markov Networks in Computer Vision

CSC411/2515 Tutorial: K-NN and Decision Tree

All lecture slides will be available at CSC2515_Winter15.html

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Markov Networks in Computer Vision. Sargur Srihari

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Algorithms for Nearest Neighbors

Information Retrieval and Web Search Engines

Logistic Regression: Probabilistic Interpretation

Feature selection. LING 572 Fei Xia

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Transcription:

Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1 Problem 1: Complexity of Update Rules for Logistic Regression Logistic regression update: n w (t+1) i w (t) i + t w (t) i + x (t) i [y (t) P (Y =1 x (t), w (t) )] o Complexity of updates: Constant in number of data points In number of features? Problem both in terms of computational complexity and sample complexity What can we with very high dimensional feature spaces? Kernels not always appropriate, or scalable What else? 2 1

Problem 2: Unknown Number of Features For example, bag-of-words features for text data: Mary had a little lamb, little lamb What s the dimensionality of x? What if we see new word that was not in our vocabulary? Obamacare Theoretically, just keep going in your learning, and initialize w Obamacare = 0 In practice, need to re-allocate memory, fix indices, A big problem for Big Data 3 Bloom Filter: Multiple Hash Tables Single hash table -> Many false positives Multiple hash tables with independent hash functions Apply h 1 (i),, h d (i), set all bits to 1 Query Q(i)? Significantly decrease probability of false positives 4 2

Count-Min Sketch: general case Keep p by m Count matrix p hash functions: Just like in Bloom Filter, decrease errors with multiple hashes Every time see string i: 8j 2{1,...,p} : Count[j, h j (i)] Count[j, h j (i)] + 1 5 Finally, Sketching for LR n w (t+1) i w (t) i + t w (t) i o + x (t) i [y (t) P (Y =1 x (t), w (t) )] Never need to know size of vocabulary! At every iteration, update Count-Min matrix: Making a prediction: Scales to huge problems, great practical implications 6 3

Hash Kernels Count-Min sketch not designed for negative updates Biased estimates of dot products Hash Kernels: Very simple, but powerful idea to remove bias Pick 2 hash functions: h : Just like in Count-Min hashing ξ : Sign hash function Removes the bias found in Count-Min hashing (see homework) Define a kernel, a projection ϕ for x: 7 Hash Kernels Preserve Dot Products i(x) = X (j)x j j:h(j)=i Hash kernels provide unbiased estimate of dot-products! Variance decreases as O(1/m) Choosing m? For ε>0, if m = O Under certain conditions Then, with probability at least 1-δ: log! N 2 (1 ) x x 0 2 2 apple (x) (x 0 ) 2 2 apple (1 + ) x x 0 2 2 8 4

Learning With Hash Kernels Given hash kernel of dimension m, specified by h and ξ Learn m dimensional weight vector Observe data point x Dimension does not need to be specified a priori! Compute ϕ(x): Initialize ϕ(x) For non-zero entries j of x j : Use normal update as if observation were ϕ(x), e.g., for LR using SGD: n o w (t+1) i w (t) i + t w (t) i + i (x (t) )[y (t) P (Y =1 (x (t) ), w (t) )] 9 Interesting Application of Hash Kernels: Multi-Task Learning Personalized click estimation for many users: One global click prediction vector w: But A click prediction vector w u per user u: But Multi-task learning: Simultaneously solve multiple learning related problems: Use information from one learning problem to inform the others In our simple example, learn both a global w and one w u per user: Prediction for user u: If we know little about user u: After a lot of data from user u: 10 5

Problems with Simple Multi-Task Learning Dealing with new user is annoying, just like dealing with new words in vocabulary Dimensionality of joint parameter space is HUGE, e.g. personalized email spam classification from Weinberger et al.: 3.2M emails 40M unique tokens in vocabulary 430K users 16T parameters needed for personalized classification! 11 Hash Kernels for Multi-Task Learning Simple, pretty solution with hash kernels: Very multi-task learning as (sparse) learning problem with (huge) joint data point z for point x and user u: Estimating click probability as desired: Address huge dimensionality, new words, and new users using hash kernels: Desired effect achieved if j includes both just word (for global w) word,user (for personalized w u ) 12 6

Simple Trick for Forming Projection ϕ(x,u) Observe data point x for user u Dimension does not need to be specified a priori and user can be unknown! Compute ϕ(x,u): Initialize ϕ(x,u) For non-zero entries j of x j : E.g., j= Obamacare Need two contributions to ϕ: Global contribution Personalized Contribution Simply: Learn as usual using ϕ(x,u) instead of ϕ(x) in update function 13 Results from Weinberger et al. on Spam Classification: Effect of m 14 7

Results from Weinberger et al. on Spam Classification: Illustrating Multi-Task Effect 15 What you need to know Hash functions Bloom filter Test membership with some false positives, but very small number of bits per element Count-Min sketch Positive counts: upper bound with nice rates of convergence General case Application to logistic regression Hash kernels: Sparse representation for feature vectors Very simple, use two hash function (Can use one hash function take least significant bit to define ξ) Quickly generate projection ϕ(x) Learn in projected space Multi-task learning: Solve many related learning problems simultaneously Very easy to implement with hash kernels Significantly improve accuracy in some problems (if there is enough data from individual users) 16 8

Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 17 Document Retrieval Goal: Retrieve documents of interest Challenges: Tons of articles out there How should we measure similarity? 18 9

Task 1: Find Similar Documents To begin Input: Query article Output: Set of k similar articles 19 Document Representation Bag of words model 20 10

1-Nearest Neighbor Articles Query: 1-NN Goal: Formulation: 21 k-nearest Neighbor Articles X = {x 1,...,x N }, x i 2 R d Query: x 2 R d k-nn Goal: Formulation: 22 11

Distance Metrics Euclidean v ux d(u, v) = t d (u i v i ) 2 Or, more generally, i=1 v ux d(u, v) = t d 2 i (u i v i ) 2 i=1 Equivalently, 2 3 2 1 0 0 d(u, v) = p (u v) 0 (u v) 2 0 2 0 where = 6 7 4... 5 Other Metrics 2 0 0... d Mahalanobis, Rank-based, Correlation-based, cosine similarity 23 Notable Distance Metrics (and their level sets) Scaled Euclidian (L 2 ) L 1 norm (absolute) Mahalanobis (Σ is general sym pos def matrix, on previous slide = diagonal) L1 (max) norm 24 12

Euclidean Distance + Document Retrieval Recall distance metric v ux d(u, v) = t d (u i v i ) 2 i=1 What if each document were Scale word count vectors times longer? What happens to measure of similarity? Good to normalize vectors 25 Issues with Document Representation Words counts are bad for standard similarity metrics Term Frequency Inverse Document Frequency (tf-idf) Increase importance of rare words 26 13

TF-IDF Term frequency: tf(t, d) = Could also use {0, 1}, 1 + log f(t, d),... Inverse document frequency: idf(t, D) = tf-idf: tfidf(t, d, D) = High for document d with high frequency of term t (high term frequency ) and few documents containing term t in the corpus (high inverse doc frequency ) 27 Issues with Search Techniques Naïve approach: Brute force search Given a query point x Scan through each point x i O(N) distance computations per 1-NN query! O(Nlogk) per k-nn query! What if N is huge??? (and many queries) 33 Distance Computations 28 14

KD-Trees Smarter approach: kd-trees Structured organization of documents Recursively partitions points into axis aligned boxes. Enables more efficient pruning of search space Examine nearby points first. Ignore any points that are further than the nearest point found so far. kd-trees work well in lowmedium dimensions We ll get back to this 29 KD-Tree Construction Pt X Y 1 0.00 0.00 2 1.00 4.31 3 0.13 2.85 Start with a list of d-dimensional points. 30 15

KD-Tree Construction NO X>.5 YES Pt X Y 1 0.00 0.00 3 0.13 2.85 Pt X Y 2 1.00 4.31 Split the points into 2 groups by: Choosing dimension d j and value V (methods to be discussed ) x i d j x i d j Separating the points into > V and <= V. 31 KD-Tree Construction NO X>.5 YES Pt X Y 1 0.00 0.00 3 0.13 2.85 Pt X Y 2 1.00 4.31 Consider each group separately and possibly split again (along same/different dimension). Stopping criterion to be discussed 32 16

KD-Tree Construction NO X>.5 YES NO Y>.1 YES Pt X Y 2 1.00 4.31 Pt X Y 3 0.13 2.85 Pt X Y 1 0.00 0.00 Consider each group separately and possibly split again (along same/different dimension). Stopping criterion to be discussed 33 KD-Tree Construction Continue splitting points in each set creates a binary tree structure Each leaf node contains a list of points 34 17

KD-Tree Construction Keep one additional piece of information at each node: The (tight) bounds of the points at or below this node. 35 KD-Tree Construction Use heuristics to make splitting decisions: Which dimension do we split along? Which value do we split at? When do we stop? 36 18

Many heuristics median heuristic center-of-range heuristic 37 Nearest Neighbor with KD Trees Traverse the tree looking for the nearest neighbor of the query point. 38 19

Nearest Neighbor with KD Trees Examine nearby points first: Explore branch of tree closest to the query point first. 39 Nearest Neighbor with KD Trees Examine nearby points first: Explore branch of tree closest to the query point first. 40 20

Nearest Neighbor with KD Trees When we reach a leaf node: Compute the distance to each point in the node. 41 Nearest Neighbor with KD Trees When we reach a leaf node: Compute the distance to each point in the node. 42 21

Nearest Neighbor with KD Trees Then backtrack and try the other branch at each node visited 43 Nearest Neighbor with KD Trees Each time a new closest node is found, update the distance bound 44 22

Nearest Neighbor with KD Trees Using the distance bound and bounding box of each node: Prune parts of the tree that could NOT include the nearest neighbor 45 Nearest Neighbor with KD Trees Using the distance bound and bounding box of each node: Prune parts of the tree that could NOT include the nearest neighbor 46 23

Nearest Neighbor with KD Trees Using the distance bound and bounding box of each node: Prune parts of the tree that could NOT include the nearest neighbor 47 Complexity For (nearly) balanced, binary trees... Construction Size: Depth: Median + send points left right: Construction time: 1-NN query Traverse down tree to starting point: Maximum backtrack and traverse: Complexity range: Under some assumptions on distribution of points, we get O(logN) but exponential in d (see citations in reading) 48 24

Complexity 49 Complexity for N Queries Ask for nearest neighbor to each document Brute force 1-NN: kd-trees: 50 25

Inspections vs. N and d 80 600 70 500 60 400 50 40 300 30 200 20 100 10 0 0 2000 4000 6000 8000 1000 0 1 3 5 7 9 11 13 15 51 K-NN with KD Trees Exactly the same algorithm, but maintain distance as distance to furthest of current k nearest neighbors Complexity is: 52 26

Approximate K-NN with KD Trees Before: Prune when distance to bounding box > Now: Prune when distance to bounding box > Will prune more than allowed, but can guarantee that if we return a neighbor at distance, then there is no neighbor closer than. r r/ In practice this bound is loose Can be closer to optimal. Saves lots of search time at little cost in quality of nearest neighbor. 53 Wrapping Up Important Points kd-trees Tons of variants On construction of trees (heuristics for splitting, stopping, representing branches ) Other representational data structures for fast NN search (e.g., ball trees, ) Nearest Neighbor Search Distance metric and data representation are crucial to answer returned For both High dimensional spaces are hard! Number of kd-tree searches can be exponential in dimension Rule of thumb N >> 2 d Typically useless. Distances are sensitive to irrelevant features Most dimensions are just noise à Everything equidistant (i.e., everything is far away) Need technique to learn what features are important for your task 54 27

What you need to know Document retrieval task Document representation (bag of words) tf-idf Nearest neighbor search Formulation Different distance metrics and sensitivity to choice Challenges with large N kd-trees for nearest neighbor search Construction of tree NN search algorithm using tree Complexity of construction and query Challenges with large d 55 Acknowledgment This lecture contains some material from Andrew Moore s excellent collection of ML tutorials: http://www.cs.cmu.edu/~awm/tutorials In particular, see: http://grist.caltech.edu/sc4devo/.../files/ sc4devo_scalable_datamining.ppt 56 28