Exam Review Session. William Cohen

Similar documents
Randomized Algorithms Wrap-Up. William Cohen

Problem 1: Complexity of Update Rules for Logistic Regression

ML from Large Datasets

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Practice Questions for Midterm

CS369G: Algorithmic Techniques for Big Data Spring

INTRO TO SEMI-SUPERVISED LEARNING (SSL)

Announcements: projects

Nearest Neighbor with KD Trees

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Nearest Neighbor with KD Trees

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Midterm Oct 18, Name: Fall 2016 Midterm Exam Andrew ID: Time Limit: 80 Minutes

Big Data Analytics CSCI 4030

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

CSE 5243 INTRO. TO DATA MINING

Map-Reduce With Hadoop!

Sublinear Time and Space Algorithms 2016B Lecture 7 Sublinear-Time Algorithms for Sparse Graphs

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

University of Washington Department of Computer Science and Engineering / Department of Statistics

6 Randomized rounding of semidefinite programs

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Snowball Sampling a Large Graph

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Locality- Sensitive Hashing Random Projections for NN Search

Coding for Random Projects

Topic: Duplicate Detection and Similarity Computing

Compact Data Representations and their Applications. Moses Charikar Princeton University

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

CPSC 311: Analysis of Algorithms (Honors) Exam 1 October 11, 2002

CS 6604: Data Mining Large Networks and Time-Series

5 Learning hypothesis classes (16 points)

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Research Students Lecture Series 2015

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Lecture 5: Data Streaming Algorithms

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

University of Washington Department of Computer Science and Engineering / Department of Statistics

CSE 573: Artificial Intelligence Autumn 2010

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Algorithms for Nearest Neighbors

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

The exam is closed book, closed notes except your one-page cheat sheet.

Graphs (Part II) Shannon Quinn

CSE 546 Machine Learning, Autumn 2013 Homework 2

Algorithms for Nearest Neighbors

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Calculus II. Step 1 First, here is a quick sketch of the graph of the region we are interested in.

Paul's Online Math Notes Calculus III (Notes) / Applications of Partial Derivatives / Lagrange Multipliers Problems][Assignment Problems]

ECS289: Scalable Machine Learning

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Logistic Regression: Probabilistic Interpretation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

10-701/15-781, Fall 2006, Final

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Conflict Graphs for Parallel Stochastic Gradient Descent

One-Pass Streaming Algorithms

Approximation Algorithms for Clustering Uncertain Data

Data Streams. Everything Data CompSci 216 Spring 2018

Scaling up: Naïve Bayes on Hadoop. What, How and Why?

Predictive Indexing for Fast Search

Linear methods for supervised learning

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Bloom filters and their applications

We have used both of the last two claims in previous algorithms and therefore their proof is omitted.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Behavioral Data Mining. Lecture 18 Clustering

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

BigTable. CSE-291 (Cloud Computing) Fall 2016

Equations of planes in

Distributed Computing with Spark

Kernels and Clustering

Kathleen Durant PhD Northeastern University CS Indexes

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Link Analysis in the Cloud

Discrete Mathematics Course Review 3

Network Traffic Measurements and Analysis

CS249: ADVANCED DATA MINING

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Image Compression With Haar Discrete Wavelet Transform

Lecture outline. Graph coloring Examples Applications Algorithms

CS 4349 Lecture August 21st, 2017

Chapter 1 - The Spark Machine Learning Library

Combine the PA Algorithm with a Proximal Classifier

Transcription:

Exam Review Session William Cohen 1

General hints in studying Understand what you ve done and why There will be questions that test your understanding of the techniques implemented why will/won t this shortcut work? what does the analysis say about the method? We re mostly about learning meets computation here there won t be many pure 601 questions 2

General hints in studying Techniques covered in class but not assignments: When/where/how to use them That usually includes understanding the analytic results presented in class Eg: is the lazy regularization update an approximation or not? when does it help? when does it not help? 3

General hints in studying What about assignments you haven t done? You should read through the assignments and be familiar with the algorithms being implemented There won t be questions about programming details that you could look up on line but you should know how architectures like Hadoop work (eg, when and where they communicate) you should be able to sketch out simple map- reduce algorithms No spark, but you should be able to read workqlow operators and discuss how they d be implemented how would you do a groupbykey in Hadoop? 4

General hints in studying There are not detailed questions on the guest speakers or the student projects (this year) If you use a previous year s exam as guidance the topics are a little different each year 5

General hints in exam taking You can bring in one 8 ½ by 11 sheet (front and back) Look over everything quickly and skip around probably nobody will know everything on the test If you re not sure what we ve got in mind: state your assumptions clearly in your answer. There s room for this even on true/false If you look at a question and don t know the answer: we probably haven t told you the answer but we ve told you enough to work it out imagine arguing for some answer and see if you like it 6

Outline major topics Hadoop stream- and- sort is how I ease you into that, not really a topic on its own Parallelizing learners (perceptron, LDA, ) Hash kernels and streaming SGD Distributed SGD for Matrix Factorization Randomized Algorithms Graph Algorithms: PR, PPR, SSL on graphs no assignment!= not on test Some of these are easier to ask questions about than others. 7

HADOOP 8

What data gets lost if the job tracker is rebooted? If I have a 1Tb Qile and shard it 1000 ways will it take longer than sharding it 10 ways? Where should a combiner run? 9

$ hadoop fs -ls rcv1/small/sharded! Found 10 items! -rw-r--r-- 3 606405 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00000! -rw-r--r-- 3 1347611 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00001! -rw-r--r-- 3 939307 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00002! -rw-r--r-- 3 1284062 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00003! -rw-r--r-- 3 1009890 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00004! -rw-r--r-- 3 1206196 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00005! -rw-r--r-- 3 1384658 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00006! -rw-r--r-- 3 1299698 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00007! -rw-r--r-- 3 928752 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00008! -rw-r--r-- 3 806030 2013-01-22 16:28 /user/wcohen/rcv1/small/sharded/part-00009!! $ hadoop fs -tail rcv1/small/sharded/part-00005! weak as the arrival of arbitraged cargoes from the West has put the local market under pressure! M14,M143,MCAT The Brent crude market on the Singapore International!! Where is this data? How many disks is it on? If I set up a directory on /afs that looks the same will it work the same? what about a local disk? 10

11

12

13

Why do I see this same error over and over again? 14

PARALLEL LEARNERS 15

Parallelizing perceptrons take 2 w (previous) Instances/labels Split into example subsets Instances/labels 1 Instances/labels 2 Instances/labels 3 Compute local vk s w -1 w- 2 w-3 w Combine by some sort of weighted averaging 16

A theorem I probably won t ask about the proof, but I could definitely ask about the theorem. Corollary: if we weight the vectors uniformly, then the number of mistakes is still bounded. I.e., this is enough communication to guarantee convergence. 17

NAACL 2010 What does the word structured mean here? why is it important? would the results be better or worse with a regular perceptron? 18

STREAMING SGD 19

Learning as optimization for regularized logistic regression Algorithm: what if you had a different update for the regularizer? Time goes from O(nT) to O(mVT) where n = number of non-zero entries, m = number of examples V = number of features T = number of passes over data 20

Formalization of the Hash Trick : First: Review of Kernels What is it? how does it work? what aspects of performance does it help? What did we say about it formally? 21

RANDOMIZED ALGORITHMS 22

Randomized Algorithms Hash kernels Countmin sketch Bloom Qilters LSH What are they, and what are they used for? When would you use which one? Why do they work - ie, what analytic results have we looked at? 23

Bloom filters - review An implementation Allocate M bits, bit[0],bit[1- M] Pick K hash functions hash(1,2),hash(2,s),. E.g: hash(i,s) = hash(s+ randomstring[i]) To add string s: For i=1 to k, set bit[hash(i,s)] = 1 To check contains(s): For i=1 to k, test bit[hash(i,s)] Return true if they re all set; otherwise, return false We ll discuss how to set M and K soon, but for now: Let M = 1.5*maxSize // less than two bits per item! Let K = 2*log(1/p) // about right with this M 24

Bloom filters An example application discarding rare features from a classiqier seldom hurts much, can speed up experiments Scan through data once and check each w: if bf1.contains(w): if bf2.contains(w): bf3.add(w) else bf2.add(w) else bf1.add(w) which needs more storage, bf1 or bf3? (same false positive rate) Now: bf2.contains(w) ó w appears >= 2x bf3.contains(w) ó w appears >= 3x Then train, ignoring words not in bf3 25

Bloom filters Here s two ideas for using Bloom Qilters for learning from sparse binary examples: compress every example with a BF either use each bit of the BF as a feature for a classiqier or: reconstruct the example at training time and train an ordinary classiqier pros and cons? 26

from: Minos Garofalakis CM Sketch Structure h 1 (s) +c <s, +c> +c +c d=log 1/δ h d (s) +c w = 2/ε n Each string is mapped to one bucket per row n Estimate A[j] by taking min k { CM[k,h k (j)] } n Errors are always over-estimates n Sizes: d=log 1/δ, w=2/ε è error is usually less than ε A 1 27

from: Minos Garofalakis CM Sketch Guarantees n [Cormode, Muthukrishnan 04] CM sketch guarantees approximation error on point queries less than ε A 1 in space O(1/ε log 1/δ) Probability of more error is less than 1-δ n This is sometimes enough: Estimating a multinomial: if A[s] = Pr(s ) then A 1 = 1 Multiclass classification: if A x [s] = Pr(x in class s) then A x 1 is probably small, since most x s will be in only a few classes 28

An Application of a Count-Min Sketch n Problem: find the semantic orientation of a work (positive or negative) using a large corpus. n Idea: positive words co-occur more frequently than expected near positive words; likewise for negative words so pick a few pos/neg seeds and compute x appears near y 29

LSH: key ideas n n Goal: map feature vector x to bit vector bx ensure that bx preserves similarity Basic idea: use random projections of x Repeat many times: n Pick a random hyperplane r by picking random weights for each feature (say from a Gaussian) n Compute the inner product of r with x n Record if x is close to r (r.x>=0) the next bit in bx n Theory says that if x and x have small cosine distance then bx and bx will have small Hamming distance 30

LSH applications Compact storage of data and we can still compute similarities LSH also gives very fast approximations: approx nearest neighbor method just look at other items with bx =bx also very fast nearest- neighbor methods for Hamming distance very fast clustering cluster = all things with same bx vector 31

32

LSH What are some other ways of using LSH? What are some other ways of using CountMin sketchs? 33

GRAPH ALGORITHMS 34

Graph Algorithms The standard way of computing PageRank, iteratively, using the power method Properties of large data graphs The subsampling problem APR algorithm - how and why it works when we d use it what the analysis says SSL on graphs example algorithms, example use of CM Sketches 35

36

Degree distribution Plot cumulative degree X axis is degree Y axis is #nodes that have degree at least k Typically use a log- log scale Straight lines are a power law; normal curve dives to zero at some point Left: trust network in epinions web site from Richardson & Domingos 37

Homophily Another def n: excess edges between common neighbors of v CC( v) CC( V, = E) # trianglesconnected to v # pairs connected to v = 1 V v CC( v) CC' ( V, E) = #trianglesin graph #length 3 paths in graph 38

Approximate PageRank: Algorithm 39

Approximate PageRank: Key Idea By definition PageRank is fixed point of: Claim: Recursively compute PageRank of neighbors of s (=sw), then adjust Key idea in apr: do this recursive step repeatedly focus on nodes where finding PageRank from neighbors will be useful 40

Analysis linearity re-group & linearity pr(α, r - r(u)χ u ) +(1- α) pr(α, r(u)χ u W) = pr(α, r - r(u)χ u + (1- α) r(u)χ u W) 41

Q/A 42