Mining The Web. Anwar Alhenshiri (PhD)

Size: px

Start display at page:

Download "Mining The Web. Anwar Alhenshiri (PhD)"

Evelyn Stevens
5 years ago
Views:

1 Mining The Web Anwar Alhenshiri (PhD)

2 Mining Data Streams In many data mining situations, we know the entire data set in advance Sometimes the input rate is controlled externally Google queries Twitter or Facebook status updates

3 The Stream Model Input tuples enter at a rapid rate, at one or more input ports. The system cannot store the entire stream accessibly. How do you make critical calculations about the stream using a limited amount of (secondary) memory?

5 Applications Mining query streams Google wants to know what queries are more frequent today than yesterday Mining click streams Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour Mining social network news feeds E.g., Look for trending topics on Twitter, Facebook

6 Data Stream Problems Sampling data from a stream Filtering a data stream Queries over sliding windows Counting distinct elements Estimating moments Finding frequent elements Frequent itemsets

7 Sampling from a Data Stream Since we can t store the entire stream, one obvious approach is to store a sample Two different problems: Sample a fixed proportion of elements in the stream (say 1 in 10) Maintain a random sample of fixed size over a potentially infinite stream

8 Sampling a Fixed Proportion Scenario: search engine query stream Tuples: (user, query, time) Answer questions such as: how often did a user run the same query on two different days? Have space to store 1/10th of query stream Naïve solution Generate a random integer in [0..9] for each query Store query if the integer is 0, otherwise discard

9 Filtering Data Streams Each element of data stream is a tuple Given a list of keys S Determine which elements of stream have keys in S Obvious solution: hash table But suppose we don t have enough memory to store all of S in a hash table e.g., we might be processing millions of filters on the same stream

10 Applications Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam Publish-subscribe People express interest in certain sets of keywords Determine whether each message matches a user s interest

11 First Cut Solution Create a bit array B of m bits, initially all 0 s. Choose a hash function h with range [0,m) Hash each member of S to one of the bits, which is then set to 1 Hash each element of stream and output only those that hash to a 1

12 Ranking Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state each page has a long-term visit rate - use this as the page s score.

13 Not Quiet Enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates.

14 Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter.

15 Result of Teleporting Now cannot get stuck locally. How do we compute this visit rate?

16 Markov Chains A Markov chain consists of n states, plus an nxn transition probability matrix P. At each step, we are in exactly one of the states. For 1 <= i,j <= n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i.

17 Markov Chains Clearly, for all i, Markov chains are abstractions of random walks.

18 Ergodic Markov Chains A Markov chain is ergodic if you have a path from any state to any other state Let the transition matrix of a Markov chain be defined by Then this is an ergodic chain. For any start state, after a finite transient time T 0, the probability of being in any state at a fixed time T>T 0 is nonzero.

19 Ergodic Markov Chains For any ergodic Markov chain, there is a unique longterm visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn t matter where we start.

20 PageRank: Formula Given page A, and pages T 1 through T n linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) C(P) is the cardinality (out-degree) of page P d is the damping factor

21 PageRank: Intuition Calculation is iterative: PR i+1 is based on PR i Each page distributes its PR i to all pages it links to. Linkees add up their awarded rank fragments to find their PR i+1 d is a tunable parameter (usually = 0.85) encapsulating the random jump factor PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n ))

22 PageRank

23 Pagerank Summary Query processing: Retrieve pages meeting query. Rank them by their PageRank. Order is query-independent.

24 The Reality PageRank is used in Google, but is hardly the full story of ranking Many sophisticated features are used Some address specific query classes Machine learned ranking heavily used PageRank still very useful for things like crawl policy

25 Topic Specific Pagerank Goal PageRank values that depend on query topic Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a topic (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories Teleport to a page uniformly at random within the chosen topic Sounds hard to implement: can t compute PageRank at query time!

26 ODP = Open Directory Project

27 Topic Specific Pagerank Offline: Compute PageRank for individual topics Query independent as before Each page has multiple PageRank scores one for each ODP category, with teleportation only to that category Online: Query context classified into (distribution of weights over) topics Generate a dynamic PageRank score for each page weighted sum of topicspecific PageRanks

28 Interpretation

29 Interpretation

30 Interpretation pr = (0.9 PRsports PRhealth) gives you: 9% sports teleportation, 1% health teleportation

31 Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a subject. e.g., Bob s list of cancer-related links. Authority pages occur recurrently on good hubs for the subject. Best suited for broad topic queries rather than for page-finding queries. Gets at a broader slice of common opinion.

32 Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation.

33 The hope

34 High-level Scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages; iterative algorithm.

35 Base Set Given text query (say browser), use a text index to get all pages containing browser. Call this the root set of pages. Add in any page that either points to a page in the root set, or is pointed to by a page in the root set. Call this the base set.

37 Mining the Web for Structured Data

38 Our view of the web so far Web pages as atomic units Great for some applications e.g., Conventional web search But not always the right model

39 Going beyond web pages Question answering What is the height of Mt Everest? Who killed Abraham Lincoln? Relation Extraction Find all <company,ceo> pairs Virtual Databases Answer database-like queries over web data E.g., Find all software engineering jobs in Fortune 500 companies

40 Question Answering E.g., Who killed Abraham Lincoln? Naïve algorithm Find all web pages containing the terms killed and Abraham Lincoln in close proximity Extract k-grams from a small window around the terms Find the most commonly occurring kgrams

41 Question Answering, cont d Naïve algorithm works fairly well! Some improvements Use sentence structure e.g., restrict to noun phrases only Rewrite questions before matching What is the height of Mt Everest becomes The height of Mt Everest is <blank> The number of pages analyzed is more important than the sophistication of the NLP For simple questions

42 Relation Extraction Find pairs (title, author) Where title is the name of a book E.g., (Foundation, Isaac Asimov) Find pairs (company, hq) E.g., (Microsoft, Redmond) Find pairs (abbreviation, expansion) (ADA, American Dental Association) Can also have tuples with >2 components

43 Relation Extraction Assumptions: No single source contains all the tuples Each tuple appears on many web pages Components of tuple appear close together Foundation, by Isaac Asimov Isaac Asimov s masterpiece, the foundation trilogy There are repeated patterns in the way tuples are represented on web pages

44 Naïve Approach Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} (title) by (author)

45 Problems with naïve approach A pattern that works on one web page might produce nonsense when applied to another So patterns need to be page-specific, or at least sitespecific Impossible for a human to exhaustively enumerate patterns for every relevant website Will result in low coverage

46 Better approach (Brin) Exploit duality between patterns and tuples Find tuples that match a set of patterns Find patterns that match a lot of tuples DIPRE (Dual Iterative Pattern Relation Extraction)

47 DIPRE Algorithm 1. R SampleTuples e.g., a small set of <title,author> pairs 2. O FindOccurrences(R) Occurrences of tuples on web pages Keep some surrounding context 3. P GenPatterns(O) Look for patterns in the way tuples occur Make sure patterns are not too general! 4. R MatchingTuples(P) 5. Return or go back to Step 2

48 Occurrences e.g., Titles and authors Restrict to cases where author and title appear in close proximity on web page <li> Foundation by Isaac Asimov (1951) url = order = [title,author] (or [author,title]) denote as 0 or 1 prefix = <li> (limit to e.g., 10 characters) middle = by suffix = (1951) occurrence = ( Foundation, Isaac Asimov,url,order,prefix,middle,suffix)

49 Patterns Nightfall by Isaac Asimov (1941) order = [title,author] (say 0) shared prefix = shared middle = by shared suffix = (19 pattern = (order,shared prefix, shared middle,shared suffix)

50 URL Prefix Patterns may be specific to a website Or even parts of it Add urlprefix component to pattern occurence: <li> Foundation by Isaac Asimov (1951) occurence: Nightfall by Isaac Asimov (1941) shared urlprefix = pattern = (urlprefix,order,prefix,middle,suffix)

51 Generating Patterns 1. Group occurrences by order and middle 2. Let O = set of occurrences with the same order and middle pattern.order = O.order pattern.middle = O.middle pattern.urlprefix = longest common prefix of all urls in O pattern.prefix = longest common prefix of occurrences in O pattern.suffix = longest common suffix of occurrences in O

52 Example occurence: <li> Foundation by Isaac Asimov (1951) occurence: Nightfall by Isaac Asimov (1941) order = [title,author] middle = by urlprefix = prefix =

53 Categorizing Matches Books and authors One possibility A tuple that matches a known tuple is positive A tuple that matches the title of a known tuple but has a different author is negative Assume title is key for relation All other tuples are unknown Can come up with other schemes if we have more information e.g., list of possible legal people names

54 Example Suppose we know the tuples Foundation, Isaac Asimov Startide Rising, David Brin Suppose pattern p matches Foundation, Isaac Asimov Startide Rising, David Brin Foundation, Doubleday Rendezvous with Rama, Arthur C. Clarke p.positive = 2, p.negative = 1, p.unknown = 1

55 Snowball Algorithm 1. Start with seed set R of tuples 2. Generate set P of patterns from R Compute support and confidence for each pattern in P Discard patterns with low support or confidence 3. Generate new set T of tuples matching patterns P Compute confidence of each tuple in T 4. Add to R the tuples t in T with conf(t)>threshold. 5. Go back to step 2

CS345 Data Mining. Mining the Web for Structured Data

CS345 Data Mining. Mining the Web for Structured Data CS345 Data Mining Mining the Web for Structured Data Our view of the web so far Web pages as atomic units Great for some applications e.g., Conventional web search But not always the right model Going