7. Mining Text and Web Data

Size: px

Start display at page:

Download "7. Mining Text and Web Data"

Ezra Lee
5 years ago
Views:

1 7. Mining Text and Web Data Contents of this Chapter 7.1 Introduction 7.2 Data Preprocessing 7.3 Text and Web Clustering 7.4 Text and Web Classification 7.5 References [Han & Kamber 2006, Sections 10.4 and 10.5] SFU, CMPT 459 / 741, 06-3, Martin Ester 367

2 7.1 The Web and Web Search Repository Web Server Crawler Storage Server Clustering Classification Indexer The jaguar has a 4 liter engine Topic Hierarchy Root News Science Plants Animals Business Computers Inverted Index engine jaguar cat Automobiles The jaguar, a cat, can run at speeds reaching 50 mph Documents in repository Search Query jaguar SFU, CMPT 459 / 741, 06-3, Martin Ester 368

3 7.1 Web Search Engines Keyword Search data mining 519,000 results SFU, CMPT 459 / 741, 06-3, Martin Ester 369

4 7.1 Web Search Engines Boolean search care AND NOT old Phrases and proximity new care loss NEAR/5 care <SENTENCE> Indexing Inverted Lists How to rank the result lists? My 0 care 1 is loss of care with old care done Your care is gain of care with new care won care D1: 1, 5, 8 D2: 1, 5, 8 new D2: 7 old D1: 7 loss D1: 3 D1 D2 SFU, CMPT 459 / 741, 06-3, Martin Ester 370

5 7.1 Ranking Web Pages Page Rank [Brin & Page 98] Idea: the more high ranked pages link to a web page, the higher its rank. PageRank( v) = p + (1 p) Interpretation by random walk: PageRank is the probability that a random surfer visits a page» Parameter p is probability that the surfer gets bored and starts on a new random page.»(1-p) is the probability that the random surfer follows a link on current page. PageRanks correspond to principal eigenvector of the normalized link matrix. Can be calculated using an efficient iterative algorithm. u v PageRank( u) OutDegree( u) SFU, CMPT 459 / 741, 06-3, Martin Ester 371

6 7.1 Ranking Web Pages Hyperlink-Induced Topic Search (HITS) [Kleinberg 98] Definitions Authorities: highly-referenced pages on a topic. Hubs: pages that point to authorities. A good authority is pointed to by many good hubs; a good hub points to many good authorities. Hubs Authorities SFU, CMPT 459 / 741, 06-3, Martin Ester 372

7 Method 7.1 Ranking Web Pages HITS Collect seed set of pages S (e.g., returned by search engine). Expand seed set to contain pages that point to or are pointed to by pages in seed set. Initialize all hub/authority weights to 1. Iteratively update hub weight h(p) and authority weight a(p) for each page: q p Stop, when hub/authority weights converge. a ( p) = h( q) h( p) = a( q) p q SFU, CMPT 459 / 741, 06-3, Martin Ester 373

8 7.1 Ranking Web Pages Comparison PageRanks computed initially for web pages independent of search query. PageRank is used by web search engines such as Google. HITS: Hub and authority weights computed for different root sets in the context of a particular search query. HITS is applicable, e.g., for ranking the crawl front of a focused web crawler. Can use the relevance of a web page for the given topic to weight the edges of the web (sub) graph. SFU, CMPT 459 / 741, 06-3, Martin Ester 374

9 7.1 Directory Services Browsing data mining ~ 200 results SFU, CMPT 459 / 741, 06-3, Martin Ester 375

10 7.1 Directory Services Topic Hierarchy Provides a hierarchical classification of documents / web pages (e.g., Yahoo!) Yahoo home page Recreation Business Science News Travel Sports Companies Finance Jobs Topic hierarchy can be browsed interactively to find relevant web pages. Keyword searches performed in the context of a topic: return only a subset of web pages related to the topic. SFU, CMPT 459 / 741, 06-3, Martin Ester 376

11 7.1 Mining Text and Web Data Shortcomings of the Current Web Search Methods Low precision Thousands of irrelevant documents returned by web search engine 99% of information of no interest to 99% of people. Low recall In particular, for directory services (due to manual acquisition). Even largest crawlers cover less than 50% of all web pages. Low quality Many results are out-dated, broken links etc. SFU, CMPT 459 / 741, 06-3, Martin Ester 377

12 7.1 Mining Text and Web Data Data Mining Tasks Classification of text documents / web pages To insert into topic hierarchy, as feedback for focused web crawlers,... Clustering of text documents / web pages To create topic hierarchy, as front-end for web search engines... Resource discovery Discovery of relevant web pages using a focused web crawler, in particular ranking of the links at the crawl front SFU, CMPT 459 / 741, 06-3, Martin Ester 378

13 7.1 Mining Text and Web Data Ambiguity of natural language synonyms, homonyms,... Challenges Different natural languages English, Chinese,... Very high dimensionality of the feature spaces thousands of relevant terms Multiple datatypes text, images, videos,... Web is extremely dynamic > 1 million pages added each day Web is a very large distributed database huge number of objects, very high access costs SFU, CMPT 459 / 741, 06-3, Martin Ester 379

14 7.2 Text Representation Preprocessing Remove HTML tags, punctuation etc. Define terms (single-word / multi-word terms) Remove stopwords Perform stemming Count term frequencies Some words are more important than others smooth the frequencies, e.g. weight by inverse document frequency SFU, CMPT 459 / 741, 06-3, Martin Ester 380

15 Transformation Different definitions of inverse document frequency n(d,t): number of occurrences of term t in document d n( d, t) n( d, t) Select significant subset of all occurring terms Vocabulary V, term t i, document d represented as { n( d, } t V rep( d) = ) 7.2 Text Representation d t, t i i n( d, t), n( d, t) n( d, t) max n( d, t) Vector Space Model Most n s are zeroes for a single document t SFU, CMPT 459 / 741, 06-3, Martin Ester 381

16 7.2 Text Representation Similarity Function similarity( d 1, d mining 2 ) rep( d1), rep( d 2) =, = inner product rep( d1) rep( d 2) similar dissimilar Cosine Similarity data document SFU, CMPT 459 / 741, 06-3, Martin Ester 382

17 7.2 Text Representation Semantic Similarity Where can I fix my scooter? A great garage to repair your 2-wheeler is auto car auto car car auto auto car car auto car auto A scooter is a 2-wheeler. Fix and repair are synonyms. car auto Two basic approaches for semantic similarity Hand-made thesaurus (e.g., WordNet) Co-occurrence and associations auto car SFU, CMPT 459 / 741, 06-3, Martin Ester 383

18 7.3 Text and Web Clustering Standard (Text) Clustering Methods Bisecting k-means Agglomerative Hierarchical Clustering Overview Specialised Text Clustering Methods Suffix Tree Clustering Frequent-Termset-Based Clustering Joint Cluster Analysis Attribute data: text content Relationship data: hyperlinks SFU, CMPT 459 / 741, 06-3, Martin Ester 384

19 7.3 Text and Web Clustering Bisecting k-means [Steinbach, Karypis & Kumar 2000] K-means Bisecting k-means Partition the database into 2 clusters Repeat: partition the largest cluster into 2 clusters... Until k clusters have been discovered SFU, CMPT 459 / 741, 06-3, Martin Ester 385

20 7.3 Text and Web Clustering Bisecting k-means Two types of clusterings Hierarchical clustering Flat clustering: any cut of this hierarchy Distance Function Cosine measure (similarity measure) s( α, β ) = g( c( α)), g( c( β )) g( c( α)) g( c( β )) SFU, CMPT 459 / 741, 06-3, Martin Ester 386

21 7.3 Text and Web Clustering Agglomerative Hierarchical Clustering 1. Form initial clusters consisting of a singleton object, and compute the distance between each pair of clusters. 2. Merge the two clusters having minimum distance. 3. Calculate the distance between the new cluster and all other clusters. 4. If there is only one cluster containing all objects: Stop, otherwise go to step 2. Representation of a Cluster C rep( C) = n( d, t ) d C i t V i SFU, CMPT 459 / 741, 06-3, Martin Ester 387

22 7.3 Text and Web Clustering Experimental Comparison [Steinbach, Karypis & Kumar 2000] Clustering Quality Measured as entropy on a prelabeled test data set Using several text and web data sets Bisecting k-means outperforms k-means. Bisecting k-means outperforms agglomerative hierarchical clustering. Efficiency Bisecting k-means is much more efficient than agglomerative hierarchical clustering. O(n) vs. O(n 2 ) SFU, CMPT 459 / 741, 06-3, Martin Ester 388

23 7.3 Text and Web Clustering Suffix Tree Clustering [Zamir & Etzioni 1998] Forming Clusters Not by similar feature vectors But by common terms Strengths of Suffix Tree Clustering (STC) Method Efficiency: runtime O(n) for n text documents Overlapping clusters 1. Identification of basic clusters 2. Combination of basic clusters SFU, CMPT 459 / 741, 06-3, Martin Ester 389

24 7.3 Text and Web Clustering Identification of Basic Clusters Basic Cluster: set of documents sharing one specific phrase Phrase: multi-word term Efficient identification of basic clusters using a suffix-tree cat ate cheese ate cheese cheese Insertion of (1) cat ate cheese too 2, 4 1, 1 1, 2 1, 3 cat ate cheese ate cheese cheese mouse ate cheese too 1, 1 1, 2 too 1, 3 too 2, 1 Insertion of (2) mouse ate cheese too 2, 2 2, 3 SFU, CMPT 459 / 741, 06-3, Martin Ester 390

25 7.3 Text and Web Clustering Combination of Basic Clusters Basic clusters are highly overlapping Merge basic clusters having too much overlap Basic clusters graph: nodes represent basic clusters Edge between A and B iff A B / A > 0,5 and A B / B > 0,5 Composite cluster: a component of the basic clusters graph Drawback of this approach: Distant members of the same component need not be similar No evaluation on standard test data SFU, CMPT 459 / 741, 06-3, Martin Ester 391

26 7.3 Text and Web Clustering Example from the Grouper System SFU, CMPT 459 / 741, 06-3, Martin Ester 392

27 7.3 Text and Web Clustering Frequent-Term-Based Clustering [Beil, Ester & Xu 2002] Frequent term set: description of a cluster Set of documents containing all terms of the frequent term set: cluster Clustering: subset of set of all frequent term sets covering the DB with a low mutual overlap {} {D 1,..., D 16 } {sun} {fun} {beach} {D 1, D 2, D 4, D 5, {D 1, D 3, D 4, D 6, {D 2, D 7, D 8, D 6, D 8, D 9, D 10, D 7, D 8, D 10, D 11, D 9, D 10, D 12, D 11, D 13, D 15 } D 14, D 15, D 16 } D 13,D 14, D 15 } {sun, fun} {sun, beach} {fun, surf} {beach, surf} {D 1, D 4, D 6, D 8, {D 2, D 8, D 9, D 10, D 11, D 15 } D 10, D 11, D 15 } {sun, fun, surf} {sun, beach, fun} {D 1, D 6, D 10, D 11 } {D 8, D 10, D 11, D 15 } SFU, CMPT 459 / 741, 06-3, Martin Ester 393

28 7.3 Text and Web Clustering Method Task: efficient calculation of the overlap of a given cluster (description) F i with the union of the other cluster (description)s F: set of all frequent term sets in D f j : the number of all frequent term sets supported by document D j f j = { Fi F Fi D j} Standard overlap of a cluster C i : SO( C i ) = Dj Ci ( f C i j 1) SFU, CMPT 459 / 741, 06-3, Martin Ester 394

29 7.3 Text and Web Clustering Algorithm FTC FTC(database D, float minsup) SelectedTermSets:= {}; n:= D ; RemainingTermSets:= DetermineFrequentTermsets(D, minsup); while cov(selectedtermsets) n do for each set in RemainingTermSets do Calculate overlap for set; BestCandidate:=element of Remaining TermSets with minimum overlap; SelectedTermSets:= SelectedTermSets {BestCandidate}; RemainingTermSets:= RemainingTermSets - {BestCandidate}; Remove all documents in cov(bestcandidate) from D and from the coverage of all of the RemainingTermSets; return SelectedTermSets; SFU, CMPT 459 / 741, 06-3, Martin Ester 395

30 7.3 Text and Web Clustering Experimental Evaluation Classic Dataset SFU, CMPT 459 / 741, 06-3, Martin Ester 396

31 7.3 Text and Web Clustering Experimental Evaluation Reuters Dataset FTC achieves comparable clustering quality more efficiently Better cluster description SFU, CMPT 459 / 741, 06-3, Martin Ester 397

32 7.3 Text and Web Clustering Joint Cluster Analysis [Ester, Ge, Gao et al 2006] Attribute data: intrinsic properties of entities Relationship data: extrinsic properties of entities Existing clustering algorithms use either attribute or relationship data Often: attribute and relationship data somewhat related, but contain complementary information Informative Graph Edges = relationships 2D location = attributes Joint cluster analysis of attribute and relationship data SFU, CMPT 459 / 741, 06-3, Martin Ester 398

33 7.3 Text and Web Clustering The Connected k-center Problem Given Dataset D as informative graph k -# of clusters Connectivity constraint Clusters need to be connected graph components Objective function maximum (attribute) distance (radius) between nodes and cluster center Task Find k centers (nodes) and a corresponding partitioning of D into k clusters such that each clusters satisfies the connectivity constraint and the objective function is minimized SFU, CMPT 459 / 741, 06-3, Martin Ester 399

34 7.3 Text and Web Clustering The Hardness of the Connected k-center Problem Given k centers and a radius threshold r, finding the optimal assignment for the remaining nodes is NP-hard Because of articulation nodes (e.g., a) b Assignment of a implies the assignment of b C 1 C 2 a Assignment of a to C 1 minimizes the radius SFU, CMPT 459 / 741, 06-3, Martin Ester 400

35 7.3 Text and Web Clustering Example: CS Publications 1073: Kempe, Dobra, Gehrke: Gossip-based computation of aggregate information, FOCS' : Fagin, Kleinberg, Raghavan: Query strategies for priced information, STOC'02 True cluster labels based on conference SFU, CMPT 459 / 741, 06-3, Martin Ester 401

36 7.3 Text and Web Clustering Application for Web Pages Attribute data text content Relationship data hyperlinks Connectivity constraint clusters must be connected New challenges clusters can be overlapping not all web pages must belong to a cluster (noise)... SFU, CMPT 459 / 741, 06-3, Martin Ester 402

37 7.4 Text and Web Classification Overview Standard Text Classification Methods Naive Bayes Classifier Bag of Words Model Advanced Methods Use EM to exploit non-labeled training documents Exploit hyperlinks SFU, CMPT 459 / 741, 06-3, Martin Ester 403

38 7.4 Text and Web Classification Naive Bayes Classifier Decision rule of the Bayes Classifier: argmax c C Assumptions of the Naive Bayes Classifier: j P( d c j ) P( c j ) d = (d 1,..., d d ) the attribute values d i are independent from each other Decision rule of the Naive Bayes Classifier : argmax P( c c C j ) d i= j 1 P( d i c j ) SFU, CMPT 459 / 741, 06-3, Martin Ester 404

39 7.4 Text and Web Classification Bag of Words Model Allows to estimate the P(d j c) from the training documents Neglects position (order) of terms in document (bag!) Model for generating a document d of class c consisting of n terms P( d i c conditional probability of observing term t i, j ) given that the document belongs to class c Naiveapproach : P( d i c j ) = d c d c j j, t n( d, t k V i n( d, t ) k SFU, CMPT 459 / 741, 06-3, Martin Ester 405 )

40 SFU, CMPT 459 / 741, 06-3, Martin Ester Text and Web Classification Bag of Words Model Problem Term t i appears in no training document of class c j t i appears in a document d to be classified Document d also contains terms which strongly indicate class c j, but: P(d j c) = 0 and P(d c) = 0 Solution Smoothing of the relative frequencies in the training documents = c j d t i d n 0 ), ( ), ( 1 ), ( ) (, V t d n t d n c d P V t c d k c d i j i k j j + + =

41 7.4 Text and Web Classification Training data Experimental Evaluation [Craven et al. 1998] 4127 web pages of CS departments Classes: department, faculty, staff, student, course,..., other Major Results Classification accuracy of 70% to 80 % for most classes Only 9% accuracy for class staff, but 80% correct in superclass person Low classification accuracy for class other Web pages are harder to classify than professional text documents SFU, CMPT 459 / 741, 06-3, Martin Ester 407

42 7.4 Text and Web Classification Exploiting Unlabeled Documents [Nigam et al. 1999] Labeling is laborious, but many unlabeled documents available. Exploit unlabeled documents to improve classification accuracy: Initial training Labeled Docs Assign weighted class labels (Estimation) re-train (Maximization) Classifier Unlabeled Docs SFU, CMPT 459 / 741, 06-3, Martin Ester 408

43 7.4 Text and Web Classification Exploiting Unlabeled Documents Let training documents d belong to classes in a graded manner Pr(c d). Initially labeled documents have 0/1 membership. Expectation Maximization (EM) repeat calculate class model parameters P(d i c); determine membership probabilities P(c d); until classification accuracy does no longer improve. SFU, CMPT 459 / 741, 06-3, Martin Ester 409

44 7.4 Text and Web Classification Exploiting Unlabeled Documents Small labeled set large accuracy boost SFU, CMPT 459 / 741, 06-3, Martin Ester 410

45 7.4 Text and Web Classification Exploiting Hyperlinks [Chakrabarti, Dom & Indyk 1998] Logical documents are often fragmented into several web pages. Neighboring web pages often belong to the same class. Web documents are extremely diverse. Standard text classifiers perform poorly on web pages. For classification of web pages, use text of the page itself text and class labels of neighboring pages. SFU, CMPT 459 / 741, 06-3, Martin Ester 411

46 7.4 Text and Web Classification Exploiting Hyperlinks c = class, t = text, N = neighbors Text-only model: P(t c) Using neighbors text to judge my topic: P(t, t(n) c) Using neighbors class label to judge my topic: P(t, c(n) c) Most of the neighbor s class labels are unknown. Method: iterative relaxation labeling.? SFU, CMPT 459 / 741, 06-3, Martin Ester 412

47 7.4 Text and Web Classification Experimental Evaluation Using neighbors text increases classification error. Even when tagging non-local texts. Using neighbors class labels reduces classification error. Using text and class of neighbors yields the lowest classification error. %Error %Neighborhood known Text Class Text+Class SFU, CMPT 459 / 741, 06-3, Martin Ester 413

48 7.5 References Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", Proc. KDD 02, Brin S., Page L.: The anatomy of a large-scale hypertextual Web search engine, Proc. WWW7, Chakrabarti S., Dom B., Indyk P.: Enhanced hypertext categorization using hyperlinks, Proc. ACM SIGMOD Craven M., DiPasquo D., Freitag D., McCallum A., Mitchell T., Nigam K., Slattery S.: Learning to Extract Symbolic Knowledge from the World Wide Web, Proc. AAAI Deerwater S., Dumais S. T., Furnas G. W., Landauer T. K., Harshman R.: Indexing by latent semantic analysis, Journal of the Society for Information Science, 41(6), Ester M., Ge R., Gao B. J., Hu Z., Ben-Moshe B.: Joint Cluster Analysis of Attribute Data and Relationship Data: the Connected k-center Problem, Proc. SDM Kleinberg J.: Authoritative sources in a hyperlinked environment, Proc. SODA Nigam K., McCallum A., Thrun S., Mitchell T.: Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, 39(2/3), pp , Steinbach M., Karypis G., Kumar V.: A comparison of document clustering techniques, Proc. KDD Workshop on Text Mining, Zamir O., Etzioni O.: Web Document Clustering: A Feasibility Demonstration, Proc. SIGIR SFU, CMPT 459 / 741, 06-3, Martin Ester 414

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused