7. Mining Text and Web Data

Size: px
Start display at page:

Download "7. Mining Text and Web Data"

Transcription

1 7. Mining Text and Web Data Contents of this Chapter 7.1 Introduction 7.2 Data Preprocessing 7.3 Text and Web Clustering 7.4 Text and Web Classification 7.5 References [Han & Kamber 2006, Sections 10.4 and 10.5] SFU, CMPT 459 / 741, 06-3, Martin Ester 367

2 7.1 The Web and Web Search Repository Web Server Crawler Storage Server Clustering Classification Indexer The jaguar has a 4 liter engine Topic Hierarchy Root News Science Plants Animals Business Computers Inverted Index engine jaguar cat Automobiles The jaguar, a cat, can run at speeds reaching 50 mph Documents in repository Search Query jaguar SFU, CMPT 459 / 741, 06-3, Martin Ester 368

3 7.1 Web Search Engines Keyword Search data mining 519,000 results SFU, CMPT 459 / 741, 06-3, Martin Ester 369

4 7.1 Web Search Engines Boolean search care AND NOT old Phrases and proximity new care loss NEAR/5 care <SENTENCE> Indexing Inverted Lists How to rank the result lists? My 0 care 1 is loss of care with old care done Your care is gain of care with new care won care D1: 1, 5, 8 D2: 1, 5, 8 new D2: 7 old D1: 7 loss D1: 3 D1 D2 SFU, CMPT 459 / 741, 06-3, Martin Ester 370

5 7.1 Ranking Web Pages Page Rank [Brin & Page 98] Idea: the more high ranked pages link to a web page, the higher its rank. PageRank( v) = p + (1 p) Interpretation by random walk: PageRank is the probability that a random surfer visits a page» Parameter p is probability that the surfer gets bored and starts on a new random page.»(1-p) is the probability that the random surfer follows a link on current page. PageRanks correspond to principal eigenvector of the normalized link matrix. Can be calculated using an efficient iterative algorithm. u v PageRank( u) OutDegree( u) SFU, CMPT 459 / 741, 06-3, Martin Ester 371

6 7.1 Ranking Web Pages Hyperlink-Induced Topic Search (HITS) [Kleinberg 98] Definitions Authorities: highly-referenced pages on a topic. Hubs: pages that point to authorities. A good authority is pointed to by many good hubs; a good hub points to many good authorities. Hubs Authorities SFU, CMPT 459 / 741, 06-3, Martin Ester 372

7 Method 7.1 Ranking Web Pages HITS Collect seed set of pages S (e.g., returned by search engine). Expand seed set to contain pages that point to or are pointed to by pages in seed set. Initialize all hub/authority weights to 1. Iteratively update hub weight h(p) and authority weight a(p) for each page: q p Stop, when hub/authority weights converge. a ( p) = h( q) h( p) = a( q) p q SFU, CMPT 459 / 741, 06-3, Martin Ester 373

8 7.1 Ranking Web Pages Comparison PageRanks computed initially for web pages independent of search query. PageRank is used by web search engines such as Google. HITS: Hub and authority weights computed for different root sets in the context of a particular search query. HITS is applicable, e.g., for ranking the crawl front of a focused web crawler. Can use the relevance of a web page for the given topic to weight the edges of the web (sub) graph. SFU, CMPT 459 / 741, 06-3, Martin Ester 374

9 7.1 Directory Services Browsing data mining ~ 200 results SFU, CMPT 459 / 741, 06-3, Martin Ester 375

10 7.1 Directory Services Topic Hierarchy Provides a hierarchical classification of documents / web pages (e.g., Yahoo!) Yahoo home page Recreation Business Science News Travel Sports Companies Finance Jobs Topic hierarchy can be browsed interactively to find relevant web pages. Keyword searches performed in the context of a topic: return only a subset of web pages related to the topic. SFU, CMPT 459 / 741, 06-3, Martin Ester 376

11 7.1 Mining Text and Web Data Shortcomings of the Current Web Search Methods Low precision Thousands of irrelevant documents returned by web search engine 99% of information of no interest to 99% of people. Low recall In particular, for directory services (due to manual acquisition). Even largest crawlers cover less than 50% of all web pages. Low quality Many results are out-dated, broken links etc. SFU, CMPT 459 / 741, 06-3, Martin Ester 377

12 7.1 Mining Text and Web Data Data Mining Tasks Classification of text documents / web pages To insert into topic hierarchy, as feedback for focused web crawlers,... Clustering of text documents / web pages To create topic hierarchy, as front-end for web search engines... Resource discovery Discovery of relevant web pages using a focused web crawler, in particular ranking of the links at the crawl front SFU, CMPT 459 / 741, 06-3, Martin Ester 378

13 7.1 Mining Text and Web Data Ambiguity of natural language synonyms, homonyms,... Challenges Different natural languages English, Chinese,... Very high dimensionality of the feature spaces thousands of relevant terms Multiple datatypes text, images, videos,... Web is extremely dynamic > 1 million pages added each day Web is a very large distributed database huge number of objects, very high access costs SFU, CMPT 459 / 741, 06-3, Martin Ester 379

14 7.2 Text Representation Preprocessing Remove HTML tags, punctuation etc. Define terms (single-word / multi-word terms) Remove stopwords Perform stemming Count term frequencies Some words are more important than others smooth the frequencies, e.g. weight by inverse document frequency SFU, CMPT 459 / 741, 06-3, Martin Ester 380

15 Transformation Different definitions of inverse document frequency n(d,t): number of occurrences of term t in document d n( d, t) n( d, t) Select significant subset of all occurring terms Vocabulary V, term t i, document d represented as { n( d, } t V rep( d) = ) 7.2 Text Representation d t, t i i n( d, t), n( d, t) n( d, t) max n( d, t) Vector Space Model Most n s are zeroes for a single document t SFU, CMPT 459 / 741, 06-3, Martin Ester 381

16 7.2 Text Representation Similarity Function similarity( d 1, d mining 2 ) rep( d1), rep( d 2) =, = inner product rep( d1) rep( d 2) similar dissimilar Cosine Similarity data document SFU, CMPT 459 / 741, 06-3, Martin Ester 382

17 7.2 Text Representation Semantic Similarity Where can I fix my scooter? A great garage to repair your 2-wheeler is auto car auto car car auto auto car car auto car auto A scooter is a 2-wheeler. Fix and repair are synonyms. car auto Two basic approaches for semantic similarity Hand-made thesaurus (e.g., WordNet) Co-occurrence and associations auto car SFU, CMPT 459 / 741, 06-3, Martin Ester 383

18 7.3 Text and Web Clustering Standard (Text) Clustering Methods Bisecting k-means Agglomerative Hierarchical Clustering Overview Specialised Text Clustering Methods Suffix Tree Clustering Frequent-Termset-Based Clustering Joint Cluster Analysis Attribute data: text content Relationship data: hyperlinks SFU, CMPT 459 / 741, 06-3, Martin Ester 384

19 7.3 Text and Web Clustering Bisecting k-means [Steinbach, Karypis & Kumar 2000] K-means Bisecting k-means Partition the database into 2 clusters Repeat: partition the largest cluster into 2 clusters... Until k clusters have been discovered SFU, CMPT 459 / 741, 06-3, Martin Ester 385

20 7.3 Text and Web Clustering Bisecting k-means Two types of clusterings Hierarchical clustering Flat clustering: any cut of this hierarchy Distance Function Cosine measure (similarity measure) s( α, β ) = g( c( α)), g( c( β )) g( c( α)) g( c( β )) SFU, CMPT 459 / 741, 06-3, Martin Ester 386

21 7.3 Text and Web Clustering Agglomerative Hierarchical Clustering 1. Form initial clusters consisting of a singleton object, and compute the distance between each pair of clusters. 2. Merge the two clusters having minimum distance. 3. Calculate the distance between the new cluster and all other clusters. 4. If there is only one cluster containing all objects: Stop, otherwise go to step 2. Representation of a Cluster C rep( C) = n( d, t ) d C i t V i SFU, CMPT 459 / 741, 06-3, Martin Ester 387

22 7.3 Text and Web Clustering Experimental Comparison [Steinbach, Karypis & Kumar 2000] Clustering Quality Measured as entropy on a prelabeled test data set Using several text and web data sets Bisecting k-means outperforms k-means. Bisecting k-means outperforms agglomerative hierarchical clustering. Efficiency Bisecting k-means is much more efficient than agglomerative hierarchical clustering. O(n) vs. O(n 2 ) SFU, CMPT 459 / 741, 06-3, Martin Ester 388

23 7.3 Text and Web Clustering Suffix Tree Clustering [Zamir & Etzioni 1998] Forming Clusters Not by similar feature vectors But by common terms Strengths of Suffix Tree Clustering (STC) Method Efficiency: runtime O(n) for n text documents Overlapping clusters 1. Identification of basic clusters 2. Combination of basic clusters SFU, CMPT 459 / 741, 06-3, Martin Ester 389

24 7.3 Text and Web Clustering Identification of Basic Clusters Basic Cluster: set of documents sharing one specific phrase Phrase: multi-word term Efficient identification of basic clusters using a suffix-tree cat ate cheese ate cheese cheese Insertion of (1) cat ate cheese too 2, 4 1, 1 1, 2 1, 3 cat ate cheese ate cheese cheese mouse ate cheese too 1, 1 1, 2 too 1, 3 too 2, 1 Insertion of (2) mouse ate cheese too 2, 2 2, 3 SFU, CMPT 459 / 741, 06-3, Martin Ester 390

25 7.3 Text and Web Clustering Combination of Basic Clusters Basic clusters are highly overlapping Merge basic clusters having too much overlap Basic clusters graph: nodes represent basic clusters Edge between A and B iff A B / A > 0,5 and A B / B > 0,5 Composite cluster: a component of the basic clusters graph Drawback of this approach: Distant members of the same component need not be similar No evaluation on standard test data SFU, CMPT 459 / 741, 06-3, Martin Ester 391

26 7.3 Text and Web Clustering Example from the Grouper System SFU, CMPT 459 / 741, 06-3, Martin Ester 392

27 7.3 Text and Web Clustering Frequent-Term-Based Clustering [Beil, Ester & Xu 2002] Frequent term set: description of a cluster Set of documents containing all terms of the frequent term set: cluster Clustering: subset of set of all frequent term sets covering the DB with a low mutual overlap {} {D 1,..., D 16 } {sun} {fun} {beach} {D 1, D 2, D 4, D 5, {D 1, D 3, D 4, D 6, {D 2, D 7, D 8, D 6, D 8, D 9, D 10, D 7, D 8, D 10, D 11, D 9, D 10, D 12, D 11, D 13, D 15 } D 14, D 15, D 16 } D 13,D 14, D 15 } {sun, fun} {sun, beach} {fun, surf} {beach, surf} {D 1, D 4, D 6, D 8, {D 2, D 8, D 9, D 10, D 11, D 15 } D 10, D 11, D 15 } {sun, fun, surf} {sun, beach, fun} {D 1, D 6, D 10, D 11 } {D 8, D 10, D 11, D 15 } SFU, CMPT 459 / 741, 06-3, Martin Ester 393

28 7.3 Text and Web Clustering Method Task: efficient calculation of the overlap of a given cluster (description) F i with the union of the other cluster (description)s F: set of all frequent term sets in D f j : the number of all frequent term sets supported by document D j f j = { Fi F Fi D j} Standard overlap of a cluster C i : SO( C i ) = Dj Ci ( f C i j 1) SFU, CMPT 459 / 741, 06-3, Martin Ester 394

29 7.3 Text and Web Clustering Algorithm FTC FTC(database D, float minsup) SelectedTermSets:= {}; n:= D ; RemainingTermSets:= DetermineFrequentTermsets(D, minsup); while cov(selectedtermsets) n do for each set in RemainingTermSets do Calculate overlap for set; BestCandidate:=element of Remaining TermSets with minimum overlap; SelectedTermSets:= SelectedTermSets {BestCandidate}; RemainingTermSets:= RemainingTermSets - {BestCandidate}; Remove all documents in cov(bestcandidate) from D and from the coverage of all of the RemainingTermSets; return SelectedTermSets; SFU, CMPT 459 / 741, 06-3, Martin Ester 395

30 7.3 Text and Web Clustering Experimental Evaluation Classic Dataset SFU, CMPT 459 / 741, 06-3, Martin Ester 396

31 7.3 Text and Web Clustering Experimental Evaluation Reuters Dataset FTC achieves comparable clustering quality more efficiently Better cluster description SFU, CMPT 459 / 741, 06-3, Martin Ester 397

32 7.3 Text and Web Clustering Joint Cluster Analysis [Ester, Ge, Gao et al 2006] Attribute data: intrinsic properties of entities Relationship data: extrinsic properties of entities Existing clustering algorithms use either attribute or relationship data Often: attribute and relationship data somewhat related, but contain complementary information Informative Graph Edges = relationships 2D location = attributes Joint cluster analysis of attribute and relationship data SFU, CMPT 459 / 741, 06-3, Martin Ester 398

33 7.3 Text and Web Clustering The Connected k-center Problem Given Dataset D as informative graph k -# of clusters Connectivity constraint Clusters need to be connected graph components Objective function maximum (attribute) distance (radius) between nodes and cluster center Task Find k centers (nodes) and a corresponding partitioning of D into k clusters such that each clusters satisfies the connectivity constraint and the objective function is minimized SFU, CMPT 459 / 741, 06-3, Martin Ester 399

34 7.3 Text and Web Clustering The Hardness of the Connected k-center Problem Given k centers and a radius threshold r, finding the optimal assignment for the remaining nodes is NP-hard Because of articulation nodes (e.g., a) b Assignment of a implies the assignment of b C 1 C 2 a Assignment of a to C 1 minimizes the radius SFU, CMPT 459 / 741, 06-3, Martin Ester 400

35 7.3 Text and Web Clustering Example: CS Publications 1073: Kempe, Dobra, Gehrke: Gossip-based computation of aggregate information, FOCS' : Fagin, Kleinberg, Raghavan: Query strategies for priced information, STOC'02 True cluster labels based on conference SFU, CMPT 459 / 741, 06-3, Martin Ester 401

36 7.3 Text and Web Clustering Application for Web Pages Attribute data text content Relationship data hyperlinks Connectivity constraint clusters must be connected New challenges clusters can be overlapping not all web pages must belong to a cluster (noise)... SFU, CMPT 459 / 741, 06-3, Martin Ester 402

37 7.4 Text and Web Classification Overview Standard Text Classification Methods Naive Bayes Classifier Bag of Words Model Advanced Methods Use EM to exploit non-labeled training documents Exploit hyperlinks SFU, CMPT 459 / 741, 06-3, Martin Ester 403

38 7.4 Text and Web Classification Naive Bayes Classifier Decision rule of the Bayes Classifier: argmax c C Assumptions of the Naive Bayes Classifier: j P( d c j ) P( c j ) d = (d 1,..., d d ) the attribute values d i are independent from each other Decision rule of the Naive Bayes Classifier : argmax P( c c C j ) d i= j 1 P( d i c j ) SFU, CMPT 459 / 741, 06-3, Martin Ester 404

39 7.4 Text and Web Classification Bag of Words Model Allows to estimate the P(d j c) from the training documents Neglects position (order) of terms in document (bag!) Model for generating a document d of class c consisting of n terms P( d i c conditional probability of observing term t i, j ) given that the document belongs to class c Naiveapproach : P( d i c j ) = d c d c j j, t n( d, t k V i n( d, t ) k SFU, CMPT 459 / 741, 06-3, Martin Ester 405 )

40 SFU, CMPT 459 / 741, 06-3, Martin Ester Text and Web Classification Bag of Words Model Problem Term t i appears in no training document of class c j t i appears in a document d to be classified Document d also contains terms which strongly indicate class c j, but: P(d j c) = 0 and P(d c) = 0 Solution Smoothing of the relative frequencies in the training documents = c j d t i d n 0 ), ( ), ( 1 ), ( ) (, V t d n t d n c d P V t c d k c d i j i k j j + + =

41 7.4 Text and Web Classification Training data Experimental Evaluation [Craven et al. 1998] 4127 web pages of CS departments Classes: department, faculty, staff, student, course,..., other Major Results Classification accuracy of 70% to 80 % for most classes Only 9% accuracy for class staff, but 80% correct in superclass person Low classification accuracy for class other Web pages are harder to classify than professional text documents SFU, CMPT 459 / 741, 06-3, Martin Ester 407

42 7.4 Text and Web Classification Exploiting Unlabeled Documents [Nigam et al. 1999] Labeling is laborious, but many unlabeled documents available. Exploit unlabeled documents to improve classification accuracy: Initial training Labeled Docs Assign weighted class labels (Estimation) re-train (Maximization) Classifier Unlabeled Docs SFU, CMPT 459 / 741, 06-3, Martin Ester 408

43 7.4 Text and Web Classification Exploiting Unlabeled Documents Let training documents d belong to classes in a graded manner Pr(c d). Initially labeled documents have 0/1 membership. Expectation Maximization (EM) repeat calculate class model parameters P(d i c); determine membership probabilities P(c d); until classification accuracy does no longer improve. SFU, CMPT 459 / 741, 06-3, Martin Ester 409

44 7.4 Text and Web Classification Exploiting Unlabeled Documents Small labeled set large accuracy boost SFU, CMPT 459 / 741, 06-3, Martin Ester 410

45 7.4 Text and Web Classification Exploiting Hyperlinks [Chakrabarti, Dom & Indyk 1998] Logical documents are often fragmented into several web pages. Neighboring web pages often belong to the same class. Web documents are extremely diverse. Standard text classifiers perform poorly on web pages. For classification of web pages, use text of the page itself text and class labels of neighboring pages. SFU, CMPT 459 / 741, 06-3, Martin Ester 411

46 7.4 Text and Web Classification Exploiting Hyperlinks c = class, t = text, N = neighbors Text-only model: P(t c) Using neighbors text to judge my topic: P(t, t(n) c) Using neighbors class label to judge my topic: P(t, c(n) c) Most of the neighbor s class labels are unknown. Method: iterative relaxation labeling.? SFU, CMPT 459 / 741, 06-3, Martin Ester 412

47 7.4 Text and Web Classification Experimental Evaluation Using neighbors text increases classification error. Even when tagging non-local texts. Using neighbors class labels reduces classification error. Using text and class of neighbors yields the lowest classification error. %Error %Neighborhood known Text Class Text+Class SFU, CMPT 459 / 741, 06-3, Martin Ester 413

48 7.5 References Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", Proc. KDD 02, Brin S., Page L.: The anatomy of a large-scale hypertextual Web search engine, Proc. WWW7, Chakrabarti S., Dom B., Indyk P.: Enhanced hypertext categorization using hyperlinks, Proc. ACM SIGMOD Craven M., DiPasquo D., Freitag D., McCallum A., Mitchell T., Nigam K., Slattery S.: Learning to Extract Symbolic Knowledge from the World Wide Web, Proc. AAAI Deerwater S., Dumais S. T., Furnas G. W., Landauer T. K., Harshman R.: Indexing by latent semantic analysis, Journal of the Society for Information Science, 41(6), Ester M., Ge R., Gao B. J., Hu Z., Ben-Moshe B.: Joint Cluster Analysis of Attribute Data and Relationship Data: the Connected k-center Problem, Proc. SDM Kleinberg J.: Authoritative sources in a hyperlinked environment, Proc. SODA Nigam K., McCallum A., Thrun S., Mitchell T.: Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, 39(2/3), pp , Steinbach M., Karypis G., Kumar V.: A comparison of document clustering techniques, Proc. KDD Workshop on Text Mining, Zamir O., Etzioni O.: Web Document Clustering: A Feasibility Demonstration, Proc. SIGIR SFU, CMPT 459 / 741, 06-3, Martin Ester 414

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

LET:Towards More Precise Clustering of Search Results

LET:Towards More Precise Clustering of Search Results LET:Towards More Precise Clustering of Search Results Yi Zhang, Lidong Bing,Yexin Wang, Yan Zhang State Key Laboratory on Machine Perception Peking University,100871 Beijing, China {zhangyi, bingld,wangyx,zhy}@cis.pku.edu.cn

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

An Adaptive Agent for Web Exploration Based on Concept Hierarchies An Adaptive Agent for Web Exploration Based on Concept Hierarchies Scott Parent, Bamshad Mobasher, Steve Lytinen School of Computer Science, Telecommunication and Information Systems DePaul University

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Link Based Clustering of Web Search Results

Link Based Clustering of Web Search Results Link Based Clustering of Web Search Results Yitong Wang and Masaru Kitsuregawa stitute of dustrial Science, The University of Tokyo {ytwang, kitsure}@tkl.iis.u-tokyo.ac.jp Abstract. With information proliferation

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

Automatic Web Page Categorization using Principal Component Analysis

Automatic Web Page Categorization using Principal Component Analysis Automatic Web Page Categorization using Principal Component Analysis Richong Zhang, Michael Shepherd, Jack Duffy, Carolyn Watters Faculty of Computer Science Dalhousie University Halifax, Nova Scotia,

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Search Results Clustering in Polish: Evaluation of Carrot

Search Results Clustering in Polish: Evaluation of Carrot Search Results Clustering in Polish: Evaluation of Carrot DAWID WEISS JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology Introduction search engines tools of everyday use

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

Weighted Suffix Tree Document Model for Web Documents Clustering

Weighted Suffix Tree Document Model for Web Documents Clustering ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Automatically Constructing a Directory of Molecular Biology Databases

Automatically Constructing a Directory of Molecular Biology Databases Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu Online Databases

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Building Search Applications

Building Search Applications Building Search Applications Lucene, LingPipe, and Gate Manu Konchady Mustru Publishing, Oakton, Virginia. Contents Preface ix 1 Information Overload 1 1.1 Information Sources 3 1.2 Information Management

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Suruchi Chawla 1, Dr Punam Bedi 2 1 Department of Computer Science, University of Delhi, Delhi, INDIA

More information

Auto-assemblage for Suffix Tree Clustering

Auto-assemblage for Suffix Tree Clustering Auto-assemblage for Suffix Tree Clustering Pushplata, Mr Ram Chatterjee Abstract Due to explosive growth of extracting the information from large repository of data, to get effective results, clustering

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Incorporating Hyperlink Analysis in Web Page Clustering

Incorporating Hyperlink Analysis in Web Page Clustering Incorporating Hyperlink Analysis in Web Page Clustering Michael Chau School of Business The University of Hong Kong Pokfulam, Hong Kong +852 2859-1014 mchau@business.hku.hk Patrick Y. K. Chau School of

More information

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Bruno Martins. 1 st Semester 2012/2013 Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS

SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS Fatih Gelgi, Srinivas Vadrevu, Hasan Davulcu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ fagelgi@asu.edu,

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING Jianhua Wang, Ruixu Li Computer Science Department, Yantai University, Yantai, Shandong, China Abstract: Key words: Document clustering methods

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

Learning to Rank Networked Entities

Learning to Rank Networked Entities Learning to Rank Networked Entities Alekh Agarwal Soumen Chakrabarti Sunny Aggarwal Presented by Dong Wang 11/29/2006 We've all heard that a million monkeys banging on a million typewriters will eventually

More information

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management

More information

Chapter 9. Classification and Clustering

Chapter 9. Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval Acknowledgement Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014 Contents of lectures, projects are extracted

More information

Context Based Indexing in Search Engines: A Review

Context Based Indexing in Search Engines: A Review International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Context Based Indexing in Search Engines: A Review Suraksha

More information

Introduction to Web Clustering

Introduction to Web Clustering Introduction to Web Clustering D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 June 26, 2009 Outline Introduction to Web Clustering Some Web Clustering engines The KeySRC approach Some

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Application of rough ensemble classifier to web services categorization and focused crawling

Application of rough ensemble classifier to web services categorization and focused crawling With the expected growth of the number of Web services available on the web, the need for mechanisms that enable the automatic categorization to organize this vast amount of data, becomes important. A

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Web Search Using a Graph-Based Discovery System

Web Search Using a Graph-Based Discovery System From: FLAIRS-01 Proceedings. Copyright 2001, AAAI (www.aaai.org). All rights reserved. Structural Web Search Using a Graph-Based Discovery System Nifish Manocha, Diane J. Cook, and Lawrence B. Holder Department

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction 2 Machine Learning Field of study that gives computers the ability to learn without

More information

Exam IST 441 Spring 2011

Exam IST 441 Spring 2011 Exam IST 441 Spring 2011 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

High Performance Text Document Clustering

High Performance Text Document Clustering Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2007 High Performance Text Document Clustering Yanjun Li Wright State University Follow this and additional

More information

Reading group on Ontologies and NLP:

Reading group on Ontologies and NLP: Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science CS-C3160 - Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Contents of this chapter Story

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

A Comparison of Document Clustering Techniques

A Comparison of Document Clustering Techniques A Comparison of Document Clustering Techniques M. Steinbach, G. Karypis, V. Kumar Present by Leo Chen Feb-01 Leo Chen 1 Road Map Background & Motivation (2) Basic (6) Vector Space Model Cluster Quality

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information