Tag-based Social Interest Discovery

Similar documents
Information Retrieval. (M&S Ch 15)

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval

Mining Web Data. Lijun Zhang

Searching the Deep Web

Introduction to Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Mining Web Data. Lijun Zhang

Natural Language Processing

Information Retrieval

Developing Focused Crawlers for Genre Specific Search Engines

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

International Journal of Advanced Research in Computer Science and Software Engineering

Instructor: Stefan Savev

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CS 6320 Natural Language Processing

How SPICE Language Modeling Works

Searching the Deep Web

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Static Pruning of Terms In Inverted Files

Web Information Retrieval using WordNet

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Introduction to Information Retrieval

Search Engines. Information Retrieval in Practice

Similarity search in multimedia databases

Chapter 2. Architecture of a Search Engine

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Information Retrieval

Document Clustering: Comparison of Similarity Measures

Domain Specific Search Engine for Students

Clustering (COSC 488) Nazli Goharian. Document Clustering.

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Ruslan Salakhutdinov and Geoffrey Hinton. University of Toronto, Machine Learning Group IRGM Workshop July 2007

modern database systems lecture 4 : information retrieval

Knowledge Discovery and Data Mining 1 (VO) ( )

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Hierarchical Link Analysis for Ranking Web Data

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

arxiv: v3 [cs.ni] 3 May 2017

Lecture 8 May 7, Prabhakar Raghavan

Information Retrieval

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Automatically Constructing a Directory of Molecular Biology Databases

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Building Web Annotation Stickies based on Bidirectional Links

Exam IST 441 Spring 2011

Text Analytics (Text Mining)

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Part I: Data Mining Foundations

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Clustering Analysis of Collaborative Tagging Systems By Using The Graph Model

Text Analytics (Text Mining)

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Clustering CS 550: Machine Learning

A Comparison of Document Clustering Techniques

Social Interaction Based Video Recommendation: Recommending YouTube Videos to Facebook Users

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval: Retrieval Models

Improving Relevance Prediction for Focused Web Crawlers

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Web Page Recommender System based on Folksonomy Mining for ITNG 06 Submissions

Exploratory Analysis: Clustering

Overview of Web Mining Techniques and its Application towards Web

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Web Page Similarity Searching Based on Web Content

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE

DATA MINING - 1DL105, 1DL111

MSA220 - Statistical Learning for Big Data

FOUNDATIONAL SEO Search Engine Optimization. Adam Napolitan

Utilizing Folksonomy: Similarity Metadata from the Del.icio.us System CS6125 Project

Theme Identification in RDF Graphs

Similarity by Metadata

Monte Carlo Integration

CSE 5243 INTRO. TO DATA MINING

How Does a Search Engine Work? Part 1

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

Knowledge Discovery and Data Mining 1 (KU)

Impact of Term Weighting Schemes on Document Clustering A Review

CS54701: Information Retrieval

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Semi-Supervised Clustering with Partial Background Information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Transcription:

Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1

Outline Introduction Data set collection & Pre-processing Architecture for (I) Social Interest Discovery Analysis of Tags Evaluation results Conclusions 2

Introduction Social network systems are becoming more successful popular, and generate challenges Discovering social interests shared by group of users is very important Detecting and representing user s interests Two types of existing approaches: User-centric: based on social connections among users Object-centric: based on the common objects fetched by users 3

Introduction Paper s approach : discover social interests by utilizing user-generated tags Statistical analyse the real-word traces of tags and web content (delicious.com) User-generated tags are consistent with the content they are being attached Develop the Social Interest Discovery system Discovering the common user interests Clustering users and their saved URLs by topic (set of tags) 4

Data set Data is collect from delicious.com database. Each post has form: p = {user, URL, tags} How many data they collected? 4.3 millions bookmarks, 0.2 millions users, and 1.4m URLs After pre-processing: ~ 0.3m tags and 4m keywords Data collection & pre-processing 1. Crawl the URLs and download pages 2. Discard all non-html object 3. Coding to UTF 8 & removing non English paper 4. Stopword List (i.e. a, an, the etc ) 5. Porter Stemming algorithm* (i.e. fishing, fisher, fished fish ) 6. Analysis distributions of frequencies (Tags, URLs and User) over the Bookmarks 5

Data set Statistical view Distributions follow power law* (linear graph in log-log scale) Distributions have long tails! ( ~ Pareto principle: 80/20 rule) Remarks Most documents are unpopular Most users are inactive Top popular tags connect most of the users 6

Architecture of SID Discovery Social Interest by Tags Idea: Set of tags are frequently used by many users can give a hint that such users may spontaneously from a community of an interest even though they may not have any physical connection or online connection. SID is proposed based on association rules algorithm Finding frequently co-occurring tags (topics of interest) Building URLs and users clusters for such tag-based topic (clustering) Importing topics and clusters into indexing system for application queries (indexing) 7

Architecture of SID 8

Architecture of SID Clustering For each topic and all the posts contain the tag set, insert URLs and uses into two clusters. Naïve clustering algorithm: 9

Architecture of SID Indexing Clusters types: Url & user clusters are identified by topics. Topic & url clusters are identified by users. Indexing cluster supports some queries: For a given topic, list all URLs contain this topic (have been tagged with all the tags in the topic). For a given topic, list all users who are interest in that topic (have used all the tags in the topic). For a given tag, list all topics contain that tag. 10

Analysis of Tags Statistical model: Use vector space model (VSM) to describe a URL (i.e. book index) Each URL: two vectors One in the space of all tags, one in the space of all document keywords In VSM, matrix with t terms and d documents: Term-document matrix A = (a ij ) R t*d Column vector a j is a set of terms belong to document j a ij : importance of term i in document j (or weight ) 11

Analysis of Tags Statistical model Weight (a ij ) measurement Tf-based (term frequency based) a ij : importance of term i in document j f ij : frequency of term i in document j Tf-Idf based (term frequency inverse document frequency) b ij : inverse document frequency D i : number of documents contains term i d : total number of documents 12

Analysis of Tags Tags vs. document keywords An intuitive example: URL Top Tf keywords Top Tf idf keyword All tags http://ka1fsb.home.att.net/resolve.html domain,name,file,resolver,server,conf,network,nameserver,ip,org,ampr ampr,domain,jnos,nameserver,conf,ka1fsb,resolver,ip,file,name,server linux,howto,network,sysadmin,dns Tags & keywords reflect the content, and differ only literally Tags are closer to people s understanding of content than keywords Some keywords are unrelated to the content / replaced without changing meaning 13

Analysis of Tags Tags vs. document keywords Vocabulary of tags and keywords: Is vocabulary of important Keywords covered by Tags? (YES) Statistical method: 7000 randomly English web document Plot cumulative distribution function : ( x is percentages Keywords missed by Tags ) 14

Analysis of Tags Convergence of User s Tag Selection Proportions of tags in the bookmarks are quite stable for popular URLs Measure the concentration & convergence of distinct tags used by different user Y = F(X) Y: Number of distinct tags X: Popularity of URLs (#saves of URLs) 15

Analysis of Tags Tags matched by documents How well do user s tags capture the main concepts of documents? Solutions Human reviews Statistical analysis about correlation between the tags of a URL and the content of its document. T = { t i } : set of tags attached an URL U w(t) : weight of tag t (frequency of tag in data) e(t,u) : tag match ratio 16 Numerator is total weight of tags which also appear in document keyword

Evaluation results Effectiveness of SID URL clusters by computing URL similarity within & cross the clusters. Compute the similarity between pair of documents with inner product (cosine similarity) of their tf * idf keywords vectors. Select 500 interest topics, each contains > 30 bookmarked URLs that share 5-6 co-occurring user tags. Each topic: compute average cosine similarity of all URL pairs in its cluster (intra-topic) Randomly select 10,000 topic pairs, compute average pairwise document similarity between every two topics (inter-topic) 17

Evaluation results Evaluate how the topic discovered by SID cover the individual interests of users. The more frequently an user uses a tag, the higher interests he has on the corresponding topic represented by the tags. Checking if top used tags of each users are in any topic discover can be discovered by SID 80% users has 100% top 5 tags 10% user has 90% top 5 tags Over 90% user has at least 90% top 5 tags SID has correctly identified & cluster over 90% interest for more than 90% users 18

Evaluation results General properties of topic clusters. The number of clusters with a given cluster size (threshold is 30) follows a power-law distribution. Users tend to use a small number of words to summarize the contents for themselves Distributions of the number of topics as the functions of number of users & number of URLs are also follow power law 19

Conclusion Advance Propose new approach for interest discovery base on user s tags with cost effective* User tags is closer with human understanding, and capture precisely web contents. Applicable system to discover common interest topic in social networks. Don t require online or physical connection between users Disadvantage Depend on user / group of user s characteristics. (talk about same thing in # ways) Community culture Users understand about something in different levels Should combine with another approach (user-centric) to give flexibility (users can selforganize their group / or get connection s recommendation) 20