Telling Experts from Spammers Expertise Ranking in Folksonomies

Similar documents
SPEAR: SPAMMING-RESISTANT EXPERTISE ANALYSIS AND RANKING IN COLLABORATIVE TAGGING SYSTEMS

PERSONALIZED TAG RECOMMENDATION

Link Analysis and Web Search

Web Science Your Business, too!

COMP5331: Knowledge Discovery and Data Mining

COMP 4601 Hubs and Authorities

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge

Functionality, Challenges and Architecture of Social Networks

Exploring Social Annotations for Web Document Classification

Semantic Web and Web2.0. Dr Nicholas Gibbins

Chapter 3: Google Penguin, Panda, & Hummingbird

How Social Is Social Bookmarking?

Image Credit: Photo by Lukas from Pexels

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Introduction to Data Mining

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

SEO: SEARCH ENGINE OPTIMISATION

A Review Paper on Page Ranking Algorithms

Slides based on those in:

Experimental study of Web Page Ranking Algorithms

Final Exam Search Engines ( / ) December 8, 2014

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Information Retrieval and Organisation

Furl Furled Furling. Social on-line book marking for the masses. Jim Wenzloff Blog:

Extracting Information from Social Networks

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Bruno Martins. 1 st Semester 2012/2013

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Digital Marketing Proposal

TODAY S LECTURE HYPERTEXT AND

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

Big Data Analytics CSCI 4030

Tag Recommendation for Photos

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

ihits: Extending HITS for Personal Interests Profiling

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

How to organize the Web?

An Efficient Methodology for Image Rich Information Retrieval

CS425: Algorithms for Web Scale Data

Web Structure Mining using Link Analysis Algorithms

Analysis of Large Graphs: TrustRank and WebSpam

Towards a hybrid approach to Netflix Challenge

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

A Survey of various Web Page Ranking Algorithms

Advances in Natural and Applied Sciences. Information Retrieval Using Collaborative Filtering and Item Based Recommendation

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

An Approach To Improve Website Ranking Using Social Networking Site

Exploiting noisy web data for largescale visual recognition

Information Retrieval. (M&S Ch 15)

Tags, Folksonomies, and How to Use Them in Libraries

Lecture 27: Learning from relational data

Extracting Information from Complex Networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

TempWeb rd Temporal Web Analytics Workshop

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Understanding the Semantics of Ambiguous Tags in Folksonomies

Method to Study and Analyze Fraud Ranking In Mobile Apps

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Repositorio Institucional de la Universidad Autónoma de Madrid.

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Analytical survey of Web Page Rank Algorithm

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

Survey on Community Question Answering Systems

FACILITATING VIDEO SOCIAL MEDIA SEARCH USING SOCIAL-DRIVEN TAGS COMPUTING

Web 2.0: Crowdsourcing:

Gary Viray Founder, Search Opt Media Inc. Search.Rank.Convert.

Semantic Web Systems Ontologies Jacques Fleuriot School of Informatics

Tag-based Social Interest Discovery

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

Bibliometrics: Citation Analysis

Personalized Information Retrieval

Author Keywords Search behavior, domain expertise, exploratory search.

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

An Empirical Analysis of Communities in Real-World Networks

Create an Account... 2 Setting up your account... 2 Send a Tweet... 4 Add Link... 4 Add Photo... 5 Delete a Tweet...

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

The Web, Semantics and Data Mining

Getting the most from your websites SEO. A seven point guide to understanding SEO and how to maximise results

Coursework Completion

How to turn. Reviews into Revenue & Trends in Online Reputation

Browsing System for Weblog Articles based on Automated Folksonomy

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Kristina Lerman University of Southern California. This lecture is partly based on slides prepared by Anon Plangprasopchok

WebSite Grade For : 97/100 (December 06, 2007)

Ranking in a Domain Specific Search Engine

Finding Topic-centric Identified Experts based on Full Text Analysis

Tag Based Image Search by Social Re-ranking

Internet Search. (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

SEO WEB DESIGN BRANDING PHOTOGRAPHY SOCIAL MEDIA

SOCIAL MEDIA. Charles Murphy

Search Computing: Business Areas, Research and Socio-Economic Challenges

Transcription:

32 nd Annual ACM SIGIR 09 Boston, USA, Jul 19-23 2009 Telling Experts from Spammers Expertise Ranking in Folksonomies Michael G. Noll (Albert) Ching-Man Au Yeung Christoph Meinel Nicholas Gibbins Nigel Shadbolt Hasso Plattner Institute Uni Southampton

2 Introduction

Background 3 Folksonomies and Collaborative Tagging Large and still increasing popularity in the WWW today Delicious.com social bookmarking service by Yahoo! with 5+ million users Web pages photos music books videos Idea: Freely annotating resources with keywords aka tags Result: bottom-up categorization by end users, aka folksonomy Used for organizing resources, sharing, self-promotion, Additional effect: new means of resource discovery

Motivation 4 Two related goals for our work on expertise in folksonomies: 11. Identifying and promoting experts for a given topic Weighting user input, giving (better) recommendations, identify trendsetters for marketing/advertising/product promotion, etc. Topic := conjunction or disjunction of one or more tags 22. Demoting spammers Reduce impact of spam and junk input thereby improving system quality, performance, operation

5 Models

Model of expert users 6 What makes an expert an expert? Postulation of two assumptions of expertise for resource discovery, grounded on literature from computer science (that s you) and psychology 11. Mutual reinforcement of user expertise and document quality Expert users tend to have many high quality documents, and high quality documents are tagged by users of high expertise. 22. Discoverers vs. followers Expert users are discoverers they tend to be the first to bookmark and tag high quality documents, thereby bringing them to the attention of the user community. Think: researchers in academia.

Model of expert users 7 user network (social graph) document network (Web graph) user tags timestamp doc Context of social bookmarking / collaborative tagging

Model of expert users 8 user network (social graph) document network (Web graph) user tags timestamp doc Our Focus

Model of expert users 9 Bookmarking history of a Web page on Delicious.com Web page Timeline Users

Model of expert users 10 Bookmarking history of a Web page Discoverers Followers john.smith December January February March April May 2009

Model of expert users 11 Credit score function C(t) earlier discovery = more credit Discoverers Followers December January February March April May 2009

12 SPEAR Algorithm

Proposed algorithm: SPEAR 13 SPEAR SPamming-resistant Expertise Analysis and Ranking Based on the HITS (Hypertext Induced Topic Search) algorithm Hubs: pages that points to good pages Authorities: pages that are pointed to by good pages Expertise and Quality (SPEAR) similar to Hub and Authority (HITS) Users are hubs we find useful pages through them Pages are authorities provide relevant information Difference: only users can point (link) to pages but not vice versa

Proposed algorithm: SPEAR 14 page page HITS / WWW user page SPEAR / Folksonomy

Proposed algorithm: SPEAR 15 Input Number of users M Number of pages N Set of taggings R tag = { (user, page, tag, timestamp) tag = tag } Credit score function C() Number of iterations k Output: Ranked list L of users by expertise in topic tag Algorithm: Set E to be the vector (1, 1,, 1) Set Q to be the vector (1, 1,, 1) A Generate_Adjacency_Matrix(R tag, C) for i = 1 to k do E Q x A T Q E x A Normalize E Normalize Q endfor L Sort users by their expertise score in E return L E: expertise of users Q: quality of pages A: user page incl. credits mutual reinforcement until convergence

Proposed algorithm: SPEAR 16 U1 U2 1 2 2 1 D1 Adjacency matrix, credits applied D1 D2 D3 U1 1.4 1.0 1.7 1.0 0.0 U2 1.0 1.4 1.0 0.0 U3 0.0 1.0 1.4 1.0 U4 0.0 0.0 1.0 U3 3 D2 U4 1 2 D3 Steve Bill Sergey Larry Rank Score U1 1 0.422 U2 2 0.328 U3 3 0.212 U4 4 0.038 Folksonomy (simplified) Ranked list of users by expertise

17 Evaluation

Evaluation 18 Experimental Setup Problem: lack of a proper ground truth for expertise Who is the best researcher in this room? Workaround: Inserting simulated users into real-world data from Delicious.com and check where they end up after ranking Real-world data set from Delicious.com comprising 50 tags with 515,000 real users (and real spammers) 71,300 real Web pages 2,190,000 real social bookmarks

Evaluation 19 Experimental Setup Probabilistic simulation, simulated users generated with four parameters P1: Number of user s bookmarks active or inactive user? P2: Newness fraction of Web pages not already in data set P3: Time preference discoverer or follower? P4: Document preference high quality or low quality?

Evaluation 20 Experimental Setup Simulation of 6 different user types Profiles (parameter values) based on recent studies + characteristics of our real-world data sets Experts Geek Veteran lots of high quality documents, discoverer (Distinguished Researcher) high quality documents, discoverer (Professor) Newcomer high quality documents, follower (PhD student) Spammers Flooder Promoter Trojan lots of random documents, follower (found in Delicious) some documents (most are his own), discoverer (found in Delicious) some documents, follower (next-gen spammer)

Evaluation 21 Performance baselines FREQ(UENCY) Most popular approach simple frequency count, looks only at quantity. Seems to be the dominant algorithm in use in practice. HITS Algorithm on which SPEAR is based. Uses mutual reinforcement but does not analyze temporal dimension of user activity.

22 Experimental Results

23 Experts

Evaluation 24 Experts: Ideal result rank 1 Geeks Veterans Newcomers Overlaps expected due to probabilistic simulation setup

Evaluation 25 Experimental Results Promoting Experts SPEAR differentiated all expert types better than its competitors SPEAR kept expected order of geeks > veterans > newcomers SPEAR was less dependent on user activity (quality before quantity)

Evaluation 26 Qualitative analysis: manual examination of Top 10 experts for three tags photography, semanticweb, javascript programming No spammers found ( phew ) These users seemed to be more involved or serious about their Delicious usage, e.g. provided optional profile information such as real name, links to their Flickr photos or microblog on Twitter Their number of bookmarks: from 100 s to 10,000 s semanticweb : Semantic Web researcher among the experts javascript programming : Top 2 experts were professional software developers

27 Spammers

Evaluation 28 Spammers: Ideal result rank #1 Trojans Promoters Flooders Trojans expected to score higher because they mimic regular users for most of the time

Evaluation 29 Experimental Results Demoting Spammers SPEAR demoted all spammer types significantly more than its competitors Only SPEAR demoted all trojans from the TOP 100 ranks FREQ completely failed to demote any spammers

Evaluation 30 Qualitative analysis: manual examination of Top 50 users for the heavily spammed tag mortgage (without inserting simulated users) Ranked users by their number of bookmarks = FREQ strategy 30 out of 50 were (real) spammers, either flooders or promoters Compared to FREQ, both SPEAR and HITS were able to remove these spammers from the Top 50 SPEAR demoted spammers significantly more than HITS

31 Summary

Summary 32 Conclusions SPEAR demoted all spammer types while still ranking experts on top SPEAR was much less vulnerable to spammers due to its reduced dependence on the activeness of users: quality >> quantity Future Work Quality score of Web pages deserve more investigation Transfer to new problem domains, e.g. blogosphere or music Follow-up with user & item recommendation, trend detection Michael G. Noll michael.noll@hpi.uni-potsdam.de Hasso Plattner Institute, LIASIT Albert Au Yeung cmay06r@ecs.soton.ac.uk University of Southampton