Social Data Exploration

Similar documents
New Perspectives in Social Data Management

arxiv: v1 [cs.db] 1 Aug 2012

EXPLORATORY MINING OF COLLABORATIVE SOCIAL CONTENT MAHASHWETA DAS. Presented to the Faculty of the Graduate School of

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Graph Cube: On Warehousing and OLAP Multidimensional Networks

Exploiting Group Recommendation Functions for Flexible Preferences

CS249: ADVANCED DATA MINING

Interactive Summarization and Exploration of Top Aggregate Query Answers

Using Data Mining to Determine User-Specific Movie Ratings

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

3 No-Wait Job Shops with Variable Processing Times

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Personalized Entity Recommendation: A Heterogeneous Information Network Approach

Security Control Methods for Statistical Database

Theorem 2.9: nearest addition algorithm

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Algorithms for Nearest Neighbors

Understanding Clustering Supervising the unsupervised

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Locality- Sensitive Hashing Random Projections for NN Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Fast Contextual Preference Scoring of Database Tuples

A Hybrid Peer-to-Peer Recommendation System Architecture Based on Locality-Sensitive Hashing

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

Towards a hybrid approach to Netflix Challenge

Machine Learning for Software Engineering

Geometric data structures:

Mining di Dati Web. Lezione 3 - Clustering and Classification

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Ranking Clustered Data with Pairwise Comparisons

International Journal of Computer Engineering and Applications, Volume IX, Issue X, Oct. 15 ISSN

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Overview of Clustering

Understanding policy intent and misconfigurations from implementations: consistency and convergence

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

The complement of PATH is in NL

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Clustering Aggregation

Genetic Algorithm for Circuit Partitioning

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Advanced Data Management Technologies Written Exam

Recommendation with Differential Context Weighting

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Finding Similar Items:Nearest Neighbor Search

Leveraging Set Relations in Exact Set Similarity Join

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Clustering. (Part 2)

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Joint Entity Resolution

Community-Based Recommendations: a Solution to the Cold Start Problem

Efficient integration of data mining techniques in DBMSs

Proposed System. Start. Search parameter definition. User search criteria (input) usefulness score > 0.5. Retrieve results

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Viral Marketing and Outbreak Detection. Fang Jin Yao Zhang

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

ECLT 5810 Clustering

Complexity Results on Graphs with Few Cliques

Big Data Analytics CSCI 4030

val(y, I) α (9.0.2) α (9.0.3)

Semi-Supervised Clustering with Partial Background Information

Polynomial-Time Approximation Algorithms

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Learning independent, diverse binary hash functions: pruning and locality

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Content-based Dimensionality Reduction for Recommender Systems

Introduction to Approximation Algorithms

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

Solving a Challenging Quadratic 3D Assignment Problem

Introduction to Spatial Database Systems

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM

Propagate the Right Thing: How Preferences Can Speed-Up Constraint Solving

Assignment 5: Collaborative Filtering

ECS289: Scalable Machine Learning

Movie Recommender System - Hybrid Filtering Approach

COMP6237 Data Mining Making Recommendations. Jonathon Hare

ECLT 5810 Clustering

Multidimensional Indexing The R Tree

University of Florida CISE department Gator Engineering. Clustering Part 2

Spatial Data Management

Efficient Top-K Count Queries over Imprecise Duplicates

SOCIAL MEDIA MINING. Data Mining Essentials

Data Preprocessing. Slides by: Shree Jaswal

On the Max Coloring Problem

Knowledge Discovery and Data Mining

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Measures of Clustering Quality: A Working Set of Axioms for Clustering

Multidimensional Data and Modelling - DBMS

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

High Dimensional Clustering

Spatial Data Management

Edge-exchangeable graphs and sparsity

Transcription:

Social Data Exploration Sihem Amer-Yahia DR CNRS @ LIG Sihem.Amer-Yahia@imag.fr Big Data & Optimization Workshop 12ème Séminaire POC LIP6 Dec 5 th, 2014

Collaborative data model User space (with attributes) Item space (with attributes) user1 boolean, rating, tag, sentiment Item 1 Item 2 user3 user2 Item 3 Item 4 user4 user5 Item 5 Item 6 Item 7 User6 2

MovieLens instances ID Title Genre Director Name Gender Location Rating 1 Titanic Drama James Cameron 2 Schindler s List Drama Steven Speilberg Amy Female New York 8.5 John Male New York 7.0 ID Title Genre Director Name Gender Location Tags 1 Titanic Drama James Cameron 2 Schindler s List Drama Steven Speilberg Amy Female New York love, Oscar John Male New York history, Oscar 3

More on MovieLens datasets http://grouplens.org/datasets/movielens/ 4

Social Data Exploration Rating exploration q Meaningful Interpretations of Collaborative Ratings Tag exploration q Who Tags What? An Analysis Framework Perspectives 5

Meaningful Interpretations of Collaborative Ratings 6

Data Model Collaborative rating site: Set of Items, Set of Users, Ratings q Rating tuple: <item attributes, user attributes, rating> ID Title Genre Director Name Gender Location Rating 1 Titanic Drama James Cameron Amy Female New York 8.5 2 Schindler s List Drama Steven Speilberg John Male New York 7.0 Group: Set of ratings describable by a set of attribute values Notion of group based on data cube q OLAP literature for mining multidimensional data 7

Exploration Space Each node/cuboid in lattice is a group A = Gender: Male B = Age: Young C = Location: CA D = Occupation: Student Task Quickly identify good groups in the lattice that help analysts understand ratings effectively Partial Rating Lattice for a Movie (M:Male, Y:Young, CA:California, S:Student) 8

DEM: Meaningful Description Mining For an input item covering R I ratings, return set C of groups, such that: description error is minimized, subject to: C k; coverage α Description Error Measures how well a group average rating approximates each individual rating belonging to it Coverage: measures percentage of ratings covered by returned groups DEM is NP-Hard: proof details in [1] [1] MRI: Meaningful Interpretations of Collaborative Ratings, S. Amer-Yahia, Mahashweta Das, Gautam Das and Cong Yu. In the Proceedings of the International Conference on Very Large Databases (PVLDB), 2011. 9

DEM: Meaningful Description Mining q Identify groups of reviewers who consistently share similar ratings on items 10

DEM: Meaningful Description Mining To verify NP-completeness, we reduce the Exact 3-Set Cover problem (EC3) to the decision version of our problem. EC3 is the problem of finding an exact cover for a finite set U, where each of the subsets available for use contain exactly 3 elements. The EC3 problem is proved to be NP-Complete by a reduction from the Three Dimensional Matching problem in computational complexity theory 11

DEM Algorithms Exact Algorithm (E-DEM) q Brute-force enumerating all possible combinations of cuboids in lattice to return the exact (i.e., optimal) set as rating descriptions Random Restart Hill Climbing Algorithm q Often fails to satisfy Coverage constraint; Large number of restarts required q Need an algorithm that optimizes both Coverage and Description Error constraints simultaneously Randomized Hill Exploration Algorithm (RHE-DEM) 12

RHE-DEM Algorithm Satisfy Coverage Minimize Error C= {Male, Student} {California, Student} 13

RHE-DEM Algorithm Satisfy Coverage Minimize Error C= {Male, Student} {California, Student} Say, C does not satisfy Coverage Constraint 14

RHE-DEM Algorithm Satisfy Coverage Minimize Error C= {Male, Student} {California, Student} C= {Male} {California,Student} C= {Student} {California,Student} 15

RHE-DEM Algorithm Satisfy Coverage Minimize Error C= {Male} {California, Student} Say, C satisfies Coverage Constraint 16

RHE-DEM Algorithm Satisfy Coverage Minimize Error C= {Male} {California, Student} 17

RHE-DEM Algorithm Satisfy Coverage Minimize Error C= {Male} {California, Student} 18

RHE-DEM Algorithm Satisfy Coverage Minimize Error C= {Male} {Student} 19

DIM: Meaningful Difference Mining For an input item covering R I + R I - ratings, return set C of cuboids, such that: q difference balance is minimized, subject to: C k; α α 20

DIM: Meaningful Difference Mining Difference Balance Measures whether the positive and negative ratings are mingled together" (high balance) or separated apart" (low balance) Coverage Measures the percentage of +/- ratings covered by returned groups DIM is NP-Hard: proof details in [1] [1] S. Amer-Yahia, Mahashweta Das, Gautam Das, Cong Yu: MRI: Meaningful Interpretations of Collaborative Ratings,. In PVLDB 2011. 21

DIM: Meaningful Difference Mining q Identify groups of reviewers who consistently disagree on item ratings 22

DIM: Meaningful Difference Mining NP-Completeness: reduction of the Exact 3-Set Cover problem (EC3). 24

DIM Algorithms Exact Algorithm (E-DIM) Randomized Hill Exploration Algorithm (RHE-DIM) q Unlike DEM error, DIM balance computation is expensive Quadratic computation scanning all possible positive and negative ratings for each set of cuboids q Introduce the concept of fundamental regions to aid faster balance computation Each rating tuple is a k-bit vector where a bit is 1 if the tuple is covered by a group A fundamental region is the set of rating tuples that share the same signature Partition space of all ratings and aggregate rating tuples in each region 25

DIM Algorithms: Fundamental Region C 1 = {Male, Student} C 2 = {California, Student} set of k=2 cuboids having 75 ratings (46+, 33-) balance = 26

Summary of Rating Exploration DEM and DIM are hard problems q q Leverage the lattice structure to improve coverage Exploit properties of rating function for faster error computation Explore other rating aggregation functions Explore other constraints: e.g., group size Explore other optimization dimensions: group diversity 27

Social Data Exploration Rating exploration q MRI: Meaningful Interpretations of Collaborative Ratings Tag exploration q Who Tags What? An Analysis Framework Perspectives 28

Collaborative Tagging Site (Amazon) 29

Collaborative Tagging Site (LastFM) 30

Exploring Collaborative Tagging in MovieLens Tag Signature for all Users Tag Signature for all CA Users 31

Exploring Collaborative Tagging Exploration considers three dimensions q User, Item, Tag and two alternative measures q Similarity, Diversity 32

Data Model q Tagging action tuple: <user attributes, item attributes, tags> ID Title Genre Director Name Gender Location Tags 1 Titanic Drama James Cameron Amy Female New York love, Oscar 2 Schindler s List Drama Steven Speilberg John Male New York history, Oscar 33

Tagging Behavior Dual Mining Problem (TagDM) 34

Tagging Behavior Dual mining Problem Instance 35

Problem: Tagging Behavior Dual Mining (TagDM) Identify similar groups of reviewers who share similar tagging behavior for diverse set of items Male Young comedy, drama, friendship drama, friendship 36

Tagging Behavior Dual Mining Instance 37

Tagging Behavior Dual Mining Instance Identify diverse groups of reviewers who share diverse tagging behavior for similar items Male Teen gun, special effects Female Teen violence, gory 38

TagDM is NP-Hard (proof details in [2]) [2] Mahashweta Das, Saravanan Thirumuruganathan, Sihem AmerYahia, Gautam Das, Cong Yu: Who Tags What? An Analysis Framework, In PVLDB 2012. 39

Two algorithms LSH (Locality Sensitive Hashing) based algorithm to handle TagDM problem instances optimizing similarity FDP (Facility Dispersion Problem) based algorithm handles TagDM problem instances optimizing diversity Both rely on computing tag signatures for groups q q Latent Dirichlet Allocation to aggregate tags Comparison between signatures based on cosine 40

Algorithm: LSH Based LSH (Locality Sensitive Hashing) based algorithm handles TagDM problem instances optimizing similarity LSH is popular to solve nearest neighbor search problems in high dimensions LSH hashes similar input items into same bucket with high probability q We hash group tag signature vectors into buckets, and then rank the buckets based on the strength of their (tag) similarity SM-LSH q Returns a set of groups, k having maximum similarity in tagging behavior, measured by comparing distances between group tag signature vectors SM-LSH-Fi: Handles hard constraints by Filtering result of SM-LSH SM-LSH-Fo: Handles hard constraints by Folding them to SM-LSH 41

Algorithm: LSH Based Hashing function for SM-LSH q We use LSH scheme in [3] that employs a family of hashing functions based on cosine similarity where Trep(g) is the tag signature vector for group g q Probability of finding the optimal result set by SM-LSH is bounded by: (proof details in paper) q where d is the dimensionality of hash signatures (buckets) We employ iterative relaxation to tune d in each iteration (Monte Carlo randomized algorithm) so that post-processing of hash tables yields nonnull result set [3]: M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002 42

Algorithm: LSH Based SM-LSH-Fi: Dealing with constraints by Filtering q For each hash table, check for satisfiability of hard constraints in each bucket, and then rank filtered buckets on tagging similarity Often yields null results SM-LSH-Fo: Dealing with constraints by Folding q Fold hard constraints maximizing similarity as soft constraints into SM-LSH q Hash similar input tagging action groups (similar with respect to group tag signature vector and user and/or item attributes) into the same bucket with high probability However, it is non-obvious how LSH hash functions may be inversed to account for dissimilarity while preserving LSH properties 43

Algorithm: FDP Based FDP (Facility Dispersion Problem) based algorithm handles TagDM problem instances optimizing diversity FDP problem locates facilities on a network in order to maximize distance between facilities q We find tagging groups maximizing diversity (distance) betweeen tag signature vectors We intialize a pair of facilities with maximum weight, and then add nodes with maximum distance to those selected, in each subsequent iteration [4] [4]: S. S. Ravi, D. J. Rosenkrantz, and G. K. Tayi. Facility dispersion problems: Heuristics and special cases. In WADS, 2002 44

Algorithm: FDP Based DV-FDP q Returns a set of groups, k having maximum diversity in tagging behavior, measured by maximizing average pairwise distance between group tag signature vectors q If G opt and G app represent the set of k ( k 2) tagging action groups returned by optimal and our approximate DV-FDP algorithm, and tag signature vectors satisfy triangular inequality: (Proof details in paper) DV-FDP-Fi q Handles hard constraints by Filtering result of DV-FDP DV-FDP-Fo q Handles hard constraints by Folding them to DV-FDP 45

Some anecdotal evidence on analysts prefs Users prefer TagDM Problems 2 (find similar user sub-populations who agree most on their tagging behavior for a diverse set of items), 3 (find diverse user subpopulations who agree most on their tagging behavior for a similar set of items) and 6 (find similar user sub-populations who disagree most on their tagging behavior for a similar set of items), having diversity as the measure for exactly one of the tagging component: item, user and tag respectively. 46

Summary and Perspectives The notion of group is central to social data exploration q q Because it is meaningful to analysts: groups are describable Because group relationships can be explored for efficient space exploration 47

Perspective 1 Rating exploration q q A single dimension was optimized at a time (error or balance) We could formulate a problem that seeks k most uniform (minimize error) and most diverse groups (least overlapping) 48

Perspective 2 There is a total of 112 concrete problem instances that our TagDM framework captures! And those optimize the tagging dimensions only 49

Perspective 3 Social data exploration over time 50