Data integration for customer preference learning

Similar documents
Kripke style Dynamic model for Web Annotation with Similarity and Reliability

Repeatable Web Data Extraction and Interlinking

Transformation and aggregation preprocessing for top-k recommendation GAP rules induction

Detecting Anomalous Trajectories and Traffic Services

Assignment 5: Collaborative Filtering

Using EasyMiner API for Financial Data Analysis in the OpenBudgets.eu Project

Influence of Word Normalization on Text Classification

Web Personalization & Recommender Systems

Using Linked Open Data to Improve Recommending on E-Commerce

Fig 1. Overview of IE-based text mining framework

Web Personalization & Recommender Systems

XML Data in (Object-) Relational Databases

Movie Recommender System - Hybrid Filtering Approach

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

Taking Advantage of Semantics in Recommendation Systems

Lecture 3: Linear Classification

Orange3 Data Fusion Documentation. Biolab

A PERSONALIZED RECOMMENDER SYSTEM FOR TELECOM PRODUCTS AND SERVICES

Ontology as Knowledge Base for Spatial Data Harmonization

Various aspects of user preference learning and recommender systems

Fuzzy Cognitive Maps application for Webmining

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Flight Systems are Cyber-Physical Systems

UnifiedViews: An ETL Framework for Sustainable RDF Data Processing

SDMX self-learning package No. 3 Student book. SDMX-ML Messages

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

Payola: Collaborative Linked Data Analysis and Visualization Framework

Top-N Recommendations from Implicit Feedback Leveraging Linked Open Data

Music Recommendation with Implicit Feedback and Side Information

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Schema-aware feature selection in Linked Data-based recommender systems (Extended Abstract)

EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH

SQL: Data De ni on. B0B36DBS, BD6B36DBS: Database Systems. h p:// Lecture 3

Key differentiating technologies for mobile search

Singular Value Decomposition, and Application to Recommender Systems

NDBI040: Big Data Management and NoSQL Databases. h p:// svoboda/courses/ ndbi040/

Computational Intelligence Meets the NetFlix Prize

Web Applications Usability Testing With Task Model Skeletons

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

arxiv: v4 [cs.ir] 28 Jul 2016

Towards a hybrid approach to Netflix Challenge

Definition and Uses of the i* Metamodel 1

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

A Content Based Image Retrieval System Based on Color Features

PrefWork - a framework for the user preference learning methods testing

Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization

Annotation for the Semantic Web During Website Development

On Distributed Querying of Linked Data

Sparse Estimation of Movie Preferences via Constrained Optimization

Example : Write a Hadoop MapReduce program for Movie Recommendation System.

NDBI040: Big Data Management and NoSQL Databases. h p://

Specification and Generation of Environment for Model Checking of Software Components *

A Capacity Planning Methodology for Distributed E-Commerce Applications

Overview of Web Mining Techniques and its Application towards Web

Problems in Reputation based Methods in P2P Networks

An FCA Framework for Knowledge Discovery in SPARQL Query Answers

Spemmet - A Tool for Modeling Software Processes with SPEM

Experience with Change-oriented SCM Tools

Graaasp: a Web 2.0 Research Platform for Contextual Recommendation with Aggregated Data

Predict the box office of US movies

Modeling with Uncertainty Interval Computations Using Fuzzy Sets

VISUALIZATION OF USERS ACTIVITIES IN A SPECIFIC ENVIRONMENT. Ivo Maly

Ranking by Alternating SVM and Factorization Machine

Shape fitting and non convex data analysis

Using Data Mining to Determine User-Specific Movie Ratings

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Abstract Path Planning for Multiple Robots: An Empirical Study

By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

1. CONCEPTUAL MODEL 1.1 DOMAIN MODEL 1.2 UML DIAGRAM

Ontology Transformation in Multiple Domains

Search engine in a class of academic digital libraries

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

UserMap an Exploitation of User-Specified XML-to-Relational Mapping Requirements and Related Problems

MetaData for Database Mining

Content-based Dimensionality Reduction for Recommender Systems

Keyword Search over RDF Graphs. Elisa Menendez

Design and Management of Semantic Web Services using Conceptual Model

Zurich Open Repository and Archive. Private Cross-page Movie Recommendations with the Firefox add-on OMORE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

Ontology Extraction from Heterogeneous Documents

Ontology-based Web Recommendation from Tags

Maintaining Mutual Consistency for Cached Web Objects

5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

Web 2.0 Lecture 1: Motivation and Course Overview

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

SQL Lab: Assignment 2 (due to )

Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing. Interspeech 2013

ApplMath Lucie Kárná; Štěpán Klapka Message doubling and error detection in the binary symmetrical channel.

InBeat: News Recommender System as a CLEF-NEWSREEL 14

CS 124/LINGUIST 180 From Languages to Information

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

Document Databases: MongoDB

News Recommender System based on Association CLEF NewsREEL 2017

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

Hybrid Recommendation System Using Clustering and Collaborative Filtering

Detecting ads in a machine learning approach

Analyses of RDF Triples in Sample Datasets

Generic Environment for Full Automation of Benchmarking

NON-CENTRALIZED DISTINCT L-DIVERSITY

Transcription:

M. Kopecký 1, M. Vomlelová 2, P. Vojtáš 1 Faculty of Mathematics and Physics Charles University Malostranske namesti 25, Prague, Czech Republic 1 {kopecky vojtas}@ksi.mff.cuni.cz 2 marta@ktiml.mff.cuni.cz Abstract. We describe the process and challenges of integration of movie data from Movie Lens, Netflix and RecSys Challenge 2014 with IMDB and DBPedia. Thanks of this integration we can enhance information by semantic data and improve prediction of customer preferences and recommendation. These data were collected in different situation by different methodologies. We want to use these data to be able to extend and further enhance our machine learning approaches developed for individual datasets to other datasets. Keywords: Applications using data extracted from web, computer annotation, data, experiments and metrics 1 Introduction, motivation, recent work. No human can comprehend any large collection of multi-dimensional data in his/her mind and choose the optimal item according to complex and often difficult to formulate criterion. For this purpose can be helpful recommender systems, that can learn user s preferences from his/her both explicit and implicit actions. The goal of the recommender system is then suggest suitable and often surprising proposals. Different collections of the otherwise similar data can often require different approaches simply due to different semantic data available about items and users in datasets. Because these approaches cannot be directly executed on all datasets, they can be compared only with complications. In this ongoing research report we thus concentrate on synergy effect of annotation and integration of data for user preference learning, and consequently for recommendation. The optimal are such domains where individual items can be identified and where additional data are publicly available. As a basic domain we choose the domain of movies. J. Steinberger, M. Zíma, D. Fiala, M. Dostal, M. Nykl (eds.) Data a znalosti 2017, Plzeň, 5. - 6. října 2017, pp. 120-124.

Work-in-progress paper 2 Extracting and integrating data from movie domain In this chapter we first describe data creation, interchanging annotation and data integration. We use Flix data i.e. enriched Netflix competition data, RecSys 2014 challenge data [3] and RuleML Challenge data [1]. We started with three available independent datasets: MovieLens 20M dataset, Twitter dataset and Flix dataset Sizes of all datasets are summarized in Table 1. Table 1 original datasets Dataset Ratings Rated /all movies MovieLens 20M 20 000 263 26 744 27 278 Twitter dataset 168 880 13 616 14 542 Flix dataset 90 217 939 12 031 17 770 Rating users 137 493 22 073 479 870 The datasets are quite different. Still they have few things in common. Movies have their title and usually also the year of their production. Ratings are equipped by timestamp that allows us to order ratings from individual users chronologically. To be able to map movies from different datasets, we wanted to enhance every movie record by the corresponding IMDb 1 identifier TT with format ttnnnnnnn. We observed that the Twitter dataset uses as their internal MOVIEID the numeric part of the IMDb identifier. So the movie Midnight Cowboy with MOVIEID=64665 corresponds to the IMDb record with ID equal to tt0064665. To be able to assign IMDb identifiers to movies from other datasets, we had to use the search capabilities of the IMDb database. For both of them we used an HTTP interface for searching movies according to their name. The HTTP response then among others contains a table in form: <table><tr> <td><a href="/title/ttnnnnnnn/?ref_=fn_ft_tt_1" ><img src="..."></a></td> <td><a href="/title/ttnnnnnnn /?ref_=fn_ft_tt_1">title of the movie</a> (YEAR)...</td> </tr></table> To be able to maintain both MovieLens and Flix dataset equally regardless different formats of movie titles in them and potentially in other future datasets, we needed to transform each movie title to the proper form expected by the IMDb interface. The basic algorithm can be described in steps: 1 http://www.imdb.com/ 121

Convert all letters in movie title to lower case. If the movie title contains year of production at its end in brackets remove it. If the movie title still contains text in brackets at its end, remove it. This text usually contained original name of movie in original language. Move word the, respectively a / an from the end of the title to the beginning. Translate characters _,.,? and, to spaces Translate & and & in titles to word and For example, the transformation changes title Official Story, The (La Historia Oficial) (1985) from the MovieLens dataset to its canonical form the official story which can be identified as movie with the ID= tt0089276. Similarly the title Seventh Seal, The (Sjunde inseglet, Det) (1957) from the same dataset is transformed to the form the seventh seal with ID= tt0050976. The successfulness of this approach to map movies from both MovieLens and Flix datasets is in first line of Table 2. In optimal case, the table returning from the IMDb search contains exactly one row with the requested record. For this situation the algorithm behaves well and is able to retrieve the correct IMDb identifier. In many other cases the result contained more rows and the correct one or the best possible one had to be identified. For this purpose we enhanced the algorithm by additional steps: The correct record should be from the requested year, so the returned table should be searched only for records from this year and other records should be ignored The IMDb search provides more levels of tolerance in title matching. Try to use them from the most exact one to the most general. If the matching record from requested year cannot be found using stricter search, the other search level is used. Currently, we have 13 081 out of all 17 770 Flix movies mapped onto the IMDb database. Even all 27 278 out of 27 278 movies from the MovieLens set are mapped to the equivalent IMDb records. So the current results provided by the combination of most advanced versions of algorithms are promising. The diagram in the Figure 1 shows the amount of movies associated to the IMDb record in different intersections after the integration. For each movie registered in the IMDb database we then retrieved XML data from the URL address http://www.omdbapi.com/?i=ttnnnnnnn&plot=full&r=xml and then from the XML data we retrieved following movie attributes. Among others title, rating, avards, year, country, language, genres, director and actors. Another source of semantic data we use is the DbPedia. For this purpose we implemented the mapping technique described in [K] and assigned DbPedia 2 identifiers and associated semantic data to IMDb movies. 2 http://wiki.dbpedia.org/ 122

Work-in-progress paper The DbPedia identifier of movie is a string, for example The_Official_Story or The_Seventh_Seal. This identifier can then be used to access directly the DbPedia graph database or retrieve data in an XML format through the URL address in form http://dbpedia.org/page/dbpediaidentifier. Table 2 IMDb search by title name the successfulness of IMDb title search for original seven steps algorithm and the final enhanced version. MovieLens Flix Twitter IMDb search by title name 45,4% 70,9% Not needed Final enhanced version 100.0% 73.6% Not needed Figure 1 Integration of movies in datasets based on the IMDb mapping 3 Conclusions, future work We illustrated our approach to integration of five datasets three movie datasets and two movie databases containing semantics data. The future challenge is twofold: provide deeper analysis of data mining and use interconnection of datasets and their semantic enhancements for identifying and using possible dataset similarities. In future research we would like to continue in approaches in [2]. extend this approach to other domains Acknowledgement: Research was supported by Czech project Progres Q48. 123

References 1. Kuchar J.: Augmenting a Feature Set of Movies Using Linked Open Data, Proceedings of the RuleML 2015 Challenge, Berlin, Germany. Published by CEUR Workshop Proceedings, 2015 2. L. Peska, I. Lasek, A. Eckhardt, J. Dedek, P. Vojtas, D. Fiser: Towards web semantization and user understanding. In EJC 2012, Y. Kiyoki et al Eds. Frontiers in Artificial Intelligence and Applications 251, IOS Press 2013, pp 63-81 3. Twitter data from RecSys 2014 challenge http://2014.recsyschallenge.com/ 124