Entry Name: "INRIA-Perin-MC1" VAST 2013 Challenge Mini-Challenge 1: Box Office VAST

Similar documents
MoVis Movie Recommendation and Visualization

ESCAPE. A MINWOO PARK FILM Press Kit

5/13/2009. Introduction. Introduction. Introduction. Introduction. Introduction

Tips and Guidance for Analyzing Data. Executive Summary

Using Data Mining to Determine User-Specific Movie Ratings

How to predict IMDb score

Sentiments Analysis of Users Review to Improve 5 Star Rating Method for a Recommendation System

Angular 2 Development with TypeScript

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Practice 1 f = 12 2 m = 5 3 a = 7 g = 3 8 q = 96 9 c = y = 6. Warm-up. 1 x = 2 2 x = 20 3 y = 6 4 y = 3. Warm-up

boxplot - A graphic way of showing a summary of data using the median, quartiles, and extremes of the data.

Website Optimizer. Before we start building a website, it s good practice to think about the purpose, your target

Making EXCEL Work for YOU!

ONLINE EVALUATION FOR: Company Name

x y

Advanced data visualization (charts, graphs, dashboards, fever charts, heat maps, etc.)

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

How to Create a Killer Resources Page (That's Crazy Profitable)

Module 9 Kelsie Donaldson Casey Boland Nitish Pahwa. IMDb, August 13th, 2002

How App Ratings and Reviews Impact Rank on Google Play and the App Store

1 Introduction to Using Excel Spreadsheets

Network Simulator Project Guidelines Introduction

Introduction. Chapter Background Recommender systems Collaborative based filtering

Table of Contents Circle Graphs. Agenda ~ Apr /4 4/5 4/6 4/7 4/8. * You need: Monday Tuesday Wednesday Thursday Friday NO SCHOOL

ProgressTestA Unit 5. Vocabulary. Grammar

Week 7 Picturing Network. Vahe and Bethany

Vocabulary: Data Distributions

Orange3 Data Fusion Documentation. Biolab

Ryan Parsons Chad Price Jia Reese Alex Vassallo

COMP6471 WINTER User-Centered Design

Integrated Math 1 Module 7 Honors Connecting Algebra and Geometry Ready, Set, Go! Homework Solutions

GRADE CENTRE BEST PRACTICE FOR A4L

Clergy and Chancery, Parish and School Staff. Tom Hardy, Director, Office of Information Technology

Test Bank for Privitera, Statistics for the Behavioral Sciences

Lesson: Collaboration

Concept Fifth Grade: Third Nine Weeks *Revised 6/1/15. Time Key Content Key Vocabulary

Theme Identification in RDF Graphs

Finding Sentiment and the Value Within

Predict the box office of US movies

Project design process by Heartgrenade

5 MORE PRACTICE. 1 Complete the sentences with the words below and a suitable suffix.

Security analytics: From data to action Visual and analytical approaches to detecting modern adversaries

IMDB Film Prediction with Cross-validation Technique

Practical Introduction to SharePoint 2007

EBOOK THE BEGINNER S GUIDE TO DESIGN VERIFICATION AND DESIGN VALIDATION FOR MEDICAL DEVICES

Chapter 3 Analyzing Normal Quantitative Data

San Francisco State University

Whitepaper US SEO Ranking Factors 2012

TRAINING MATERIAL. An introduction to SONET-BULL Platform for members. HOME PAGE

Chapters 5-6: Statistical Inference Methods

News English.com Ready-to-use ESL / EFL Lessons

Balancing the pressures of a healthcare SQL Server DBA

GAP CLOSING. Grade 9. Facilitator s Guide

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

Unit 0: Extending Algebra 1 Concepts

CS Equalizing Society - Assignment 8. Interactive Hi-fi Prototype

Students interpret the meaning of the point of intersection of two graphs and use analytic tools to find its coordinates.

Cover Page. The handle holds various files of this Leiden University dissertation.

Amyyon customers can t wait to get their hands on it s new application, developed in Uniface.

Digital Marketing Manager, Marketing Manager, Agency Owner. Bachelors in Marketing, Advertising, Communications, or equivalent experience

Chapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data

12 SOURCES OF ERROR IN DATA

Lesson 2. Introducing Apps. In this lesson, you ll unlock the true power of your computer by learning to use apps!

HYPERVARIATE DATA VISUALIZATION

PRACTICE EXERCISES. Family Utility Expenses

Interviewee 2 I work on various bioinformatics projects, mostly looking at database integration.

1

EVALUATION ASSIGNMENT 2

New Horizons in Project Share: An Introductory Guide to the Project Share Gateway. The Texas Education Agency 2012 Texas Education Agency

Getting the most from your websites SEO. A seven point guide to understanding SEO and how to maximise results

CASE STUDY IT. Albumprinter Adopting Redgate DLM

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Music is the universal language of mankind. Henry Wadsworth Longfellow

Case study on PhoneGap / Apache Cordova

Hybrid Recommendation System Using Clustering and Collaborative Filtering

BIOL 417: Biostatistics Laboratory #3 Tuesday, February 8, 2011 (snow day February 1) INTRODUCTION TO MYSTAT

Vocabulary: Bits and Pieces III

Requirements Elicitation

Yammer Product Manager Homework: LinkedІn Endorsements

User Interface Document version

Projekt 1 Ausarbeitung

DESIGNING RESPONSIVE DASHBOARDS. Best Practices for Building Responsive Analytic Applications

GAP CLOSING. Integers. Intermediate / Senior Facilitator s Guide

5.5 Newton s Approximation Method

Hershey Park. By: Alicia Danenhower. English 3880 Section 10. Deborah Welsh.

Exploratory data analysis with one and two variables

Designed by Jason Wagner, Course Web Programmer, Office of e-learning NOTE ABOUT CELL REFERENCES IN THIS DOCUMENT... 1

Getting Ready. Preschool. for. Fun with Dinosaurs. and. Monsters

Experiencing MIS, 6e (Kroenke) Chapter 2 Business Processes, Information Systems, and Information

Adding Depth to Games

Student Outcomes. Lesson Notes. Classwork. Discussion (4 minutes)

BRIEF CASE STUDY COMPETITION AREA: The area for residential customers / Department for marketing TV services and content

Chapter 1 Polynomials and Modeling

The Procedure Proposal of Manufacturing Systems Management by Using of Gained Knowledge from Production Data

Section 9: One Variable Statistics

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 2 nd Nine Weeks,

CS 4460 Intro. to Information Visualization Sep. 18, 2017 John Stasko

Whitepaper. Dashboard Design Tips & Tricks.

WhatsApp Group Data Analysis with R

List Building Starter Course. Lesson 2. Writing Your Campaign. Sean Mize

Transcription:

Entry Name: "INRIA-Perin-MC1" VAST 2013 Challenge Mini-Challenge 1: Box Office VAST Team Members: Charles Perin, INRIA, Univ. Paris-Sud, CNRS-LIMSI, charles.perin@inria.fr PRIMARY Student Team: YES Analytic Tools Used: None May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2013 is complete? YES Video: http://youtu.be/7274cwlhrtq

Figure. 1: CinemAviz interface: (a) IMDN unique id input; (b) dimensions of the explored movie; (c) additional dimensions input; (d) matrix view; (e) matrix view visualization options; (f) sliders to weight each selected dimension; (g) sliders to weight movies according to the number of dimensions they have in common with the explored movie; (h) opening box office prediction view; (i) rating prediction view.

Description Data We used only one data source for the challenge: the Imdb database. The reason why we did not consider social data is that we wanted an independent system, mainly because using only the Imdb data allows us to predict any movie at any time, without being constrained by the temporality of social media data. We can predict any movie based on objective data and are not constrained to twitter data for example, whose quality may vary a lot according to the released movies. The database we built consists of two tables. The first one contains all the movies released in a specific time interval. We chose to keep all movies from 1990 to today because older movies would not be pertinent to compare to new ones and limit the size of the database. However, a longer period of time can easily be parsed. For each movie entry we store several information, such as their budget (converted in US dollars), and of course their opening week end box office and user rating. The second table is then the list of all people (actors, actresses, directors, music composer, etc.) involved in at least one movie of the movies table. Each movie has the list of people involved, and each people has the list of movies he or she was involved in. Overall, the data consists of 2713 movies and 236982 people. Application The tool is a client-oriented web application. Once the data files have been downloaded by the client, everything is computed locally and run offline in any modern browser. CinemAviz is built with javascript and the d3 library. The core characteristic of our tool is that it helps comparing similar movies. Using what we call dimensions of movies. Dimensions We first select a movie by typing its Imdb unique id (Fig. 1(a)). It makes the dimensions of the movie appear with the number of movies they are involved in (Fig. 1(b)). We call dimension every attribute of a movie (actors, directors, budget, etc.). We can also manually add dimensions with an entry text (Fig. 1(c)). Although we were not supposed to enter additional information ourselves, this feature was raised as very interesting by one of the analysts we got feedback from. Once the dimensions are set up, we can select/unselect them, making the dimensions appear in an adjacency matrix view (Fig. 1(d)). Each cell is the intersection of two dimensions, meaning it represents all the movies of the database having these two dimensions. We also propose a Budget dimension, which will be associated all the movies having their budget in an interval around the budget of the currently analyzed movie. Cell visualizations It is possible to switch between different visualizations for the cells. With the linechart, barchart and stripped chart views, both the opening box office (in red in the figure on the right) and the ratings (in blue) are shown, but obviously at a different scale. The color and scale of the visualizations can be set using the widgets shown Figure 1(e). Using these visualizations, one may observe the distribution through different views. These views are used to analyze the

dimensions and find outliers or trends for the next steps of the analysis. The last cell visualization is a scatterplot where each dot represents a movie, making the adjacency matrix become a scatterplot matrix. In this view, the x axis is the rating and the y axis the opening box office. Once again, the distribution of the movies with these two dimensions is visualized. We also associate to each dot a grayscale, according to the number of dimensions the associated movie has in common with the movie we explore. The darkest dots will be very similar to the target movie while lightest ones will share only a few dimensions with it. For instance, when exploring a movie such as Wolverine, we see in a very dark color the series of X-Men movies. Weighting dimensions Once we have explored the different dimensions using the matrix view, we select the dimensions we estimate to be of interest. This is achieved by clicking on the dimensions headers. For each selected dimension, a slider is created on the right of the interface (Fig. 1(f)). This slider is used to set the weight of the dimension. Sliders can be set between 0 and 1000 and their initiate value is 500. The analyst s expertise is crucial for this weighting process. Depending on the user, the weightings may vary a lot and so the result. This process is subjective and a limited knowledge of the dimensions from the analyst would ensure bad results to the weightings. Six other sliders are available. These sliders, named [1-6]D, are used to weight differently the movies according to the number of dimensions they have in common with the explored movie (Fig. 1(g)). Note that the 6D slider is actually a 6D+ slider. These sliders are very useful when the movie we explore is for instance the new opus of a series, and basically as soon as a movie looks very similar to several others in terms of dimensions. Estimation views When the sliders values are changed, two estimation views are updated: the opening box office view and the rating view (Fig. 1(h,i)). These views consist of one or several linechart, depending on the current mode which can be either the plot of each dimension (Fig. 3(a)), or the average of the dimensions (Fig. 3(b)). The weights of the sliders change the shape of each associated dimension linechart as well as the average linechart. The x axis is from 0 to 10 for the rating and from 0 to the maximum value in the dataset for the opening box office. The y scale is the number of movies for each value in the x scale. Because an actor will have fewer movies than a genre for instance, the views will often be stretched by dimensions with many movies (Fig. 2(1), brown linechart). To give less weight to these dimensions, we visually weight the sliders by observing the feedback in the views until we are satisfied with the importance of the dimensions (Fig. 2(2-3)). The linechart in the estimation views are colored according to their type. In Figure 2, the brown linechart is the genre and the blue ones are the actors. The colors are consistent with the ones used for the dimension selection and the headers of the matrix. Figure. 2: visually weighting a dimension

Estimations are performed using the two focus views, but going back and forth with the weighting dimensions step. The interactions available in the focus views are illustrated Figure 3. Moving the mouse will trigger the inspector and display the value at the current mouse position (Fig. 3(c)); the average value of all the weighted dimensions is a red line (Fig. 3(d)); brushing in the area makes a selection rectangle appear (Fig. 3(e)); and the average value within the brushed area is the orange line at its center with the associated value (Fig. 3(f)). Discussion and conclusion CinemAviz is based only on visual exploration and visual Figure. 3: exploring the focus views decision. As the challenge requested that the user have an important role, we did not use advanced mathematical models to automatize the process and focused on a tool where user s expertise is crucial and the interface is here only to present the data. Then, to obtain good results with the tool, the analyst needs to have a very good knowledge of cinematography. We found that making accurate predictions was difficult when the dimensions of the explored movies were also dimensions of only a few movies in the database. Because the tool is based on previous movies, it is for instance not well suited to predict independent movies and movies with unknown actors or directors. We think we made quite a good job overall, according to the different results and recognitions we got during the challenge and are globally satisfied with the results we obtained, in particular for the viewer rating prediction. We explain this in details in question 5 of the next part of the document. Several ameliorations may improve the analysis. For example, being able to brush each cell of the matrix, in the scatterplot mode or any other, to filter only some movies and remove outliers with a higher precision as it is for instance the case with ScatterDice. We may also take into account other dimensions such as the length of a movie, the period it was released (we know that during summer, box office are often higher), and the production company. Finally, we really enjoyed the challenge and the development of CinemAviz. Besides, we realized that our tool is really helpful to find movies similar to others although it was not its original purpose. We used it a lot for personal research and discovered movies we loved, based on their similarities with our favorite movies, actors or directors.

Questions 1) What data factors, alone or in combination, were most useful for predicting possible outcomes? The most useful factors were star actors and directors because their casting in a movie may appeal or the opposite spectators. Less pertinent dimensions were cinematographer, composer, or costume designer. Indeed, either they were involved in only a few movies, or they have huge numbers, of various genres, and with various opening box office and rating, making them unreliable. The genres of the movies are subject to the same remark because each genre has top movies and very bad ones. Another very important dimension was the budget. Indeed, we quickly realized that the budget highly impacts the opening box office, while it is far less the case for the rating. This is easily explained by the fact that a blockbuster movie will be extremely advertised, that star actors are in the casting, and often that incredible special effects are shown, making a very good trailer. However, if spectators can be abused and spend money for a movie even if it is not worth it they did not see the movie yet they rate the movie once they saw it and many high budget movies end up with a bad viewer rating. Finally, a crucial data factor was the number of similar dimensions to predict a movie s opening box office and rating. The scatterplot matrix was very useful for this purpose, as well as the [1-6]D sliders. 2) How did you combine factors from the structured data with factors in unstructured data and what was the impact on the results? Did you see correlations? How can a user of your system explore this combination? We made choice not to use so-called unstructured data (although we would not call the IMDB data structured, given the efforts it required to parse the entire, and not really consistent, database). 3) Do the important factors vary by class, such as movie genre? We did not find any difference between the movie genres. For instance, we realized that horror movies or comedy movies often had a lower viewer rating, but we did not modify our analysis according to that. Indeed, because we based our predictions according to similar movies, a comedy or horror movie will be closer to movies with the same genre and then be impacted by their scores. However, a very important factor, as explained before, is the budget for the opening box office, and the actors and directors when famous, for both the opening box office and the viewer rating. 4) Did you use data on previous movies to help analyze/predict outcomes for later movies? If so, how? We are not sure if the question is about using results from our previous analyses for the later ones, or about using data on previous movies. If the question is about using data on previous movies, then of course we did it, it is the core of CinemAviz, which is based only on the Imdb database. If the question is about our previous analyses, then the answer is also yes. Actually, when we started the challenge, it was already in its second phase and we had a lot of past results to exploit. Once again, because we use only the Imdb data, we can predict a movie at any time, without being constrained by the social media data temporality. Then, we trained ourselves and iteratively developed our tool using the previously released movies of the challenge to compare our predictions with the results. This has been our training part, and this step has been primordial for our late estimations.

5) For any prediction that you had a significant margin of error (for our challenge, this would be a high mean relative absolute error), explain possible sources of error. We are quite satisfied with our results for the viewer rating, and we believe CinemAviz is reliable for this estimation. This indicator is highly dependent on the dimensions of the movie we analyze and we finally got an average absolute error of 0.7 for the rating, with some very precise predictions. The opening box office estimation was less accurate. Although we had for the July 12 results the best opening box office prediction made at this date, the opening box office prediction was not always that accurate and we also had several bad predictions. We partly explain this because the tool is highly based on the analyst s knowledge and expertise; and we have to admit that we are not expert in all kind of movies. Our predictions were often very wrong for types of movies that we are not interested in (e.g., horror, family comedy). We truly think that CinemAviz can give accurate results, as long as the user knows the topic very well. Only him will know which weight he should give to a dimension, and this may vary a lot depending on the context, the analyst, and his subjective preferences for actors, genres or directors. We also think that the opening box office would be easier to predict using social media data, and it is one of the limitations of our tool. The opening box office is not influenced only by the dimensions of the movie, but also for example by its release date, other movies released the same week, and social events occurring at the same time (vacations, sport events, etc.) that we did not consider. 6) What data trends if any were you able to identify? How did the identification of trends affect / shape predictions? Did you see instances where early data about a movie was contradicted by later data/factors? This question is once again about the unstructured data we made the choice not to use.