Entry Name: "INRIA-Perin-MC1" VAST 2013 Challenge Mini-Challenge 1: Box Office VAST

Entry Name: "INRIA-Perin-MC1" VAST 2013 Challenge Mini-Challenge 1: Box Office VAST Team Members: Charles Perin, INRIA, Univ. Paris-Sud, CNRS-LIMSI, charles.perin@inria.fr PRIMARY Student Team: YES Analytic Tools Used: None May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2013 is complete? YES Video: http://youtu.be/7274cwlhrtq

Figure. 1: CinemAviz interface: (a) IMDN unique id input; (b) dimensions of the explored movie; (c) additional dimensions input; (d) matrix view; (e) matrix view visualization options; (f) sliders to weight each selected dimension; (g) sliders to weight movies according to the number of dimensions they have in common with the explored movie; (h) opening box office prediction view; (i) rating prediction view.

Description Data We used only one data source for the challenge: the Imdb database. The reason why we did not consider social data is that we wanted an independent system, mainly because using only the Imdb data allows us to predict any movie at any time, without being constrained by the temporality of social media data. We can predict any movie based on objective data and are not constrained to twitter data for example, whose quality may vary a lot according to the released movies. The database we built consists of two tables. The first one contains all the movies released in a specific time interval. We chose to keep all movies from 1990 to today because older movies would not be pertinent to compare to new ones and limit the size of the database. However, a longer period of time can easily be parsed. For each movie entry we store several information, such as their budget (converted in US dollars), and of course their opening week end box office and user rating. The second table is then the list of all people (actors, actresses, directors, music composer, etc.) involved in at least one movie of the movies table. Each movie has the list of people involved, and each people has the list of movies he or she was involved in. Overall, the data consists of 2713 movies and 236982 people. Application The tool is a client-oriented web application. Once the data files have been downloaded by the client, everything is computed locally and run offline in any modern browser. CinemAviz is built with javascript and the d3 library. The core characteristic of our tool is that it helps comparing similar movies. Using what we call dimensions of movies. Dimensions We first select a movie by typing its Imdb unique id (Fig. 1(a)). It makes the dimensions of the movie appear with the number of movies they are involved in (Fig. 1(b)). We call dimension every attribute of a movie (actors, directors, budget, etc.). We can also manually add dimensions with an entry text (Fig. 1(c)). Although we were not supposed to enter additional information ourselves, this feature was raised as very interesting by one of the analysts we got feedback from. Once the dimensions are set up, we can select/unselect them, making the dimensions appear in an adjacency matrix view (Fig. 1(d)). Each cell is the intersection of two dimensions, meaning it represents all the movies of the database having these two dimensions. We also propose a Budget dimension, which will be associated all the movies having their budget in an interval around the budget of the currently analyzed movie. Cell visualizations It is possible to switch between different visualizations for the cells. With the linechart, barchart and stripped chart views, both the opening box office (in red in the figure on the right) and the ratings (in blue) are shown, but obviously at a different scale. The color and scale of the visualizations can be set using the widgets shown Figure 1(e). Using these visualizations, one may observe the distribution through different views. These views are used to analyze the

dimensions and find outliers or trends for the next steps of the analysis. The last cell visualization is a scatterplot where each dot represents a movie, making the adjacency matrix become a scatterplot matrix. In this view, the x axis is the rating and the y axis the opening box office. Once again, the distribution of the movies with these two dimensions is visualized. We also associate to each dot a grayscale, according to the number of dimensions the associated movie has in common with the movie we explore. The darkest dots will be very similar to the target movie while lightest ones will share only a few dimensions with it. For instance, when exploring a movie such as Wolverine, we see in a very dark color the series of X-Men movies. Weighting dimensions Once we have explored the different dimensions using the matrix view, we select the dimensions we estimate to be of interest. This is achieved by clicking on the dimensions headers. For each selected dimension, a slider is created on the right of the interface (Fig. 1(f)). This slider is used to set the weight of the dimension. Sliders can be set between 0 and 1000 and their initiate value is 500. The analyst s expertise is crucial for this weighting process. Depending on the user, the weightings may vary a lot and so the result. This process is subjective and a limited knowledge of the dimensions from the analyst would ensure bad results to the weightings. Six other sliders are available. These sliders, named [1-6]D, are used to weight differently the movies according to the number of dimensions they have in common with the explored movie (Fig. 1(g)). Note that the 6D slider is actually a 6D+ slider. These sliders are very useful when the movie we explore is for instance the new opus of a series, and basically as soon as a movie looks very similar to several others in terms of dimensions. Estimation views When the sliders values are changed, two estimation views are updated: the opening box office view and the rating view (Fig. 1(h,i)). These views consist of one or several linechart, depending on the current mode which can be either the plot of each dimension (Fig. 3(a)), or the average of the dimensions (Fig. 3(b)). The weights of the sliders change the shape of each associated dimension linechart as well as the average linechart. The x axis is from 0 to 10 for the rating and from 0 to the maximum value in the dataset for the opening box office. The y scale is the number of movies for each value in the x scale. Because an actor will have fewer movies than a genre for instance, the views will often be stretched by dimensions with many movies (Fig. 2(1), brown linechart). To give less weight to these dimensions, we visually weight the sliders by observing the feedback in the views until we are satisfied with the importance of the dimensions (Fig. 2(2-3)). The linechart in the estimation views are colored according to their type. In Figure 2, the brown linechart is the genre and the blue ones are the actors. The colors are consistent with the ones used for the dimension selection and the headers of the matrix. Figure. 2: visually weighting a dimension

Estimations are performed using the two focus views, but going back and forth with the weighting dimensions step. The interactions available in the focus views are illustrated Figure 3. Moving the mouse will trigger the inspector and display the value at the current mouse position (Fig. 3(c)); the average value of all the weighted dimensions is a red line (Fig. 3(d)); brushing in the area makes a selection rectangle appear (Fig. 3(e)); and the average value within the brushed area is the orange line at its center with the associated value (Fig. 3(f)). Discussion and conclusion CinemAviz is based only on visual exploration and visual Figure. 3: exploring the focus views decision. As the challenge requested that the user have an important role, we did not use advanced mathematical models to automatize the process and focused on a tool where user s expertise is crucial and the interface is here only to present the data. Then, to obtain good results with the tool, the analyst needs to have a very good knowledge of cinematography. We found that making accurate predictions was difficult when the dimensions of the explored movies were also dimensions of only a few movies in the database. Because the tool is based on previous movies, it is for instance not well suited to predict independent movies and movies with unknown actors or directors. We think we made quite a good job overall, according to the different results and recognitions we got during the challenge and are globally satisfied with the results we obtained, in particular for the viewer rating prediction. We explain this in details in question 5 of the next part of the document. Several ameliorations may improve the analysis. For example, being able to brush each cell of the matrix, in the scatterplot mode or any other, to filter only some movies and remove outliers with a higher precision as it is for instance the case with ScatterDice. We may also take into account other dimensions such as the length of a movie, the period it was released (we know that during summer, box office are often higher), and the production company. Finally, we really enjoyed the challenge and the development of CinemAviz. Besides, we realized that our tool is really helpful to find movies similar to others although it was not its original purpose. We used it a lot for personal research and discovered movies we loved, based on their similarities with our favorite movies, actors or directors.

Questions 1) What data factors, alone or in combination, were most useful for predicting possible outcomes? The most useful factors were star actors and directors because their casting in a movie may appeal or the opposite spectators. Less pertinent dimensions were cinematographer, composer, or costume designer. Indeed, either they were involved in only a few movies, or they have huge numbers, of various genres, and with various opening box office and rating, making them unreliable. The genres of the movies are subject to the same remark because each genre has top movies and very bad ones. Another very important dimension was the budget. Indeed, we quickly realized that the budget highly impacts the opening box office, while it is far less the case for the rating. This is easily explained by the fact that a blockbuster movie will be extremely advertised, that star actors are in the casting, and often that incredible special effects are shown, making a very good trailer. However, if spectators can be abused and spend money for a movie even if it is not worth it they did not see the movie yet they rate the movie once they saw it and many high budget movies end up with a bad viewer rating. Finally, a crucial data factor was the number of similar dimensions to predict a movie s opening box office and rating. The scatterplot matrix was very useful for this purpose, as well as the [1-6]D sliders. 2) How did you combine factors from the structured data with factors in unstructured data and what was the impact on the results? Did you see correlations? How can a user of your system explore this combination? We made choice not to use so-called unstructured data (although we would not call the IMDB data structured, given the efforts it required to parse the entire, and not really consistent, database). 3) Do the important factors vary by class, such as movie genre? We did not find any difference between the movie genres. For instance, we realized that horror movies or comedy movies often had a lower viewer rating, but we did not modify our analysis according to that. Indeed, because we based our predictions according to similar movies, a comedy or horror movie will be closer to movies with the same genre and then be impacted by their scores. However, a very important factor, as explained before, is the budget for the opening box office, and the actors and directors when famous, for both the opening box office and the viewer rating. 4) Did you use data on previous movies to help analyze/predict outcomes for later movies? If so, how? We are not sure if the question is about using results from our previous analyses for the later ones, or about using data on previous movies. If the question is about using data on previous movies, then of course we did it, it is the core of CinemAviz, which is based only on the Imdb database. If the question is about our previous analyses, then the answer is also yes. Actually, when we started the challenge, it was already in its second phase and we had a lot of past results to exploit. Once again, because we use only the Imdb data, we can predict a movie at any time, without being constrained by the social media data temporality. Then, we trained ourselves and iteratively developed our tool using the previously released movies of the challenge to compare our predictions with the results. This has been our training part, and this step has been primordial for our late estimations.

5) For any prediction that you had a significant margin of error (for our challenge, this would be a high mean relative absolute error), explain possible sources of error. We are quite satisfied with our results for the viewer rating, and we believe CinemAviz is reliable for this estimation. This indicator is highly dependent on the dimensions of the movie we analyze and we finally got an average absolute error of 0.7 for the rating, with some very precise predictions. The opening box office estimation was less accurate. Although we had for the July 12 results the best opening box office prediction made at this date, the opening box office prediction was not always that accurate and we also had several bad predictions. We partly explain this because the tool is highly based on the analyst s knowledge and expertise; and we have to admit that we are not expert in all kind of movies. Our predictions were often very wrong for types of movies that we are not interested in (e.g., horror, family comedy). We truly think that CinemAviz can give accurate results, as long as the user knows the topic very well. Only him will know which weight he should give to a dimension, and this may vary a lot depending on the context, the analyst, and his subjective preferences for actors, genres or directors. We also think that the opening box office would be easier to predict using social media data, and it is one of the limitations of our tool. The opening box office is not influenced only by the dimensions of the movie, but also for example by its release date, other movies released the same week, and social events occurring at the same time (vacations, sport events, etc.) that we did not consider. 6) What data trends if any were you able to identify? How did the identification of trends affect / shape predictions? Did you see instances where early data about a movie was contradicted by later data/factors? This question is once again about the unstructured data we made the choice not to use.