How to predict IMDb score - PDF Free Download

How to predict IMDb score Jiawei Li A53226117 Computational Science, Mathematics and Engineering University of California San Diego jil206@ucsd.edu Abstract This report is based on the dataset provided by the IMDb (The Internet Movie Database), the dataset contains different features of a movie, and its score on the IMDb. In this report, we are going to first analyze these basic feature, by discovering the pattern and the relationship between different feature. Then we are going to use these pattern and the result we found to predict the score of the movie, to see how different they are with the real score, then we are going to compare our model with other existing model to see how well it performance. I. Analyze on dataset A. Description In the normal situation we need to wait until enough people have voted for a movie to see how well this movie is which in some case may take few times, in that case, is there a way to predict a score of a movie before enough people votes for it, in this report we are going to analyze that. The data we are going to analyze is download from [1]//ftp.fu-berlin.de/pub/misc/movies/database/ which is a ftp site of subset of the IMDb plain text data files. The IMDb is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews, operated by IMDb.com, Inc., a subsidiary of Amazon.com. The whole data is contained in several files but I rearranged them to a csv file to make it much more easier to look. Thanks to the Hadley from the github[2] that provided me with some help on how to rearrange these data. Some of my code of the part to construct the data file are inspired by his original code from https://github. com/hadley/data-movies. The whole packet is very big, and some provided files are actually have nothing to discover, for instance the move_link file. At last, the main data.csv file contain the following information: 1:'genres': The type of different movie 2: 'director_name' 3:'actor_3_name','actor_2_name','actor_1_name' 4:'num_voted_users' 6:'num_user_for_reviews' 7:'color''language' 8:'gross''plot_keywords' 9:'duration''budget' 10:'content_rating':PG rating for a movie; 11:'movie_facebook_likes''cast_total_facebook_likes' 12:'title_year': Year the movie was published 13:'country':The country the movie is made from 14:'movie_title'

15: 'imdb_score' These are the basic information for the movie we have, some of them may still not be used in the report, they were just included for farther analyze if there need one. The total number of the movie is 50000, although the original file is way much bigger than that, I chose these data on purpose, since my computer may not be able to handle that size of dataset efficiently. B. Interesting discovering While analyzing the data, I separate the data in to three parts, train/validation/test, the train part only contains 15000 movies, and I have some very interesting found on these data; 1: The movie year and the movie score: The IMDb score may be the mainly important part of these dataset, most people surfing for IMDb just because they want to see how good or how bad is the movie, and this is performed by IMDb score. Most of these score are made by fan, of course some of the expert will give these movies score too. To make the result more obvious I use a scatter diagram to show the pattern between the score and the time the movie was published: Figure.1.1 The x-axis of Figure 1.1 represent the year, y-axis represent the score the movie got, we can see that with time passing the quantity of the movie is increasing, especially after 1980s, but the number of the lower score is still increasing. However, if we look back, seldom did those old films get low score, only one movie before 1960s, get a score lower than 5, which suggest that, most user would like to give a high score to those old movie, and indeed consider some of these movie could represent that age, they were more than just a film, but the art, this could make them deserve this kind of high reward. 2: The relationship between the budget and the movie score In our normal opinion, a movie with higher budget means high quality like Titanic or Avatar,but is this the truth? By analyzing the relationship with the budget and the IMDb score, I have a scatter diagram below. Figure1.2

The figure1.2 may be a little bit strange since the number of the budget is too big that the matlib change the x-axis automatically, and some budget of the movie I found later was 0, this could be caused by the updating mistake from the original website. But the whole pattern is very clear, from the lower side we can see an increasing curve of the score with the increasing of the budget, but from the upper side, the score is actually decreasing with the rise of the budget, and finally the score converged to somewhere near 7, and you can see that with the increasing of the budget the score won t have much change after 7. From this figure we can have an interesting conclusion that with the rising of a movie budget, it can guarantee that you won t have a very low score, but the budget can at most bring you to the score near 7, and after that you need something else to raise the score of your movie. 3: The voting number The score of the movie is determined by the voter, but a score of 8 voted by 100 user could be more convincible compare to 9 score voted by only one user, which brings me the interests of testing whether the number of the votes still affects the IMDb score. And the result is shown below: Figure1.3 The x-axis is the number of the votes, and the y-axis is the score, we can see from the Figure1.3 that the score has a very clear increasing curve with the increasing of the votes, that means with much people vote for this movie, the IMDb score could be higher for this movie, after all, if this movie is terrible, no one will care, and no one will vote for it. 4: What genres does people like Although different people may have different taste, there could be some kind of movie that most people like, to analyze this I have the two figure below, Figure 1.4 is the number of the film of different genres, Figure 1.5 is the average score they got. Figure 1.4 Figure 1.5 From the Figure 1.4 and Figure 1.5 we can see there are total 26 genres in the data, although there are not much Film-Noir in Figure1.4, it gets pretty high score in Figure 1.5, consider most of the Film-Noir movies

were produced long time ago, this actually match the pattern we found from Figure 1.1, most of them have no big difference, but few of them like Game-show and Reality-TV didn t attract people. 5: The number of movies the director made and the number of movies the actor took We know that the score on IMDb is made by voter, since it s not a prediction website, it won t predict what kinds of movie is good or bad, but the quality actually has a strong connected to the director and the actor, the more movie a director made, the higher score he should have, since the experience is very important, and so do the actor. So, I use the number of the movie they made or took, and the score to create the following two Figure1.6 and Figure 1.7: Figure 1.6 Figure 1.7 The x-axis of two figure is the number of movies a director made or an actor/actress took in actor_1,2,3, y-axis is the score, we can see from the first figure that, with the increasing of the number of movies a director made, their movies average score are becoming higher and higher, although some of them are not stable, could drop sometime, the whole pattern is really obvious. As for the actor, their score may not be as clear as the director, since they have much more movies to take than the director, but we can still find that, in the front part of the figure, where most actor don t take many movie, their score are jumping in a huge range, however, with more movie an actor/actress take, his or her average score is converge to some point, that may reflect that their acting skill are more stable with more movies they take, this will be a very useful information for us in our latter study. II predictive task The scores from IMDb are mostly based on the vote from the user, because of that, sometimes when there are not that much user to vote for a movie, the score of a movie may not represent the quality of the movie precisely. So how do we know whether this movie is good or bad before we get enough votes? There indeed have some features that can define a quality of a movie like the feature I mentioned in Chapter I. And these are what we need to use to build our model. A: Description of the Prediction Task We are going to use the feature we have to predict the score of the movie, to see whether the score on IMDB shows the quality of the movie. B: Preparation of the data I separate the data into three parts train/validation/test, the train part only contains 15000 movies which is the data I showed in chapter I and also the data I m going to analyze later, the validation data has 15000 movies, and the test data has 20000 movies. Since in the original dataset, the data may provide with some order, this cause the problem that some

actor or director appeared in the early movie doesn t show up in the rest data, so I shuffle the data to reduce the error. There are still some strange place where the score or the name of the actor or director are null, this will cause the compare error or even affect the result, so I take them out from the original dataset, which turns out to be 323 of them have null unit among all 50000 data. This as I said may cause by the update error from the file website. C: Feature The features I m going to use are the features I described in Chapter I, but the way I use them could be different from each other, I am going to use different model towards different features, I ll briefly describe how am I going to do that in Chapter III, the following are the features I m going to use: 1: Genres: From the Figure 1.5 we can see that although most genres gets the score similar to each other, there are still some genres that attract people much to get a high score,or on the contrary, doesn t attract people so they can only get a lower score, this may not take a significant part in the main model, but the genres do affect people on the voting, since in the dataset, the genres of a movie is not unique or single, most of them are like this: genres={ Action Adventure Fantasy Sci-Fi }, which could bring them lot s of fans from different area, some of them may like Action movie, some of them may like Sci-Fi, this can potentially increase the score they got. 2: The budget: The budget as I said in chapter II, can only do limited work to the score of the film, but that s actually what we need, since it can set a baseline for a movie, for instance even if the movie not as good as others, with a lot of budget we can still guarantee that this movies score won t be lower than some baseline. 3: The score for actor/actress and director: These are the most important features in my model, since my prediction task is mainly on predict the score before the movie got enough people to vote for it, we need something that can define the quality of a movie, and our model isn t intelligence enough to tell whether the story of the movie is good or bad, we can only predict from the director and the actor it have. From the chapter I you can know that, the average score of a director is raised based on the number of the movie it makes, which means a movie made by a better director can get a better score, it s the same to the actor. But there is something else for the actor, in the dataset, we have actor_1,2,3, which is the main actor and supporting actor, the different part an actor take can also affect the score, for instance, if an actor with high skill take actor_3 who only show up for 10 minutes in the movie, and other terrible actor get the main part, the movie may not get a high score. We will also consider this in our model building. 4: The vote number: Although our model is used to predict the movie score before it gets enough votes, but the vote number actually plays a very important part in IMDb score, just as I said in Chapter I, so I decided to use the vote number as a feature to adjust the model in order to reduce the error of the result. III Prediction Model A. Linear Regression The first model we are going to use is Linear Regression[3], we are going to implement this model on the genres, since in the dataset a genres of a movie is multiple, we can have a formula(1) below: Y=α+θ(1)*X(1)+θ(2)*X(2)+θ(3)*X(3)+θ(4)*X(4)+θ(5)*X(5)+θ(6)*X(6)+.+θ(25)*X(25)+θ(26)*X(26) (1) Since there are 26 types in the genres, each θ in the formula represent the type, if the movie genres contain such type the X(n) will be 1, if it doesn t the X(n) will be 0, for example, the x set for genres={action, Biography} is like {1,1,1,0,0,0,0,0,0,0,.0}. I first implement this mode based on the train set

of my dataset without any other features to see how it works, and get the following result: Table 3.1 'Action' -0.228698 'Family' -0.32113794 'Reality-TV' 0.12676543 'Biography' 0.3074218 'Musical' 0.07295079 'Game-Show' -3.50044351 'Drama' 0.5036412 'Adventure' 0.27333117 'Documentary' 0.76480744 'War' 0.27005743 'Thriller' -0.23288836 'Film-Noir' 0.94627338 'Sci-Fi' 0.09841337 'Comedy' -0.30408267 'Horror' -0.48745732 'Music' -0.22143861 'News' 0.16036248 'Sport' -0.02754539 'Mystery' 0.14913146 'Animation' 0.5141327 'History' 0.08222791 'Short' 0.1698352 'Romance' -0.07519327 'Western' 0.0835315 'Fantasy' 0.07070337 'Crime' 0.19230911 These are the theta value for different genres, which can represent the weight while they are in a movies genres, then I directly use this model to predict the validation set, and calculate the MSE (2) The MSE for this simple model is: MSE=1.3758544115905825 which is not very accurate, since the genres of a movie can t judge whether this is a great movie Fortunately, at first I was afraid that there could be some new genres that showed in the validation set but not in the train set, after the test it appears that the first 15000 data contains all the genres we need, and since when I process the data, I eliminate all the data with None unit in it, this simple test runs perfectly. B: ABV This part of the model is used to adjust my result of the main model, like when I use the basic feature of the movie(actor, director) to predict the base score, if the movie contains more votes, that means this movie may get a much greater score than the base score. I first decided to use SVM to implement this model, since the pattern of the budget is not that easy to be found, however the result turns out to be terrible, the validation MSE is 1.4897632561 which is way too much than I thought, so I decides to change it into another model. The reason why this model may fail, I think may cause by the crowding of the data, from figure 1.2 you can see that most of the data are crowded in one small area, and some of them like 10~20 are far away from the main part, this may raise the error of the SVM model, since this will make the pattern more unclear. Then I decided to use is ABV based on the feature of budget and number of the votes. Since these features are very similar to each other, the score will increase when the budget or the number of votes increase, the only thing that different is that in the budget figure, the score will finally converge to a certain point, while in the vote figure, the score can continue increasing in the pattern of a logarithmic function. The formula is basic like this: Y=α+Y(genres)+log(budget)+log(vote_number) (3) The Y(genres) is the model from chapter III A, I add it to this model to reduce the potential error, after the first round of test, I got the MSE=1.36789354688 which is a little bit smaller than a single model of Y(genres), so I change the formula into: Y=α+Y(genres)+log(budget**2)+log(vote_number) (3) The MSE reduce a little bit to 1.3297732145, although it s better than the genres model, the reduction of the error is still not obvious, I still hope I can get a better model, then I check the graph and find that in Figure 1.3, the pattern of the curve doesn t really match the logarithmic function, the score is

start from somewhere near the 0, while in the typical logarithmic function the value of y is always start from negative infinity, so I adjust my model in to this: Y=α+Y(genres)+log(budget**2) + log(vote _ number + θ) (4) I change the value of θ several times. And the MSE surprisingly reduced to 1.29953672 at θ=250000, although I changed the value later, but the MSE didn t have farther more reduction, it just converged at around 1.30, so I suppose this is the best this model can do to predict the score. And since this is not the main part of my model, I decided to move on to my next model. C. Latent-factor models This part is the most important part in my model, it will predict the base score of a movie according to the director of this movie and the actors it has. The score of the director and the actors are based on the score of the film they took part in. The base formula of the model is looks like this: F (d,a)=α+β(d)+β(a) The F is the score, the β(d) represent the weight of director, β(a) represent the weight of the actor, however, the dataset provide us with actor_1,2,3 in each movie, which means some actor/actress may act as the main actor in a movie but he or she may be the supporting actor in another movie, the weight to those actors/actresses should be different, since if an great actor play a very simple role, and only show up for very short time, his or her talent skill may not be fully showed on this movie, and his or her fans may still fill upset and could give this movie a low score. To make this more obvious, the Figure 3.1 shows us the relationship between the score and the character an actor/actress take: Figure 3.1 The red point represent the actor_1, the green point represent actor_2, the blue point represent the actor_3, the x-axis is the different actors/actress that shows up in the data, and is sorted by the average score they get, the y-axis is the score, you can see from the figure that most people take actor_3 in the movie, especially in the middle part, where those actors/actresses take much more movies than others, and the score of the movie they take are in a really big range, which makes it impossible for us to just give an unique weight for each actor/actress. That s why I decided to change my model in to like this: F(d,a)=α+β(d)+β_1(a)+β_2(a)+β_3(a) (5) With three different value of β for different part in the movie with one actor/actress, we can have a more specific base score. To solve these problem, I modify the differential equation we learned in the class as follow: argminσ(α+β(d)+β_1(a)+β_2(a)+β_3(a)+0.5*y- Y(s))^2+λ[Σβ(d)^2+Σβ_1(a)^2+Σβ_2(a)^2+Σβ_3(a)^2+(0.5*Y)^2]

The Y(s) is the score of each movie, Y is the result from our former model, since it s only used for adjustment, I give it a lower weight with 0.5 to reduce it s affection on the main prediction. The we can find the different weight for each part of the actor/actress and the director by solving the following equation: α= Σ(Y(s)-(α+β(d)+β_1(a)+β_2(a)+β_3(a)+0.5*Y))/N (6) β(d)= Σ(Y(s)-(α +β_1(a)+β_2(a)+β_3(a)+0.5*y))/(λ+i(d)) (7) β_1(a)= Σ(Y(s)-(α +β(d)+β_2(a)+β_3(a)+0.5*y))/(λ+i(a_1)) (8) β_2(a)= Σ(Y(s)-(α +β(d)+β_1(a)+β_3(a)+0.5*y))/(λ+i(a_2)) (9) β_3(a)= Σ(Y(s)-(α +β(d)+β_1(a)+β_2(a)+0.5*y))/(λ+i(a_3)) (10) The I(d) represent the number of the director in this training dataset, I(a) represent the number of this part of the actor/actress. Then we need to implement them and use the equation (6)(7)(9)(8)(9)(10) doing a iteration on these five equation until each value of beta and alpha get converged, the condition I set is the difference between the former MSE and the current MSE is less than 0.0000001, and to make the MSE as least as possible, I have to find an appropriate value for λ, I set the λ in range from 0 to 20, and found that when λ=9 the MSE is the lowest, so I keep testing around the 9, and found that when λ=8.67, we can have the lowest MSE=1.08987635423. By applying my model on the validation set, I get the MSE=1.102678867, and MSE=1.105185634 in test dataset which is actually in my expectation, and so far is the model I build to predict the imdb_score of the movie. IV literature The data I analyzed is download from //ftp.fu-berlin.de/pub/misc/movies/database/ which is a ftp site of subset of the IMDb plain text data files. After doing some research, I found that there is several analyze on the score of the IMDb, there even has a movie datasets on the kaggle, but it only contains 5000+ movie, which is much lower compare to 50000 movies I got. But one of their research interests me, that they are analyze the possibility that the number of the face on a poster could affect the score of the movie[4], and also they use some very similar feature as mine like the budget, and they also have other very interesting dataset like the face book like of a director or an actor/actress which unfortunately I don t, but the number of the data they have is very limited, so I wonder if they have 50000+ data about the face book like, since this could play a very important part in rating for the actor/actress or director, and also it can certainly increase the accuracy of my third model in Chapter III C, and their result shows that: And they also use the dimensional reduction to present a three dimensional PCA-plot, which I failed to implement since the feature I have contains different type, it s hard to reduce them automatically, so I just choose them manually. They also have a very similar conclusion with me about the budget:

However, one of their conclusion have a collision with my dataset, in my model the most important part is the rating of the director, the position the actor takes, and may have a little bit related to the type of the movie and the genres, but they think: After I analyze the data, my data shows a very different result: Figure 4.1 The x-axis is the duration of the movie, and the y-axis is the score of the movie, you can see from the Figure that some movie with less duration are much better than some with long duration, and just like budget, with the increasing of the duration, it can only guarantee that you won t get a low score, but it doesn t mean that you can have a high score (at least the Figure 4.1 tell us such information), but they did have many prove on this, so I may need to do farther research to find out whether this is true. V Conclusion With the model I build we can see that, the MSE of the test dataset is 1.105185634, which means that the score we predict according to the rating of the actor and the rating of the director is similar to the score on IMDB, and it can more or less represent the quality of the movie, so we can say that most score on IMDb did represent the quality of a movie. In this project, I use Linear Regression, SVM, ABV, and Latent-factor model to analyze the data and combine few of them to establish my model. It s very lucky for me that I haven t got the overfitting yet. But there are still some improvement that I can do, as I said in Chapter IV, there are some useful dataset I failed to search out like Facebook like of a director or an actor, this could really bring my work to a next level since, the fan number usually related to the popularity of an actor/actress or director, which also related to their skill or talent. And also, I should include more features in to this model like country or color, or even more I can download the data from different website like Rotten Tomatoes, to make sure that we have multiple score to analyze rather than just a single score from IMDb. There will be many feature work for me to do if I keep analyzing this. [1] //ftp.fu-berlin.de/pub/misc/movies/database/ [2] Github https://github. com/hadley/data-movies [3] https://www.r-bloggers.com/predicting-movie-ratings-with-imdb-data-and-r/ [4] https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/