Predict the box office of US movies Group members: Hanqing Ma, Jin Sun, Zeyu Zhang 1. Introduction Our task is to predict the box office of the upcoming movies using the properties of the movies, such as director, actors and genres and so on. The prediction is based on the datasets of movies released during the last 5 years. The box office prediction is meaningful to the Producer of the movie as well as the movie theaters. The Producer can make a prediction of the box office when deciding the director and actors for the movie. As far as the movies theaters, a higher prediction box office means people are interested in the movie and they can arrange more screens for it. Firstly, we collect the movie dataset from the Internet. Then, we analyze the properties of the movies and preprocess on the dataset. Next job is to fit our dataset on the Naïve Bayes classification and multilayer perceptron. From the feedback of the trials, we change the split point of each attributes of movies. Finally, since the box office is actually a continuous attribute, we use linear regression compare the results to the results given by classification methods. 2. Data collection and preprocess 1) Data collection In order to get the useful and convincing information about movies, we use python to grab all the US movie information from Internet movie database (a.k.a IMDB www.imdb.com). We only grab the movies released from 2008 to 2014, which are useful and reasonable to predict the box office now. There are more than 3000 films that have exact box office but finally we successfully grab 2109 films. But in this 2109 dataset, there are some duplications of movie released in last century; the final dataset contains 1820 films. We try to scrap many details about a movie but after consideration, we find there are some key properties that decide the box office. We choose director, writer, actors, genre, release date, and producer as the feathers of the movie. That will make the learning process much fast and avoid the over-fitting of unessential factors. 2) Data preprocess Data preprocess is to find the feature of each attribute on movie. It needs some consideration about the property itself and the feedback from the training model. For the movie properties, director, writer and producer are dealt using the same method. Take the director as an example. We calculate the average box office of the exist directors in our dataset. Then, divide the director into three types I, II and III. The type I contains a group of directors who have a highest box office. At first, we use uniformly division, but it is not work well. Finally, we let the proportion of type I, II and III be 2:3:5. As for the actor, situations are complex. Each movie we have not only one star. Our method is to classify the actors also into 3 levels but for each movie the actor properties is the sum of the three actors. Thus, the actor property of a movie is of level 3 to level 9. We consider the release date and the genre of the movie as nominal attributes. Release date is just January to December. And since there are no many combinations of the genres, (more than 400 unique one), we only take the first two genres into consideration. The details are shown in the Table 1.
Feature Type Number of value Category method Final number of producer director writer discrete discrete discrete 1080 1440 1558 We figure out the average box office of each of the name and category them into 3 levels Sum the level of Actor stars discrete 3223 three stars of one movie Using two genres discrete 17 combination of genres Release data discrete 365 Just care about month Box office continuous -- Divided into 10 ranges Table1 preprocess of the feature of the movie value Category into level I, II and III Category into 6 levels 25 12 10 3. Algorithm The algorithm for classification, we use Naïve Bayes and multilayer perceptron. 1) Multi-class Naïve Bayes Multi-class Naïve Bayes model is similar to the binary situation: arg max = 1) ( = ), {,,, } We use Multivariate multinomial distribution to fit the Navies Bayes model. It assumes each individual predicator follows a multinomial model within a class. The parameters for a predictor include the probabilities of all possible values that the corresponding feature can take. Then, obtain the parameters using maximum likelihood estimates as the following formula: {,,, }, = 1 ( ) = ( ) = + 1 { ( ) = } + 2) Multilayer perceptron A multilayer perceptron (MLP) is a feed-forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. We represent the error in output node in the th data point by e ( ) = d (n) y (n), where is the target value and is the value produced by the perceptron. We then make
corrections to the weights of the nodes based on those corrections which minimize the error in the entire output, given by Using gradient descent, we find our change in each weight to be. Since the box office is actually a continuous attribute. Instead of using classification on the box office, we try regression methods. 3) Linear regression Linear regression is a relatively easy regression algorithm. 4. Realization 1) Model The input attributes: producer, director, writer, stars, genres, release date The output box office: For classification the 10 classifications for the estimation box office is acquired from some statistical result about the train data. The result is shown in the table 2. 10 classification of box office <100k <500k <1m <4m <7m <10m <50m <100m <200m >200m 2) Cross-validation Table 2 the classification of the box office Divide our 1819 dataset into 10 fold randomly. Do 10 times training and predicting, each time use one of the 10 fold as the testing data and the other 9 folds as the training data. It is the way we evaluate how it is worked on the algorithm. 3) Error evaluation We use two method to evaluate the predict result. The first one is to calculate the error rate, i.e. 1{h(x)!= y}. It can figure out how many test data are predicted with error. The second method is to use the confusion matrix. C = confusionmat (group, grouphat) returns the confusion matrix C determined by the known and predicted groups in group and grouphat, respectively. For example: C = 50 0 0 0 47 3 0 3 47
Where C(, ) is a count of observations known to be in group i but predicted to be in group j. It not only provides the error rate but also the difference between the prediction and the actual box office. 5. Result 1) Multi-class Naïve Bayes In case the producer, director, writer and actors are divided uniformly: The average confusion matrix in 10-fold cross-validation: E1= 12 10 0 0 0 0 0 0 0 10 16 0 0 0 0 0 0 0 7 4 0 0 0 0 0 0 0 2 2 0 1 0 0 0 0 0 0 1 0 0 3 3 0 0 0 0 0 0 1 1 14 0 1 0 0 0 0 0 1 3 0 6 0 0 0 0 0 0 4 0 23 3 0 0 0 0 0 0 0 2 47 From the matrix we can catch that the elements on the diagonal of the matrix is the right prediction. And most of the error prediction is within 2 error distance to the true class of box office. In case the producer, director, writer and actors are divided in the proportion of 2:3:5. The average confusion matrix in 10-fold cross-validation: E2= 7 6 2 0 0 0 0 0 0 3 12 0 0 0 0 0 0 0 3 2 22 0 1 2 0 0 0 0 0 3 0 1 1 0 0 0 0 0 3 0 1 3 0 0 0 0 0 4 0 0 14 0 0 0 0 0 1 0 0 4 2 3 0 0 0 0 0 0 1 2 16 13 0 0 0 0 0 0 0 5 43 Compared matrix E2 with the matrix E1, the trace of the E2 is bigger than E1, which means the error rate decreases. 2) Multilayer perceptron When using the multilayer perceptron, there are 11 nodes for building the model. And the relation between prediction box office and the true box office.
Figure 1 the prediction box office and the true box office using multilayer perceptron The advantage of the multilayer perceptron is that we can find the weight of each attribute and determine it influence to the output. From the model of this perceptron, we can determine that director and writer are the most influential properties. 3) Linear regression Instead of using classification, we try regression method using the continuous box office. And show that director and writer have the most perfect linear relation to the box office. (a) (b) Figure 2 (a) approximate linear relation between writer and box office (b) approximate linear relation between director and box office Finally, to compare the following three methods, we draw a table, which shows the mean absolute error for the 10 fold cross-validation.
Multiclass Naïve Bayes Multilayer Linear Uniformly classify 2:3:5 classify perceptron regression Mean error rate 35.6% 33.5% -- -- Mean absolute error -- -- 31.5% 72.8% Table3 comparison of three different methods From the result above, it seems that the multilayer perceptron have the best performance. 6. Conclusion and future work In this project, we learn how to grab data for Internet, how to adjust the data for a basic machine learning algorithm and how to decide which algorithm is suitable. It is a meaningful project. In the future, we will use our model to predict some upcoming movies and check if we do a precise prediction. What s more, we will think about some dependence of the features in the movie. For example, the release date and genre may have some relations. Some actors may have more influence on some particular genre. The comedy movies may be more popular at Christmas and so on. And try to improve our model.