Predict the box office of US movies

Similar documents
International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

CS229 Final Project: Predicting Expected Response Times

Performance Analysis of Data Mining Classification Techniques

A Systematic Overview of Data Mining Algorithms

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Lecture 20: Neural Networks for NLP. Zubin Pahuja

The Problem of Overfitting with Maximum Likelihood

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Author Prediction for Turkish Texts

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

I211: Information infrastructure II

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Neural Network Neurons

Lecture #11: The Perceptron

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Introduction to Automated Text Analysis. bit.ly/poir599

Predicting Popular Xbox games based on Search Queries of Users

Louis Fourrier Fabien Gaie Thomas Rolf

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

CS6220: DATA MINING TECHNIQUES

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CS145: INTRODUCTION TO DATA MINING

Global Journal of Engineering Science and Research Management

Tutorial on Machine Learning Tools

Multi-Class Logistic Regression and Perceptron

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

Weka ( )

Week 3: Perceptron and Multi-layer Perceptron

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Classifying Depositional Environments in Satellite Images

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Machine Learning Classifiers and Boosting

Simple Model Selection Cross Validation Regularization Neural Networks

Machine Learning Part 1

1 Document Classification [60 points]

Team Members: Yingda Hu (yhx640), Brian Zhan (bjz002) James He (jzh642), Kevin Cheng (klc954)

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

Network Traffic Measurements and Analysis

Regularization and model selection

Predicting User Ratings Using Status Models on Amazon.com

CS249: ADVANCED DATA MINING

CPSC Coding Project (due December 17)

Wild Mushrooms Classification Edible or Poisonous

Tanagra Tutorial. Determining the right number of neurons and layers in a multilayer perceptron.

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Ensemble methods in machine learning. Example. Neural networks. Neural networks

What is machine learning?

Evaluating Classifiers

CHAPTER 6 EXPERIMENTS

MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Problem Set #6 Due: 2:30pm on Wednesday, June 1 st

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

CS294-1 Assignment 2 Report

CP365 Artificial Intelligence

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Collective classification in network data

Bayes Classifiers and Generative Methods

Machine Learning in Telecommunications

Notes on Multilayer, Feedforward Neural Networks

Recommender Systems. Collaborative Filtering & Content-Based Recommending

Assignment 5: Collaborative Filtering

Stat 342 Exam 3 Fall 2014

IMDB Film Prediction with Cross-validation Technique

Incluvie: Actor Data Collection Ada Gok, Dana Hochman, Lucy Zhan

Tutorials (M. Biehl)

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

SUPPORT VECTOR MACHINES

Orange3 Data Fusion Documentation. Biolab

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Classification: Linear Discriminant Functions

Classification Algorithms in Data Mining

Penalizied Logistic Regression for Classification

Evaluating Classifiers

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Probabilistic Classifiers DWML, /27

Classifying Building Energy Consumption Behavior Using an Ensemble of Machine Learning Methods

Movie Recommender System - Hybrid Filtering Approach

Data Mining and Knowledge Discovery: Practice Notes

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Data Mining: STATISTICA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Data Mining and Knowledge Discovery Practice notes 2

WEKA homepage.

Using Machine Learning to Optimize Storage Systems

Univariate and Multivariate Decision Trees

Improving Imputation Accuracy in Ordinal Data Using Classification

Applying Supervised Learning

APPLICATION OF SOFTMAX REGRESSION AND ITS VALIDATION FOR SPECTRAL-BASED LAND COVER MAPPING

Perceptron-Based Oblique Tree (P-BOT)

Information Management course

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Transcription:

Predict the box office of US movies Group members: Hanqing Ma, Jin Sun, Zeyu Zhang 1. Introduction Our task is to predict the box office of the upcoming movies using the properties of the movies, such as director, actors and genres and so on. The prediction is based on the datasets of movies released during the last 5 years. The box office prediction is meaningful to the Producer of the movie as well as the movie theaters. The Producer can make a prediction of the box office when deciding the director and actors for the movie. As far as the movies theaters, a higher prediction box office means people are interested in the movie and they can arrange more screens for it. Firstly, we collect the movie dataset from the Internet. Then, we analyze the properties of the movies and preprocess on the dataset. Next job is to fit our dataset on the Naïve Bayes classification and multilayer perceptron. From the feedback of the trials, we change the split point of each attributes of movies. Finally, since the box office is actually a continuous attribute, we use linear regression compare the results to the results given by classification methods. 2. Data collection and preprocess 1) Data collection In order to get the useful and convincing information about movies, we use python to grab all the US movie information from Internet movie database (a.k.a IMDB www.imdb.com). We only grab the movies released from 2008 to 2014, which are useful and reasonable to predict the box office now. There are more than 3000 films that have exact box office but finally we successfully grab 2109 films. But in this 2109 dataset, there are some duplications of movie released in last century; the final dataset contains 1820 films. We try to scrap many details about a movie but after consideration, we find there are some key properties that decide the box office. We choose director, writer, actors, genre, release date, and producer as the feathers of the movie. That will make the learning process much fast and avoid the over-fitting of unessential factors. 2) Data preprocess Data preprocess is to find the feature of each attribute on movie. It needs some consideration about the property itself and the feedback from the training model. For the movie properties, director, writer and producer are dealt using the same method. Take the director as an example. We calculate the average box office of the exist directors in our dataset. Then, divide the director into three types I, II and III. The type I contains a group of directors who have a highest box office. At first, we use uniformly division, but it is not work well. Finally, we let the proportion of type I, II and III be 2:3:5. As for the actor, situations are complex. Each movie we have not only one star. Our method is to classify the actors also into 3 levels but for each movie the actor properties is the sum of the three actors. Thus, the actor property of a movie is of level 3 to level 9. We consider the release date and the genre of the movie as nominal attributes. Release date is just January to December. And since there are no many combinations of the genres, (more than 400 unique one), we only take the first two genres into consideration. The details are shown in the Table 1.

Feature Type Number of value Category method Final number of producer director writer discrete discrete discrete 1080 1440 1558 We figure out the average box office of each of the name and category them into 3 levels Sum the level of Actor stars discrete 3223 three stars of one movie Using two genres discrete 17 combination of genres Release data discrete 365 Just care about month Box office continuous -- Divided into 10 ranges Table1 preprocess of the feature of the movie value Category into level I, II and III Category into 6 levels 25 12 10 3. Algorithm The algorithm for classification, we use Naïve Bayes and multilayer perceptron. 1) Multi-class Naïve Bayes Multi-class Naïve Bayes model is similar to the binary situation: arg max = 1) ( = ), {,,, } We use Multivariate multinomial distribution to fit the Navies Bayes model. It assumes each individual predicator follows a multinomial model within a class. The parameters for a predictor include the probabilities of all possible values that the corresponding feature can take. Then, obtain the parameters using maximum likelihood estimates as the following formula: {,,, }, = 1 ( ) = ( ) = + 1 { ( ) = } + 2) Multilayer perceptron A multilayer perceptron (MLP) is a feed-forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. We represent the error in output node in the th data point by e ( ) = d (n) y (n), where is the target value and is the value produced by the perceptron. We then make

corrections to the weights of the nodes based on those corrections which minimize the error in the entire output, given by Using gradient descent, we find our change in each weight to be. Since the box office is actually a continuous attribute. Instead of using classification on the box office, we try regression methods. 3) Linear regression Linear regression is a relatively easy regression algorithm. 4. Realization 1) Model The input attributes: producer, director, writer, stars, genres, release date The output box office: For classification the 10 classifications for the estimation box office is acquired from some statistical result about the train data. The result is shown in the table 2. 10 classification of box office <100k <500k <1m <4m <7m <10m <50m <100m <200m >200m 2) Cross-validation Table 2 the classification of the box office Divide our 1819 dataset into 10 fold randomly. Do 10 times training and predicting, each time use one of the 10 fold as the testing data and the other 9 folds as the training data. It is the way we evaluate how it is worked on the algorithm. 3) Error evaluation We use two method to evaluate the predict result. The first one is to calculate the error rate, i.e. 1{h(x)!= y}. It can figure out how many test data are predicted with error. The second method is to use the confusion matrix. C = confusionmat (group, grouphat) returns the confusion matrix C determined by the known and predicted groups in group and grouphat, respectively. For example: C = 50 0 0 0 47 3 0 3 47

Where C(, ) is a count of observations known to be in group i but predicted to be in group j. It not only provides the error rate but also the difference between the prediction and the actual box office. 5. Result 1) Multi-class Naïve Bayes In case the producer, director, writer and actors are divided uniformly: The average confusion matrix in 10-fold cross-validation: E1= 12 10 0 0 0 0 0 0 0 10 16 0 0 0 0 0 0 0 7 4 0 0 0 0 0 0 0 2 2 0 1 0 0 0 0 0 0 1 0 0 3 3 0 0 0 0 0 0 1 1 14 0 1 0 0 0 0 0 1 3 0 6 0 0 0 0 0 0 4 0 23 3 0 0 0 0 0 0 0 2 47 From the matrix we can catch that the elements on the diagonal of the matrix is the right prediction. And most of the error prediction is within 2 error distance to the true class of box office. In case the producer, director, writer and actors are divided in the proportion of 2:3:5. The average confusion matrix in 10-fold cross-validation: E2= 7 6 2 0 0 0 0 0 0 3 12 0 0 0 0 0 0 0 3 2 22 0 1 2 0 0 0 0 0 3 0 1 1 0 0 0 0 0 3 0 1 3 0 0 0 0 0 4 0 0 14 0 0 0 0 0 1 0 0 4 2 3 0 0 0 0 0 0 1 2 16 13 0 0 0 0 0 0 0 5 43 Compared matrix E2 with the matrix E1, the trace of the E2 is bigger than E1, which means the error rate decreases. 2) Multilayer perceptron When using the multilayer perceptron, there are 11 nodes for building the model. And the relation between prediction box office and the true box office.

Figure 1 the prediction box office and the true box office using multilayer perceptron The advantage of the multilayer perceptron is that we can find the weight of each attribute and determine it influence to the output. From the model of this perceptron, we can determine that director and writer are the most influential properties. 3) Linear regression Instead of using classification, we try regression method using the continuous box office. And show that director and writer have the most perfect linear relation to the box office. (a) (b) Figure 2 (a) approximate linear relation between writer and box office (b) approximate linear relation between director and box office Finally, to compare the following three methods, we draw a table, which shows the mean absolute error for the 10 fold cross-validation.

Multiclass Naïve Bayes Multilayer Linear Uniformly classify 2:3:5 classify perceptron regression Mean error rate 35.6% 33.5% -- -- Mean absolute error -- -- 31.5% 72.8% Table3 comparison of three different methods From the result above, it seems that the multilayer perceptron have the best performance. 6. Conclusion and future work In this project, we learn how to grab data for Internet, how to adjust the data for a basic machine learning algorithm and how to decide which algorithm is suitable. It is a meaningful project. In the future, we will use our model to predict some upcoming movies and check if we do a precise prediction. What s more, we will think about some dependence of the features in the movie. For example, the release date and genre may have some relations. Some actors may have more influence on some particular genre. The comedy movies may be more popular at Christmas and so on. And try to improve our model.