A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries

Similar documents
Forecasting London Museum Visitors Using Google Trends Data

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Prioritizing the Links on the Homepage: Evidence from a University Website Lian-lian SONG 1,a* and Geoffrey TSO 2,b

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Analysing Search Trends

Exploring Econometric Model Selection Using Sensitivity Analysis

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

Forward Feature Selection Using Residual Mutual Information

Robust Face Recognition Based on Convolutional Neural Network

Classification with Class Overlapping: A Systematic Study

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Information Push Service of University Library in Network and Information Age

Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio

MIXED VARIABLE ANT COLONY OPTIMIZATION TECHNIQUE FOR FEATURE SUBSET SELECTION AND MODEL SELECTION

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

Random Forest A. Fornaser

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Video annotation based on adaptive annular spatial partition scheme

An Effective Feature Selection Approach Using the Hybrid Filter Wrapper

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Performance Analysis of Data Mining Classification Techniques

Data mining with Support Vector Machine

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Predicting Library OPAC Users Cross-device Transitions

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Big Data: From Transactions, To Interactions

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Image and Video Quality Assessment Using Neural Network and SVM

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Imputation for missing data through artificial intelligence 1

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

FOUNDATIONS OF A CROSS-DISCIPLINARY PEDAGOGY FOR BIG DATA *

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Study on Building and Applying Technology of Multi-level Data Warehouse Lihua Song, Ying Zhan, Yebai Li

Gender Classification Technique Based on Facial Features using Neural Network

The Forecast of PM10 Pollutant by Using a Hybrid Model

Random Forests for Big Data

Rainfall forecasting using Nonlinear SVM based on PSO

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Fast or furious? - User analysis of SF Express Inc

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Improving Classification Accuracy for Single-loop Reliability-based Design Optimization

Relevance Feature Discovery for Text Mining

Research on Design and Application of Computer Database Quality Evaluation Model

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Handling Missing Values via Decomposition of the Conditioned Set

Improving Results and Performance of Collaborative Filtering-based Recommender Systems using Cuckoo Optimization Algorithm

Face Recognition Technology Based On Image Processing Chen Xin, Yajuan Li, Zhimin Tian

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

A Survey On Different Text Clustering Techniques For Patent Analysis

Fraud Detection Using Random Forest Algorithm

Multisource Remote Sensing Data Mining System Construction in Cloud Computing Environment Dong YinDi 1, Liu ChengJun 1

Team Members: Yingda Hu (yhx640), Brian Zhan (bjz002) James He (jzh642), Kevin Cheng (klc954)

Real Estate Investment Advising Using Machine Learning

An Empirical Study on feature selection for Data Classification

A *69>H>N6 #DJGC6A DG C<>C::G>C<,8>:C8:H /DA 'D 2:6G, ()-"&"3 -"(' ( +-" " " % '.+ % ' -0(+$,

An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi Sun, Songjiang Li

DETECTION OF SMOOTH TEXTURE IN FACIAL IMAGES FOR THE EVALUATION OF UNNATURAL CONTRAST ENHANCEMENT

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

From Building Better Models with JMP Pro. Full book available for purchase here.

Destination Travel Ad Spend and Trends: Now and Later

The Effects of Outliers on Support Vector Machines

Fuzzy Entropy based feature selection for classification of hyperspectral data

Insights JiWire Mobile Audience Insights Report Q4 2012

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data

The Solutions to Some Key Problems of Solar Energy Output in the Belt and Road Yong-ping GAO 1,*, Li-li LIAO 2 and Yue-shun HE 3

Research on Social Relationship Network System based on MongoDB

BENCHMARKING ATTRIBUTE SELECTION TECHNIQUES FOR MICROARRAY DATA

A Recommender System Based on Improvised K- Means Clustering Algorithm

A reversible data hiding based on adaptive prediction technique and histogram shifting

STATISTICS (STAT) Statistics (STAT) 1

Imputation for missing observation through Artificial Intelligence. A Heuristic & Machine Learning approach

Machine Learning - Clustering. CS102 Fall 2017

Analysis on the technology improvement of the library network information retrieval efficiency

VIDEO SEARCHING AND BROWSING USING VIEWFINDER

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Ontology Molecule Theory-based Information Integrated Service for Agricultural Risk Management

Application of Fuzzy Classification in Bankruptcy Prediction


Rolling element bearings fault diagnosis based on CEEMD and SVM

A Lazy Approach for Machine Learning Algorithms

A Hybrid Genetic Algorithm for the Distributed Permutation Flowshop Scheduling Problem Yan Li 1, a*, Zhigang Chen 2, b

Outlook for Lodging. Amherst. University of Massachusetts Amherst. Charlie Ballard TripAdvisor

Correlation Based Feature Selection with Irrelevant Feature Removal

An Improved KNN Classification Algorithm based on Sampling

Multi-dimensional database design and implementation of dam safety monitoring system

A mining method for tracking changes in temporal association rules from an encoded database

On User-centric QoE Prediction for VoIP & Video Streaming based on Machine-Learning

Optimized Watermarking Using Swarm-Based Bacterial Foraging

Transcription:

University of Massachusetts Amherst ScholarWorks@UMass Amherst Tourism Travel and Research Association: Advancing Tourism Research Globally 2016 ttra International Conference A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries Xin Li Dr. Beijing Union University, Beijing, China, leexin111@163.com Follow this and additional works at: http://scholarworks.umass.edu/ttra Li, Xin Dr., "A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries" (2016). Tourism Travel and Research Association: Advancing Tourism Research Globally. 9. http://scholarworks.umass.edu/ttra/2016/academic_papers_oral/9 This Event is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Tourism Travel and Research Association: Advancing Tourism Research Globally by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact scholarworks@library.umass.edu.

A Random Forest based Learning Framework for Tourism Demand Forecasting with Search Queries Tony Chen, Ph.D. Candidate School of Economics and Management Beihang University, Beijing 100191, China Telephone: 86-156-003-36797 E-mail: tonychen1989@gmail.com and Xin Li (corresponding author), Ph.D. Institute of Tourism Beijing Union University, Beijing 100101, China Telephone: 86-159-011-60410 E-mail: leexin111@163.com Abstract This study proposes a novel framework for tourism demand forecasting, which combines search queries generated on the Internet, advanced feature selection methods, and machine learning based forecasting technique. This new methodology is applied to forecast tourism demand in two popular destinations in China: Beijing and Haikou. The study evaluates the performances of various feature selection approaches in tourism forecasting. We further show that the random forest feature selection and support vector regression with radial basis function can be extremely useful for the accurate forecasts of tourist volumes. This research highlights the advantage of big data and machine learning algorithms in the tourism forecasting. Keywords: random forest, feature selection, machine learning, tourism demand forecasting Introduction Search queries have emerged as popular and significant sources for tourism demand forecasting, and they have been documented to predict tourist volumes pretty accurately (Choi & Varian, 2012; Bangwayo-Skeete & Skeete, 2015; Yang et al., 2015). In general, search queries have high dimensions because they are generated by the public on the Internet (Wu & Brynjolfsson, 2013; Cho & Tomkins, 2007). To improve the forecasting accuracy, researchers need to select appropriate search queries, and model them with optimal approaches. Machine learning based methods can model large data sets with relatively superiority compared with econometric models (Liao, Chu, & Hsiao, 2012; Pai, Huang, & Lin, 2014). To the best of our knowledge, single models are unlikely to perform well, especially with large data sets. Therefore, a key challenge to bridge the gap is to propose an integrated methodology that makes the best of machine learning based approaches. The goal of this study was to provide an integrated framework for tourism demand forecasting, which combines search queries, feature selection process, and machine learning approaches.

Search queries are considered as features, and feature selection methods can eliminate irrelevant features and keep the ones that are helpful for the forecasting. Then, machine learning methods are applied to modeling these selected search queries. Compared to econometric models, machine learning based models can process and modeling large date sets with higher accuracy. Moreover, empirical findings suggest that random forest based feature selection performs best among other filter and wrapper based methods. Our study extends the existing studies about tourism demand forecasting with search queries by using machine learning framework. Literature Review Tourism demand forecasting with search queries is first reviewed. Then, the commonly used feature selection methods including both filter-based and wrapper-based ones are illustrated. Afterwards, machine learning based approaches and their advantages are briefly reviewed. Finally, the research gaps are addressed at the end of the section. Search queries reflect how people show interests and attention on specific topics on the Internet. Existing studies have demonstrated that these data can predict the future trends (Choi and Varian 2012). Researchers have incorporated these unique data sources into the forecasting in the fields of tourism and hospitality (Pan, Wu, & Song, 2012; Yang, et al., 2015). However, the approaches used to model the search queries are relatively simple, such as autoregressive model, vector autoregressive model, and autoregressive distributed lag models (Ettredge, Gerdes, & Karuga, 2005; Song & Witt, 2006; Chu, 2009; McLaren, 2011; Vosen & Schmidt, 2012; Akın, 2015; Hassani, Webster & Heravi, 2015). These econometric models are superior to depict the linear correlation between the dependent and independent variables, especially when the dimensions of search queries are low. However, these models may fail to process large, nonlinear data sets with exceptional performances. Compared to econometric models, machine learning based approaches have advantages of modeling large data sets. In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression (Cortes & Vapnik, 1995; Chen, Chiang & Storey, 2012). From the perspective of machine learning, each search query can be considered as a feature, so the major problem is to select most appropriate features. In general, there are mainly three types of feature selection methods: filter-based, wrapper-based ones and embedded based ones (Kohavi et al., 1997; Guyon & Elisseff, 2003; Tong & Mintram, 2010). In particular, Guyon and Elisseff (2003) suggested that feature selection can improve the prediction performances, provide faster and more cost-effective predictors, and better understand the data generation process. A filter-based and three wrapper-base feature selection methods are illustrated as well. In particular, random forest combines several binary decision tress that are from learning samples and randomly selected variables (Genuer, Poggi, & Tuleau-Malot, 2010). To date, scarce literature adopted machine learning based approaches to forecast tourism demand with search queries. It is necessary to propose a new methodology to improve the forecasting accuracy of tourism demand. In this study, different feature selection and support vector machines are combined to model search queries. Methodology

An integrated machine learning based forecasting framework is proposed, which combines random forest feature selection and support vector machine. To evaluate the performances of different feature selection methods, we use the following methods, including filter based feature selection (FBFS), backwards feature selection (BFS), genetic algorithm feature selection (GAFS), and random forest feature selection (RFFS). Then, a support vector machine with radial basis function is used to train and test the forecasting model. This framework has four integrated components: search queries collection, feature selection, machine learning based modeling, and forecasting evaluation. First, weekly search queries are collected from Baidu trends following specific search terms. Then, the feature selection is used to eliminate the most irrelevant features. Afterwards, the selected series are modeled with support vector machines. Finally, the performances of the models are evaluated. Results We conduct rigorous experiments to verify the effectiveness of the proposed framework in the forecasting of Beijing and Haikou tourism volumes. Different feature selection methods generate various subsets of features, which are shown in Table 1. Table 1. Selected Features Beijing: Feature Selection Methods FBFS BFS GAFS RFFS Haikou: Feature Selection Methods FBFS BFS GAFS RFFS Number of Selected Features Top 11 of 46 features Top 20 of 46 features Top 13 of 46 features Top 9 of 46 features Number of Selected Features Top 7 of 20 features Top 6 of 20 features Top 4 of 20 features Top 7 of 20 features The selected feature sets using different methods, are shown in the above table. As shown in Table 1, 9 out of 46 features of Beijing tourism and 7 out of 20 features of Haikou tourism are selected using random forest feature selections. Then, for the robustness of the empirical study, the samples are randomly split into three-folds, two samples are used to train and the other one is used to test the model. The average forecasting accuracy with support vector machines are also computed. Table 2 presents the average forecasting accuracy of Beijing and Haikou tourist volumes.

Table 2. Average Forecasting Accuracy of Beijing and Haikou Tourist Volumes City: Beijing Machine learning SVM with Radial Basis Function City: Haikou Machine learning SVM with Radial Basis Function Average forecasting accuracy Feature selection Mean Absolute Percentage Error (MAPE) FBFS 0.18 BFS 0.22 GAFS 0.25 RFFS 0.17 All features 0.29 Average forecasting accuracy Feature selection Mean Absolute Percentage Error (MAPE) FBFS 0.09 BFS 0.08 GAFS 0.09 RFFS 0.08 All features 0.12 Interesting findings are shown from Table 2. First, the models with feature selection are superior those without feature selection process. The first four SVMs are constructed with the selected features sets, which are obtained with a filter based and three wrapper based feature selection methods. The last SVM model handles all features without the selection process. From the whole forecasting accuracy of Beijing tourism demand, SVM with the selected features perform better than the model without any feature selection process. Second, random forest based feature selection has the lowest MAPEs, compared to other feature selection methods. Filter based feature selection method also has exceptional accuracy in this forecasting task. Random forest feature selection improves the forecasting accuracy by 5.56%, 22.73%, 32%, and 41.38% compared to filter, backwards, genetic algorithm based feature selection and all features. In addition, from the empirical study of Haikou forecasts, random forest feature selection and support vector machine perform best, with the lowest MAPE. The worst performance is the combination of all features and support vector machine. In particular, features using random forest improve 33.33% forecasting accuracy of tourism demand compared to all features. In average, wrapper based feature selection methods outperform filter based ones in the forecasting of tourist volumes, with the reduction of forecasting error being 18.52%. Feature selection can eliminate unimportant search queries and significantly improve the forecasting accuracy of tourism demand. Therefore, it is convinced that the forecasting framework that combines search queries, feature selection, and machine learning approaches can significantly improve the tourism demand forecasting accuracy. Conclusion and Discussion Search queries generated on the Internet reflect tourists attention on the destinations. By incorporating search queries into the forecasting of tourism demand, researchers can update their

forecasts timely and accurately. We propose an integrated methodology that effectively analyze and modeling large search queries. This study considers search queries as features, and aims to apply machine learning based approaches into the forecasting of tourism demand. We hope to provide new insights for the timely and accurate forecasts of tourism demand in the Big Data era. Empirical results show that random forest based feature selection and support vector machines have the superiority to improve the forecasting accuracy of tourism demand. The proposed methodology is useful for analyzing large search query sets, because it can select most important variables for the forecasting. In the future research, we need to examine whether this methodology can accurately forecast tourist volumes to other destinations. In particular, the future research should emphasize on the adoption of machine learning based approaches so as to analyze more user-generated, multi-types, and big data. Acknowledgements The authors would like to thank the Chair, co-chairs and reviewers of ttra conference for these valuable help, insightful suggestions and comments, which has dramatically improved the quality of this research. This research is supported by grants from the National Natural Science Foundation of China (NSFC No. 71373023), National Key Technology Support Program (No. 2012BAZ03744), Beijing Higher Education Young Elite Teacher Project (No. YETP1750) and support from Talents Program in Beijing Union University (RK100201509). References Akın, M. (2015). "A novel approach to model selection in tourism demand modeling." Tourism management 48: 64-72. Bangwayo-Skeete, P. F. and R. W. Skeete (2015). "Can Google data improve the forecasting performance of tourist arrivals? Mixed-data sampling approach." Tourism management 46: 454-464. Chen, H., R. H. L. Chiang., and V. C. Storey (2012). "Business intelligence and analytics: From big data to big impact." MIS Quarterly 36(4). Cho, J. and A. Tomkins (2007). "Social media and search." IEEE Internet Computing 11(6): 13-15. Choi, H. and H. Varian (2012). "Predicting the present with google trends." Economic Record 88(s1): 2-9. Chu, F. L. (2009). "Forecasting tourism demand with ARMA-based methods." Tourism management 30(5): 740-751. Cortes, C. and V. Vapnik (1995). "Support-vector networks". Machine Learning 20(3): 273. doi:10.1007/bf00994018. Ettredge, M., J. Gerdes, and G. Karuga (2005). "Using web-based search data to predict macroeconomic statistics." Communications of the ACM 48(11): 87-92. Hassani, H., A. Webster, E. S. Silva, and S. Heravi (2015). "Forecasting U.S. tourist arrivals using optimal Singular Spectrum Analysis." Tourism management 46: 322-335.

Genuer, R., J. M. Poggi, and C. Tuleau-Malot (2010). Variable selection using Random Forests. Pattern Recognition Letters, 31 (14): 2225-2236. Kohavi, R., and G.H. John (1997). Wrappers for feature subset selection. Artificial Intelligence. 97(1-2): 273-324. Guyon, I., and A. Elisseff (2003). An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157-1182. Liao, S. H., Chu, P, H., and Hsiao, P. Y (2012). "Data mining techniques and applications A decade review from 2000 to 2011." Expert Systems with Applications 39(12): 11303-11311. McLaren, N. (2011). "Using Internet search data as economic indicators." Bank of England Quarterly Bulletin 51(2). Pai, P. F., K. C. Huang, and K. P. Lin (2014). "Tourism demand forecasting using novel hybrid system." Expert Systems with Applications 41(8): 3691-3702. Pan, B., C. Wu., & H. Song (2012). "Forecasting hotel room demand using search engine data." Journal of Hospitality and Tourism Technology 3(3): 196-210. Song, H. and S. F. Witt (2006). "Forecasting international tourist flows to Macau." Tourism management 27(2): 214-224. Tong, D. L. and R. Mintram (2010). "Genetic algorithm-neural network (GANN): a study of neural network activation functions and depth of genetic algorithm search applied to feature selection." International Journal of Machine Learning and Cybernetics 1(1-4): 75-87. Vosen, S. and T. Schmidt (2012). "A monthly consumption indicator for Germany based on Internet search query data." Applied Economics Letters 19(7): 683-687. Wu, L. and E. Brynjolfsson (2013). The future of prediction: How Google searches foreshadow housing prices and sales. Economics of Digitization, University of Chicago Press.