Sentiment Analysis for Mobile SNS Data

Size: px

Start display at page:

Download "Sentiment Analysis for Mobile SNS Data"

Darrell Perkins
5 years ago
Views:

1 Sentiment Analysis for Mobile SNS Data SeonHwan Kim, Il-Kyu Ha, Bong-Hyun Back and Byoungchul Ahn Department of Computer Engineering, Yeungnam University, Gyeongsan, Gyeongbuk, Korea Abstract Everyday a lot of diverse data have been generated every day regarding individual opinions and preferences on the contents of Social Network Service (SNS). These data could affect greatly to various fields of our society such as politics, public opinions, economics, services and entertainments. It is necessary to extract new information from SNS data and to understand the true intention of users or customers. To extract important information, it is required to several techniques to analyze a large amount of SNS data, extract meaningful data from them, and generate new information. This paper presents an efficient method that can process various unstructured big data on social networks, and extract the information for sentiment and generate preferences of users from sentiment information. The proposed method shows O(n) processing time as the number of data increases. Keywords: Big data, SNS, Sentiment 1 Introduction Social Networking Service (SNS) is widely serviced by smart phones and their users are increased very rapidly in recent years. In addition, a lot of data for a variety of personal opinions and interests are generated exponentially. Some critical information from SNS data might generate a great impact to public opinion formation in various fields such as politics, economy, service, and entertainment. It is necessary to develop methods or algorithms which extract and process meaningful information from a large amount of data generated by the SNS. Also it is required to capture opinions in real time and to utilize this information for various application fields and to represent them with visualization. We propose a big data processing method that can efficiently handle various unstructured data that collected from a lot of SNS data. Further, we suggest sentiment analysis algorithms, which can extract the sentiment information and classify preferences and changes of customers about a particular issues as time passes. 2 Related work Most data generated on SNS service are unstructured data because data have not been standardized and its structure and shape are so complex unlike video image data and document data [1]. In order to extract meaningful information from a number of unstructured data on SNS, the process of unstructured data is needed. Various technologies for processing the unstructured data are studied focusing on morphological analysis. However, barriers to data analysis such as symbol word and new buzzword from the young people could exist. For this reason, big data processing and sensitive analysis using the computer has become more difficult. Thus, researches on text mining extract information in the semi-structured or atypical text data based on the natural language processing techniques have been developed[5-7]. They are using statistical, periodic algorithm based on machine learning to extract meaningful information and to purify the information from the text data of the mass. In addition, research on opinion mining to determine the evaluation of positive, negative, neutral preference in the text has also been carried out [8-9]. Currently, a variety of open source projects for processing big data are in progress by naming ecosystem of Hadoop (Hadoop ECO system) [10]. Database that is used to process the big data, use NoSQL (Not-Only SQL) for storage and retrieval of data using the consistency model less restrictive than traditional relational databases [11]. As relational databases such as RDBMS, NoSQL uses a database depending on the situation. Many studies on the NoSQL database is underway in academia and industry current. Typically, Google BigTable, Amazon DynamoDB, Apache HBase of open source projects, Cassandra, MongoDB are representative [10][12][13][14][16]. In particular, MongoDB that are used in this study is classified to a CP type database with the Partition tolerance and Consistency based on the theory (Consistency, Availability, Partition tolerance) of CAP. It has been promoted as a source project. The sentiment is emotion which we feel in mind and happen to some works or phenomena [16]. Sentiment Analysis is a process that discovers and extracts subjective information from the original data by utilizing computational linguistics, natural language processing and text analytics [16]. Studies that analyze the sentiment from big data have been developed[17-19]. Work to analyze the type of sentiment and classification, can be divided into three stages significantly. In the first step, the sentence in which sensibility information is included to express thoughts and feelings subjective is extracted. In the next step, the polarity of the sentence or document is classified like as positive,

negative or neutral. In final step, a classification of intensity determines subjectivity strength of text documents [20-21]. 3 Sentiment Analysis of Unstructured SNS Data 3.

2 negative or neutral. In final step, a classification of intensity determines subjectivity strength of text documents [20-21]. 3 Sentiment Analysis of Unstructured SNS Data 3.1 System Model We propose a big data processing system that can efficiently handle various unstructured SNS data. The proposed system is comprised of parallel HDFS(Hadoop Distributed File System) and MapReduce. Parallel HDFS that is based on the ecosystem of Hadoop is used to collect and save data reliably from a large variety SNS data. And MapReduce[22] is used to analyze large amounts of unstructured data for sentiment of user effectively. Configuration of the proposed system is shown in Figure MapReduce Functions MapReduce is a software framework developed by Google to support distributed computing and parallel programming using the concept of function called map. In this paper, it is classified into four special map functions. They perform positive/negative context analysis, morphological analysis, token analysis, prohibitive word analysis respectively. Table 2 shows 4 proposed functions and their operations. Table 2. Functions of the proposed sentiment analysis Sentiment analysis function Operations Referenced dictionary Positive/negative context analysis function context analysis using sentence pattern matching elimination of needless Morphological elements, analysis function calculation of the result count Token analysis function creation of tokens calculation of the result count Prohibited words calculation of the prohibited analysis function word score positive/negative context dictionary positive/negative word dictionary prohibited word dictionary Figure 1. The proposed system 3.2 Composition of HDFS HDFS is a file processing system which has the structure of distributed processing. It has been configured as a parallel server shown in Figure 1. The system is connected in parallel using four servers based on Linux and each chunk node to store data is set to 64MB. It duplicates the name server using the NFS for disaster recovery. Functions of the proposed servers are described in Table 1. Table 1. HDFS Servers Server Components Functions PrimaryServer (Master Node) SecondaryServer (Slave Node 1) DataServer1 (Slave Node 2) DataServer2 (Slave Node 3) Namenode, MapReduce, Crawler Secondary NameNode Main server for parallel distribution process Name node(controlling other servers) Backup server of main server First, it performs a positive/negative contextual analysis function. It examines the context by each sentence to enhance accuracy and is subjected to matching pattern with the negative context dictionary or the positive context dictionary. And it counts the number of positive and negative context, if the number of positive word is equal to the number of negative words, the sentence is treated as positive and it is transferred to the morphological analysis if the contextual analysis does not classify context. Algorithm for contextual analysis is shown as Figure 2. Second, it performs a morphological analysis function. This function removes an unnecessary component such as special symbols in the analysis by using the morphological analyzer. And it counts by comparing the sentence to positive and negative clause dictionaries. If the value of positive counter is equal to that of negative counter, the sentence is treated as positive. If the morphological analyzer does not classify the polarity, the sentence is passed to the token analysis. Third, it performs the token analysis. After separating tokens by space from the source sentence, the function counts positive word and negative word by comparing the negative and positive dictionaries. If the value of the positive counter is equal to that of negative counter, the sentence is treated as positive. If the token analysis does not classify the polarity, the sentence is passed to the prohibition word analysis. Fourth, it performs prohibitive analysis. It calculates the prohibition score based on prohibition dictionary. Algorithm

3 for morphological analysis, token analysis and prohibition word analysis is described as Figure 3. //Context Analysis //input keyword, source //keyword: target word for decision of positive or negative sentiment //source: source data of text form that is processed by HDFS Input keyword and source Initialize result // a criteria for sentiment decision //pre-processing Change the keyword to lower-case Change the source to lower-case Eliminate the needless characters in source text Initialize positive_count and negative_count //Context Analysis Get the minimum sentence unit from the source //Computation of the positive_count and negative_count if (minimum sentence unit == positive) then positive_count++ if( minimum sentence unit == negative) then negative_count++ Repeat this step until there is no minimum sentence unit //Computation of the result by positive_count and negative_count if (positive_count == 0 and negative_count == 0) then result = 0 //undecidable if (positive_count == negative_count) then result = 1 //positive else result = positive_count - negative_count Figure 2. Context analysis function 3.4 Dictionaries for Sentiment Analysis The proposed dictionaries use five MapReduce functions. They are a positive context, a negative context dictionary, a positive word dictionary, a negative word dictionary and a prohibited word dictionary. In prohibition word dictionary, it is composed of polarity and score. The role of each dictionary is shown as Table 3. //Morphological Analysis if (result == 0) in previous stage Input source Initialize result-s //a criteria for sentiment decision //pre-processing source Eliminate the needless characters in source text Initialize positive_count_s and negative_count_s //Computation of the positive_count_s and negative_count_s using //the positive/negative word dictionary Compute positive_count_s, negative_count_s Repeat this step until there is no morpheme unit if (positive_count_s == 0 and negative_count_s == 0) then result-s=0 if (positive_count_s == negative_count_s) then result-s = positive_count_s else result-s = positive_count_s - negative_count_s //Token Analysis if (result-s == 0) in previous stage Create tokens Initialize positive_count_s and negative_count_s //Computation of the positive_count_s and negative_count_s using // the positive/negative word dictionary Compute positive_count_s, negative_count_s Repeat this step until there is no token if (positive_count_s == 0 and negative_count_s == 0) then result-s=0 if (positive_count_s == negative_count_s) then result-s = positive_count_s else result-s = positive_count_s - negative_count_s //Prohibited word Analysis if (result-s == 0) in previous stage //Computation of the positive_count_s and negative_count_s using // the prohibited word dictionary Compute positive_count_s, negative_count_s result-s = positive_count_s - negative_count_s Figure 3. Analysis of Morphological, token and prohibited word Table 3. Dictionaries for sentiment analysis Role application Positive Context Negative Context Positive Word Negative Word Prohibited Word compute the number of positive context in source sentence / set of positive context patterns compute the number of negative context in source sentence / set of negative context patterns compute the number of positive word in source sentence / set of positive word patterns compute the number of negative word in source sentence / set of negative word patterns compute the number of prohibited word in source sentence / set of prohibited words 4 Experiment and Results Context Analysis Morphological/To ken Analysis Prohibited Word Analysis 4.1 Data Collection and Experimental Environment Data collection performance of the proposed system is analyzed through the Twitter and Topsy. Topsy analyzes the activity of users in the SNS services such as Google Plus and Twitter. Topsy provides the analyzed data by analyzing about 500 millions of data per day. After the acquisition of the historical data, Twitter4j is used to collect data for continuous incremental data. Twitter provides one week data only and the key that may be used to query 450 for 15 minutes. In this study, a data collection module is to run every 4 hours using the crawler. Experimental environment of the proposed system for performance analysis is described at Table 4. The proposed system consists of four Hadoop-based parallel servers and uses the 6.3 x64 CentOS as an operating system. Components OS, RE Crawler, HDFS Layer MapReduce Layer MongoDB WAS, Web Server Table 4. Experimental environment Roles Use of Hadoop for distributed storage Supporting Java environment for processing some business logic Crawler: Gathering the source data from various SNSs HDFS: Distribution File system, Data storage Sentence Analysis, Text Mining, Sentiment Analysis Storing analyzed results by MapReduce in MongoDB Supporting Web applications using analyzed results

4 4.2 Analysis and Evaluation The following four tests have been carried out to analyze the performance of the proposed system. First, it is an experiment of the system performance according to the number of data. The test of system load and acquisition time is performed using seven Twitter data sets at Table 5. Each data are collected using Topsy API. Table 5. Data sets for experiment and analysis Data Set Extraction Number of data number period (day) API 1 2,106 1 Topsys API 2 11, , , , , , Figure 4 shows a comparison of HDFS loading time and crawling time for each data set. Figure 5 and 6 shows the CPU load and memory load of each node in HDFS when each dataset has stacked and crawled. In the case of 2,106 dataset, crawl time is 6 seconds and HDFS loading time is 1 second. In the case of 100,497 dataset, crawl time is 70 seconds and HDFS loading time is 10 seconds as shown in Figure 4. The processing time is increased in HDFS loading time and crawl time in proportion to the number of data. Thus, the network load and the system load by collecting and stacking data show very close to the proposed system, the stable data collection and the data loading are processed in a few seconds. resource by distributing loading the data. The master node uses more memory resource than the slave nodes. Figure 5. Memory Usage for Data Crawling and HDFS Loading In Figure 6, slave node SN1 and slave node SN2 show that CPU usages are from maximum 2.8% to minimum 0.0%. But the slave node SN3 shows the CPU usage is from minimum 0.0% up to 11.4%. The reason is that the slave node SN3 loads data in parallel and distributed processing. The master node shows the CPU usage is from 5.0% up to 7.9%. Therefore, the proposed system provides a stable environment when it collects and loads data. Figure 6. CPU Usage for Data Crawling and HDFS Loading Figure 4. Crawling Time and HDFS Loading Time The memory usage from slave node SN1 to slave node SN3 has used maximum 3.93% and minimum 0.03%. The master node M, has used from maximum 7.31% to minimum 0.6% as shown in Figure 5. Slave nodes use small memory Figure 7. Time of MapReduce Processing and Sentiment Analysis

5 Sentiment analysis time and system load are tested by increasing the number of data. The experiment is executed in the degree of the system load and analysis time for sentiment analysis. Figure 7 shows the comparison of the sentiment analysis time for each data set. Figure 8 and 9 show memory load and CPU load for each node. The sentiment analysis time takes from 68 seconds to 35 seconds for each 7 data sets. The analysis time increases linearly to the number of data as shown in Figure 7. In Figure 8, the master node does not process actual analysis but manage slave nodes. Its CPU usage is low when the slave nodes use most of the CPU resource. When the number of data set is less than 40,000, each slave node processes data in parallel. When the number of data set is greater than 40,000, all slave nodes utilize to maximize CPU resources according to the number of data. Therefore, the proposed system is performed stably as the number of data is increased. This is because the proposed system engages in parallel mode if CPU loads are increased. In Figure 9, the memory usage of the master node is low, but the load of the memory usage of slave nodes is distributed to each slave node and all slaves have balanced for the analysis. Therefore the proposed system distributes work load to slave nodes equally and maintains the load balancing. The system and algorithm of the propose method shows O(n) processing time. It provides a stable distributed analysis environment without processing by a single node. Figure 9. Memory Consumption of MapReduce Processing and Sentiment Analysis Figure 10. Comparison between the results of the proposed functions and the results of manual sorting Figure 8. CPU Consumption of MapReduce Processing and Sentiment Analysis The accuracy of the sentiment analysis is measured. "Happy" word is used to analyze the sentiment. Figure 10 shows the comparison results of the proposed system and manual works. In Figure 10, error ratio of neutral sentiment is relatively high and the error rate for positive and negative sentiment is relatively small. The sentiment analysis results of the proposed system are very close to those of manual works. 5 Conclusions A big data processing system and algorithms are proposed to analyze the sentiment of users from the large amounts of unstructured data generated by SNS. The proposed system is composed of a parallel HDFS system based Hadoop Ecosystem and four primary special functions for the MapReduce. In addition, it uses the five types of data dictionary for sentiment analysis. The proposed system processes data with small loading time as the number of data increases. The analyzing works are not processed by one node, but distributed to all nodes for load balancing. When the proposed sentiment analysis functions have processed the data, the load of the system is distributed to all slave nodes equally. The sentiment analysis results of the proposed system are very close to those of manual works. Therefore the proposed system distributes work load to slave nodes equally and maintains the load balancing. Please address any questions of this paper to Byoungchul Ahn by (b.ahn@yu.ac.kr).

6 6 Acknowledgement This work (Grants No. C ) was supported by Business for Cooperative R&D between Industry, Academy, and Research Institute funded Korea Small and Medium Business Administration in References [1] McKinsey, 2011, Big Data: The Next Frontier for Innovation, Competition, and Productivity, [Online. McKinsey & Company, [2] Chang-Shing Lee, Mei-Hui Wang, Automated ontology construction for unstructured text documents, Data & Knowledge Engineering, Vol.60, Iss.3, pp , 2007 [3] B. Lee, J. Lim, J. Yoo, Utilization of Social Media Analysis using Big Data, Jour. of the Korea Contents Association, Vol.13, No.2, pp , 2013 [4] M. Song, S. Kim, A Study of improving on prediction model by analyzing method Big data, The Journal of Digital Policy & Management, Vol.11, No.6, pp , 2013 [5] Ah Tan, Text mining: The state of the art and the challenges, Proc. of the PAKDD 1999, 1999 [6] Q. Mei, C. Xhai, Discovering evolutionary theme patterns from text: an exploration of temporal text mining, Proc. of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, pp , 2005 [7] K. Park, K. Hwang, A Bio-Text Mining System Based on Natural Language Processing, Jour. of KISS: computing practices, Vol.17, No.4, pp , 2011 [13] S. Sivasubramanian, Amazon dynamodb: a seamlessly scalable non-relational database service, Proc. of the 2012 ACM SIGMOD 12, pp , 2012 [14] Lars George, HBase: The Definitive Guide, O REILLY, 2011 [15] Kristina Chodorow, MongoDB: The Definitive Guide 2nd Edition, O REILLY, 2013 [16] B. Pang,,L. Lee, "Opinion Mining and Sentiment Analysis," Foundations and Trends in Information Retrieval: Vol.2, No.1-2,pp.1-135, 2008 [17] S. Mukherjee, P. Bhattacharyya, Sentiment Analysis in Twitter with Lightweight Discourse Analysis, Proc. of COLING 2012, pp , 2012 [18] N. Godbole, S. Skiena, Large-Scale Sentiment Analysis for News and Blogs, Proc. of the ICWSM 2007, 2007 [19] A. Pak, P. Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, Proc. of the LREC 2010, 2010 [20] H. Tang, S. Tan, X. Cheng, " A survey on sentiment detection of reviews," Expert Systems with Applications, Vol.36, pp , 2009 [21] Seth Gilbert, Nancy Lynch, Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT New 33(2), pp , [22] J. Dean, S. Ghemawat, MapReduce; Simplified Data Processing on Large Clusters, Communications of the ACM, Vol.51, No.1, pp , 2008 [8] B. Pang, L. Lee, Opinion Mining and Sentiment Analysis, Foundations and Trends in Information Retrieval, Vol.2, No.1-2, pp.1-135, 2008 [9] B. Kang, M. Song, A Study on Opinion Mining of Newspaper Texts based on Topic Modeling, Jour. of the Korean Library and Information Science Society, Vol.47, No.4, pp , 2013 [10] [11] Jing Han, Kian Du, Survey on NoSQL database, Proc. of 6th International Conference on Pervasive Computing and Applications(ICPCA), pp , 2011 [12] Fay Chang, R.E. Gruber, Bigtable: A Distributed Storage System for Structured Data, ACM Transations on Computer System, Vol.26, Iss.2, 2008

Research Article MapReduce Functions to Analyze Sentiment Information from Social Big Data

Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2015, Article ID 417502, 11 pages http://dx.doi.org/10.1155/2015/417502 Research Article MapReduce Functions to