An Efficient Informal Data Processing Method by Removing Duplicated Data

Similar documents
Sentiment Analysis for Mobile SNS Data

Research Article MapReduce Functions to Analyze Sentiment Information from Social Big Data

SMCCSE: PaaS Platform for processing large amounts of social media

SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA

Sentiment Analysis using Weighted Emoticons and SentiWordNet for Indonesian Language

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Fast and Effective System for Name Entity Recognition on Big Data

A Fast and High Throughput SQL Query System for Big Data

Twitter data Analytics using Distributed Computing

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Mahek Merchant 1, Ricky Parmar 2, Nishil Shah 3, P.Boominathan 4 1,3,4 SCOPE, VIT University, Vellore, Tamilnadu, India

Keyword Extraction by KNN considering Similarity among Features

Building Corpus with Emoticons for Sentiment Analysis

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

Sentiment Analysis for Customer Review Sites

Sentiment Analysis of Customers using Product Feedback Data under Hadoop Framework

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

NLP Final Project Fall 2015, Due Friday, December 18

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Text Classification for Spam Using Naïve Bayesian Classifier

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Data Clustering Frame work using Hadoop

Novel Comment Spam Filtering Method on Youtube: Sentiment Analysis and Personality Recognition

Big Data Analytics for Host Misbehavior Detection

Election Analysis and Prediction Using Big Data Analytics

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

Distributed Face Recognition Using Hadoop

Introduction to Text Mining. Hongning Wang

Text Classification Using Mahout

Image Classification using Fast Learning Convolutional Neural Networks

String Vector based KNN for Text Categorization

Part I: Data Mining Foundations

ISSN: Page 74

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Big Data Architect.

URL ATTACKS: Classification of URLs via Analysis and Learning

Bigdata Platform Design and Implementation Model

Relevance Feature Discovery for Text Mining

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Review Spam Analysis using Term-Frequencies

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

A data-driven framework for archiving and exploring social media data

Remote Direct Storage Management for Exa-Scale Storage

Automatic Domain Partitioning for Multi-Domain Learning

An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage

Social Network Analytics on Cray Urika-XA

Learning Binary Code with Deep Learning to Detect Software Weakness

Mining Social Media Users Interest

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Clustering Documents. Case Study 2: Document Retrieval

Multi-level Byte Index Chunking Mechanism for File Synchronization

An Improved Social Media Analysis on Three Layers: A Real Time Enhanced Recommendation System

Lane Detection using Fuzzy C-Means Clustering

An Analysis of Website Accessibility of Private Industries -Focusing on the Time for Compulsory Compliance with Web Accessibility Guidelines in Korea-

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

A MODEL OF EXTRACTING PATTERNS IN SOCIAL NETWORK DATA USING TOPIC MODELLING, SENTIMENT ANALYSIS AND GRAPH DATABASES

Encoding Words into String Vectors for Word Categorization

Global Journal of Engineering Science and Research Management

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Design and Implement of Bigdata Analysis Systems

Micro-blogging Sentiment Analysis Using Bayesian Classification Methods

High Performance Computing on MapReduce Programming Framework

INTRUSION DETECTION SYSTEM USING BIG DATA FRAMEWORK

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Topics in Opinion Mining. Dr. Paul Buitelaar Data Science Institute, NUI Galway

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

DCBench: a Data Center Benchmark Suite

Improved MapReduce k-means Clustering Algorithm with Combiner

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

ISSN: (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Big Data Analytics CSCI 4030

Best Customer Services among the E-Commerce Websites A Predictive Analysis

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Natural Language Processing on Hospitals: Sentimental Analysis and Feature Extraction #1 Atul Kamat, #2 Snehal Chavan, #3 Neil Bamb, #4 Hiral Athwani,

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Data Science with R Decision Trees with Rattle

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Supervised classification of law area in the legal domain

Machine Learning using MapReduce

Design of Good Bibimbap Restaurant Recommendation System Using TCA based on BigData

Sentiment Analysis using Support Vector Machine based on Feature Selection and Semantic Analysis

Rule Based Classification on a Multi Node Scalable Hadoop Cluster

Available online at ScienceDirect. Procedia Computer Science 64 (2015 )

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Efficient Query Evaluation on Distributed Graph with Hadoop Environment

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

CADIAL Search Engine at INEX

The Study of Genetic Algorithm-based Task Scheduling for Cloud Computing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Transcription:

An Efficient Informal Data Processing Method by Removing Duplicated Data Jaejeong Lee 1, Hyeongrak Park and Byoungchul Ahn * Dept. of Computer Engineering, Yeungnam University, Gyeongsan, Korea. *Corresponding author 1 Orcid ID: 0000-0001-6637-3502 Abstract Useful information is extracted from many social networking services(sns). The data uploaded by many users include the preferences or comments for the specific topics, which can be used for sentiment analysis. This sentiment information might be various social information, personal services and so on. However, SNS data are included a lot of duplicated data and spam data, which make slow down the processing time and the accuracy of the sentiment analysis. This paper presents an effective informal big data processing method by filtering out duplicated data or spam data. Hadoop Distributed File System(HDFS) and MapReduce method are used to extract sentiment information through machine learning. Experiment results increase not only the processing performance but also but also the accuracy of sentiment analysis. When duplication and spam are filtered out, the upload time is reduced about 4 seconds for 133,500 data. The analysis time after duplication and spam data are removed, is reduced by 36 percent and 41.26 percent respectively for 133,500 data. The incorrect spam detection is 20.53 percent for 35,320 data. Keywords: Big data; SNS; Sentiment analysis; Spam filtering; Informal Data INTRODUCTION As the social networking service (SNS) becomes popular, it connects people with the same interest such as hobbies, sports, pets and so on. Typically short sentences are uploaded to social networks such as Twitters, Facebook and so on. The uploaded postings are highly valuable since there are many comparatively subjective opinions. After implementing the analysis system, it can be quickly estimated emotions of many users for uploaded sentences. It is easy to get new information and can be widely used to identify public opinions and preferences for specific topics. To investigate the preferences for specific topics, it is required to extract the polarity of opinions after collecting the data in SNS[1][2]. When a lot of data are generated, the duplicated data and spam data are also growing. Spammers create a large amount of spam data using automation tools and upload posting manually to avoid filtering. These generated data affect the speed of analysis or negative results and so on. Therefore, it is necessary a method to extract the information quickly and accurately by removing spam data on SNS. Data uploading to SNS follows the format of unconstrained informal data because there is no particular constraint for writing. Typically SNS data types might be formal data, semi-structured data or informal data. Informal data means the data such as video, image, document record file and so on. Many studies have been conducted to analyze informal data such as natural processing, morpheme analysis, Support Vector Machine (SVM) and so on [3][4][5][6][7][8]. This paper presents improved an informal data processing method by filtering out spam data. In Section 2, related works are discussed. In Section 3, the analysis model and the configuration for the HDFS and MapReduce are discussed. The performance of the proposed method is discussed in Section 4. Conclusion is described in Section 5. RELATED WORK Pak and Paroubek proposed a method for sentiment analysis which was processed by a sentiment classifier to distinguish positive, neutral and negative sentiments for messages[13]. They proposed an efficient analysis method, but did not consider for the collection and storage of data for analysis. Back et el. configured a parallel HDFS and MapReduce method to analyze informal sentiment data[14]. They analyzed sentimental data using their method, but did not consider unnecessary data removal for their analysis. There are no research studies that remove unnecessary data efficiently in order to analyze subjective opinions efficiently. But most studies are concentrated to detect or analyze the spam generated from e-mail, SMS, web and so on. After spams are classified from its contents, e-mail address, phone number and user identification, they stored on blacklists to filter them out. Wang proposed a spam detection prototype system to identify suspicious in Twitter and compared several detection methods 6564

based on its contents[16]. Chu et el. proposed a method to filter out spams from the contents of Tweet using Naive Bayesian classification algorithm[17]. Kim et al. proposed to categorize general account, tweet bot and cyborg by analyzing the features tweet interval time with entropy, spam content and so on[18]. They suggested how to distinguish malicious accounts but did not consider the handling of original texts. They also conducted to analyze the sentiment by referring to Twitter's hash tag. number of queries per 15 minutes was expanded from 180 to 450. The data collection process using Twitter4j API is as follows. It is necessary to filter out spams before polarity classification and to improve the accuracy of the sentiment analysis. PROPOSED METHOD A. Configuration of HDFS and MapReduce In this paper, we configured the HDFS to collect and store Twitter data. We organized three Linux servers in parallel and configured the name node as redundancy server for the reliable server operation. The proposed HDFS consists of a primary server, a secondary server, and a data server to collect, store and process Twitter data as shown in Figure 1. The primary server is used as a main server for the distributed parallel processing and control of other servers. The secondary server is responsible for backing up the main name node. All servers have a data node and process the data. MapReduce is a programming model to process big data sets with a parallel and distributed algorithm on a cluster. In this paper, it performs sentiment analysis as shown in Figure 1. The analysis sequence starts from extracting words from Tweet data and pass those data to the Mapper. These extracted data are tokenized, created vector values and weighted by TF-IDF(Term Frequency- Inverse Document Frequency) value by Mahout. These values are used for sentiment analysis. These classified data are passed to the Reducer, added polarity values and stored in the MongoDB to calculate statistics. Figure 1: Proposed Configuration of HDFS B. Informal Data Collection For sentiment analysis, we collect the Twitter data for one week maximum using Twitter4j API. The token of Twitter is used to get authorization to collect data by the streaming API or the REST API. We used the REST API and the additional Bearer token to extend the number of query. As a result, the Figure 2: Data Collection Process using Twitter4j C. Spams, Duplication data removal and sentimental classification The data collected through Twitter4j API have a lot of data to process the sentiment analysis. By filtering out spams and duplicated data, we can improve the accuracy of sentiment analysis as well as the processing speed. Figure 3 shows pseudo code to index and remove spams. The spam training model is programmed to train the spam classifier by calculating the spam index. The index value is used to identify spam words or paragraphs for Twitter sentences. If the spam index value is three or higher, they are considered as spams. If not, they are considered as regular data. The spam filtering module stores common words trained with spams and common labels. Figure 4 is pseudo code to remove duplicated words after checking duplication for each page. It removes blanks, nonimportant words or duplicated words. Therefore, nonduplicated words or sentences are used for sentiment analysis. // Spam Removal 1. Inputs: 2. T - training model for Mahout 3. K - keyword 4. S = {S 1, S 2,, S k} sentences // source data 5. s i - spam index, set to zero 6. Outputs: 7. R result of common data 8. Set up T 9. foreach (S n S) 10. Calculate s i(k, S n) 11. if s i is more than criteria of spam index then skip 12. else add S n to R 13. end 14. return R Figure 3: Pseudo-codes for spam removing module 6565

// Duplication removal 1. Inputs: 2. S = {S 1, S 2,, S k} // sentences 3. T - temporary space 4. Outputs: 5. U - unique data 6. foreach (S n S) 7. T = Removal unnecessary characters(s n) 8. if U contains T then skip 9. else add T to U 10. end 11. return U Figure 4: Pseudo-codes for duplication removal module D. Proposed sentiment analysis method The polarity classification process using Mahout is divided into three phase, which are preparatory phase, training phase and classification phase as shown in Figure 5. Preparatory phase collects data randomly from Tweet to build a training set. // Sentimental classifier 1. Inputs: 2. T - training model for Mahout 3. K - keyword 4. S - source data 5. S = {S 1, S 2,, S k} - sentences 6. ps - positive minimum score, set to upper 40% 7. ns - negative maximum score, set to lower 40% 8. ss - sentence score, set to zero 9. Outputs: 10. R result 11. Set up T 12. foreach (S n S) 13. Calculate ss(k, S n) 14. end 15. if ss is zero then return neutral 16. if ps contains ss then R = positive 17. else if ns contains ss then R = negative 18. else R = neutral 19. return R Figure 6: Pseudo code for polarity classification module The collected Tweet data are transferred to the training phase after classifying polarity. These data are generated as a training model after they are converted to sequential files and vectorized. Finally, the polarity of classification result is derived from the classification phase. EXPERIMENTAL EVALUATION We have built a sentiment analysis system as shown in Table 1 to measure the performance of the proposed method. The operating system of all servers is Ubuntu 14.04.4 LTS 64x and consists of Hadoop parallel systems. Table 1: System configuration for the sentiment analysis Figure 5: Polarity classification process using Mahout Figure 6 is pseudo code to classify emotional polarity. It uses a sentiment training model to calculate sentiment polarity. If sentiment index is 0, it means neutral. If the index value is the upper 40% or higher, it means positive. If the index value is the lower 40% or lower, it means negative. Finally, the polarity is decided by calculating a sentence score through the training model. Configuration OS CPU Memory HDD Primary Server (main) Secondary Server (sub1) Data Server (sub2) Ubuntu_14.04.4 LTS x64 Ubuntu_14.04.4 LTS x64 Ubuntu_14.04.4 LTS x64 Intel Core(TM)2 Duo E8400(2Core, 3.00GHz) Intel Core(TM)2 Duo E8400(2Core, 3.00GHz) Intel Core(TM)2 Duo E8300(2Core, 2.83GHz) 4GB 4GB 6GB 300G 300G 500G 6566

A. collection To measure the performance, we configured the dataset, which are consisted of a total of 21 according to three data collection methods. Configured dataset consists of a total of 989,148, as shown in Table 2. The training set consists of 15,364 and 11,591 respectively to divide spams and polarity. Since SNS changes in real time, the number of data varies slightly by the data collection time. Table 2: for experiments Table 3: Results of duplication and spam data removal Category Default 1 2 3 4 5 6 7 All data 6,082 6,952 29,584 35,226 39,808 83,346 127,71 2 Duplication removal All data 6,097 6,953 29,552 35,263 37,687 83,690 128,68 2 1 2 3 4 5 6 7 Number of Data 6,082 Default Application method 6,097 Duplication removal(dr) 6,106 DR+Spam Filtering 6,952 Default 6,953 Duplication removal 6,954 DR+Spam Filtering 29,584 Default 29,552 Duplication removal 29,564 DR+Spam Filtering 35,226 Default 35,263 Duplication removal 35,320 DR+Spam Filtering 39,808 Default 37,687 Duplication removal 37,125 DR+Spam Filtering 83,346 Default 83,690 Duplication removal 83,855 DR+Spam Filtering 127,712 Default 128,682 Duplication removal 133,590 DR+Spam Filtering Duplicate data 2,937 5,400 16,055 14,743 26,389 33,036 63,214 Elapsed data 3,160 1,553 13,497 20,520 11,298 50,654 65,468 DR+Spam Filtering All data 6,106 6,954 29,564 35,320 37,125 83,855 133,59 0 Duplicate data 2,956 5,402 16,066 14,754 25,911 33,019 65,113 Elapsed data 3,150 1,552 13,498 20,566 11,214 50,836 68,477 Common data 3,060 1,300 10,904 15,121 10,714 49,802 66,606 Spam data 90 252 2,594 5,445 500 1,034 1,871 C. Data collection and uploading performance test Figure 7 compares the collection time for data set shown in Table 2. Figure 8 compares the upload time for default datasets, datasets after removing duplication, and datasets filtering out spam s in HDFS. Although data collecting times of the three datasets are similar to each other, the uploaded times are different because duplication data and spam data are eliminated. The collection time and the HDFS upload time are increased as the amount of data are grows. The upload times are saved about 1 second to 4 seconds. B. Duplication and spam data removal By the SNS characteristics posted freely, we have confirmed that the rates of duplication and spam are very high by advertising data, re-tweeting without additional personal opinion. Data duplication can be used differently depending on usage purposes. If the data are posted without additional comments, they are removed. The rates of common data, spam, and deduplication are shown in Table 3. Figure 7: Data Collection Time 6567

data from 35,320 data for 4, it showed 1,118 wrong results. The incorrect spam detection is 20.53 percent. The results for 3 and 4 are shown in Figure 10. Figure 8: HDFS Upload Time D. Analysis time For 1, the analysis time is decreased from 12.11 seconds to 10.67 seconds. Also, when both duplication and spam data are removed, the analysis time is decreased to 10.29 seconds. For 6, the analysis time after duplication and spam data are removed, is reduced from 33.62 seconds to 24.72 seconds and 23.8 seconds, respectively. This means that the analysis time is reduced by 36 percent and 41.26 percent respectively. Therefore, it shows that the analysis time has increased stably according to the amount of data. The results of each data set are shown in Figure 9. Figure 10: Practical Assessment of Spam Filtering CONCLUSION We have proposed an efficient method by removing lots of duplication and spam data before sentiment analysis for SNS. The proposed method is built on HDFS and MapReduce method to collect data from Twitter and eliminate duplication and spam data. When duplication and spam are removed, the upload time is reduced about 4 seconds for 133,500 data. The analysis time after duplication and spam data are filtered out, is reduced by 36 percent and 41.26 percent respectively for 133,500 data. The incorrect spam detection is 20.53 percent for 35,320 data. It is necessary to improve the accuracy for spam filtering. To reduce the spam filter errors, we need to study more language specific studies for real life such as new words, slangs and so on. Also it is required to improve the spam training algorithms. Figure 9: Analysis Time E. Detection accuracy for spams and duplication data removal The detection accuracy of duplication and spam data are evaluated. 3 and 4 are evaluated since those datasets contains a lot of duplication and spam data. In the case of duplication removal, it shows good performance since it is easy to figure out spaces or web link addresses. When users added meaningless characters such as "...", "!!!", they are not removed. In the case of spam filtering, when the spam classifier detects 2,594 spam data from 29,564 data for 3, it showed 481 wrong results. The incorrect spam detection is 18.45 percent. When the spam classifier detects 5,445 spam REFERENCES [1] Jae-Young Chang, SinYoung Lee, JongBin Han. "Machine-Learned Classification Technique for Opinion Documents Retrieval in Social Network Services." Porceedings of KISS, p.245-247, June 2013.. [2] Joa-Sang Lim, Jin-Man Kim. "An Empirical Comparison of Machine Learning Models for Classifying Emotions in Korean Twitter." Journal of Korea Multimedia Society, 17.2 (2014.02): 232-239. [3] Jinju Hong, Sehan Kim, Jeawon Park, Jaehyun Choi. "A Malicious Comments Detection Technique on the Internet using Sentiment Analysis and SVM." Journal of the Korea Institute of Information and Communication Engineering, 20.2 (2016.2): 260-267. [4] Phil-Sik Jang. "Study on Principal Sentiment Analysis of Social Data." Journal of the Korea Society of Computer and Information, 19.12 (2014.12): 49-56. 6568

[5] An, Jungkook, and Hee-Woong Kim. "Building a Korean Sentiment Lexicon Using Collective Intelligence." Journal of Intelligence and Information Systems 21.2 (2015): 49-67. [6] Choi, Sukjae, and Ohbyung Kwon. "The Study of Developing Korean SentiWordNet for Big Data Analytics: Focusing on Anger Emotion." Journal of Society for e-business Studies 19.4 (2014). [7] Thoits, Peggy A. "The sociology of emotions." Annual review of sociology (1989): 317-342. [8] Baccianella, Stefano, Andrea Esuli, and Fabrizio Sebastiani. "SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining." LREC. Vol. 10. 2010. [9] Apache Hadoop, http://hadoop.apache.org (accessed Dec 02, 2016). [10] D. Borthakur, Apache Software Foundation, The Hadoop Distributed File System: Architecture and Design, 2007. [11] C. M. Bishop, Pattern recognition and machine learning, Vol. 1, New York: springer, 2006. [12] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large cluster," Communications of the ACM, Vol.51, Iss.1, pp.107-113, 2008. [13] A. Pak and P. Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, Proc. of the LREC 2010, 2010. [14] Bong-Hyun Back, Ilkyu Ha, ByoungChul Ahn. "An Extraction Method of Sentiment Infromation from Unstructed Big Data on SNS." Journal of Korea Multimedia Society, 17.6 (2014.6): 671-680. [15] Mahmoud, Tarek M., and Ahmed M. Mahfouz. "SMS spam filtering technique based on artificial immune system." IJCSI International Journal of Computer Science Issues 9.1 (2012): 589-597. [16] Wang, Alex Hai. "Don't follow me: Spam detection in twitter." Security and Cryptography (SECRYPT), Proceedings of the 2010 International Conference on. IEEE, 2010. [17] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia, Detecting automation of Twitter accounts: Are you a human, bot, or cyborg?, IEEE Transactions on Dependable and Secure Computing, vol. 9, no.6, pp. 811-824, Dec. 2012. [18] J. Kim, S. Lee and H. Yong, Automatic Classification Scheme of Opinions Written in Korea, Journal of KIISE: Database, Vol.38, No.6, pp.423-428. 2011. 6569