A Fast Social-user Reaction Analysis using Hadoop and SPARK Platform

Similar documents
HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

3D Model Retrieval Method Based on Sample Prediction

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Weston Anniversary Fund

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

EFFECT OF QUERY FORMATION ON WEB SEARCH ENGINE RESULTS

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

On-line Evaluation of a Data Cube over a Data Stream

Avid Interplay Bundle

Elementary Educational Computer

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

Keywords Software Architecture, Object-oriented metrics, Reliability, Reusability, Coupling evaluator, Cohesion, efficiency

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

n Explore virtualization concepts n Become familiar with cloud concepts

BOOLEAN MATHEMATICS: GENERAL THEORY

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III

Algorithms for Disk Covering Problems with the Most Points

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Python Programming: An Introduction to Computer Science

GPUMP: a Multiple-Precision Integer Library for GPUs

Optimization for framework design of new product introduction management system Ma Ying, Wu Hongcui

+ Cluster analysis. a generalization can be derived for each cluster and hence processing is done batch wise rather than individually

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Fast Fourier Transform (FFT) Algorithms

A Study on the Performance of Cholesky-Factorization using MPI

Anti-addiction System Development Based on Android Smartphone. Xiafu Pan

Redundancy Allocation for Series Parallel Systems with Multiple Constraints and Sensitivity Analysis

Sectio 4, a prototype project of settig field weight with AHP method is developed ad the experimetal results are aalyzed. Fially, we coclude our work

A Development of Automatic Topic Analysis System using Hybrid Feature Extraction based on Spark SQL

Adaptive Resource Allocation for Electric Environmental Pollution through the Control Network

An Efficient Algorithm for Graph Bisection of Triangularizations

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

A QoS Provisioning mechanism of Real-time Wireless USB Transfers for Smart HDTV Multimedia Services

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Text Feature Selection based on Feature Dispersion Degree and Feature Concentration Degree

Outline. Research Definition. Motivation. Foundation of Reverse Engineering. Dynamic Analysis and Design Pattern Detection in Java Programs

1 Enterprise Modeler

FREQUENCY ESTIMATION OF INTERNET PACKET STREAMS WITH LIMITED SPACE: UPPER AND LOWER BOUNDS

Project 2.5 Improved Euler Implementation

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview

A Method of Malicious Application Detection

performance to the performance they can experience when they use the services from a xed location.

Chapter 3 Classification of FFT Processor Algorithms

Enhancing Efficiency of Software Fault Tolerance Techniques in Satellite Motion System

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Low Complexity H.265/HEVC Coding Unit Size Decision for a Videoconferencing System

Mining from Quantitative Data with Linguistic Minimum Supports and Confidences

l-1 text string ( l characters : 2lbytes) pointer table the i-th word table of coincidence number of prex characters. pointer table the i-th word

An Efficient Algorithm for Graph Bisection of Triangularizations

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

A Development of LDA Topic Association Systems Based on Spark-Hadoop Framework

Searching a Russian Document Collection Using English, Chinese and Japanese Queries

n Some thoughts on software development n The idea of a calculator n Using a grammar n Expression evaluation n Program organization n Analysis

Python Programming: An Introduction to Computer Science

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

SOFTWARE usually does not work alone. It must have

Cubic Polynomial Curves with a Shape Parameter

Ontology-based Decision Support System with Analytic Hierarchy Process for Tour Package Selection

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

1&1 Next Level Hosting

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

The Magma Database file formats

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Text Summarization using Neural Network Theory

A Note on Least-norm Solution of Global WireWarping

MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014

Data Warehousing. Paper

CS 111: Program Design I Lecture 15: Modules, Pandas again. Robert H. Sloan & Richard Warner University of Illinois at Chicago March 8, 2018

n Learn how resiliency strategies reduce risk n Discover automation strategies to reduce risk

Baan Tools User Management

Research on K-Means Algorithm Based on Parallel Improving and Applying

EFFICIENT MULTIPLE SEARCH TREE STRUCTURE

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Fuzzy Minimal Solution of Dual Fully Fuzzy Matrix Equations

Hui Xiao School of Environmental Science, Nanjing Xiaozhuang University, Nanjing , China

Probability of collisions in Soft Input Decryption

UNIVERSITY OF MORATUWA

Τεχνολογία Λογισμικού

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

SCI Reflective Memory

Investigation Monitoring Inventory

Speeding-up dynamic programming in sequence alignment

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Analysis of Algorithms

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

VISUALSLX AN OPEN USER SHELL FOR HIGH-PERFORMANCE MODELING AND SIMULATION. Thomas Wiedemann

Quorum Based Data Replication in Grid Environment

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Extending The Sleuth Kit and its Underlying Model for Pooled Storage File System Forensic Analysis

arxiv: v2 [cs.ds] 24 Mar 2018

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c

Transcription:

A Fast Social-user Reactio Aalysis usig Hadoop ad SPARK Platform Kieji Park Professor, Departmet of Itegrative Systems Egieerig, Ajou Uiversity, Suwo, South Korea Limei Peg Assistat Professor, Departmet of Idustrial Egieerig, Ajou Uiversity, Suwo, South Korea Abstract Social data such as commets is massive ad ustructured, thus, existig relatioal data model shows short of processig such kid of data. Besides, it is more difficult to aalyze the users resposes i a real-time maer from the dyamically icreasig social data. I this paper, to quickly aalyze the social users commets, we desig a fast social-user reactio aalysis system based o Hadoop for storig big data ad distributed i-memory-based Spark for data processig. I the experimets, about oe Terabytes of social data which is composed of aroud 1.6 billio records are first stored ad pre-processed. The a algorithm called -gram is used to aalyze the commet resposes. I processig this algorithm, big data is ot loaded to cluster disk but directly to memory ad thus, it is possible to process the social users resposes i a real-time maer. 1 Keywords: Social Data, Hadoop, Spark, SparkSQL, -gram INTRODUCTION Social etwork service (SNS) cotiues extedig as a space for sharig olie iformatio. Moreover, with the developmet of IT techologies, the umber of social users grows sharply. Subsequetly, the geerated data amout also icreases geometrically. Especially, social commets data, which reflect social users opiios, becomes very importat materials, sice we ca gather resposes ad opiios of users from differet stratums. I other words, the big amout of geerated iformal strig data, which ca reflect various social pheomeo ad predict olie treds, becomes very importat sources. Especially, if we collect the users commets o ews, products, commuity, SNS, etc., ad aalyze them, we ca obtai precious iformatio that caot be obtaied by existig methodologies. For example, we ca mark scores o users feeligs accordig to some special keywords. I this way, eve without a survey, we ca predict users feelig ad we ca aalyze the reasos of the feeligs ad fid the aspects that eed improvemets. However, comparig to the existig RDBMS (Relatioal Database Maagemet Systems) trasactio data, such social media-based data volume is much larger ad has heterogeeous structure. For this, we adopt the data storage system of HDFS (Hadoop Distributed File System) based o YARN (Yet Aother Resource Negotiator), i order to 1 This work is (partially) supported by the Natioal Research Foudatio of Korea (NRF) grat fuded by the Korea govermet (2015R1C1A1A02036536) ad i part by the Ajou Uiversity Research Fud. process massive social data [1][2]. To quickly process the data sesed i HDFS, i-memory based Spark platform istead of existig MapReduce computig method is applied to load data to distributed memories from distributed disks. The rest of the paper is orgaized as follows. Chapter 2 itroduces the imemory based distributed computig ad -gram algorithm. Chapter 3 describes the desig of distributed processig system for social big data ad the data processig procedure. Chapter 4 shows the aalysis results of social users resposes through the prototype system. Chapter 5 cocludes the paper. RELATED RESEARCH I-memory based distributed computig I HDFS of Hadoop, which is a represetative platform for aalyzig big data, data are stored i a distributed way ad the, the read-oly fuctios of Map ad Reduces are combied i various ways to aalyze ad process data i parallel. However, due to the bottleeck of readig from hard disk drive every time, it is limited to process big data with high speed i a real-time way. To solve this, we propose a data structure based o Spark ad maitai data i cluster memory so as to eable i-memory data processig. That is to say, eve for complicated query process, itermediate data are loaded to memory for executio ad the we set a liage for the executable data i advace, so that data ca be processed i a optimized way. Especially, for iteractive aalyses, to simultaeously satisfy the sequetial processig of fuctioal programmig ad the declaratory processig of SQL, DataFrame API (Applicatio Programmig Iterface) of SparkSQL [6] is used to eable the aalyses based o existig SQL-like commads. This meas that iformal data ca be hadled, meawhile, distributed batch processig for big data is possible [7][8]. Text data idex aalysis: -gram There exist various methods to aalyze the buch cotets i the cotiuous strig data. I this paper, we apply the -gram algorithm which ca aalyze the data features accordig to the frequecy a word appears i a setece without the process of selectig data features accordig to data structure status [9]. Here, meas the umber that a word cotiuously appears ad -gram is oe of the represetative idex-aalysis algorithms that use coditioal probabilities. The reaso of selectig -gram for aalyzig text data is that amog the whole data, it ca express the whole cotets well i case that the relevat word buch that appears frequetly ad play the same role with some specific worlds [10]. Whe applyig syllable uit-based -gram algorithm, for character strigs, we first make widows with the same sizes as the character strigs, ad the, we collect the 9345

character item sets i uits of syllables from left to right of the widows. Usig this method, we ca collect the character item sets for two respective character strigs, compare the appearace frequecies of the two character strigs ad fially umerically show the comparig results. I the results, values of 1, 2, 3 of mea algorithms of Uigram, Bigram, ad Trigram, respectively. The chai rule of probability for - gram is show i Equatio (1). P(w 1 ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1 2 ) P(w w 1 1 ) = P(w k w 1 ) (1) Where w 1 is the sequetial word eumerate ad meas w 1 w. The geeralized form is show i Equatio (2). P(w 1 ) = P(w k w k N 1 ) (2) For example, whe applyig the Bigram (2-gram) icludig sequetial w 1, w 2, the equatio is as (3). P(w 1 ) = P(w k w ) (3) H/W & S/W Data Node (Slave) O.S. Big Data Platform Table 1. Prototype System Eviromet Specificatio CPU: Itel Core i7 or i5, 3.2GHz Memory: 384GB (64GB * 6 odes) HDD: 48TB (8TB * 6 odes) Ubutu 14.04 LTS Hadoop 2.7.2 / Spark 1.6.0 Cluster maagemet ad data pre-processig I experimets, cluster system uses Hadoop ad YARN to maage the whole cluster ad above that, the iteractive method-based Spark platform is executed for aalyzig as show i Fig. 2. The big-data query system cosists of oe master ode (Drive Program) ad six slave odes (Worker). Sice Spark uses data uits i terms of resiliet distributed datasets (RDD) based o the directed acyclic graph (DAG), they ca use the memory of all odes i the cluster. This ca sigificatly alleviate the bottleeck problem happes i the hard disk whe usig existig big-data aalysis platform, such as MapReduce. REACTION ANALYSIS SYSTEMS System structure We desig the architecture of reactio aalysis system structure for high-speed aalysis o big social data. Based o this system architecture, we ca realize iteractive experimet eviromet. Fig. 1 shows the proposed architecture. Withi oe cluster, multiple applicatios ca be executed i pipelie. More exactly, real-time streamig data processig based o Spark streamig, the likage with results obtaied from machie learig (MLib), aalysis work, ad virtualizatio of aalysis result ca be doe simultaeously. Figure 2. I-Memory Cluster Maagemet Figure 1. The architecture of reactio aalysis system I -gram, we eed a lot of memory to maitai the itermediate computig results. For this, the specificatio of the proposed reactio aalysis system experimet eviromet is show i Table 1. The memory size of the total cluster is 384GB ad HDD is 64TB. Fig. 3 shows how the chagig process of the loaded text data. Strig data (i.e., users commets) is sorted accordig to StopWords obtaied by Tokeizig for each word. I the colum amed body of Fig. 3, commets of social data are stored i terms of strig; i the colum amed words, every strig is distributed to respective words, assiged with IDs, ad stored i arrays. I the colum amed words1, it shows the words by removig the StopWords. Fially, the colum of words1 shows the values processed by -gram ad outputs of word buches are geerated accordig to the value of i -gram. Now, aalyzers ca use declaratory programmig laguage, such as SQL, ad fuctioal programmig laguage, such as DataFrame of SparkSQL, to easily achieve high-speed query processig for huge amout of iformal data. This meas that it is possible to switch from the existig SQL-based big data aalyses. 9346

body (Raw data) I just wish Googl... Yeah, they ve jus... That is pretty sm... Might have bee s... My dad turs his... It is literally i... I really hoped th... words (Tokeized) [i, just, wish, g... [yeah, they, ve,... [that, is, pretty... [might, have, bee... [my, dad, turs,... [it, is, literall... [i, really, hoped... words1 (StopWord ) [just, wish, goog... [yeah, ve, just,... [pretty, smooth] [program, use, mi... [dad, turs, pho... [literally, ista... [really, hoped, h... Figure 3. Examples of dataset chages through preprocessig User-reactio aalysis results Fig. 6 shows the aalyzed cotets o What social users are iterested? by applyig the 3-gram to all the social data records. Fig. 6(a) shows the aalyzed results after applyig the 3-gram to 500 millio of Adroid users commets. From the results, we ca see that the umber of play google com is 86,674, which is the highest. Fig. 6(b) shows the items of iterest for Bitcoi users. Amog a total of about 370 millio commets, the umber of www reddit com is 71,166, which is the highest. Through this method, we ca kow the sites ad items of iterests that social users prefer. EXPERIMENTAL RESULTS Iput dataset Fig. 4 shows the social data samples i form of jso files that are used i experimets. Every item is represeted as a colo, say :, followig by a strig, ad is separated with the ext item by a comma, say,. The total umber of etry properties is 22 ad it is possible that there is a missig value for every record. I this experimet, the size of social data used is about 1TB, ad the total umber of records is aroud 16.5 billio. [1st Record] {"score_hidde":false,"ame":"t1_cas8zv","lik_id":"t3_2qyr1a","body":"m ost of us have some family members like this. *Most* of my family is like this. ","dows":0,"created_utc":"1420070400","score":14,"author":"yougmoder ","distiguished":ull,"id":"cas8zv","archived":false,"paret_id":"t3_2qyr1a ","subreddit":"exmormo","author_flair_css_class":ull,"author_flair_text": ull,"gilded":0,"retrieved_o":1425124282,"ups":14,"cotroversiality":0,"subr eddit_id":"t5_2r0gj","edited":false} [2d Record] {"distiguished":ull,"id":"cas8zw","archived":false,"author":"redcoatsfor ever","score":3,"created_utc":"1420070400","dows":0,"body":"but Mill's career was way better. Betham is like, the Joseph Smith to Mill's Brigham Youg.","lik_id":"t3_2qv6c6","ame":"t1_cas8zw","score_hidde":false,"c otroversiality":0,"subreddit_id":"t5_2s4gt","edited":false,"retrieved_o":142 5124282,"ups":3,"author_flair_css_class":"o","gilded":0,"author_flair_text": "Otario","subreddit":"CaadaPolitics","paret_id":"t1_cas2b6"}... (a) Adroid users Figure 4. Sample Records i Form of JSON[11] I this experimet, we aalyze social users resposes through the body data classified by Adroid. At the begiig, social users commets are show as the cotets i the colum of body as a corpus, ad the, the corpus will be pre-processed via Tokeizer ad Stop words. Fially, the words selected by usig -gram algorithm are used. Fig. 5 shows the processig results of users commets by 2-gram. We ca see that after pre-processig, every commet is divided ito character item sets with two cotiuous buched words. Figure 5. Results after applyig 2-gram (b) Bitcoi users Figure 6. Social user aalyses results uder differet cases. Through the experimets we ca see that tremedous iformal social data ca all be read i ad aalyzed by usig the imemory-based Spark ad the processig speed ca be sigificatly improved. Usig the -gram to pre-process the huge umber of commets, we ca figure out the items that social users are iterested i. I fact, for about oe TBs of 1.6 billio records, the average query processig time is 25 miutes. To process the same amout of data by the existig RDBMS-based query processig system, the SQL queryig time is about 3~4 hours. Obviously, the processig time of our proposed system is much faster. Fig. 7 couts ad compares the aalyzed results after applyig the 2-gram algorithm o all the records at differet 9347

time periods. Fig. 7(a) shows the results for buildapc users. It shows us the site ames that are frequetly used by social users ad the computer products of iterests. It is observed that a lot of cotets is related to computer products, ad iformatio o users directly assemblig computers are shared. Durig the three years, the products that users are the most iterested raked as power supply, video card, hard drive, ad iteral hard ad the rakig did ot chaged durig the past three years. Fig. 7(b) shows the results for pcmasterrace users. There is almost o data i Jauary 2013 ad the data are geerated later tha that of buildapc. Comparig to computer products, we ca see the websites of iterests by users. CONCLUSION I this paper, to achieve real-time aalyses for massive iformal SNS data, we proposed the social-user reactio aalysis system architecture based o HDFS ad distributed i-memory Spark. Moreover, the proposed architecture was advatageous of simultaeously usig declaratory programmig laguage such as SQL ad platform based o distributed batch procedural programmig laguage, which ca provide a iteractive processig eviromet. I the experimets, about oe TB of social data that cosists of about 1.6 billio records was executed by readig ad preprocessig. For the data that were geerated durig the queryig processig process, they were loaded directly to the memory istead of the cluster disks ad thus, the aalyses time was sigificatly reduced. I the proposed system architecture, before programmig executio, the read-oly big data were optimized ad thus high-speed queryig results ca be obtaied. I the future, we will desig the various distributed algorithms for platforms of processig social strig data i a real-time maer. REFERENCES (a) Adroid users (b) Bitcoi users Figure 7. Compariso o items of iterest per year Through the experimets we ca see that tremedous irregular social data ca all be read i ad aalyzed by usig the imemory-based Spark ad the processig speed ca be sigificatly improved. Usig the -gram algorithm to preprocess the huge umber of commets, we ca figure out the items that social users are iterested i. I fact, for about oe TBs of records, the average query processig time is 25 miutes. To process the same amout of data by the existig RDBMS-based query processig system, the SQL queryig time is about 3~4 hours. Obviously, the processig time of our proposed system is much faster. [1] K. Shvachko, et al. The Hadoop Distributed File System, I Proceedigs of the 26th IEEE Trasactios o Computig Symposium o Mass Storage Systems ad Techologies, pp. 1-10, 2010. [2] V. K. Vavilapalli, A. C. Murthy, et al. Apache hadoop yar: Yet aother resource egotiator, I Proceedigs of the 4th aual Symposium o Cloud Computig ACM, pp. 5:1-5:16, 2013. [3] J. Dea ad S. Ghemawat, MapReduce: Simplified Data Processig o Large Clusters, I Proceedigs of the 6th Symposium o Operatig System Desig ad Implemetatio, pp. 137-150, 2004. [4] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Frakli, S. Sheker, ad I. Stoica, Resiliet Distributed Datasets: A fault-tolerat abstractio for i-memory cluster computig, NSDI, Apr. 2012. [5] M. Zaharia, M. Chowdhury, M. J. Frakli, S. Sheker, ad I. Stoica, Spark: Cluster Computig with Workig Sets, I HotCloud, p. 10, 2010. [6] M. Armbrust, R. S. Xi, C. Lia, Y. Huai, D. Liu, J. K. Bradley, X. Meg, T. Kafta, M. J. Frakli, A. Ghodsi, ad M. Zaharia, Spark SQL: Relatioal data processig i Spark, I Proceedigs of the 2015 ACM SIGMOD Iteratioal Coferece o Maagemet of Data, pp. 1383-1394, 2015. [7] Kieji Park ad Limei Peg, A Desig of High-speed Big Data Query Processig System for Social Data Aalysis : Usig SparkSQL, Iteratioal Joural of Applied Egieerig Research, 11(14), pp. 8221-8225, 2016. [8] Kieji Park, Chagwo Baek ad Limei Peg, A Developmet of Streamig Big Data Aalysis System usig I-memory Cluster Computig Framework: Spark, LNEE, Vol. 393, pp. 157-163, 2016. 9348

[9] P. F. Brow, P. V. desouza, R. L. Mercer, V. J. D. Pietra, ad J. C. Lai, Class-Based N-gram Models of Natural Laguage, Computatioal liguistics, 18(4), pp. 467-479, 1992. [10] Li, Y. H. ad Jai, A. K., Classificatio of Text Documets, The Computer Joural, 41(8), pp. 537-546, 1998. [11] Reddit, https://www.reddit.com/ 9349