A Fast Social-user Reactio Aalysis usig Hadoop ad SPARK Platform Kieji Park Professor, Departmet of Itegrative Systems Egieerig, Ajou Uiversity, Suwo, South Korea Limei Peg Assistat Professor, Departmet of Idustrial Egieerig, Ajou Uiversity, Suwo, South Korea Abstract Social data such as commets is massive ad ustructured, thus, existig relatioal data model shows short of processig such kid of data. Besides, it is more difficult to aalyze the users resposes i a real-time maer from the dyamically icreasig social data. I this paper, to quickly aalyze the social users commets, we desig a fast social-user reactio aalysis system based o Hadoop for storig big data ad distributed i-memory-based Spark for data processig. I the experimets, about oe Terabytes of social data which is composed of aroud 1.6 billio records are first stored ad pre-processed. The a algorithm called -gram is used to aalyze the commet resposes. I processig this algorithm, big data is ot loaded to cluster disk but directly to memory ad thus, it is possible to process the social users resposes i a real-time maer. 1 Keywords: Social Data, Hadoop, Spark, SparkSQL, -gram INTRODUCTION Social etwork service (SNS) cotiues extedig as a space for sharig olie iformatio. Moreover, with the developmet of IT techologies, the umber of social users grows sharply. Subsequetly, the geerated data amout also icreases geometrically. Especially, social commets data, which reflect social users opiios, becomes very importat materials, sice we ca gather resposes ad opiios of users from differet stratums. I other words, the big amout of geerated iformal strig data, which ca reflect various social pheomeo ad predict olie treds, becomes very importat sources. Especially, if we collect the users commets o ews, products, commuity, SNS, etc., ad aalyze them, we ca obtai precious iformatio that caot be obtaied by existig methodologies. For example, we ca mark scores o users feeligs accordig to some special keywords. I this way, eve without a survey, we ca predict users feelig ad we ca aalyze the reasos of the feeligs ad fid the aspects that eed improvemets. However, comparig to the existig RDBMS (Relatioal Database Maagemet Systems) trasactio data, such social media-based data volume is much larger ad has heterogeeous structure. For this, we adopt the data storage system of HDFS (Hadoop Distributed File System) based o YARN (Yet Aother Resource Negotiator), i order to 1 This work is (partially) supported by the Natioal Research Foudatio of Korea (NRF) grat fuded by the Korea govermet (2015R1C1A1A02036536) ad i part by the Ajou Uiversity Research Fud. process massive social data [1][2]. To quickly process the data sesed i HDFS, i-memory based Spark platform istead of existig MapReduce computig method is applied to load data to distributed memories from distributed disks. The rest of the paper is orgaized as follows. Chapter 2 itroduces the imemory based distributed computig ad -gram algorithm. Chapter 3 describes the desig of distributed processig system for social big data ad the data processig procedure. Chapter 4 shows the aalysis results of social users resposes through the prototype system. Chapter 5 cocludes the paper. RELATED RESEARCH I-memory based distributed computig I HDFS of Hadoop, which is a represetative platform for aalyzig big data, data are stored i a distributed way ad the, the read-oly fuctios of Map ad Reduces are combied i various ways to aalyze ad process data i parallel. However, due to the bottleeck of readig from hard disk drive every time, it is limited to process big data with high speed i a real-time way. To solve this, we propose a data structure based o Spark ad maitai data i cluster memory so as to eable i-memory data processig. That is to say, eve for complicated query process, itermediate data are loaded to memory for executio ad the we set a liage for the executable data i advace, so that data ca be processed i a optimized way. Especially, for iteractive aalyses, to simultaeously satisfy the sequetial processig of fuctioal programmig ad the declaratory processig of SQL, DataFrame API (Applicatio Programmig Iterface) of SparkSQL [6] is used to eable the aalyses based o existig SQL-like commads. This meas that iformal data ca be hadled, meawhile, distributed batch processig for big data is possible [7][8]. Text data idex aalysis: -gram There exist various methods to aalyze the buch cotets i the cotiuous strig data. I this paper, we apply the -gram algorithm which ca aalyze the data features accordig to the frequecy a word appears i a setece without the process of selectig data features accordig to data structure status [9]. Here, meas the umber that a word cotiuously appears ad -gram is oe of the represetative idex-aalysis algorithms that use coditioal probabilities. The reaso of selectig -gram for aalyzig text data is that amog the whole data, it ca express the whole cotets well i case that the relevat word buch that appears frequetly ad play the same role with some specific worlds [10]. Whe applyig syllable uit-based -gram algorithm, for character strigs, we first make widows with the same sizes as the character strigs, ad the, we collect the 9345
character item sets i uits of syllables from left to right of the widows. Usig this method, we ca collect the character item sets for two respective character strigs, compare the appearace frequecies of the two character strigs ad fially umerically show the comparig results. I the results, values of 1, 2, 3 of mea algorithms of Uigram, Bigram, ad Trigram, respectively. The chai rule of probability for - gram is show i Equatio (1). P(w 1 ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1 2 ) P(w w 1 1 ) = P(w k w 1 ) (1) Where w 1 is the sequetial word eumerate ad meas w 1 w. The geeralized form is show i Equatio (2). P(w 1 ) = P(w k w k N 1 ) (2) For example, whe applyig the Bigram (2-gram) icludig sequetial w 1, w 2, the equatio is as (3). P(w 1 ) = P(w k w ) (3) H/W & S/W Data Node (Slave) O.S. Big Data Platform Table 1. Prototype System Eviromet Specificatio CPU: Itel Core i7 or i5, 3.2GHz Memory: 384GB (64GB * 6 odes) HDD: 48TB (8TB * 6 odes) Ubutu 14.04 LTS Hadoop 2.7.2 / Spark 1.6.0 Cluster maagemet ad data pre-processig I experimets, cluster system uses Hadoop ad YARN to maage the whole cluster ad above that, the iteractive method-based Spark platform is executed for aalyzig as show i Fig. 2. The big-data query system cosists of oe master ode (Drive Program) ad six slave odes (Worker). Sice Spark uses data uits i terms of resiliet distributed datasets (RDD) based o the directed acyclic graph (DAG), they ca use the memory of all odes i the cluster. This ca sigificatly alleviate the bottleeck problem happes i the hard disk whe usig existig big-data aalysis platform, such as MapReduce. REACTION ANALYSIS SYSTEMS System structure We desig the architecture of reactio aalysis system structure for high-speed aalysis o big social data. Based o this system architecture, we ca realize iteractive experimet eviromet. Fig. 1 shows the proposed architecture. Withi oe cluster, multiple applicatios ca be executed i pipelie. More exactly, real-time streamig data processig based o Spark streamig, the likage with results obtaied from machie learig (MLib), aalysis work, ad virtualizatio of aalysis result ca be doe simultaeously. Figure 2. I-Memory Cluster Maagemet Figure 1. The architecture of reactio aalysis system I -gram, we eed a lot of memory to maitai the itermediate computig results. For this, the specificatio of the proposed reactio aalysis system experimet eviromet is show i Table 1. The memory size of the total cluster is 384GB ad HDD is 64TB. Fig. 3 shows how the chagig process of the loaded text data. Strig data (i.e., users commets) is sorted accordig to StopWords obtaied by Tokeizig for each word. I the colum amed body of Fig. 3, commets of social data are stored i terms of strig; i the colum amed words, every strig is distributed to respective words, assiged with IDs, ad stored i arrays. I the colum amed words1, it shows the words by removig the StopWords. Fially, the colum of words1 shows the values processed by -gram ad outputs of word buches are geerated accordig to the value of i -gram. Now, aalyzers ca use declaratory programmig laguage, such as SQL, ad fuctioal programmig laguage, such as DataFrame of SparkSQL, to easily achieve high-speed query processig for huge amout of iformal data. This meas that it is possible to switch from the existig SQL-based big data aalyses. 9346
body (Raw data) I just wish Googl... Yeah, they ve jus... That is pretty sm... Might have bee s... My dad turs his... It is literally i... I really hoped th... words (Tokeized) [i, just, wish, g... [yeah, they, ve,... [that, is, pretty... [might, have, bee... [my, dad, turs,... [it, is, literall... [i, really, hoped... words1 (StopWord ) [just, wish, goog... [yeah, ve, just,... [pretty, smooth] [program, use, mi... [dad, turs, pho... [literally, ista... [really, hoped, h... Figure 3. Examples of dataset chages through preprocessig User-reactio aalysis results Fig. 6 shows the aalyzed cotets o What social users are iterested? by applyig the 3-gram to all the social data records. Fig. 6(a) shows the aalyzed results after applyig the 3-gram to 500 millio of Adroid users commets. From the results, we ca see that the umber of play google com is 86,674, which is the highest. Fig. 6(b) shows the items of iterest for Bitcoi users. Amog a total of about 370 millio commets, the umber of www reddit com is 71,166, which is the highest. Through this method, we ca kow the sites ad items of iterests that social users prefer. EXPERIMENTAL RESULTS Iput dataset Fig. 4 shows the social data samples i form of jso files that are used i experimets. Every item is represeted as a colo, say :, followig by a strig, ad is separated with the ext item by a comma, say,. The total umber of etry properties is 22 ad it is possible that there is a missig value for every record. I this experimet, the size of social data used is about 1TB, ad the total umber of records is aroud 16.5 billio. [1st Record] {"score_hidde":false,"ame":"t1_cas8zv","lik_id":"t3_2qyr1a","body":"m ost of us have some family members like this. *Most* of my family is like this. ","dows":0,"created_utc":"1420070400","score":14,"author":"yougmoder ","distiguished":ull,"id":"cas8zv","archived":false,"paret_id":"t3_2qyr1a ","subreddit":"exmormo","author_flair_css_class":ull,"author_flair_text": ull,"gilded":0,"retrieved_o":1425124282,"ups":14,"cotroversiality":0,"subr eddit_id":"t5_2r0gj","edited":false} [2d Record] {"distiguished":ull,"id":"cas8zw","archived":false,"author":"redcoatsfor ever","score":3,"created_utc":"1420070400","dows":0,"body":"but Mill's career was way better. Betham is like, the Joseph Smith to Mill's Brigham Youg.","lik_id":"t3_2qv6c6","ame":"t1_cas8zw","score_hidde":false,"c otroversiality":0,"subreddit_id":"t5_2s4gt","edited":false,"retrieved_o":142 5124282,"ups":3,"author_flair_css_class":"o","gilded":0,"author_flair_text": "Otario","subreddit":"CaadaPolitics","paret_id":"t1_cas2b6"}... (a) Adroid users Figure 4. Sample Records i Form of JSON[11] I this experimet, we aalyze social users resposes through the body data classified by Adroid. At the begiig, social users commets are show as the cotets i the colum of body as a corpus, ad the, the corpus will be pre-processed via Tokeizer ad Stop words. Fially, the words selected by usig -gram algorithm are used. Fig. 5 shows the processig results of users commets by 2-gram. We ca see that after pre-processig, every commet is divided ito character item sets with two cotiuous buched words. Figure 5. Results after applyig 2-gram (b) Bitcoi users Figure 6. Social user aalyses results uder differet cases. Through the experimets we ca see that tremedous iformal social data ca all be read i ad aalyzed by usig the imemory-based Spark ad the processig speed ca be sigificatly improved. Usig the -gram to pre-process the huge umber of commets, we ca figure out the items that social users are iterested i. I fact, for about oe TBs of 1.6 billio records, the average query processig time is 25 miutes. To process the same amout of data by the existig RDBMS-based query processig system, the SQL queryig time is about 3~4 hours. Obviously, the processig time of our proposed system is much faster. Fig. 7 couts ad compares the aalyzed results after applyig the 2-gram algorithm o all the records at differet 9347
time periods. Fig. 7(a) shows the results for buildapc users. It shows us the site ames that are frequetly used by social users ad the computer products of iterests. It is observed that a lot of cotets is related to computer products, ad iformatio o users directly assemblig computers are shared. Durig the three years, the products that users are the most iterested raked as power supply, video card, hard drive, ad iteral hard ad the rakig did ot chaged durig the past three years. Fig. 7(b) shows the results for pcmasterrace users. There is almost o data i Jauary 2013 ad the data are geerated later tha that of buildapc. Comparig to computer products, we ca see the websites of iterests by users. CONCLUSION I this paper, to achieve real-time aalyses for massive iformal SNS data, we proposed the social-user reactio aalysis system architecture based o HDFS ad distributed i-memory Spark. Moreover, the proposed architecture was advatageous of simultaeously usig declaratory programmig laguage such as SQL ad platform based o distributed batch procedural programmig laguage, which ca provide a iteractive processig eviromet. I the experimets, about oe TB of social data that cosists of about 1.6 billio records was executed by readig ad preprocessig. For the data that were geerated durig the queryig processig process, they were loaded directly to the memory istead of the cluster disks ad thus, the aalyses time was sigificatly reduced. I the proposed system architecture, before programmig executio, the read-oly big data were optimized ad thus high-speed queryig results ca be obtaied. I the future, we will desig the various distributed algorithms for platforms of processig social strig data i a real-time maer. REFERENCES (a) Adroid users (b) Bitcoi users Figure 7. Compariso o items of iterest per year Through the experimets we ca see that tremedous irregular social data ca all be read i ad aalyzed by usig the imemory-based Spark ad the processig speed ca be sigificatly improved. Usig the -gram algorithm to preprocess the huge umber of commets, we ca figure out the items that social users are iterested i. I fact, for about oe TBs of records, the average query processig time is 25 miutes. To process the same amout of data by the existig RDBMS-based query processig system, the SQL queryig time is about 3~4 hours. Obviously, the processig time of our proposed system is much faster. [1] K. Shvachko, et al. The Hadoop Distributed File System, I Proceedigs of the 26th IEEE Trasactios o Computig Symposium o Mass Storage Systems ad Techologies, pp. 1-10, 2010. [2] V. K. Vavilapalli, A. C. Murthy, et al. Apache hadoop yar: Yet aother resource egotiator, I Proceedigs of the 4th aual Symposium o Cloud Computig ACM, pp. 5:1-5:16, 2013. [3] J. Dea ad S. Ghemawat, MapReduce: Simplified Data Processig o Large Clusters, I Proceedigs of the 6th Symposium o Operatig System Desig ad Implemetatio, pp. 137-150, 2004. [4] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Frakli, S. Sheker, ad I. Stoica, Resiliet Distributed Datasets: A fault-tolerat abstractio for i-memory cluster computig, NSDI, Apr. 2012. [5] M. Zaharia, M. Chowdhury, M. J. Frakli, S. Sheker, ad I. Stoica, Spark: Cluster Computig with Workig Sets, I HotCloud, p. 10, 2010. [6] M. Armbrust, R. S. Xi, C. Lia, Y. Huai, D. Liu, J. K. Bradley, X. Meg, T. Kafta, M. J. Frakli, A. Ghodsi, ad M. Zaharia, Spark SQL: Relatioal data processig i Spark, I Proceedigs of the 2015 ACM SIGMOD Iteratioal Coferece o Maagemet of Data, pp. 1383-1394, 2015. [7] Kieji Park ad Limei Peg, A Desig of High-speed Big Data Query Processig System for Social Data Aalysis : Usig SparkSQL, Iteratioal Joural of Applied Egieerig Research, 11(14), pp. 8221-8225, 2016. [8] Kieji Park, Chagwo Baek ad Limei Peg, A Developmet of Streamig Big Data Aalysis System usig I-memory Cluster Computig Framework: Spark, LNEE, Vol. 393, pp. 157-163, 2016. 9348
[9] P. F. Brow, P. V. desouza, R. L. Mercer, V. J. D. Pietra, ad J. C. Lai, Class-Based N-gram Models of Natural Laguage, Computatioal liguistics, 18(4), pp. 467-479, 1992. [10] Li, Y. H. ad Jai, A. K., Classificatio of Text Documets, The Computer Joural, 41(8), pp. 537-546, 1998. [11] Reddit, https://www.reddit.com/ 9349