Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE Dept. of IT, SOE Dept. of IT Dept. of IT, SOE CUSAT, Kerala CUSAT, Kerala Rajagiri School of Engineering and Technology CUSAT, Kerala ABSTRACT Twitter is one of the trending social media in the social networking. The data in twitter is one of the examples of unstructured data. Analysis of unstructured data is very difficult and also to get meaning full information from huge amount of data is a tedious task. In order to analysis the twitter data we are using Distributed computing techniques. Here Spark and the MapReduce are used. Apache Spark is a framework suitable for distributed in memory based computing. Due to the distributed nature Spark is used for big data analysis. The twitter data is collected using Twitter APIs and analysis is performed using Spark and MapReduce. To convert the unstructured data to meaning full data classification technique is used. There are many classification technique in which we have use Support vector Machine. The result of our experiment shows that Spark is the better method for big data classification. Keywords Twitter, Big data, Spark, Distributed computing, MapReduce INTRODUCTION Times New Roman, 12 pointsa huge amount of data gets generated due to the human intervention in digital space; example Facebook stores more than 30 Petabytes of data [1]. Such huge amount of data containing useful information is called Big Data. The most popular tool used to handle the Big Data Mining is Spark. Apache Spark [2] is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop Map Reduce and it extends the Map Reduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark [3] is its in-memory cluster computing[4] that increases the processing speed of an application. Spark can be used for any methods such as batch application or iterative algorithm or iterative queries or even streaming data. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools. Spark provides Java, Scala[5], Python and R [6] kinds of high-level APIs, and an optimized engine that supports general execution graphs. Spark also support high end tool such as GraphX for graph processing, Spark SQL used for SQL, Spark Streaming for real time data processing, MLlib for machine learning[7]. SQL query can be easily done using Spark SQL. Spark SQL has data compatibility and performance optimization. The Spark Streaming is rich in APIs which can be used for streaming, batch application and interactive query. Spark API for charts and graphs is GraphX which reduce memory overhead by parallel processing. MLlib (Machine Learning library) is a scalable Machine Learning library of Spark; it includes relevant tests and data generators. The performances of the machine learning algorithms have been comparatively increased with the use of Spark than in map reduce. In general, the core technology and the foundation architecture of Spark is RDD [8]. The Spark SQL, MLlib, GraphX, Spark Streaming are the core members of the Spark ecosystem, as shown in fig 1. Section 2 deals with the Background work then the proposed methodology in Section 3; Section 4 is of Experimental Results and finally the Conclusion in Section 5. 173
Fig 1: The ecosystem of Spark 3 BANKGROUND WORK Big data analytics frameworks to work require very large scale algorithms. Advance in computation power helped the emerge of more computing frameworks [9, 10, 11, 12, 13, 15]. First one was Map Reduce [9] which revolutionized the way of analytics, then came the Apache Flink [10] to bridges the gap between non parallel database system and Map Reduce. The HDFS of Hadoop[11] helped in faster execution of algorithm than the traditional ones. Apache Mahout [12] is an open source project that is primarily used for creating scalable machine learning algorithms. It uses the Apache Hadoop library to scale effectively in the cloud. Map Reduce [13] is an incremental processing extension to i2mapreduce and widely used framework for mining big data [14]. However, Map Reduce does not perform well for iterative algorithms. Apache Spark [15] is an open source, developed for the optimization of large-scale interactive computation. Basic terminology of Spark is shown in Fig 2. From the above stated frameworks, Apache Spark is fault tolerance and works really good for iterative algorithms because of in-memory computations, while retaining the scalability. Fig 2: Basic terminology of Spark 15 The processes of non-interactive jobs executing all in one time is called Batch Process[16]. Batch processing is particularly useful for operations that require the computer or a peripheral device for an extended period of time.note that batch processing implies that there is no interaction with the user while the program is being executed. In order to overcome the difficulties of batch processing we use stream processing Apache Spark. Apache Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program called the driver program. Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers either Sparks s standalone cluster manager, Mesos or YARN, which allocate resources across applications. 174
PROPOSED METHODOLOGY The Classification of unstructured data is difficult task.we use Spark coded in Python to classify the twitter data and compare with Map Reduce. Spark provides a simple, secure method of submitting jobs without many of the complex set up requirements of Map Reduce. From the different supervised machine-learning algorithms like Decision Tree, Support vector machine, Naive Bayes, Neural Network and k-nearest neighbor to classify the data, we used Support Vector Machine which gives more accurate result when compared to other. EXPERIMENTAL RESULT Data needed for the research is collected from Twitter by using the Twitter Streaming API. Twitter has created its own API for tweets retrieval.the implementation is done with the help of google cloud, where installation of Spark and Map Reduce have done. We have used this Twitter API in our Python code for retrieving the tweets needed for our research. The experiment conducted using the different size of data set, the data size used is shown in the table 1. The results are show in the fig 5, 6, 7, 8 and 9. The result shows that the Spark shows significantly good result than the Map-Reducer, also easy to implement and use. The average rate of accuracy for Map Reduce is 79.2 % whereas for Spark is about 86.8%. The implementation and the running of Apache Spark and Map Reduce have been discussed and the time taken for the running the same data set on Map Reduce and the Spark is also shown. Group Test Data 1 Test Data 2 Test Data 3 Test Data 4 Test Data5 Table 1: Test Data Data Size 10 MB 100 MB 500 MB 700 MB 1 GB Fig 5: Comparison of the different test data items to run in map reduce and the spark 175
Fig 6: Time taken to run and the number of file per node comparison Fig 7: CPU Utilization of Map Reduce and the Spark. Figure 8: Average error Classification for Map Reduce and Spark 176
Fig 9: Accuracy rate of Map Reduce and Spark for different Data Set. CONCLUSION The big data is the hot topic of research and the researcher are realizing the fact that important predictions can be made by processing and analyzing big data. Most of the data is unstructured or semi structured; it must be formatted in a way that makes it suitable for data mining and subsequent analysis. The main purpose behind this work was to build Distributed Framework which is targeted for Big Data applications. The experimental result proves that the proposed method results in significant reduction in the Run time, CPU utilization and minimum miss classification. The result shows that the Spark is the better for big data analysis with 86.8 % accurate when compared to the Map reduce of 79.2%. REFERENCES [1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Inst., Tech. Rep.9341321, pp. 1 137, May 2011. [2] http://spark.apache.org/. [3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster Computing with Working Sets, in HotCloud, 2010. [4] http://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm [5] John A. Miller, Casey Bowman, Vishnu Gowda Harish and Shannon Quinn, Open Source Big Data Analytics Frameworks Written in Scala, 2016 IEEE International Congress on Big Data,pp.389-393 [6] AyushiMalviya, Amit Udhani, SuryakantSoni, R-Tool: Data Analytic Framework for Big Data, 2016 Symposium on Colossal Data Analysis and Networking (CDAN). [7] Chao Chen, Jun Zhang, Yi Xie, Yang Xiang, A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection, IEEE Transactions On Computational Social Systems, Vol. 2, No. 3, September 2015, pp.65-76. [8] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, in NSDI, 2012, pp. 15 28. [9] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp. 107 113, 2008. 177
[10] P. Mika, Flink: Semantic Web technology for the extraction and analysis of social networks, Web Semantics: Sci. Services Agents World Wide Web, vol. 3, no. 2, pp. 211 223, 2005. [11] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Hadoop: Efficient iterative data processing on large clusters, Proc. VLDB Endowment, vol. 3, no. 1/2, pp. 285 296, 2010. [12] A. M. Team, Apache Mahout: Scalable machine-learning and data-mining library, (2011). [Online]. Available: http://mahout. apache.org/ [13] Y. Zhang, S. Chen, Q. Wang, and G Yu, i2mapreduce: Incremental MapReduce for mining evolving big data, IEEE Trans. Knowl. Data Eng., vol. 27, no. 7, pp. 1906 1919, Jul. 2015. [14] Manjula Sanjay, Alamma B.H., An Insight into Big Data Analytics Methods and Application, IEEE Conference. [15] AkhmedovKhumoyun Yun Cui, Lee Hanku, Spark Based Distributed Deep Learning Framework For Big Data Applications, 978-1-5090-3546-5,IEEE 2016. [16] http://www.webopedia.com/term/b/batch_processing.html. 178