Twitter data Analytics using Distributed Computing

Similar documents
Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Survey on Incremental MapReduce for Data Mining

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Cloud, Big Data & Linear Algebra

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

An Introduction to Apache Spark

Integration of Machine Learning Library in Apache Apex

Comparative Analysis of Range Aggregate Queries In Big Data Environment

2/4/2019 Week 3- A Sangmi Lee Pallickara

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

DATA SCIENCE USING SPARK: AN INTRODUCTION

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Stream Processing on IoT Devices using Calvin Framework

Spark Overview. Professor Sasu Tarkoma.

Shark: Hive on Spark

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Cloud Computing Techniques for Big Data and Hadoop Implementation

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Specialist ICT Learning

CSE 444: Database Internals. Lecture 23 Spark

A REVIEW: MAPREDUCE AND SPARK FOR BIG DATA ANALYTICS

Embedded Technosolutions

Spark, Shark and Spark Streaming Introduction

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

MapReduce, Hadoop and Spark. Bompotas Agorakis

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark

Introduction to MapReduce Algorithms and Analysis

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Processing of big data with Apache Spark

Scalable Tools - Part I Introduction to Scalable Tools

Research Article Apriori Association Rule Algorithms using VMware Environment

Fast, Interactive, Language-Integrated Cluster Computing

Machine learning library for Apache Flink

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Big Data Infrastructures & Technologies

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Resilient Distributed Datasets

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Real-time Data Stream Processing Challenges and Perspectives

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Big Data Architect.

Data Platforms and Pattern Mining

Dell In-Memory Appliance for Cloudera Enterprise

International Journal of Engineering Science Invention Research & Development; Vol. III, Issue X, April e-issn:


Introduction to Big-Data

Performance Evaluation of Big Data Frameworks for Large-Scale Data Analytics

Apache Spark 2.0. Matei

Webinar Series TMIP VISION

On Performance Evaluation of BM-Based String Matching Algorithms in Distributed Computing Environment

A Tutorial on Apache Spark

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

Research on improved K - nearest neighbor algorithm based on spark platform

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Logging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN

Big data systems 12/8/17

Adaptive Control of Apache Spark s Data Caching Mechanism Based on Workload Characteristics

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Cloud Computing & Visualization

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Certified Big Data Hadoop and Spark Scala Course Curriculum

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Big Data Analytics using Apache Hadoop and Spark with Scala

Online Bill Processing System for Public Sectors in Big Data

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

Unifying Big Data Workloads in Apache Spark

Distributed Computing with Spark and MapReduce

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Deep Learning Frameworks with Spark and GPUs

Lambda Architecture with Apache Spark

Global Journal of Engineering Science and Research Management

Survey on Frameworks for Distributed Computing: Hadoop, Spark and Storm Telmo da Silva Morais

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

arxiv: v2 [cs.dc] 26 Mar 2017

Massive Online Analysis - Storm,Spark

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Spark Streaming: Hands-on Session A.A. 2017/18

Beyond MapReduce: Apache Spark Antonino Virgillito

Project Design. Version May, Computer Science Department, Texas Christian University

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

High Performance Computing on MapReduce Programming Framework

a Spark in the cloud iterative and interactive cluster computing

Lecture 11 Hadoop & Spark

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

An Enhanced Approach for Resource Management Optimization in Hadoop

The Datacenter Needs an Operating System

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES

Transcription:

Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE Dept. of IT, SOE Dept. of IT Dept. of IT, SOE CUSAT, Kerala CUSAT, Kerala Rajagiri School of Engineering and Technology CUSAT, Kerala ABSTRACT Twitter is one of the trending social media in the social networking. The data in twitter is one of the examples of unstructured data. Analysis of unstructured data is very difficult and also to get meaning full information from huge amount of data is a tedious task. In order to analysis the twitter data we are using Distributed computing techniques. Here Spark and the MapReduce are used. Apache Spark is a framework suitable for distributed in memory based computing. Due to the distributed nature Spark is used for big data analysis. The twitter data is collected using Twitter APIs and analysis is performed using Spark and MapReduce. To convert the unstructured data to meaning full data classification technique is used. There are many classification technique in which we have use Support vector Machine. The result of our experiment shows that Spark is the better method for big data classification. Keywords Twitter, Big data, Spark, Distributed computing, MapReduce INTRODUCTION Times New Roman, 12 pointsa huge amount of data gets generated due to the human intervention in digital space; example Facebook stores more than 30 Petabytes of data [1]. Such huge amount of data containing useful information is called Big Data. The most popular tool used to handle the Big Data Mining is Spark. Apache Spark [2] is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop Map Reduce and it extends the Map Reduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark [3] is its in-memory cluster computing[4] that increases the processing speed of an application. Spark can be used for any methods such as batch application or iterative algorithm or iterative queries or even streaming data. Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools. Spark provides Java, Scala[5], Python and R [6] kinds of high-level APIs, and an optimized engine that supports general execution graphs. Spark also support high end tool such as GraphX for graph processing, Spark SQL used for SQL, Spark Streaming for real time data processing, MLlib for machine learning[7]. SQL query can be easily done using Spark SQL. Spark SQL has data compatibility and performance optimization. The Spark Streaming is rich in APIs which can be used for streaming, batch application and interactive query. Spark API for charts and graphs is GraphX which reduce memory overhead by parallel processing. MLlib (Machine Learning library) is a scalable Machine Learning library of Spark; it includes relevant tests and data generators. The performances of the machine learning algorithms have been comparatively increased with the use of Spark than in map reduce. In general, the core technology and the foundation architecture of Spark is RDD [8]. The Spark SQL, MLlib, GraphX, Spark Streaming are the core members of the Spark ecosystem, as shown in fig 1. Section 2 deals with the Background work then the proposed methodology in Section 3; Section 4 is of Experimental Results and finally the Conclusion in Section 5. 173

Fig 1: The ecosystem of Spark 3 BANKGROUND WORK Big data analytics frameworks to work require very large scale algorithms. Advance in computation power helped the emerge of more computing frameworks [9, 10, 11, 12, 13, 15]. First one was Map Reduce [9] which revolutionized the way of analytics, then came the Apache Flink [10] to bridges the gap between non parallel database system and Map Reduce. The HDFS of Hadoop[11] helped in faster execution of algorithm than the traditional ones. Apache Mahout [12] is an open source project that is primarily used for creating scalable machine learning algorithms. It uses the Apache Hadoop library to scale effectively in the cloud. Map Reduce [13] is an incremental processing extension to i2mapreduce and widely used framework for mining big data [14]. However, Map Reduce does not perform well for iterative algorithms. Apache Spark [15] is an open source, developed for the optimization of large-scale interactive computation. Basic terminology of Spark is shown in Fig 2. From the above stated frameworks, Apache Spark is fault tolerance and works really good for iterative algorithms because of in-memory computations, while retaining the scalability. Fig 2: Basic terminology of Spark 15 The processes of non-interactive jobs executing all in one time is called Batch Process[16]. Batch processing is particularly useful for operations that require the computer or a peripheral device for an extended period of time.note that batch processing implies that there is no interaction with the user while the program is being executed. In order to overcome the difficulties of batch processing we use stream processing Apache Spark. Apache Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program called the driver program. Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers either Sparks s standalone cluster manager, Mesos or YARN, which allocate resources across applications. 174

PROPOSED METHODOLOGY The Classification of unstructured data is difficult task.we use Spark coded in Python to classify the twitter data and compare with Map Reduce. Spark provides a simple, secure method of submitting jobs without many of the complex set up requirements of Map Reduce. From the different supervised machine-learning algorithms like Decision Tree, Support vector machine, Naive Bayes, Neural Network and k-nearest neighbor to classify the data, we used Support Vector Machine which gives more accurate result when compared to other. EXPERIMENTAL RESULT Data needed for the research is collected from Twitter by using the Twitter Streaming API. Twitter has created its own API for tweets retrieval.the implementation is done with the help of google cloud, where installation of Spark and Map Reduce have done. We have used this Twitter API in our Python code for retrieving the tweets needed for our research. The experiment conducted using the different size of data set, the data size used is shown in the table 1. The results are show in the fig 5, 6, 7, 8 and 9. The result shows that the Spark shows significantly good result than the Map-Reducer, also easy to implement and use. The average rate of accuracy for Map Reduce is 79.2 % whereas for Spark is about 86.8%. The implementation and the running of Apache Spark and Map Reduce have been discussed and the time taken for the running the same data set on Map Reduce and the Spark is also shown. Group Test Data 1 Test Data 2 Test Data 3 Test Data 4 Test Data5 Table 1: Test Data Data Size 10 MB 100 MB 500 MB 700 MB 1 GB Fig 5: Comparison of the different test data items to run in map reduce and the spark 175

Fig 6: Time taken to run and the number of file per node comparison Fig 7: CPU Utilization of Map Reduce and the Spark. Figure 8: Average error Classification for Map Reduce and Spark 176

Fig 9: Accuracy rate of Map Reduce and Spark for different Data Set. CONCLUSION The big data is the hot topic of research and the researcher are realizing the fact that important predictions can be made by processing and analyzing big data. Most of the data is unstructured or semi structured; it must be formatted in a way that makes it suitable for data mining and subsequent analysis. The main purpose behind this work was to build Distributed Framework which is targeted for Big Data applications. The experimental result proves that the proposed method results in significant reduction in the Run time, CPU utilization and minimum miss classification. The result shows that the Spark is the better for big data analysis with 86.8 % accurate when compared to the Map reduce of 79.2%. REFERENCES [1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Inst., Tech. Rep.9341321, pp. 1 137, May 2011. [2] http://spark.apache.org/. [3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster Computing with Working Sets, in HotCloud, 2010. [4] http://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm [5] John A. Miller, Casey Bowman, Vishnu Gowda Harish and Shannon Quinn, Open Source Big Data Analytics Frameworks Written in Scala, 2016 IEEE International Congress on Big Data,pp.389-393 [6] AyushiMalviya, Amit Udhani, SuryakantSoni, R-Tool: Data Analytic Framework for Big Data, 2016 Symposium on Colossal Data Analysis and Networking (CDAN). [7] Chao Chen, Jun Zhang, Yi Xie, Yang Xiang, A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection, IEEE Transactions On Computational Social Systems, Vol. 2, No. 3, September 2015, pp.65-76. [8] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, in NSDI, 2012, pp. 15 28. [9] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp. 107 113, 2008. 177

[10] P. Mika, Flink: Semantic Web technology for the extraction and analysis of social networks, Web Semantics: Sci. Services Agents World Wide Web, vol. 3, no. 2, pp. 211 223, 2005. [11] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Hadoop: Efficient iterative data processing on large clusters, Proc. VLDB Endowment, vol. 3, no. 1/2, pp. 285 296, 2010. [12] A. M. Team, Apache Mahout: Scalable machine-learning and data-mining library, (2011). [Online]. Available: http://mahout. apache.org/ [13] Y. Zhang, S. Chen, Q. Wang, and G Yu, i2mapreduce: Incremental MapReduce for mining evolving big data, IEEE Trans. Knowl. Data Eng., vol. 27, no. 7, pp. 1906 1919, Jul. 2015. [14] Manjula Sanjay, Alamma B.H., An Insight into Big Data Analytics Methods and Application, IEEE Conference. [15] AkhmedovKhumoyun Yun Cui, Lee Hanku, Spark Based Distributed Deep Learning Framework For Big Data Applications, 978-1-5090-3546-5,IEEE 2016. [16] http://www.webopedia.com/term/b/batch_processing.html. 178