Election Analysis and Prediction Using Big Data Analytics

Similar documents
Big Data with Hadoop Ecosystem

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data Hadoop Stack

Department of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India

Big Data Architect.

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Microsoft Big Data and Hadoop

A Review Approach for Big Data and Hadoop Technology

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data Specialized Studies

Automatic Voting Machine using Hadoop

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Introduction to Big-Data

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

Online Bill Processing System for Public Sectors in Big Data

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Big Data Analytics using Apache Hadoop and Spark with Scala

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Introduction to BigData, Hadoop:-

Hadoop Online Training

File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier

STUDENT GRADE IMPROVEMENT INHIGHER STUDIES

Twitter data Analytics using Distributed Computing

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Modelling Structures in Data Mining Techniques

BEST BIG DATA CERTIFICATIONS


Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Oracle Big Data SQL brings SQL and Performance to Hadoop

Big Data Prediction on Crime Detection

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Hadoop, Yarn and Beyond

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Certified Big Data and Hadoop Course Curriculum

Security and Performance advances with Oracle Big Data SQL

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Data Science Training

Stages of Data Processing

Embedded Technosolutions

International Journal of Advanced Engineering and Management Research Vol. 2 Issue 5, ISSN:

Transaction Analysis using Big-Data Analytics

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

BIG DATA COURSE CONTENT

An Introduction to Big Data Formats

BIG DATA & HADOOP: A Survey

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

<Insert Picture Here> Introduction to Big Data Technology

Chase Wu New Jersey Institute of Technology

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

FOUNDATIONS OF A CROSS-DISCIPLINARY PEDAGOGY FOR BIG DATA *

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

A Review Paper on Big data & Hadoop

Hadoop. Introduction / Overview

A Survey on Big Data

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

International Journal of Advance Engineering and Research Development. A study based on Cloudera's distribution of Hadoop technologies for big data"

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Oracle Big Data Fundamentals Ed 2

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Comparison of FP tree and Apriori Algorithm

Overview of Web Mining Techniques and its Application towards Web

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

ANNUAL REPORT Visit us at project.eu Supported by. Mission

PDA Trainer Certification Process V 1.0

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Guide to Yammer For Education Filed. There have been an increase in use of the social network applications in education system

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,


Innovatus Technologies

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Prototyping Data Intensive Apps: TrendingTopics.org

Big Data Analytics. Description:

CSC 261/461 Database Systems. Fall 2017 MW 12:30 pm 1:45 pm CSB 601

Data Storage Infrastructure at Facebook

ResponseTek Listening Platform Release Notes Q4 16

Design and Implement of Bigdata Analysis Systems

Databases 2 (VU) ( / )

New Approaches to Big Data Processing and Analytics

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Facebook data extraction using R & process in Data Lake

Scalable Tools - Part I Introduction to Scalable Tools

A REVIEW PAPER ON BIG DATA ANALYTICS

Introduction to Hadoop and MapReduce

Introduction to Data Science Day 2

Comparative Analysis of Range Aggregate Queries In Big Data Environment

SERVICE OPERATION ITIL INTERMEDIATE TRAINING & CERTIFICATION

Transcription:

Election Analysis and Prediction Using Big Data Analytics Omkar Sawant, Chintaman Taral, Roopak Garbhe Students, Department Of Information Technology Vidyalankar Institute of Technology, Mumbai, India omkar.sawant6@gmail.com Deepali Nayak Assistant Professor, Department. Of Information Technology Vidyalankar Institute of Technology, Mumbai, India deepali.nayak@vit.edu.in Abstract The aim of this paper is to propose an alternate way to conduct elections by highlighting the fundamental loopholes present in current system. The current election process considers the number of votes polled by a candidate as the sole parameter to choose the winner. Our proposal aims at highlighting the discrepancy between the most deserving candidate to win the election and the candidate who is predicted to win on the basis of his or her popularity. We will be choosing the deserved candidate by taking into account parameters such as the criminal records, educational qualifications, past social work, previous term record etc. We want the election process to not be a contest of popularity only. We aim to bring about a change in the election process to make it better and unprejudiced. Keywords-Election Analysis, Prediction, Hadoop Framework, Pig Programming, Text mining, Big Data Analytics. ***** I. INTRODUCTION The current Election system that is being followed only considers the number of votes gained by the candidates to decide the winner ignoring all other vital aspects. Also, a significant section of the population chooses not to vote as they believe that the entire system is flawed. As a result, wrong candidates get selected for high & important positions in Government offices and other organizations. We believe that votes given to a candidate just on the basis of the spurious claims made during their Election campaigns without knowing his or her qualifications, previous records and eligibility are not justifiable Hence, we came up with an altogether innovative idea that exploits the power of Hadoop for an in-depth analysis of elections for overcoming the flaws present in the current system. Our proposed system for conducting elections is designed to increase people s participation in elections and to make it more stringent and impeccable. A. Proposed System This project will provide a compact and lightweight distributed task execution framework that has been divided into two stages: 1: Analysis stage 2: Prediction stage 107

B. Analysis stage The detailed description of the analysis stage is as follows: The cardinal goal of analysis stage is to determine the deserving candidate to win the election from the given list of candidates. The candidates will be judged according to following parameters: Educational Qualifications Criminal records Past social work Previous term record Speaking, motivational skills & personality Social status and popularity To use the above parameters to judge the candidates it is important to give each parameter appropriate weightage. To do this we have performed analysis on two huge datasets, 1: Common people s opinion about these parameters. This data has been collected by conducting physical surveys as well as using Google Forms. Now, after analyzing this dataset, we have given a preliminary weightage to each of the parameters. 2: Data about the candidates who have or who are serving their term. This dataset contains various fields like tenure, number of useful projects done for benefit of people, track record, consistency, and data about the above mentioned parameters for each candidate etc. We conducted physical surveys and collected maximum data possible; however, as the political data is confidential, we were forced to make our own dataset. Hence, we formed our own data set by coding which consisted of data about thousands of candidates serving their term. Hence, after analyzing this dataset, we got a brief idea about the importance each of the parameter and a secondary 108

weightage was given to each of them irrespective of the weightage given in the first step. The final weightage given to parameters is the average of both of them. This has drastically enhanced the accuracy of analysis as we are using public opinion as well as actual data. After we found out the final weightage about the parameters, we accepted data about the candidates presently contesting elections through a user interface and found out the deserving person among them by using the parameters. A score was generated for each candidate depending on his qualifications on each of the parameter and the result about the deserving candidate was stored in the database. C. Prediction stage The detailed description of the prediction stage is as follows: Data about the candidates (or their party) will be extracted out from social media [1]. E.g.: Twitter, Facebook, etc. Based on analysis of this data a candidate who has more chances of winning the election irrespective of whether he is deserves to win will be predicted. This phase is very similar to the exit polls that are conducted during any major elections. Figure 1. Flow diagram of Analysis phase Figure 2. Flow diagram of Prediction phase the candidate for people s benefit, clear track record and 109 II. ALGORITHM/PSEUDO-CODE Step 1: Conduct surveys and gather information about people s opinions about the parameters. Step 2: Perform pre-processing activities & bring the data into suitable form for transferring it to Hadoop ecosystem. Step 3: Using Sqoop tool, import the dataset containing people s views into Hive tables. Step 4: Make appropriate set of positive, negative and neutral keywords that will be used for performing Text mining on the dataset. Step 5: Using pig script, perform Text mining on the dataset and evaluate a preliminary weightage to each parameter. Step 6: Store this result into Hadoop Distributed File System (HDFS) for future use. Step 7: Collect data about the politicians currently serving or has served their term. Step 8: Perform data cleaning & integration activities and add required fields useful for analysis. Step 9: Import this data set using Sqoop into Hive tables. Step 10: Using fields like total number of projects done by

overall satisfaction score find out the top 10% successful hardware [2]. It also provided various tools like Hive, Pig, politicians. Step 11: Use Pig programming to analyze the parameters of the top candidates to again give a secondary weightage to the parameters. Step 12: Store this result into HDFS. Step 13: Integrate the two weightages and give a final weightage to each of the parameter. Step 14: Accept data about the candidates now contesting elections. Step 15: Using the weightage, judge the candidates on the basis of the parameters and generate score for each candidate. Step 16: The candidate with the highest score is the deserving candidate. Step 17: Direct this output into a.txt file and add necessary explanations. Step 18: Extract data from social media and other sources about the candidates contesting elections to have a brief idea about their popularity. Step 19: Extract this data into Hive tables. Step 20: Using pig programming, Batch processing and the collected data generate polls to predict the winner among those by creating graphs and charts. Step 21: Direct this output to the same.txt file. Step 22: The.txt file will show the difference between the deserved candidate & the candidate predicted to win thus highlighting the flaws of current system. Sqoop that were perfectly suitable for our project. B. Apache Pig Apache Pig is a high-level platform for creating programs that run on Apache Hadoop [3]. As mentioned earlier, in the analysis phase, there is a need to analyze people s opinions about the parameters as well as actual data about the politicians in order to give those parameters a proper weightage. This we have achieved by writing Pig scripts. For scrutinizing people s opinions we have used Sentiment Analysis. Sentiment analysis is the process of using text analytics to mine various sources of data for opinions. Also for prediction phase, pig scripts & batch processing is used to generate polls. For real time & dynamic prediction analysis the Pig Scripts can be replaced by Apache Spark. C. Apache Hive The Apache Hive provides data warehouse software which facilitates reading, writing and managing large datasets residing in distributed storage using SQL. The data that we collected by conducting physical surveys and online has been imported into Hive tables using Sqoop. Sqoop is a tool which is used to transfer data between Hadoop Distributed File System (HDFS) and external databases and data warehouses. We have used it as a repository to store the datasets Also, we have fired ad hoc queries on the data stored in Hive tables in order to create statistics, graphs and charts. III. TECHNOLOGIES & CONCEPTS USED IV. CONCLUSIONS A. Hadoop Ecosystem As we started planning on our project, we realized that to do this kind of analysis, there will be a need for performing analysis on a huge amount of data. Moreover, the data will be available in an unstructured form as there is no predefined approach of collecting people s opinion in a uniform way. Hence, as we explored our options and decided to use Apache Hadoop as it is flexible and there is no need to pre-process data unlike Relational Databases. Also, Hadoop offers a convenient way to handle unstructured data and is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity The discrepancy and the loopholes in the present system are clearly highlighted with the help of this innovative project. After examining the result, it was clear that the current election process is unable to select the most suitable candidate of the available people. Hence, pertaining to the current quality of politicians, our proposed system should be implemented by the Government as early as possible by making required changes. This type of analytics based approach can be used for any type of election. This approach also highlights the power and scope of data analytics as a whole in the Computer Science field. 110

REFERENCES [2] Dhruba Borthakur, "The Hadoop Distributed File System: [1] A Review of Sentiment Analysis in Twitter Data Using Hadoop (paper published in International Journal of Database Theory and Application)J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68 73. Architecture and Design", The apache Software Foundation, 2007.K. Elissa, Title of paper if known, unpublished. [3] Processing performance on Apache Pig, Apache Hive and MySQL cluster. ( Date Added to IEEE Xplore: 15 January 2015 INSPEC Accession Number: 14853176 ) 111