Facebook data extraction using R & process in Data Lake

Similar documents
Data Analytics at Logitech Snowflake + Tableau = #Winning

Overview of Data Services and Streaming Data Solution with Azure

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Specialist ICT Learning

Talend Big Data Sandbox. Big Data Insights Cookbook

Modern Data Warehouse The New Approach to Azure BI

Integrating SAS Analytics into Your Web Page

@Pentaho #BigDataWebSeries

Chapter 6 VIDEO CASES

Data-Intensive Distributed Computing

SugarCRM for Hootsuite

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Microsoft Exam

POWER BI BOOTCAMP. COURSE INCLUDES: 4-days of instructor led discussion, Hands-on Office labs and ebook.

Hadoop Online Training

Capture Business Opportunities from Systems of Record and Systems of Innovation

Create-A-Page Design Documentation

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

Designing your BI Architecture

Talend Big Data Sandbox. Big Data Insights Cookbook

BEAWebLogic. Portal. Overview

Přehled novinek v SQL Server 2016

A Review Paper on Big data & Hadoop

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Taming Structured And Unstructured Data With SAP HANA Running On VCE Vblock Systems

QLIKVIEW ARCHITECTURAL OVERVIEW

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Data in the Cloud and Analytics in the Lake

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Welcome to the Investor Experience

Boost your Analytics with ML for SQL Nerds

SAS Data Integration Studio 3.3. User s Guide

SQL Server 2017 Power your entire data estate from on-premises to cloud

PROCE55 Mobile: Web API App. Web API.

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

What's New in ActiveVOS 7.1 Includes ActiveVOS 7.1.1

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

SAS IT Resource Management 3.8: Reporting Guide

WebCenter Interaction 10gR3 Overview

iway Big Data Integrator New Features Bulletin and Release Notes

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Expose Existing z Systems Assets as APIs to extend your Customer Reach

MicroStrategy Academic Program

IBM C Rational Functional Tester for Java. Download Full Version :

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Integration Service. Admin Console User Guide. On-Premises

R Language for the SQL Server DBA

Logi Ad Hoc Reporting Management Console Usage Guide

Oliver Engels & Tillmann Eitelberg. Big Data! Big Quality?

Oracle Service Cloud. Release 18D. What s New

SAS 9.4 Intelligence Platform: Overview, Second Edition

Working with Database Connections. Version: 18.1

Introduction to Big-Data

WebSphere Puts Business In Motion. Put People In Motion With Mobile Apps

6/29/ :38 AM 1

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Live Guide Co-browsing

8.0 Help for End Users About Jive for SharePoint System Requirements Using Jive for SharePoint... 6

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Fast Innovation requires Fast IT

Embedded Technosolutions

Spotfire: Brisbane Breakfast & Learn. Thursday, 9 November 2017

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

If you re a Facebook marketer, you re likely always looking for ways to

Safelayer's Adaptive Authentication: Increased security through context information

RECSM Summer School: Scraping the web. github.com/pablobarbera/big-data-upf

Active Endpoints. ActiveVOS Platform Architecture Active Endpoints

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

WHAT IS THE CONFIGURATION TROUBLESHOOTER?

Hortonworks Data Platform

Solving Mobile App Development Challenges. Andrew Leggett & Abram Darnutzer CM First

Data Management Glossary

The Now Platform Reference Guide

How Do I Inspect Error Logs in Warehouse Builder?

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Big Data and Enterprise Data, Bridging Two Worlds with Oracle Data Integration

IBM Endpoint Manager Version 9.0. Software Distribution User's Guide

Power BI Developer Bootcamp

Product Documentation. ER/Studio Portal. User Guide. Version Published February 21, 2012

Sharp Social. Natural Language Understanding

<Insert Picture Here> Introduction to Big Data Technology

Oracle Data Integrator 12c: Integration and Administration

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

USER MANUAL. SalesPort Salesforce Customer Portal for WordPress (Lightning Mode) TABLE OF CONTENTS. Version: 3.1.0

CUSTOMER PORTAL. Connectors Guide

Certified Data Science with Python Professional VS-1442

Integration Service. Admin Console User Guide. On-Premises

OLAP Introduction and Overview

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Election Analysis and Prediction Using Big Data Analytics

WEBSITE INSTRUCTIONS. Table of Contents

Oracle Big Data Connectors

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

BUILD BETTER MICROSOFT SQL SERVER SOLUTIONS Sales Conversation Card

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Multi-Sponsor Environment. SAS Clinical Trial Data Transparency User Guide

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Transcription:

Facebook data extraction using R & process in Data Lake An approach to understand how retail companie B s y G c a a ut n am p Go e sw rf a o m r i m Facebook data mining to analyze customers behavioral pattern By OnlineGuwahati Team

Abstract Without visibility on social media, it's extremely tough for the brands to interact with below ground influencers who molds the future. Now a day's almost 94 per cent of buying decisions are based on exponential growth of user participation on social media (mainly on Facebook). Facebook is playing a critical role to increase brand awareness. Businesses need to transform digital native customers into brand advocates and this can only be done if the relationship has been nurtured. The key for brands is to encourage consumers to endorse the brand and play a real part within the business. Facebook data mining is becoming a major factor in taking accurate business decisions by analysis of user posts, comments, likes, shares etc. as well as the sentiment analysis on business page. To analyze data, we should have proper mechanism to extract data from the business page. With effective utilization of R, a programming language for statistical analysis and Hadoop Distributed File Systems (HDFS) with ELT (Extraction, Loading and Transformation) approach, this E-Book has describe in details how we can perform Facebook data mining. Part I Facebook application creation and R installation Before creating a Facebook application, we need to have a fair knowledge on Facebook platform. A set of application programming interfaces (API) and tools have been developed by Facebook for the third party developers so that they can create applications to leverage and interact with core Facebook features. Entire set of API and tools are as a whole denoted as Facebook platform. Besides, following are the high level components consolidated in Facebook platform. - Graph API can be utilized by application developer to read from and write data to Facebook. Graph API provides an overview of the Facebook social graph and the relationships between entities therein. - Authentication allows applications to interact with Facebook and users to sign on to various applications through Facebook via a PC, mobile phone or desktop app. - Social plug-in like "Like button by which application developers are allowed to give their users a social experience through Facebook without gaining access to Facebook users' information.

- Open graph protocol which can be utilized by application developers to integrate their pages with Facebook. - IFrames can be used to create applications those can be accessed via Facebook login, but are hosted separately from Facebook. - Microformats, is a component that allows Facebook users to move these details to their own calendars or to mapping applications. Facebook application creation Facebook application is a small application which is developed for the Facebook profiles. As a first step we need to have an account with Facebook. Login to developers.facebook.com if already has an account. After successful login, the dropdown box at the right hand top corner will change to My Apps. On extending it, we will have option to add a new application as shown in picture below.

Now we can create a new application by adding few inputs. Applications are already built internally by Facebook. Based on our category as well as display name selection, an ID would be generated and assign to it. Since we are going to extract page data, so category selection should be of type "Apps for the page" from the dropdown box. After clicking on "Create App Id ", new App ID will be generated with options to execute multiple operations on it. To get all the mandatory information about the application, we need to navigate through "Dashboard" that appears on left side below the application name. App ID and App Secret are mandatory and sensitive information and hence need to be copied in safe file to avoid disclosure to others. Now we need to proceed towards "Settings" panel to add platform that would be "Website".

We are almost done with the application creation except Website Site-URL input. Leave the browser without sign-out from developers.facebook.com. Next Step is R installation and R-Studio setup. R installation and R-Studio setup R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of platforms viz. UNIX, Windows and MacOS. R encompasses effective data handling and storage facility with a suit of operators for calculations on arrays, in particular matrices. Eventually it is a simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities. In order to download R, we can choose preferred CRAN mirror. Here is the URL of the mirror https://cran.r-project.org/mirrors.html. If we choose operating system as Windows, it will appear as R-3.3.3-win.exe after download. Below picture shows how R console looks like after successful installation.

RStudio is an integrated development environment (IDE) for R. It makes R easier to be used. RStudio is a consolidated platform that includes code editor, debugging and visualization tool. Besides, a console, syntax-highlighting that supports direct code execution, tools for plotting, history and workspace management are key components in RStudio. RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux). As a beginner/ learner, you can go for open source and get it from https://www.rstudio.com/products/rstudio/download/. Note:- Without proper installation of R, RStudio doesn t work, because RStudio runs on top R environment. We can correlate this with Eclipse studio where Java runtime should be installed first to run it.

Above picture shows how RStudio looks in Windows environment. Even though several basic packages come by default with RStudio, we need to Install Rfacebook package from CRAN separately. Additional packages can be installed by navigating top menu "Tools" -> install Packages. Here we need to have "Rfacebook". After successful installation, it's mandatory to check whether all required packages are available or not. Those are visible in the right side below and here is the list of packages.

Finally we can see a message in RStudio editor. Part 2 Generate and assign OAuth token to Facebook R session. R internally makes a call to the Facebook API to get all the relevant information but call/invocation to the Facebook API has to be authenticated first. That means, before releasing any data from its repository, Facebook API verifies whether methods in the API gets invoked from trusted source. Once we generate OAuth token using App ID and App Secret as explained in Part 1, R internally use it during the session to invoke various methods on Facebook API. Following are the steps for OAuth Token creation, which will subsequently be assigned to R session. Start the RStudio and load the library "Rfacebook"

Get the library into RStudio editor by > library ("Rfacebook") Pass the value of App ID and App Secret to the method fboauth as parameter and returned value should store in a local variable say og_oauth. Value of App ID and App Secret are already noted in Part 1. After entering the above command, a site URL gets generated as http://localhost:1410 that we need to add into Facebook application created in Part 1 and click "Save changes" button on Facebook application page.

After that, press any key or Enter in the RStudio editor. Immediately a new browser tab will be opened with Facebook page to get the password again. Once entered, we see a message as Authentication complete. Please close this page and return to R." on the page. We can save the variable "og_oauth" as a file to be re-used in future sessions, which in fact can be used as token in functions by the following command in RStudio editor For testing, we can extract information for the logged in user viz. name of the user, total Likes etc. To execute that, we need to invoke getuser() method where User Token of the created Facebook application has to be passed. To access User Token, we need to navigate the Tools & Support menu in developer.facebook.com where application is created.

Copy the User Token and add as parameter in the RStudio editor. "myself" is a local variable/object here which getuser() method returns if we type "myself$name" in editor, Name of the logged in user gets displayed. As the connection between Facebook API & R is established now, we are ready to extract the data from any company s facebook page for analysis. Part 3 Extraction of post/data from the public facebook page Here we are considering electronic commerce company's pages for precisely analyzing customer's feedback, sentiments and other information. In real time business a group of customers or other interested individuals give the companies a convenient way to keep the members informed about products and services and share information. Besides, posts, comments in groups in Facebook page help companies to understand customers /buyers expectations. To attract customers from rival companies and retain customers with same company, companies need to enhance/improve products/service in an innovative way referring to the critics, complaints posted on rival company Facebook page.

Once those posts/information are extracted and analyzed, it will be very easy to decide on the additional steps to be performed, those the rival companies have overlooked. https://www.facebook.com/snapdeal 471784335392 The data what we have extracted from Facebook is unstructured and it can't be processed in traditional RDBMDS (Databases like Oracle, MySQL, IBM's DB2 etc.) as well as mining too to understand customer's behaviors' etc. In a similar ways we can consolidate huge volume of Facebook data from the companies pages. Besides, twitter streaming data is another source where we can analyze customer's sentiments, their views etc. The data produced in the twitter streaming is semi-structured that is JSON (Java script Object Notation) that is light weight compare to XML Part 4 Process data by adopting ELT in a distributed manner on Data Lake. The basic concept of Data Lake where we can use the approach ELT (Extraction, Loading and then Transformation) against traditional ETL (Extraction, Transformation and then loading) process. ETL process implies to traditional data warehousing system where structured data format follows (raw and column). The main characteristic of a Data Lake is that data is not

classified when gets persisted and due to that the data preparation, cleansing and subsequently transformation activities are being eliminated. In Data Warehouse, these activities consume major time of whole process. By storing Facebook's comments, reviews, shares, 'Like' etc. in its rawest form enables business analyst of the organizations specific to retail domain with multichannel e-commerce as a separate vertical to find answers from those data for which they do not know the questions yet. By leveraging HDFS (Hadoop Distributed File System), we can develop Data Lake to store any format data in order to process and analysis. Directly data can be loaded in the Lake without transformation, later transformation can be performed on demand basis. There are few components available using those we can ingest data (independent of any Format) into lake like Flume, Sqoop, efficient file transfer mechanism etc. Node Node Node Node Node HDFS Node

Since we are not going to analyze run time streaming data, we need to follow batch processing approach where data first gets persisted in the lake. As a next step, all the junk, invalid, noisy data should be removed and that can be implemented using multiple Map-Reduce chain which in turn also detonated as data pipeline. Data quality is a major challenge in the data lake implementation even though it makes possible to store all the data, ask complex and radically bigger business questions, and find out hidden patterns and relationships from the data. After successful data crunching in an optimized way, responsibilities goes to data scientist to perform any ad-hoc queries, and build an advanced model at any time iteratively. By adopting machine learning technique, we can start to explore data, as well as to build predictive models for customer's interest in buying products. If our objective is to build a model that can predict the products which customer is buying then proceed with supervised approach. Unsupervised approach like clustering or PCA will likely mix the information that customers are interested in with other, unrelated, products and generate a worse predictor. By levering supervised machine learning techniques, accumulating all available historical information can predict trends and effects of seasonality for planning and forecasting. On top of it, natural language processing (NLP) can effectively utilized to analyze consumer sentiments and apply those as inputs to the planning and forecasting process to organize the products in the channel of "Related Products" or "You may like also" in the E-Commerce portal. The retailer can penetrate insight to their customer as well as visitors by performing forecasting and planning through machine learning on those extracted Facebook data which in turns helps to improve performance and profitability. Applying automated complex task such as NLP on Facebook's unstructured massive data sets collection, an actionable intelligence can be developed to achieve better understanding of their customers. In addition to that, information achieved from the datasets collection by machine learning can be leveraged to channelize the business and operational reshuffle to improve business decisions. In this e-book, we have just discussed the approach how to analyze the Facebook's data in Data Lake using machine learning. So it can't be taken as cookbook to implement in real scenarios. There are large numbers of technical steps with complete description and configuration involves which we have not covered here. This e- book is presented to get the flavor of Facebook's unstructured data mining using R and Hadoop (A popular Big Data processing framework )