Facebook data extraction using R & process in Data Lake

Facebook data extraction using R & process in Data Lake An approach to understand how retail companie B s y G c a a ut n am p Go e sw rf a o m r i m Facebook data mining to analyze customers behavioral pattern By OnlineGuwahati Team

Abstract Without visibility on social media, it's extremely tough for the brands to interact with below ground influencers who molds the future. Now a day's almost 94 per cent of buying decisions are based on exponential growth of user participation on social media (mainly on Facebook). Facebook is playing a critical role to increase brand awareness. Businesses need to transform digital native customers into brand advocates and this can only be done if the relationship has been nurtured. The key for brands is to encourage consumers to endorse the brand and play a real part within the business. Facebook data mining is becoming a major factor in taking accurate business decisions by analysis of user posts, comments, likes, shares etc. as well as the sentiment analysis on business page. To analyze data, we should have proper mechanism to extract data from the business page. With effective utilization of R, a programming language for statistical analysis and Hadoop Distributed File Systems (HDFS) with ELT (Extraction, Loading and Transformation) approach, this E-Book has describe in details how we can perform Facebook data mining. Part I Facebook application creation and R installation Before creating a Facebook application, we need to have a fair knowledge on Facebook platform. A set of application programming interfaces (API) and tools have been developed by Facebook for the third party developers so that they can create applications to leverage and interact with core Facebook features. Entire set of API and tools are as a whole denoted as Facebook platform. Besides, following are the high level components consolidated in Facebook platform. - Graph API can be utilized by application developer to read from and write data to Facebook. Graph API provides an overview of the Facebook social graph and the relationships between entities therein. - Authentication allows applications to interact with Facebook and users to sign on to various applications through Facebook via a PC, mobile phone or desktop app. - Social plug-in like "Like button by which application developers are allowed to give their users a social experience through Facebook without gaining access to Facebook users' information.

- Open graph protocol which can be utilized by application developers to integrate their pages with Facebook. - IFrames can be used to create applications those can be accessed via Facebook login, but are hosted separately from Facebook. - Microformats, is a component that allows Facebook users to move these details to their own calendars or to mapping applications. Facebook application creation Facebook application is a small application which is developed for the Facebook profiles. As a first step we need to have an account with Facebook. Login to developers.facebook.com if already has an account. After successful login, the dropdown box at the right hand top corner will change to My Apps. On extending it, we will have option to add a new application as shown in picture below.

Now we can create a new application by adding few inputs. Applications are already built internally by Facebook. Based on our category as well as display name selection, an ID would be generated and assign to it. Since we are going to extract page data, so category selection should be of type "Apps for the page" from the dropdown box. After clicking on "Create App Id ", new App ID will be generated with options to execute multiple operations on it. To get all the mandatory information about the application, we need to navigate through "Dashboard" that appears on left side below the application name. App ID and App Secret are mandatory and sensitive information and hence need to be copied in safe file to avoid disclosure to others. Now we need to proceed towards "Settings" panel to add platform that would be "Website".

We are almost done with the application creation except Website Site-URL input. Leave the browser without sign-out from developers.facebook.com. Next Step is R installation and R-Studio setup. R installation and R-Studio setup R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of platforms viz. UNIX, Windows and MacOS. R encompasses effective data handling and storage facility with a suit of operators for calculations on arrays, in particular matrices. Eventually it is a simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities. In order to download R, we can choose preferred CRAN mirror. Here is the URL of the mirror https://cran.r-project.org/mirrors.html. If we choose operating system as Windows, it will appear as R-3.3.3-win.exe after download. Below picture shows how R console looks like after successful installation.

RStudio is an integrated development environment (IDE) for R. It makes R easier to be used. RStudio is a consolidated platform that includes code editor, debugging and visualization tool. Besides, a console, syntax-highlighting that supports direct code execution, tools for plotting, history and workspace management are key components in RStudio. RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux). As a beginner/ learner, you can go for open source and get it from https://www.rstudio.com/products/rstudio/download/. Note:- Without proper installation of R, RStudio doesn t work, because RStudio runs on top R environment. We can correlate this with Eclipse studio where Java runtime should be installed first to run it.

Above picture shows how RStudio looks in Windows environment. Even though several basic packages come by default with RStudio, we need to Install Rfacebook package from CRAN separately. Additional packages can be installed by navigating top menu "Tools" -> install Packages. Here we need to have "Rfacebook". After successful installation, it's mandatory to check whether all required packages are available or not. Those are visible in the right side below and here is the list of packages.

Finally we can see a message in RStudio editor. Part 2 Generate and assign OAuth token to Facebook R session. R internally makes a call to the Facebook API to get all the relevant information but call/invocation to the Facebook API has to be authenticated first. That means, before releasing any data from its repository, Facebook API verifies whether methods in the API gets invoked from trusted source. Once we generate OAuth token using App ID and App Secret as explained in Part 1, R internally use it during the session to invoke various methods on Facebook API. Following are the steps for OAuth Token creation, which will subsequently be assigned to R session. Start the RStudio and load the library "Rfacebook"

Get the library into RStudio editor by > library ("Rfacebook") Pass the value of App ID and App Secret to the method fboauth as parameter and returned value should store in a local variable say og_oauth. Value of App ID and App Secret are already noted in Part 1. After entering the above command, a site URL gets generated as http://localhost:1410 that we need to add into Facebook application created in Part 1 and click "Save changes" button on Facebook application page.

After that, press any key or Enter in the RStudio editor. Immediately a new browser tab will be opened with Facebook page to get the password again. Once entered, we see a message as Authentication complete. Please close this page and return to R." on the page. We can save the variable "og_oauth" as a file to be re-used in future sessions, which in fact can be used as token in functions by the following command in RStudio editor For testing, we can extract information for the logged in user viz. name of the user, total Likes etc. To execute that, we need to invoke getuser() method where User Token of the created Facebook application has to be passed. To access User Token, we need to navigate the Tools & Support menu in developer.facebook.com where application is created.

Copy the User Token and add as parameter in the RStudio editor. "myself" is a local variable/object here which getuser() method returns if we type "myself$name" in editor, Name of the logged in user gets displayed. As the connection between Facebook API & R is established now, we are ready to extract the data from any company s facebook page for analysis. Part 3 Extraction of post/data from the public facebook page Here we are considering electronic commerce company's pages for precisely analyzing customer's feedback, sentiments and other information. In real time business a group of customers or other interested individuals give the companies a convenient way to keep the members informed about products and services and share information. Besides, posts, comments in groups in Facebook page help companies to understand customers /buyers expectations. To attract customers from rival companies and retain customers with same company, companies need to enhance/improve products/service in an innovative way referring to the critics, complaints posted on rival company Facebook page.

Once those posts/information are extracted and analyzed, it will be very easy to decide on the additional steps to be performed, those the rival companies have overlooked. https://www.facebook.com/snapdeal 471784335392 The data what we have extracted from Facebook is unstructured and it can't be processed in traditional RDBMDS (Databases like Oracle, MySQL, IBM's DB2 etc.) as well as mining too to understand customer's behaviors' etc. In a similar ways we can consolidate huge volume of Facebook data from the companies pages. Besides, twitter streaming data is another source where we can analyze customer's sentiments, their views etc. The data produced in the twitter streaming is semi-structured that is JSON (Java script Object Notation) that is light weight compare to XML Part 4 Process data by adopting ELT in a distributed manner on Data Lake. The basic concept of Data Lake where we can use the approach ELT (Extraction, Loading and then Transformation) against traditional ETL (Extraction, Transformation and then loading) process. ETL process implies to traditional data warehousing system where structured data format follows (raw and column). The main characteristic of a Data Lake is that data is not

classified when gets persisted and due to that the data preparation, cleansing and subsequently transformation activities are being eliminated. In Data Warehouse, these activities consume major time of whole process. By storing Facebook's comments, reviews, shares, 'Like' etc. in its rawest form enables business analyst of the organizations specific to retail domain with multichannel e-commerce as a separate vertical to find answers from those data for which they do not know the questions yet. By leveraging HDFS (Hadoop Distributed File System), we can develop Data Lake to store any format data in order to process and analysis. Directly data can be loaded in the Lake without transformation, later transformation can be performed on demand basis. There are few components available using those we can ingest data (independent of any Format) into lake like Flume, Sqoop, efficient file transfer mechanism etc. Node Node Node Node Node HDFS Node

Since we are not going to analyze run time streaming data, we need to follow batch processing approach where data first gets persisted in the lake. As a next step, all the junk, invalid, noisy data should be removed and that can be implemented using multiple Map-Reduce chain which in turn also detonated as data pipeline. Data quality is a major challenge in the data lake implementation even though it makes possible to store all the data, ask complex and radically bigger business questions, and find out hidden patterns and relationships from the data. After successful data crunching in an optimized way, responsibilities goes to data scientist to perform any ad-hoc queries, and build an advanced model at any time iteratively. By adopting machine learning technique, we can start to explore data, as well as to build predictive models for customer's interest in buying products. If our objective is to build a model that can predict the products which customer is buying then proceed with supervised approach. Unsupervised approach like clustering or PCA will likely mix the information that customers are interested in with other, unrelated, products and generate a worse predictor. By levering supervised machine learning techniques, accumulating all available historical information can predict trends and effects of seasonality for planning and forecasting. On top of it, natural language processing (NLP) can effectively utilized to analyze consumer sentiments and apply those as inputs to the planning and forecasting process to organize the products in the channel of "Related Products" or "You may like also" in the E-Commerce portal. The retailer can penetrate insight to their customer as well as visitors by performing forecasting and planning through machine learning on those extracted Facebook data which in turns helps to improve performance and profitability. Applying automated complex task such as NLP on Facebook's unstructured massive data sets collection, an actionable intelligence can be developed to achieve better understanding of their customers. In addition to that, information achieved from the datasets collection by machine learning can be leveraged to channelize the business and operational reshuffle to improve business decisions. In this e-book, we have just discussed the approach how to analyze the Facebook's data in Data Lake using machine learning. So it can't be taken as cookbook to implement in real scenarios. There are large numbers of technical steps with complete description and configuration involves which we have not covered here. This e- book is presented to get the flavor of Facebook's unstructured data mining using R and Hadoop (A popular Big Data processing framework )