DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE SYSTEM.

Size: px

Start display at page:

Download "DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE SYSTEM."

Emily Carter
6 years ago
Views:

1 DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE SYSTEM. PRESENTED BY: DEVANG PATEL ( ) SONAL DESHMUKH ( )

2 INTRODUCTION: The significance of online shopping is growing day by day because of easy purchase method by just one click. Amazon is one such world widely known E-commerce website. Initially it was known for its huge collection of books but later it was expanded for other items. It is all about making money. So, customer satisfaction and opinion is important part of E- commerce websites.this gave rise User Reviews. User Reviews are customer suggestions which help other customers to make decision about that product.

3 HOW TO GET AMAZON REVIEW DATASET?: We ed them to get the access of amazon review dataset and they provide the link from which we can download the review dataset. Data was in JSON file format.

4 SOFTWARES AND TOOLS:

5 REVIEW SAMPLE AND DATASET DESCRIPTION: Rating (1-5 stars) Review Text Summary No of peopled who found review helpful Product ID Reviewer ID

6 CONTINUED We got JSON dataset that contains following fields: reviewerid: ID of the Reviewer. asin: ID of the Product. reviewername: Name of the reviewer. helpful: Helpfulness rating of the review. reviewtext: Text of the review. overall: Rating of the Product. summary: Summary of the review. unixreviewtime: Time of the review (Unix time). reviewtime: Time of the review.

7 IMPLEMENTATION: Data was in JSON format and we are using hive. So Need to convert JSON to CSV file but we choose JSONSerDe. hive-serdes-1.0-snapshot.jar We downloaded JSONSerDe jar file and copied it in hive/lib folder.

8 CONTINUED Uploaded data of Music_Instruments.JSON file on HDFS. After uploading it on HDFS we want to load it in the hive table.

9 CONTINUED Table: MI_table Row format: JSONSerDe.class Location: HDFS path where JSON file is stored.

10 PROBLEMS FACED DURNG TABLE CREATION IN HIVE Select * from MI_table; It fetched all the rows 10,261. but problem was NULL values. The problem of NULL values was that the key names were in capital case. Solution :SERDEPROPERTIES( case.insensitive = false )

11 PROBLEM ON DATA FORMAT The metadata file of amazon reviwes has key-value pairs in single quotes. We tried all the types of json serde available but none worked Json.dumps functionality converted the json data into correct format.

12 SELECT * DESERIALIZES

13 CAN WE USE THIS TABLE AT PRODUCTION LEVEL?:

14 MOST REVIEWED PRODUCT:

15 AVERAGE RATING OF MUSIC INSTRUMENTS:

16 AVERAGE RATINGS ON PRODUCTS

17 AMAZON 5 DIFFERENT PRODUCTS AVERAGE: Automative (4.18) Cellphones (4.12) Lawn_Garden (4.18) Musical Instruments (4.48) Pet_Suppliers (4.22)

18 REVIEWS WERE POSITIVE OR NEGATIVE?:

19 CONTINUED

20 CONTINUED

21 CONTINUED

22 CONTINUED

23 CONTINUED

24 WHICH YEAR PRODUCTS WERE REVIEWED MOST?

26 COST OF MOST REVIEWED (TOP 5) PRODUCTS

27 HDFS IN THE BROWSER :

28 TRUST AND HELPFULNESS IN AMAZON PRODUCT REVIEWS The helpful column contains values that look like this [56, 63]. The first value represents the number of helpful votes, the second represents overall votes Percentage and also a binary column which states if the review is helpful or not.

29 OUTPUT FILE AND TABLEAU DATA SOURCE.

30 Helpful Ratings and Distribution

31 PIPELINE MODEL OF SPARK S MLIB: Cylinders indicate DataFrames. The Tokenizer.transform method splits the raw text documents into words, adding a new column with words to the DataFrame. The HashingTF. Transform () method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame Logistic regression is the machine learning algorithm

32 PROBLEM FACED IN RUNNING SPARK MACHINE LEARNING PROGRAM EXECUTION Creating dataframe for the amazon data. Downloaded sklearn,but pyspark has its own classification mlib The execution throws an error that it requires only Numpy 1.4 or higher version. It is solved by correcting a bug in the init.py program of the mlib.

33 SPARKS EXECUTION : BIN/PYSPARK SHELL COMMAND Start the Sparks on top of hadoop.

34 TRAINING SET AND TEST SET RESULTS

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases