Data Exploration. The table below lists each of the files available for analysis with a short description of what is found in each one.

Similar documents
Acquiring, Exploring and Preparing the Data

Graph Analytics. Modeling Chat Data using a Graph Data Model. Creation of the Graph Database for Chats

I) write schema of the six files.

Graph Analytics. Modeling Chat Data using a Graph Data Model. Creation of the Graph Database for Chats

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

EXAM - FM Developer Essentials for FileMaker 12 Exam. Buy Full Product.

Business Club. Decision Trees

Advanced Multidimensional Reporting

Oracle Compare Two Database Tables Sql Query Join

John Biancamano Inbound Digital LLC InboundDigital.net

Learn SQL by Calculating Customer Lifetime Value

Qlik Sense Performance Benchmark

T-SQL Training: T-SQL for SQL Server for Developers

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

TURN DATA INTO ACTIONABLE INSIGHTS. Google Analytics Workshop

Weighted Powers Ranking Method

Vendor: IBM. Exam Code: C Exam Name: IBM Cognos 10 BI Author. Version: Demo

Oracle Database: Introduction to SQL

Introduction to BEST Viewpoints

normalization are being violated o Apply the rule of Third Normal Form to resolve a violation in the model

Hustle Documentation. Release 0.1. Tim Spurway

Dr. Barbara Morgan Quantitative Methods

Data Mining By IK Unit 4. Unit 4

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

1

EECE 418. Fantasy Exchange Pass 2 Portfolio

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

Topic 1, Volume A QUESTION NO: 1 In your ETL application design you have found several areas of common processing requirements in the mapping specific

University of Waterloo Midterm Examination Sample Solution

Aster Data Basics Class Outline

Oracle Database 10g: Introduction to SQL

Knowingo + Dashboard Manual

Data warehouses Decision support The multidimensional model OLAP queries

1 Topic. Image classification using Knime.

DB2 SQL Class Outline

BUYER S GUIDE WEBSITE DEVELOPMENT

Outbound Engine Manual 1

Bridgemate App. Information for players. Version 2. Bridge Systems BV

Unit Assessment Guide

Redshift Queries Playbook

How to Increase Sales With Personalized Website Visitor Engagement. How to Increase Sales With Personalized Website Visitor Engagement

The Now Lifestyle Marketing Dictionary

Server Side Scripting Report

Abstract. Problem Statement. Objective. Benefits

Introduction to Computer Science and Business

A MODEL FOR COMPARATIVE ANALYSIS OF THE SIMILARITY BETWEEN ANDROID AND IOS OPERATING SYSTEMS

Cisco Unified Contact Center Express Historical Reporting Guide, Release 10.6(1)

Sitecore Experience Platform 8.0 Rev: September 13, Sitecore Experience Platform 8.0

Extra readings beyond the lecture slides are important:

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

FileMaker Exam FM0-306 Developer Essential for FileMaker 12 Version: 6.0 [ Total Questions: 198 ]

Aster Data SQL and MapReduce Class Outline

XVIII Open Cup named after E.V. Pankratiev Stage 1: Grand Prix of Romania, Sunday, September 17, 2017

IBM emessage Version 9 Release 1 February 13, User's Guide

DoConference Web Conferencing: DoMore DoConference

REDCap Indiana University Release Notes

Data Explorer: User Guide 1. Data Explorer User Guide

X-AFFILIATE module for X-Cart 4.0.x

Stat Wk 3. Stat 342 Notes. Week 3, Page 1 / 71

Fundamentals of Database Systems

EXAMINATIONS 2013 MID-YEAR SWEN 432 ADVANCED DATABASE DESIGN AND IMPLEMENTATION

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

VISITOR SEGMENTATION

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

Dataflow Editor User Guide

Cisco Jabber Features and Options

Making MongoDB Accessible to All. Brody Messmer Product Owner DataDirect On-Premise Drivers Progress Software

Monitor Qlik Sense sites. Qlik Sense Copyright QlikTech International AB. All rights reserved.

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

Analytics: Server Architect (Siebel 7.7)

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

Play Guide. Download. 3 Introduction. 4 Claim Your Name. 6 Projections. 7 Trading. 17 Ticker. 24 Community

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

15-110: Principles of Computing, Spring 2018

ABC Business Magazines Reporting Standards (UK)

PostgreSQL what's new

Approaching the Petabyte Analytic Database: What I learned

Midterm 1: CS186, Spring 2012

Grid Computing Systems: A Survey and Taxonomy

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Oracle Database 11g: SQL and PL/SQL Fundamentals

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Pig A language for data processing in Hadoop

kjhf MIS 510 sharethisdeal - Final Project Report Team Members:- Damini Akash Krittika Karan Khurana Agrawal Patil Dhingra

Data Analysis and Data Science

Supplementary text S6 Comparison studies on simulated data

Building Self-Service BI Solutions with Power Query. Written By: Devin

Decision Manager Help. Version 7.1.7

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

Inbound Marketing Glossary

Cisco Unified Contact Center Express Historical Reporting Guide, Release 10.5(1)

3. Cluster analysis Overview

DIGITAL IDENTITY MANAGEMENT Temple Community Platform

Guest Lecture. Daniel Dao & Nick Buroojy

Setup Google Analytics

Transcription:

Data Exploration Data Set Overview The table below lists each of the files available for analysis with a short description of what is found in each one. File Name Description Fields ad-clicks.csv This csv file corresponds to ERD table Adclicks. A line is added to this file when a player clicks on an advertisement in the Flamingo app timestamp: when the click occurred txid: a unique id (within adclicks.log) for the click usersessionid: the id of the user session for the user who made the click teamid: the current team id of the user who made the click userid: the user id of the user who made the click adid: the id of the ad clicked on adcategory: the category/type of ad clicked on buy-clicks.csv This csv file corresponds to InAppPurchases ERD table. A line is added to this file when a player makes an in-app purchase in the Flamingo app. timestamp: when the purchase was made txid: a unique id (within buyclicks.log) for the purchase usersessionid: the id of the user session for the user who made the purchase team: the current team id of the user who made the purchase userid: the user id of the user who made the purchase buyid: the id of the item purchased price: the price of the item purchased users.csv This csv file corresponds to User ERD table. This file contains a line for each user playing the game. timestamp: when user first played the game id: the user id assigned to the user nick: the nickname chosen by the user twitter: the twitter handle of the

user dob: the date of birth of the user country: the two-letter country code where the user lives team.csv This csv file corresponds to Team ERD table. This file contains a line for each team terminated in the game. teamid: the id of the team name: the name of the team teamcreationtime: the timestamp when the team was created teamendtime: the timestamp when the last member left the team strength: a measure of team strength, roughly corresponding to the success of a team currentlevel: the current level of the team teamassignments.csv This csv file corresponds to TeamAssignment ERD table. A line is added to this file each time a user joins a team. A user can be in at most a single team at a time. time: when the user joined the team team: the id of the team userid: the id of the user assignmentid: a unique id for this assignment level-events.csv This csv file corresponds to LevelEvent ERD table. time: when the event occurred. eventid: a unique id for the event teamid: the id of the team level: the level started or completed eventtype: the type of event, either start or end user-session.csv This csv file corresponds to User_Sessions ERD table. Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started. timestamp: a timestamp denoting when the event occurred usersessionid: a unique id for the session userid: the current user's ID teamid: the current user's team assignmentid: the team assignment id for the user to the team sessiontype: whether the event is the start or end of a

session teamlevel: the level of the team during this session platformtype: the type of platform of the user during this session game-clicks.csv This csv file corresponds to GameClicks ERD table. A line is added to this file each time a user performs a click in the game. time: when the click occurred clickid: a unique id for the click userid: the id of the user performing the click usersessionid: the id of the session of the user when the click is performed ishit: denotes if the click was on a flamingo (value is 1) or missed the flamingo (value is 0) teamid: the id of the team of the user teamlevel: the current level of the team of the user Aggregation Amount spent buying items buyid Sum 0 592.0 1 538.0 2 2142.0 3 1685.0 4 4250.0 5 12200.0 [source="buy-clicks.csv" stats sum(price) by buyid] # Unique items available to be purchased 30 [source="ad-clicks.csv" stats count by adid stats count]

A histogram showing how many times each item is purchased: A histogram showing how much money was made from each item:

Filtering A histogram showing total amount of money spent by the top ten users (ranked by how much money they spent). source="buy-clicks.csv" stats sum(price) by userid sort -sum(price) head 10 The following table shows the user id, platform, and hit-ratio percentage for the top three buying users: Rank User Id Platform Hit-Ratio (%) 1 2229 iphone 61/526*100=11.6% 2 12 iphone 92/704*100=13.1% 3 471 iphone 76/524*100=14.5% Sample query used to find Platform: source="user-session.csv" userid=2229 stats count by platformtype Sample query used to find Hit-Ratio(%) source="game-clicks.csv" userid=2229 stats sum(ishit) count

Analysis of combined_data.csv Data Preparation Sample Selection Item Amount # of Samples 4619 # of Samples with Purchases 1411 Attribute Creation A new categorical attribute was created to enable analysis of players as broken into 2 categories (HighRollers and PennyPinchers). A screenshot of the attribute follows: <Fill In: Describe the design of your attribute in 1-3 sentences.> Used Rule Engine Knime node. Logic as below: $avg_price$ > 5.0 => "HighRollers" $avg_price$ <= 5.0 => "PennyPinchers" The creation of this new categorical attribute was necessary because we want to aggregate the users into small number of groups. Avg_price being continuous value does not enable this.

Attribute Selection The following attributes were filtered from the dataset for the following reasons: Attribute Avg_price UserId usersessionid Rationale for Filtering The type of customer (PennyPinchers or HighRollers) is derived from avg_price. This attribute cannot give any prediction about customer s spending. This is a random number. It is realistically impossible to have any pattern between userid to customer s spending. Same as above. This does not influence customer s spending.

Data Partitioning and Modeling The data was partitioned into train and test datasets. The train data set was used to create the decision tree model. The trained model was then applied to the test dataset. This is important because train data enables to define the model and this model can be verified on test data. The accuracy on test data will provide an reasonable measure of performance for actual predictions. When partitioning the data using sampling, it is important to set the random seed because it ensures partitioning node produces same partition each time it is executed thereby producing the same result. A screenshot of the resulting decision tree can be seen below:

Evaluation A screenshot of the confusion matrix can be seen below: As seen in the screenshot above, the overall accuracy of the model is 88.496% <Fill In: Write one sentence for each of the values of the confusion matrix indicating what has been correctly or incorrectly predicted.> 308 PennyPinchers and 192 HighRollers have been correctly classified. However, 38 HighRollers have been wrongly classified as PennyPinchers. Similarly, 27 PennyPinchers have been wrongly classified as HighRollers. Overall accuracy is good (88.496%).

Analysis Conclusions The final KNIME workflow is shown below: What makes a HighRoller vs. a PennyPincher? iphone platformtypes users are most likely to be HighRollers whereas linux, windows and android are most likely to to PennyPinchers. Mac platform users are heterogeneous (high entropy) with more tilted towards PennyPinchers. Specific Recommendations to Increase Revenue 1. Eglence should firmly focus on iphone platform as users of this are more likely to purchase. 2. Incentivise non-iphone users to join iphone platform.

Clustering Analysis Attribute Selection Summary: The idea of creating this cluster is to find whether any pattern exists between user game clicks, ad-clicks and revenue generated by a user. Attribute Total game clicks Total ad-clicks Revenue per user userid of user Rationale for Selection Total number of clicks by each user (both hit and misses) Total number of clicks on ads. Amount of money earned from each user s spending on their purchase of apps key source of revenue This is the key to bind the data

Training Data Set Creation The training data set used for this analysis is shown below (first 10 lines): Dimensions of the training data set (rows x columns) : 543 x 3 # of clusters created: 3 (checked with a few others it looks like the data is more meaningful when we create two different clusters).

Cluster Centers Cluster # Cluster Center 1 (gameclicks, adclicks, revenue)=(2412.26, 32.26, 41.89) 2 (gameclicks, adclicks, revenue)=(959,52, 36.37, 46.03) 3 (gameclicks, adclicks, revenue)=(363.42, 25.17, 35.36) These clusters can be differentiated from each other as follows: Cluster 1 is different from the others in that its gameclicks is much higher compared to others. Cluster 2 is different from the others it has the highest revenue per user. It also has highest adclicks per user. Cluster 3 has least gameclicks and least ad clicks per user. Cluster 2 is desired type for user best revenue fuelled by high ad-clicks and gameclicks (underpinned by in-game purchase).

Recommended Actions Action Recommended Negotiate better rates for ad clicks with advertiser Do further study to understand background and rationale of the users who generate highest revenue. Rationale for the action Ad clicks is a key source for revenue. Eglence should negotiate better deals with advertisers. Users in cluster #2 are the preferred for Eglence Inc. They are moderate users (gameclicks) but generate the best revenue. Eglence should do further analytics to understand who these users are (i.e. whether they belong to any country, age group, platform).

Graph Analytics Modelling Chat Data using a Graph Data Model Chat data created by Eglence Inc. users is represented as a graph data model. There are 4 different types of nodes: User, Team, TeamChatSession and ChatItem. An instance of TeamChatSession is created when a User creates a new chat session for a team. ChatItem is individual instance of chat. Below are different relationships: CreateSession from User to TeamChatSession node CreateChat from User to ChatItem node ResponseTo ChatItem to ChatItem node (during a chat conversation) OwnedBy TeamChatSession to Team PartOf ChatItem to TeamChatSession Mentioned ChatItem to User Joins User to TeamChatSession Leaves User to TeamChatSession Creation of the Graph Database for Chats Describe the steps you took for creating the graph database. As part of these steps i) Write the schema of the 6 CSV files Schema of the 6 CSV files: chat_create_team_chat.csv UserId, TeamId, TeamChatSessionId, CreatedTimestamp chat_join_team_chat.csv UserId, TeamChatSessionId and JoinsEdgeTimestamp chat_leave_team_chat.csv UserId, TeamChatSessionId and LeaveTimestamp chat_leave_team_chat.csv UserId, TeamChatSessionId, ChatItemId, CreateChatTimestamp chat_mention_team_chat.csv ChatItemId, UserId, MentionedTimestamp chat_respond_team_chat.csv ChatItemId, ResponseChatItemId, ResponseTimestamp ii) Explain the loading process and include a sample LOAD command Loading process

Extracted the csv files and placed in the directory location Neo4j referred to. In my case, Neo4j refers to a specific location in D: drive. Used the sample scripts and made necessary amendments. Below is a sample Load command: LOAD CSV FROM "file:///d:/chat_item_team_chat.csv" AS row MERGE (u:user {id: toint(row[0])}) MERGE (c:teamchatsession {id: toint(row[1])}) MERGE (i:chatitem {id: toint(row[2])}) MERGE (i)-[:partof{timestamp: row[3]}]->(c) MERGE (u)-[:createchat{timestamp: row[3]}]->(i) iii) Present a screenshot of some part of the graph you have generated. The graphs must include clearly visible examples of most node and edge types. Below are two acceptable examples. The first example is a rendered in the default Neo4j distribution, the second has had some nodes moved to expose the edges more clearly. Both include examples of most node and edge types. Screen of graph generated. Used match ()-[r]->() return r limit 70 command and adjusted scale to 0.8. match ()-[r]->() return r limit 25

Finding the longest conversation chain and its participants Report the results including the length of the conversation (path length) and how many unique users were part of the conversation chain. Describe your steps. Write the query that produces the correct answer. To find the number of chats in longest conversation chain: match p=(a)-[:responseto*]->(c) return size(nodes(p)) order by size(nodes(p)) desc limit 1 Above returns 10. To find unique users of the conversation chain: match p=(i:chatitem)-[:responseto*]->(j:chatitem) where size(nodes(p)) = 10 with p match (u:user)-[:createchat]->(i:chatitem) where i in nodes(p) return distinct u 5 unique users in the chain. Ids are 1192, 1978, 1153, 853, 1514.

Analyzing the relationship between top 10 chattiest users and top 10 chattiest teams Describe your steps from Question 2. In the process, create the following two tables. You only need to include the top 3 for each table. Identify and report whether any of the chattiest users were part of any of the chattiest teams. Chattiest Users match q=(a:user)-[r:createchat]->(b:chatitem) return a.id,count(a) order by count(a) desc limit 10 Users 394 115 2067 111 209 109 Number of Chats Chattiest Teams match p=(c:chatitem)-[r1:partof]->(s:teamchatsession)-[r2:ownedby]->(t:team) return t.id, count(c) order by count(c) desc limit 10 Teams Number of Chats 82 1324 185 1036 112 957 Finally, present your answer, i.e. whether or not any of the chattiest users are part of any of the chattiest teams. match (u:user)-[:createchat]->(i:chatitem)-[:partof]->(s:teamchatsession)-[:ownedby]->(t) where t.id in [82, 185, 112, 18, 194, 129, 52, 136, 146, 81] and u.id in [394, 2067, 209, 1087, 554, 516, 1627, 999, 668, 461] return distinct u.id, t.id Above query returns user 999 is the only one of the top 10 chattiest users belong to one of the top 10 chattiest teams (52). None among the top 10 chattiest users belongs to the top 10 chattiest teams.

How Active Are Groups of Users? Describe your steps for performing this analysis. Be as clear, concise, and as brief as possible. Finally, report the top 3 most active users in the table below. Created the edge InteractsWith: Match (u1:user)-[:createchat]->(i:chatitem)-[:mentioned]->(u2:user) create (u1)- [:InteractsWith]->(u2) Match (u1:user)-[:createchat]->(i:chatitem)-[:responseto]->(i2:chatitem)<-[:createchat]- (u2:user) create (u1)-[:interactswith]->(u2) Deleted self loops: Match (u1)-[r:interactswith]->(u1) delete r To get number of nodes and the number of edges between their neighbour s nodes, and to get clustering coefficients by using this formula: (number of neighbour nodes)/[number of edges*(number of edges 1)]: Match (u)-[cc:createchat]->() With u,count(u) as chatcount order by chatcount desc limit 10 With collect(u.id) as users match(u:user)-[:interactswith]-(u1:user) where u.id in users With u,count(distinct u1) as k, collect(distinct u1.id) as neighbours Match (n:user)-[:interactswith]-(n1:user),(m:user)-[:interactswith]-(m1:user) where n.id=u.id and n1.id in neighbours and m1.id in neighbours With distinct u,n1,m1,k With u,sum(case when (n1)--(m1) then 1 else 0 end) as edges, k return u.id,edges,k,k*(k-1), tofloat(edges)/tofloat(k*(k-1)) as coefficient order by coefficient desc Above query returns the list below: u.id edges k k*(k-1) coefficient 461 6 3 6 1 394 12 4 12 1 209 40 7 42 0.9523809523809523 516 40 7 42 0.9523809523809523 554 38 7 42 0.9047619047619048 999 78 10 90 0.8666666666666667 1087 24 6 30 0.8 1627 44 8 56 0.7857142857142857 2067 44 8 56 0.7857142857142857 668 14 5 20 0.7

Most Active Users (based on Cluster Coefficients) User ID 461 1 394 1 209.95 Coefficient