Data Exploration. The table below lists each of the files available for analysis with a short description of what is found in each one.

Size: px

Start display at page:

Download "Data Exploration. The table below lists each of the files available for analysis with a short description of what is found in each one."

Paulina Snow
5 years ago
Views:

1 Data Exploration Data Set Overview The table below lists each of the files available for analysis with a short description of what is found in each one. File Name Description Fields ad-clicks.csv This csv file corresponds to ERD table Adclicks. A line is added to this file when a player clicks on an advertisement in the Flamingo app timestamp: when the click occurred txid: a unique id (within adclicks.log) for the click usersessionid: the id of the user session for the user who made the click teamid: the current team id of the user who made the click userid: the user id of the user who made the click adid: the id of the ad clicked on adcategory: the category/type of ad clicked on buy-clicks.csv This csv file corresponds to InAppPurchases ERD table. A line is added to this file when a player makes an in-app purchase in the Flamingo app. timestamp: when the purchase was made txid: a unique id (within buyclicks.log) for the purchase usersessionid: the id of the user session for the user who made the purchase team: the current team id of the user who made the purchase userid: the user id of the user who made the purchase buyid: the id of the item purchased price: the price of the item purchased users.csv This csv file corresponds to User ERD table. This file contains a line for each user playing the game. timestamp: when user first played the game id: the user id assigned to the user nick: the nickname chosen by the user twitter: the twitter handle of the

2 user dob: the date of birth of the user country: the two-letter country code where the user lives team.csv This csv file corresponds to Team ERD table. This file contains a line for each team terminated in the game. teamid: the id of the team name: the name of the team teamcreationtime: the timestamp when the team was created teamendtime: the timestamp when the last member left the team strength: a measure of team strength, roughly corresponding to the success of a team currentlevel: the current level of the team teamassignments.csv This csv file corresponds to TeamAssignment ERD table. A line is added to this file each time a user joins a team. A user can be in at most a single team at a time. time: when the user joined the team team: the id of the team userid: the id of the user assignmentid: a unique id for this assignment level-events.csv This csv file corresponds to LevelEvent ERD table. time: when the event occurred. eventid: a unique id for the event teamid: the id of the team level: the level started or completed eventtype: the type of event, either start or end user-session.csv This csv file corresponds to User_Sessions ERD table. Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started. timestamp: a timestamp denoting when the event occurred usersessionid: a unique id for the session userid: the current user's ID teamid: the current user's team assignmentid: the team assignment id for the user to the team sessiontype: whether the event is the start or end of a

3 session teamlevel: the level of the team during this session platformtype: the type of platform of the user during this session game-clicks.csv This csv file corresponds to GameClicks ERD table. A line is added to this file each time a user performs a click in the game. time: when the click occurred clickid: a unique id for the click userid: the id of the user performing the click usersessionid: the id of the session of the user when the click is performed ishit: denotes if the click was on a flamingo (value is 1) or missed the flamingo (value is 0) teamid: the id of the team of the user teamlevel: the current level of the team of the user Aggregation Amount spent buying items buyid Sum [source="buy-clicks.csv" stats sum(price) by buyid] # Unique items available to be purchased 30 [source="ad-clicks.csv" stats count by adid stats count]

4 A histogram showing how many times each item is purchased: A histogram showing how much money was made from each item:

5 Filtering A histogram showing total amount of money spent by the top ten users (ranked by how much money they spent). source="buy-clicks.csv" stats sum(price) by userid sort -sum(price) head 10 The following table shows the user id, platform, and hit-ratio percentage for the top three buying users: Rank User Id Platform Hit-Ratio (%) iphone 61/526*100=11.6% 2 12 iphone 92/704*100=13.1% iphone 76/524*100=14.5% Sample query used to find Platform: source="user-session.csv" userid=2229 stats count by platformtype Sample query used to find Hit-Ratio(%) source="game-clicks.csv" userid=2229 stats sum(ishit) count

6 Analysis of combined_data.csv Data Preparation Sample Selection Item Amount # of Samples 4619 # of Samples with Purchases 1411 Attribute Creation A new categorical attribute was created to enable analysis of players as broken into 2 categories (HighRollers and PennyPinchers). A screenshot of the attribute follows: <Fill In: Describe the design of your attribute in 1-3 sentences.> Used Rule Engine Knime node. Logic as below: $avg_price$ > 5.0 => "HighRollers" $avg_price$ <= 5.0 => "PennyPinchers" The creation of this new categorical attribute was necessary because we want to aggregate the users into small number of groups. Avg_price being continuous value does not enable this.

7 Attribute Selection The following attributes were filtered from the dataset for the following reasons: Attribute Avg_price UserId usersessionid Rationale for Filtering The type of customer (PennyPinchers or HighRollers) is derived from avg_price. This attribute cannot give any prediction about customer s spending. This is a random number. It is realistically impossible to have any pattern between userid to customer s spending. Same as above. This does not influence customer s spending.

8 Data Partitioning and Modeling The data was partitioned into train and test datasets. The train data set was used to create the decision tree model. The trained model was then applied to the test dataset. This is important because train data enables to define the model and this model can be verified on test data. The accuracy on test data will provide an reasonable measure of performance for actual predictions. When partitioning the data using sampling, it is important to set the random seed because it ensures partitioning node produces same partition each time it is executed thereby producing the same result. A screenshot of the resulting decision tree can be seen below:

Evaluation A screenshot of the confusion matrix can be seen below: As seen in the screenshot above, the overall accuracy of the model is 88.

9 Evaluation A screenshot of the confusion matrix can be seen below: As seen in the screenshot above, the overall accuracy of the model is % <Fill In: Write one sentence for each of the values of the confusion matrix indicating what has been correctly or incorrectly predicted.> 308 PennyPinchers and 192 HighRollers have been correctly classified. However, 38 HighRollers have been wrongly classified as PennyPinchers. Similarly, 27 PennyPinchers have been wrongly classified as HighRollers. Overall accuracy is good (88.496%).

10 Analysis Conclusions The final KNIME workflow is shown below: What makes a HighRoller vs. a PennyPincher? iphone platformtypes users are most likely to be HighRollers whereas linux, windows and android are most likely to to PennyPinchers. Mac platform users are heterogeneous (high entropy) with more tilted towards PennyPinchers. Specific Recommendations to Increase Revenue 1. Eglence should firmly focus on iphone platform as users of this are more likely to purchase. 2. Incentivise non-iphone users to join iphone platform.

11 Clustering Analysis Attribute Selection Summary: The idea of creating this cluster is to find whether any pattern exists between user game clicks, ad-clicks and revenue generated by a user. Attribute Total game clicks Total ad-clicks Revenue per user userid of user Rationale for Selection Total number of clicks by each user (both hit and misses) Total number of clicks on ads. Amount of money earned from each user s spending on their purchase of apps key source of revenue This is the key to bind the data

Training Data Set Creation The training data set used for this analysis is shown below (first 10 lines): Dimensions of the training data set (rows x

12 Training Data Set Creation The training data set used for this analysis is shown below (first 10 lines): Dimensions of the training data set (rows x columns) : 543 x 3 # of clusters created: 3 (checked with a few others it looks like the data is more meaningful when we create two different clusters).

Cluster Centers Cluster # Cluster Center 1 (gameclicks, adclicks, revenue)=(2412.26, 32.26, 41.89) 2 (gameclicks, adclicks, revenue)=(959,52, 36.37, 46.03) 3 (gameclicks, adclicks, revenue)=(363.

13 Cluster Centers Cluster # Cluster Center 1 (gameclicks, adclicks, revenue)=( , 32.26, 41.89) 2 (gameclicks, adclicks, revenue)=(959,52, 36.37, 46.03) 3 (gameclicks, adclicks, revenue)=(363.42, 25.17, 35.36) These clusters can be differentiated from each other as follows: Cluster 1 is different from the others in that its gameclicks is much higher compared to others. Cluster 2 is different from the others it has the highest revenue per user. It also has highest adclicks per user. Cluster 3 has least gameclicks and least ad clicks per user. Cluster 2 is desired type for user best revenue fuelled by high ad-clicks and gameclicks (underpinned by in-game purchase).

14 Recommended Actions Action Recommended Negotiate better rates for ad clicks with advertiser Do further study to understand background and rationale of the users who generate highest revenue. Rationale for the action Ad clicks is a key source for revenue. Eglence should negotiate better deals with advertisers. Users in cluster #2 are the preferred for Eglence Inc. They are moderate users (gameclicks) but generate the best revenue. Eglence should do further analytics to understand who these users are (i.e. whether they belong to any country, age group, platform).

15 Graph Analytics Modelling Chat Data using a Graph Data Model Chat data created by Eglence Inc. users is represented as a graph data model. There are 4 different types of nodes: User, Team, TeamChatSession and ChatItem. An instance of TeamChatSession is created when a User creates a new chat session for a team. ChatItem is individual instance of chat. Below are different relationships: CreateSession from User to TeamChatSession node CreateChat from User to ChatItem node ResponseTo ChatItem to ChatItem node (during a chat conversation) OwnedBy TeamChatSession to Team PartOf ChatItem to TeamChatSession Mentioned ChatItem to User Joins User to TeamChatSession Leaves User to TeamChatSession Creation of the Graph Database for Chats Describe the steps you took for creating the graph database. As part of these steps i) Write the schema of the 6 CSV files Schema of the 6 CSV files: chat_create_team_chat.csv UserId, TeamId, TeamChatSessionId, CreatedTimestamp chat_join_team_chat.csv UserId, TeamChatSessionId and JoinsEdgeTimestamp chat_leave_team_chat.csv UserId, TeamChatSessionId and LeaveTimestamp chat_leave_team_chat.csv UserId, TeamChatSessionId, ChatItemId, CreateChatTimestamp chat_mention_team_chat.csv ChatItemId, UserId, MentionedTimestamp chat_respond_team_chat.csv ChatItemId, ResponseChatItemId, ResponseTimestamp ii) Explain the loading process and include a sample LOAD command Loading process

Extracted the csv files and placed in the directory location Neo4j referred to. In my case, Neo4j refers to a specific location in D: drive. Used the sample scripts and made necessary amendments.

16 Extracted the csv files and placed in the directory location Neo4j referred to. In my case, Neo4j refers to a specific location in D: drive. Used the sample scripts and made necessary amendments. Below is a sample Load command: LOAD CSV FROM "file:///d:/chat_item_team_chat.csv" AS row MERGE (u:user {id: toint(row[0])}) MERGE (c:teamchatsession {id: toint(row[1])}) MERGE (i:chatitem {id: toint(row[2])}) MERGE (i)-[:partof{timestamp: row[3]}]->(c) MERGE (u)-[:createchat{timestamp: row[3]}]->(i) iii) Present a screenshot of some part of the graph you have generated. The graphs must include clearly visible examples of most node and edge types. Below are two acceptable examples. The first example is a rendered in the default Neo4j distribution, the second has had some nodes moved to expose the edges more clearly. Both include examples of most node and edge types. Screen of graph generated. Used match ()-[r]->() return r limit 70 command and adjusted scale to 0.8. match ()-[r]->() return r limit 25

Finding the longest conversation chain and its participants Report the results including the length of the conversation (path length) and how many unique users were part of the

To find the number of chats in longest conversation chain: match p=(a)-[:responseto*]->(c) return size(nodes(p)) order by size(nodes(p)) desc limit 1 Above returns 10.

18 Finding the longest conversation chain and its participants Report the results including the length of the conversation (path length) and how many unique users were part of the conversation chain. Describe your steps. Write the query that produces the correct answer. To find the number of chats in longest conversation chain: match p=(a)-[:responseto*]->(c) return size(nodes(p)) order by size(nodes(p)) desc limit 1 Above returns 10. To find unique users of the conversation chain: match p=(i:chatitem)-[:responseto*]->(j:chatitem) where size(nodes(p)) = 10 with p match (u:user)-[:createchat]->(i:chatitem) where i in nodes(p) return distinct u 5 unique users in the chain. Ids are 1192, 1978, 1153, 853, 1514.

19 Analyzing the relationship between top 10 chattiest users and top 10 chattiest teams Describe your steps from Question 2. In the process, create the following two tables. You only need to include the top 3 for each table. Identify and report whether any of the chattiest users were part of any of the chattiest teams. Chattiest Users match q=(a:user)-[r:createchat]->(b:chatitem) return a.id,count(a) order by count(a) desc limit 10 Users Number of Chats Chattiest Teams match p=(c:chatitem)-[r1:partof]->(s:teamchatsession)-[r2:ownedby]->(t:team) return t.id, count(c) order by count(c) desc limit 10 Teams Number of Chats Finally, present your answer, i.e. whether or not any of the chattiest users are part of any of the chattiest teams. match (u:user)-[:createchat]->(i:chatitem)-[:partof]->(s:teamchatsession)-[:ownedby]->(t) where t.id in [82, 185, 112, 18, 194, 129, 52, 136, 146, 81] and u.id in [394, 2067, 209, 1087, 554, 516, 1627, 999, 668, 461] return distinct u.id, t.id Above query returns user 999 is the only one of the top 10 chattiest users belong to one of the top 10 chattiest teams (52). None among the top 10 chattiest users belongs to the top 10 chattiest teams.

20 How Active Are Groups of Users? Describe your steps for performing this analysis. Be as clear, concise, and as brief as possible. Finally, report the top 3 most active users in the table below. Created the edge InteractsWith: Match (u1:user)-[:createchat]->(i:chatitem)-[:mentioned]->(u2:user) create (u1)- [:InteractsWith]->(u2) Match (u1:user)-[:createchat]->(i:chatitem)-[:responseto]->(i2:chatitem)<-[:createchat]- (u2:user) create (u1)-[:interactswith]->(u2) Deleted self loops: Match (u1)-[r:interactswith]->(u1) delete r To get number of nodes and the number of edges between their neighbour s nodes, and to get clustering coefficients by using this formula: (number of neighbour nodes)/[number of edges*(number of edges 1)]: Match (u)-[cc:createchat]->() With u,count(u) as chatcount order by chatcount desc limit 10 With collect(u.id) as users match(u:user)-[:interactswith]-(u1:user) where u.id in users With u,count(distinct u1) as k, collect(distinct u1.id) as neighbours Match (n:user)-[:interactswith]-(n1:user),(m:user)-[:interactswith]-(m1:user) where n.id=u.id and n1.id in neighbours and m1.id in neighbours With distinct u,n1,m1,k With u,sum(case when (n1)--(m1) then 1 else 0 end) as edges, k return u.id,edges,k,k*(k-1), tofloat(edges)/tofloat(k*(k-1)) as coefficient order by coefficient desc Above query returns the list below: u.id edges k k*(k-1) coefficient

21 Most Active Users (based on Cluster Coefficients) User ID Coefficient

Acquiring, Exploring and Preparing the Data

Technical Appendix Catch the Pink Flamingo Analysis Produced by: Prabhat Tripathi Acquiring, Exploring and Preparing the Data Data Exploration Data Set Overview The table below lists each of the files