Acquiring, Exploring and Preparing the Data

Save this PDF as:

Size: px
Start display at page:

Download "Acquiring, Exploring and Preparing the Data"


1 Technical Appendix Catch the Pink Flamingo Analysis Produced by: Prabhat Tripathi Acquiring, Exploring and Preparing the Data Data Exploration Data Set Overview The table below lists each of the files available for analysis with a short description of what is found in each one. File Name Description Fields ad-clicks.csv Contain records of all clicks on ads by players Timestamp: the date-time when the event (click) occurred txid: a unique id within this file usersessionid: the id of the user session where this click occurred teamid: id of the team of the player who clicked on the ad adcategory: the category of the ad clicked buy-clicks.csv All in-app purchase entries Timestamp: the time when the purchase occurred txid: a unique if within this data set usersessionid: the id of the user session where this purchase occurred buyid: the id of the item bought userid: user that purchased the item price: price of the item purchased users.csv Contains all unique users in the game Timestamp: the date-time when each user first played the game (when the record was created) id: unique id given to each user nick: nick name chosen by the user

2 twitter: twitter handle of the user dob: date of birth country: two letter country code team.csv All teams terminated in the game teamid: unique id assigned to each team name: name of the team teamcreationtime: when team record was created teamendtime: timestamp when the last member left the team currentlevel: current level reached by the team team-assignments.csv Player-team mapping table time: when the user joined the team team: team id userid: user id Assignmentid: a unique id assigned to each user-team mapping level-events.csv user-session.csv Logs of teams completing game levels Each line in this file describes a user session, which denotes when a user starts and stops playing the game. Additionally, when a team goes to the next level in the game, the session is ended for each user in the team and a new one started. time: when a level start or end occurred eventid: a uniqueid assigned to each record teamid: team id level: level started or completed eventtype: start or end usersessionid: a unique id for the session. assignmentid: the team assignment id for the user to the team. starttimestamp: a timestamp denoting when the session started. endtimestamp: a timestamp denoting when the session ended. team_level: the level of the team during this session. platformtype: the type of platform of the user during this session.

3 game-clicks.csv Contains all clicks by users time: when the click occurred. clickid: a unique id for the click. userid: the id of the user performing the click. usersessionid: the id of the session of the user when the click is performed. ishit: denotes if the click was on a flamingo (value is 1) or missed the flamingo (value is 0) Aggregation Amount spent buying items # Unique items available to be purchased 6 A histogram showing how many times each item is purchased: A histogram showing how much money was made from each item:

4 Filtering A histogram showing total amount of money spent by the top ten users (ranked by how much money they spent). The following table shows the user id, platform, and hit-ratio percentage for the top three buying users: Rank User Id Platform Hit-Ratio (%) iphone %

5 2 12 iphone % iphone %

6 Data Classification Analysis Data Preparation Analysis of combined_data.csv Sample Selection Item Amount # of Samples 4619 # of Samples with Purchases 1411 Attribute Creation A new categorical attribute was created to enable analysis of players as broken into 2 categories (HighRollers and PennyPinchers). A screenshot of the attribute follows: <Fill In: Describe the design of your attribute in 1-3 sentences.> Here price_categories attribute is added as numeric binner data manimulation node. This attribute partitions rows into two bins HighRollers (AvgPrice > 5.0) and PennyPichers (AvgPrice <= 5.0). The new attribute, price_categories is appended at the end.

7 The creation of this new categorical attribute was necessary because <Fill in 1-2 sentences>. We are creating a model to predict some characteristics of highroller users. In order to create this model and learn this behavior using train data-set, we would need to create this new attribute. Attribute Selection The following attributes were filtered from the dataset for the following reasons: Attribute userid usersessionid Avg(price) Rationale for Filtering This id attribute should not have any impact on the prediction model. This Id attribute (column) should not have any influence on whether a user is highroller or pennypincher. We have derived attribute (price-category) from this attribute. IT is therefore no longer needed for analysis. Data Partitioning and Modeling The data was partitioned into train and test datasets. The train data set was used to create the decision tree model. The trained model was then applied to the test dataset. This is important because we would want to test our model on data that was not used to train it. When partitioning the data using sampling, it is important to set the random seed because it ensures that you will get the same partitions every time you execute this node. This is important to get reproducible results. A screenshot of the resulting decision tree can be seen below:

8 Evaluation A screenshot of the confusion matrix can be seen below:

9 Confusion matrix provides comparison between actual and predicted values of the test attribute (price-categories in this case). Out of the test data, which is 40% of the total 1411 or ~565 rows, we see that our model predicted: PennyPinchers were correctly predicted but PennyPinchers were incorrectly predicted as HighRollers HighRollers were correctly predicted but HighRollers were incorrectly predicted as PennyPinchers Analysis Conclusions The final KNIME workflow is shown below: What makes a HighRoller vs. a PennyPincher? From the analysis, it appears that platformtype predicts HighRoller vs PennyPincher. iphone users are more likely to be HighRollers. Specific Recommendations to Increase Revenue 1. Promote game more to iphone users (ios platform) 2. There are still android high-rollers so do not stop android platform development just the focus and marketing energy should be on iphone platform.

10 Yes, the analysis clearly identifies that iphone users are HighRollers and other platformtype users are PennyPinchers. Yes, and at least one is related to marketing to iphone users. Few HighRollers are seen outside of iphone users. This maybe due to lack of useful application or hardness for use from extra-iphone-platforms. We should inspect the application or interface problem for extra-iphone-users.

11 Clustering Analysis Attribute Selection Training Data Set Creation The training data set used for this analysis is shown below (first 5 lines): Dimensions of the training data set (rows x columns) : 531 x 4 # of clusters created: 2 Cluster Centers These clusters can be differentiated from each other as follows: First number (field1) is each array refers to number of sessions, second number (field 2) is the hit ratio whereas the third number (field 3) is the revenue per user.

12 Therefore: compare first number in two clusters to find our how users in two clusters differ in terms of their game engagements? compare second number to find out how users differ in terms of game expertise levels in the two clusters? compare the third number to find out how much revenue users in two clusters generate? Cluster 1 is different from the others in that In the first cluster (cluster 1) players play the game more often (almost 50% more) but they generate revenue more than 5 times. The hit ratio in two clusters appears to be similar and therefore does not have much differentiation. Cluster 2 is different from the others in that Players in clusters are less engaged and generate less revenue compared to players in cluster 1. Recommended Actions We are assuming that Eglence Inc. gets paid for showing ads and for hosting in-app purchase items.

13 Graph Analytics Analysis Modeling Chat Data using a Graph Data Model (Describe the graph model for chats in a few sentences. Try to be clear and complete.) ChatItem CreateChat PartOf ResponseTo User Joins/Leaves TeamChatSession Joins/Leaves User CreatesSession CreatesSession OwnedBy PartOf Team ChatItem Men=oned The above model shows the nodes and relationships on the Chat-Data graph model. In the following description, nodes are marked bold. User creates/joins/leaves TeamChatSessions Every TeamChatSessions are owned by a Team User creates ChatItems which are always part of an existing TeamChatSession a ChatItem can be in ResponseTo another ChatItem and may Mention a User

14 Creation of the Graph Database for Chats Describe the steps you took for creating the graph database. As part of these steps i) Write the schema of the 6 CSV files 1) File: chat_create_team_chat.csv Description: A line is added to this file when a player creates a new chat with their team. User -> CreatesSession -> TeamChatSession Example: userid, teamid, TeamChatSessionID, timestamp 559,48,6288, ,15,6289, ,68,6290, ) File: chat_item_team_chat.csv Description: User creates a ChatItem which as a part of an existing TeamChatSession User -> CreateChat -> ChatItem ChatItem -> PartOF -> ChatSessionId Example: userid, teamid, timestamp 1956,6299, ,6296, ,6290,6316 3) File: chat_join_team_chat.csv Creates an edge labeled "Joins" from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Joins edge.

15 User -> Joins -> TeamChatSession Example: userid, TeamChatSessionID, teamstamp 559,6288, ,6289, ,6290, ) File: chat_leave_team_chat.csv User -> Leaves -> TeamChatSession Creates an edge labeled "Leaves" from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Leaves edge. Example: userid, chatid, timestamp 1244,6821, ,6838, ,6777, ) File: chat_mention_team_chat.csv ChatItem -> Mentioned -> User Each row represent a chat item that mentioned a user Example: ChatItem, userid, timestamp 6349, ,2491

16 6371,104 6) File: chat_respond_team_chat.csv A line is added to this file when a player responds to a chat post. ChatItem -> ResponseTo -> ChatItem Example: chatid1, chatid2,timestamp 6326,6305, ,6326, ,6366,54567 ii) Explain the loading process and include a sample LOAD command Loading process includes: - Put all.csv files into the neo4j database folder (when you launch neo4j it shows its database dir, the files need to be inside that) - One my one load each csv file and create the node edges as per the schema of the file The following example LOAD command, for instance, loads the data file that contains relationship between two ChatITems. Each line describes the ChatItem id (Column 1, index 0) which is in ResponseTo another ChatItem id given in column 2 (index 1). LOAD CSV FROM "file:///chat_respond_team_chat.csv" AS row MERGE (ci:chatitem {id: toint(row[0])}) MERGE (ci2:chatitem {id: toint(row[1])}) MERGE (ci)-[:responseto{timestamp: row[2]}]->(ci2) iii) Present a screenshot of some part of the graph you have generated. The graphs must include clearly visible examples of most node and edge types. Below are two acceptable examples. The first example is a rendered in the default Neo4j distribution, the second has had some nodes moved to expose the edges more clearly. Both include examples of most node and edge types.

17 Finding the longest conversation chain and its participants Report the results including the length of the conversation (path length) and how many unique users were part of the conversation chain. Describe your steps. Write the query that produces the correct answer. 1) How many chats are involved in it? The longest path length with relation ResponseTo Query: match p=(a:chatitem)-[:responseto*]->(c:chatitem) return length(p) order by length(p) desc limit 1; Result: 9 That means, 10 chatitems are involved in the longest chat path. 2) How many users participated in this chain? Cypher command: match p=(ci:chatitem)-[:responseto*]->(ci2:chatitem) where length(p) = 9 with p match (d:chatitem)<-[r:createchat]-(u:user) where d in nodes(p)

18 return count(distinct u); - note that the path length 9 is from the part 1. * Result: 5 users Analyzing the relationship between top 10 chattiest users and top 10 chattiest teams Describe your steps from Question 2. In the process, create the following two tables. You only need to include the top 3 for each table. Identify and report whether any of the chattiest users were part of any of the chattiest teams. Chattiest Users match (u:user)-[:createchat]->(ci:chatitem) return u, count(distinct ci) as chats order by chats desc limit 3; Users Number of Chats Chattiest Teams match (ci:chatitem)-[:partof]->(s:teamchatsession)-[:ownedby]->(t:team) return t, count(distinct ci) as chatitems order by chatitems desc limit 3; Teams Number of Chats

19 Finally, present your answer, i.e. whether or not any of the chattiest users are part of any of the chattiest teams. Top 10 Chattiest teams: t chatitems {id: 82} 1324 {id: 185} 1036 {id: 112} 957 {id: 18} 844 {id: 194} 836 {id: 129} 814 {id: 52} 788 {id: 136} 783 {id: 146} 746 {id: 81} 736 Top 10 chattiest Users: u chats {id: 394} 115 {id: 2067} 111 {id: 209} 109 {id: 1087} 109 {id: 554} 107

20 {id: 1627} 105 {id: 516} 105 {id: 999} 105 {id: 461} 104 {id: 668} 104 Once we find team for all these 10 users, we see that: - Only userid 999 belongs to a team that is in top 10 chattiest team. How Active Are Groups of Users? Creating InteractsWith Relationship: 1. One user mentioned another user in a chat match (u1:user)-[cs:createchat]->(ci:chatitem)-[:mentioned]->(u2:user) create (u1)- [:InteractsWith]->(u2) 2. One user created a chatitem in response to another user s chatitem match (u1:user)-[cs:createchat]->(ci1:chatitem)-[:responseto]-(ci2:chatitem) with u1,ci1,ci2 match (u2)-[:createchat]-(ci2) create (u1)-[:interactswith]->(u2)

21 3. Delete Self relationships Match (u1)-[r:interactswith]->(u1) delete r Most Active Users (based on Cluster Coefficients) User ID Coefficient