Graph Analytics. Modeling Chat Data using a Graph Data Model. Creation of the Graph Database for Chats

Graph Analytics Modeling Chat Data using a Graph Data Model This we will be using a graph analytics approach to chat data from the Catch the Pink Flamingo game. Currently this chat data is purely numeric, no text. Analytically, it can still serve a useful purpose in revealing certain types of behaviors which can only be observed within a graph analytics context. This data captures team chat sessions that occur while users play the Catch the Pink Flamingo game. The database schema tracks various actions that occur such as creating a team session, joining a session, leaving a session, and responding to or mentioning items in a chat. Creation of the Graph Database for Chats Database for Schema Chat Data chat_create_team_chat.csv User INT Identifies a unique user Team INT Identifies the team a User belongs to TeamChatSession INT Identifies a chat session the User created Timestamp FLOAT Time that the chat session was created. chat_join_team_chat.csv User INT Identifies a unique user TeamChatSession INT Identifies a chat session the User created Timestamp FLOAT Time that the User joined the chat session chat_leave_team_chat.csv User INT Identifies a unique user TeamChatSession INT Identifies a chat session the

User created Timestamp FLOAT Time that the User left the chat session chat_item_team_chat.csv User INT Identifies a unique user TeamChatSession INT Identifies a chat session the User created ChatItem INT Identifier for a Chat Item. Timestamp FLOAT Time that the ChatItem was created. chat_mention_team_chat.csv ChatItem INT Identifier for a Chat Item. User INT Identifies a unique user Timestamp FLOAT Time that the User was mentioned in the ChatItem. chat_respond_team_chat.csv ChatItem INT Identifier for the responding Chat Item. ChatItem INT Identifier for a Chat Item that was responded to. Timestamp FLOAT Time that the ChatItem was responded to. Loading the Chat Data into Neo4J chat_create_team_chat.csv LOAD CSV FROM "file:///chat_create_team_chat.csv" AS row MERGE (u:user {id: toint(row[0])}) MERGE (t:team {id: toint(row[1])}) MERGE (c:teamchatsession {id: toint(row[2])}) MERGE (u)-[:createssession{timestamp: row[3]}]->(c) MERGE (c)-[:ownedby{timestamp: row[3]}]->(t)

chat_join_team_chat.csv LOAD CSV FROM "file:///chat_join_team_chat.csv" AS row MERGE (u:user {id: toint(row[0])}) MERGE (c:teamchatsession {id: toint(row[1])}) MERGE (u)-[:joins{timestamp: row[2]}]->(c) chat_leave_team_chat.csv LOAD CSV FROM "file:///chat_leave_team_chat.csv" AS row MERGE (u:user {id: toint(row[0])}) MERGE (c:teamchatsession {id: toint(row[1])}) MERGE (u)-[:leaves{timestamp: row[2]}]->(c) chat_item_team_chat.csv LOAD CSV FROM "file:///chat_item_team_chat.csv" AS row MERGE (u:user {id: toint(row[0])}) MERGE (c:teamchatsession {id: toint(row[1])}) MERGE (i:chatitem {id: toint(row[2])}) MERGE (u)-[:createchat{timestamp: row[3]}]->(i) MERGE (i)-[:partof{timestamp: row[3]}]->(c) chat_mention_team_chat.csv LOAD CSV FROM "file:///chat_mention_team_chat.csv" AS row MERGE (i:chatitem {id: toint(row[0])}) MERGE (u:user {id: toint(row[1])}) MERGE (i)-[:mentioned{timestamp: row[2]}]->(u) chat_respond_team_chat.csv LOAD CSV FROM "file:///chat_respond_team_chat.csv" AS row MERGE (i:chatitem {id: toint(row[0])})

MERGE (j:chatitem {id: toint(row[1])}) MERGE (i)-[:responseto{timestamp: row[2]}]->(j) Screen Shots of Graph Data This graph shows the edges between Chat Items. match (n:chatitem)-[r]->(m) return n, r, m LIMIT 50 This graph shows the all of the ResponseTo edges. match (n)-[r:responseto*]->(m) return n, r, m LIMIT 50

Finding the longest conversation chain and its participants Description Find the longest conversation chain in the chat data using the "ResponseTo" edge label. This question has two parts, 1) How many chats are involved in it?, and 2) How many users participated in this chain? Results Path length: 9 Number of Unique Users: 10 Unique Users: 52694, 16478, 13704, 11312, 9744, 9435, 8194, 8044, 7816, 7803 Query START n=node(*) MATCH p=(n)-[:responseto*]->(m) WITH COLLECT(p) AS paths, MAX(length(p)) AS maxlength RETURN FILTER(path IN paths WHERE length(path)= maxlength) AS longestpaths Screenshot

Relevance Long chats can be used to identify users that have a lot of influence. These Maven users can be targeted with incentives to help sway opinions of users to certain purchases, etc. Identifying highly influential users is far more cost effective then appealing to all users and arguably as effective. Analyzing the relationship between top 10 chattiest users and top 10 chattiest teams Description Find the longest conversation chain in the chat data using the "ResponseTo" edge label. This question has two parts, 1) How many chats are involved in it?, and 2) How many users participated in this chain? Results Chattiest Users User 394 115 2067 111 209 109 Number of Chats Chattiest Teams Teams 82 1324 185 1036 112 967 Number of Chats

Users / Teams User Team Top 10 Team? 394 54 No 2067 93 No 209 97 No 1087 135 No 554 32 No 516 95 No 1627 18 Yes 999 86 No 668 134 No 451 82 Yes The data show there is only about a 20% probability that a chatty user belongs to a chatty team. This could be due to the fact that teams with more communication will have a more equal distribution of chats (like the San Antonio Spurs), where the super chatty team member will drown out the others (think Kobe on the Lakers). Queries Chatty Users START n=node(*) MATCH (n)-[:createchat*]->(m) return n.id as User, count(m) as Total ORDER BY Total DESC LIMIT 10 Chatty Teams START n=node(*) MATCH ()-[:PartOf]-()-[:OwnedBy]-(n) return n.id as Team, count(n) as Total ORDER BY Total DESC LIMIT 10 Screenshot Chatty Users

Chatty Teams Relevance This data shows that chatty users are not necessarily popular or influential. When targeting groups of users this should be remembered. You might get better cost efficiency but targeting tightly coupled groups rather than individuals. How Active Are Groups of Users? Description In this question, we will compute an estimate of how dense the neighborhood of a node is. In the context of chat that translates to how mutually interactive a certain group of users are. If we can

identify these highly interactive neighborhoods, we can potentially target some members of the neighborhood for direct advertising. We will do this in a series of steps. The degree of connectedness is measured by a connectedness coefficient that is a ratio of the number of relationships a user has compared to the maximum relationships that are possible. Results Most Active Users (based on Cluster Coefficients) User ID Coefficient 554 0.11904761904761904 461 0.5 668 0.15 2067 0.11904761904761904 516 0.2 394 0.3333333333333333 209 0.16666666666666666 999 0.09722222222222222 1627 0.14285714285714285 1087 0.2 Queries Graph Creation We will construct the neighborhood of users. In this neighborhood, we will connect two users if: One mentioned another user in a chat: Match (u1:user)-[:createchat]->()-[:mentioned]->(u2:user) create (u1)-[:interactswith]->(u2) One created a chatitem in response to another user s chatitem Match (u1:user)-[:createchat]->()-[:responseto]->()-[:mentioned]- >(u2:user) create (u1)-[:interactswith]->(u2) Then we delete self loops. Match (u1)-[r:interactswith]->(u1) delete r And finally delete all duplicate relationships. MATCH (a)-[r:interactswith]->(b) WITH a, b, TAIL (COLLECT (r)) as rr FOREACH (r IN rr DELETE r) Connectedness Coefficient Calculation

After much struggling, this was the final query I came up with. This one did not come easy. I found Cypher quite difficult. I can see the power of it, but it was not a natural fit for me. I did learn a great deal though and feel much more comfortable. MATCH (n)-[:createchat*]->(m) with {u: n.id, t: count(m)} as f ORDER BY f.t DESC LIMIT 10 with collect (f.u) as chatty match (d)-[:interactswith]-(b) with {u: d.id, n: collect(distinct b.id)} as row, chatty with collect (row) as neighbors, chatty match (n)-[r:interactswith]->(m) UNWIND neighbors as row with row.u as id, SUM(CASE WHEN (n.id = row.u and m.id in row.n) or (n.id in row.n and m.id = row.id) THEN 1 ELSE 0 END) as numedges, length(row.n) as k, chatty with {u: id, n: numedges, k: k} as row, chatty with collect(row) as rows, chatty with FILTER(row in rows WHERE row.u in chatty) as frows UNWIND frows as frow with frow.u as id, frow.n as numedges, frow.k as k return id as userid, tofloat(numedges)/tofloat(k * (k - 1)) as coefficient Screenshot