Graph Analytics Modeling Chat Data using a Graph Data Model (Describe the graph model for chats in a few sentences. Try to be clear and complete.) Creation of the Graph Database for Chats Describe the steps you took for creating the graph database. As part of these steps i) ii) iii) Write the schema of the 6 CSV files Explain the loading process and include a sample LOAD command Present a screenshot of some part of the graph you have generated. The graphs must include clearly visible examples of most node and edge types. Below are two acceptable examples. The first example is a rendered in the default Neo4j distribution, the second has had some nodes moved to expose the edges more clearly. Both include examples of most node and edge types. I) write schema of the six files. Overall Schema There are no headers for any of the data in the files, which are just a matrix of data arranged in rows also called lines. All data value contained therein is string type. The chat model has: 4 Node types, namely : User, Team, ChatItem, and ChatSession. 8 Edge types namely : CreatesSession, OwnedBy(Team owns the Chat Session) : Joins, Leaves, CreateChat, PartOf, Mentioned & ResponseTo. 1) In chat_create_team_chat.csv file, the columns contain the following data: column1: column2: column3: column4: id of user/player in the Pink Flamingo Game. id of the Team the player is from. id of the TeamChatSession. timestamp when TeamchatSession created. Column reference index is 0 Column reference index is 1 Column reference index is 2 Column reference index is 3 2) In chat_join_team_chat.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of the Team the player is from. Column reference index is 1
column3: timestamp: user Joins TeamChatSession edge Col reference index is 2 3) In chat_leave_team_chat.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of TeamChatSession. Column reference index is 1 column3: timestamp: user Leaves TeamChatSession edge Col reference index is 2 4) In chat_item_team_chat.csv.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of the TeamChatSession. Column reference index is 1 column3: timestamp: CreateChat & PartOf edges Column reference index is 2 5) In chat_mention_team_chat.csv.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of the ChatItem. Column reference index is 1 column3: timestamp: mentioned edge created Column reference index is 2 6) In chat_respond_team_chat.csv.csv file the columns contain the following data: column1: id of ChatItem. Column reference index is 0 column2: id of another ChatItem. Column reference index is 1 column3: timestamp: ResponseTo ChatItem edge created Col reference index is 2 Conceptually: data was stored the way it was because: A) Data from file chat_create_team_chat.csv file is used 1) to create three nodes named: Team, TeamChatSession, & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge OwnedBy with timestamp propery, is created, connecting TeamChatSession and Team nodes. A Team owns a TeamChatSession. 3) Another CreateSession edge with timestamp property, is created connecting TeamChatSession and User. B) Data from file chat_join_team_chat.csv file is used 1) to create two nodes named: TeamChatSession, & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge Joins, with a timestamp property is created, connecting User & TeamChatSession nodes when a user Joins a TeamChatSession.
C) Data from file chat_leave_team_chat.csv file is used 1) to create two nodes named: TeamChatSession, & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge Leaves, with a timestamp is created, connecting User and TeamChatSession nodes when a user Leaves a TeamChatSession. D) Data from file chat_item_team_chat.csv file is used 1) to create three nodes named: TeamChatSession, ChatItem & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge CreateChat, with a timestamp property is created, connecting User & TeamChatSession nodes, when a user indulges in chatter, in a TeamChatSession. An edge PartOf with a timestamp property is created connecting ChatItem & TeamChatSession. E) Data from file chat_mention_team_chat.csv file is used 1) to create three nodes named: ChatItem & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge Mentioned, with a timestamp property is created, connecting ChatItem & User when a chat item is mentioned to a user. F) Data from file chat_mention_team_chat.csv file is used 1) to create two nodes named: ChatItem i and ChatItem j, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge ResponseTo, with a timestamp property is created, connecting ChatItem i to ChatItem j. ii) Explain the loading process and include a sample load. Loading process starts with using Cypher script which is quite similar to SQL language. The LOAD CSV command with a path to the.csv file, which were stored in my computer's home folder. Loading of each line was specified with 'AS row' words included in the command. Files are loaded and executed, one at a time.
Sample commands, script to load chat-respond-team-chat.csv file. 1) LOAD CSV FROM file:/// chat_respond_team_chat.csv AS row # /// were needed to import the data from the computer, using html into Neo4j. 2) MERGE (i: ChatItem {id: toint(row [0])}]) # creates node i, named ChatItem, id property from Column index 0 of this csv file. 3) MERGE (j: ChatItem {id: toint(row [1])}]) # creates node j, named ChatItem, id property from Column index 1 of this csv file. 4) MERGE (i)-[: ResponseTo {timestamp:(row [2])}]->(j) # creates edge j, named ResponseTo, timestamp property from Column index 2 of this csv file. This edge is directional from node i to node j. Snapshot of sample load.
2 sample loaded graph displays below:
Finding the longest conversation chain and its participants Report the results including the length of the conversation (path length) and how many unique users were part of the conversation chain. Describe your steps. Write the query that produces the correct answer. The number of Chat Items involved in this longest path =10 as seen below. The following script obtains the 5 unique users involved in the longest conversation. MATCH p = (i:chatitem)-[:responseto*] (j:chatitem) #longest chat between above 2 chat item nodes with p order by length (p) desc limit 1 # this returns length of longest chat match (u:user)-[:createchat] (i:chatitem) # find users who created those chat items where i in nodes(p) # where ChatItems are in nodes (p) return count(distinct u) # return count of unique users involved in unique chat Result is 5.
Chattiest Users top 10 a) Outdegree is a method of Neo4j to know the number of directionally out going edges from a particular node. I recorded it for my convenience as NumberOfChats. b) The script and output of the command are in the snapshot below. Chattiest Teams: Top 10 Script and output with are in the snapshot below
Analyzing the relationship between top 10 chattiest users and top 10 chattiest teams Describe your steps from Question 2. In the process, create the following two tables. You only need to include the top 3 for each table. Identify and report whether any of the chattiest users were part of any of the chattiest teams. Top 3 Chattiest users Users id 394 2067 209 Number of Chats 115 111 109 Top 3 Chattiest teams Teams id 82 185 112 Number of Chats 1324 1036 957 Is any chattiest user part of chattiest team. Yes, a single team id of 999 and user id of 52. and total MaxChat count of 105 Code script is in the command line and Snapshot is seen here.
How Active Are Groups of Users? Describe your steps for performing this analysis. Be as clear, concise, and as brief as possible. Finally, report the top 3 most active users in the table below. To arrive at the assessment of how active a group is, a cluster coefficient measurement is done, which measures how highly interconnected a group is. Then we can compare groups with this measurement index. In calculating this coefficient for each chattiest user we proceed as follows 1) We created an edge called interactswith when two nodes communicate and we called them neighbors. 2) Using this edge we found out the 10 chattiest users while answering question 2. We know their id's. 3) For each of these chattiest users identified by their id we proceed to find the Outdegree of their connectedness using the interactswith edge. The Outdegree is equated to k. These are the neighbors of this particular Chattiest User. 4) Then we figure out how inter connected those neighbors are and equate that to N. In the figure below N is = 1; where Neighbor 1 is connected to Neighbor 2. None of the other nodes are connected in this way. Though there is only one N edge, two cliques are formed so in the coefficient calculation N is multiplied by 2. 5) For the figure below Cluster coefficient is got by the formula: 2N/k*(k-1). 6) Where k*(k-1) is the Max possible connections. Example: In the figure below a fictitious Chattiest User has as a direct interactswith edge to: Neighbor 1, Neighbor 2 and Neighbor 3. So Outdegree of Chattiest is 3 and is the value k we are looking for. Neighbor 1 Neighbor 2 Neighbor 3 Chattiest
In Cypher script: In the snapshot below I am giving the query and output of one Chattiest user. We need to do this for every Chattiest user and obtain necessary values for our computation. It shows that Chattiest user whose id is 394 has 5 neighbors colored in green along with their id's. They are connected with interactswith edge.
Here is the script to get the neighbors of all Chattiest Users in one shot. Also, for each top 10 chattiest user, here are the neighbors and the number of neighbors: MATCH (u1:user)-[i:interactswith]-(u2:user) WHERE u1.id in [394,2067,209,1087,554,516,1627,999,668,461] WITH u1,collect(distinct u2.id) as neighbors MATCH (u3)-[i2:interactswith]-(u4) WHERE (u3.id in neighbors and u4.id in neighbors) return distinct u1,length(neighbors) as k, neighbors Then I went along to obtain the N figure for all Chattiest Users. I combined all the above queries into one single query, plugged it into Neo4j. Obtained the result in snapshot below. Most Active Users (based on Cluster Coefficicient) User ID 461 394 209 Coefficient 1 1 1