I) write schema of the six files.

Similar documents
Graph Analytics. Modeling Chat Data using a Graph Data Model. Creation of the Graph Database for Chats

Graph Analytics. Modeling Chat Data using a Graph Data Model. Creation of the Graph Database for Chats

Acquiring, Exploring and Preparing the Data

Data Exploration. The table below lists each of the files available for analysis with a short description of what is found in each one.

Field Types and Import/Export Formats

Oracle Compare Two Database Tables Sql Query Join

Contents 1. OVERVIEW GUI Working with folders in Joini... 4

Module 1.Introduction to Business Objects. Vasundhara Sector 14-A, Plot No , Near Vaishali Metro Station,Ghaziabad

EXAM - FM Developer Essentials for FileMaker 12 Exam. Buy Full Product.

NOSQL Databases and Neo4j

USER MANUAL. Odoo Peafowl Theme TABLE OF CONTENTS. Version: 1.0.6

Release notes for version 3.7.1

Bulgarian Math Olympiads with a Challenge Twist

SQL Server Replication Guide

TeamViewer 12 Manual Management Console. Rev

External Data Connector for SharePoint

Toad for Oracle Suite 2017 Functional Matrix

Customer Journey Platform Customer Engagement Analyzer User Guide

External Data Connector for SharePoint

T-SQL Training: T-SQL for SQL Server for Developers

Player Pathway System User Guide for Coaches and Team Managers

The transition: Each student passes half his store of candies to the right. students with an odd number of candies eat one.

MASSTRANSIT DATABASE ANALYSIS. Using Microsoft Excel And ODBC To Analyze MySQL Data

Version 3.3 System Administrator Guide

My Query Builder Function

Sql Script To Change Table Schema Management Studio 2012

Unit 10: Advanced Actions

Data for Accountability Transparency and Impact (DATIM)

University of California, Berkeley. (2 points for each row; 1 point given if part of the change in the row was correct)

Data for Accountability, Transparency and Impact Monitoring (DATIM) MER Data Import Reference Guide Version 2. December 2018

CSE 444, Winter 2011, Midterm Examination 9 February 2011

What s new in Adobe Connect 9.4.2

DB Export/Import/Generate data tool

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4)

MAStudio documentation

MySQL On Crux Part II The GUI Client

User Guide. Data Preparation R-1.0

SELF TEST. List the Capabilities of SQL SELECT Statements

Business Intelligence

Transaction Isolation Level in ODI

COMPUTER SCIENCE TRIPOS

Today Learning outcomes LO2

What is a graph database?

The CHECKBOX Quick Start Guide

Aster Data Basics Class Outline

Logi Ad Hoc Reporting System Administration Guide

The Basics. As of December 12, 2016

Developing Microsoft SQL Server 2012 Databases

Microsoft Power Tools for Data Analysis #10 Power BI M Code: Helper Table to Calculate MAT By Month & Product. Notes from Video:

Writing Analytical Queries for Business Intelligence

National Quali cations

Oracle BI 11g R1: Build Repositories

Abstract. For notes detailing the changes in each release, see the MySQL for Excel Release Notes. For legal information, see the Legal Notices.

UNIT V *********************************************************************************************

Server Side Scripting Report

Creating a stacked bar chart

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

ListManager. ListManager Basic Training

Create View With Schemabinding In Sql Server 2005

Training Content Key Terms... 1 How to Run a Report... 2 How to View a Dashboard... 5 How to Modify & Customize Reports... 6

Version 3.1 System Administrator Guide

Microsoft SQL Server Reporting Services (SSRS)

Abstract. For notes detailing the changes in each release, see the MySQL for Excel Release Notes. For legal information, see the Legal Notices.

1. Analytical queries on the dimensionally modeled database can be significantly simpler to create than on the equivalent nondimensional database.

Data Mapper Manual. Version 2.0. L i n k T e c h n i c a l S e r v i c e s

Release notes for version 3.7

Deploying a System Center 2012 R2 Configuration Manager Hierarchy

Index A, B. bi-directional relationships, 58 Brewer s Theorem, 3

UP L11 Using IT Analytics as an Alternative Reporting Platform Hands-On Lab

Getting started. Create event content. Quick Start Guide. Quick start Adobe Connect for Webinars

Sql Server Compare Two Tables To Find Differences

EXAM - 1Y Managing Citrix XenDesktop 7.6 Solutions. Buy Full Product.

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara

Oracle BI 11g R1: Build Repositories

User Guide. Data Preparation R-1.1

normalization are being violated o Apply the rule of Third Normal Form to resolve a violation in the model

Writing Reports with Report Builder and SSRS Level 2

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

InfoSphere Guardium 9.1 TechTalk Reporting 101

6232A - Version: 1. Implementing a Microsoft SQL Server 2008 Database

Oracle Database: Introduction to SQL

Dataflow Editor User Guide

How To Export Database Diagram Sql Server 2008 To Excel

Planning and performing database migrations

CS/INFO 4154: Analytics-driven Game Design

Data Management Lecture Outline 2 Part 2. Instructor: Trevor Nadeau

ActiveVOS Fundamentals

1 Dashboards Administrator's Guide

Unit Assessment Guide

RELATIONAL DATABASE AND GRAPH DATABASE: A COMPARATIVE ANALYSIS

Skype Connection Kit Solution

EE221 Databases Practicals Manual

Algebra 1 Semester 2 Final Review

1Z0-526

Qwizdom Training Guide Q6 / Q7

Using the Scripting Interface

Finding Your Way Around Aspen IMS

Manual Speedy Report. Copyright 2013 Im Softly. All rights reserved.

MIDTERM EXAMINATION Spring 2010 CS403- Database Management Systems (Session - 4) Ref No: Time: 60 min Marks: 38

Alyssa Grieco. Data Wrangling Final Project Report Fall 2016 Dangerous Dogs and Off-leash Areas in Austin Housing Market Zip Codes.

Transcription:

Graph Analytics Modeling Chat Data using a Graph Data Model (Describe the graph model for chats in a few sentences. Try to be clear and complete.) Creation of the Graph Database for Chats Describe the steps you took for creating the graph database. As part of these steps i) ii) iii) Write the schema of the 6 CSV files Explain the loading process and include a sample LOAD command Present a screenshot of some part of the graph you have generated. The graphs must include clearly visible examples of most node and edge types. Below are two acceptable examples. The first example is a rendered in the default Neo4j distribution, the second has had some nodes moved to expose the edges more clearly. Both include examples of most node and edge types. I) write schema of the six files. Overall Schema There are no headers for any of the data in the files, which are just a matrix of data arranged in rows also called lines. All data value contained therein is string type. The chat model has: 4 Node types, namely : User, Team, ChatItem, and ChatSession. 8 Edge types namely : CreatesSession, OwnedBy(Team owns the Chat Session) : Joins, Leaves, CreateChat, PartOf, Mentioned & ResponseTo. 1) In chat_create_team_chat.csv file, the columns contain the following data: column1: column2: column3: column4: id of user/player in the Pink Flamingo Game. id of the Team the player is from. id of the TeamChatSession. timestamp when TeamchatSession created. Column reference index is 0 Column reference index is 1 Column reference index is 2 Column reference index is 3 2) In chat_join_team_chat.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of the Team the player is from. Column reference index is 1

column3: timestamp: user Joins TeamChatSession edge Col reference index is 2 3) In chat_leave_team_chat.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of TeamChatSession. Column reference index is 1 column3: timestamp: user Leaves TeamChatSession edge Col reference index is 2 4) In chat_item_team_chat.csv.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of the TeamChatSession. Column reference index is 1 column3: timestamp: CreateChat & PartOf edges Column reference index is 2 5) In chat_mention_team_chat.csv.csv file the columns contain the following data: column1: id of user/player in the Pink Flamingo Game. Column reference index is 0 column2: id of the ChatItem. Column reference index is 1 column3: timestamp: mentioned edge created Column reference index is 2 6) In chat_respond_team_chat.csv.csv file the columns contain the following data: column1: id of ChatItem. Column reference index is 0 column2: id of another ChatItem. Column reference index is 1 column3: timestamp: ResponseTo ChatItem edge created Col reference index is 2 Conceptually: data was stored the way it was because: A) Data from file chat_create_team_chat.csv file is used 1) to create three nodes named: Team, TeamChatSession, & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge OwnedBy with timestamp propery, is created, connecting TeamChatSession and Team nodes. A Team owns a TeamChatSession. 3) Another CreateSession edge with timestamp property, is created connecting TeamChatSession and User. B) Data from file chat_join_team_chat.csv file is used 1) to create two nodes named: TeamChatSession, & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge Joins, with a timestamp property is created, connecting User & TeamChatSession nodes when a user Joins a TeamChatSession.

C) Data from file chat_leave_team_chat.csv file is used 1) to create two nodes named: TeamChatSession, & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge Leaves, with a timestamp is created, connecting User and TeamChatSession nodes when a user Leaves a TeamChatSession. D) Data from file chat_item_team_chat.csv file is used 1) to create three nodes named: TeamChatSession, ChatItem & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge CreateChat, with a timestamp property is created, connecting User & TeamChatSession nodes, when a user indulges in chatter, in a TeamChatSession. An edge PartOf with a timestamp property is created connecting ChatItem & TeamChatSession. E) Data from file chat_mention_team_chat.csv file is used 1) to create three nodes named: ChatItem & User, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge Mentioned, with a timestamp property is created, connecting ChatItem & User when a chat item is mentioned to a user. F) Data from file chat_mention_team_chat.csv file is used 1) to create two nodes named: ChatItem i and ChatItem j, each with relevant id properties. id string, is converted to integer type in the process. 2) An edge ResponseTo, with a timestamp property is created, connecting ChatItem i to ChatItem j. ii) Explain the loading process and include a sample load. Loading process starts with using Cypher script which is quite similar to SQL language. The LOAD CSV command with a path to the.csv file, which were stored in my computer's home folder. Loading of each line was specified with 'AS row' words included in the command. Files are loaded and executed, one at a time.

Sample commands, script to load chat-respond-team-chat.csv file. 1) LOAD CSV FROM file:/// chat_respond_team_chat.csv AS row # /// were needed to import the data from the computer, using html into Neo4j. 2) MERGE (i: ChatItem {id: toint(row [0])}]) # creates node i, named ChatItem, id property from Column index 0 of this csv file. 3) MERGE (j: ChatItem {id: toint(row [1])}]) # creates node j, named ChatItem, id property from Column index 1 of this csv file. 4) MERGE (i)-[: ResponseTo {timestamp:(row [2])}]->(j) # creates edge j, named ResponseTo, timestamp property from Column index 2 of this csv file. This edge is directional from node i to node j. Snapshot of sample load.

2 sample loaded graph displays below:

Finding the longest conversation chain and its participants Report the results including the length of the conversation (path length) and how many unique users were part of the conversation chain. Describe your steps. Write the query that produces the correct answer. The number of Chat Items involved in this longest path =10 as seen below. The following script obtains the 5 unique users involved in the longest conversation. MATCH p = (i:chatitem)-[:responseto*] (j:chatitem) #longest chat between above 2 chat item nodes with p order by length (p) desc limit 1 # this returns length of longest chat match (u:user)-[:createchat] (i:chatitem) # find users who created those chat items where i in nodes(p) # where ChatItems are in nodes (p) return count(distinct u) # return count of unique users involved in unique chat Result is 5.

Chattiest Users top 10 a) Outdegree is a method of Neo4j to know the number of directionally out going edges from a particular node. I recorded it for my convenience as NumberOfChats. b) The script and output of the command are in the snapshot below. Chattiest Teams: Top 10 Script and output with are in the snapshot below

Analyzing the relationship between top 10 chattiest users and top 10 chattiest teams Describe your steps from Question 2. In the process, create the following two tables. You only need to include the top 3 for each table. Identify and report whether any of the chattiest users were part of any of the chattiest teams. Top 3 Chattiest users Users id 394 2067 209 Number of Chats 115 111 109 Top 3 Chattiest teams Teams id 82 185 112 Number of Chats 1324 1036 957 Is any chattiest user part of chattiest team. Yes, a single team id of 999 and user id of 52. and total MaxChat count of 105 Code script is in the command line and Snapshot is seen here.

How Active Are Groups of Users? Describe your steps for performing this analysis. Be as clear, concise, and as brief as possible. Finally, report the top 3 most active users in the table below. To arrive at the assessment of how active a group is, a cluster coefficient measurement is done, which measures how highly interconnected a group is. Then we can compare groups with this measurement index. In calculating this coefficient for each chattiest user we proceed as follows 1) We created an edge called interactswith when two nodes communicate and we called them neighbors. 2) Using this edge we found out the 10 chattiest users while answering question 2. We know their id's. 3) For each of these chattiest users identified by their id we proceed to find the Outdegree of their connectedness using the interactswith edge. The Outdegree is equated to k. These are the neighbors of this particular Chattiest User. 4) Then we figure out how inter connected those neighbors are and equate that to N. In the figure below N is = 1; where Neighbor 1 is connected to Neighbor 2. None of the other nodes are connected in this way. Though there is only one N edge, two cliques are formed so in the coefficient calculation N is multiplied by 2. 5) For the figure below Cluster coefficient is got by the formula: 2N/k*(k-1). 6) Where k*(k-1) is the Max possible connections. Example: In the figure below a fictitious Chattiest User has as a direct interactswith edge to: Neighbor 1, Neighbor 2 and Neighbor 3. So Outdegree of Chattiest is 3 and is the value k we are looking for. Neighbor 1 Neighbor 2 Neighbor 3 Chattiest

In Cypher script: In the snapshot below I am giving the query and output of one Chattiest user. We need to do this for every Chattiest user and obtain necessary values for our computation. It shows that Chattiest user whose id is 394 has 5 neighbors colored in green along with their id's. They are connected with interactswith edge.

Here is the script to get the neighbors of all Chattiest Users in one shot. Also, for each top 10 chattiest user, here are the neighbors and the number of neighbors: MATCH (u1:user)-[i:interactswith]-(u2:user) WHERE u1.id in [394,2067,209,1087,554,516,1627,999,668,461] WITH u1,collect(distinct u2.id) as neighbors MATCH (u3)-[i2:interactswith]-(u4) WHERE (u3.id in neighbors and u4.id in neighbors) return distinct u1,length(neighbors) as k, neighbors Then I went along to obtain the N figure for all Chattiest Users. I combined all the above queries into one single query, plugged it into Neo4j. Obtained the result in snapshot below. Most Active Users (based on Cluster Coefficicient) User ID 461 394 209 Coefficient 1 1 1