Collection Management Tweet

Similar documents
Collection Management Tweets Final Report

Team FE Final Presentation

Collection Management Webpages

SOLR. Final Presentation. Team Members: Abhinav, Anand, Jeff, Mohit, Shreyas, Siyu, Xinyue, Yu

Development of guidelines for publishing statistical data as linked open data

Welcome to INFO216: Advanced Modelling

Introduction to RDF and the Semantic Web for the life sciences

Building A Billion Spatio-Temporal Object Search and Visualization Platform

Serving Ireland s Geospatial as Linked Data on the Web

/ Cloud Computing. Recitation 7 October 10, 2017

Final Report CS 5604 Fall 2016

DB2 NoSQL Graph Store

Semantic Integration with Apache Jena and Apache Stanbol

Final Project Report CS 5604 Information Storage and Retrieval CLA Team, Fall 2016

Introducing Fedora 4. Overview, examples, and features. David Wilcox,

Semantic Web Fundamentals

Welcome to INFO216: Advanced Modelling

Towards Open Innovation with Open Data Service Platform

Fuseki Server Installation

COMPUTER AND INFORMATION SCIENCE JENA DB. Group Abhishek Kumar Harshvardhan Singh Abhisek Mohanty Suhas Tumkur Chandrashekhara

Applying Ontologies in the Dairy Farming Domain for Big Data Analysis

Bringing the Semantic Web closer to reality PostgreSQL as RDF Graph Database

E6885 Network Science Lecture 10: Graph Database (II)

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

DLRL Cluster CS 4624 Semester Project Virginia Tech Blacksburg, VA May 8, 2014 Client: Sunshin Lee

Transformative characteristics and research agenda for the SDI-SKI step change: A Cadastral Case Study

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

May 3, Spring 2016 CS 5604 Information storage and retrieval Instructor : Dr. Edward Fox

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.

Scaling the Semantic Wall with AllegroGraph and TopBraid Composer. A Joint Webinar by TopQuadrant and Franz

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

THE GETTY VOCABULARIES TECHNICAL UPDATE

microsoft

Harvesting Open Government Data with DCAT-AP

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Oracle Spatial and Graph: Benchmarking a Trillion Edges RDF Graph ORACLE WHITE PAPER NOVEMBER 2016

Linked Data. The World is Your Database

RedPoint Data Management for Hadoop Trial

Advisor/Committee Members Dr. Chris Pollett Dr. Mark Stamp Dr. Soon Tee Teoh. By Vijeth Patil

Collaborative Filtering

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Exercise #1: ANALYZING SOCIAL MEDIA AND CUSTOMER SENTIMENT WITH APACHE NIFI AND HDP SEARCH INTRODUCTION CONFIGURE AND START SOLR

NLP Final Project Fall 2015, Due Friday, December 18

Implementing a Variety of Linguistic Annotations

Reducing Consumer Uncertainty

LucidWorks: Searching with curl October 1, 2012

Introduction to Twitter

ISSN: Page 74

Transformative characteristics and research agenda for the SDI-SKI step change:

Semantic Web: Core Concepts and Mechanisms. MMI ORR Ontology Registry and Repository

Turning Relational Database Tables into Spark Data Sources

Social Network Analytics on Cray Urika-XA

Implementing a Knowledge Database for Scientific Control Systems. Daniel Gresh Wheatland-Chili High School LLE Advisor: Richard Kidder Summer 2006

Search Engines and Knowledge Graphs

Package rrdf. R topics documented: February 15, Type Package

An FCA Framework for Knowledge Discovery in SPARQL Query Answers

/ Cloud Computing. Recitation 13 April 12 th 2016

Novel System Architectures for Semantic Based Sensor Networks Integraion

Powering Linked Open Data Applications

Jans Aasman, Ph.D. CEO Franz Inc Optimizing Sparql and Prolog for reasoning on large scale diverse ontologies

Benchmarking RDF Production Tools

Open And Linked Data Oracle proposition Subtitle

ISWC 2017 Tutorial: Semantic Data Management in Practice

Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute

/ Cloud Computing. Recitation 13 April 14 th 2015

/ Cloud Computing. Recitation 9 March 17th and 19th, 2015

USE TEXT ANALYTICS TO ANALYZE SEMI-STRUCTURED AND UNSTRUCTURED DATA

Introduction to NoSQL Databases

Stanbol Enhancer. Use Custom Vocabularies with the. Rupert Westenthaler, Salzburg Research, Austria. 07.

NOTSL Fall Meeting, October 30, 2015 Cuyahoga County Public Library Parma, OH by

Structured Data To RDF II Deliverable D4.3.2

Enhancing applications with Cognitive APIs IBM Corporation

Orange3 Text Mining Documentation

MapReduce, Hadoop and Spark. Bompotas Agorakis

Metadata. Frauke Kreuter BLS 2018 University of Maryland (JPSM), University of Mannheim & IAB

A Semantic Framework for the Retrieval and Execution of Open Source Code

Clustering and Topic Analysis Final Report

COMP9321 Web Application Engineering

Semantic Web Fundamentals

RESOURCES DESCRIPTION FRAMEWORK: RDF

COMP9321 Web Application Engineering

An overview of RDB2RDF techniques and tools

Learn Well Technocraft

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

BUILDING THE SEMANTIC WEB

Linked Data. Department of Software Enginnering Faculty of Information Technology Czech Technical University in Prague Ivo Lašek, 2011

Title. Prolog, Rules, Reasoning and SPARQLing Magic in the real world. Franz Inc

linked Data Entities Acknowledgment: SSONDE in its current I did at IMATI-CNR in collaboration with M. De Martino


Semantic Technologies and CDISC Standards. Frederik Malfait, Information Architect, IMOS Consulting Scott Bahlavooni, Independent

Scraping and Preprocessing of Social Media Data

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

The Semantic Web Revisited. Nigel Shadbolt Tim Berners-Lee Wendy Hall

Extracting knowledge from Ontology using Jena for Semantic Web

Web Ontology for Software Package Management

Orchestrating Music Queries via the Semantic Web

geospatial querying ApacheCon Big Data Europe 2015 Budapest, 28/9/2015

Programming The Semantic Web: Build Flexible Applications With Graph Data By ToSegaran, Colin Evans

CS Knowledge Representation and Reasoning (for the Semantic Web)

Web Service: annotateservice. Operations. annotate. The service performs semantic annotation of textual documents provided as plain text or as XML.

Transcription:

Collection Management Tweet CS5604, Information Storage & Retrieval, Fall 2017 Farnaz Khaghani Junkai Zeng Momen Bhuiyan Anika Tabassum Payel Bandyopadhyay Professor: Dr. Edward Fox

Purpose of CMT Processing Tweets of two events: Solar Eclipse (6M Tweets) Las Vegas Shooting (~0.18M tweets) Creating a social network database based on the Twitter users and tweets relationships 2

Tweet Processing Overview 3

Previous Arch.: JSON to HBase 4

Current Arch.: JSON to HBase 5

Parsing json4s: a json library in scala For Las Vegas Shooting dataset (~180k tweet), the parsing took less than 2mins Changes: Removal of Multiple Steps: Minimize Data Pre Processing Overhead: Copying the json file 6

Cleaning Data cleaning NER, POS, Tokenization, Lemmatization: Stanford CoreNLP Hashtag, Mentions, Retweet: Matthew s Framework Stopword Removal: Spark ML lib Cleaning Punctuation, Removing Profanity, Formatting: Scala Code For Las Vegas shooting dataset, data cleaning took less than 2 hour 7

Schemas Provided in HBase Column Family Column-name Example clean-tweet NER Shooting a Chrome <em class='number'>.50</em> Cal Machine Gun on the <em class='location'>vegas</em> <em class='location'>strip</em> #lasvegas #vegas #shooting #SaturdayMotivation https://t.co/zromary7un clean-tweet POS <em class='nn'>rt</em> <em class='nn'>@troyglidden</em> : <em class='nn'>scanner</em>... clean-tweet clean-text-cla security guard shot leg 32nd floor unk hotel vegas shooting clean-tweet clean-text-cta security guard shot leg 32nd floor unk hotel vegas shooting clean-tweet clean-text-solr security guard shot leg 32nd floor unk hotel vegas shooting clean-tweet clean-tokens shooting;chrome;50;cal;machine;gun;vega;strip;lasvega ;vega;shooting;saturdaymotivation;

Schemas Provided in HBase Column Family Column Name Example clean-tweet clean-tweet geom-type hashtags #lasvegas,#vegas,#shooti ng,#saturdaymotivation clean-tweet long-url http://freebeacon.com/cu lture/shooting-a-chrome- 50-cal-machine-gun-onthe-vegas-strip/ clean-tweet mentions troyglidden clean-tweet rt false clean-tweet sner-locations Vegas;Strip; clean-tweet clean-tweet clean-tweet sner-organizations sner-people solr-gemo 9

Schemas Provided in HBase Column Family Column Name Example clean-tweet spatial-bounding clean-tweet spatial-coord clean-tweet tweet-importance clean-tweet url_visited_cmw metadata collection-id 1024 metadata collection-name #shooting #LasVegas metadata doc-type tweet metadata dummy-data false 10

Schemas Provided in HBase Column Family Column Name Example tweet archive-source twitter-search tweet comment-count -1 tweet contributor-enabled false Sat Sep 23 20:08:16 +0000 tweet created-time 2017 tweet tweet tweet tweet created-timestamp geo-0 geo-1 geo-type tweet language en 11

Schemas Provided in HBase Column Family Column Name Example tweet like-count 5 tweet place-country-code US tweet profile-img-url http://pbs.twimg.com/profile_image s/894753143057137666/3u9y6di2_n ormal.jpg tweet retweet-count 1 tweet screen-name pepesgrandma tweet source <a href="http://twitter.com" rel="nofollow">twitter Web Client</a> tweet text Shooting a Chrome.50 Cal Machine Gun on the Vegas Strip \xf0\x9f\x98\x8d\x0a#lasvegas #vegas #shooting #SaturdayMotivation https://t.co/zromary7un tweet to-user-id 12

Schemas Provided in HBase Column Family Column Name Example tweet tweet-deleted false tweet tweet-id 911683653868113920 tweet url https://t.co/zromary7un tweet user-deleted false tweet user-id 116384038 Babushka\xE5\xA5\xB3\xE tweet user-name 5\xA3\xAB tweet user_favourites_count 42111 tweet user_followers_count 5569 tweet user_friends_count 357 tweet user_lang en tweet user_location Siberia China tweet user_mentions_id_str Dahboo7 tweet user_mentions_name 1411455757 tweet user_statuses_count 31996 13

Social Network 14

Overview 15

Initial Data: JSON 16

Pre-processing data for social network Using shell scripts for pre-processing the data Converting the tweets from JSON to CSV format Created a full CSV file with all fields 17

Challenges of working with JSON file Difficult to interpret JSON formatter Large files to process Inconsistency in the fields JSON CSV 18

Commands to convert JSON to CSV Used the jq library Sample usage: cat Eclipse.json jq -r '. [.user.id_str,.retweeted_status.id_str,.in_reply_to_user_id,.entities.user_mentions[].id] @csv' >./Eclipse/Eclipse.csv The above didn t worked when there were more than 2 fields having array elements. For those cases, we processed the fields separately, then separated them using semi-colon, ; and then merged the files 19

Sample pruned CSV file id favourite_ count full_text user_id retweeted_status_id in_reply_to_ user_id entities_user_mentions 888201064817860613 5 There's going to be a. 19199743 2 I gotta buy some solar eclipse. 2762027475 0 Cellphone service could be spotty. 224233529 0 Anyone else notice how. 103167711 889882842242707456 15102849 713741422000807937 264792278 889941327202455553 125485258 2470058834 466665274 889874789611048960 15102849 124197346 101144034 889898800411688960 11348282 11348282 20

Social Network Objective : Build a social network to connect the tweets and users relationship Nodes: 1) Users 2) Tweets Edges: Existence of the relationship Retweet Mention In reply to 21

RDF triplestore RDF (Resource Description Framework) triplestore is a graph database for storing semantic facts: Formally describes the semantics, or meaning, of information Represents metadata Consists of triples which are based on an Entity-Attribute Value (EAV) model Selena Gomez follows Coach 22

What is triplestore? - Social network is a graph of nodes and edges (Every nodes as a user and edge as a relationship) - Triplestore stores every node-edge (user-user relationship in simple sentence form) - Simple sentence: <subject> <predicate> <object> - Subject: user, predicate: relationship object: user - We store each user in form of Twitter Ids 23

Why Triplestore? - Faster than relational databases - Support optional schema models, called ontology - Improve the search and analytics power - Use of SPARQL Query 24

Convert CSV to RDF N-Triple File - Apache Jena Library in Java to convert CSV file to N- Triple (.nt) file - Apache Jena Fuseki server to store social network (ntriple) data 25

N Triple file sample <http://example.org/898620093059534848> Subject: URI of the userid <http://xmlns.com/snr/0.1/mentions> "1021074122". Predicate: URI of the predicate Object: userid (string) 26

Triplestore Database 27

Triplestore Database 28

Front End Team Interface Dataset: Solar Eclipse event : /eclipse LasVegas Shooting event : /shooting (Both datasets are Persistent in fuseki server) URI: Subject: <http://example.org/> Predicate: <http://xmlns.com/snr/0.1/> 29

Front End Team Interface Relations: in_reply_to mentions in_retweet_to followedby 30

Front End Team Interface Sample for fetching query result in JSON: http://mule.dlib.vt.edu:3030/eclipse/query?query=prefix%20sub:%20%3chttp://exam ple.org/%3e%20prefix%20pred:%20%3chttp://xmlns.com/snr/0.1/%3e%20selec T%20?y%20WHERE{sub:2351245436%20pred:mentions pred:in_reply_to pred:in_r etweet_to%20?y.}&wt=json&json.wrf=my_callback - Will fetch all mentions, in_reply_to and in_retweet_to ids of user id 2351245436 31

Time to upload data - The largest Solar Eclipse file (~373MB) NT file takes ~4 min to upload - Time to upload whole Solar Eclipse core ~ 12 min - Time to upload Las Vegas Shooting core ~2 min 32

Challenges and Future Works - Fetching Twitter followers, friends takes time, not possible ~4M users - Converting directly to n-triple file from JSON - Parallelizing the conversion to N-Triple - Storing user names, screen names, followers, friends in social network - Calculating followers, friends for top N users who have highest number of followers, friends, tweets posted 33

Acknowledgment First, we would like to thank Dr.Fox for his constructive comments and guidance during this project. Our thanks are also due to US National Science Foundation for supporting Global Event and Trend Archive Research (GETAR) through IIS-1619028. 34

Questions? 35