Processing Twitter Data with MongoDB. Xiaoxiao Liu

Size: px

Start display at page:

Download "Processing Twitter Data with MongoDB. Xiaoxiao Liu"

Chad Sutton
6 years ago
Views:

1 Processing Twitter Data with MongoDB Xiaoxiao Liu

2 Issue with Facebook Data Original, I planned to do this project with Facebook Data. - Facebook Graph API - Third-Party Java Library: restfb I was interested in doing social network analysis, so the information I need to get including users information, users' friends information, and the relationship between these users.

3 However... Limitation of Graph API: As stated by Facebook: This will only return any friends who have used (via Facebook Login) the app making the request. (In this case, the app is graph API itself).

4 Only one friend showed up :(

5 Only myself showed up!

6 User Friend1 Friend1's Friends Friend1's Friends Friends of Friend Friend2... Friend n Friend1's Friends Friends of Friend authorization exception

7 What else can I do? Twitter! -Mid-term election -tweets related to vote

8 Data Source: Twitter Twitter Rest APIs - The REST APIs provides programmatic access to read and write Twitter data. Author a new Tweet, read author profile and follower data, and more. The REST API identifies Twitter applications and users using OAuth; responses are available in JSON. Rate Limits: - Search will be rate limited at 180 queries per 15 minute window for the time being, but we may adjust that over time.

9 The Search API The Twitter Search API is part of Twitter s v1.1 REST API. It allows queries against the indices of recent or popular Tweets and behaves similarily to, but not exactly like the Search feature available in Twitter mobile or web clients.

10 Geolocalization: The search operator near isn t available in API, but there is a more precise way to restrict your query by a given location using the geocode parameter specified with the template latitude,longitude,radiu s.

11 Twitter4J I used a third-party java library called Twitter 4J. This library makes it easier to integrate Java application with Twitter service. To use this library, simply download it and add the.jar file to class path.

12 QueryString: the Search keyword QueryDate: search Tweets sent in Certain day Report back how many Tweets were gathered

13 Search Keywords 11/1/ /4/2014(Election Day) - quinn (Democrat Candidate's Lastname) - rauner (Republican Candidate's Lastname) - democrat - republican - governor 11/3/ /4/2014(Election Day) - election

14 I stored data in txt file with a wired format

15 Why MongoDB My needs: My input data is basically tweets. I need to run word count. I need to query the tweets with different keywords. I do not want to separate one tweet into several columns. MongoDB is great for modeling many of the entities Form data: MongoDB makes it easy to evolve the structure of form data over time Blogs / user-generated content: can keep data with complex relationships together in one object Messaging: vary message meta-data easily per message or message type without needing to maintain separate collections or schemas System configuration: just a nice object graph of configuration values, which is very natural in MongoDB Log data of any kind: structured log data is the future Graphs: just objects and pointers a perfect fit Location based data: MongoDB understands geo-spatial coordinates and natively supports geospatial indexing

16 MongoDB Document-Oriented Storage JSON-style documents with dynamic schemas offer simplicity and power. Full Index Support Index on any attribute, just like you're used to. Querying Rich, document-based queries. Map/Reduce Flexible aggregation and data processing.

17 - I wanted to re-run my java code to gather tweets again, and this time I would like to store them in json format. - Unfortunately, it did not work out. You cannot use the Search API to find Tweets older than about a week -I wrote another java application to convert that txt file to a json file

${'user_name': 'xyz',$

18 {'user_name': 'xyz', 'tweet': 'whatever tweet text'}

19 Import Data to MongoDB: mongoimport --db mydb --collection tweets --file tweets.json

20 { user_name : xyz, tweet : whatever tweet text }

21 Run mongo shell Structure/Schema

22 Run mapreduce to count words

23 Relevant Keywords: Voting Vote Wage Citizens #democrats #politics #rockthevote Possible relevant keywords: shit Stupid Protect fuck

24 Interesting Finds Robert Quinn kisses the bicep after that quarterback sack. (keywords: bicep, quarterback)

Interesting Finds @EliseStefanik REPUBLICAN WOMEN Set to Make History

25 Interesting REPUBLICAN WOMEN Set to Make History Tonight

26 @m_silverberg -Wifi for media at the Bruce Rauner party is $50 a pop... -Every TV station in Illinois about to go live at 5 from Bruce Rauner's election night party.

Code User who sent most tweets Relevant Users: @grammy620: Vote for @JeanneShaheen and this will continue!http://t.co/alejxtqs1n CLOSE OUR BORDERS!

27 Code User who sent most tweets Relevant Vote and this will continue! CLOSE OUR BORDERS! #NHsen Stop the Obama Agenda for Quinn tomorrow!!!!!!!!!!!!!!!!!!! Why I'm NOT drinking the Rauner You re ready to vote, and we re ready to help you find out where!

28 Word Count for Keyword democrat code

29 Result

30 Word Count for Keyword republican Pres. #Obama Brings The Jobless Rate From 10.1% to 5.9% despite republican obstacles #TheyMad #news #p2 #TFB Obama

31 The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah

32 This paper describes the systems that engender effortless ingress and egress out of the Hadoop system and presents case studies of how data mining applications are built at LinkedIn. Kafka, Azkaban Ingress, egress

33 For egress, three main mechanisms are necessary: - 70% is key-value access Voldemort 20% is stream-oriented access Kafka Multidimensional or OLAP access Avatara Given the high velocity of feature development and the difficulty in accurately gauging capacity needs, these systems are all horizontally scalable. These systems are run as a multitenant service where no real stringent capacity planning needs to be done: rebalancing data is a relatively cheap operation, engendering rapid capacity changes as needs arise.

35 Ingress Kafka is a distributed publish-subscribe system that persists messages in a write-ahead log, partitioned and distributed over multiple brokers. It allows data publishers to add records to a log. Each of these logs is referred to as a topic. Example: search. The search service would produce these records and publish them to a topic named SearchQueries where any number of subscribers might read these messages. All Kafka topics support multiple subscribers as it is common to have many different subscribing systems. Kafka supports distributing data consumption within each of these subscribing systems, because many of these feeds are too large to be handled by a single machine

36 Ingress: Data Evolution Two solutions 1. Simply load data stream in whatever form they appear. 2. manually map the source data into a stable, well-through-out schema and perform whatever transformations are necessary to support this. LinkedIn's solution: retains the same structure throughout data pipeline and enforces compatibility and other correctness conventions on changes to this structure. Maintain a schema with each topic in a singe consolidated schema registry. If data is published to a topic with and incompatible schema, it is rejected. If it is published with a new backwards compatible schema, it evolves automatically. Each schema also goes through a review process to help ensure consistency with the rest of activity data model.

37 Ingress: Hadoop Load The activity data generated and stored on Kafka is pulled into Hadoop using a map-only job that runs every 10 minutes on a dedicated ETL Hadoop cluster as a part of an Azkaban workflow. First, reads the Kafka log offsets and checks for any new topics. Then, starts a fixed number of mapper tasks to pull data into HDFS partition files, and finally registers it with LinkedIn's various systems. ETL workflow also runs an aggregator job every day to combine and dedup data saved throughout the day into another HDFS location and run predefined retention policies on a per topic basis. (This combining and cleanup prevents having many small files)

38 Egress The result of workflows are usually pushed to other systems, either back for online serving or as a derived data-set for further consumption. The workflows appends an extra job at the end of their pipeline for data delivery out of Hadoop.

39 Egress: Key-Value Voldemort is a distributed key-value store akin to Amazon's Dynamo with a simple get(key) and put{key, value} interface. Tuples are grouped together into logical stores. Each key is replicated to multiple nodes depending on the preconfigured replication factor of its corresponding store. Every node is futher split into logical partitions.

40 Egress: Stream The ability to publish data to Kafka is implemented as Hadoop OutputFormat. Each MapReduce slot acts as Kafka producer that emits essages, throttling as necessary to avoid overwhelming the Kafka brokers. As Kafka is a pull-based queue, the consuming application can read message at its own pace.

41 Egress: OLAP A system called Avatara that moves the cube generation to a high throughput offline system and the query serving to a low latency system. By separating the two systems, we lose some freshness of data, but are able to scale them independently. This independence also prevents the query layer from the performance impact that will occur due to concurrent cube computation.

42 Applications Key-value - people you may know Collaborative Filtering Skill Endorsements Related searches

43 Applications Stream - News Feed Updates Relationship Strength

44 Application OLAP - who viewed my profile? Who's viewed this job?

45 Thank you!

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data