Moodify. 1. Introduction. 2. System Architecture. 2.1 Data Fetching Component. W205-1 Rock Baek, Saru Mehta, Vincent Chio, Walter Erquingo Pezo
|
|
- Roxanne Richard
- 5 years ago
- Views:
Transcription
1 1. Introduction Moodify Moodify is an music web application that recommend songs to user based on mood. There are two ways a user can interact with the application. First, users can select a mood that is supported by the system and the application displays a list of songs that are classified with the highest probability for that mood. Second, user can navigate trending mood through an interactive Google map that displays the current most popular mood in each of the state or country around the world. The rest of the paper discusses about the technical details on the proposed system. Section 2 discusses about system architecture. Section 3 discusses about data retrieval strategies. Section 4 discusses about implementation details and improvement. Section 5 concludes the paper. 2. System Architecture The system involves two major components: 1) a backend system that fetches music metadata as well as generating mood categorization for each of the song; 2) a frontend user facing web application that accepts user mood queries and responds with a list of songs that matches the user input. The backend system is to build all the necessary and properly indexed data that is then consumed by the frontend web application. There are two types of requirements from the frontend: 1) search songs by mood; 2) browse mood by location. To support the first requirement, the backend system needs to build an index on the probability of a song falling into certain mood category. The probability is then multiplied by the song hotness to calculate a mood frequency - song hotness (mf- sh) score. The details of calculating mf- sh score are elaborated in section 2.5. To support the second requirement, the backend needs to associate location with a mood for a song, and then aggregate all the mood by location and build an index on the location- mood count to be consumed by the frontend. In order to support the requirements from the frontend, the backend needs to: 1) compile a list of trending songs (trending means songs that are most listened or mentioned, and has no correlation with mood); 2) associate mood with each of the song; 3) associate location to each of the moods; 4) build the indexes that are consumed by the frontend. Section 2.1 to 2.4 discusses about the technical implementation of the backend system. Section 2.5 discusses about the frontend system. Section 2.6 shows the system diagram. 2.1 Data Fetching Component To tackle requirement 1, the system refers to Echonest ( for the list of trending songs. Echonest hosts one of the most versatile music database in the world with over 30 million songs and 3 millions artists records. Not only the data set includes basic metadata such as song title, artist, album and genre, but also it includes some of the intelligent attributes such as 1
2 energy, danceability, tempo and hotness. We utilized the hotness attribute as sorting criteria to gather the list of trending songs. To tackle requirement 2, the system requires a supplementary human behavioral information in order to accurately predict moods associated with a song. Using human behavioral data has advantage over other static data such as lyrics to tag a song with moods. This is because human s mood toward a song may change overtime whereas lyrics stay forever. The dynamic nature of the mood analysis provides real- time music recommendation that more accurately reflects the current trend. This source of data is obtained through two social media sites: 1) Twitter and 2) YouTube. By using song title and artist as filter criteria, we can fetch the most relevant tweets and YouTube comments for each of the songs and associate moods with each of the text. Requirement 3 depends on the location data for each of the text gathered from requirement 2 above. Twitter supports location based tweets, however, YouTube comment does not support it currently. Thus, the system only uses geo- enabled tweets to associate the location for mood, and only those moods with associated location are aggregated and displayed in the interactive mood map. Noted that, the process of aggregating location mood has no effect on the process of aggregating mood for a song and thus has no impact on the frontend requirement 1 (search songs by mood). These two ETL processes are discussed later in section Mood Analysis The system supports the following mood categories: anger, disgust, fear, joy, love, sad, surprise. Our main guideline on building our corpora is based on the paper EmpaTweet: Annotating and Detecting Emotions on Twitter which describes how to tag tweets with similar categories. The main approach for tweets is to categorize manually about 1500 tweets (200 per category) and use this as corpora for a series of Multinomial Naive Bayes Classifiers, one for each category. In this categorization, non alphabetic characters, stop words and hashtags are removed. Non mood related hashtags such as event and topic hashtags are removed because they appear more frequently because of a trend and not because of their sentiment values. Besides, a Porter stemmer is used. Additionally, the same process is repeated for YouTube comments, whose language is different than tweets because a tweet is limited to 140 characters and many of them are simply hashtags. YouTube comments do not have these restrictions. Thus, 14 classifiers are needed in total with 7 categories for YouTube and 7 categories for Twitter. The system also utilizes NLTK library to clean up the comments and tweets and scikit- learns for the classifiers. These classifiers are then used to tag moods for each of the tweets and YouTube comments. A vector of moods is produced as a result for each tweet and comment. The vector includes a list of mood categories with each containing a 0/1 number indicating whether the song falls into the mood category. These vectors are consumed by the ETL process to aggregate all the moods associated with a song. 2.3 ETL process 2
3 The system involves two aggregation processes that are required to generate the indexes for frontend consumption. The first ETL process is to aggregate all the mood vectors for each of the song. This can be accomplished by a MapReduce job. The mapper reads the mood vectors for each of the song and emits song id as key and the mood vector as value. The reducer simply counts up all value for each mood for a song and divides the aggregated value by the total number of reference to get the probability. The result of the MapReduce job is a list of probability mood vector for each of the song. Each mood in the vector indicates its probability of the mood occurring among all the corpora for the song. The second ETL process is to aggregate all the mood for each of the state/country for the entire database. This can also be accomplished by a MapReduce job. The mapper reads only the mood vectors that are associated with a location and emits (location, mood) as key. The reducer simply counts up all the keys. The result of the MapReduce job is used to build a location- mood model. The root level of the location- mood model is keyed by the location. The second level is keyed by mood, which is sorted by the total number of a mood by location. 2.4 Data Storage Component MongoDB is the primary data storage component for the whole system. MongoDB has several advantages over traditional SQL database. First, the data fetching component consists of multiple data sources with different data schema. Using MongoDB avoids the overhead of schema definition and potential database schema migration should we decide to add additional attributes and data sources. This allows us to efficiently implement the data fetching component. Second, the schema of the system is relatively simple, considering the fact that the front- end web application only requires the two indexes and song title and artist (finding the video of a song can be done on- demand using title and artist once user selects the song). There is no need for data normalization. Third, the flexibility of the document- oriented storage system allows us to augment data structure with added functionalities such as mood analysis without the need of modifying the database schema to adapt the new modeling in the system. Data sources from data fetching component are stored directly into MongoDB. The data can then be exported into CSV formatted file to be consumed by the ETL. Similarly, mood vectors for each of the tweet and comment generated from mood analysis component are stored directly into MongoDB. However, the result of the MapReduce jobs from the ETL processes are stored initially into the file system and then a process is triggered to transform the aggregate result for each of the key into the corresponding indexes into the MongoDB. 2.5 Data Presentation A web frontend application is built to present users with two major functions: 1) search songs by mood categories; 2) explore moods by regions in an interactive map. This first type of requests can be answered by the ETL process that builds the mood vectors for each of the song. The probability of a song falling into a specific mood category can be calculated by dividing the total number of vector references for that mood over the total number of mood vectors for a song. This probability score is then multiplied by the hotness score that is fetched from Echonest data source. Multiplying the hotness score can balance the scenario where less popular songs have fewer number of mood 3
4 vectors which increase the chance of falling into a specific mood category. This also takes into consideration that more popular songs should have higher chance of being shown. The mood frequency - song hotness (mf- sh) score is used as sorting criteria to display the list of songs for a specific category. The second type of requests can be answered by the ETL process that builds the location- mood model. Each region in the map is displayed with the top referenced mood which simply counts all the reference for a mood in the specific region for all of the songs. 2.6 System Diagram 3. Data Retrieval Strategy There are three major data sources that are consumed by the system. Each sub- section discusses about the challenges we face for each data source and the corresponding data retrieval strategies. For each data source, we only stored the necessary attributes in the database and dropped 4
5 everthing else. Thus, we are able to control the final database size down to 1.37GB even though we have crawled a large amount of data. 3.1 Echonest Table 1 in appendix 3 provides a summary on the collected Echonest data. Script was written to fetch trending song from Echonest sorted by attribute hotness. Hotness is a fraction number between 0 and 1. Due to the API constraints, the search parameter for hotness only accepts up to 2 decimal points and the API only returns up to 1000 records for each type of search. To fetch as many song as possible, the following strategy is used: 1) specify both lower limit and upper limit for hotness, e.g ; 2) search up to 1000 results for each of the hotness range (0.01). Using this strategy, we are able to fetch trending songs from Echonest. The process of fetching the data was finished within half an hour. However, the trending songs returned using this strategy have duplication. These duplicated songs have different song id but same artist name and song title. Another script was written to eliminate the duplicated songs. As a result, we extracted unique songs. 3.2 Twitter Table 2 in appendix 3 provides a summary on the collected Twitter data. Script was written to fetch up to 500 tweets for each of the song using Twitter API. Song title and artist name are used as search filter criteria to fetch related tweets. Using this strategy, we are able to fetch tweets for unique songs. The process of fetching the data was finished within 1 day under the search API rate limit of tweets / 15 minutes. However, shortly after we used a sample of the tweet corpus as training dataset for the classifier, we realized a severe quality issue in the retrieved tweets. Most of the tweets have the format of Listening to and Now playing and lots of them are promotional tweets for marketing purpose. Thus, any mood analysis based on the set of low quality tweets would be irrelevant. Most of the tweets ended up with 0 probability for any of the moods using this classifier. A second iteration of tweets acquisition was run to fetch more relevant tweet using different criteria. We transformed song title into a hashtag and used it as filter criteria. We also filtered out non- english tweets and tweets that contains the words watch, now playing and video. Since we recognized this issue at a very late stage of the project, we only fetched up to 100 tweets for each of the song in order to speed up the process. Using this approach the tweet corpus contains much more higher quality text. As a result, the distribution of the probability mood vector is significantly improved. We are able to fetch tweets for unique songs. 3.3 Tweet Location In order to fetch the state/country data for geo- enabled tweets, a program was written to perform reverse geocoding to transform location coordinates into state and country for higher level aggregation. For country that support administration region such as state, state will be used as 5
6 aggregation point instead, otherwise, country is used. By using Nominatim service, we are able to fetch locations for unique tweets. Then, we used Google Map service to transform the state/country back into coordinates to be used to display mood icon in Google Map in the frontend application. This process was finished in a day. 3.4 YouTube Table 3 in appendix 3 provides a summary on the collected YouTube data. In addition to using Twitter comments to analyze the moods for each top song retrieved from our Echonest corpus, we use Youtube comments as an equal measure to gauge the moods of a song. Similar to the first tweets fetching strategy, song title and artist name were used as search filter criteria to fetch related tweets. For each unique song, 5 YouTube videos were used as comments references and 20 comments were fetched for each of the video. Thus a maximum of 100 comments were fetched for each of the unique song. Retrieving song comments from a variety of videos ensures that we have a large random sample of user comments to perform the sentiment analysis. Using this strategy, we are able to fetch comments for unique songs. The rate limit for retrieving data from Youtube API is 50,000 requests/day, so we can search 500 songs a day. The songs were split into three sections and were assigned to three teammates to perform the data retrieval concurrently. The process of fetching the data was finished within 10 days. 4. Implementation 4.1 Data Storage Since the data acquisition process was split between teammates, there are two major ways of storing the initial dataset. One way is to store the immediate data into local MongoDB. We used this strategy for the Echonest and Twitter data. Since only one teammate is responsible in this data retrieval process, storing in local MongoDB removes unnecessary complexity of migrating data to the system MongoDB server. The other way is to store the immediate data into AWS S3. We used this strategy for the YouTube data because this process was split between three teammates. The YouTube data was fetched from S3 and stored to local MongoDB afterward. After the data acquisition process was accomplished, we created an EC2 instance in AWS and ran a MongoDB server in the instance to serve as main data storage for the system. We reused the database backup and restore programs written in assignment 3 to migrate all the data from local to remote MongoDB. We prefer this strategy rather than writing to the remote MongoDB during the data acquisition process because of performance concern. The remote MongoDB is running in a t2.micro EC2 instance that takes advantage of the AWS free tier services. The processing power of this type of EC2 instance is very limited. Performing database insertion for millions of records would take hours to days. Whereas, using the backup and restore strategy takes only a few minutes. The remote MongoDB server serves as the backbone for all the subsequent system processes. 6
7 4.2 Mood Analysis NLTK is used for the Porter stemmer, the stop words for English and tokenization removing non- alpha characters. We also use sklearn for the Multinomial Naive Bayes Algorithm and the Counter Vectorizer of documents. sklearn has advantage over the NLTK on the performance of Naive Bayes classification. We used a random sample of the tweets and YouTube comments corpora and manually tagged them which were used to train the 14 classifiers. A program was also written to automatically tag all the tweets and comments, generate the mood vectors, and store the vectors in the MongoDB. 4.3 ETL After the mood vectors for each of the tweets and YouTube comments were generated, we ran the two MapReduce jobs implemented using MRJob for the two ETL processes described earlier. We also wrote a program to calculate the mf- sh score for each of the song based on the aggregated mood vector from the MRJob output and store the mf- sh score back to the system MongoDB for each of the songs. The location- mood aggregation result from MRJob output was stored in a new MongoDB collection which is consumed directly by the frontend application. 4.4 Data Modeling The following table shows the data models that we used to store the result from each of the components described earlier to MongoDB. echonest_songs tweets_v2 youtube_comments location_moods - id - title - artists_name - song_hotttnesss - youtube_mf_sh - tweet_mf_sh - id - song_id - text - coordinates - user - love - joy - sad - disgust - anger - surprise - fear - Geolocation - id - song_id - text - love - joy - sad - disgust - anger - surprise - fear - location - love - joy - sad - disgust - anger - surprise - fear - longitude - latitude echonest_songs is the collection model to hold information for each individual songs. id, title, artist_name and song_hotttnesss are attributes crawled from Echonest data source. They are stored into the MongoDB without any modification. youtube_mf_sh and tweet_mf_sh are objects that hold the mf_sh score for each of the mood. Each mf_sh object holds 7 attributes that are keyed by each of the mood name and the value of each mood key is the calculated mf_sh score. For example, youtube_mf_sh may look like: 7
8 { love : 0.11, joy : 0.019, sad : 0.037, disgust : 0.028, anger : 0.037, surprise : 0.084, fear : 0.009} tweets_v2 is the collection model to hold information for each tweet. id, text, coordinates and user are attributes crawled from Twitter data source. song_id is the corresponding Echonest song id for a tweet. The song_id is used for association with the echonest_songs collection and is also used as aggregation key for the ETL processes. love, joy, sad, disgust, anger, surprise and fear are all 0/1 valued attributes used to store the mood analysis result. These 7 attributes are considered as the mood vector as discussed throughout the paper. Geolocation stores the result of reverse geocoding of the coordinates. For example: the attribute may look like: { city : West Hollywood, house_number : 463, country : United States of America, county : Los Angeles County, state : California, postcode : 90036, country_code : us } This attribute is used to construct the location key that is used in the location- mood ETL process which generates the location key in the location_moods collection. youtube_comments is the collection model to hold information for each YouTube comment. text is the comment crawled from YouTube data source. id is generated by the system because we did not record the information during data acquisition process. However, it turns out that we never need to use the id to fetch more data from YouTube and thus the attribute is never reused. As is similar to tweets_v2 collection, song_id and the rest of the mood attributes have exactly the same meaning and functionalities. location_moods is the collection model to hold the aggregated mood vector for each of the location. location is a text representation of state/country discussed earlier in the second ETL process. This attribute is also used as key for the each of the document. An example of the attribute looks like british columbia,canada. longitude and latitude are the coordinates of the location attribute. These two attributes are used to create mood tags in the Google Map in the web application. The rest of the mood attributes are the aggregation result from the ETL process. Thus they may have value higher than 1. A summary statistic for each of the collections is also reported in Appendix Web Application The web application serves as a pure presentation layer of the mood recommendation system. It exposes two main Restful endpoints with one serving the mood category inquiry requests and the other one serving the location mood requests. We made use of Ruby on Rails framework to develop the application. Since two of the teammates already have experience working with the technologies and Ruby on Rails has outstanding advantage of developing web application efficiently, we decided to use it for productivity reason. The web application can be accessed through 8
9 4.6 Scalability The architecture of the system was planned to scale to millions songs. Due to the simplicity of the data modeling, MongoDB serves extremely well for the functionalities of the system while maintains the simplicity of the system design. Since MongoDB handles database sharding out of the box, the database should support the two main type of queries from web application without any performance degrade. However, each component of system currently requires a manual trigger by human currently. Ideally, a scheduler should be implemented to pipeline the whole process. Based on refresh period, the scheduler would automatically fetch data from Echonest and then YouTube and Twitter. We could also increase the API rate limits for all the data source providers when the existing limits severely impact the performance of the data acquisition process. But this should not be considered as a scalability issue. Since mood classifier can be reused, it does not impose any scalability drawback to construct them. But as more tweets and YouTube comments are logged into the system, the classifying process can become the bottleneck. Instead of sequentially tagging the text, we can split the whole corpora by database page, id space or shard id. Then the system can run the classifiers concurrently. The performance of the two ETL processes can also be improved significantly when corpus for each data source is growing or when more data sources are added to the system. Since the MapReduce jobs are implemented using MRJob, they can be easily configured to be run in Amazon EMR clusters and take advantage of the computing power. 4.7 Improvement One major challenge during the data acquisition process is to aggregate relevant human behavioral data. Fetching relevant YouTube comments is fairly straightforward. Simply searching videos in YouTube using the song title and artist name, the returned list of videos, especially the videos that are ranked at the top, are likely to include the official video of the song. The comments in YouTube videos are also closely representing the emotion or feeling of the commenters. However, as discussed in section 3.2, using similar strategy ended up with a large list of lower quality tweets that result in 0 for all the mood classifiers. Although the second iteration of the acquisition significantly improve the quality, the distribution of the moods in the tweet corpus is still heavily skewed to the love and joy mood. The following two diagrams demonstrate the issue: 9
10 Figure 1: YouTube comments mood distribution Figure 2: Tweet mood distribution Because we can only fetch location information from tweets, most of the mood in the mood map are either love or joy. There are several improvements can be made to the system: 1) add more data 10
11 sources, e.g. 8tracks and soundcloud where user comments are directly linked to songs as similar to YouTube; 2) diversify the sampling of the training dataset for classifiers to include multiple languages; 3) include more features to the classifiers such as punctuation and training text from other corpus. 4) mf_sh scores from different data sources can be weighted and combined into a single mf_sh scores to be used as the only sorting criteria. This would improve user experience of the web application. 5. Conclusion Moodify exposes a new way of exploring music using real- time user behavioral information. As opposed to the traditional mood classification based on static song attributes, the real- time behavioral information employs a dynamic layer in the mood classification algorithm that creates a more accurate prediction based on current trend. In addition, mood map allows users to explore the current mood in the world through the lense of music appetite. Once the improvements discussed in section 4.7 can be achieved, we anticipate that users would use Moodify to dynamically construct a music playlist based on current mood or location in interest. The experience would be similar to clicking one of the many playlists in existing streaming music services. Appendix 1: Project Repository Appendix 2: AWS S3 Appendix 3: Data Summary Table 1: Echonest data summary Steps # of songs Size MongoDB collection Echonest data before cleaning Echnoest data after cleaning duplicates 31, MB echonest_song 14, MB echonest_songs 11
12 Table 2: Twitter data summary Steps # of tweets # of unique referenced songs Size MongoDB collection First iteration 1,785, MB tweets Second iteration 575, MB tweets_v2 Table 3: YouTube data summary # of comments # of unique referenced songs Size MongoDB collection 739, MB youtube_comments Table 4: Location mood data summary # of unique location (state,country) # of tweets with geolocation Size MongoDB collection MB location_moods Appendix 4: Tools and Libraries pyechonest: fetch song metadata from Echoecho API service tweepy: search tweets that are related to songs fetched from Echonest. pymongo: manage the MongoDB apiclient, oauth2client: fetch YouTube comments from YouTube API service. nltk: used for Porter stemmer, the stop words for English and tokenization removing non- alpha characters sklearn: used for the Multinomial Naive Bayes Algorithm and the Counter Vectorizer of documents. Also its metrics are used to make a plot of the performance of the algorithm. geopy: fetch geolocation information using Nominatim and Google Map services boto: store and retrieve data in AWS S3 matplotlib: plot data pandas: read and parse CSV files mrjob: used for MapReduce job implementation Ruby on Rails: web framework to build the frontend application 12
The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources
FOSS4G 2017 Boston The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources Devika Kakkar and Ben Lewis Harvard Center for Geographic Analysis
More informationBuilding A Billion Spatio-Temporal Object Search and Visualization Platform
2017 2 nd International Symposium on Spatiotemporal Computing Harvard University Building A Billion Spatio-Temporal Object Search and Visualization Platform Devika Kakkar, Benjamin Lewis Goal Develop a
More informationSTORE LOCATOR USER GUIDE Extension version: 1.0 Magento Compatibility: CE 2.0
support@magestore.com sales@magestore.com Phone: +1-606-657-0768 STORE LOCATOR USER GUIDE Extension version: 1.0 Magento Compatibility: CE 2.0 Table of Contents 1. INTRODUCTION 3 Outstanding Features...3
More informationAutomated Tagging for Online Q&A Forums
1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created
More informationNLP Final Project Fall 2015, Due Friday, December 18
NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,
More information/ Cloud Computing. Recitation 7 October 10, 2017
15-319 / 15-619 Cloud Computing Recitation 7 October 10, 2017 Overview Last week s reflection Project 3.1 OLI Unit 3 - Module 10, 11, 12 Quiz 5 This week s schedule OLI Unit 3 - Module 13 Quiz 6 Project
More informationAssignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis
Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running
More informationEnhancing applications with Cognitive APIs IBM Corporation
Enhancing applications with Cognitive APIs After you complete this section, you should understand: The Watson Developer Cloud offerings and APIs The benefits of commonly used Cognitive services 2 Watson
More informationW205: Storing and Retrieving Data Spring 2015
W205: Storing and Retrieving Data Spring 2015 Instructor: Alex Milowski Team Members: Nate Black Arthur Mak Malini Mittal Marguerite Oneto April 28, 2015 1 Table of Contents 1 Introduction. 4 1.1 The Problem:
More informationLambda Architecture for Batch and Stream Processing. October 2018
Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.
More informationUber Push and Subscribe Database
Uber Push and Subscribe Database June 21, 2016 Clifford Boyce Kyle DiSandro Richard Komarovskiy Austin Schussler Table of Contents 1. Introduction 2 a. Client Description 2 b. Product Vision 2 2. Requirements
More informationTest On Line: reusing SAS code in WEB applications Author: Carlo Ramella TXT e-solutions
Test On Line: reusing SAS code in WEB applications Author: Carlo Ramella TXT e-solutions Chapter 1: Abstract The Proway System is a powerful complete system for Process and Testing Data Analysis in IC
More informationSTORE LOCATOR PLUGIN USER GUIDE
support@simicart.com Support: +84.3127.1357 STORE LOCATOR PLUGIN USER GUIDE Table of Contents 1. INTRODUCTION... 3 2. HOW TO INSTALL... 4 3. HOW TO CONFIGURE... 5 4. HOW TO USE ON APP... 13 SimiCart Store
More informationSemantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.
Semantic Web Company PoolParty - Server PoolParty - Technical White Paper http://www.poolparty.biz Table of Contents Introduction... 3 PoolParty Technical Overview... 3 PoolParty Components Overview...
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationGraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing
Institute of Parallel and Distributed Systems () Universitätsstraße 38 D-70569 Stuttgart GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing Ruben Mayer, Christian Mayer,
More informationCSE 454 Final Report TasteCliq
CSE 454 Final Report TasteCliq Samrach Nouv, Andrew Hau, Soheil Danesh, and John-Paul Simonis Goals Your goals for the project Create an online service which allows people to discover new media based on
More informationMedia AI. Adaptive. Intelligent. Architectural Design Document
Adaptive. Intelligent. Nick Burwell CS 130 Software Development Thursday, December 16, 2004 Table of Contents 1. Introduction...1 2. Architecture...1 3. Component Design...2 3.1 User login & administration...2
More information/ Cloud Computing. Recitation 10 March 22nd, 2016
15-319 / 15-619 Cloud Computing Recitation 10 March 22nd, 2016 Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.3, OLI Unit 4, Module 15, Quiz 8 This week
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationCSCI6900 Assignment 1: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 1: Naïve Bayes on Hadoop DUE: Friday, January 29 by 11:59:59pm Out January 8, 2015 1 INTRODUCTION TO NAÏVE BAYES Much of machine
More informationExtracting Information from Social Networks
Extracting Information from Social Networks Reminder: Social networks Catch-all term for social networking sites Facebook microblogging sites Twitter blog sites (for some purposes) 1 2 Ways we can use
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationUsing the Force of Python and SAS Viya on Star Wars Fan Posts
SESUG Paper BB-170-2017 Using the Force of Python and SAS Viya on Star Wars Fan Posts Grace Heyne, Zencos Consulting, LLC ABSTRACT The wealth of information available on the Internet includes useful and
More informationETL Testing Concepts:
Here are top 4 ETL Testing Tools: Most of the software companies today depend on data flow such as large amount of information made available for access and one can get everything which is needed. This
More informationPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page
More informationIBM Best Practices Working With Multiple CCM Applications Draft
Best Practices Working With Multiple CCM Applications. This document collects best practices to work with Multiple CCM applications in large size enterprise deployment topologies. Please see Best Practices
More informationThe main website for Henrico County, henrico.us, received a complete visual and structural
Page 1 1. Program Overview The main website for Henrico County, henrico.us, received a complete visual and structural overhaul, which was completed in May of 2016. The goal of the project was to update
More informationTransformer Looping Functions for Pivoting the data :
Transformer Looping Functions for Pivoting the data : Convert a single row into multiple rows using Transformer Looping Function? (Pivoting of data using parallel transformer in Datastage 8.5,8.7 and 9.1)
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More information1
1 2 3 6 7 8 9 10 Storage & IO Benchmarking Primer Running sysbench and preparing data Use the prepare option to generate the data. Experiments Run sysbench with different storage systems and instance
More informationRedPoint Data Management for Hadoop Trial
RedPoint Data Management for Hadoop Trial RedPoint Global 36 Washington Street Wellesley Hills, MA 02481 +1 781 725 0258 www.redpoint.net Copyright 2014 RedPoint Global Contents About the Hadoop sample
More informationThe Road to a Complete Tweet Index
The Road to a Complete Tweet Index Yi Zhuang Staff Software Engineer @ Twitter Outline 1. Current Scale of Twitter Search 2. The History of Twitter Search Infra 3. Complete Tweet Index 4. Search Engine
More informationApplied Machine Learning
Applied Machine Learning Lab 3 Working with Text Data Overview In this lab, you will use R or Python to work with text data. Specifically, you will use code to clean text, remove stop words, and apply
More informationMachine Learning in Action
Machine Learning in Action PETER HARRINGTON Ill MANNING Shelter Island brief contents PART l (~tj\ssification...,... 1 1 Machine learning basics 3 2 Classifying with k-nearest Neighbors 18 3 Splitting
More informationUsing the VMware vcenter Orchestrator Client. vrealize Orchestrator 5.5.1
Using the VMware vcenter Orchestrator Client vrealize Orchestrator 5.5.1 You can find the most up-to-date technical documentation on the VMware website at: https://docs.vmware.com/ If you have comments
More informationCS : Final Project Report
CS 294-16: Final Project Report Team: Purple Paraguayans Michael Ball Nishok Chetty Rohan Roy Choudhury Alper Vural Problem Statement and Background Music has always been a form of both personal expression
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationETL Transformations Performance Optimization
ETL Transformations Performance Optimization Sunil Kumar, PMP 1, Dr. M.P. Thapliyal 2 and Dr. Harish Chaudhary 3 1 Research Scholar at Department Of Computer Science and Engineering, Bhagwant University,
More informationCreating a Classifier for a Focused Web Crawler
Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationUsing the VMware vrealize Orchestrator Client
Using the VMware vrealize Orchestrator Client vrealize Orchestrator 7.0 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by
More informationSearch Engines and Time Series Databases
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18
More informationPagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB
Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB Pagely is the market leader in managed WordPress hosting, and an AWS Advanced Technology, SaaS, and Public
More informationP2P Applications. Reti di Elaboratori Corso di Laurea in Informatica Università degli Studi di Roma La Sapienza Canale A-L Prof.ssa Chiara Petrioli
P2P Applications Reti di Elaboratori Corso di Laurea in Informatica Università degli Studi di Roma La Sapienza Canale A-L Prof.ssa Chiara Petrioli Server-based Network Peer-to-peer networks A type of network
More informationMovieRec - CS 410 Project Report
MovieRec - CS 410 Project Report Team : Pattanee Chutipongpattanakul - chutipo2 Swapnil Shah - sshah219 Abstract MovieRec is a unique movie search engine that allows users to search for any type of the
More informationWelcome to the New Era of Cloud Computing
Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing
More informationSOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES
SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES Introduction to Information Retrieval CS 150 Donald J. Patterson This content based on the paper located here: http://dx.doi.org/10.1007/s10618-008-0118-x
More informationIncluvie: Actor Data Collection Ada Gok, Dana Hochman, Lucy Zhan
Incluvie: Actor Data Collection Ada Gok, Dana Hochman, Lucy Zhan {goka,danarh,lucyzh}@bu.edu Figure 0. Our partner company: Incluvie. 1. Project Task Incluvie is a platform that promotes and celebrates
More informationOracle Endeca Information Discovery
Oracle Endeca Information Discovery Glossary Version 2.4.0 November 2012 Copyright and disclaimer Copyright 2003, 2013, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationMICROSOFT BUSINESS INTELLIGENCE
SSIS MICROSOFT BUSINESS INTELLIGENCE 1) Introduction to Integration Services Defining sql server integration services Exploring the need for migrating diverse Data the role of business intelligence (bi)
More informationPROJECT REPORT. TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C
PROJECT REPORT TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C00161361 Table of Contents 1. Introduction... 1 1.1. Purpose and Content... 1 1.2. Project Brief... 1 2. Description of Submitted
More informationA U T O M A T E D C O N T E NT P R O T E C T I O N, A N A L Y T I C S A N D M O N E T I Z A T I O N A C R O S S S O C I A L P L A T F O R M S
Presenting: Eyal Arad VIDEOCITES 1 ID LTD. 2018 A U T O M A T E D C O N T E NT P R O T E C T I O N, A N A L Y T I C S A N D M O N E T I Z A T I O N A C R O S S S O C I A L P L A T F O R M S VIDEOCITES
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationS E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N
S E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N BY J OHN KELLY SOFTWARE DEVELOPMEN T FIN AL REPOR T 5 TH APRIL 2017 TABLE OF CONTENTS Abstract 2 1.
More informationCA ERwin Data Modeler
CA ERwin Data Modeler Implementation Guide Service Pack 9.5.2 This Documentation, which includes embedded help systems and electronically distributed materials, (hereinafter referred to only and is subject
More informationSocial Network Analytics on Cray Urika-XA
Social Network Analytics on Cray Urika-XA Mike Hinchey, mhinchey@cray.com Technical Solutions Architect Cray Inc, Analytics Products Group April, 2015 Agenda 1. Introduce platform Urika-XA 2. Technology
More informationClustering to Reduce Spatial Data Set Size
Clustering to Reduce Spatial Data Set Size Geoff Boeing arxiv:1803.08101v1 [cs.lg] 21 Mar 2018 1 Introduction Department of City and Regional Planning University of California, Berkeley March 2018 Traditionally
More informationR-Store: A Scalable Distributed System for Supporting Real-time Analytics
R-Store: A Scalable Distributed System for Supporting Real-time Analytics Feng Li, M. Tamer Ozsu, Gang Chen, Beng Chin Ooi National University of Singapore ICDE 2014 Background Situation for large scale
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationJava Archives Search Engine Using Byte Code as Information Source
Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id
More informationQlik Sense Enterprise architecture and scalability
White Paper Qlik Sense Enterprise architecture and scalability June, 2017 qlik.com Platform Qlik Sense is an analytics platform powered by an associative, in-memory analytics engine. Based on users selections,
More informationDATA MINING TRANSACTION
DATA MINING Data Mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is
More informationVolunteerMatters Wordpress Web Platform Calendar Admin Guide. Version 1.1
VolunteerMatters Wordpress Web Platform Calendar Admin Guide Version 1.1 VolunteerMatters Wordpress Web: Admin Guide This VolunteerMatters Wordpress Web Platform administrative guide is broken up into
More informationYour First Hadoop App, Step by Step
Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1 Miliauskas @mmiliauskas Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On
More informationData Analytics Framework and Methodology for WhatsApp Chats
Data Analytics Framework and Methodology for WhatsApp Chats Transliteration of Thanglish and Short WhatsApp Messages P. Sudhandradevi Department of Computer Applications Bharathiar University Coimbatore,
More informationNosDB vs DocumentDB. Comparison. For.NET and Java Applications. This document compares NosDB and DocumentDB. Read this comparison to:
NosDB vs DocumentDB Comparison For.NET and Java Applications NosDB 1.3 vs. DocumentDB v8.6 This document compares NosDB and DocumentDB. Read this comparison to: Understand NosDB and DocumentDB major feature
More informationStager. A Web Based Application for Presenting Network Statistics. Arne Øslebø
Stager A Web Based Application for Presenting Network Statistics Arne Øslebø Keywords: Network monitoring, web application, NetFlow, network statistics Abstract Stager is a web based
More informationConclusions. Chapter Summary of our contributions
Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web
More informationIdentifying Important Communications
Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our
More informationFreegal emusic PC user guide
Freegal emusic PC user guide What is Freegal? Freegal is a free music streaming and downloading service. Freegal offers access to about 7 million songs including the Sony Music catalogue. In total the
More informationParts of Speech, Named Entity Recognizer
Parts of Speech, Named Entity Recognizer Artificial Intelligence @ Allegheny College Janyl Jumadinova November 8, 2018 Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 1 / 25
More informationData for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit
Data for linguistics ALEXIS DIMITRIADIS Text, corpora, and data in the wild 1. Where does language data come from? The usual: Introspection, questionnaires, etc. Corpora, suited to the domain of study:
More informationAI Dining Suggestion App. CS 297 Report Bao Pham ( ) Advisor: Dr. Chris Pollett
AI Dining Suggestion App CS 297 Report Bao Pham (009621001) Advisor: Dr. Chris Pollett Abstract Trying to decide what to eat can be challenging and time-consuming. Google or Yelp are two popular search
More information/ Cloud Computing. Recitation 9 March 15th, 2016
15-319 / 15-619 Cloud Computing Recitation 9 March 15th, 2016 Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.2, OLI Unit 4, Module 14, Quiz 7 This week
More informationHow can you implement this through a script that a scheduling daemon runs daily on the application servers?
You ve been tasked with implementing an automated data backup solution for your application servers that run on Amazon EC2 with Amazon EBS volumes. You want to use a distributed data store for your backups
More informationSitecore Experience Platform 8.0 Rev: September 13, Sitecore Experience Platform 8.0
Sitecore Experience Platform 8.0 Rev: September 13, 2018 Sitecore Experience Platform 8.0 All the official Sitecore documentation. Page 1 of 455 Experience Analytics glossary This topic contains a glossary
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationQlik Sense Performance Benchmark
Technical Brief Qlik Sense Performance Benchmark This technical brief outlines performance benchmarks for Qlik Sense and is based on a testing methodology called the Qlik Capacity Benchmark. This series
More informationKaggle See Click Fix Model Description
Kaggle See Click Fix Model Description BY: Miroslaw Horbal & Bryan Gregory LOCATION: Waterloo, Ont, Canada & Dallas, TX CONTACT : miroslaw@gmail.com & bryan.gregory1@gmail.com CONTEST: See Click Predict
More informationOptimizing Testing Performance With Data Validation Option
Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
More informationA Study of the Correlation between the Spatial Attributes on Twitter
A Study of the Correlation between the Spatial Attributes on Twitter Bumsuk Lee, Byung-Yeon Hwang Dept. of Computer Science and Engineering, The Catholic University of Korea 3 Jibong-ro, Wonmi-gu, Bucheon-si,
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationUSING THE MUSICBRAINZ DATABASE IN THE CLASSROOM. Cédric Mesnage Southampton Solent University United Kingdom
USING THE MUSICBRAINZ DATABASE IN THE CLASSROOM Cédric Mesnage Southampton Solent University United Kingdom Abstract Musicbrainz is a crowd-sourced database of music metadata. The level 6 class of Data
More informationMedia wrangling in the car with GENIVI reqs
Media wrangling in the car with GENIVI reqs Collecting all your music in one place Jonatan Pålsson February 2, 2014 Jonatan Pålsson Media wrangling in the car with GENIVI reqs February 2, 2014 1 / 22 Outline
More informationPython Certification Training
Introduction To Python Python Certification Training Goal : Give brief idea of what Python is and touch on basics. Define Python Know why Python is popular Setup Python environment Discuss flow control
More informationTowards a hybrid approach to Netflix Challenge
Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the
More informationSEO: SEARCH ENGINE OPTIMISATION
SEO: SEARCH ENGINE OPTIMISATION SEO IN 11 BASIC STEPS EXPLAINED What is all the commotion about this SEO, why is it important? I have had a professional content writer produce my content to make sure that
More information/ Cloud Computing. Recitation 8 October 18, 2016
15-319 / 15-619 Cloud Computing Recitation 8 October 18, 2016 1 Overview Administrative issues Office Hours, Piazza guidelines Last week s reflection Project 3.2, OLI Unit 3, Module 13, Quiz 6 This week
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationOrchestrating Music Queries via the Semantic Web
Orchestrating Music Queries via the Semantic Web Milos Vukicevic, John Galletly American University in Bulgaria Blagoevgrad 2700 Bulgaria +359 73 888 466 milossmi@gmail.com, jgalletly@aubg.bg Abstract
More informationGR Reference Models. GR Reference Models. Without Session Replication
, page 1 Advantages and Disadvantages of GR Models, page 6 SPR/Balance Considerations, page 7 Data Synchronization, page 8 CPS GR Dimensions, page 9 Network Diagrams, page 12 The CPS solution stores session
More informationMCSA SQL SERVER 2012
MCSA SQL SERVER 2012 1. Course 10774A: Querying Microsoft SQL Server 2012 Course Outline Module 1: Introduction to Microsoft SQL Server 2012 Introducing Microsoft SQL Server 2012 Getting Started with SQL
More information