Moodify. 1. Introduction. 2. System Architecture. 2.1 Data Fetching Component. W205-1 Rock Baek, Saru Mehta, Vincent Chio, Walter Erquingo Pezo

Size: px

Start display at page:

Download "Moodify. 1. Introduction. 2. System Architecture. 2.1 Data Fetching Component. W205-1 Rock Baek, Saru Mehta, Vincent Chio, Walter Erquingo Pezo"

Roxanne Richard
5 years ago
Views:

1 1. Introduction Moodify Moodify is an music web application that recommend songs to user based on mood. There are two ways a user can interact with the application. First, users can select a mood that is supported by the system and the application displays a list of songs that are classified with the highest probability for that mood. Second, user can navigate trending mood through an interactive Google map that displays the current most popular mood in each of the state or country around the world. The rest of the paper discusses about the technical details on the proposed system. Section 2 discusses about system architecture. Section 3 discusses about data retrieval strategies. Section 4 discusses about implementation details and improvement. Section 5 concludes the paper. 2. System Architecture The system involves two major components: 1) a backend system that fetches music metadata as well as generating mood categorization for each of the song; 2) a frontend user facing web application that accepts user mood queries and responds with a list of songs that matches the user input. The backend system is to build all the necessary and properly indexed data that is then consumed by the frontend web application. There are two types of requirements from the frontend: 1) search songs by mood; 2) browse mood by location. To support the first requirement, the backend system needs to build an index on the probability of a song falling into certain mood category. The probability is then multiplied by the song hotness to calculate a mood frequency - song hotness (mf- sh) score. The details of calculating mf- sh score are elaborated in section 2.5. To support the second requirement, the backend needs to associate location with a mood for a song, and then aggregate all the mood by location and build an index on the location- mood count to be consumed by the frontend. In order to support the requirements from the frontend, the backend needs to: 1) compile a list of trending songs (trending means songs that are most listened or mentioned, and has no correlation with mood); 2) associate mood with each of the song; 3) associate location to each of the moods; 4) build the indexes that are consumed by the frontend. Section 2.1 to 2.4 discusses about the technical implementation of the backend system. Section 2.5 discusses about the frontend system. Section 2.6 shows the system diagram. 2.1 Data Fetching Component To tackle requirement 1, the system refers to Echonest ( for the list of trending songs. Echonest hosts one of the most versatile music database in the world with over 30 million songs and 3 millions artists records. Not only the data set includes basic metadata such as song title, artist, album and genre, but also it includes some of the intelligent attributes such as 1

2 energy, danceability, tempo and hotness. We utilized the hotness attribute as sorting criteria to gather the list of trending songs. To tackle requirement 2, the system requires a supplementary human behavioral information in order to accurately predict moods associated with a song. Using human behavioral data has advantage over other static data such as lyrics to tag a song with moods. This is because human s mood toward a song may change overtime whereas lyrics stay forever. The dynamic nature of the mood analysis provides real- time music recommendation that more accurately reflects the current trend. This source of data is obtained through two social media sites: 1) Twitter and 2) YouTube. By using song title and artist as filter criteria, we can fetch the most relevant tweets and YouTube comments for each of the songs and associate moods with each of the text. Requirement 3 depends on the location data for each of the text gathered from requirement 2 above. Twitter supports location based tweets, however, YouTube comment does not support it currently. Thus, the system only uses geo- enabled tweets to associate the location for mood, and only those moods with associated location are aggregated and displayed in the interactive mood map. Noted that, the process of aggregating location mood has no effect on the process of aggregating mood for a song and thus has no impact on the frontend requirement 1 (search songs by mood). These two ETL processes are discussed later in section Mood Analysis The system supports the following mood categories: anger, disgust, fear, joy, love, sad, surprise. Our main guideline on building our corpora is based on the paper EmpaTweet: Annotating and Detecting Emotions on Twitter which describes how to tag tweets with similar categories. The main approach for tweets is to categorize manually about 1500 tweets (200 per category) and use this as corpora for a series of Multinomial Naive Bayes Classifiers, one for each category. In this categorization, non alphabetic characters, stop words and hashtags are removed. Non mood related hashtags such as event and topic hashtags are removed because they appear more frequently because of a trend and not because of their sentiment values. Besides, a Porter stemmer is used. Additionally, the same process is repeated for YouTube comments, whose language is different than tweets because a tweet is limited to 140 characters and many of them are simply hashtags. YouTube comments do not have these restrictions. Thus, 14 classifiers are needed in total with 7 categories for YouTube and 7 categories for Twitter. The system also utilizes NLTK library to clean up the comments and tweets and scikit- learns for the classifiers. These classifiers are then used to tag moods for each of the tweets and YouTube comments. A vector of moods is produced as a result for each tweet and comment. The vector includes a list of mood categories with each containing a 0/1 number indicating whether the song falls into the mood category. These vectors are consumed by the ETL process to aggregate all the moods associated with a song. 2.3 ETL process 2

3 The system involves two aggregation processes that are required to generate the indexes for frontend consumption. The first ETL process is to aggregate all the mood vectors for each of the song. This can be accomplished by a MapReduce job. The mapper reads the mood vectors for each of the song and emits song id as key and the mood vector as value. The reducer simply counts up all value for each mood for a song and divides the aggregated value by the total number of reference to get the probability. The result of the MapReduce job is a list of probability mood vector for each of the song. Each mood in the vector indicates its probability of the mood occurring among all the corpora for the song. The second ETL process is to aggregate all the mood for each of the state/country for the entire database. This can also be accomplished by a MapReduce job. The mapper reads only the mood vectors that are associated with a location and emits (location, mood) as key. The reducer simply counts up all the keys. The result of the MapReduce job is used to build a location- mood model. The root level of the location- mood model is keyed by the location. The second level is keyed by mood, which is sorted by the total number of a mood by location. 2.4 Data Storage Component MongoDB is the primary data storage component for the whole system. MongoDB has several advantages over traditional SQL database. First, the data fetching component consists of multiple data sources with different data schema. Using MongoDB avoids the overhead of schema definition and potential database schema migration should we decide to add additional attributes and data sources. This allows us to efficiently implement the data fetching component. Second, the schema of the system is relatively simple, considering the fact that the front- end web application only requires the two indexes and song title and artist (finding the video of a song can be done on- demand using title and artist once user selects the song). There is no need for data normalization. Third, the flexibility of the document- oriented storage system allows us to augment data structure with added functionalities such as mood analysis without the need of modifying the database schema to adapt the new modeling in the system. Data sources from data fetching component are stored directly into MongoDB. The data can then be exported into CSV formatted file to be consumed by the ETL. Similarly, mood vectors for each of the tweet and comment generated from mood analysis component are stored directly into MongoDB. However, the result of the MapReduce jobs from the ETL processes are stored initially into the file system and then a process is triggered to transform the aggregate result for each of the key into the corresponding indexes into the MongoDB. 2.5 Data Presentation A web frontend application is built to present users with two major functions: 1) search songs by mood categories; 2) explore moods by regions in an interactive map. This first type of requests can be answered by the ETL process that builds the mood vectors for each of the song. The probability of a song falling into a specific mood category can be calculated by dividing the total number of vector references for that mood over the total number of mood vectors for a song. This probability score is then multiplied by the hotness score that is fetched from Echonest data source. Multiplying the hotness score can balance the scenario where less popular songs have fewer number of mood 3

vectors which increase the chance of falling into a specific mood category. This also takes into consideration that more popular songs should have higher chance of being shown.

4 vectors which increase the chance of falling into a specific mood category. This also takes into consideration that more popular songs should have higher chance of being shown. The mood frequency - song hotness (mf- sh) score is used as sorting criteria to display the list of songs for a specific category. The second type of requests can be answered by the ETL process that builds the location- mood model. Each region in the map is displayed with the top referenced mood which simply counts all the reference for a mood in the specific region for all of the songs. 2.6 System Diagram 3. Data Retrieval Strategy There are three major data sources that are consumed by the system. Each sub- section discusses about the challenges we face for each data source and the corresponding data retrieval strategies. For each data source, we only stored the necessary attributes in the database and dropped 4

5 everthing else. Thus, we are able to control the final database size down to 1.37GB even though we have crawled a large amount of data. 3.1 Echonest Table 1 in appendix 3 provides a summary on the collected Echonest data. Script was written to fetch trending song from Echonest sorted by attribute hotness. Hotness is a fraction number between 0 and 1. Due to the API constraints, the search parameter for hotness only accepts up to 2 decimal points and the API only returns up to 1000 records for each type of search. To fetch as many song as possible, the following strategy is used: 1) specify both lower limit and upper limit for hotness, e.g ; 2) search up to 1000 results for each of the hotness range (0.01). Using this strategy, we are able to fetch trending songs from Echonest. The process of fetching the data was finished within half an hour. However, the trending songs returned using this strategy have duplication. These duplicated songs have different song id but same artist name and song title. Another script was written to eliminate the duplicated songs. As a result, we extracted unique songs. 3.2 Twitter Table 2 in appendix 3 provides a summary on the collected Twitter data. Script was written to fetch up to 500 tweets for each of the song using Twitter API. Song title and artist name are used as search filter criteria to fetch related tweets. Using this strategy, we are able to fetch tweets for unique songs. The process of fetching the data was finished within 1 day under the search API rate limit of tweets / 15 minutes. However, shortly after we used a sample of the tweet corpus as training dataset for the classifier, we realized a severe quality issue in the retrieved tweets. Most of the tweets have the format of Listening to and Now playing and lots of them are promotional tweets for marketing purpose. Thus, any mood analysis based on the set of low quality tweets would be irrelevant. Most of the tweets ended up with 0 probability for any of the moods using this classifier. A second iteration of tweets acquisition was run to fetch more relevant tweet using different criteria. We transformed song title into a hashtag and used it as filter criteria. We also filtered out non- english tweets and tweets that contains the words watch, now playing and video. Since we recognized this issue at a very late stage of the project, we only fetched up to 100 tweets for each of the song in order to speed up the process. Using this approach the tweet corpus contains much more higher quality text. As a result, the distribution of the probability mood vector is significantly improved. We are able to fetch tweets for unique songs. 3.3 Tweet Location In order to fetch the state/country data for geo- enabled tweets, a program was written to perform reverse geocoding to transform location coordinates into state and country for higher level aggregation. For country that support administration region such as state, state will be used as 5

6 aggregation point instead, otherwise, country is used. By using Nominatim service, we are able to fetch locations for unique tweets. Then, we used Google Map service to transform the state/country back into coordinates to be used to display mood icon in Google Map in the frontend application. This process was finished in a day. 3.4 YouTube Table 3 in appendix 3 provides a summary on the collected YouTube data. In addition to using Twitter comments to analyze the moods for each top song retrieved from our Echonest corpus, we use Youtube comments as an equal measure to gauge the moods of a song. Similar to the first tweets fetching strategy, song title and artist name were used as search filter criteria to fetch related tweets. For each unique song, 5 YouTube videos were used as comments references and 20 comments were fetched for each of the video. Thus a maximum of 100 comments were fetched for each of the unique song. Retrieving song comments from a variety of videos ensures that we have a large random sample of user comments to perform the sentiment analysis. Using this strategy, we are able to fetch comments for unique songs. The rate limit for retrieving data from Youtube API is 50,000 requests/day, so we can search 500 songs a day. The songs were split into three sections and were assigned to three teammates to perform the data retrieval concurrently. The process of fetching the data was finished within 10 days. 4. Implementation 4.1 Data Storage Since the data acquisition process was split between teammates, there are two major ways of storing the initial dataset. One way is to store the immediate data into local MongoDB. We used this strategy for the Echonest and Twitter data. Since only one teammate is responsible in this data retrieval process, storing in local MongoDB removes unnecessary complexity of migrating data to the system MongoDB server. The other way is to store the immediate data into AWS S3. We used this strategy for the YouTube data because this process was split between three teammates. The YouTube data was fetched from S3 and stored to local MongoDB afterward. After the data acquisition process was accomplished, we created an EC2 instance in AWS and ran a MongoDB server in the instance to serve as main data storage for the system. We reused the database backup and restore programs written in assignment 3 to migrate all the data from local to remote MongoDB. We prefer this strategy rather than writing to the remote MongoDB during the data acquisition process because of performance concern. The remote MongoDB is running in a t2.micro EC2 instance that takes advantage of the AWS free tier services. The processing power of this type of EC2 instance is very limited. Performing database insertion for millions of records would take hours to days. Whereas, using the backup and restore strategy takes only a few minutes. The remote MongoDB server serves as the backbone for all the subsequent system processes. 6

7 4.2 Mood Analysis NLTK is used for the Porter stemmer, the stop words for English and tokenization removing non- alpha characters. We also use sklearn for the Multinomial Naive Bayes Algorithm and the Counter Vectorizer of documents. sklearn has advantage over the NLTK on the performance of Naive Bayes classification. We used a random sample of the tweets and YouTube comments corpora and manually tagged them which were used to train the 14 classifiers. A program was also written to automatically tag all the tweets and comments, generate the mood vectors, and store the vectors in the MongoDB. 4.3 ETL After the mood vectors for each of the tweets and YouTube comments were generated, we ran the two MapReduce jobs implemented using MRJob for the two ETL processes described earlier. We also wrote a program to calculate the mf- sh score for each of the song based on the aggregated mood vector from the MRJob output and store the mf- sh score back to the system MongoDB for each of the songs. The location- mood aggregation result from MRJob output was stored in a new MongoDB collection which is consumed directly by the frontend application. 4.4 Data Modeling The following table shows the data models that we used to store the result from each of the components described earlier to MongoDB. echonest_songs tweets_v2 youtube_comments location_moods - id - title - artists_name - song_hotttnesss - youtube_mf_sh - tweet_mf_sh - id - song_id - text - coordinates - user - love - joy - sad - disgust - anger - surprise - fear - Geolocation - id - song_id - text - love - joy - sad - disgust - anger - surprise - fear - location - love - joy - sad - disgust - anger - surprise - fear - longitude - latitude echonest_songs is the collection model to hold information for each individual songs. id, title, artist_name and song_hotttnesss are attributes crawled from Echonest data source. They are stored into the MongoDB without any modification. youtube_mf_sh and tweet_mf_sh are objects that hold the mf_sh score for each of the mood. Each mf_sh object holds 7 attributes that are keyed by each of the mood name and the value of each mood key is the calculated mf_sh score. For example, youtube_mf_sh may look like: 7

8 { love : 0.11, joy : 0.019, sad : 0.037, disgust : 0.028, anger : 0.037, surprise : 0.084, fear : 0.009} tweets_v2 is the collection model to hold information for each tweet. id, text, coordinates and user are attributes crawled from Twitter data source. song_id is the corresponding Echonest song id for a tweet. The song_id is used for association with the echonest_songs collection and is also used as aggregation key for the ETL processes. love, joy, sad, disgust, anger, surprise and fear are all 0/1 valued attributes used to store the mood analysis result. These 7 attributes are considered as the mood vector as discussed throughout the paper. Geolocation stores the result of reverse geocoding of the coordinates. For example: the attribute may look like: { city : West Hollywood, house_number : 463, country : United States of America, county : Los Angeles County, state : California, postcode : 90036, country_code : us } This attribute is used to construct the location key that is used in the location- mood ETL process which generates the location key in the location_moods collection. youtube_comments is the collection model to hold information for each YouTube comment. text is the comment crawled from YouTube data source. id is generated by the system because we did not record the information during data acquisition process. However, it turns out that we never need to use the id to fetch more data from YouTube and thus the attribute is never reused. As is similar to tweets_v2 collection, song_id and the rest of the mood attributes have exactly the same meaning and functionalities. location_moods is the collection model to hold the aggregated mood vector for each of the location. location is a text representation of state/country discussed earlier in the second ETL process. This attribute is also used as key for the each of the document. An example of the attribute looks like british columbia,canada. longitude and latitude are the coordinates of the location attribute. These two attributes are used to create mood tags in the Google Map in the web application. The rest of the mood attributes are the aggregation result from the ETL process. Thus they may have value higher than 1. A summary statistic for each of the collections is also reported in Appendix Web Application The web application serves as a pure presentation layer of the mood recommendation system. It exposes two main Restful endpoints with one serving the mood category inquiry requests and the other one serving the location mood requests. We made use of Ruby on Rails framework to develop the application. Since two of the teammates already have experience working with the technologies and Ruby on Rails has outstanding advantage of developing web application efficiently, we decided to use it for productivity reason. The web application can be accessed through 8

9 4.6 Scalability The architecture of the system was planned to scale to millions songs. Due to the simplicity of the data modeling, MongoDB serves extremely well for the functionalities of the system while maintains the simplicity of the system design. Since MongoDB handles database sharding out of the box, the database should support the two main type of queries from web application without any performance degrade. However, each component of system currently requires a manual trigger by human currently. Ideally, a scheduler should be implemented to pipeline the whole process. Based on refresh period, the scheduler would automatically fetch data from Echonest and then YouTube and Twitter. We could also increase the API rate limits for all the data source providers when the existing limits severely impact the performance of the data acquisition process. But this should not be considered as a scalability issue. Since mood classifier can be reused, it does not impose any scalability drawback to construct them. But as more tweets and YouTube comments are logged into the system, the classifying process can become the bottleneck. Instead of sequentially tagging the text, we can split the whole corpora by database page, id space or shard id. Then the system can run the classifiers concurrently. The performance of the two ETL processes can also be improved significantly when corpus for each data source is growing or when more data sources are added to the system. Since the MapReduce jobs are implemented using MRJob, they can be easily configured to be run in Amazon EMR clusters and take advantage of the computing power. 4.7 Improvement One major challenge during the data acquisition process is to aggregate relevant human behavioral data. Fetching relevant YouTube comments is fairly straightforward. Simply searching videos in YouTube using the song title and artist name, the returned list of videos, especially the videos that are ranked at the top, are likely to include the official video of the song. The comments in YouTube videos are also closely representing the emotion or feeling of the commenters. However, as discussed in section 3.2, using similar strategy ended up with a large list of lower quality tweets that result in 0 for all the mood classifiers. Although the second iteration of the acquisition significantly improve the quality, the distribution of the moods in the tweet corpus is still heavily skewed to the love and joy mood. The following two diagrams demonstrate the issue: 9

10 Figure 1: YouTube comments mood distribution Figure 2: Tweet mood distribution Because we can only fetch location information from tweets, most of the mood in the mood map are either love or joy. There are several improvements can be made to the system: 1) add more data 10

11 sources, e.g. 8tracks and soundcloud where user comments are directly linked to songs as similar to YouTube; 2) diversify the sampling of the training dataset for classifiers to include multiple languages; 3) include more features to the classifiers such as punctuation and training text from other corpus. 4) mf_sh scores from different data sources can be weighted and combined into a single mf_sh scores to be used as the only sorting criteria. This would improve user experience of the web application. 5. Conclusion Moodify exposes a new way of exploring music using real- time user behavioral information. As opposed to the traditional mood classification based on static song attributes, the real- time behavioral information employs a dynamic layer in the mood classification algorithm that creates a more accurate prediction based on current trend. In addition, mood map allows users to explore the current mood in the world through the lense of music appetite. Once the improvements discussed in section 4.7 can be achieved, we anticipate that users would use Moodify to dynamically construct a music playlist based on current mood or location in interest. The experience would be similar to clicking one of the many playlists in existing streaming music services. Appendix 1: Project Repository Appendix 2: AWS S3 Appendix 3: Data Summary Table 1: Echonest data summary Steps # of songs Size MongoDB collection Echonest data before cleaning Echnoest data after cleaning duplicates 31, MB echonest_song 14, MB echonest_songs 11

12 Table 2: Twitter data summary Steps # of tweets # of unique referenced songs Size MongoDB collection First iteration 1,785, MB tweets Second iteration 575, MB tweets_v2 Table 3: YouTube data summary # of comments # of unique referenced songs Size MongoDB collection 739, MB youtube_comments Table 4: Location mood data summary # of unique location (state,country) # of tweets with geolocation Size MongoDB collection MB location_moods Appendix 4: Tools and Libraries pyechonest: fetch song metadata from Echoecho API service tweepy: search tweets that are related to songs fetched from Echonest. pymongo: manage the MongoDB apiclient, oauth2client: fetch YouTube comments from YouTube API service. nltk: used for Porter stemmer, the stop words for English and tokenization removing non- alpha characters sklearn: used for the Multinomial Naive Bayes Algorithm and the Counter Vectorizer of documents. Also its metrics are used to make a plot of the performance of the algorithm. geopy: fetch geolocation information using Nominatim and Google Map services boto: store and retrieve data in AWS S3 matplotlib: plot data pandas: read and parse CSV files mrjob: used for MapReduce job implementation Ruby on Rails: web framework to build the frontend application 12

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

FOSS4G 2017 Boston The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources Devika Kakkar and Ben Lewis Harvard Center for Geographic Analysis