News Article Matcher Team: Rohan Sehgal, Arnold Kao, Nithin Kunala Abstract: The news article matcher is a search engine that allows you to input an entire news article and it returns articles that are similar to it in nature. In order to achieve this, we need a few components that merge together to help build this news article matcher. We need a crawler to crawl various news websites and gather data to index, an inverted index generator to create an inverted index for searching, a search tool, that calculates the BM25 score for each document, and a front-end query parser, that allows us to break down an inputted article into its key words which then forms the query string. A user-selected feedback mechanism is also employed. It allows the user to select relevant articles, which then modifies query and returns results that are more relevant to the user. Related Work: Typically, news searches are done mainly by generic search engines, where user input keywords in the query and the search engine returns articles relevant to those keywords. During our research, we found no tool that provides the functionality of inputting the literal text of a news article in order to find articles that are similar to it both nature and content. This is where we feel our tool the News Article Matcher stands out. Many a time, a user who wants to explore an event or news piece in greater detail, does not fully know in advance what they are looking for. Reading a single article often does not give not give the user a holistic enough view to summarize the article into a query that can be used continue exploring the topic. The News Article Matcher allows them to simply copy-paste the text of the article and the tool them summarizes and takes out the main keywords from the articles, and finds more similar for the user to explore. It can be used for a variety of purposes, like finding historical new articles about events, finding out how similar events played out in the past and as discussed above getting more information about news events without having to craft their own queries. Problem: The end result or problem to be solved is taking a user-given article and returning a list of timesensitive articles matching the given article and modifying the result when feedback is taken into account. The main We can break down the News Article Matcher into 4 distinct components as mentioned briefly in the abstract. These four components all interact with each other solving certain mini-problems along the way. They are: 1) Crawler This will be the main source of data for the search engine. After looking and searching for extended time, we were unable to find a news article database that was current and in the format we desired. Hence the crawler became a necessity that would
allow articles to remain current and give us articles that we could match the userinputted articles too. 2) Inverted Index This component is required to create and maintain the inverted index structure for our database. We needed inverted index component that could run ondemand when we added news articles manually for testing, as well as one that could run when new articles were added by the crawler. In this way we always have the freshest inverted index possible to work with. 3) Search Mechanism: For returning search results we needed a component that could calculate the BM25 score for documents and return the top-n documents. This component also had to handle a date input which would weigh documents by user given date input in addition to the regular BM25 weight. 4) Query Parser and Generator: This component needed to read an input article provide by the user and find the key words in the article that could be used to summarize the article and then generate a query using those terms. This component, also is used for the feedback portion. It needs to reassess the weights on words based on the feedback given by users and change and weigh the query terms appropriately to return more relevant results for the user in the next run of the query. Methods: For the crawler, we decided to use the feed of Reuters Top News to crawl as we felt that this would give us a good overview of all the important news events in the world. The crawler was implemented by continuously crawling the top news headlines on this page http://us.mobile.reuters.com/category/topnews every 3 hours and adding the text and headlines of new articles to our database. The xml of the pages was loaded by the crawler and the data was scraped using XPath queries and stored into our MySQL database. Once the news articles were in our database, we then needed to add the new articles to our inverted index table in our database. We had 3 tables storing a variety of information about an article. The schema of the tables that the InvertedIndex component affects is below: Lexicon (TermID, DocFreq, OverallFreq) InvertedIndex (DocID, TermID, Freq_in_doc) DocInfo (DocID, Length) We felt that this schema, gave us all the information we needed for a BM25 score to be generated just by simple table lookups. This ensure that queries would return quickly, and not require large calculations to be done at run-time. The inverted index method is invoked after each crawled document. The document is first cleaned, that is stripped of all punctuation and then trailing and leading whitespace and is then stemmed using a Porters Stemmer. It then ensures that the counts in the Lexicon table or either initialized or updated depending upon whether the term exists in the database, the length of the document is in the DocInfo table and
the InvertedIndex table of counts in a document is created for all unique terms in the stemmed document. On the query side, the first step was to build the query for computation. In order to do this, we used a script to find the Porter stem of every word and counted the frequency of each stem in the input article. We then normalized each stemmed term by dividing this count by the overall frequency of the term in our database (similar to MP1). After sorting the normalized list, we took the top X number of terms and passed a map of the terms and their (non-normalized) term frequency in the query article to our search function. Frequency normalized = Frequency query / Frequency background For our search, we implemented using the BM-25 function. For each document in the database, we iterated through our generated query, consisting of the previously mentioned top X terms. Then, if we found any of the query terms in the document, we would find the product of the term frequency, inverse document frequency, and query frequency before adding it to the BM-25 score. The scoring function specifically was: TF (t, d) = (k+1) c (t, d)/(c (t, d) +k (1-b+b*doclen/avgdoclen) IDF (t) = log ((n+1)/k) Score = TF * IDF * QF (QF = frequency of the term in the query article) After summing up and finding the BM-25 scores, we would just sort them, and display the top Y results to the user. We also offered an optional date input and a date weight. If the user entered a date, then for each article that had a non-zero score, we would calculate the difference in days between the scoring article and the input date. Then we would multiply the BM-25 score based on the entered weight and an exponentially decaying function on the difference in days. # of Days DateMult = DateDecay The output of this equation decays pretty aggressively since we crawled enough articles to have a large enough date range on the articles in our database. The value of DateDecay is hardcoded, but it can be modified to change the amount of punishment for a difference in date. We also had an optional date weight parameter. This parameter represents how much of the scoring is actually based on date. So, for instance, if the user entered 50%, then half of the score will be based on the query match, and half of the score will be based on date match. It is important to know that the DateScore (whose formula is given below) is just the DateMult and the BM-25 score multiplied together. This way, we don t run into any problems with normalizing the date score around article size or any other factors that might affect a BM-25 score. It also means that if the BM-25 score is 0, then the overall score will be 0 regardless of how well the
date matches. This is desirable, because we want similar articles with similar date, and we never want the date parameter to overpower the article similarity ranking. We also made sure to always normalize our scores regardless of the percentage and date parameter. We did this because for a specific query, we want the user to be able to look at the scores and see how well they match, and we don t want inflation due to the date parameters. DateScore = BM25 Score * DateMult * (UserEnteredPerc / (1- UserEnteredPerc)) TotalScore = BM25Score * DateScore / (1 - (UserEnteredPerc / (1- UserEnteredPerc))); In the case that the first iteration of results returned by our search engine is not to the user s satisfaction, we offer the option of providing feedback. However, we chose not to keep permanent records of any feedback; it is only on a per-usage basis. The way that this feedback system works is that given a set of results for a given query article, the user can select which ones are most relevant and re-submit the query. These choices are then analyzed in addition to the original input query. One way to understand our implementation of feedback is by thinking of it as an analysis of the concatenation of the original query and the selected relevant articles. In other words, we sum the term count of the original input with weighted term counts of all relevant articles, normalize it, and return the top X words with the highest normalized scores with each word mapped to the total term count, which may not be an integer at this point. These words and term counts are then plugged into the BM-25 calculations as before. TF feedback = TF query + w * TF relevant generally, w ϵ (0, 1] Evaluation and Sample Results: The end product, looked like the following:
In the current page, an article from BBC about the Boko Haram in Nigeria is input (http://www.bbc.com/news/world-africa-13809501) and the results that can be seen in the image are all related to the topic of the Boko Haram in Nigeria. When a date is entered we can see that the weights changing for the same query, reflecting the weight of the date parameter in the document scores and the article Islamist attack kills 125 in northeast Nigeria published on the 7th of May is given the highest score now compared to the other articles that are published on later dates. For a more formal result analysis, we did keep track of a sample article and the number of relevant articles it s returned using Prescision@5 documents, since our database consisted of only a hundred or so articles. For our sample test article, we manually added 5 relevant articles and we had 5 non-relevant articles on articles that had similar words but were not on the same topic. These articles were in addition to the other articles in the database. We plotted the Precission@5 documents as a function of the number of documents in the database.
Precision@5 5 4 3 2 1 0 10 17 24 35 55 77 Number of documents Precision@5 What we saw was an improvement in the precision as the number of documents in the database increased. This made sense as initially the database was skewed and the query generator was having trouble identifying unique words based on frequency normalization. As more articles were added, words that were rarer in the dataset were easier to identify and hence there was an improvement in the query being generated for the article and hence the precision of results improved. Using our relevance feedback, we were always able to improve our Precision scores resulting in all 5 documents being displayed each time except for when there were 10 documents in the database. We also measured the Precision@5 against the number of keywords we were using for the query. We saw an improvement as the number of query terms increased, which again made sense since ti would be a more accurate representation of the article. However this increased computation time and hence compromised and ran our final version on 25 terms. 5 4 3 2 1 Presicion@5 0 2 5 10 15 25 35 Query Size Presicion@5
Conclusions and Future work: Overall we were pleased with the performance of the News Article Matcher. It was consistently matching news articles accurately with random articles drawn from the internet. The date function worked well and the feedback, which was a little tricky to implement worked well. Overall we learnt how to implement a search engine from scratch, got familiar with a practical use of stemming and also learnt a bit about how to summarize large text pieces effectively and efficiently. For future work, inspired by the project presentations, we would probably want to migrate to the using the Apache Solr framework instead of maintain the inverted index ourselves in the MySQL database. This would lead to a major performance enhancement as our queries were definitely slowing down as more query terms were used and more articles were added to the database. We would also have liked to crawl a larger variety of websites so that we could match a larger range of articles. Right now, since we were only crawling Top News, entertainment events and sporting events were not being added to the database often and were being skipped leading to poor matches in those areas. Another change would be adding our crawled data to an already existing database of news articles. This would allow us to test our date functionality more accurately as well as test our search functionality on a larger dataset. We would also get more accurate matches for input queries since there would more likely be matches in a larger dataset. Contributions: Nithin Kunala: Designed the database schema Designed the feedback process Designed the query generation process Implemented BM25 search Implemented part of Inverted Index Rohan Sehgal: Designed the database schema Designed the Feedback Process Implemented the Crawler Implemented Inverted Index Created the UI for the tool Arnold Kao: Designed the database schema Designed the query generation process Implemented the feedback process
Implemented the query generator Created the UI for the tool References: 1) CS 410 Lecture notes 2) http://en.wikipedia.org/wiki/okapi_bm25 3) http://tartarus.org/~martin/porterstemmer/php.txt