News Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala

Similar documents
Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

THE WEB SEARCH ENGINE

Table of Laplace Transforms

Full-Text Indexing For Heritrix

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

MovieRec - CS 410 Project Report

Chapter 2. Architecture of a Search Engine

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval. (M&S Ch 15)

Notes: Notes: Primo Ranking Customization

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Building an Inverted Index

CSCI572 Hw2 Report Team17

The Topic Specific Search Engine

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Information Retrieval

CISC689/ Information Retrieval Midterm Exam

Google Tools and your Library - the Possibilities are Exponential

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CMPSCI 646, Information Retrieval (Fall 2003)

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

University of Santiago de Compostela at CLEF-IP09

Search Engine Architecture II

Relevance of a Document to a Query

Chapter 27 Introduction to Information Retrieval and Web Search

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Web Information Retrieval using WordNet

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Codify: Code Search Engine

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

MG4J: Managing Gigabytes for Java. MG4J - intro 1

CS105 Introduction to Information Retrieval

A short introduction to the development and evaluation of Indexing systems

Semantic Website Clustering

Information Retrieval and Organisation

Topic 3: Fractions. Topic 1 Integers. Topic 2 Decimals. Topic 3 Fractions. Topic 4 Ratios. Topic 5 Percentages. Topic 6 Algebra

Full-Text Indexing for Heritrix

In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the

6.001 Notes: Section 4.1

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

68A8 Multimedia DataBases Information Retrieval - Exercises

Tag-based Social Interest Discovery

Instructor: Stefan Savev

(Refer Slide Time 6:48)

Relevancy Workbench Module. 1.0 Documentation

Introduction to Information Retrieval

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

ResPubliQA 2010

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc.

CSCI 5417 Information Retrieval Systems Jim Martin!

(Refer Slide Time 3:31)

Distributed computing: index building and use

CS6200 Information Retreival. Crawling. June 10, 2015

Big Data Analytics CSCI 4030

Introduction to Information Retrieval

How Does a Search Engine Work? Part 1

5.1 The String reconstruction problem

Information Retrieval

Introduc)on to. CS60092: Informa0on Retrieval

Static Pruning of Terms In Inverted Files

Text Analytics (Text Mining)

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

Boolean Model. Hongning Wang

COMP6237 Data Mining Searching and Ranking

Searching the Web What is this Page Known for? Luis De Alba

XNA Tutorials Utah State University Association for Computing Machinery XNA Special Interest Group RB Whitaker 21 December 2007

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Everyday Activity. Course Content. Objectives of Lecture 13 Search Engine

Component ranking and Automatic Query Refinement for XML Retrieval

E-Shop: A Vertical Search Engine for Domain of Online Shopping

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

(Refer Slide Time: 02:59)

Transformer Looping Functions for Pivoting the data :

Social Search Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Link Analysis and Web Search

% Close all figure windows % Start figure window

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CrownPeak Playbook CrownPeak Search

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Web Page Similarity Searching Based on Web Content

6.001 Notes: Section 8.1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

How SPICE Language Modeling Works

dr.ir. D. Hiemstra dr. P.E. van der Vet

Principles of Algorithm Design

The Utrecht Blend: Basic Ingredients for an XML Retrieval System

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

Transcription:

News Article Matcher Team: Rohan Sehgal, Arnold Kao, Nithin Kunala Abstract: The news article matcher is a search engine that allows you to input an entire news article and it returns articles that are similar to it in nature. In order to achieve this, we need a few components that merge together to help build this news article matcher. We need a crawler to crawl various news websites and gather data to index, an inverted index generator to create an inverted index for searching, a search tool, that calculates the BM25 score for each document, and a front-end query parser, that allows us to break down an inputted article into its key words which then forms the query string. A user-selected feedback mechanism is also employed. It allows the user to select relevant articles, which then modifies query and returns results that are more relevant to the user. Related Work: Typically, news searches are done mainly by generic search engines, where user input keywords in the query and the search engine returns articles relevant to those keywords. During our research, we found no tool that provides the functionality of inputting the literal text of a news article in order to find articles that are similar to it both nature and content. This is where we feel our tool the News Article Matcher stands out. Many a time, a user who wants to explore an event or news piece in greater detail, does not fully know in advance what they are looking for. Reading a single article often does not give not give the user a holistic enough view to summarize the article into a query that can be used continue exploring the topic. The News Article Matcher allows them to simply copy-paste the text of the article and the tool them summarizes and takes out the main keywords from the articles, and finds more similar for the user to explore. It can be used for a variety of purposes, like finding historical new articles about events, finding out how similar events played out in the past and as discussed above getting more information about news events without having to craft their own queries. Problem: The end result or problem to be solved is taking a user-given article and returning a list of timesensitive articles matching the given article and modifying the result when feedback is taken into account. The main We can break down the News Article Matcher into 4 distinct components as mentioned briefly in the abstract. These four components all interact with each other solving certain mini-problems along the way. They are: 1) Crawler This will be the main source of data for the search engine. After looking and searching for extended time, we were unable to find a news article database that was current and in the format we desired. Hence the crawler became a necessity that would

allow articles to remain current and give us articles that we could match the userinputted articles too. 2) Inverted Index This component is required to create and maintain the inverted index structure for our database. We needed inverted index component that could run ondemand when we added news articles manually for testing, as well as one that could run when new articles were added by the crawler. In this way we always have the freshest inverted index possible to work with. 3) Search Mechanism: For returning search results we needed a component that could calculate the BM25 score for documents and return the top-n documents. This component also had to handle a date input which would weigh documents by user given date input in addition to the regular BM25 weight. 4) Query Parser and Generator: This component needed to read an input article provide by the user and find the key words in the article that could be used to summarize the article and then generate a query using those terms. This component, also is used for the feedback portion. It needs to reassess the weights on words based on the feedback given by users and change and weigh the query terms appropriately to return more relevant results for the user in the next run of the query. Methods: For the crawler, we decided to use the feed of Reuters Top News to crawl as we felt that this would give us a good overview of all the important news events in the world. The crawler was implemented by continuously crawling the top news headlines on this page http://us.mobile.reuters.com/category/topnews every 3 hours and adding the text and headlines of new articles to our database. The xml of the pages was loaded by the crawler and the data was scraped using XPath queries and stored into our MySQL database. Once the news articles were in our database, we then needed to add the new articles to our inverted index table in our database. We had 3 tables storing a variety of information about an article. The schema of the tables that the InvertedIndex component affects is below: Lexicon (TermID, DocFreq, OverallFreq) InvertedIndex (DocID, TermID, Freq_in_doc) DocInfo (DocID, Length) We felt that this schema, gave us all the information we needed for a BM25 score to be generated just by simple table lookups. This ensure that queries would return quickly, and not require large calculations to be done at run-time. The inverted index method is invoked after each crawled document. The document is first cleaned, that is stripped of all punctuation and then trailing and leading whitespace and is then stemmed using a Porters Stemmer. It then ensures that the counts in the Lexicon table or either initialized or updated depending upon whether the term exists in the database, the length of the document is in the DocInfo table and

the InvertedIndex table of counts in a document is created for all unique terms in the stemmed document. On the query side, the first step was to build the query for computation. In order to do this, we used a script to find the Porter stem of every word and counted the frequency of each stem in the input article. We then normalized each stemmed term by dividing this count by the overall frequency of the term in our database (similar to MP1). After sorting the normalized list, we took the top X number of terms and passed a map of the terms and their (non-normalized) term frequency in the query article to our search function. Frequency normalized = Frequency query / Frequency background For our search, we implemented using the BM-25 function. For each document in the database, we iterated through our generated query, consisting of the previously mentioned top X terms. Then, if we found any of the query terms in the document, we would find the product of the term frequency, inverse document frequency, and query frequency before adding it to the BM-25 score. The scoring function specifically was: TF (t, d) = (k+1) c (t, d)/(c (t, d) +k (1-b+b*doclen/avgdoclen) IDF (t) = log ((n+1)/k) Score = TF * IDF * QF (QF = frequency of the term in the query article) After summing up and finding the BM-25 scores, we would just sort them, and display the top Y results to the user. We also offered an optional date input and a date weight. If the user entered a date, then for each article that had a non-zero score, we would calculate the difference in days between the scoring article and the input date. Then we would multiply the BM-25 score based on the entered weight and an exponentially decaying function on the difference in days. # of Days DateMult = DateDecay The output of this equation decays pretty aggressively since we crawled enough articles to have a large enough date range on the articles in our database. The value of DateDecay is hardcoded, but it can be modified to change the amount of punishment for a difference in date. We also had an optional date weight parameter. This parameter represents how much of the scoring is actually based on date. So, for instance, if the user entered 50%, then half of the score will be based on the query match, and half of the score will be based on date match. It is important to know that the DateScore (whose formula is given below) is just the DateMult and the BM-25 score multiplied together. This way, we don t run into any problems with normalizing the date score around article size or any other factors that might affect a BM-25 score. It also means that if the BM-25 score is 0, then the overall score will be 0 regardless of how well the

date matches. This is desirable, because we want similar articles with similar date, and we never want the date parameter to overpower the article similarity ranking. We also made sure to always normalize our scores regardless of the percentage and date parameter. We did this because for a specific query, we want the user to be able to look at the scores and see how well they match, and we don t want inflation due to the date parameters. DateScore = BM25 Score * DateMult * (UserEnteredPerc / (1- UserEnteredPerc)) TotalScore = BM25Score * DateScore / (1 - (UserEnteredPerc / (1- UserEnteredPerc))); In the case that the first iteration of results returned by our search engine is not to the user s satisfaction, we offer the option of providing feedback. However, we chose not to keep permanent records of any feedback; it is only on a per-usage basis. The way that this feedback system works is that given a set of results for a given query article, the user can select which ones are most relevant and re-submit the query. These choices are then analyzed in addition to the original input query. One way to understand our implementation of feedback is by thinking of it as an analysis of the concatenation of the original query and the selected relevant articles. In other words, we sum the term count of the original input with weighted term counts of all relevant articles, normalize it, and return the top X words with the highest normalized scores with each word mapped to the total term count, which may not be an integer at this point. These words and term counts are then plugged into the BM-25 calculations as before. TF feedback = TF query + w * TF relevant generally, w ϵ (0, 1] Evaluation and Sample Results: The end product, looked like the following:

In the current page, an article from BBC about the Boko Haram in Nigeria is input (http://www.bbc.com/news/world-africa-13809501) and the results that can be seen in the image are all related to the topic of the Boko Haram in Nigeria. When a date is entered we can see that the weights changing for the same query, reflecting the weight of the date parameter in the document scores and the article Islamist attack kills 125 in northeast Nigeria published on the 7th of May is given the highest score now compared to the other articles that are published on later dates. For a more formal result analysis, we did keep track of a sample article and the number of relevant articles it s returned using Prescision@5 documents, since our database consisted of only a hundred or so articles. For our sample test article, we manually added 5 relevant articles and we had 5 non-relevant articles on articles that had similar words but were not on the same topic. These articles were in addition to the other articles in the database. We plotted the Precission@5 documents as a function of the number of documents in the database.

Precision@5 5 4 3 2 1 0 10 17 24 35 55 77 Number of documents Precision@5 What we saw was an improvement in the precision as the number of documents in the database increased. This made sense as initially the database was skewed and the query generator was having trouble identifying unique words based on frequency normalization. As more articles were added, words that were rarer in the dataset were easier to identify and hence there was an improvement in the query being generated for the article and hence the precision of results improved. Using our relevance feedback, we were always able to improve our Precision scores resulting in all 5 documents being displayed each time except for when there were 10 documents in the database. We also measured the Precision@5 against the number of keywords we were using for the query. We saw an improvement as the number of query terms increased, which again made sense since ti would be a more accurate representation of the article. However this increased computation time and hence compromised and ran our final version on 25 terms. 5 4 3 2 1 Presicion@5 0 2 5 10 15 25 35 Query Size Presicion@5

Conclusions and Future work: Overall we were pleased with the performance of the News Article Matcher. It was consistently matching news articles accurately with random articles drawn from the internet. The date function worked well and the feedback, which was a little tricky to implement worked well. Overall we learnt how to implement a search engine from scratch, got familiar with a practical use of stemming and also learnt a bit about how to summarize large text pieces effectively and efficiently. For future work, inspired by the project presentations, we would probably want to migrate to the using the Apache Solr framework instead of maintain the inverted index ourselves in the MySQL database. This would lead to a major performance enhancement as our queries were definitely slowing down as more query terms were used and more articles were added to the database. We would also have liked to crawl a larger variety of websites so that we could match a larger range of articles. Right now, since we were only crawling Top News, entertainment events and sporting events were not being added to the database often and were being skipped leading to poor matches in those areas. Another change would be adding our crawled data to an already existing database of news articles. This would allow us to test our date functionality more accurately as well as test our search functionality on a larger dataset. We would also get more accurate matches for input queries since there would more likely be matches in a larger dataset. Contributions: Nithin Kunala: Designed the database schema Designed the feedback process Designed the query generation process Implemented BM25 search Implemented part of Inverted Index Rohan Sehgal: Designed the database schema Designed the Feedback Process Implemented the Crawler Implemented Inverted Index Created the UI for the tool Arnold Kao: Designed the database schema Designed the query generation process Implemented the feedback process

Implemented the query generator Created the UI for the tool References: 1) CS 410 Lecture notes 2) http://en.wikipedia.org/wiki/okapi_bm25 3) http://tartarus.org/~martin/porterstemmer/php.txt