Context Based Content Extraction of HTML Documents

Size: px

Start display at page:

Download "Context Based Content Extraction of HTML Documents"

Cameron Fox
5 years ago
Views:

1 Context Based Content Extraction of HTML Documents Thesis Proposal Suhit Gupta Dept. of Comp. Sci./450 CS Bldg 500 W. 120 th Street New York, NY Advisor Dr. Gail E. Kaiser Dec 10 th, 2004

2 ii

3 Abstract Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of useful and relevant content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing HTML and data components such as images, which may take away from a webpage s inherent look and feel. Unlike Content Reformatting, which aims to reproduce the entire webpage in a more convenient form, my solution directly addresses Content Extraction. My goal is to demonstrate that DOM tree based content extraction is an easier and superior approach to processing the flat HTML files directly. I will test several hypotheses including whether it is a versatile solution to implement the extraction engine as a transparent proxy that is built as a framework to allow programmers and administrators to add heuristics. I further propose that rather than going with current approaches that are pre-packaged one size fits all and entirely programmer controlled, going with a more adaptable approach will produce a more satisfactory experience. Finally, I will show that inferring tree patterns and using context for content extraction, especially content genre and physical layout, will allow Crunch to contend with a wide range of websites. My experiment will include creating Crunch, a framework that employs an easily extensible set of techniques, for enabling and integrating heuristics concerned with content extraction from HTML web pages and be practically usable as a web proxy by the end user. I will try and demonstrate the ability to reduce human involvement in applying heuristic settings for websites and instead try to automate the job by detecting and utilizing the physical layout and content genre of a given website. Crunch uses frequency distribution of the words associated with the webpage to extract genre. These results are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings. In order to analyze the success of my solution, I will measure the usability and performance of the content extraction proxy in terms of the quality of the output generated by the heuristics that act as filters for inferring the context of a webpage. I will focus on Crunch being utilized towards rendering webpages in a screen reader-friendly format for visually impaired and blind users. However, I will also test its ability to produce pages for constrained devices like PDAs and cell phones. iii

4 iv

5 Table of Contents 1 Statement of Problem 1 2 Motivation 3 3 Related Work 5 4 Approach 9 5 Feasibility Crunch Crunch Crunch Evaluation 26 7 Plan/Schedule 28 8 Contributions 31 9 Appendix A (Figures) Appendix B (Tables) Appendix C (Screenshots) Acknowledgements References 48 v

6 vi

7 1. Statement of Problem Web pages are often cluttered with distracting features around the body of an article that distract the user from the actual content they re interested in. These features may include pop-up ads, flashy banner advertisements, unnecessary images, or links scattered around the screen. Navigation features like menus and quick links often interrupt a user s workflow. Automatic extraction of useful and relevant content from web pages has many applications, ranging from enabling end users to accessing the web more easily over constrained devices like Personal Digital Assistants (PDAs) and cellular phones to providing better access to the Web for the disabled. Most traditional approaches to removing clutter or making content more readable involve either a blanket removal/addition of a feature to all webpages or time consuming natural language processing techniques, with the end user having little control over the output they wish to see. Some approaches include increasing the font size, removing images, disabling JavaScript, etc., or a combination of these methods, all of which eliminate the webpage s inherent look-and-feel. Examples include WPAR [18], Webwiper [19] and JunkBusters [20]. All of these products involve hardcoded techniques for certain common web page designs as well as fixed blacklists of advertisers. This can produce inaccurate results if the software encounters a layout that it hasn t been programmed to handle. Other approaches involve content reformatting which reorganizes the data so that it fits on a PDA; however, this does not eliminate clutter but merely reorganizes it. Opera [21], for example, utilizes their proprietary Small Screen Rendering technology that reformats web pages to fit inside the screen width. I propose Content Extraction technique as an alternative that can remove clutter without destroying webpage layout, making more of a page s content viewable at once. These techniques should also work on web pages made up of multiple content bodies, even if they are separated by the distracting features or with them interspersed within the different sections of content. 1.1 Thesis Statement I propose a framework called Crunch, for enabling and integrating heuristics concerned with content extraction from HTML web pages. The goal of content extraction is to find the (likely) content of a given web page, and remove what is deemed non-content, by utilizing a combination of several heuristic based filters. I propose to work on the DOM tree of a webpage rather than the flat HTML file. I further intend to extrapolate the context of a website its genre and its physical layout by doing a frequency distribution of the words associated with the webpage. These results will be used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings. I will measure the usability and performance of the content extraction proxy in terms of the quality and accuracy in the output generated (for target blind users and/or constrained devices) by the heuristics. To summarize, web pages contain clutter such as unnecessary images and 1

8 extraneous links around the body of an article that distracts a user from actual content. Most current approaches do not give the end user enough control over exactly what they would like to see, or are too specific in their implementation to support a wide range of websites. Extraction of useful and relevant content from web pages has many applications, including rendering webpages in a screen reader-friendly format for visually impaired and blind users, displaying pages on constrained devices like PDAs and cell phones, producing less noisy data for text summarization, and to reduce interruptions due to extraneous content in a user s workflow when the browsing the web. I propose to show my solution, Crunch, which directly addresses Content Extraction, as it is an extensible framework plugged with heuristics based filters that use a webpage s genre and layout information to extract content from a webpage. 2

9 2. Motivation Computer users are spending more and more time on the Internet in today s world of online shopping and banking; meanwhile, webpages are getting more complex in design and content. Web pages are cluttered with guides and menus attempting to improve the user s efficiency, but they often end up distracting from the actual content of interest. These features may include script- and flash-driven animation, menus, pop-up ads, obtrusive banner advertisements, unnecessary images, or links scattered around the screen. 2.1 Problem As explained in the next section (Related work), content extraction is typically takes a one-size-fits-all approach, where one approach is used for all users and all websites. This problem is further exacerbated by the fact that the target applications could be varied speech rendering, a constrained device or a summarization algorithm. 2.2 Applications Web browsing for blind/visually impaired users Content extraction is particularly useful for the visually impaired and blind [48]. A common practice for improving web page accessibility for the visually impaired is to increase font size [76] and decrease screen resolution; however, this also increases the size of the clutter, reducing effectiveness. Screen readers for the blind, like Hal Screen Reader by Dolphin Computer Access [46] or Microsoft s Narrator [47], don t usually automatically remove such clutter either and often read out full raw HTML. Webaim Screen Reader [49], IBM Homepage Reader [50] and JAWS [75] do attempt to enhance usability by pruning out duplicate pieces of information however they tend to be slow and again fall into the one-size-fits-all approach pitfall [48]. Therefore, both groups (visually impaired and blind users) benefit from extraction, as less material must be read to obtain the desired results Web browsing on handheld constrained devices A variety of approaches have been suggested for formatting web pages to fit on the small screens of constrained devices (including the Opera browser [16], and Bitstream ThunderHawk [17]); however, the reformatting approaches generally do not distinguish significant from subsidiary content (that is, clutter), nor remove the latter. Browsers built for cell phones (Pocket Internet Explorer and Opera Cell Phone Browser) provide similar behavior of reformatting webpages. They are either dependent on the website developers to provide a low bandwidth and constrained device version of the site or will use text resizing and content reformatting to try and fit as much data as possible on the tiny screen. Some browsers will even attempt to detect content by simply rendering the middle of the page as the primary view as an attempt to catapult the user over the banners and menus that are typically found at the top of the page. Again, we see 3

10 the non-adaptable one-size-fits all strategy being followed which is not always satisfactory as constrained devices vary widely in their constraints (screen size, bandwidth, etc.) Information retrieval algorithms Natural Language Processing (NLP) and information retrieval (IR) algorithms can also benefit from content extraction, as they rely on the relevance of content and the reduction of standard word error rate to produce accurate results [13], where the error rate is the number of words incorrectly processed from the original format. Content extraction would be a preprocessor to these NLP algorithms, allowing the algorithms to process only the extracted content as input as opposed to cluttered data coming directly from the web [14]. Currently, most NLP-based algorithms require writing specialized extractors for each web domain [14][15]. Generalized content extraction may be sufficient, but determining whether or not it is sufficient is outside the scope of this thesis, and would be appropriate future work Improving user workflow Finally, an abstract but noteworthy goal is to better the web browsing workflow for all users. Users, whether novices to the web or experienced, often get distracted by the extraneous links and other non-content on a page, that are often completely tangential to the task at hand (even though it may interest them at an extra curricular level). So, giving them more direct access to the data they are interested in is an important goal. One caveat to addressing this point is that often when a user is surfing the web with no particular goal in mind, removing extraneous content may not be a desired feature. Crunch will be useful for the other applications. However, in order to limit the scope of the proposed thesis, I will focus my experiments primarily for the blind/visually impaired and secondarily for small screens devices, and not directly address Information Retrieval or general user workflow. In section 3, I will further motivate the thesis topic by discussing existing solutions. In sections 4, I will describe my approach at an abstract level and pose hypotheses that could improve the current bed of research in the field. In section 5, I will describe the three versions of the Crunch proxy, each of which was designed to further address the different problems outlined in this proposal Crunch 1 being simply the framework; Crunch 2 with substantial improvements to the framework with respect to the plug-in API and the extensibility of the administrative interface; and Crunch 3 as the context and genre based content extraction proxy. In section 6, I will briefly describe the approach I plan to take in order to evaluate my approach to content extraction. I will describe the plan for the remainder of the thesis in Section 7 and finally conclude with contribution in Section 8. The appendices present extra materials for interested readers. 4

11 3. Related Work 3.1 Content Extraction The basic approach that I am going to take towards content extraction is one that is DOM tree based. I am going to build a framework that employs an easily extensible set of techniques, for enabling and integrating heuristics concerned with content extraction from HTML web pages and be practically usable as a web proxy by the end user. There is a large body of related work in content identification, information retrieval and context extraction that attempts to solve similar problems using various other techniques. I will describe the related work in three subsections content extraction, which deals with the actual identification and extraction of content; content reformatting, which deals with resizing webpages to fit on constrained devices; and finally genre based content detection that first tries and extrapolates the context of the document and then uses that information towards content extraction. Rahman et al. [2] propose techniques that use structural analysis, contextual analysis, and summarization. The structure of an HTML document is first analyzed and then decomposed into smaller subsections. The content of the individual sections is then extracted and summarized. While the paper describes prerequisites for content extraction, it doesn t propose methods to do so. The solution is meant for constrained devices like cell phones, but implementation is more of a one-size-fits-all as the technique is not adjustable; therefore an administrator has low flexibility retrieving removed content. Limitations of their work include the lack of ways to generate a good summary from documents that have multiple main themes, have specific constructs such as bullets and lists, cross comparing section headings with text, and detecting relationship among sections for safe merging. Finn et al. [1] discuss methods for content extraction from single-article sources, where content is presumed to be in a single body. The algorithm tokenizes a page into either words or tags; the page is then sectioned into 3 contiguous regions, placing boundaries to partition the document such that most tags are placed into outside regions and word tokens into the center region. This approach works well for single-body documents, but doesn t produce good results for multi-body documents, i.e., where content is segmented into multiple smaller pieces, common on Web logs ( blogs ) like Slashdot ( McKeown et al. [8][15], in the NLP group at Columbia University, detect the largest body of text on a webpage (by counting the number of words in body of the HTML) and classify that as content. This method works well with simple pages. However, since this algorithm counts all the words on the page (content and clutter), it produces noisy or inaccurate results when handling multi-body documents, especially with random advertisement and image placement. In order to compensate for such limitations, several individual website specific extraction scripts have been written. This approach doesn t support one of my goals of being able to dynamically support a wide range of sites without writing specialized code for each as this is not a scalable solution. 5

12 Buyukkokten et al. [3][10] define accordion summarization as a strategy where a page can be shrunk or expanded much like the instrument. They also discuss a method to transform a web page into a hierarchy of individual content units called Semantic Textual Units, or STUs. First, STUs are built by analyzing syntactic features of an HTML document, such as text contained within paragraph (<P>), table cell (<TD>), and frame component (<FRAME>) tags. These features are then arranged into a hierarchy based on the HTML formatting of each STU. STUs that contain HTML header tags (<H1>, <H2>, and <H3>) or bold text (<B>) are given a higher level in the hierarchy than plain text. This hierarchical structure is finally displayed on PDAs and cellular phones. While Buyukkokten s hierarchy is similar to our DOM tree-based model, DOM trees remain highly editable and can easily be reconstructed back into a complete webpage. DOM trees are also a widely-adopted W3C standard, easing support and integration of our technology. From a results standpoint, STUs produce results very similar to those gotten when processing DOM trees. The main problem with the STU approach is that once the STU has been identified, Buyukkokten, et al. [3][4] perform complex summarization on the STUs to produce the content that is then displayed on PDAs and cell phones. Kaasinen et al. [5] discuss methods to divide a web page into individual units likened to cards in a deck. A web page is divided into a series of hierarchical cards that are placed into a deck. This deck of cards is presented to the user one card at a time for easy browsing. The paper also suggests a simple conversion of HTML content to WML (Wireless Markup Language), resulting in the removal of simple information such as images and bitmaps from the web page so that scrolling is minimized for small displays. While this reduction has advantages, the method proposed works with flat HTML files rather than a tree based approach which created problems in differentiating layout tables from other tables. The problem with the deck-of-cards model is that it relies on splitting a page into tiny sections that can then be browsed as windows. But this means that it is up to the user to determine on which cards the actual contents are located. Chen et al. [56] propose a similar approach to the deck of cards method, except in their case using the DOM tree for organizing and dividing up the document. They propose showing an overview of the desired page so the user can select the portion of the page he/she is truly interested in. When selected, that portion of the page is zoomed into full view. One of their key insights is that their overview page is actually a collection of semantic blocks that the original page has been broken up into, each one color coded to show the different blocks to the user. This, very nicely, provides the user with a table of contents from which to select the desired section. While this is an excellent idea, it still involves the user clicking on the block of choice, and then going back and forth between the overview and the full view. Since improving a user s workflow is one of the target applications, my goal is to extract and provide the relevant content right away instead of making the user go back and forth between the content and the table of contents. 3.2 Content Reformatting Multiple approaches have been suggested for formatting web pages to fit on the small screens of cellular phones and PDAs (including the Opera browser [16] and its use 6

13 of the handheld CSS media type, and Bitstream ThunderHawk [17]); however, the reformatting approaches generally do not distinguish significant from subsidiary content (that is, clutter), nor remove the latter. Opera [21], for example, utilizes their proprietary Small Screen Rendering technology that reformats web pages to fit inside the screen width. Most of these technologies are closed source as they are commercial products, therefore the exact details of their implementation is unknown. However, since they focus on reformatting a webpage, my system would do at least as well as theirs as my proxy could either simulate their reformatting (or use their algorithms as plus-ins) on the original page or introduce a precursor extraction phase prior to the reformatting. Browsers built for cell phones (Pocket Internet Explorer and Opera Cell Phone Browser) provide similar behavior of reformatting webpages. They are either dependent on the website developers to provide a low bandwidth and constrained device version of the site or will use text resizing and content reformatting to try and fit as much data as possible on the tiny screen. Some browsers will even attempt to detect content by simply rendering the middle of the page as the primary view as an attempt to move past the banners and menus that are typically found at the top of the page. This approach has two problems firstly constrained devices usually have constrained web connections and jumping to the middle of the page still requires the device to download all non-content; and secondly, rendering the middle of the page first does not guarantee the beginning of the section containing content and a first time visitor to such a page would not know which direction to scroll first. 3.3 Genre Classification My approach to reducing human involvement in applying heuristic settings for websites automatically will be to detect and utilize the physical layout and content genre of a given website. Crunch uses frequency distribution of the words associated with the webpage to extract genre. These results are used to improve the extraction process by comparing them to previously known results that work well for certain genres of sites and utilizing those settings. There is some related work that tries to classify webpages by genre. Roussinov et al. [64] define genre as a group of documents with similar form, topic or purpose, a distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form. They show the advantages of browsing the web by genre but their application is only designed to help categorize documents so users can see similar pages. There is no work in terms of content extraction; however, they have shown several results in the field of improving web search engine results. Karlgren et al. [65][66] showed that the texts that were judged relevant to a set of queries differ substantially from the texts that were not relevant. They used topical clustering in conjunction with stylistics based genre prediction to build an interactive information retrieval engine to facilitate multidimensional presentation of search results. They built a genre palette by interviewing users, identifying several genre classes that are useful for web filtering. The system was evaluated by users given particular search tasks. The subjects did not do well on the search tasks, but all but one reported that they liked 7

14 the genre enhanced search interface. Subjects used the genres in the search interface to filter the search results. Stamatatos et al. [61] show that the frequency of word occurrence is very useful in automatic text genre classification. I have used these results towards my research as they produce results that are domain independent and require minimal computation. Kessler et al. [62] and Argamon et al. [63] have shown good results with a technique where they detect genre based on surface cues where comparisons are made between the performances of function words and stop words (stop words include pronouns, articles and other words that don t add to the genre information). However, their approach is dependent on substantial bodies of text, the domain of their classification is limited to the sites trained on. Since they do not learn new sites dynamically, their approach cannot be applied dynamically to all web-sites. Rauber et al. [68] infer document genre from structural features and display documents as though grouped onto book shelves. They use the age of a document and frequency of lookups as important distinguishing features and present a method of automatic analysis based on various surface level features of the text. Most of this work is used for extracting content from materials observed over a large amount of time. This would not scale for my goals as users often visit sites that they have never gone to before. While all these approaches are valid, I find them all to be lacking for my problem domain. Most fall into the one-size-fits-all approach which means they wouldn t scale dynamically for a websites not previously trained on. I also find that most approaches try to extract a word or phrase to describe the context of a given page. I instead compute the genre from a large set of words with individual weights, as it allows for greater accuracy when comparing between an extremely varied ranges of websites. 8

15 4. Approach My solution employs a series of techniques that address the problems mentioned above in Sections 2 and 3. One of the key insights of my thesis is that in order to analyze a web page for content extraction, web pages are passed through an HTML parser, which creates a Document Object Model (DOM) tree. The Document Object Model ( is a standard for creating and manipulating in-memory representations of HTML (and XML) content. By parsing a webpage's HTML into a DOM tree, I can not only extract information from large logical units similar to Buyukkokten s Semantic Textual Units (STUs, see [3][4]), but can also manipulate smaller units such as specific links within the structure of the DOM tree. In addition, DOM trees are highly transformable and can be easily used to reconstruct a complete webpage. Finally, increasing support for the Document Object Model makes my solution widely portable. Most information retrieval algorithms or content reformatting solutions work with the plain and flat HTML. Working with a tree gives access to a hierarchical data structure, something that is lost when working with a flat file. Most algorithms are implemented as plug-ins specifically for a particular browser. Others involve processing data offline, finally to be able to utilize results on a specific and limited domain of sites. In order to make the extraction as transparent from the user perspective as possible, Crunch operates as a web proxy, which can be used together with most web browsers and other http clients, and filters the web page on the return trip to send only the extracted content to the browser. Additionally, a system should be able to do on-line context detection and also be able to learn new results, adding the new knowledge to the repertoire. One of the critical problems most other solutions suffer from is they follow the one size fits all approach. Therefore, content is extracted for all websites and for all users in the exact same manner. Furthermore, none of these approaches take a target application into account where their algorithms (when targeted to different applications like screen readers vs. constrained devices) would produce different results even if they have identical settings. I have therefore implemented Crunch as a plug-in framework where a programmer or administrator has the ability to implement/insert their own plugins and then use them individually or in conjunction with the other plug-ins available. The plug-ins are adjusted or are toggled on/off from a GUI that is accessible to the end user thus allowing the actual user to decide what he/she sees. It was important to see if this heuristics/plug-in API was truly implementable by someone not intimately familiar with the code so effort was put into making the API as easy to implement as possible. The data outputted by most content extraction systems is not adaptive to different applications (example, screen readers vs. small screen devices). Since, as pointed out earlier, the applications for content extraction are several, our goal as to create an engine that would adapt to the various end users. Therefore, one of my hypotheses is that the data outputted by Crunch can be different for different applications. I intend to demonstrate this by showing how the settings of the proxy, despite being the same, would 9

16 produce different results depending on whether targeted for a blind user or for a constrained device like a PDA. Once data has been extracted, I have found that if a user is not satisfied with the output, it is hard for them to access deleted content without reloading the page after tweaking the proxy settings. The ability to learn user preferences and preserving otherwise removed content is a goal. Moreover, to be able to reorganize some of the noncontent to a different section of the webpage is a possibility. One problem with content extraction in general is that it is impossible to determine the intention of the author and the desires of the reader. Therefore my goal is to approach the problem heuristically and work towards making the heuristics as accurate as possible. Crunch extracts the content, with filters customizable by an administrator and/or by a savvy user. No one size fits all algorithm could achieve this goal. In particular, I will not attempt to model either author or user tasks, nor their corresponding context or intentions, but any non-intrusive approach to doing so would also likely be heuristic and thus also imprecise; however, proving this is outside the scope of this thesis. Therefore, one of the limitations of my framework is that Crunch can potentially remove items from the web page that the user may be interested in, and may present content that the user is not particularly interested in. My strategy to address this is to employ a multi-pass filtering system where the resulting DOM tree produced after each stage of filtering is compared to the original copy. If too much or too little has been removed (percentage settable by a programmer or administrator), given the settings, I assume that the settings were set incorrectly and ignore them during the next pass over the DOM. I also try to determine the classification of a given website, both in its physical layout as well as the context of its content. I have found that, given a manually-created frequently updated database of preset heuristic settings for different genres of websites, I can use this contextual information to dynamically utilize matching settings from the database and produce better results of extracted webpage content. Finally, inferring tree patterns should give rise to highly accurate results in content extraction. Tree pattern inference [70] has been used to identify common parts in the DOM tree of a highly structured webpage and develop a system for inferring reusable patterns from positive examples given by a user. Using the inverse of this approach, I will be able to identify the non-similar elements from different pages belonging to the same site. My goal is to prove that the dissimilar elements will prove to be content while the similar elements will turn out to be the common navigation elements, links and advertisements that wrap all the webpages of that site. This technique should work especially well for sites that are built with a content management system (CMS) since most non-content remains the same from page to page. Therefore, when two or more pages from a CMS website are compared, the similar elements from such pages, i.e. framework around the webpage itself (the menu, links, tables and other layout related HTML) will be found as similar patterns in the DOM tree and the rest, i.e. the dissimilar content, will produce content. 10

17 4.1 Hypotheses To summarize, my hypotheses can be stated as follows a) Analysis of a web page using its Document Object Model (DOM) tree structure rather than the flat HTML makes content extraction easier as parsing the page into its DOM disseminates the page into its constituent parts. DOM trees are highly transformable and can be easily used to reconstruct a complete webpage. Working with a tree gives access to a hierarchical data structure, something that is lost when working with a flat file. Since this is an integral part of my approach, this will not be a hypothesis I will be testing through my experiments. b) Implementing Crunch to operate as a web proxy, which can be used together with most web browsers and other http clients, and filters the web page on the return trip to send only the extracted content to the browser, will eliminate any dependence of the content extraction framework from specific browsers. c) Implementing Crunch as a plug-in based infrastructure allows the end user to have full control over what the resulting HTML page is going to look like. The plug-ins are adjusted or are toggled on/off from a GUI that is accessible to the end user thus allowing the actual user to decide what he/she sees. d) Making Crunch adaptive to the target application will allow the proxy to produce different output, despite having identical settings. Therefore, Crunch would produce different results depending on whether targeted for a blind user or for a constrained device like a PDA. e) Preserving links and other non-content slated for removal, and giving the user the option to display this data at the bottom of the page, in case of interest, will allow a user to still have access to what Crunch deems non-content. f) Implementing a multi-pass filtering system where the resulting DOM tree produced after each stage of filtering is compared to the original copy, so as to fix settings dynamically if too much or too little content is removed in a filter pass, will eliminate potential bad results produced by Crunch. g) Utilizing the classification of a given website, both in its physical layout as well as the context of its content will allow Crunch to automatically apply settings and extract webpage content without human intervention. h) I hypothesize that inferring tree patterns [70], that have been used to identify common parts in the DOM tree of a highly structured webpage and develop a system for inferring reusable patterns, will give rise to highly accurate results in content extraction. 11

18 5. Feasibility My solution employs multiple extensible techniques that incorporate the advantages of the previous work on content extraction like accordion summarization and content discovery, and attempts to avoid the common pitfalls like noisy results and slow performance. Since a content extraction algorithm can be applied to many different applications, for example in the fields of NLP and IR, as well as assistive technologies like those that help the visually impaired, I implemented my solution so that it can be easily used in these variety of cases. Through an extensive set of preferences, the extraction algorithm can be highly customized for different uses. These settings are easily editable through the GUI, through method calls that have been exposed through a simple API, or direct manipulation of the settings file on disk. The GUI itself can also easily be integrated (as a Swing JPanel for Crunch 1.0 or as a standard widget for Crunch 2.0 and 3.0) into any Java project, or one can customize it directly. The content extraction algorithm is also implemented as an interface for easy incorporation into other programs. The content extractor s broad set of features and customizability allow others to easily add their own version of the algorithm to any product. Further discussion on Crunch as a framework can be found in Section 5.1. In order to analyze a web page for content extraction, the page is first passed through an HTML parser that corrects HTML errors and then creates a DOM tree representation of the web page. (HTML on the Internet can be extremely malformed and most popular browsers like Internet Explorer and Mozilla are able to handle incorrect HTML by making the closest guess to what the HTML should be.) Once parsed, the resulting DOM document can be seamlessly shown as a webpage to the end-user by flattening the tree and producing back the HTML. This process accomplishes the steps of structural analysis and structural decomposition analogous to those done by several other techniques (see Section 3). The DOM tree is hierarchically arranged and can be analyzed in sections or as a whole, providing a wide range of flexibility for our extraction algorithm. Just as the approach mentioned by Kaasinen et al. modifies the HTML to restructure the content of the page, my content extractor navigates the DOM tree recursively, using a series of different filtering techniques to remove and adjust specific nodes and leave only the content behind. In my first attempt, Crunch 1.0, I designed a one-pass system that extracted content by running a series of filters one after the other, i.e., the selected filters just ran sequentially on the output produced by the previous filters. This caused problems at times when parts of a webpage that the user wanted were removed. In Crunch 2.0, this was amended by making it a multi-pass system. Here multiple copies of a webpage were kept in memory and a filter checked for the optimal copy to work on. Further discussion can be found in Section 5.2 and a large number of examples demonstrating the results of different filter settings are shown in Appendix B. Crunch as a framework handles the webpage, but the filters that are plugged into the framework make it dynamic and customizable. The framework defines a standard API, which a programmer implements when creating a plug-in. The programmer also decides the order in which the filters are run in order to maximize the benefit of each one. 12

19 An example construction of a Crunch plug-in is given in Section Each of the filters can be easily turned on and off either by the user, the administrator or the programmer, and can be customized to a certain degree through a GUI if provided by the programmer. There are two sets of filters that have been implemented, with different levels of granularity, in both Crunch 1.0 and 2.0/3.0. The first set of filters simply ignores tags or specific attributes within tags but keep track of them in memory. With these filters images, links, scripts, styles, and many other HTML elements can be quickly removed from the web page. This process of filtering is similar to Kaasinen s conversion of HTML to WML. However, the second set of filters is more complex and algorithmic, providing a higher level of content extraction. This set, which can be extended, currently consists of the advertisement remover, the link list remover, the removed link retainer and the empty table remover. In Crunch 2.0, I also added filters that allow the user to control the font size and word wrapping of the output, and heuristic functions guiding the multi-pass processor; to evaluate the acceptability of the output as each filter pass edits the DOM tree. This ensures that we don t suffer from some of the pitfalls of version 1.0 where occasionally pages returned null outputs after passing through Crunch, e.g., link heavy pages like as shown later in Figures 17 and 18. Finally, I have attempted to allow for greater control on most of the filters by adding supplementary options. For example, users now have the ability of controlling, at a finer granularity, complex web pages where certain HTML structures are embedded within others, e.g., within table cells. The advertisement remover uses a common and efficient technique to remove advertisements. As the DOM tree is parsed, the values of the src and href attributes throughout the page are surveyed to determine the servers to which the links refer. If an address matches against a list of common advertisement servers, the node of the DOM tree that contained the link is removed. This process is similar to the use of an operating systems-level hosts file to prevent a computer from connecting to advertiser hosts. Hanzlik [6] examines this technique and cites a list of hosts, which I use for my advertisement remover. In order to avoid the common pitfall of deploying a fixed blacklist of advertisers, our software also periodically updates the list from a site that specializes in creating such blacklists. This is a technique employed by most ad blocking software. The link list remover employs a filtering technique that removes all link lists, which are bodies of content either in the page or within table cells for which the ratio of the number of links to the number of non-linked words is greater than a specific threshold (known as the link/text removal ratio). When the DOM parser encounters a table cell, the Link List Remover tallies the number of links and non-linked words. The number of nonlinked words is determined by taking the number of letters not contained in a link and dividing it by the average number of characters per word, which we preset as 5 (although it may be overridden by the user and could, in principle, be derived from the specific web page or web domain). If the ratio is greater than the user-determined link/text removal ratio (default ratio is set to 0.35), the content of the table cell (and, optionally, the cell itself) is removed. This algorithm succeeds in removing most long link lists that tend to 13

20 reside along the sides of web pages while leaving the text-intensive portions of the page intact. After these steps, I have found that numerous tables that are either completely empty or have several empty cells take up large swaths of space remain on the webpage. The empty table remover removes tables that are empty of any substantive information. The user determines, through settings, which HTML tags should be considered to be substance and how many characters within a table are needed to be viewed as substantive, set much like the word size or link-to-text ratio settings set earlier. This does not require much prior knowledge of HTML since the syntax of the markup language is simple and matches words from the English language closely, e.g., table, form, etc. The table remover checks a table for substance after it has been parsed through the filter. If a table has either no substance or less than some user defined threshold, it is removed from the tree. This algorithm effectively removes any tables left over from previous filters that contain small amounts of unimportant information. This filter is typically run towards the end to maximize its benefit. While the above filters remove non-content from the page, the removed link retainer adds link information back at the end of the document to keep the page browsable. The removed link retainer keeps track of all the text links that are removed throughout the filtering process. After the DOM tree is completely parsed, the list of removed links is added to the bottom of the page. In this way, any important navigational links that were previously removed remain accessible, and since the parser had parsed them initially as separate units, each menu or navigational link is kept intact and they can all be viewed without any loss of original setup or style. After the entire page is parsed and modified appropriately, it can be output in either HTML or as plain text (filters could be added to translate to another output format such as WML). The plain text output removes all the tags and retains only the text of the site, while eliminating most white space. The result is a text document that contains the main content of the page in a format suitable for summarization, speech rendering or storage. This technique is significantly different from Rahman et al. [2], which states that a decomposed webpage should be analyzed using NLP techniques to find the content. It is true that NLP techniques may produce better results, but at the cost of far more time consuming processing. My system doesn t technically find the content but instead eliminates likely non-content. In this manner, I can still process and return results for sites that don t have an explicit main body. Crunch, however, does have some limitations: 1) Crunch cannot filter non-html content like Flash. It allows a boolean choice of whether to keep or remove such structures but it can't help edit or filter within the animation itself. 2) Crunch does not distinguish between different users. There is only one set of options, whether an individual is using the proxy or whether it is set up as groupware. 14

21 5.1 CRUNCH Overview In order to make the extractor easy to use, I implemented it as a web proxy (program and instructions are accessible at The proxy can be used as a personal filter by individual users as well as a central system for groups of people. In the case where Crunch is set up as groupware, users can access the proxy by simply setting their browser to do so, as most modern browsers can now point to external proxies for filtering content. This allows an administrator to set up the extractor and provide content extraction services for a group. The proxy is coupled with a graphical user interface (GUI) to customize its behavior. The separate screens of the GUI are shown in Appendix A. Figure 1 shows the very broad options that can be turned off or on that ignore certain tags completely. Figure 2 has more advanced options that give more granular control, whereas Figure 3 shows controls on output. The current implementation of the proxy is in Java for cross-platform support, and has been successfully tested on Windows, MacOS, Linux and Solaris. The Content Extraction framework itself has a complexity of O(N + P), where N is the number of nodes in the DOM tree after the HTML page is parsed and P is the sum of the complexities of the plug-ins; therefore the overall complexity is O(N) without plug-ins. Crunch 1.0 is implemented as a one-pass system, so it is the plug-ins that truly determine the running time of the system. For example, the plug-in that edits tables has an algorithm whose worst case running time is O(M 2 ) for complex nested tables; without such nesting, the typical running time is O(M), where M is the number of elements composing the table; so the overall running time of the system works out to be O(N + M 2 ) with the table plug-in. During tests, the algorithm performs quickly and efficiently following proxy customization. The proxy can handle most web pages, including those with badly formatted HTML, because of the corrections automatically applied while the page is parsed into a DOM tree. However, sites that are extremely link heavy produce bad results; when the link to text ratio approaches 100%, anomalous behavior was experienced. Depending on the type and complexity of the web page, the content extraction suite can produce a wide variety of output. The algorithm performs well on pages with large blocks of text such as news articles and mid-size to long informational passages. Most navigational bars and extraneous elements of web pages such as advertisements and side panels are removed or reduced in size. Figure 4 and Figure 5 show an example before and after Crunch is applied, resp. When printed out in text format, most of the resulting text is directly related to the content of the page, making it possible to use summarization and keyword extraction algorithms efficiently and accurately. Text-only output for this example is shown in Figure 6. The initial implementation of the proxy was designed for simplicity in order to test and design content extraction algorithms. It spawns a new thread to handle each new connection, limiting its scalability. Most of the performance drop from using the proxy 15

22 originates from the proxy s need to download the entire page before sending it to the client More examples Figure 7 and Figure 8 show the front page of a website dedicated to a first-person shooter game, before and after content extraction, respectively. Despite producing results that are rich in text, in this particular Crunch configuration the screenshots of the game are also removed, which the user might deem relevant content. This might be a case where the user might want to tweak the default settings. Figure 9 and Figure 10 show a link-heavy page in its pre- and post-filtered state, resp. Since the site is a portal which contains links and little else, the proxy does not find any coherent content to keep. I investigated heuristics that would leave such pages either untouched, or alternatively employ only the most basic filters that only remove advertisements and banners, and implemented such techniques in Crunch 2.0. From these examples one may get the impression that input fields are affected irregularly by the proxy; this is because the run-time decision of leaving them in or removing them from the page is dependent on the tables or frames they are contained in. Forms are handled as one semantic unit, where either a form is displayed or not based on the user setting. Additionally, it should be mentioned that there isn t any sort of preservation of objects that may be lost after the HTML is passed through our parser, except links can be retained as explained above. The user would have to change the settings of the proxy and reload the page to see the previously removed content. However, a different set of filters could be developed to move rather than just remove content, for forms or other identifiable HTML elements or data Implementation details The life cycle of the process that gets a page to the client s browser through the proxy from a very high level is - the client passes a request for the webpage to the proxy which opens a socket, fetches the original content of the page, and parses the page to create a DOM tree representation. It is then passed through the different filters based on the settings set by the user. The edited DOM tree is then either flattened into the HTML form, to be sent back to the client s browser, or stripped of all HTML tags and only the text content is sent to the client for rendering. An architectural diagram of Crunch 1.0 is shown in Figure 11. In more detail, in order to analyze a web page for content extraction, the page is passed through an HTML parser that creates a Document Object Model tree. The algorithm begins by starting at the root node of the DOM tree (the <HTML> tag), and proceeds by parsing through its children using a recursive depth first search function called filternode(). The function initializes a Boolean variable (mcheckchildren) to true to allow filternode() to check the children. The currently selected node is then passed through a filter method called passthroughfilters() that analyzes and modifies the node based on a series of user-selected preferences. At any time within passthroughfilters(), 16

23 the mcheckchildren variable can be set to false, which allows the individual filter to prevent specific subtrees from being filtered. That is, certain filters can elect to produce the final result at a given node and not allow any other filters to edit the content after that. After the node is filtered accordingly, filternode() is recursively called using the children if the mcheckchildren variable is still true. The filtering method, passthroughfilters(), performs the majority of the content extraction. It begins by examining the node it is passed to see if it is a text node (data) or an element node (HTML tag). Element nodes are examined and modified in a series of passes. First, any filters that edit an element node but do not delete it are applied. For example, the user can enable a preference that will remove all table cell widths, and it would be applied in the first phase because it modifies the attributes of table cell nodes without deleting them. The second phase in examining element nodes is to apply all filters that delete nodes from the DOM tree. Most of these filters prevent the filternode() method from recursively checking the children by setting mcheckchildren to false. A few of the filters in this subset set mcheckchildren to true so as to continue with a modified version of the original filternode() method. For example, the empty table remover filter sets mcheckchildren to false so that it can itself recursively search through the <TABLE> tag using a bottom-up depth first search while filternode() uses a top-down depth first search. Finally, if the node is a text node, text filters are applied, if any. I have implemented several basic filters in order to demonstrate the capabilities of Crunch. The pseudo-code for the text size modifier filter looks like: procedure wraptextnodes(n: node) if n is a textnode then f := new font node f.sizeattribute := settings.size n.parent.replacechild(n, f) f.addchild(n) else if n.name = "TITLE" then return nodes := n.getchildren() foreach node m in nodes wraptextnodes(m) And the empty table removing plug-in looks like: procedure removeemptytables(inode: node) if inode.haschildnodes() then next := inode.getfirstchild() while next!= Ø begin current := next next := current.getnextsibling() 17

24 end filternode(current) lengthfortableremover := 0; empty := processemptytable(inode) if empty then inode.getparentnode().removechild(inode) Feasibility demonstration The work through Crunch 1.0 shows that using a DOM based content extraction model is feasible and produces excellent results. Furthermore, it shows the validity of building a content extraction engine as a framework with multiple pluggable heuristics that work as plug-ins. 5.2 CRUNCH Overview Crunch 1.0 nicely demonstrated the proof-of-concept design of the system as a framework, but certain problems needed to be addressed in order for Crunch to be widely used. After releasing Crunch 1.0 in September 2002, I received several suggestions from early users for additions and improvements. An informal user study of blindfolded students was conducted in May 2003 followed by a formal user study with blind and visually impaired users begun in December 2003 by Dr. Michael Chiang, an Ophthalmologist at Columbia University; the first results from the latter are discussed in our journal paper to be published in WWWJ [72]. The NLP group at Columbia University tried using Crunch briefly for their Newsblaster [8][9] project, which is a system that automatically tracks, clusters and summarizes each day s news programmatically. They used Crunch as their input mechanism in order to run their natural language processing algorithms on content extracted by Crunch rather than noisy data streams coming straight from the web. As indicated in section 2, Crunch 2.0 is similar to its predecessor. However, I spent time on improving its performance and user interface, and several changes were made in the supplied set of heuristic filters, e.g., to show more useful results for linkheavy pages. Additional filters were added that allow the user to control the font size and word wrapping of the output. Perhaps most importantly, heuristic functions were added in the form of a multi-pass system that evaluates the output the DOM tree passed through each filter. This prevents link-heavy pages like from returning blank pages as output; with the new result-checking heuristics of Crunch 2.0, I instead got the 18

25 better results. Figure 12 shows the original form of a link-heavy page; Figure 13 and Figure 14 show the outputs from Crunch 1.0 and Crunch 2.0, respectively. Finally, in the newer version, I attempted to allow for greater flexibility to most of the filters by adding supplementary options to each. For example, users now have the ability of controlling, at a finer granularity, complex web pages where certain HTML structures are embedded within others, such as having the ability to control not only the content on the entire page but also within table cells. Appendix B shows several suites of screenshots with different sets of Crunch 2.0 filters applied. Like Crunch 1.0, the complexity of the newer version remains at O(N+P); however, the worst case running time increases to O(N*P), where N is the number of nodes in the DOM tree after the HTML page is parsed and P is the complexity of the plug-ins with highest running time. The increase in worst case complexity is due to the fact that we have switched to a multi-pass system. Therefore, in case of a bad result, a filtered webpage may have to revert to a previous state and re-run through the proxy with a different set of options; this may happen for any number of nodes in the DOM tree Technical improvements in version 2 Even though the basic architecture of the system is the same as shown in Figure 13, there are some notable changes. 1) Replaced OpenXML with NekoHTML The original version of Crunch used OpenXML [25] as its HTML parser. OpenXML, however, has efficiency problems that are unlikely to be fixed since OpenXML is apparently no longer an active project. Instead, I switched to using NekoHTML [35] an HTML scanner, tag balancer and parser for the Apache opensource project, Xerces [35]. This new parser has many benefits most notably, increased parsing speed and robust correction of buggy HTML/XML. One of the key longer-term benefits is that I am now using a parser that is under active development. NekoHTML currently has some problems parsing some pages; in particular, the output is not always rendered the same as input, e.g., certain complex nested tables and some CSS-enabled pages [17]. However, most of these errors are minor cosmetic ones that Crunch usually manages to fix in its new multi-pass scheme. Additionally, the developers of NekoHTML are working on its deficiencies. NekoHTML assists in our handling of multiple versions of HTML. Much like my work with the previous parser, Crunch downloads the appropriate HTML stream and sends it to NekoHTML and gets back a DOM tree upon which it applies filters. It then uses an HTML serializer to send data to the client. With this architecture, Crunch can handle any version of HTML NekoHTML supports, including all current versions of HTML and xhtml. [17] 2) Performance tuning Since users often get frustrated with the latency in loading of webpages, I spent a substantial amount of time tuning the performance of the proxy to produce more near-real 19

26 time results. Some speed improvement was achieved through switching to NekoHTML. The other major contributor to increased speed was the optimization of Crunch s networking code, originally written using Java s blocking I/O API. By collapsing multiple writes and reads, dealing with timeouts more efficiently, and removing unnecessary or redundant calls in the transfer loops, server performance and bandwidth utilization now seems adequate. In order to try and deal with large workgroup loads, Crunch was migrated to a staged event architecture using Java s non-blocking I/O API. I now use asynchronous callbacks to avoid threading scalability issues. The concept of a staged event architecture was introduced formally by Welsh [37] for performance gains in highly concurrent server applications, so that they are able to support massive concurrency demands [37]. The concept of thread pools helps large-scale systems like Apache webserver deal with the load spikes efficiently. I took the same concept and extended it in my framework so that Crunch can meet the demands of several parallel requests in a groupware setup. I have not yet conducted a performance study with large loads, but I have outlined scenarios that should provide performance statistics in the Evaluation section of this proposal. 3) Switch to SWT One of the biggest changes made in recent versions of Crunch was to change the basis of the user interface from Java Swing to IBM s SWT (Standard Widget Toolkit [16]). The original UI made the proxy sluggish and user unfriendly. SWT has an extremely clean interface, allows the creation of attractive UIs, and is highly responsive, partially due to its use of JNI and native routine calls that can take advantage of the operating system's built-in optimizations. It also uses native GUI widgets to provide a look and feel consistent with the operating system, while remaining operating system independent. As an added benefit, SWT allows the program to be compiled into a binary executable, resulting in a faster startup time, a smaller distribution, less memory utilization, and an easier installation for novice users. The latest version of Crunch can be downloaded in executable form from our website at Screenshots of the new proxy GUI are shown in Appendix A and B, where we see the basic settings and the available plug-ins as well as the advanced functions. 4) Accessibility Accessibility was another focus. Switching to SWT helped maximize accessibility, as described below. One of Crunch s main goals is to assist disabled persons in browsing the web, yet the previous version of Crunch s UI was highly inaccessible. Visually impaired users were then dependent on an administrator to adjust their settings. There are millions of blind or visually impaired people in the US alone and only a small fraction of them are currently able to surf the web [17]. Worse, visually impaired users will often spend several minutes in finding the content they are seeking on a given website. We conducted a small user study with Dr. Michael Chiang, M.D., Instructor in Clinical Ophthalmology and a Research Master's Candidate in the 20

27 Department of Medical Informatics here at Columbia University. The goal was to find the common causes of visually impaired users browsing latency to help us address them directly. Our goal was to use Crunch to reduce this time to something more in line with regular users. There are three basic categories of accessibility support: mobility enablement, visual enhancement, and screen readers [17]. Crunch provides mobility enablement as all settings can be easily accessed using a keyboard in lieu of a mouse. SWT provides keyboard accelerators in the API and supports intelligent tabbing through GUI components. SWT leverages the operating system s accessibility support [17]; therefore Windows ability to use large fonts and high-contrast themes works with Crunch. SWT also supports Microsoft Active Accessibility Support (MAAS) and its associated widgets, so Crunch automatically supports screen readers that read content from the window with focus. An example screenshot of such changes is in the Appendix A (Figure 15) and Appendix B (Table 2) Example plug-ins for PDAs One very important requirement is that Crunch be able to support existing and new heuristics invented by others (that is, by persons and organizations other than the Crunch developers), following a modular approach. One popular application for content reformatting and filtering heuristics is retargeting conventional web pages to the small screens of PDAs. Most web pages are designed for resolutions upwards of 800x600 while a majority of PDAs support only 240x320 (newer PDAs support 480x640). Numerous utilities and tools have been developed attempting to solve this problem, some of which were discussed in Section 2 and 3. I believe that many of the algorithms and heuristics underlying these tools (as well as other web page filters and reformatters devised for other purposes by third parties) could easily be integrated with Crunch via the plug-in interface. New filtering and reformatting approaches oriented towards a variety of applications could also be easily prototyped by third parties using our plug-in approach, particularly the improved API developed for Crunch 2.0. See Appendix B (Table 3) to compare the Crunch 1.0 and 2.0 extension APIs. For instance, I found it very easy to re-implement both the ThunderHawk PDA browser [27] and the Sqweezer proxy [28] functionalities (both discussed in more detail in Section 2) as Crunch plug-ins, giving results essentially identical to the original utilities. If the source code of either system had been available, the relevant code could have instead been integrated directly, also via a plug-in. Part of the GUI for the Thunderhawk-like plug-in was shown in Figure 15 (there in the high contrast format). The plug-in that simulated the functionality of Thunderhawk that we implemented is on the order of about 150 lines of code. Similarly, the Skweezer plug-in was about 140 lines of code. I have found that most of our plug-ins average about 200 lines of code. 21

28 To implement any such plug-in, one would create a class that extends ProxyFilter. The method that should do the actual processing is the process(document, Document, Document) method. It should be thread safe because multiple threads can be accessing it at the same time. public abstract Document process( Document originaldocument, Document previousdocument, Document currentdocument); Crunch 2.0 can be told to load the plug-in at initialization by editing the constructor of Crunch2.java to have a line like proxy.registerplugin(new SkweezerPlugin()); appended to the already existing plug-ins. proxy.registerplugin(new ContentExtractor()); proxy.registerplugin(new SamplePlugin()); proxy.registerplugin(new SizeModifier()); The order these lines appear in is the order the plug-ins are applied to filtered content. In this manner, one can add any number of plug-ins Feasibility demonstration The work done in Crunch 2.0 was a major step forward in helping validate other hypotheses. Having already shown that using a DOM based content extraction model is feasible, I worked on showing how easily other heuristics could be added to Crunch through the plug-in API. The potential of improvement in results produced through the implementation of the multi-pass filtering system vs. a single pass mechanism was also demonstrated. I showed successfully the ability to make the proxy itself accessible to blind users and the feasibility of goal to move data, slated to be removed, to a nonsignificant part of the webpage. 5.3 CRUNCH 3.0 The DOM-based content extraction of the earlier version of Crunch, showed that the ability to achieve high levels of content extraction accuracy for a wide domain of websites. Example output screenshots shown in my original paper [71] and the subsequent journal paper [72] demonstrated that while the results that our proxy produced (with optimal settings) were quite accurate, the settings for the various filters often had to be tweaked by hand by the administrator or the end user for websites that differed in context and layout. For example, if a user were to browse a typical news-based site, e.g., CNN, then their settings would remove heavy links, images with links (menus), advertisements, and forms, since they would typically be looking for only the articles 22

29 contained in each page. The results produced, as shown in Figure 16, would be expected. However, if the user chose to switch workflow contexts and browse Amazon.com instead a link-heavy page with lots of advertisements, images and forms those important links would be lost due to the news-optimized settings, resulting in an undesirable result as shown in Figure 17. This would easily be fixed by adjusting the settings in the Content Extraction filter of our proxy; however it would require the user to switch application focus and do so manually. The user would have throttled the link/text ratio higher, perhaps even toggled off the advertise remover and the scripts remover (Figure 17) for a better shopping experience. As explained before, this workflow results in poor user productivity. Since creating a set of filter settings for producing good results for known webpages is straightforward, I focused on classifying websites. Ideally, a website s classification would be matched against known classes and an appropriate configuration would be selected. For example, if the user browsing CNN were to then browse Amazon.com, for which a classification didn t already exist, Crunch would detect that Amazon.com is a shopping site, similar to others already categorized, and those settings would be adopted automatically. This eliminates the need for human intervention every time there is a context switch on the part of the user. I have chosen two types of classification to categorize websites physical layout and content genre. My approach to classifying websites based on genre involves a onetime initial processing stage where 200 of the top sites visited by our group on a daily basis would be pre-classified. I combined the textual content of the site itself with the contents of search results returned by searching for the site s domain name on three of the top Internet search engines (Google, Yahoo and Dogpile). Adding text from search engine results enables us to leverage the small blurb describing the site in the engine s results, which assists in classifying the site. Combining results per domain also improves the frequency of the occurrence of words that describe the function and content of the site. I then removed all stop words and counted the frequency of the words in the remaining text. Since the resulting 200 sets of frequency graphs contain both words that repeat across graphs and words that occur too infrequently to affect content information, I picked all unique words that occurred in at least one of the graphs more than five times, creating a master set of Key Words. A re-graph against this new set of Key Words produces an accurate content genre identifier for each of these websites (examples of results for Amazon and ebay are shown in Figures 18 and 19 respectively). The classifications can now be used in conjunction with a database containing filter settings for each of the 200 sites. When Crunch encounters a website that it hasn t seen before, it again extract the text from the site and corresponding search engine blurb, and performs a frequency match against the frequently-occurring Key Words. Using the Manhattan histogram distance measure algorithm it measures the distance between the website in question and our original classifications. The formula is defined as D n 1 1 ( h1, h2 ) = h = 0 1[ i] h2 [ i] i The histogram ( h1, h2 ) is represented as a vector, where n is the number of bins in the histogram (i.e., the number of words in our Key Word Set). and h must first be h1 2 23

30 normalized in order to satisfy the above distance function requirements. In Crunch, the sum of the histogram s bins is normalized to 1 before computing the distance. Crunch then uses the settings associated with the website whose distance is closest to the one being accessed. In the previous example with CNN, Amazon and ebay, the user might have already defined preferences for CNN and Amazon. If the user navigates to ebay, the histogram matching algorithm finds it to be most similar to Amazon from our pool of categorized sites and picks the equivalent settings. I found that the resulting page (Crunch 3 result Figure 20) is quite acceptable. I have tried other approaches and compared their accuracy and speed to the Manhattan distance. For example, Euclidean distance measuring gave equivalent results with slightly greater computational overhead. D n ( h1, h2 ) = ( h [ ] [ ]) i= 0 1 i h 2 i I have also tried using a simplified version of the Mahalanobis histogram distance formula d 2 T 1 ( x, y) = ( x y) C ( where x and y are two feature vectors, and each element of the vector is a variable. x is the feature vector of the new observation, y is the averaged feature vector computed 1 from the training examples, and C is the inverse covariance matrix, where C = Cov( y, y ). y, y ij i j i j are the i th and j th elements of the training vector. The advantage of Mahalanobis distance is that it takes into account not only the average value but also its variance and the covariance of the variables measured. I expected the Mahalanobis formula to be most accurate; however, this was not the case, probably due to the lack of a large amount of variance in our training data. If I had results from more then three different search engines, I could potentially improve the variance in our data and the Mahalanobis histogram distance might give us more reliable results. The current system uses the simple and efficient Manhattan histogram distance measure. In Figures 21 and 22, I show the frequency chart for CNN and Spacer.com (an astronomy news related site) respectively. Crunch determines that both these sites are similar enough to warrant the same set of settings. The results viewed in the browser can be seen in the Crunch 3.0 results of Figure 23. As shown in Figures 19-23, I do not classify websites exclusively into predefined genres. Based on the word frequency histogram we either find an already-defined genre with a similar histogram or create a new genre centered on the new website. Currently websites are assumed to be identical within the domain, which works well for homogeneous sites created with a content management system [39], but less well for heterogeneous sites such as geocities.com assumed to be of a single genre for the entire domain. In order to classify websites based on their physical characteristics, I manually created a database of physical layouts of the same 200 top visited sites, and classified x y) 24

31 them into several basic categories. Categorizations include the number of columns, link density, type of site (news, shopping, banking, Weblog, etc.) and the percentage amount of content data contained in the various columns. Future work includes automating this process. My approach will be to analyze the constituent parts of the HTML in the DOM tree using tree pattern inference. [38] The user has the option of overriding our automatic settings through Crunch s user interface. Ultimately, I hope to use the data produced using physical classification to add to the information gained through content and genre classification to produce even more accurate results Feasibility demonstration Crunch 3.0 demonstrates the feasibility of utilizing contextual data about a website, genre analysis and physical layout information, for the application of filter settings to a website never seen before by Crunch. 25

32 6. Evaluation In order to evaluate the context extraction framework, I will compare the results outputted by Crunch to those used without Crunch. In the evaluation cases below, I will perform three sets of experiments one comparing crunch vs. no crunch, second comparing crunch to competitors, and third showing the advantages of the genre matching by comparing using crunch with and without said matching. 6.1 Evaluation methods In order to see whether I have made effective progress, compared to those achieved by other solutions, I will compare results produced by Crunch with those produced by applications representative of those that reformat content for constrained devices and those that cater to blind users. For the constrained devices scenario, I will use Bitstream Thunderhawk and Opera s PDA browser as the two applications. For those that cater to blind users, I will use ZoomText manufactured by AI Squared and for the screen reader, I will use JAWS [75], the most popular commercial screen reader on the market. I have picked these applications as they are the best in their areas of content extraction, formatting and screen reading. I will perform the evaluation in the following ways 1) Screen Reader testing The goal of this test is to check how well a webpage generated by Crunch works with screen readers. The results of this test will show the amount of time a blind/visually impaired user can save when using Crunch vs. not using Crunch. Furthermore, I will use Crunch separately with default settings, hand tuned settings and automatic settings (which in essence also tests the different version of Crunch) in order to show the success of genre based content filtering. a) I will choose a number (e.g. 10) of websites and take the original page as well as the outputs produced by Crunch (pages generated with best settings set by me vs. automatic settings vs. default settings), and the applications listed above and pass them through a screen reader. I will then measure the amount of time it takes, in seconds, to read the entire webpage. This will help evaluate hypotheses b, c, f and g. b) I will denote a particular section of text within the webpage as the target content. I will choose this section/paragraph of text arbitrarily, as long as it is not on the first rendered page on the screen. I will then measure how long it takes for a user (me in this case) to adjust the Crunch settings vs. all the other applications to exactly extract the prose containing the decided upon section. I will then repeat step 1a on this output. This will not only help evaluate the hypotheses tested in step 1a, but also show how long it takes to achieved desired output. 2) Constrained Device testing The goal of this test is to check how well a webpage generated by Crunch is rendered on a PDA. The results of this test will demonstrate the increase in relevant data displayed on a small screen when using Crunch vs. not using Crunch. Furthermore, I will use Crunch separately with default settings, hand tuned 26

33 settings and automatic settings in order to show the success of genre based content filtering. a) Like step 1, I will display the original page, the output of the Crunch (pages generated with best settings set by me vs. automatic settings vs. default settings) and those produced by the applications listed above, on a 320x240 as well as a 640x480 resolution PDA and determine by percentage, how much is content vs. how much is noncontent is on the first loaded page. This will test hypotheses b, c, d, f and g. b) Again, like in step 1b, I will select a certain section of a given page as the target content and ask the user (again, me) to set the settings and measure the amount of time it takes. I will then repeat step 2a on the resulting outputs and measure the percentage of content vs. non-content. This again will help evaluate how long it takes to tweak Crunch to achieved desired results. 3) Performance and Scalability testing I worked at increasing the performance of Crunch through every iteration and this test is designed to measure the scalability of Crunch when multiple pages are requested by the user. The results will show the latency one can expect from Crunch and will demonstrate that it will work well as a web proxy for groupware environments. I will use multiple browsers that support tabbed browsing (AvantBrowser and Firefox), and load several websites simultaneously in each browser, each with and without Crunch. These browsers will be able to load the given websites, each in separate tabs, and will issue http requests and then render the resulting responses into webpages simultaneously. I will report data on how much longer it takes the browsers to render a given page with and without Crunch. This will help test the performance of Crunch. Hypotheses b c d e f g Perf Tests 1a X X X X 1b X X X X 2a X X X X X X 2b X X X X X X 3 X 27

34 7. Plan/Schedule 7.1 Schedule Work Details Status Crunch 1.0 Crunch 2.0 Crunch 3.0 Crunch 3.0+ Demonstrate that the DOM object model works well for content extraction Implement Crunch as a transparent web proxy Demonstrate the success of the content extraction solution as a pluggable framework Improve the plug-in API so it is something that can easily implemented Support scripted pages as well as cascading style sheets (CSS) Make the proxy itself accessible by using the Standard Widget Toolkit (SWT) Improve performance Allow the user to retain lost links at the bottom of the extracted page Demonstrate the feasibility of using genre and physical layout to automatically extract content from websites Pre-process several websites to generate content based clusters Utilize tree pattern inference to better extract content from pages from the same website Differentiate the output produced by identical filter settings when target application is screen reader vs. constrained device (PDA) Done Done Done 2 months Evaluation Evaluate the success of Crunch as laid out in Section 6 3 months Write Thesis Write the thesis 3 months Total 8 months I will pre-process several hundred websites and classify them by genre as well as physical layout and demonstrate that there exist naturally occurring clusters of websites based on these two modes of classification. 28

35 I would also like to create physical layout heuristics by inferring tree patterns [38], i.e., finding patterns within the DOM trees of multiple pages from the same domain and use dissimilarity information to extract content from webpages. The feasibility for finding patterns has already been conducted; I intend to use the inverse of this approach to extract content from a webpage. One of the limitations pointed out in section is lack of ability to differentiate between a user following a workflow and a user surfing the web with no particular goal in mind, because removing extraneous content for the later may not be a desired feature. I would like to identify and differentiate between the two cases and utilize genre information about the different webpages being loaded in order to relax proxy settings for if random web surfing is detected. This could potentially lead to further future work, outside the scope of this thesis, where I could use machine learning to model such behavior and learn a user s browsing habits in order to apply extraction settings more accurately. I will also continue to work on improving the latency and scalability of Crunch, especially since tabbed browsing is becoming increasingly popular and users often open up several tens of pages at a time. 7.2 Future Work Detecting bad output, and especially quantifying what makes bad output bad, is a hard problem. Training the multi-pass filter to learn to detect bad output is one of my major goals for the future. I have found that sites that require large amount of form-based input, like most shopping and/or banking sites, often need to be rendered in a specific format in order for them to be usable by the user. Currently, I only have a binary option of removing forms or leaving them in. I plan to work on detecting related form items and showing those that are most relevant to the task at hand. Finally, one of my goals was to expose a simple API for programmers to extend, so that current and future natural language processing and information retrieval algorithms can easily be added to Crunch. This would allow users to truly be able to customize the content they would like to view on visited web pages. Full evaluation of the API and plug-in framework will not be possible until sufficient outside developers have worked with Crunch. Ideally, I would like to select programmers from the Computer Science department here at Columbia University and simply show them the API of Crunch and ask them to implement simple filters. So, I would be able to measure and report the time it takes them create such filters and add their controls to the GUI. I would also try and ask them to quantify the hardness of adding a filter, given that they had no prior knowledge of the inner workings of Crunch. However, this is outside the scope of this thesis. 29

36 While NekoHTML works well, I would also like to test a commercial HTML parser like the ones that are bundled with Mozilla or Internet Explorer and compare their efficiency. Again, this is also outside the scope of this thesis. I have further found that few systems use referrer data, i.e. information about the sites that the user came from, to influence the content extraction or display of a given page. I propose to build a test filter that will demonstrate the usefulness of such a concept by highlighting, in a given page, the string matching the domain name from which the user has been referred. This would potentially be replaced with a natural language processing system that would instruct Crunch to highlight a few keywords representative of the prose from where the user came, rather than just the domain name. This also is outside the scope of this thesis. I would also like to differentiate random surfing from workflow based browsing, relaxing filter settings for the former and also be able to handle the index/top-level page of a site differently from the remaining child/linked pages of that same website. 30

37 8. Contributions Many web pages contain excessive clutter around the body of an article. There are many special-case solutions to remove advertising (particularly pop-ups) or reformat for small screens, however, it is still a relatively new field where few general purpose tools are available so most researchers must construct their content extractors from scratch. In this proposal I have described three versions of the Crunch proxy Crunch 1 being simply the framework; and Crunch 2 with substantial improvements to the framework with respect to the plug-in API and the extensibility of the administrative interface; and Crunch 3 that classifies webpages based on genre and physical layout and then utilizes this information to extract content in similar manners from similar sites. My main contribution, working with the Document Object Model tree as opposed to raw HTML markup, enables me to apply in tandem an extensible collection of Content Extraction filters, and potentially other kinds of filters such as format translators and NLP summarizers. The heuristic filters that I have developed to date, though simple, are quite effective. My second contribution will be to demonstrate that implementing an extraction engine as a transparent proxy that is built as a framework that allows programmers, administrators and end users to add plug-ins, gives flexibility to the system that has not been seen before. Furthermore, I will avoid the one size fits all approach and adapt the output based on different webpages and the target applications. I have presented the changes made to the Crunch proxy through the user interface and the performance. I have found that while the results that my original proxy produced were quite accurate, they had to be adjusted by hand for various websites that differ drastically in content and form. I have started to address that problem by dynamically detecting the context of the website, both in terms of physical layout as well as content genre. Using this information and comparing this to previously known results that work well for certain genres of sites, I am able to select settings for our heuristics and achieve the same results automatically that previously required a human administrator. My contribution here will be to show that inferring tree patterns and using context for content extraction, especially content genre and physical layout, will produce extremely accurate results for a wide range of websites. The Crunch framework provides the basis for additional research in context extraction and accessibility. Crunch also leaves room for a lot of further research in machine learning, from learning user s browsing habits in order to apply proxy settings more effectively to learning to detect and classify a wider range of pages more accurately. 31

38 Appendix A Figures Figure 1 Figure 2 Figure 3 32

39 Figure 4 - Before Figure 5 - After Figure 6 Text Only 33

40 Figure 7 - Before Figure 8 - After Figure 9 - Before Figure 10 - After 34

41 Figure 11 Figure 12 - Link-Heavy Page Figure 13 - Crunch 1.0 Output Figure 14 - Crunch 2.0 Output Figure 15 35

42 Figure 16 - CNN (original, through Crunch 2.0 with shopping settings and through Crunch 3.0 with auto-settings) Figure 17 - Amazon (original, through Crunch 2.0 with news settings and through Crunch 3.0 with auto-settings) 36

Amazon 25 20 15 Ser i es 1 10 5 0 Wor d Key

25 20 Frequency 15 10 Series1 5 0 Word Key

43 Amazon Ser i es Wor d Key Figure 18 - Frequency chart for Amazon Ebay Frequency Series1 5 0 Word Key Figure 19 - Frequency Chart for ebay Figure 20 - ebay (original, through Crunch 2.0 with news settings and through Crunch 3.0 with autosettings) 37

CNN 45 40 35 30 Frequency 25 20 Series1 15 10 5 0

40 35 30 Frequency 25 20 15 Series1 10 5 0 Word Key

44 CNN Frequency Series Word Key Figure 21 - Frequency chart for CNN Spacer Frequency Series Word Key Figure 22 - Frequency chart for Spacer Figure 23 - Spacer.com (original, through Crunch 2.0 with shopping settings and through Crunch 3.0 with auto-settings) 38

45 Appendix B Tables Table 1 39

Table 2 Table 3 - Crunch 1.0 vs. Crunch 2.0 Plug-in APIs ProxyFilter.java package psl.memento.pervasive.crunch; import java.io.

getcontenttype(); } package psl.memento.pervasive.crunch2.plugins; import org.w3c.dom.

hassettingsgui() { return false; } public abstract String getname(); public abstract String getdescription(); public void setenabled(boolean b) { enabled = b;

46 Table 2 Table 3 - Crunch 1.0 vs. Crunch 2.0 Plug-in APIs ProxyFilter.java package psl.memento.pervasive.crunch; import java.io.*; public interface ProxyFilter { public File process(file in) throws IOException; public ProxyFilterSettings getsettingsgui(); public String getcontenttype(); } package psl.memento.pervasive.crunch2.plugins; import org.w3c.dom.document; public abstract class ProxyFilter { private boolean enabled = true; public void getsettingsgui() { // no settings GUI is required } public boolean hassettingsgui() { return false; } public abstract String getname(); public abstract String getdescription(); public void setenabled(boolean b) { enabled = b; } public boolean isenabled() { return enabled; } ProxyFilterSettings.java package psl.memento.pervasive.crunch; } public abstract Document process( Document originaldocument, Document previousdocument, Document currentdocument); package psl.memento.pervasive.crunch2.plugins; 40

47 import javax.swing.jpanel; public abstract class ProxyFilterSettings extends JPanel { public abstract void commitsettings(); public abstract void revertsettings(); public abstract String gettabname(); } public interface ProxyFilterSettings { public void set(string key, String value); public String get(string key); public void commitsettings(); public void revertsettings(); } 41

48 Appendix C Example Screenshots We show some examples of typical websites with different Crunch 2.0 options turned on. The point is to give the reader an idea of the degree of control a user can have over what he/she wants to see on a webpage. The pages we chose are: 1) A typical article from 2) An article from a link and script heavy site, 3) An article from 4) The entry page to the WWW2004 website, www2004.org Each of the four following pages shows the corresponding set of images. The images start from a screenshot of the original site, followed by a gradual increase in the number of filters used, continuing to the screenshot that was taken of the site in text-only mode. We have created this anthology of images to help the user get an idea of how Crunch and its filters work on a given webpage. 42

49 Original Remove link lists Remove advertisements All advanced filters on Ignore Scripts Ignore everything, text only 43

50 Original All advanced filters on Ignore scripts Ignore everything, text only 44

51 Original Remove forms All advanced filters on Ignore everything, text only 45

52 Original All advanced filters turned on Ignore everything, text only 46

Editorial Manager(tm) for World Wide Web Journal Manuscript Draft. Title: Automating Content Extraction of HTML Documents

Editorial Manager(tm) for World Wide Web Journal Manuscript Draft Manuscript Number: Title: Automating Content Extraction of HTML Documents Article Type: Manuscript Section/Category: Keywords: DOM trees,