ABSTRACT. recommends some rules, which improve the performance of the algorithm.

Size: px

Start display at page:

Download "ABSTRACT. recommends some rules, which improve the performance of the algorithm."

Pauline Lyons
5 years ago
Views:

1 ABSTRACT The purpose of this project is to develop an algorithm to extract information from semi structured Web pages. Many Web applications that use information retrieval, information extraction and automatic page adaptation can benefit from this structure. This project presents an automatic top-down, tag-tree independent approach to detect Web content structure. It simulates how a user understands Web layout structure based on his visual perception. It also segments the Web page based on the data records that is the most important information in the whole structure. Comparing to other existing techniques, our approach is independent to the underlying documentation representation such as HTML and works well even when the HTML structure is far different from the layout structure. The current method works on a large set of Web pages. This project uses VBS algorithm to extract information from most semi-structured Web pages. It recommends some rules, which improve the performance of the algorithm. ii

2 TABLE OF CONTENTS Abstract... ii Table of Contents...iii List of Figures... iv List of Tables... v 1. Background And Rationale Document Object Model Color Code Model Vision Based Page Segmentation Visual based segmentation Narrative Visual Block Extraction System Design Analysis User Interface Vision-based Content Structure for Web Pages Evaluation and Results Future Work Conclusion Bibliography and References APPENDIX A. Websites Testing Results iii

3 LIST OF FIGURES Figure 1.1 A semi-structure Web page from buy.com... 2 Figure 1.2 An example how VIPS segments a Web page into blocks... 8 Figure 1.3 shows VB 1-1-1(8)... 9 Figure 1.4 shows VB 1-1-2(4)... 9 Figure 1.5 A sample web page segmented Figure 1.5(a) Segmentation of Web page[cai 2003] Figure 1.5(b) An example of Web based content structure [Cai 2003] Figure 3.1 Overall dataflow Figure 3.2 A sample input Web page Figure 3.3 A sample VBS partitioned Web page Figure 3.4 User interface with segmented data records Figure 3.5 A segmented Web page mapped to Visual blocks Figure 3.6 VIPS segments data records as leaf node Figure 3.7 A Web page analyzed by VIPS which identifies noise as node Figure 3.8 Shows the drop down list which acts as noise Figure 3.9 A Web page segmented by VIPS that identifies noise as data records Figure 3.10 Complex Visual Structure Figure 3.11 A Web page analyzed by VIPS which identifies noise as node Figure 3.12 VBS segments data records Figure 3.13 A Web page segmented by VBS that ignores noise iv

4 LIST OF TABLES Table 1 Evaluation Results 27 v

5 1. BACKGROUND AND RATIONALE Today the Web has become the largest information source for many people. Most information retrieval systems on the Web consider Web pages as the smallest and undividable units, but a Web page as a whole may not be appropriate to represent a single semantic idea. A Web page usually contains various contents such as items for navigation, decoration, interaction and contact information, which are not related to the topic of the Web page. Furthermore, a Web page often contains multiple topics that are not necessarily relevant to each other. Therefore, detecting the semantic content structure of a Web page could potentially improve the performance of Web information retrieval [Cai 2003]. Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. These features may include pop-up ads, flashy banner advertisements, unnecessary images, or links scattered around the screen. Extraction of useful and relevant content from Web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization [Gupta 2005]. People view a Web page through a Web browser and get a 2-D presentation image, which provides many visual cues to help distinguish different parts of the page. Examples of these cues include lines, blanks, images, font sizes, and colors, etc. For the purpose of easy browsing and understanding, a closely packed region within a Web page is usually about a single topic. This observation motivates us to segment a Web page from its visual presentation [Embley 1999]. 1

Figure 1.1 A semi-structure Web page from buy.com In the sense of human perception, it is always the case that people view a Web page as different semantic objects rather than a single object.

6 Figure 1.1 A semi-structure Web page from buy.com In the sense of human perception, it is always the case that people view a Web page as different semantic objects rather than a single object. Some research efforts show that users always expect that certain functional part of a Web page (e.g. navigational links, advertisement bar) appear at certain position of that page. Actually, when a Web page is presented to the user, the spatial and visual cues can help the user to unconsciously divide the Web page into several semantic parts. Therefore, it might be possible to automatically segment the Web pages by using the spatial and visual cues [Cai 2003]. Many Web applications can utilize the semantic content structures of Web pages. For example, in Web information accessing, to overcome the limitations of browsing and keyword searching, some researchers have been trying to use database techniques and build wrappers to 2

7 structure the Web data. Wrappers are interfaces to data sources that translate data into a common data model used by the mediator. A wrapper is used to integrate data from different databases and other data sources by introducing a middleware virtual database called as mediators. Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. They support fast and efficient data extraction and are domain independent. In building wrappers, it is necessary to divide the Web documents into different information chunks. If we can get a semantic content structure of the Web page, wrappers can be more easily built and information can be more easily extracted. Moreover, link analysis has received much attention in recent years. Traditionally different links in a page are treated identically. The basic assumption of link analysis is that if there is a link between two pages, there is some relationship between the two whole pages. But in most cases, a link from page A to page B just indicates that there might be some relationship between some certain part of page A and some certain part of page B [Ashish 1997]. 1.1 Document Object Model In order to analyze a Web page for content extraction, we pass Web pages through an open Source HTML parser, which creates a Document Object Model (DOM) tree, an approach also adopted by Chen [Chen 2003]. The DOM is a standard for creating and manipulating in-memory representations of HTML (and XML) content. By parsing a Web Page s HTML into a DOM tree, we can not only extract information from large logical units similar to Semantic Textual Units (STUs) but can also manipulate smaller units such as specific links within the structure of the DOM tree. In 3

8 addition, DOM trees are highly transformable and can be easily used to reconstruct complete Web pages. Finally, increasing support for the DOM makes our solution widely portable [Buyukkokten 2001]. There is a large body of related work in content identification and information retrieval that attempts to solve similar problems using various other techniques. Finn et al. discussed methods for content extraction from single-article sources, where content is presumed to be in a single body [Finn 2001]. The algorithm tokenizes a page into either words or tags; the page is then sectioned into three contiguous regions, placing boundaries to partition the document such that most tags are placed into outside regions and word tokens into the center region. This approach works well for single-body documents. It destroys the structure of the HTML and does not produce good results for multi-body documents where content is segmented into multiple smaller pieces, common on Web logs. In order for content of multi-body documents to be successfully extracted, the running time of the algorithm would become polynomial time with a degree equal to the number of separate bodies, i.e. extraction of a document containing 8 different bodies would run in O(N 8 ), N being the number of tokens in the document. Kan et al. Similarly used semantic boundaries to detect the largest body of text on a Web page (by counting the number of words) and classify that as content [Kan 1998]. This method worked well with simple pages. However, this algorithm produced noisy or inaccurate results handling multi-body documents, especially with random advertisement and image placement. Rahman et al. proposed another technique that used structural analysis, contextual analysis, and summarization [Rahman 2001]. 4

9 The structure of an HTML document is first analyzed and then decomposed into smaller subsections. The content of the individual sections is then extracted and summarized. Contextual analysis is performed with proximity and HTML structure analysis in addition to natural language processing involving contextual grammar and vector modeling However, this proposal has yet to be implemented [Rahman 2001]. Kaasinen et al. discussed methods to divide a Web page into individual units likened to cards in a deck. Like STUs, a Web page is divided into a series of hierarchical cards that are placed into a deck. This deck of cards is presented to the user one card at a time for easy browsing. He also suggests a simple conversion of HTML content to WML (Wireless Markup Language), resulting in the removal of simple information such as images and bitmaps from the Web page so that scrolling is minimized for small displays. The cards are created by this HTML to WML conversion proxy. While this reduction has advantages, the method proposed in that paper shares problems with STUs. The problem with the deck-of-cards model is that it relies on splitting a page into tiny sections that can then be browsed as windows. However, this means that it is up to the user to determine on which cards the actual contents are located, and since this system was used primarily on cell phones, scrolling through the different cards in the entire deck soon became tedious [Kaasinen 2000]. 1.2 Color Code Model Chen et al. proposed a similar approach to the deck of cards method, except that in their case the DOM tree is used for organizing and dividing the document. They proposed by showing an overview of the desired page. The user can select the portion of the page he/she is truly interested. When selected, that portion of the page is zoomed into full view. One of the key 5

10 insights is that the overview page is actually a collection of semantic blocks that the original page has been broken up into, each one color coded to show the different blocks to the user. This provided the user with a table of contents from which user selected the desired section. While this is an excellent idea, it still involved the user clicking on the block of choice, and then going back and forth between the overview and the full view. None of these concepts solved the problem of automatically extracting just the content, although they do provide simpler means in which the content can be found. These approaches performed limited analysis of Web pages themselves and in some cases information was lost in the analysis process. By parsing a Web page into a DOM tree, we have found that one not only got better results but also had more control over the exact pieces of information that can be manipulated while extracting content [Chen 2003]. 1.3 Vision Based Page Segmentation The Vision-based Page Segmentation (VIPS) algorithm aims to extract the semantic structure of a Web page based on its visual presentation. Such semantic structure is a tree structure; each node in the tree corresponds to a block. Each node was assigned a value (Degree of Coherence) DoC to indicate how coherent is the content in the block based on visual perception, the bigger is the DoC value, the more coherent is the block. The VIPS algorithm made full use of page layout structure. It first extracted all the suitable blocks from the html DOM tree, and then it found the separators between these blocks. Here, separators denoted the horizontal or vertical lines in a Web page that visually cross with no blocks. Based on these separators, the semantic tree of the Web page was constructed. Thus, a Web page was represented as a set of blocks (leaf nodes of the semantic tree). Compared with DOM based 6

11 methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration was easily removed because they were often placed in certain positions of a page. Contents with different topics are distinguished as separate blocks. The vision-based content structure of a page was obtained by combining the DOM structure and the visual cues. Block extraction, separator detection and content structure construction are regarded as a round. The algorithm is top-down. The Web page was firstly segmented into several big blocks and the hierarchical structure of this level was recorded. For each big block, the same segmentation process is carried out recursively until we get sufficiently small blocks whose DoC values are greater than a threshold. In VIPS, the data records in each segment are supposed to have the same degree of coherence and the same level of depth. But the real Web pages may not have same level contents at the same depth, e.g., querying (See Figure 1.2) using keyword automobile, the returned results are at different levels. Not only does that happen, VIPS may partition data record components into different neighboring blocks. 7

12 Figure 1.2 An example how VIPS segments a Web page into blocks In Figure 1.2, the Web page is divided into two blocks VB1-1(4) and VB1-2(10) (VB stands for Visual Block, and 1-1 is the block ID assigned by VIPS. The number inside the parentheses is the degree of coherence of the block). The DoC is assigned based on the block s visual property. VB1-1(4) is further divided into VB1-1-1(8), VB1-1-2(4) because its DoC(4) is less than the Permitted Degree of Coherence (PDoC) (10). The Block VB1-1-1(8) is further segmented until the DoC is grater than 10.this action is performed recursively until the condition (DoC>PDoC) is satisfied. VB 1-1-1(8) is shown in Figure 1.3. It mainly has a dropdown menu and a text form for search query input. Figure 1.4 shows VB1-1-2(4). It consists of data records and external links. 8

Figure 1.3 shows VB 1-1-1(8) Figure 1.4 shows VB 1-1-2(4) The basic model of vision-based content structure for Web pages is described as follows. A Web page is represented as a triple O,,.

Each block can be recursively viewed as a sub-webpage associated with sub-structure induced from the whole page structure.. 1, 2,.

13 Figure 1.3 shows VB 1-1-1(8) Figure 1.4 shows VB 1-1-2(4) The basic model of vision-based content structure for Web pages is described as follows. A Web page is represented as a triple O,,. O 1, 2,..., N is a finite set of blocks. Not all these blocks are overlapped. Each block can be recursively viewed as a sub-webpage associated with sub-structure induced from the whole page structure.. 1, 2,..., T is a finite set of separators, including horizontal separators and vertical separators. Every separator has a weight indicating its visibility every separator has a weight indicating its visibility, and all the separators in the same have the same weight. is the relationship of every two blocks in O and can be expressed as O O NULL. For example, suppose 9

i and j are two objects in O, i, j NULL indicates that i and j are exactly separated by the separator i, j or we can say the two objects are adjacent to each other, otherwise there are other objects

14 i and j are two objects in O, i, j NULL indicates that i and j are exactly separated by the separator i, j or we can say the two objects are adjacent to each other, otherwise there are other objects between the two blocks i and j [Cai 2003]. Figure 1.5 shows an example of vision-based content structure for a Web page of Yahoo Auctions. It illustrates the layout structure and the vision-based content structure of the page. Then we can further construct sub content structure for each sub Web page. Figure 1.5 A sample web page segmented 10

DoC has the following properties: The greater the DoC value, the more consistent the content within the block; In the hierarchy tree, the

15 Figure 1.5(a) Segmentation of Web page[cai 2003] Figure 1.5(b) An example of Web based content structure [Cai 2003] For each visual block, DoC is defined to measure how coherent it is. DoC has the following properties: The greater the DoC value, the more consistent the content within the block; In the hierarchy tree, the DoC of the child is not smaller than that of its parent. In VIPS algorithm, DoC values are integers ranging from 1 to 10, although alternatively different ranges (e.g., real numbers, etc.) could be used. We can pre-define PDoC to achieve 11

16 different granularities of content structure for different applications. The smaller the PDoC is, the coarser the content structure would be. For example in Figure 1.5(a) the visual block VB2_1 may not be further partitioned with an appropriate PDoC. Different application can use VIPS to segment Web page to a different granularity with proper PDoC. [Robertson 1997]. The vision-based content structure is more likely to provide a semantic partitioning of the page. Every node of the structure is likely to convey certain semantics [Gupta 2005]. For instance, in Figure 1.5 (a) we can see that VB2_1_1 denotes the category links of Yahoo! Shopping auctions, and that VB2_2_1 and VB2_2_2 show details of the two different comics. 1.4 Visual based segmentation In the Vision Based Segmentation (VBS) algorithm, various visual cues, such as position, font, color, and size are taken into account to achieve a more accurate content structure on the semantic level. These visual cues are adapted from VIPS algorithm. Not all current segmentation algorithms can determine the data regions or data record boundaries because they are not developed for this purpose, but they provide the important semantic partition information of a Web page. VBS is a top-down algorithm. It first extracts all the suitable nodes from the HTML DOM tree, and then builds visual blocks from these nodes. The following observations are made from the analysis of the algorithm over various Web pages. 1. Similar data records are typically presented in one or more contiguous region of a page, with one major region containing most data records and several other minor regions. Although there maybe some noise such as sponsored links or paid commercials in the middle of a contiguous region, this type of noise usually had a very different visual 12

17 structure from the data records. In addition, usually there is more than one data record before and after the noise. 2. Similar data records usually are siblings, and a leaf or terminal node is not a data record because a data record can be further partitioned into more than one sub blocks in the block trees. Although there are cases where similar data records may have different degrees of coherence (DoC) and are at a different depth in the block tree, usually the depth gap is as small as In block trees, a data record is usually self-contained in a sub tree and contains at least two different types of blocks. 4. Usually data records located in different block sub trees (or data regions) have Different block tree structure. In most cases, there is no need to compare the data records that are not siblings. 13

18 2. NARRATIVE The VBS algorithm used in this project used the DOM structure of a Web page. Visual blocks are created from its DOM structure using visual cues and heuristics. The programming language used to create the interface is c-sharp and it uses microsoft.net 3.6 as platform. The input of the application is the html address of any Web page.the interface segments the Web page in to visual blocks that contain data records. The process of visual block extraction is explained in the following section. 2.1 Visual Block Extraction In this step, we aim at finding all appropriate visual blocks contained in the current sub page. In general, every node in the DOM tree can represent a visual block. However, some huge nodes such as <TABLE> and are used only for organization purpose and are not appropriate to represent a single visual block. In these cases, the current node should be further divided and replaced by its children. Due to the flexibility of HTML grammar, many Web pages do not fully obey the W3C HTML specification, so the DOM tree cannot always reflect the true relationship of the different DOM node. For each extracted node that represents a visual block. We judge if a DOM node can be divided based on following considerations. DOM node properties. For example, the HTML tags of this node, the background color of this node, the size and shape of this block corresponding to this DOM node. The properties of the children of the DOM node. For example, the HTML tags of children nodes, background color of children and size of the children. The number of different kinds of children is also a consideration [Cai 2003]. 14

19 Based on WWW html specification 4.011, we classify the DOM node into two categories, inline node and line-break node. Inline node: the DOM node with inline text HTML tags, These tags affect the appearance of text and can be applied to a string of characters without introducing line break. Such tags include , <BIG>, , , , , , etc. Line-break Node: the node with tag other than inline text tags. Based on the appearance of the node on the browser and the children properties of the node, we give some definitions: Valid node: a node that can be seen through the browser. The node s width and height are not equal to zero. Text node: the DOM node corresponding to free text, which does not have an html tag. Virtual text node (recursive definition): o o Inline node with only text node children is a virtual text node. Inline node with only text node and virtual text node children is a virtual text node. Some important cues that are used to produce heuristic rules in the algorithm are: Tag cue: 1. Tags such as <HR> are often used to separate different topics from visual perspective. Hence, we prefer to divide a DOM node if it contains these tags. 2. If an inline node has a child that is line-break node, we divide this inline node. Color cue: We prefer to divide a DOM node if its background color is different from one of its children s. At the same time, the child node with different background color will not be divided in this round. 15

20 Text cue: If most of the children of a DOM node are text nodes or virtual text node, we prefer not to divide it. Size cue: We predefine a relative size threshold (the node size compared with the size of the whole page or sub-page) for different tags (the threshold varies with the DOM nodes having different HTML tags). If the relative size of the node is small than the threshold, we prefer not to divide the node. Based on these cues, we can produce heuristic rules to judge if a node should be divided. If a node should not be divided, a block is extracted. 16

21 3. SYSTEM DESIGN 3.1 Analysis This project developed an algorithm to extract a semantic structure of the Web pages using the VBS algorithm. It creates a DOM tree and the Web page is segmented based on the heuristics. Then all the noise such as advertisements, pop-ups are removed and the data records are viewed. The data flow is illustrated in Figure 3.1 Web Pages Generate DOM Tree VBS Segmented Web pages Figure 3.1 Overall dataflow The VBS algorithm partitions the Web page using a set of heuristic rules that exceed the performance of VIPS algorithm and offer better page segmentation. The Web page in Figure 3.2 is given as input. A DOM structure is generated and then given as input to VBS algorithm. 17

22 Figure 3.2 A sample input Web page 3.2 User Interface The user interface shown in Figure 3.3 has two input fields. The first one is the address bar where the Website s address is entered and the other field is a numeric field that indicates the count of text of a visual block. The visual blocks with text less than this specified value are considered as noise and eliminated. According to the algorithm employed, it shows the nodes and branches of the tree created. 18

Figure 3.3 A sample VBS partitioned Web page Figure 3.3 shows the analysis of a Web page. It contains many data blocks, which contains information about books.

23 Figure 3.3 A sample VBS partitioned Web page Figure 3.3 shows the analysis of a Web page. It contains many data blocks, which contains information about books. Each data block is a collection of image and various texts which represent price, title, ISBN, etc. Initially the whole Web page is partitioned according to the Visual blocks. In Figure 3.3 phrase VB represents visual block and the value in parenthesis represents the count of text in the respective blocks. In figure 3.4 VB(3184) represents the header of the Web page. VB(6587) contains the body of the Web page. It includes the data records. VB (6587) is further segmented in to the data blocks containing information about the books. VB(1730) represents the footer of the Web page that mostly contains noise. 19

24 Figure 3.4 User interface with segmented data records Figure 3.4 shows the data records segmented from the Web page. VB(297) is the first data record in the Web page it consists of image and text lines that give information about the book. Similarly, VB(116) is a segmented data record. The visual cues employed in VBS are as follows Format tags include "B", "I", "A", "U", "STRONG", "BR", "EM", "CITE", "VAR", "ABBR", "Q", Special tags such as "DFN", "CODE", "SUB", "SUP", "SAMP", "KBD", "ACRONYM", "FONT", "HR", Text tags are represented by "P", "PRE", "SPAN", List tags are identified by "UL", "OL", "LI", "DL", "DT", "DD" Image tags are identified by "IMG", "MAP", "AREA", 20

25 Heading are represented by "H1", "H2", "H3" The Figure 3.5 shows Visual blocks of a sample Web page that was segmented using VBS algorithm. The analysis of the Html code started at the head of the Web page and traversed towards the bottom of the page. VBS algorithm started to build visual blocks from the top of the Web page.it first encountered a table with a single row and two columns. Since the table was labeled as a visible and contained visual cues such as text cues, format cues and list cues it was segmented as a single visual block and named as VB(837) and then it was further subdivided using the VBS algorithm in to two visual blocks named VB(734) and VB(819) where VB(819) represents the table column which contains department links. VB(734) represents a table column which contains data elements as rows.the table rows which contain data records are further subdivided in to visual blocks named VB(299),VB(116),VB(312).The conditions used to determine a visual block from visible elements is explained below. 21

26 Figure 3.5 A segmented Web page mapped to Visual blocks This project uses new heuristics in page segmentation to achieve a better data extraction. The new heuristics employed improve the performance of the VIPS algorithm producing efficient data extraction results. The procedure that VBS uses to segment a Web page into visual blocks is as follows. 1. Start page examination from body element and obtain the suitable nodes 2. Build tree of elements 3. Next step is to recursively walk through the structure 4. Define a current visual block that represents current node 22

27 5. Walk through all visible child elements 6. For each visible element try to retrieve visible block 7. If block found and its text length is less than or equal to supplied Threshold value add it to the current visual block 8. For each child visual block perform operation 3 through Vision-based Content Structure for Web Pages This Project identifies the basic object as the leaf node in the DOM tree that cannot be decomposed any more. It uses the vision-based content structure, where every node, called a block, is a basic object or a set of basic objects. It is important to note that the nodes in the vision-based content structure do not necessarily correspond to the nodes in the DOM tree [Tang 1999]. The reasons causing incorrect data extraction and unacceptable segmentation of the Web pages using VIPS are as follows. 1. VIPS translates data records as terminal nodes (leaf nodes) in the visual block Tree, This case includes that multiple data records are in a single leaf node. In Figure 3.6, all the data records are translated in to one leaf node that cannot be further divided. This heuristic severely limits the capability of segmenting all data records. VB (11) is the leaf node which has more data records inside it. 23

Figure 3.6 VIPS segments data records as leaf node 2. VIPS put wrong attributes in a node (e.g., put link length = 0 in a link node) or put incomplete information in a node.

28 Figure 3.6 VIPS segments data records as leaf node 2. VIPS put wrong attributes in a node (e.g., put link length = 0 in a link node) or put incomplete information in a node. It may cause wrong leaf node reduction. The same reason as in above causes noise node cannot be identified and removed. A part of the problem also involves the Web page designers who don t pay enough attention in specifying the information about the elements as tags. Figure 3.7 shows a Web page that is segmented using VIPS and it highlights the node that is empty. 24

29 Figure 3.7 A Web page analyzed by VIPS which identifies noise as node 3. It is observed that the major contribution of noise is the noise on the edge of the Web pages, and most of them are drop down lists, action buttons or text boxes. Figure 3.8 shows a sample Web page that has a dropdown list at the edge of the node. 25

Figure 3.8 Shows the drop down list which acts as noise 4. When identifying data records, the node type having the Highest Number of occurrences can be a non-data record.

30 Figure 3.8 Shows the drop down list which acts as noise 4. When identifying data records, the node type having the Highest Number of occurrences can be a non-data record. The node that occurs frequently may be an ad or a text that is repeated to draw attention. It is mistaken for Data record in VIPS. Figure 3.9 shows a Web page segmented by VIPS that identifies noise blocks as data records because the keyword books is found and because the blocks are of similar size they are identified as data records. 26

31 Figure 3.9 A Web page segmented by VIPS that identifies noise as data records 5. Data records scatter through more than two levels in the block tree (complex Visual structure). Figure 3.10 shows a complex visual structure where C denotes category, D denotes data record and R denotes Related Content. Figure 3.10 Complex Visual Structure 27

6. Occasionally similar blocks are not data records at all. Figure 3.8 shows that even though the noise blocks are similar they are not data records. 7.

32 6. Occasionally similar blocks are not data records at all. Figure 3.8 shows that even though the noise blocks are similar they are not data records. 7. A rare case is that VIPS splits a data record into different visual blocks. Figure 3.11 shows a Web page segmented by VIPS that segments a data record in to two data records. Figure 3.11 A Web page analyzed by VIPS which identifies noise as node After evaluating the performance of VIPS algorithm the following heuristics are proposed, which improve the performance of the Algorithm with respect to visual segmentation The data record should be considered as a block by default, which eliminates the case of multiple data records in a single leaf node. This is achieved in VBS algorithm by assuming all the elements are visual blocks and then the elements are filtered out based on the visual clues. Figure 3.6 shows a Web page segmented by VIPS where multiple 28

33 data records are in a single leaf node. Figure 3.12 shows the same Web page segmented by VBS algorithm where it achieves better segmentation. Figure 3.12 VBS segments data records If a node has a higher number of occurrences then that node should be eliminated as a non-data record, it is more likely to be noise component that usually repeats itself. In VBS algorithm the number of occurrences is not considered to determine if the node is data record or not. Because every Web page built in recent years features ads that contribute to the income of the company or the individual, these ads are embedded in the Web page repeatedly for maximum visibility. Hence, the number of occurrences is not used to determine if block is data record or not. It leads to irregular data segmentation. The Web page in Figure 3.8 is segmented using VBS in Figure 3.13 it shows that the noise blocks are not segmented as data records and the whole noise block is considered as a leaf node. 29

34 Figure 3.13 A Web page segmented by VBS that ignores noise Data records shouldn t be translated as terminal nodes instead a separator should be made as a terminal node, which will accurately depict the visual blocks that include the data records. VBS algorithm doesn t implement separators hence this suggestion is for the algorithms that achieve segmentation using separators. 30

35 4. EVALUATION AND RESULTS In this experiment, we used three data sets to compare the performance of our VBS algorithm to VIPS algorithm. The three data sets come from different resources. The first data set (Data 1) is the Dataset 31 used by ViNTs. It is downloaded from The second data set (Data 2) is downloaded from the manually labeled Testbed for Information Extraction from Deep Web TBDW ver TBDW holds query results from 51 search engines, and there are five query result pages for each search engine. We only collect the first result page (1.html) of each search engine. It is available We gather the third data set (Data 3) from the home pages listed in the MDR paper (MDR paper does not provide the URLs of real data it tested).[zhai 2005]. The number of Web pages for each of the three data sets is shown in the third row of Table 1. The performance measures we use to compare the three methods are recall = E c /N t and precision = E c /E t, where E c is the total number of correctly extracted data records, E t the total number of records extracted, and N t the total number of data records contained in all the Web pages of a data set. 31

36 Table 1 Evaluation Results Data 1 Data 2 Data 3 VIPS VBS VIPS VBS VIPS VBS # Web pages # DRs #Extracted DRs #Correct DRs Recall Precision Table 1 shows the values of recall and precision achieved by both VIPS and VBS algorithms. VBS has achieved a higher precision and recall in Dataset 1 and Dataset 2 where as the results are close in Dataset 3. The evaluation is based on the number of data records segmented by both the algorithms and the correctness of the DOM tree generated by them. An additional comment about their performance has also been recorded. The value of DoC for VIPS algorithm has been set to 10 for evaluation purposes. The correctness of a data record is based on the following rules: 1. A data record is correctly extracted if it only contains everything belonging to it and nothing else. If some part of the data record is missing or the data record contains the irrelevant content (e.g., a part of another data record), the data record is incorrectly extracted. Therefore, nested data records such as N in 1. It is considered as incorrect in our experiment. 32

37 2. The suggested search results, the most popular search results, or the sponsored links, which often listed at the top of the result page, are not counted because they can usually be found from the full results list or are irrelevant to the query. In addition, banners, the advertisement-like item images, or the item categories are not considered as data records. 3. Data records may come from multiple data regions in a result page rather than just from one major region. 33

38 5. FUTURE WORK The Visual based algorithm can be further improved by populating more heuristics that can correspond to the new html tags that have been used lately such as embedded flash players, streaming videos that also hold considerable amount of data. The performance of the algorithm can be further improved by including another block in the user interface that would also indicate the characteristics of the block and its elements. it would come handy in the process of determining the status of the block whether it is a block of noise or a data record with a large amount of data. More emphasis on the block segmentation will also contribute to the accurate data extraction. The new age Web developers are using modern technologies like ajax, java script, adobe flex which have a different layout compared to the traditional html layout. 34

39 6. CONCLUSION An automatic top-down, tag-tree independent and scalable algorithm to detect Web content structure is presented. It simulates how a user understands the layout structure of a Web page based on its visual representation. Compared with traditional DOM based segmentation method, our scheme utilizes useful visual cues from VIPS algorithm to obtain a better partition of a page at the semantic level. It is also independent of physical realization and works well even when the physical structure is far different from visual presentation. The produced Web content structure is very helpful for applications such as Web adaptation, information retrieval and information extraction. By identifying the logic relationship of Web content based on visual layout information, Web content structure can effectively represent the semantic structure of the Web page. Using proposed rules the visual segmentation capability of VBS exceeds VIPS. It also reduced noise in complex data structures. 35

40 BIBLIOGRAPHY AND REFERENCES [Ashish 1997]Ashish, N. and Knoblock, C. A., Semi-Automatic Wrapper Generation for Internet Information Sources, In Proceedings of the Conference on Cooperative Information Systems, 1997, pp [Buyukkokten 2001] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, Accordion summarization for end-game browsing on PDAs and cellular phones in Proceedings Of Conference on Human Factors in Computing Systems (CHI 01), [Buyukkokten] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, Seeing the whole in parts: text summarization for Web browsing on handheld devices, in Proceedings. of 10 th international World-Wide Web Conference, 2001 [Cai 2003] Cai, Deng. VIPS: a VIsion based Page Segmentation Algorithm. Microsoft Technical Report (MSR-TR ), [Chen 2003] Y. Chen, W. Y. Ma, and H. J. Zhang, Detecting Web page structure for Adaptive viewing on small form factor devices in Proceedings, WWW 03, Budapest, Hungary, May [Embley 1999]Embley, D. W., Jiang, Y., and Ng, Y.-K., Record-boundary discovery in Web documents, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, 1999, pp [Finn 2001] A. Finn, N. Kushmerick, and B. Smyth, Fact or fiction: content classification for digital libraries, in Proceedings. of Joint DELOS NSF Workshop on Personalization and Recommender Systems in Digital Libraries (Dublin), [Gupta 2005] Suhit Gupta, Gail Kaiser Automating Content Extraction of HTML Documents World Wide Web: Internet and Web Information Systems, 8, , 2005 [Kaasinen 2000] E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski, and T. Laakko, Two approaches to bringing Internet services to WAP devices, in Proceedings of 9th International World-Wide Web Conference, [Kan 1998] M.-Y. Kan, J. L. Klavans, and K. R. McKeown, Linear segmentation and Segment relevance, in Proceedings. of6th International Workshop of Very Large Corpora (WVLC-6), [Mc Keown 2001] K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M. Y. 36

41 Kan, B. Schiffman, and S. Teufel, Columbia multi-document summarization: approach and evaluation, in Proceedings of Document Understanding Conference, [Rahman 2001] A. F. R. Rahman, H. Alam, and R. Hartono, Content extraction from HTML documents, in Proceedings of the1st International Workshop on Web Document Analysis (WDA2001), [Robertson 1997]Robertson, S. E., Overview of the okapi projects, Journal of Documentation, Vol. 53, No. 1, 1997, pp [Tang 1999]Tang, Y. Y., Cheriet, M., Liu, J., Said, J. N., and Suen, C. Y., Document Analysis and Recognition by Computers, Handbook of Pattern Recognition and Computer Vision, edited by C. H. Chen, L. F. Pau, and P. S. P. Wang World Scientific Publishing Company, [Zhai 2005] Zhai, Yanhong and Liu, Bing. Web Data Extraction Based on Partial Tree Alignment. WWW 2005, May 10-14, Chiba, Japan, APPENDIX A. WEBSITES TESTING RESULTS * For detailed description, please refer to section 4: Evaluation and Results, page

42 **Testing results are on the next page. 38

43 Dataset 1 (MDR_DATA) Web page VIPS records found VBS records found Comments Advanced Travel 2 2 VIPS doesn t detect the tool bar in the individual Portal records Amazon top sellers Asia travels Both detect the same nodes in a different order Barnes and Nobles Bookpool Buy HP 0 3 VIPS cannot display the web page Buy Products Online Codys Books Comp Usa 8 8 VBS segmented records as rows Computers-Mama Costarica tours Discount cheap software Ebay Plasma Find video games Fragrance Cosmetics 9 9 Gaming 8 8 Gifts under VIPS finds a blank space as record which is not desirable GPS Navigation 6 6 Kadys Books Kids footlocker Kodak Easy Share 9 9 Same functionality Low cost Domain 6 6 Mapquest 0 0 No records available Lycos search 0 0 Script would not allow analysis New Egg Overstock Product List Similar analysis Radioshack 1 1 Treating all the data records as one record Sos store Shop lycos 9 9 Software outlet Summer Jobs VIPS Segmentation not accurate because of large amount of text involved, U BID Waffles Eqarl The EMU 5 5 Welcome to Streets World wide airport 0 0 No Data records in the page But similar data blocks are recognized Yahoo Auctions

44 Dataset 2 Testing Results (TBDW_Testbed) Web pages VIPS VBS Comments VIPS detect a link as a text, cause neighboring edge noise being detected as data records VBS detects them accurately In VBS all the data records are under the same parent but where as in VIPS it has following errors Missing 1: complicated situation, just one data record under his parent. Missing 5: actually under the incorrect data record Missing 4: actually under the incorrect data record , Missing 1:complicated situation, just one data record under his parent VIPS splits data records into different sub-tree IN VIPS All data record nodes are links, after reduced, left only leaf node, no Comparison Where as in VBS the links can be further analyzed VIPS detect a link as a text, cause neighboring edge noise being detected as data records Wrong node type from VIPS. Complex tree VIPS segment all the data as leaves under a single node VBS further segments the node Only 1 data record VIPS segment all the data as leaves under a single node In VBS each data record is a different node Wrong node type from VIPS Wrong node type from VIPS. Tree is wrong VBS tree structure shows all nodes in same level Wrong node type from VIPS. Tree is wrong VBS tree structure shows all nodes in same 40

45 level Wrong node type from VIPS. Tree is wrong VIPS put all the data in a single leaf VBS partitions all the records Wrong node type from VIPS Please double check the page, I think there are > 10 DR Wrong node type from VIPS Most data record nodes are links, after reduced, left only leaf node, no comparison All similar blocks are treated as Data blocks in VIPS VIPS doesn t detect blocks from Data records cannot be extracted because all the records are under the same level under single node All similar blocks are treated as Data blocks in VIPS Wrong node type from VIPS. Tree is wrong Wrong node type from VIPS. Tree is wrong First 3 data records found by VIPS are very different Wrong node type from VIPS. Tree is wrong Wrong node type from VIPS. Tree is wrong Data records are leaves not internal nodes All data records are similar to noise VIPS put all data records in single leaf Large about of noise undetected in VIPS 41

46 42

47 Dataset 3 (VINTS_DATA) Web page VIPS VBS Comments Agents VIPS treats each data record as a leaf Alphabets Not enough information from VIPS Alpha works 9 9 Each data record itself is a leaf node Amazoid 6 6 VIPS did not treat a data record less than one internal node. Amazon AW Barnes 9 10 Incorrect 2: noise on the edge. Missing 17: nearly each level has one data record. Book Buyer Bookpool Missing 1: It's on the different level with other 24 data record. These 24 data records range from tp Borders Not enough information from VIPS Canoe2 9 9 VIPS treats each data record as several Link Type leaf nodes Canoe the similarity threshold is not high enough Cbc customers 7 7 Chapters VIPS treats all 20 data records as a leaf Cnet Cnet games 5 5 The real data record actually are and Missing 3: because each of these 3 data records itself is a link type leaf node. Cnet tech VIPS did not mark the data record on the same level. Cody Dwjava Missing 1: , it is under the different parent node with other data records, and under it's parent, there is only one data record itself, it belongs to the vsdr complicated situation. Incorrect 6: using leaf node reduction, make unsimilar data record seem similar Dwxml Ebay 3 3 Not enough information from VIPS Etoys 9 9 VIPS did not mark right. So can not delete the big link noise Excite 0 15 incorrect 2: noise on the edge. Missing 10, these 10 data records actually under the incorrect 2 data records, and , 5 under and other 5 under Because and both have leaf node. So they are marked as data records. Missing 5. each data record itself is leaf node. Fat brain 0 25 VIPS mark link type as text type and then vsdr use leaf 43

48 reduction. Incorrect 5: noise on the edge and VIPS mark not detailed enough, and make 1-1 and 1-2 similar. Gamecenter incorrect 1: leaf reduction, make unsimilar be similar gamelan 2 10 Missing 10: VIPS mark link type as text type and then vsdr use leaf reduction. Incorrect 2: noise on the edge Google 0 10 missing 10: VIPS mark link type as text type and treat each data record as one leaf. Incorrect 10 noise on the edge Goto VIPS does not mark right and similarity threshold is not high enough. Hotbot Ibm 4 4 invalid characters Infoseek Itn 0 10 Missing 19: VIPS mark link type as text type, King 0 19 Lc Lycos Missing 4: VIPS even does not generate correspending ID for these four data records. Missing 10: VIPS mark link type as text type and then vsdr using leaf reduction make data record as a text leaf. Missing 3:each of these 3 is actullay a combination of each two of incorrect actually is a data record is a data record is a data record. Incorrect 4: noise on the edge Magazine outlet msn 0 50 Missing 10: VIPS marks other types as text type. Powells 7 8 incorrect 4: noise on the edge.missing 1: , it is under the different parent node with other data records, and under it's parent, there is only one data record itself, it belongs to the vsdr complicated situation Quote Rubylane VIPS does not give the enough information, Signpost VIPS does not give the enough information, Thestar Vancouverson 0 4 VIPS treat each data record is a leaf node Vunet 0 10 VIPS mark link type as text type Incorrect 2: noise on the edge Wine Yahoo Missing 16: VIPS mark link type as text type and then vsdr using leaf reduction make data record as a text leaf. Incorrect 5: noise on the edge Yahoo 0 17 Most data records are leaves. Others are reduced in leaf node reduction. Yahoo Auction

49 Zbooks Missing 2: VIPS treats each of them as a leaf node. Incorrect 1: VIPS does not give the enough information, VBS detects all the data records Zshop

VIPS: a Vision-based Page Segmentation Algorithm

VIPS: a Vision-based Page Segmentation Algorithm Deng Cai Shipeng Yu Ji-Rong Wen Wei-Ying Ma Nov. 1, 2003 Technical Report MSR-TR-2003-79 Microsoft Research Microsoft Corporation One Microsoft Way Redmond,