Semantic HTML Page Segmentation using Type Analysis

Size: px

Start display at page:

Download "Semantic HTML Page Segmentation using Type Analysis"

Emerald Gibbs
6 years ago
Views:

Semantic HTML Page Segmentation using Type nalysis Xin Yang, Peifeng Xiang, Yuanchun Shi Department of Computer Science and Technology, Tsinghua University, Beijing, P.R.

1 Semantic HTML Page Segmentation using Type nalysis Xin Yang, Peifeng Xiang, Yuanchun Shi Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China {yang-x02, bstract Semantic information is necessary for Semantic Web processing and is useful to Web adaptation services such as personalization of users browsing activities on small screen devices. However, semantic information is always implicitly encoded in most existing HTML documents. This paper describes a page segmentation method to parse Web pages into rectangular segments containing some semantic information, namely blocks. Existing page segmentation techniques are mainly built on HTML DOM structure or purely vision based, not accurate enough either in visual presentation or in semantic sense. Our approach is automatic, and based on a refined typing system which tightly couples type analysis with indispensable visual cues to generate blocks into the tree structure, aiming to achieve high degree of coherence in both semantic and visual views. Experimental results show better accuracy and completeness of our method over existing ones. Keywords: Page Segmentation, Block, Visual Cues, Type Recognition, Pattern Discovery, Semantic Structural Tree. 1. Introduction Semantic Web is likely to be the next-generation Web. Its basic infrastructure encompasses both online and offline databases filled with enormous semantic objects. However, as a necessary part of online resources, most existing pages are originally encoded in HTML documents, in which semantic information implicitly hides but visually presents in a structural way. For example, in Figure 1, information in the red rectangle together represents the topic of Headline News and information in the blue rectangle represents a sub topic, each associated with a piece of news. Essentially Web pages are composed of several such rectangular areas, each of which contains some useful semantic information with the same topic, namely block as in [3]. Semantic page segmentation is a preliminary step for advanced Semantic Web processing, and [7][13][14][15] show its great potential and possibilities. For example, information retrieval and extraction can achieve much better results by regarding sets of blocks Figure 1. fragment of News front page as basic processing objects instead of the whole page, e.g., [13][15]. Besides, specific Web adaptation services such as the personalization of users browsing activities on small screen devices can also benefit a lot by directly using semantic blocks as input units, e.g., [7][14]. Existing solutions to page segmentation fall into two categories. The first class is based on some non-visual cues such as HTML DOM tags, content, links, etc, e.g. [1][2][4][5][6][8][9][11]. Methods of this class often achieve low accuracy because of overlooking visual cues. The second class suggests an opposite solution, e.g., [3] proposed a purely vision-based method, but often achieve a limited degree of semantic coherence because of relying on visual cues too much and failing in making full use of them. In human s view, each Web page is a set of semantic blocks separate in visual presentation but semantically related to each other. [12] stresses the simple observation that semantically related items exhibit consistency in presentation style and spatial locality. It is especially useful to template based Web pages such as news front pages and e-commerce sites. We take three further notes (Section 3.1) and use them to guide type analysis through the semantic page segmentation process, with both non-visual and visual cues taking effect. dditionally, Web pages may contain some semantic free items except blocks, such as blank tables and white separators. We consider filtering them out to tidy the tree structure and simplify the algorithm. We utilize the idea of pattern discovery in [12] but implement it in an essentially different way, mainly by 669

2 taking into account some indispensable visual cues. The contributions include: Defining a refined typing system built on basic types. Filtering out semantic free items through type recognition. Coupling type analysis with visual cues by dynamically inserting and removing separator items and adjusting relationship between adjacent items. Next, Section 2 presents a brief overview of related work. Then Section 3 describes our technique in detail with some experimental results following in Section 4. Finally Section 5 gives discussions. 2. Related Work Recently Semantic Web has drawn more and more attention from researchers. Many contributions have emerged in such areas as Web page segmentation and information extraction, both related to this issue. On one hand, many approaches have been provided for Web page segmentation. [8] and [11] both use HTML tag information as cues, while [2][4][5][9] focus on content and link information. [1] even tries to detect specific templates by making use of link information. In [6] a new model called FOM (Function-based Object Model) is proposed to construct hierarchical structures for Web pages. Methods above all try to directly explore semantics from Web pages, but ignore the actual visual presentation style. [3] discusses their limitations respectively and presents a vision-based algorithm, socalled VIPS (Vision-based Page Segmentation), to extract the semantic structure of Web pages. It is based on the assumption that human unconsciously divide Web pages into semantic segments in virtue of visual cues. Being a tag-tree free approach, it works well even when the HTML structure is quite different from the actual layout structure. However, no semantic cues are taken into account, and visual cues are not utilized completely, thus leading to the limited degree of semantic coherence within blocks. On the other hand, some information extraction techniques are concerned about algorithms related to page segmentation. In [10], a flexible algorithm called MDR (Mining Data Records in Web Pages) is used to mine data records in Web pages. Data records are lists of regularly structured objects containing some information, which are somewhat similar to blocks. Compared with earlier automatic techniques, this algorithm works more accurately and effectively and can discover non-continuous data records. However, because its original objective is to fill database tables, it overlooks the structural relationship among different data records and therefore is not suitable for general use. In [12], a framework coupling structural analysis of documents with semantic analysis using domain ontology is developed to partition HTML documents into unlabeled partition trees by grouping together elements with related semantics. It exploits the key observation that semantically related items exhibit consistency in presentation style and spatial locality and tries to discover structural recurrence patterns for semantically related items under each sub tree through a bottom-up process. However, it has two inherent limitations. First, it uses specified HTML tag path as the type of each node, making it time consuming and not suitable for Real-time processing. Second, it relies on pattern discovery but overlooks visual cues, yet is not accurate enough and can hardly achieve completeness. Our approach is unique as utilizing the idea of pattern discovery for reference and making it work in parallel with visual cues in type analysis process. Meanwhile, we consider filtering out semantically free items through type recognition process. Therefore, page segmentation can be achieved more accurately and comprehensively both in visual and semantic sense. 3. Semantic Segmentation 3.1. The Basic Idea Our technique is originally based on the simple observation mentioned in Section 1. When dipping into the relationship between HTML DOM structure and the actual representation style, we take three further notes, leading arising of the basic idea: - Items with similar semantics usually have similar HTML tags. This gives rise to a refined typing system built on basic types. Each item is bound with a basic tomic Type according to its tag and location in HTML DOM tree. Then semantic free items can be filtered out through a type recognition process. - Similar semantic blocks usually contain items with similar HTML tag sequences. Then the typing system can be enlarged by binding each semantic block with a sequence of atomic types, namely Composite Type. This is done in parallel with the pattern discovery algorithm. - Similar semantic blocks usually locate in the same sub tree structure and have the same parent. This gives birth to the idea of using visual cues as assistant in our type analysis. We take two measures and they both work effectively: - Dynamically inserting and removing separator items during pattern discovery process. - djusting the relationship between adjacent items. Given a HTML document, we get its DOM tree, and then parse it into a semantic structural tree through a 670

3 TD TD FONT STRONG FONT STRONG FONT FONT IMG SPN IMG SPN FONT STRONG FONT PTTERN PTTERN FONT(TimesNewRoman, Times, Serif Strong) FONT(TimesNewRoman, Times, Serif Strong)... (a) (b) Figure 2. (a) fragment of a Tag-Tree (b) Semantic Structural Tree of corresponding fragment two-step strategy, with each node denoting a block, as shown in Figure 2(b): Step 1: Tracing the original DOM structure, type analysis is performed bottom-up to assign each leaf node with an atomic type, filter out semantic free nodes, and generate a composite type for each internal node. In this process, type recognition and pattern discovery work in parallel with each other, and separator nodes are dynamically inserted or removed depending on indispensable visual cues. Step 2: Tracing the outcome tree structure of step 1, a top-down refinement process is performed to adjust the relationship between adjacent nodes according to visual position cues. Note that visual cues serve as assistant to semantic cues in Step 1, while in Step 2 act as the guidance. Detailed techniques are described below Type Recognition HTML Dom tree is structural in presentation style but in disorder in semantic sense. For example, Figure 2(a) presents HTML DOM structure generated from the corresponding fragment in Figure 1. Note that several leaf nodes are invisible (e.g. nodes enclosed in dashed) and yet with no semantic cues. Based on the first note (Section 3.1), we define a refined typing system by classifying nodes into ten categories. Seven priorities are pre-defined to serve as the rule for nodes suitable to multiple categories, thus make sure that each node belongs to only one category, as shown in Table 1 (lesser number denotes the higher priority). Table 1. The Refined Typing System Priority Type Categories 0 ROOT 1 FONT, 2 LINK 3, PTTERN, NOTSURE 4 SEPRTOR 5 STG 6 PLIN In the bottom-up type analysis algorithm, each node is assigned with a specific Type through type recognition. Then leaf nodes belonging to the PLIN category are filtered out as they hardly provide any semantic information, e.g., blank tables and separators. Type recognition can be done effectively by following several heuristic rules. Some visual cues are taken into account, such as the minimum width (MinWidth) and the minimum height (MinHeight) of a semantic item in the HTML document. Given a node, let us denote its HTML tag, width and height as Tag, Width and Height, respectively. Seven rules are listed below by priority: - Rule 1: If Tag = body, then Type = ROOT. - Rule 2: If Height < MinHeight, then Type = PLIN. - Rule 3: If Tag = font or one of its ancestor s Tag = font, then Type = FONT+[fontstyle]. Here fontstyle denotes the typeface and presentation style (e.g. the first leaf node in Figure 2(b)). - Rule 4: If the node or one of its ancestors has internal text between its tag pairs, then Type =. 671

4 - Rule 5: If Width < MinWidth, then Type = PLIN. - Rule 6: If the node or one of its ancestors has Link information and it is not the source URL, then Type = LINK. - Rule 7: If Tag is probably visible (e.g. iframe, input, object), then Type = STG. Note that nodes submitted to these rules contain not only all leaf nodes in HTML DOM tree, but also those internal nodes already with all children filtered out. Besides, types not mentioned above will appear in the next phase as they are only useful to internal nodes with more than one child Pattern Discovery Pattern discovery is collaborated with type recognition during type analysis process. It contributes a lot to transforming DOM tree into semantic structural tree by generating new and PTTERN nodes and marking existing ones as or PTTERN or NOTSURE or PLIN (e.g. Figure 2(b)). Referring to [12], we follow the basic idea of discovering sequential patterns on the type sequence of all child nodes under an internal node, which is especially useful to template-based Web pages. Meanwhile several improvements are brought in. First, refined typing system separates the notion of Type and Type String, yet tomic Type and Composite Type are defined to describe primitive type and compound type. Note that the type sequence is really a Type String sequence. Each node is assigned with a Type String using the function below: HTML tag name, if Type {STG, NOTSURE} Type String = Type name, if Type is tomic and Type STG string sequence, if Type is Composite Second, visual cues play an assistant role in the algorithm. SEPRTOR nodes are inserted between adjacent nodes and B when they are visually apart from each other, or formally when both of the following conditions are satisfied: - Condition 1: B. right left. or B. left. right - Condition 2: B. bottom. top or B. top bottom. Besides, SEPRTOR nodes are inserted at both sides of the children sequence of an internal PLIN node when it is expanded during the pattern discovery process under its parent. Third, the core notion in pattern discovery, namely Maximal Repeating Substrings, is replaced by Maximal Repeating Continuous Substrings, in which the type string of SEPRTOR is used as real separators and thus the result string contain no type string of SEPRTOR. Given a string S and a support threshold valueθ, a substring αthat repeats k times in S is a Maximal Repeating Continuous Substring if and only if: ( i) k 2 and α k θ S ( ii) ( iii) SEPRTOR α α k is the maximum ( iv) k is the maximum dditionally, we introduce NOTSURE type to denote internal nodes without any obvious patterns. They are assigned a temporal Type String during the pattern discovery process under its parent. Similar to type recognition process, related heuristic rules are integrated into the algorithm to improve its performance, such as: - Rule 8: If it is a leaf node, then Type is tomic. - Rule 9: If it is a node and all its children have the same tomic Type, then Type is tomic. - Rule 10: If it has only and leaf children and they all have the same tomic Type, then Type =. - Rule 11: If it has only two children and they are not SEPRTOR nodes, then Type = PTTERN. - Rule 12: Note that pattern discovery is only performed on nodes with mutiple children, and PLIN nodes marked during this process are not filtered out like their leaf peers. Meanwhile, SEPRTOR nodes may be dynamically removed when too dense, as in such case as the maximum number of non-seprtor nodes between two adjacent SEPRTOR nodes is 1. What is important, the efficiency of pattern discovery serves as the bottleneck of that of type analysis, and it is mostly depends on the efficiency of finding Maximal Repeating Continuous Substrings. The 2 temporal complexity is O( n ) at worst, where n denotes the length of original string. Compared to the tag path string used in [12], the length of Type String now becomes much shorter. We step further to assign each Type String with a unique integer, making n denote the amount of children. Thanks to the filtering process in type recognition, the algorithm can potentially speed up a lot Visual Refinement Now we get a rough semantic structural tree in which each node denotes a semantic block. However, further refinement is needed to make sure that its structure is in accordance with actual presentation style. For example, a node may be completely covered by its neighbor. It may be caused during the process of dynamically removing SEPRTOR nodes in previous steps. top-down algorithm is performed to find visual faults and adjust the relationship among related nodes. Note that sometimes no refinement happens, as the same to the tree fragment in Figure 2(b). 672

5 4. Experimental Results We implement the algorithm in C# and C++ language respectively. The support threshold valueθ, which limits the relative minimum length of Maximal Repeating Continuous Substrings (same to θ used in [12]), is set to 0. Visual threshold value MinWidth and MinHeight are both set to 13 pixels in accordance with the minimum font size in most Web pages. We use 4 metrics, namely: - NT: Number of nodes in a HTML DOM tree. - NS: Number of nodes in a semantic structural tree. - NF: Number of nodes filtered out. - Recall: Fraction of the number of semantic blocks recognized by the algorithm over the number of standard blocks marked manually. The system is experimented on 24 HTML documents from different Websites, containing those automatically generated by templates such as some famous news portals and e-commerce home pages. We get standard blocks by choosing 5 volunteers to manually parse each page into blocks to their own taste. Then corresponding semantic structure trees are automatically generated by the system. We also experiment VIPS on these pages and compute Recall in each page for both methods. Statistics are collected in Table 2 (N denotes the number of blocks). NT, NS and NF have such relationship as below: NS = NT NF NU + NN NU denotes the number of nodes having only one child, while NN denotes the newborn internal nodes through pattern discovery process. Differences among NT, NS and NF show that a large amount of semantic free items are eliminated and DOM structure is changed a lot during type analysis. We point out that the filtering job is worth doing as it makes the whole algorithm more efficient while bringing much convenience to following phases. We use Recall to evaluate the performances of both methods. Figure 3 show that our algorithm always reaches a higher level than VIPS as the number of blocks increases. It is essentially because that repeated patterns seldom exist under the root node of a page, thus our algorithm is inclined to break down first-level blocks such as those presented as page headlines. It is observed that our algorithm can achieves comprehensive completeness with all small blocks generated while VIPS often fails to generate sub-blocks for small blocks, and sometimes even generate only the root block for a page, e.g., those using images as the background ( Thus our algorithm proves to be more flexible. In addition, our algorithm also works well when Table 2. Experimental results with comparison of Semantic Segmentation (SS) and VIPS 673

6 Research Fund for the Doctorial Program of Higher Education, No Figure 3. Comparison between SS and VIPS VIPS fails by grouping together sub-blocks with little semantic relation. There are cases when visual cues are not precise enough, e.g., the distance between a subtitle and the related sub-content may be larger than the distance between the same subtitle and the previous sub-content. It is obvious that sometimes visual cues are misleading, thus it is better to take both non-visual cues and visual cues into account, as in our algorithm. Note that the standard block sets is constructed on human views, possibly with some bias, thus our technique outperforms VIPS with more flexibility. 5. Discussions We propose a new approach to automatically parse HTML documents into semantic structural tree through semantic page segmentation using type analysis. lthough using pattern discovery for reference, it is more generally useful and potentially less timeconsuming than related information extraction technique in [12]. Besides, our algorithm is more flexible and more accurate in both semantic and visual sense over VIPS, while the latter proves to be more satisfied in performance in comparison to other page segmentation methods, as discussed in [3]. However, more adjustment deserves doing during visual refinement. Besides, the efficiency of our prototype system has not been tested, but we believe that further optimization of the core algorithm is called for achieving Real-time processing. It is observed that blocks with similar semantics often share similar sub-tree structures in our semantic structural trees, whether or not extracted from different HTML documents. In the future we would like to exploit the essential semantic features within and between blocks and step into the hotspot of Web service personalization on small screen devices. cknowledgement Supported by Program for New Century Excellent Talents in University, NCET and Specialized References [1] Z. Bar-Yossef and S. Rajagopalan, Template Detection via Data Mining and Its pplications, Proceedings of the 11th International Conference on World Wide Web, 2002, pp [2] D. Buttler, L. Liu and C. Pu, Fully utomated Object Extraction System for the World Wide Web, Proceedings of the 21st International Conference on Distributed Computing Systems, 2001, pp [3] D. Cai, S. Yu, J.R. Wen and W.Y. Ma, VIPS: VIsion based Page Segmentation lgorithm, Microsoft Technical Report, MSR-TR , [4] S. Chakrabarti, Integrating the Document Object Model with Hyperlinks for Enhanced Top Distillation and Information Extraction, Proceedings of the 10th International Conference on World Wide Web, 2001, pp [5] S. Chakrabarti, M. Joshi and V. Tawde, Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks, Proceedings of the 24th nnual International CM SIGIR Conference on Research and Development in Information Retrieval, 2001, pp [6] J.L. Chen, B.Y. Zhou, J. Shi, H.J. Zhang and Q.F. Wu, Function-based Object Model towards Website daptation, Proceedings of the 10th International Conference on World Wide Web, 2001, pp [7] Y. Chen, W.Y. Ma and H.J. Zhang, Detecting Web Page Structure for daptive Viewing on Small Form Factor Devices, Proceedings of the 12th International Conference on World Wide Web, 2003, pp [8] S.T. Chen, Y.L. Diao, H.J. Lu and Z.P. Tian, FCT: Learning based Web Query Processing System, Proceedings of the 2000 CM SIGMOD International Conference on Management of Data, 2000, pp [9] D.W. Embley, Y. Jiang and Y.K. Ng, Record-Boundary Discovery in Web Documents, Proceedings of the 1999 CM SIGMOD International Conference on Management of Data, 1999, pp [10] B. Liu, R. Grossman and Y.H. Zhai, Mining Data Records in Web Pages, Proceedings of the 9th CM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp [11] S.H. Lin and J.M. Ho, Discovering Informative Content Blocks from Web Documents, Proceedings of the 8th CM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp [12] S. Mukherjee, G. Yang and I.V. Ramakrishnan, utomatic nnotation of Content-rich HTML Documents: Structural and Semantic nalysis, Proceedings of the 2nd International Semantic Web Conference, 2003, pp [13] S. Mukherjee, I.V. Ramakrishnan and. Singh, Bootstrapping Semantic nnotation for Content-Rich HTML Documents, Proceedings of the 21st International Conference on Data Engineering, 2005, pp [14] S. Mukherjee and I.V. Ramakrishnan, Browsing Fatigue in Handhelds: Semantic Bookmarking Spells Relief, Proceedings of the 14th International Conference on World Wide Web, 2005, pp

Heading-Based Sectional Hierarchy Identification for HTML Documents

Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of