Web page classification using visual layout analysis

Size: px

Start display at page:

Download "Web page classification using visual layout analysis"

Madeline Hunt
5 years ago
Views:

1 Web page classification using visual layout analysis Miloš Kovaèeviæ 1, Michelangelo Diligenti 2, Marco Gori 2, Marco Maggini 2, Veljko Milutinoviæ 3 1 School of Civil Engineering, University of Belgrade, Serbia milos@grf.bg.ac.yu 2 Dipartimento di Ingegneria dell Informazione, University of Siena, Italy {diligmic, maggini, marco}@dii.unisi.it 3 School of Electrical Engineering, University of Belgrade, Serbia vm@etf.bg.ac.yu Abstract Automatic processing of Web documents is an important issue in the design of search engines, of Web mining tools, and of applications for Web information extraction. Simple text-based approaches are typically used in which most of the information provided by the page visual layout is discarded. Only some visual features, as the font face and size, are effectively used to weigh the importance of the words in the page. In this paper, we propose to use a hierarchical representation, which includes the visual screen coordinates for every HTML object in the page. The use of the visual layout allows us to identify common page components such as the header, the navigation bars, the left and right menus, the footer, and the informative parts of the page. The recognition of the functional role of each object is performed by a set of heuristic rules. The experimental results show that page areas are correctly classified in 73% of the cases. The identification of different functional areas on the page allows the definition of a more accurate method for representing the page text contents, which splits the text features into different subsets according to the area they belong to. We show that this approach can improve the classification accuracy for page topic categorization by more than 10% with respect to the use of a flat bag-of-words representation.

2 1. Introduction The target of Web page design is to organize the information to make it attractive and usable by humans. Web authors organize the page contents in order to facilitate the users navigation and to give a nice and readable look to the page contents. The complexity of the Web page structure has increased in the last few years with the diffusion of sophisticated Web authoring tools which make it easy to use different styles, to position text areas anywhere in the page, to automatically include menu or navigation bars. Even if the new trends in Web page design allow authors to enrich the information contents of published documents and to improve the presentation and usability for human users, the automatic processing of Web documents is becoming increasingly more difficult. In fact, most of the features which are attractive for humans are just noise for machine learning and information retrieval techniques when they rely only on a flat text-based representation of the page. Since the page layout is used by Web authors to carry important information, visual analysis can introduce many benefits in the design of systems for the automatic processing of web pages. The visual information associated to an HTML page consists essentially of the spatial position of each HTML object in a browser window. Visual information can be useful in many tasks. Consider the problem of feature selection for web page classification. There are several methods, based on word-frequency analysis, to perform feature selection, for example using Information Gain [1] or TF-IDF (term frequency inverse document frequency) [2]. In both cases we try to estimate which are the most relevant words that describe document D, i.e. the best vector representation of D that will be used in the classification process. It is evident that some words may represent noise with respect to the real page contents if they belong to a menu, a navigation bar, a banner link or a page footer. Also, we can suppose that the words in the central part of the page (screen) carry more information than the words from the down right corner. Hence, there should be a way to weight differently words from different layout contexts. Moreover, visual analysis can allow us to identify areas related to different page contents. 2

3 The analysis of the page layout can be the basis for designing efficient crawling strategies in focused search engines. Given a specific topic T and a set of seed pages S, a focused crawler visits the Web graph starting from the pages in S, choosing the links to follow in order to retrieve only on-topic pages. The policies for focused crawlers require to estimate whether an outgoing link is promising or not [3,4,5]. Usually link selection is based on the analysis of the anchor text. However, the position of the link can improve the selection accuracy: links that belong to menus or navigation bars can be less important than links in the center of the page; links that are surrounded by more text are probably more important to the topic than links positioned in groups; anyway groups of links can identify hub areas on the page. Finally, many tricks have been used recently for cheating search engines in order to gain visibility on the Web. A widely used technique consists in inserting irrelevant keywords into the HTML source. While it is relatively easy to detect and reject false keywords where their foreground color is the same as the background color, there is no way to detect keywords of regular color but covered with images. Moreover, topologically based rankings, as the PageRank used in the Google search engine [6], and indexing techniques based on the keywords contained in the anchor text of the links to a given page can suffer from link spamming. The target page can gain positions in the ranking list by constructing a promoting Web substructure which links the target page with fake on-topic anchors and an ad hoc topology. Visual page analysis is also motivated by the common design patterns of Web pages which result from usability studies. Users expect to find certain objects in predefined areas on the browser screen [7]. Figure 1 shows the areas where users expect to find internal and external links. This simple observations allowed us to define a set of heuristics for the recognition of specific page areas (menus, header, footer and center of a page). 3

4 Figure 1: User expectance in percents concerning positions of internal (left) and external (right) links in the browser window [6]. It is clear that menus are supposed to be either inside the left or the right margin of a page. We used the visual layout analysis and the classification of the page areas into functional components to build a modular classifier for page categorization. The visual partition of the page into components produces subsets of features which are processed by different Naïve Bayes classifiers. The outputs of the classifiers are combined to obtain the class score for the page. We used a feed-forward two-layer neural net to estimate the optimal weights to be used in the mixture. The resulting classification system shows a significantly better accuracy than a single Naïve Bayes processing the page using a bag-of-words representation. The paper is organized as follows. Section 2 defines the M-Tree representation of an HTML document used to render the page on the virtual screen, i.e. to obtain the screen coordinates of every HTML object. Section 3 describes the heuristics for the recognition of some functional components (header, footer, left and right menu, and center of the page). In Section 4, the experimental results for component recognition on a predefined dataset are reported. In Section 5, we present the page classification system based on visual analysis. In Section 6, the experimental results for the classification of HTML pages in a predefined dataset are shown. Finally, the conclusions are drawn in Section 7. 4

5 2. Extracting visual information from an HTML source We introduce a virtual screen (VS) that defines a coordinate system for specifying the positions of HTML objects inside pages. The VS is a rectangle with a predefined width measured in pixels. The VS width is set 1000 pixels which corresponds to a page display area in a maximized browser window on a standard monitor with resolution of 1024x768 pixels. Of course one can set any other resolution but this one is compatible with most of the common web page patterns. The actual VS height depends on the page at hand. The top left corner of the VS represents the origin of the VS coordinate system. PAGE PARSER...(TE,DE),(TE,DE),... TREE BUILDER m-tree RENDERING MODULE M-TREE FIRST PASS SECOND and THIRD PASS Figure 2: Constructing the M-Tree (main steps). Figure 2 shows the steps of the visual analysis algorithm. In the first step the page is parsed using an HTML parser that extracts two different types of elements tags and data. Tag elements (TE) are delimited by the characters < and >, while data elements (DE) are contained between two consecutive tags. Each TE element is labeled by the name of the corresponding tag and a list of attribute-value pairs. A DE element is represented as a list of text tokens (words), which are extracted using the space characters as separators. A DE contains the empty list if no token is placed between the two consecutive tags. The parser skips the <SCRIPT> and <!-- > tags. 5

6 In the second step, the (TE,DE) sequence which is extracted from the HTML code is processed by a tree builder. The output of this module is a structure we named m-tree. Many different algorithms to construct the parsing tree of an HTML page are described in the literature [8,9]. We adopted a single pass technique based on a set of rules which allow us to properly nest TEs into the hierarchy according to the HTML 4.01 specification [10]. Additional efforts were made to design the tree builder that is immune to bad HTML source. Definition 1: a m-tree (mt) is a directed n-ary tree defined by a set of nodes N and a set of edges E with following characteristics: 1. N = N desc N cont N data where: - N desc (description nodes) is the set of nodes which correspond to the TEs of the HTML tags {<TITLE>, <META>}; - N cont (container nodes) is the set of nodes which correspond to the TEs of the HTML tags {<TABLE>, <CAPTION>, <TH>, <TD>, <TR>, <P>, <CENTER>, <DIV>, <BLOCKQUOTE>, <ADDRESS>, <PRE>, <H1>, <H2>, <H3>, <H4>, <H5>, <H6>, <OL>, <UL>, <LI>, <MENU>, <DIR>, <DL>, <DT>, <DD>, <A>, <IMG>, <BR>, <HR>} ; - N data (data nodes) is the set of nodes which correspond to the DEs. Each node n N has the following attributes: name is the name of the corresponding tag except for the nodes in N data where name = TEXT ; attval is a list of attribute-value pairs extracted from the corresponding tag and it can be empty (e.g. for the nodes in N data ). Additionally, each node in N data has four more attributes: value which contains tokens from the corresponding DE; fsize which describes the font size of the tokens; emph which defines the text appearance as 6

7 specified by the tags {<B>, <I>, <U>, <STRONG>, <EM>, <SMALL>, <BIG>}; align which describes the alignment of the text (left, right or centered). In the following, we denote a node n which corresponds to tag X as n <X>. 2. The root of mt, n ROOT N cont represents the whole page and its name is set to ROOT while attval contains only the pair ( URL, url of the page). 3. The set of the edges E = {(n x, n y ) n x, n y N } contains: a) (n ROOT, n desc ), " n desc N desc b) (n cont1, n cont2 ), n cont1 N cont \ {n <IMG> }, n cont2 N cont \ {n ROOT } iff n cont2 belongs to the context of n cont1 according to the nesting rules of the HTML 4.01 specification; c) (n cont, n data ), n cont N cont \ {n <IMG> }, n data N data iff the node n data belongs to the context of the node n cont. From the definition it follows that image and text nodes can be only leafs in an mt. Figure 3 shows an example of a simple page and its corresponding mt. Figure 3: An HTML source (right) and its mt (left). The coordinates of every object of interest are computed by the rendering module using the mt. We followed some recommendations from W3C [10] and the visual behavior of one of the most 7

8 popular browsers (Microsoft Internet Explorer) to design the renderer. In order to simplify and to speed the rendering process, we assumed some simplifications which do not influence significantly the page representation for our specific task. The simplifications are the following: 1) The rendering module (RM) calculates only coordinates for the nodes in mt, i.e. some HTML tags are not considered. 2) The RM does not support layered HTML documents. 3) The RM does not support frames. 4) The RM does not support style sheets. The rendering module produces the final representation of a page as a M-Tree (MT) which extends the concept of mt by incorporating the VS coordinates for each node n N \N desc. Definition 2: A MT is the extension of a mt in which n N \ N desc there are two additional attributes: X and Y. These are arrays which contain the x and y coordinates of the corresponding object polygon on the VS with following characteristics: 1. If n N cont \ {n <A> } then it is assumed that the corresponding object occupies a rectangular area on the VS. Thus X and Y have dimension 4. The margins of the rectangle are: - the bottom margin is equal to the top margin of the left sibling node if it exists. If n does not have a left sibling or n = n <TD>, then the bottom margin is equal to the bottom margin of its parent node. If n = n ROOT then the bottom margin is the x-axes of the VS coordinate system. - the top margin is equal to the top margin of the rightmost leaf node of the sub-tree having n as the root node. 8

9 - the left margin is equal to the left margin of the parent node of n, shifted to the right by a correction factor. This factor depends on the name of the node (e.g. if name = LI this factor is set to 5 times the current font width because of the indentation of list items). If n = n <TD> and n has a left sibling then the left margin is equal to right margin of the this left sibling. If n = n ROOT then the left margin is the y-axes of the VS coordinate system. - the right margin is equal to the right margin of the parent of node n. If n = n <TABLE> or n = n <TD> then the right margin is set to correspond to table/cell width. 3. If n N data or n = n <A> then X and Y can have dimensions from 4 to 8 depending on the area on the VS occupied by the corresponding text/link (see Figure 4). Coordinates are calculated considering the number of characters contained in the value attribute and the current font width. Text flow is restricted to the right margin of the parent node and then a new line is started. The height of the line is determined by current font height. x0,y0 x3,y3 this is the first paragraph! this is the first paragraph! x1,y1 x2,y2 x0,y0 x5,y5 this is the second example x0,y0 x7,y7 this is the third case Figure 4: Some examples of TEXT polygons. The definition of the M-Tree covers most of the aspects of the rendering process, but not all of them because of the complexity of the process. For example, if the page contains tables then the RM implements a modified auto-layout algorithm [11] for calculating table/column/cell widths. When a n <TABLE> node is encountered, the RM goes down the mt to calculate the cell/column/table widths to determine the table width. If there are other n <TABLE> nodes down on the path (nesting of tables) the width computation is performed recursively. Before resolving a table, artificial cells (nodes) are 9

10 inserted in order to simplify the cases where cell spanning is present (colspan and rowspan attributes in the tag <TD>). 3. Defining heuristics for the recognition of the page components Given the MT of a page and assuming the common web page design patterns, it is possible to define a set of heuristics for the recognition of standard components of a page, such as menus or footers. We considered five different types of components: header (H), footer (F), left menu (LM), right menu (RM), and center of the page (C). We focused our attention on these particular components because they are frequently found in Web pages regardless of the page topic. We adopted intuitive definitions for each class, which rely exclusively on the VS coordinates of logical groups of objects in the page. After a careful examination of many different Web pages, we restricted the areas in which components of type H, F, LM, RM, and C can be found. We introduced a specific partition of a page into locations as it is shown in figure 5. H start of page H 1 C LM RM end of page H 2 W 1 F W 2 Figure 5: Partition of the page into locations. 10

11 We set W 1 = W 2 to be 30% of the page width in pixels determined by the rightmost margin among the nodes in MT. W 1 (W 2 ) defines the location LM (RM) where the LM (RM) components can be exclusively found. We set H 1 = 200 pixels and H 2 = 150 pixels. H 1 and H 2 define H and F respectively which are the locations where the components H and F can be exclusively found. These values were found by a statistical analysis on a sample of Web pages. Components are recognized using the following heuristics: Heuristic 1: H consists of the nodes in each sub-tree S of MT whose root r S lies in H and satisfies one or more of the following conditions: 1. r S is of type n <TABLE> and the corresponding table lies completely in H (i.e. the upper margin of the table is less than or equal to H 1 ). 2. the upper margin of the object associated to r S is less than or equal to the maximum upper bound of all the n <TABLE> nodes which satisfy condition 1 and r S is not in a sub-tree satisfying condition 1. Heuristic 2: LM consists of the nodes in each sub-tree S of MT whose root r S lies in LM, it is not contained in H and it satisfies one or more of the following conditions: 1. r S is of type n <TABLE> and n <TABLE> lies completely in LM (i.e. the right bound of the table is less than or equal to W 1 ). 2. r S is of type n <TD>, it completely lies in LM, and the node n <TABLE> to which it belongs has a lower bound less than or equal to H 1 and the upper bound greater than or equal to H 2. 11

12 Heuristic 3: RM consists of the nodes in each sub-tree S of MT whose root r S lies in RM, it is not contained in H, LM and it satisfies one or more of the two conditions which are obtained from the conditions of heuristic 2 by substituting LM by RM and W 1 by W 2. Heuristic 4: F consists of the nodes in each sub-tree S of MT whose root r S lies in F, it is not contained in H, LM, RM, and it satisfies one or more of the following conditions: 1. r S is of type n <TABLE> and the node lies completely in F (i.e. the bottom margin of the table is greater than or equal to H 2 ). 2. the lower bound of r S is greater than or equal to the maximum lower bound of all the n <TABLE> nodes which satisfy condition 1 and r S does not belong to any sub-tree satisfying condition the lower bound of r S is greater than or equal to the upper bound of the lowest of all nodes n completely contained in F, where n {n <BR>, n <HR> } or n is in the scope of the central text alignment, and r S does not belong to any sub-tree satisfying one of the two previous conditions. Heuristic 5: C consists of all nodes in MT that are not in H, LM, RM, and F. These heuristics are strongly dependent on table objects. In fact, tables are commonly used ( 88% of the cases) to organize the layout of the page and the alignment of other objects. Thus pages usually include a lot of tables and every table cell often represents a small amount of logically grouped information. Often tables are used to group menu objects, footers, search and input forms, and text areas. 12

13 4. Experimental results for component recognition We constructed a complete dataset by downloading nearly 1000 files from the first level of each root category of the open source directory The total number of pages in the dataset was about In order to test the performance of the component recognition algorithm, we extracted a subset D of the complete dataset by randomly choosing 515 files, uniformly distributed among the categories and the file sizes. Two experts collaboratively labeled the areas in the pages that could be considered of type H, F, LM, RM, and C. The areas of the pages in D were automatically labeled by using the Siena Tree 1 tool that includes an MT builder and the logic for applying the component recognition heuristics. In these experiments, we selected values for the margins H 1, H 2, W 1, and W 2 according to the statistics from [6].The performance of the region classifier was evaluated by comparing the labels assigned by the experts and the labels produced automatically by the recognition algorithm. The results are shown in Table 1. Header Footer Left M Right M Overall Not recognized Bad Good Excellent Table 1: Recognition rate (in %) for different levels of accuracy. Shaded rows represent successful recognition. 1 Siena Tree is written in Java 1.3 and can be used to visualize objects of interest from a web page. One can enter any sequence of HTML tags to obtain the picture (visualization) of their positions. To obtain a demo version contact milos@grf.bg.ac.yu 13

14 The possible outcomes of the recognition process were summarized in 4 different categories. A component X is not recognized if the component X exists but it is detected, or if X does not exist but some part of the page is labeled as X. If only less than 50% of the objects in X are labeled, or if some objects out of X are labeled too, then the recognition is considered bad. A good recognition is obtained if more than 50% but less than 90% of the objects in X are labeled and no objects out of X are labeled. Finally, the recognition is considered excellent if more than 90% of the objects from X and no objects out of X are labeled. The recognition of components of type C is the complement of the recognition of the other areas according to heuristic 5. So we did not include it in performance measurements. The column overall is obtained by introducing a total score S for a given page as the sum of the scores assigned for the recognition of all areas of interest. If X {H, F, LM, RM} is not recognized then the corresponding score is 0. Categories bad, good, and excellent are mapped to the score values 1, 2, and 3 respectively. Hence, if S=12 we considered the overall recognition for particular page as excellent. Similarly, good refers to the case 8 S < 12, bad stands for the case 4 S < 8, and not recognized represents the cases in which S < 4. By analyzing pages for which the overall recognition was bad or not recognized, we found that in nearly 20% of the cases, the MT was not completely correct because of the approximations used in the rendering process. A typical error is that the single subparts of the page are internally well rendered, but they are scrambled as a whole. For the rest 80% of the cases, the bad performance was due to the heuristic rules used in the classifier. We are currently investigating the possibility of writing more accurate rules. 14

15 5. Page classification using visual information The rendering module provides an enhanced document representation, which can be used in all the tasks (i.e. page ranking, crawling, document clustering and classification), where it is important to preserve the complex structure of a Web page which is not captured by the traditional bag-ofwords representation. In this paper, we have performed some document classification experiments using a MT representation for each page. In this section, we first describe the employed feature selection algorithm. Then, we describe the architecture of a novel classification system dealing with visual information. In particular, the proposed system is composed by a pool of Naïve Bayes classifiers which outputs are combined by a neural network (NN) Feature selection process Let D be a set of pages and d D be a generic document in the collection. We assume that each document in D belongs to one of n mutually exclusive categories. We indicate with c i the set containing all documents belonging to the i th category. In order to use any classification technique, d must be represented as a feature vector relative to a D-specific vocabulary V. A common approach is to construct a feature vector v representing d that has V coordinates, the i th entry of v is equal to 1 iff the i th feature from V is in d, and equal to 0 otherwise. In order to reduce the dimensionality of the feature space and thus to improve classification accuracy, numerous techniques were developed [13]. Our feature selection methodology starts dividing each page into 6 parts: header, footer, left and right menu, central part and a set of features from meta tags such as a <TITLE> tag. When this procedure is applied on a dataset of documents, 6 datasets are obtained where each one contains only a portion of the initial documents. For each dataset, we construct a vocabulary of features and we compute the Information Gain for each feature. In the reduced document representation we want to maintain, the same percentage of features belonging to a specific region. For example, suppose that a page contains 100 different features and that we want to reduce the page to contain 15

16 only 50 features. If header, footer, left and right menus, central part and title respectively contain 20, 6, 20, 4 and 40 features, then we pick the 20 50/100=10 most informative features (features providing the highest information gain) from the header, 3 from the footer and so on Classification system architecture We decided to use Naïve Bayes (NB) classification technique for classifying each element of the page. The Naïve Bayes classifier [11] is the simplest instance of a probabilistic classifier. The output p(cd) of a probabilistic classifier is the probability that a pattern d belongs to a class c after observing the data d (posterior probability). It assumes that text data comes from a set of parametric models (each single model is associated to a class). Training data are used to estimate the unknown model parameters. During the operative phase, the classifier computes (for each model) the probability p(dc) expressing the probability that the document is generated using the model. The Bayes theorem allows the inversion of the generative model and the computation of the posterior probabilities (probability that the model generated the pattern). The final classification is performed selecting the model yielding the maximum posterior probability. In spite of its simplicity, a Naïve Bayes classifier is almost as accurate as state-of-the-art learning algorithms for text categorization tasks [12]. The Naïve Bayes classifier is the most used classifier in many different Web applications such as focus crawling, recommending systems, etc. For all these reasons, we have selected such classifier to measure the accuracy improvement provided by taking into account visual information. 16

17 17 Page P MT of P Visualisation Recognition P Header Footer Left M Right M Center Title + Meta tags Figure 6: Page representation used as an input for 6 Naïve Bayes classifiers. A Naïve Bayes classifier computes the conditional probability P(c j d) that given d belongs to c j. According to the Bayes theorem such probability is: ) ( ) ( ) ( ) ( d P c d P c P d c P j j j = (3) The Naïve Bayes assumption states that features are independent of each other, given a category c j : = 1 ) ( ) ( ) ( v i j i j j c w P c P d c P (4) where w i V is the i th feature from v. In the learning phase we estimate P(c j ) and P(w k c j ) from the set of training examples. Estimates are given through the following equations: 1 ) ( 1 D n y c P D i ij j + + = = (5) = = = + + = ), ( ), ( 1 ) ( V s D i i s ij i k D i ij j k d w N y V d w N y c w P (6)

18 where y ij =1 iff d i c j, else y ij =0. N(w,d i ) is equal to the frequency of the feature w in d i. In (5) and (6) we used Laplace smoothing to prevent zero probabilities for infrequently occurring features. After splitting a page into 6 constituents, each constituent is classified by a NB classifier. Threfore, 6 distinct class memberships are obtained for each class, yielding a vector of 6 n probability components for each processed page (n is the number of distinct classes in the dataset). In a first prototype, we decided to calculate a linear combination of probability estimates for each class and then to assign the final class label * c j to a class with a maximum value of that combination: 6 * c = arg max k P( c d ) (7) j j i= 1 The probabilities P c j d ) are the outputs from the i th classifier, and d i denotes the i th constituent ( i of page d. In our experiments i=1,2,3,4,5,6 has respectively the meaning of a left menu, right menu, header, footer, center and meta-tags respectively. The weight k i ( 0 k i 1, i = 1,..., 6 and k = 1) denotes the influence of the i-th classifier in the final decision. Such influence should be proportional to the expected relevance of the information stored into the d i part of the page. In particular, after some tuning, we have assigned the following weights to each classifier: header 0.1, footer 0.01, left menu 0.05, right menu 0.04, center 0.5, title and meta-tags 0.3. The results of classification will be shown in Section 6. In a second prototype we employed an artificial neural network to estimate the optimal weights of the mixture. Such approach allows the system to automatically adapt on different datasets without any manual tuning of the parameters. i j i 6 i= 1 i 18

19 page M-Tree builder Area recognizer Feature selection unit Naive Bayes classifier H Naive Bayes classifier F Naive Bayes classifier LM Naive Bayes classifier RM Naive Bayes classifier C Naive Bayes classifier M n n n n n n Neural Net n Class probabilities Figure 7: Architecture of a classification system based on a visual representation of a page. We used a two layer feed-forward neural net with sigmoidal units, trained with the Backpropagation learning algorithm [14]. The basic idea for Backpropagation learning is to implement a gradient descent search through the space of possible network weights w i, iteratively reducing the error between the training example target values and the network outputs. The 6 n dimensional output of the Naive Bayes classifiers is input of the network. In our experimental setup, the neural network featured 20 hidden units and 15 output units, one per class. Each network output represents the probabilities of belonging to the corresponding class. The ordinal number of the output with maximum value corresponds to the winning class. The learning rate and the momentum term were both set equal to 0.5. A validation set was used to learn the network parameters. 19

20 6. Classification results At the time of writing, there was not a dataset of Web pages, which has been commonly accepted as a standard reference for classification tasks. Maybe the best known dataset is called WebKB 2. After some inspection of the pages in this dataset we concluded that the dataset is not appropriate because pages are of very simple layout without complex HTML structures. This dataset originates from January 1997 and collected pages only from educational domains. Educational pages often features a much simpler structure than commercial pages, that moreover cover more than 80% of the Web [16]. Since 1997 Web design has been developing significantly, nowadays, many popular software packages allow a fast and easy design of complex Web pages. Often real pages are very complex in their presentational concept (just look at the CNN or BBC home pages and compare it with the pages from the WebKB dataset). Thus, we decided to create our own dataset. After extracting all the URLs provided by the first 5 levels of the DMOZ topic taxonomy, we selected the 15 topics at the first level of the hierarchy (the first level topic Regional was discarded since it mostly contains non-english documents). Each URL has been associated to the class (topic) from which it has been discovered. Finally, all classes have been randomly pruned, keeping only 1000 URLs for each class. Using a Web crawler, we downloaded all the documents associated to the URLs. Many links were broken (server down or pages not available anymore), thus only about pages could be effectively retrieved (an average of 668 pages for each class). These pages have been used to create the dataset. Such dataset 3 can be easily replicated, enlarged and updated (the continuous changing of Web format and styles does not allow to employ a frozen dataset since after a few months it would be not representative of the real documents that can be found on the Internet). 2 Can be downloaded from 3 The data set can be downloaded from 20

21 The dataset was split into a training set, test set and a validation set, the latter one used for training the neural network. In order to compare the results provided by our method, we used a standard NB classifier (as described in section 5), dealing with all words filtered by a feature selection methods. In particular, we employed a feature selection algorithm either based on TF-IDF or Information Gain. The results are shown in the figure 8. Classifier performance (100 training examples per category) Classifier performance (200 training examples per category) Classified correctly(%) Classified correctly(%) Number of features Number of features NN LM IG stop-list IG TF-IDF NN LM IG stop-list IG TF-IDF Figure 8: Classifier performance: NN and LM indicates the results of methods taking into account visual information with respectively a neural network and a linear combination, performing the final mixture. IG indicates the result of a NB classifier dealing with the entire page (no visual information is considered), filtered using Information Gains of words. IG stop-list uses also a stop-list of features. Finally, TF-IDF indicates the results of a NB classifier which input is filtered by a TF-IDF based algorithm. Visual based methods clear outperform methods based on a non-visual representation. Feature selection based on TF-IDF is worse when compared to Information Gain. Classification accuracy can be improved using feature selection based on Information Gain in combination with a stop-list, containing common terms such as articles, pronouns etc. However, it is clear that methods using visual information clearly outperform methods purely based on text analysis (the improvement 21

22 was around 10 even where classical methods reach their maximum). Increasing the number of training examples from 100 to 200 per group helps in improving the performance for all the methods. The interesting effect is that if we increase the number of features, after some limit (around 75) precision of classic approaches decreases. Our methods are more robust to that phenomenon, since noisy features are likely to lay into marginal areas of the page and are then filtered by the final mixture. Even if the manually tuned mixture provides slightly better performances than the mixture performed by the neural network, the difference is always less than 2%, clearly showing that the network well estimates the intrinsic importance of a region. 7. Conclusion This paper describes a possible representation, called M-Tree, for a Web page in which objects are placed into well-defined tree hierarchy according to where they belong to in an HTML structure of a page. Further, each object (node from M-Tree) carries information about its position in a browser window. This visual information enables us to define heuristics for recognition of common areas such as header, footer, left and right menus, and center of a page. The crucial difficulty was to develop sufficiently good rendering algorithm i.e. to imitate behavior of popular user agents such as Internet Explorer. Unfortunately, HTML source is often not conform to the standard, posing additional problems in the rendering process. After applying some techniques for error recovery in construction of the parsing tree and introducing some rendering simplifications (we do not deal with frames, layers and style sheets) we defined recognition heuristics based only on visual information. The overall results in recognizing targeted areas was around 73%. In the future we plan to improve rendering process and recognition heuristics. We also plan to recognize logical groups of objects that are not 22

23 necessarily in the same area like titles of figures, titles and subtitles in text, all commercial add-ins etc. Our experimental results provide evidence to claim that spatial information is of crucial importance to classify Web documents. Classification accuracy of a Naive Bayes classifier was increased of more than 10%, when taking into account the visual information. In particular, we constructed a mixture of classifiers each one trained to recognize words appearing in a specific portion of the page. In the future, we plan to use our system to improve link selection in focus crawling sessions [4] by estimating importance of hyperlinks using their position and neighborhood. However, we believe that our visual page representation can find its application in many other areas related to search engines, information retrieval and data mining from the Web. Acknowledgements We would like to thank Nicola Baldini for fruitful discussions on the Web page representations adopted in the focuseek project ( which inspired some of the ideas proposed in this paper. References [1] Quinlan, J.R., Induction of decision trees, Machine Learning, 1986, pp [2] Salton, G., McGill, M.J., An Introduction to Modern Information Retrieval, McGraw-Hill, [3] Chakrabarti S., van den Berg M., Dom B., Focused crawling: A new approach to topic-specific web resource discovery, In Proceedings of the 8 th Int. World Wide Web Conference, Toronto, Canada, [4] Diligenti M., Coetzee F., Lawrence S., Giles C., Gori M., Focused crawling using context graphs, In Proceedings of the 26 th Int. Conf. On Very Large Databases, Cairo, Egypt,

24 [5] Rennie J., McCallum A., Using reinforcement learning to spider the web efficiently, In Proceedings of the Int. Conf. On Machine Learning, Bled, Slovenia, [6] Brin S., Page L., The anatomy of a large-scale hypertextual web search engine, In Proceedings of the 7 th International World Wide Web Conference, Brisbane, Australia, 1997, vol.3, ACM Press. [7] Bernard L.M., Criteria for optimal web design (designing for usability), [8] Embley D.W., Jiang Y.S., Ng Y.K., Record-Boundary Discovery in Web Documents, In Proceedings of SIGMOD, Philadelphia, USA, [9] Lim S. J., Ng Y. K., Extracting Structures of HTML Documents Using a High-Level Stack Machine, In Proceedings of the 12 th International Conference on Information Networking ICOIN, Tokyo, Japan, 1998 [10] World Wide Web Consortium (W3C), HTML 4.01 Specification, December [11] James F., Representing Structured Information in Audio Interfaces: A Framework for Selecting Audio Marking Techniques to Represent Document Structures, Ph.D. thesis, Stanford University, available online at [12] Mitchell T., Machine Learning, McGraw Hill, [13] Sebastiani F., Machine learning in automated text categorization, ACM Computing Surveys, 34(1), pp [14] Yang, Y., Pedersen J.P. A Comparative Study on Feature Selection in Text Categorization, In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 1997, pp

25 [15] Chauvin Y., Rumelhart D., Backpropagation: Theory, architectures, and applications, Hillsdale, NJ, Lawrence Erlbaum Assoc. [16] Lawrence, S., Giles, L., Accessibility of Information on the Web, Nature, 400, pp , July

Searching for web information more efficiently using presentational layout analysis. Michelangelo Diligenti and Marco Gori

310 Int. J. Electronic Business, Vol. 1, No. 3, 2003 Searching for web information more efficiently using presentational layout analysis Miloš Kovačević* Faculty of Civil Engineering, University of Belgrade,