Automatic Web pages author extraction

Size: px

Start display at page:

Download "Automatic Web pages author extraction"

Jayson Lewis
5 years ago
Views:

1 Automatic Web pages author extraction Sahar Changuel, Nicolas Labroche, and Bernadette Bouchon-Meunier Laboratoire d Informatique de Paris 6 (LIP6) DAPA, LIP6 104, Avenue du Président Kennedy, 75016, Paris, France {Sahar.Changuel, Nicolas.Labroche, Bernadette.Bouchon-Meunier}@lip6.fr Abstract. This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. Asemi-automaticapproachwasconductedforcorpusexpansioninorder to help annotating the dataset with less human effort. This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus. 1 Introduction The Web has become the major source of information that disseminates news and documents at an incredible speed. With this rapid increase of information, locating the relevant resources is becoming more and more difficult. One approach to make the Web more understandable to machines is the Semantic Web [1], where resources are enriched with descriptive information called metadata. Metadata are commonly known as a kind of structure data about data that can describe contents, semantics and services of data, playing a central role in supporting resources description and discovery. Basic metadata about a document are: its title, its author, its date of publication, its keywords and its description [2]. Although manual annotations are considered as the main source of information for the Semantic Web, the majority of existing HTML pages are still poorly equipped with any kind of metadata. Hence automatic metadata extraction is an attractive alternative for building the Semantic Web. The three main existing methods to generate metadata automatically are[14]: -Derivingmetadata:creatingmetadata based on system properties. -Harvestingmetadata:gatheringexistingmetadata,ex:METAtagsfoundin the header source code of an HTML resource. -Extractingmetadata:pullingmetadatafromresourcecontentwhichmay employ sophisticated indexing and classification algorithms.

2 This paper focuses on automatic author extraction from HTML documents as part of a more global application on automatic metadata extraction from learning resources. The author of a resource is the responsible for its creation, it can be a person or an organization. It allows users to judge the credibility of the resource content [3] and can also serve as a searchable record for browsing on digital libraries: ausercanlookforacoursebychoosingtheprofessor snameasaquery.hence automatically annotating the author field can be of great interest to help finding appropriate information. In HTML documents, people can explicitly specify the author on the Meta tag <author>, however,peopleseldomdoitcarefully.inourwork,theauthor field was evaluated on our dataset which contains 354 HTML pages, we found that only 15% of the META author fields are filled, therefore an alternative method should be adopted for author extraction. This paper proposes a machine learning technique for automatic author extraction from web documents based on the Decision Tree (C4.5) algorithm. For each HTML page, person names are extracted and features are generated based on spatial and contextual information. Features corresponding to the same person are then combined in a disjunctive manner, this combination improves considerably the extraction results. The rest of the paper is organized as follows. In section 2, previous related works are described, and in section 3, we give specifications on the HTML page author. Section 4 describes our method of author extraction as well as the features construction method. Section 5 presents the experimental results. We make concluding remarks in section 6. 2 Related work Web information extraction has become a popular research area and many issues have been intensively investigated. Automatic extraction of web information has many applications such as cell phone and PDA browsing [4], automatic annotation of web pages with semantic information [5] and text summarization [6]. There are two main approaches to web information extraction (IE): namely, the rule based approach [7] inducing a set of rules from a training set, and the machine learning based approach which learns statistical models or classifiers. The machine learning based approach is more widely employed and systems differ from each other mainly in the features that they use, some use only basic features such as token string, capitalization and token type (word, number, etc.) [8], others use linguistic features such as part-of-speech, semantic information from gazetteer lists and the outputs of other IE systems(ex:named entity recognizers) [9, 10]. In [11], authors proposed a machine learning method for title extraction from HTML pages. They utilize format information such as font size, position, and font weight as features for title extraction. While most Web pages have their titles placed on the beginning of the document with a conspicuous color and

3 size, the author of the page doesn t have special visual properties which makes the visual method unsuitable for author extraction. To the best of our knowledge the only work on author extraction from HTML document is realized by the authors of [3] who proposed a method for author name candidates ranking from Web pages in Japanese language. They use features derived from document structure as wellaslinguisticknowledge,andrank candidates using the Ranking SVM model. As their approach relies on the distance from the main content, the method fails when the author name occurs inside the main content. Our approach resolves this problem by merging the features of the different occurrences of a person name in a page into a sole and representative one. The merging of features improves remarkably the extraction results. Awellknowndrawbackofthesupervisedmachinelearningmethodisthe manual annotation of the input data set. In this paper a semi automatic annotation method is adopted. It consists of iteratively expand the corpus by extracting the authors from new HTML pages. It uses alearningmodelwhichisconstructed on few manually annotated pages, the human main task will consists on the verification of the suggested annotations. 3 Web page author This paper focuses on the problem of automatically extracting authors from the bodies of HTML documents assuming that the extraction is independent from the Web page structure. In this paper, Web page author is considered as the person responsible for the content of the document. Authors can have common specifications: -TheauthorofanHTMLdocumentcanbecomposedofafirstnameor/and alastname. -Itisgenerallyplacedinthe beginning of the page or after the main content of the document. -Nexttotheauthorname,wecanfindamentionofthedocumentcreation date, the author to contact him(her) and even the name of the organization he(she) belongs to. -Somevocabularycanbeusedtohelprecognizingtheauthor sname,like: author, created by, written by, etc., a list of words of interest was constructed. For example, in the pages shown in figure 1, Richard Fitzpatrick is the author of the page A 1 and Jason W.Hinson is the author of the page B 2.Weassume that there is only one author for a Web page intro.html

4 Fig. 1. Examples of Web pages authors 4 Author extraction method In this paper, a machine learning approach is conducted to address the problem of Web documents authors extraction. Before the training phase, we need a preprocessing phase to prepare the input data. The global schema of the features construction is illustrated in figure HTML page parsing In order to analyze an HTML page for content extraction, it is transformed first to a well-formed XML document using the open source HTML syntax checker, Jtidy 3.TheresultingdocumentisthenparsedusingtheCobratoolkit 4,which creates its Document Object Model Tree 5 (DOM tree) representation. We consider the <body> HTML element as the root node of the tree. Our content extractor navigates the DOM tree and get the text content from the leaf nodes. 4.2 Person names extraction Person names (PNs) are extracted from the textual nodes of the DOM tree. For this purpose, some Named entities recognition systems like Balie 6 (baseline information extraction) and Lingpipe [12] were tried first. While these systems give good results when trained and applied to a particular genre of text, they make many more errors with heterogeneous text found on the Web. Instead, our method is based on the gazetteer approach to extract PNs from web pages using the frequently occurring US first names list 7.Usingthislist files.html

5 Fig. 2. Features construction results in a simpler algorithm and permits to extract the names more accurately than sophisticated named entities extraction systems. We use only the US first names list because the US last names list contains some common words like White, Young, Price, etc. which can cause extracting common words as PNs and hence generates labeling noise. Indeed, we created some regular expressions based on capitalization to extract the total name from the first name. 4.3 Context window extraction Since in IE the context of the word is usually as important as the word itself, our approach needs to take into account the words neighboring each person name occurrence. Each PN is considered with 15 words to either side as a window of context around the candidate object. The size 15 was chosen experimentally (see section 5.3 for more details). However, in an HTML document, the context window does not only rely on the number of tokens but also on the page layout and structure: the text in an HTML document is generally divided in different visual blocs. If a PN is situated in a given bloc of the page, its context window should not contain tokens from other blocs. In the example in figure 3 8,thecontextwindowoftheauthorname Mircea NICOLESCU is composed of the highlighted tokens, the window should contain only words that are on the same bloc as the PN, hence the left window of the current example contains only the phrase Created by. In this paper, a DOM-based method was adopted to construct the contextual window. Our approach exploits the tags generally used to represent blocs in 8 mircea/teaching/cpe201/

6 Fig. 3. Context window extraction HTML pages, such as HR, TABLE, H1, P, and DIV. They are used to refine the context window as follow: for the left window, text nodes situated before one of these tags are excluded, and likewise for the right window, nodes situated after one of these tags are not taken into account. Hence, for each PN occurrence, its visual bloc is detected and its context window is constructed. The former will be used to extract some contextual information required for the construction of the features. 4.4 Features construction Spatial information The author name is generally placed either before the main content of the page or in its footer part. The position of the PN relatively to the page (PR) can be considered as an important information for extracting the author. Position maxp agedepth We define PR =,whereposition is the position of the current node in the DOM tree and maxpagedepth is the total number of textual nodes in the tree. Two thresholds were fixed experimentally: a beginning threshold equal to 0.2 and an end threshold equal to 0.9. This paper supposes that when PR is inferior to 0.2, the text is situated in the beginning of the page, when it is superior to 0.9, it is then placed at the end of the page, and when it is between the two thresholds, the text is located in the main content of the page. The principal issue is how to fix both thresholds. This was done by making different experiments with different thresholds values and we retain those which give the better results (more details are given in the section 5.3). Contextual information To get contextual information related to each PN in an HTML document, features are extracted from the context window based on the following information:

7 -Date:Todetectwhetherthereisadate in the contextual window, special regular expressions were created. - Regularexpressionsarecreatedfor detection,moreoverhyperlinks are also exploited by using the mailto links. -Organization:ThenamedentitiesrecognitionsystemBaliewasappliedto detect the occurrence of the organization entities in the context window. -Vocabulary:Twofeaturesindicatingtheexistenceofwordsfromtheauthor gazetteer were also constructed. The author gazetteer contains two lists, the first includes words like author, creator, writer, contact... and the second contains a list of verbs such as created, realized, founded.... These words are semantically interesting for Web page author recognition. -Anadditionalfeaturewascreatedtopointouttheexistenceofthethepreposition by preceding the PN. This feature is kept apart since in some pages it can be the only information provided next to the author. 9binaryfeaturesarecreatedforeachPN,3spatialfeaturesand6contextual features. 4.5 Merging features For each PN in a Web document, a feature vector is created. One of the problems encountered is that the author name can occur more than once in the document and each occurrence encloses more or less richer contextual information. Authors in [3] proposed a ranking method by giving a rank to each author name candidate. As our aim is to extract the author from a web page and not to rank its occurrences, the solution proposed in this paper consists of merging the feature vectors using a disjunction method (operator OR). An example is given in the figure 4, suppose that John Smith is the author of a Web page, and that its name occurs 3 times in the page, hence, we have three feature vectors for each candidate, V1 = [1,0,1,0,1,0,0,0,1,1], V2 = [0,1,0,0,0,0,1,0,0,1], and V3 = [1,0,1,0,0,0,0,0,0,1]. The key idea is to construct a feature vector V representing all the occurrence of John Smith in the page, V would be the disjunction of the three vectors: V=[V1ORV2ORV3]=[1,1,1,0,1,0,1,0,1,1]. This method gives richer and more powerful information for each person name in a page, moreover, it eliminates poor examples that can affect the training model, section 5.3 shows the merging features effect on the extraction result. 5 Algorithms and evaluations 5.1 Algorithms This paper uses a supervised learning technique to extract authors from HTML documents. The algorithm learns a model from a set of examples, let {(x 1,y 1 )...(x n,y n )} be atwo-classtrainingdataset,withx i atrainingfeaturevectorconstructedfor

8 Fig. 4. Features merging each person name and their labels y i (1 for the class author and -1 for the class non author ). APNislabeledasauthorifitcanapproximatelymatchtheannotatedpage author, the approximate match is true when at least 70% of the tokens in the annotated author are contained in the PN. We used supervised learning methods implemented in Weka [13] to train our classifier. Through experimentation, we found that the Decision Tree implementation (C4.5) provided the best classification performance. As baseline methods, we used the extraction of the author from the meta Author tag (metaautor), and the OneR (one rule) classifier. The OneR model seeks to generate classification rules using a single attribute only, we use the implementation of a one-rule classifier provided in the Weka toolkit. To evaluate the author extraction method, Precision, Recall and F1-measure are used as evaluation metrics. 5.2 Data Data was first collected manually by sending queries to the Web, and among the resulting pages a human annotator selects those which contain their authors in their contents. As human annotation is time consuming, and, in this case, require looking among numerous Web pages to find few interesting ones, annotation is stopped when 100 annotated pages are obtained. But, a dataset of 100 pages is not enough representative especially that the accuracy of a learned model usually increases withthenumberoftrainingexamples. In order to expand the dataset, we adopt a semi automatic technique that can be explained as follows: -ADecisionTreemodelistrainedfirston the features extracted from the existing annotated pages.

9 -NewpagesareacquiredbysendingdifferentqueriestoaWebsearchengine using the Google API. -PNsareextractedfromtheresultingpages and their feature vectors are constructed as explained in section 3. -Thelearningmodelisthenappliedonthesefeatures,andpagesthatcontain instances classified as author are retained. These pages will be labeled by the human annotator. This phase is heavily based on the system s suggestions and the annotator main task is correcting and integrating the suggested annotations. The latter has now a small number of pages to parse. -Featurescreatedfromthenewannotatedpagesareaddedtotheprevious training examples. The process is repeated a certain number of times, and each time different queries are chosen in order to get different HTML pages. Our corpus has grown from 100 to 354 pages more quickly and with less human effort. Within the context of our global application of automatic metadata extraction from learning resources, we are especially interested in extracting information from the education domain, thus, the words of the queries belong to the education lexicon. Ex: Analog electronics, Molecular Biology, Operating systems, Human Sciences, etc. Even if the new pages are obtained from already annotated ones, our corpus contains heterogeneous HTML documents which are content and structure independent. 5.3 Experiments This section summarizes the results of the differentexperimentsonauthorextraction. In the experiments, we conducted 10-fold cross validation, and thus all the results are averaged over 10 trials. The input examples of our experiments are the binary feature vectors related to the persons names found in the annotated HTML documents. Each example is labeled by the class author or non author. The results are summarized in table 1 and indicate that our method significantly outperforms the baseline methods. Our learning-based method can make an effective use of various types of information for author extraction. The metaautor method can correctly extract only about 15% of the authors from the Meta tags of the HTML pages. Table 1. Performances of baseline methods for author extraction Method Precision Recall F1-measure C OneR MetaAuthor

10 Effectiveness of merging features: Table 2 shows the results obtained before and after applying the features disjunctive combination for the different PN candidates. Table 2. Evaluation results before and after features combination (Positive = The number of examples labeled as author, and Negative = The number of examples labeled as non author ). Method Precision Recall F1-measure Positive Negative Non combined features Combined features The results indicate that combining the features enhances notably the results, particularly the recall have improved of about 21%. This can be explained by the fact that the recall is affected by the number of items incorrectly classified as non author. Without merging the features this number is high since a page can contain more than one author name and some candidates have poor contextual information which causes them to be incorrectly classified as non authors by the model. Parameters effectiveness: Figure 5 shows the experimental results in term of F1-measure obtained with different parameters. The curve C shows the results while changing the size of the context window. With a small window size we can miss some useful information and larger sizes can induce to more noise in the dataset. A window size of 15 seems to enclose the relevant context information and to give better results. Curve A shows the training model results changes while varying the values of the beginning threshold and fixing the end threshold value to 0.9, and in curve B, we fixe the beginning threshold to 0.2 and we vary the end threshold values. These curves enhance the reason of choosing 0.2 and 0.9 as beginning and end thresholds to delimit the main content of a Web page. Both values give the best results in term of F1-measure. Dataset size effectiveness: During the dataset expansion, the performance evolution of our system is evaluated. The curve D in figure 5 summarizes the results and shows that the performance of the model improves when the number of annotated HTML pages increases. Feature Contribution We further investigate the contribution of each feature type in author extraction. Experiments was done using each category of feature separably (the spatial features and the contextual features). Table 3 summarizes the results.

11 A B Beginning threshold End threshold C D Window size Data set size Fig. 5. Parameters effectiveness The results indicate that one type of features alone is not sufficient for accurate author extraction. With the spatial features we obtain a better precision value whereas with the contextual features we get a better recall. Information on the position are insufficient for extracting all the authors from the HTML pages, the contextual information give more completeness to the result. Table 3. Contribution of each feature type Feature subset Precision Recall F1-measure Spatial features contextual features Conclusion This paper provides a new approach to extract automatically the author from a set of heterogeneous Web documents. The author is an essential component for judging the credibility of a resource content. To address the problem, our method uses a machine learning approach based on the HTML structure as well as on contextual information. The method adopted in this paper extracts the author name from the body of

12 the HTML document, if this information is absent in the content of the page other methods should be adopted like the stylometry approach which is often used to attribute authorship to anonymous documents. Future directions include discovering other fields of metadata from HTML pages so as to enrich resources and to make them more accessible. References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (2001). 2. Romero, C., Ventura, S.: Educational data mining: A survey from 1995 to Expert Syst. Appl. 33 (2007) Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author of web pages. In: WICOW 08: Proceeding of the 2nd ACM workshop on Information credibility on the web, New York, NY, USA, ACM (2008) Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: Dom-based content extraction of html documents. In: WWW 03: Proceedings of the 12th international conference on World Wide Web, New York, NY, USA, ACM (2003) Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, t., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165 (2005) Evans, D., Klavans, J.L., McKeown, K.R.: Columbia newsblaster: Multilingual news summarization on the web. In: Proceedings of Human Language Technology conference of the North American. (2004) 7. Ciravegna, F.: lp) 2, an adaptive algorithm for information extraction from webrelated texts. In: Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining. (2001) 8. Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence,AAAI Press/The MIT Press (2000) Amitay, E., Harel, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: SIGIR 04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, ACM Press (2004) Nadeau, D., Turney, P., Matwin, S.: Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. (2006) Changuel, S., Labroche, N., Bouchon-meunier, B.: A general learning method for automatic title extraction from html pages. In: International Conference on Machine Learning and Data Mining (MLDM 09), Leipzig Germany. (2009) 12. Alias-i LingPipe Natural Language Toolkit Ian H. Witten, E.F.: Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Diane Cerra (2005) 14. Greenberg, J., Spurgin, K., Crystal, A.: Functionalities for automatic metadata generation applications: a survey of metadata experts opinions. Int. J. Metadata Semant. Ontologies 1, 320 (2006)

Iterative Learning of Relation Patterns for Market Analysis with UIMA

UIMA Workshop, GLDV, Tübingen, 09.04.2007 Iterative Learning of Relation Patterns for Market Analysis with UIMA Sebastian Blohm, Jürgen Umbrich, Philipp Cimiano, York Sure Universität Karlsruhe (TH), Institut