AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET Ismael Sanz, Rafael Berlanga, María José Aramburu and Francisco Toledo Departament d'informàtica Campus Penyeta Roja, Universitat Jaume I, E-12071 Castellón, SPAIN e-mail: {berlanga,aramburu,toledo}@inf.uji.es Keywords: Internet, Digital Libraries, Document Recognition and Logic Programming. ABSTRACT After our previous works on modelling a database of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published in distinct web resources. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus with the library PiLLoW).
1 Introduction The soaring availability of periodical publications in the Internet is making necessary new methods for the management of these kind of documents [Ara96], as well as large specific digital libraries that provide sophisticated indexing and retrieval on them [Ara97a]. After our previous works on modelling databases of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published at distinct web servers. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement our gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus and the PiLLoW library).this approach to the gathering problem has been possible thanks to the regular newspaper document structure which can be described by using the Document Type System (DTS) [Ara97a]. 1.1 Overall structure Gatherers take as input sets of web-accesible documents [Tbl94] and returns metadata descriptions for each recognised publication. Figure 1 illustrates how DTS descriptions, which are internally transformed into grammars, are used to represent the structure of individual documents as well as the relationships among them. In this way, the system is able to classify different kinds of documents and extract relevant information from them. Document descriptions DTS Layout Classes Rules Schema View GATHERER data types meta-attributes files Context Free Grammar Values Output Metadata Object navigation Web server Schema complex value Figure 1: Architecture of the gathering system 1.2 The DTS type system Newspaper documents in a web server are regulated by a set of types and layout rules which define their structure and contents. These documents can be represented using an object-oriented data model whose underlying type system supports flexibility and optionality when representing complex document structures [Ara96]. The gathering process is also regulated by this type system, called DTS (Document Type System), which can be viewed as a formal object-oriented type system with a syntax similar to the SGML DTD language. Briefly, DTS types follows the syntax [Ara97b]: τ := rawdata class (τ 1 τ m ) τ+ τ* τ? [ A 1 :τ 1,, A m :τ m ] where, rawdata is any basic multimedia data type (e.g. Text and Graphic), class is the name of a document class (e.g. Article, Paper, etc...) whose type is in turn a DTS type, the type constructor " " expresses the union of DTS types, the constructor [.. ] expresses a (possibly nested) tuple by using a set of document attributes A i, finally the suffixes +, * and? express the different optionality degrees
of a document component, namely: at least one occurrence, zero or more occurrences and less than two occurrences, respectively. 2 Gathering HTML files We assume that the publications to be analyzed are available in HTML format through a HTTP server. The details of HTTP requests are handled by using the PiLLoW library [Cab96], a publicly available package that implements such low level routines for PROLOG languages. Upon this layer, a set of services are provided for traversing web servers gracefully; in particular, the Standard for Robot Exclusion [Fah96] is fully supported. But, despite these robot-like features, the gathering system does not behave blindly, as common web-indexing programs do. Instead, it is able to advantageously exploit any available information on the internal web server structure which is modelled with DTS descriptions. As an example, let us consider a typical tree-like web site structure for a newspaper publication: Front page Index of section #1 Index of section #2 Irrelevant documents Article #1 Article #2 Article #3 Article #4 Article #5...... Figure 2: Sample structure of an electronic published newspaper In this case, the server root (front page) contains a set of hypertext links to section indexes, which in turn point to their corresponding articles. A set of DTS classes expressing these relationships may be similar to the following ones: FrontPage := [ date: Date, sections: SectionIndexRef+ ] SectionIndex := [ name: SectionName, articles: ArticleRef+ ] Article := (Report Chronicle Interview) By using appropriate semantics specifications for the SectionIndexRef and ArticleRef tokens (see section 0), it is possible to express that the corresponding links point to documents of the SectionIndex and Article classes respectively. In this way, the gatherer is instructed to traverse the web server using logical, well-known paths. As a consequence, no irrelevant documents are requested, and the network traffic is thereby minimized. Apart from this, the gatherer supports other methods for web site traversing. The most important one implements a traditional breadth-first search of the full target site, and stores the fetched documents into a simple database. This is mainly useful for learning about a site structure locally, in an off-line fashion. In fact, the implementation allows the addition of new traversing methods. For this purpose, it is only necessary to create a PROLOG module that implements a prefixed set of predicates. The most important ones are summarised in Table 1. Of course, the underlying features of the gathering system are available for these predicates via a set of well-defined interfaces. Predicate robot_start/0 robot_action/1 robot_finish/0 Description Performs any necessary initializations. The argument is the full information on the current HTML document, represented as an association list that contains not only the HTML source, but also the returning HTTP headers. It is the responsibility of this predicate to specify further URLs to be fetched. Cleans up if necessary. Table 1: Basic Predicates for traversing web publication sites
3 Type system implementation In order to use the DTS descriptions that regulates the server newspapers, it is necessary to translate them into a format that gatherers, which are implemented in PROLOG, could easily incorporate. At this respect, some minor syntactic additions need to be made on the DTS statements. We introduce the modified syntax by means of the example above, which looks like as follows: class BodyAndPhoto uses Rawdata BodyAndPhoto := [ body: Paper+, photo: Photograph ] Here, the clause class declares the name of the DTS type that is to be defined, the clause uses specifies the PROLOG file that comprises the set of rawdata types involved by this type (details about this file are given in the next section), finally the DTS description is specified. The statement above is then compiled into a PROLOG file that contains a Definite Clause Grammar (DCG) version of the class definition, as well as some glue code necessary for the dynamic loading of the code into the final gatherering system. 3.1 Rawdata specification and markup styles The DTS implementation distinguishes two different kinds of rawdata classes: primitive types and tokens. The former are generic multimedia types (Text and Graphic are currently recognized), whereas the latter are class-specific formats which usually correspond to markup styles. Let us explain this concept in the following paragraphs. A document can be viewed as a sequence of marked-up texts 1. Usually, the function of each text within the document is expressed by a characteristic style, for instance, a headline and an image footage will probably have different visual appearance. Since we are dealing with HTML files, we consider that the markup for each text is defined by a set of tags. For example, some headlines at the top of a piece of news can look like as follows: Troublesome European Summit UE governments disagree about the starting date for the new euro Table 2 presents their markup-text pairs: Markup center, bold, font size=+2 center, font size=+1 Text "Troublesome European Summit" "UE governments disagree about the starting date for the new euro" Table 2: Example of mark-up-text pairs. A token is just a set of markup tags. In this case, we could define the token Headline as the set {center, bold, font size = +1} and the token Secondary headline as the set {center, font size = +1}. In publications sites, only a few combinations are used for the overall layout. These are usually specified by an internal book of style, similar to those of printed newspapers, and constitute the distinctive graphic vocabulary of the publication server. In order to use the layout descriptions within gatherers, a PROLOG representation for them must be established as for DTS type descriptions. For easy interfacing with the compiled DTS classes, 1 Provided that a strict sequential order can be obtained for every element in the document. For HTML documents, this is always possible.
Definite Clause Grammars are also used here. In this case, they must assume to be parsing a list of atoms of the following form (as later described in section 0): paragraph(tags, Text) where Tags is a list of HTML tags with the format Name $ List-of-attributes, and Text is the associated string of characters (see Table 2). For instance, the following grammar rule states the Headline token defined above: Headline( Text #T)-> [paragraph([center$[],b$[],font$[size= +2 ]],T)]. Of course, arbitrary PROLOG code may be added to these grammar rules. This means that powerful extensions may be programmed by building on this basic mechanism; for instance, this capability is used in the gatherer for the implementation of the DTS-directed site traversal. 3.2 Obtaining styles In order to transform the HTML source into lists of paragraph/2 terms, a special routine is used that attempts to exploit any knowledge about the characteristics of the site. This routine performs the following steps: 1. It transforms the HTML source into Prolog terms, using facilities provide by the PiLLoW library. The tree structure of the tags in the source file is preserved by nesting terms appropriately. 2. The resulting structure is simplified using rules defined in an external file. These rules basically specify which tags are the relevant ones, and which ones should be removed together with their entire subtree. 3. Finally, the simplified tree is flattened into a list of paragraph/2 terms. 4 Identification of document classes Each grammar associated to a DTS type, together with its corresponding token specifications, is capable of parsing a document represented as a list of styles, and extracting all the relevant information from it. Specifically, each document is recognised by using a set of candidate grammars which must be checked one by one until the document conforms with one of them. Experience shows that non-conforming grammars tend to fail very soon, and thus the searching procedure is kept efficient. In order to limit the number of grammars that are attempted for each document, DTS descriptions may be clustered together by putting related groups into separate directories. Each cluster shares a common style definition file (see section 0). For instance, for identifying the documents in the treelike structure shown in Figure 1, it would be reasonable to separate into different clusters the classes that represent site structures (FrontPage, SectionIndex) and the ones for the articles (Report, Chronicle, Interview). The result of this process is a PROLOG term with a structure that corresponds to the original DTS class description, and contains the extracted value of each attribute. The format, as returned by the compiled DTS descriptions, may be informally described as: Class name # List of attributes or, for primitive types, Class name # Primitive value 2 2 This value is specific to the particular primitive class. For a Text class it is a string, and for a Graphic class it is a URL.
where each attribute has the following syntax: Attribute name : Class description Attribute name : List of values For instance, given the following DTS descriptions Body := Paragraph+ Article := [ headline:headline, body:body ] where Headline and Paragraph are tokens that return Text values, a term representing a conforming document could be the following one: Article#[headline: Text# PrologScript Revisited, body: Body#[Text# First paragraph text, Text# Second paragraph text ]] This format is devised to allow easy insertion of data into an object-oriented database. 5 Conclusions This work has described an implementation of gatherers for web repositories of digitized newspapers. Furthermore, we have shown the usefulness of formal data description methods in the retrieval and classification of structured sets of web documents. Specifically, DTS types have been used to effectively manage complex collections of documents. In this work, PROLOG has taken a relevant role in the design of gatherers, by providing grammar rules for type recognition and HTTP protocol primitives for performing web traversal. Future work is focusing on extending the here presented technique to other kind of publications, such as journals, patents and so on. 6 References [Ara96] Aramburu, M.J. and Berlanga, R. Object-oriented modelling of periodicals Proceedings of the 7th Workshop on Databases and Expert System Applications (DEXA'96), Ed. IEEE, Zurich, 1996. [Ara97a] Aramburu, M.J. and Berlanga, R. An approach to a digital library of newspapers. To appear in Information Processing & Management, Special Issue on Electronic News, 1997 [Ara97b] Aramburu, M.J. and Berlanga, R. Metadata in a Digital Library of Perodicals Informe Técnico DI 01-01/97, Departamento de Informática, Universitat Jaume I, January 1997. [Tbl94] Berners-Lee, T., Cailliau, A., Nielsen, H.F., Luotonen, A. and Secret, A. The world wide web. Communications of the ACM, 37(8):76-82, August 1994. [Fah96] Fah-Chun Cheong, "Internet Agents", New Riders Publishing, Indianapolis, 1986 [Cab96] Cabeza, D., Hermenegildo, M. and Varma, S. "The PiLLoW/CIAO library for Internet/ WWW programming using computational logic systems" Proceedings of the 1st Workshop on Logic Programming Tools for INTERNET Applications, IJCSLP'96, Bonn, 1996.