AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET

Similar documents
Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Semantic Web Lecture Part 1. Prof. Do van Thanh

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents

An Approach To Web Content Mining

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University

Seaglex Software, Inc. Web Harvesting White Paper. Overview

Yonghyun Whang a, Changwoo Jung b, Jihong Kim a, and Sungkwon Chung c

SEARCH SEMI-STRUCTURED DATA ON WEB

Semantic Web and Electronic Information Resources Danica Radovanović

Introduction to XML. XML: basic elements

A network is a group of two or more computers that are connected to share resources and information.

Knowledge Representation, Ontologies, and the Semantic Web

The Nature of the Web

Device Independent Principles for Adapted Content Delivery

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Applying Java for the Retrieval of Multimedia Knowledge Distributed on High Performance Clusters on the Internet

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015

Automatic Metadata Extraction for Archival Description and Access

M2-R4: INTERNET TECHNOLOGY AND WEB DESIGN

Aspects of an XML-Based Phraseology Database Application

Information Retrieval

ISO. International Organization for Standardization. ISO/IEC JTC 1/SC 32 Data Management and Interchange WG4 SQL/MM. Secretariat: USA (ANSI)

Interoperability for Digital Libraries

JISC PALS2 PROJECT: ONIX FOR LICENSING TERMS PHASE 2 (OLT2)

Designing a Semantic Ground Truth for Mathematical Formulas

Transformation of structured documents with the use of grammar

M359 Block5 - Lecture12 Eng/ Waleed Omar

5/19/2015. Objectives. JavaScript, Sixth Edition. Introduction to the World Wide Web (cont d.) Introduction to the World Wide Web

Proving the validity and accessibility of dynamic web pages

[MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

Web Ontology for Software Package Management

Enabling Grids for E-sciencE ISSGC 05. XML documents. Richard Hopkins, National e-science Centre, Edinburgh June

Stylus Studio Case Study: FIXML Working with Complex Message Sets Defined Using XML Schema

HTML+ CSS PRINCIPLES. Getting started with web design the right way

Part VII. Querying XML The XQuery Data Model. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153

A Prolog-based Proof Tool for Type Theory TA λ and Implicational Intuitionistic-Logic

XML and information exchange. XML extensible Markup Language XML

Hypertext Markup Language, or HTML, is a markup

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

('cre Learning that works for Utah STRANDS AND STANDARDS WEB DEVELOPMENT 1

ACCESSIBLE DESIGN THEMES

From Open Data to Data- Intensive Science through CERIF

Adaptive and Personalized System for Semantic Web Mining

Executing Evaluations over Semantic Technologies using the SEALS Platform

(1) I (2) S (3) P allow subscribers to connect to the (4) often provide basic services such as (5) (6)

Data Presentation and Markup Languages

Introduction to web development and HTML MGMT 230 LAB

Metadata Standards and Applications. 4. Metadata Syntaxes and Containers

COMMIUS Project Newsletter COMMIUS COMMUNITY-BASED INTEROPERABILITY UTILITY FOR SMES

Data is the new Oil (Ann Winblad)

Search Engine Optimisation Basics for Government Agencies

A Review on Identifying the Main Content From Web Pages

Using UML To Define XML Document Types

Introduction to Information Systems

The XQuery Data Model

XML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11

Chapter 13 XML: Extensible Markup Language

Automated Classification. Lars Marius Garshol Topic Maps

3. ALGORITHM FOR HIERARCHICAL STRUC TURE DISCOVERY Edges in a webgraph represent the links between web pages. These links can have a type such as: sta

ITEC 810 Minor Project. Inferring Document Structure. Final Report. Author: Weiyen Lin SID: Supervised by: Jette Viethen

Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications

User Interaction: XML and JSON

EPiServer s Compliance to WCAG and ATAG

Design and Implementation of an RDF Triple Store

UR what? ! URI: Uniform Resource Identifier. " Uniquely identifies a data entity " Obeys a specific syntax " schemename:specificstuff

The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu

Publishing Technology 101 A Journal Publishing Primer. Mike Hepp Director, Technology Strategy Dartmouth Journal Services

Distributed Database System. Project. Query Evaluation and Web Recognition in Document Databases

Information Retrieval (IR) through Semantic Web (SW): An Overview

CHAPTER 1: GETTING STARTED WITH HTML CREATED BY L. ASMA RIKLI (ADAPTED FROM HTML, CSS, AND DYNAMIC HTML BY CAREY)

PART. Oracle and the XML Standards

Service Oriented Architectures (ENCS 691K Chapter 2)

What is a web site? Web editors Introduction to HTML (Hyper Text Markup Language)

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Digital Asset Management 2. Introduction to Digital Media Format

EMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents

From administrivia to what really matters

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

introduction to XHTML

Lecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck

Development of an Ontology-Based Portal for Digital Archive Services

Webbed Documents 1- Malcolm Graham and Andrew Surray. Abstract. The Problem and What We ve Already Tried

extensible Markup Language

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

CHAPTER-23 MINING COMPLEX TYPES OF DATA

MODULE 2 HTML 5 FUNDAMENTALS. HyperText. > Douglas Engelbart ( )

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

A Meta-Model for Fact Extraction from Delphi Source Code

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

The XML Metalanguage

Structured documents

I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING. Andrii Donchenko

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.

Use of Mobile Agents for IPR Management and Negotiation

A new generation of tools for SGML

Defining an Abstract Core Production Rule System

Achitectural specification: Base

Part III: Survey of Internet technologies

Transcription:

AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET Ismael Sanz, Rafael Berlanga, María José Aramburu and Francisco Toledo Departament d'informàtica Campus Penyeta Roja, Universitat Jaume I, E-12071 Castellón, SPAIN e-mail: {berlanga,aramburu,toledo}@inf.uji.es Keywords: Internet, Digital Libraries, Document Recognition and Logic Programming. ABSTRACT After our previous works on modelling a database of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published in distinct web resources. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus with the library PiLLoW).

1 Introduction The soaring availability of periodical publications in the Internet is making necessary new methods for the management of these kind of documents [Ara96], as well as large specific digital libraries that provide sophisticated indexing and retrieval on them [Ara97a]. After our previous works on modelling databases of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published at distinct web servers. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement our gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus and the PiLLoW library).this approach to the gathering problem has been possible thanks to the regular newspaper document structure which can be described by using the Document Type System (DTS) [Ara97a]. 1.1 Overall structure Gatherers take as input sets of web-accesible documents [Tbl94] and returns metadata descriptions for each recognised publication. Figure 1 illustrates how DTS descriptions, which are internally transformed into grammars, are used to represent the structure of individual documents as well as the relationships among them. In this way, the system is able to classify different kinds of documents and extract relevant information from them. Document descriptions DTS Layout Classes Rules Schema View GATHERER data types meta-attributes files Context Free Grammar Values Output Metadata Object navigation Web server Schema complex value Figure 1: Architecture of the gathering system 1.2 The DTS type system Newspaper documents in a web server are regulated by a set of types and layout rules which define their structure and contents. These documents can be represented using an object-oriented data model whose underlying type system supports flexibility and optionality when representing complex document structures [Ara96]. The gathering process is also regulated by this type system, called DTS (Document Type System), which can be viewed as a formal object-oriented type system with a syntax similar to the SGML DTD language. Briefly, DTS types follows the syntax [Ara97b]: τ := rawdata class (τ 1 τ m ) τ+ τ* τ? [ A 1 :τ 1,, A m :τ m ] where, rawdata is any basic multimedia data type (e.g. Text and Graphic), class is the name of a document class (e.g. Article, Paper, etc...) whose type is in turn a DTS type, the type constructor " " expresses the union of DTS types, the constructor [.. ] expresses a (possibly nested) tuple by using a set of document attributes A i, finally the suffixes +, * and? express the different optionality degrees

of a document component, namely: at least one occurrence, zero or more occurrences and less than two occurrences, respectively. 2 Gathering HTML files We assume that the publications to be analyzed are available in HTML format through a HTTP server. The details of HTTP requests are handled by using the PiLLoW library [Cab96], a publicly available package that implements such low level routines for PROLOG languages. Upon this layer, a set of services are provided for traversing web servers gracefully; in particular, the Standard for Robot Exclusion [Fah96] is fully supported. But, despite these robot-like features, the gathering system does not behave blindly, as common web-indexing programs do. Instead, it is able to advantageously exploit any available information on the internal web server structure which is modelled with DTS descriptions. As an example, let us consider a typical tree-like web site structure for a newspaper publication: Front page Index of section #1 Index of section #2 Irrelevant documents Article #1 Article #2 Article #3 Article #4 Article #5...... Figure 2: Sample structure of an electronic published newspaper In this case, the server root (front page) contains a set of hypertext links to section indexes, which in turn point to their corresponding articles. A set of DTS classes expressing these relationships may be similar to the following ones: FrontPage := [ date: Date, sections: SectionIndexRef+ ] SectionIndex := [ name: SectionName, articles: ArticleRef+ ] Article := (Report Chronicle Interview) By using appropriate semantics specifications for the SectionIndexRef and ArticleRef tokens (see section 0), it is possible to express that the corresponding links point to documents of the SectionIndex and Article classes respectively. In this way, the gatherer is instructed to traverse the web server using logical, well-known paths. As a consequence, no irrelevant documents are requested, and the network traffic is thereby minimized. Apart from this, the gatherer supports other methods for web site traversing. The most important one implements a traditional breadth-first search of the full target site, and stores the fetched documents into a simple database. This is mainly useful for learning about a site structure locally, in an off-line fashion. In fact, the implementation allows the addition of new traversing methods. For this purpose, it is only necessary to create a PROLOG module that implements a prefixed set of predicates. The most important ones are summarised in Table 1. Of course, the underlying features of the gathering system are available for these predicates via a set of well-defined interfaces. Predicate robot_start/0 robot_action/1 robot_finish/0 Description Performs any necessary initializations. The argument is the full information on the current HTML document, represented as an association list that contains not only the HTML source, but also the returning HTTP headers. It is the responsibility of this predicate to specify further URLs to be fetched. Cleans up if necessary. Table 1: Basic Predicates for traversing web publication sites

3 Type system implementation In order to use the DTS descriptions that regulates the server newspapers, it is necessary to translate them into a format that gatherers, which are implemented in PROLOG, could easily incorporate. At this respect, some minor syntactic additions need to be made on the DTS statements. We introduce the modified syntax by means of the example above, which looks like as follows: class BodyAndPhoto uses Rawdata BodyAndPhoto := [ body: Paper+, photo: Photograph ] Here, the clause class declares the name of the DTS type that is to be defined, the clause uses specifies the PROLOG file that comprises the set of rawdata types involved by this type (details about this file are given in the next section), finally the DTS description is specified. The statement above is then compiled into a PROLOG file that contains a Definite Clause Grammar (DCG) version of the class definition, as well as some glue code necessary for the dynamic loading of the code into the final gatherering system. 3.1 Rawdata specification and markup styles The DTS implementation distinguishes two different kinds of rawdata classes: primitive types and tokens. The former are generic multimedia types (Text and Graphic are currently recognized), whereas the latter are class-specific formats which usually correspond to markup styles. Let us explain this concept in the following paragraphs. A document can be viewed as a sequence of marked-up texts 1. Usually, the function of each text within the document is expressed by a characteristic style, for instance, a headline and an image footage will probably have different visual appearance. Since we are dealing with HTML files, we consider that the markup for each text is defined by a set of tags. For example, some headlines at the top of a piece of news can look like as follows: Troublesome European Summit UE governments disagree about the starting date for the new euro Table 2 presents their markup-text pairs: Markup center, bold, font size=+2 center, font size=+1 Text "Troublesome European Summit" "UE governments disagree about the starting date for the new euro" Table 2: Example of mark-up-text pairs. A token is just a set of markup tags. In this case, we could define the token Headline as the set {center, bold, font size = +1} and the token Secondary headline as the set {center, font size = +1}. In publications sites, only a few combinations are used for the overall layout. These are usually specified by an internal book of style, similar to those of printed newspapers, and constitute the distinctive graphic vocabulary of the publication server. In order to use the layout descriptions within gatherers, a PROLOG representation for them must be established as for DTS type descriptions. For easy interfacing with the compiled DTS classes, 1 Provided that a strict sequential order can be obtained for every element in the document. For HTML documents, this is always possible.

Definite Clause Grammars are also used here. In this case, they must assume to be parsing a list of atoms of the following form (as later described in section 0): paragraph(tags, Text) where Tags is a list of HTML tags with the format Name $ List-of-attributes, and Text is the associated string of characters (see Table 2). For instance, the following grammar rule states the Headline token defined above: Headline( Text #T)-> [paragraph([center$[],b$[],font$[size= +2 ]],T)]. Of course, arbitrary PROLOG code may be added to these grammar rules. This means that powerful extensions may be programmed by building on this basic mechanism; for instance, this capability is used in the gatherer for the implementation of the DTS-directed site traversal. 3.2 Obtaining styles In order to transform the HTML source into lists of paragraph/2 terms, a special routine is used that attempts to exploit any knowledge about the characteristics of the site. This routine performs the following steps: 1. It transforms the HTML source into Prolog terms, using facilities provide by the PiLLoW library. The tree structure of the tags in the source file is preserved by nesting terms appropriately. 2. The resulting structure is simplified using rules defined in an external file. These rules basically specify which tags are the relevant ones, and which ones should be removed together with their entire subtree. 3. Finally, the simplified tree is flattened into a list of paragraph/2 terms. 4 Identification of document classes Each grammar associated to a DTS type, together with its corresponding token specifications, is capable of parsing a document represented as a list of styles, and extracting all the relevant information from it. Specifically, each document is recognised by using a set of candidate grammars which must be checked one by one until the document conforms with one of them. Experience shows that non-conforming grammars tend to fail very soon, and thus the searching procedure is kept efficient. In order to limit the number of grammars that are attempted for each document, DTS descriptions may be clustered together by putting related groups into separate directories. Each cluster shares a common style definition file (see section 0). For instance, for identifying the documents in the treelike structure shown in Figure 1, it would be reasonable to separate into different clusters the classes that represent site structures (FrontPage, SectionIndex) and the ones for the articles (Report, Chronicle, Interview). The result of this process is a PROLOG term with a structure that corresponds to the original DTS class description, and contains the extracted value of each attribute. The format, as returned by the compiled DTS descriptions, may be informally described as: Class name # List of attributes or, for primitive types, Class name # Primitive value 2 2 This value is specific to the particular primitive class. For a Text class it is a string, and for a Graphic class it is a URL.

where each attribute has the following syntax: Attribute name : Class description Attribute name : List of values For instance, given the following DTS descriptions Body := Paragraph+ Article := [ headline:headline, body:body ] where Headline and Paragraph are tokens that return Text values, a term representing a conforming document could be the following one: Article#[headline: Text# PrologScript Revisited, body: Body#[Text# First paragraph text, Text# Second paragraph text ]] This format is devised to allow easy insertion of data into an object-oriented database. 5 Conclusions This work has described an implementation of gatherers for web repositories of digitized newspapers. Furthermore, we have shown the usefulness of formal data description methods in the retrieval and classification of structured sets of web documents. Specifically, DTS types have been used to effectively manage complex collections of documents. In this work, PROLOG has taken a relevant role in the design of gatherers, by providing grammar rules for type recognition and HTTP protocol primitives for performing web traversal. Future work is focusing on extending the here presented technique to other kind of publications, such as journals, patents and so on. 6 References [Ara96] Aramburu, M.J. and Berlanga, R. Object-oriented modelling of periodicals Proceedings of the 7th Workshop on Databases and Expert System Applications (DEXA'96), Ed. IEEE, Zurich, 1996. [Ara97a] Aramburu, M.J. and Berlanga, R. An approach to a digital library of newspapers. To appear in Information Processing & Management, Special Issue on Electronic News, 1997 [Ara97b] Aramburu, M.J. and Berlanga, R. Metadata in a Digital Library of Perodicals Informe Técnico DI 01-01/97, Departamento de Informática, Universitat Jaume I, January 1997. [Tbl94] Berners-Lee, T., Cailliau, A., Nielsen, H.F., Luotonen, A. and Secret, A. The world wide web. Communications of the ACM, 37(8):76-82, August 1994. [Fah96] Fah-Chun Cheong, "Internet Agents", New Riders Publishing, Indianapolis, 1986 [Cab96] Cabeza, D., Hermenegildo, M. and Varma, S. "The PiLLoW/CIAO library for Internet/ WWW programming using computational logic systems" Proceedings of the 1st Workshop on Logic Programming Tools for INTERNET Applications, IJCSLP'96, Bonn, 1996.