AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET

Size: px
Start display at page:

Download "AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET"

Transcription

1 AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET Ismael Sanz, Rafael Berlanga, María José Aramburu and Francisco Toledo Departament d'informàtica Campus Penyeta Roja, Universitat Jaume I, E Castellón, SPAIN Keywords: Internet, Digital Libraries, Document Recognition and Logic Programming. ABSTRACT After our previous works on modelling a database of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published in distinct web resources. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus with the library PiLLoW).

2 1 Introduction The soaring availability of periodical publications in the Internet is making necessary new methods for the management of these kind of documents [Ara96], as well as large specific digital libraries that provide sophisticated indexing and retrieval on them [Ara97a]. After our previous works on modelling databases of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published at distinct web servers. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement our gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus and the PiLLoW library).this approach to the gathering problem has been possible thanks to the regular newspaper document structure which can be described by using the Document Type System (DTS) [Ara97a]. 1.1 Overall structure Gatherers take as input sets of web-accesible documents [Tbl94] and returns metadata descriptions for each recognised publication. Figure 1 illustrates how DTS descriptions, which are internally transformed into grammars, are used to represent the structure of individual documents as well as the relationships among them. In this way, the system is able to classify different kinds of documents and extract relevant information from them. Document descriptions DTS Layout Classes Rules Schema View GATHERER data types meta-attributes files Context Free Grammar Values Output Metadata Object navigation Web server Schema complex value Figure 1: Architecture of the gathering system 1.2 The DTS type system Newspaper documents in a web server are regulated by a set of types and layout rules which define their structure and contents. These documents can be represented using an object-oriented data model whose underlying type system supports flexibility and optionality when representing complex document structures [Ara96]. The gathering process is also regulated by this type system, called DTS (Document Type System), which can be viewed as a formal object-oriented type system with a syntax similar to the SGML DTD language. Briefly, DTS types follows the syntax [Ara97b]: τ := rawdata class (τ 1 τ m ) τ+ τ* τ? [ A 1 :τ 1,, A m :τ m ] where, rawdata is any basic multimedia data type (e.g. Text and Graphic), class is the name of a document class (e.g. Article, Paper, etc...) whose type is in turn a DTS type, the type constructor " " expresses the union of DTS types, the constructor [.. ] expresses a (possibly nested) tuple by using a set of document attributes A i, finally the suffixes +, * and? express the different optionality degrees

3 of a document component, namely: at least one occurrence, zero or more occurrences and less than two occurrences, respectively. 2 Gathering HTML files We assume that the publications to be analyzed are available in HTML format through a HTTP server. The details of HTTP requests are handled by using the PiLLoW library [Cab96], a publicly available package that implements such low level routines for PROLOG languages. Upon this layer, a set of services are provided for traversing web servers gracefully; in particular, the Standard for Robot Exclusion [Fah96] is fully supported. But, despite these robot-like features, the gathering system does not behave blindly, as common web-indexing programs do. Instead, it is able to advantageously exploit any available information on the internal web server structure which is modelled with DTS descriptions. As an example, let us consider a typical tree-like web site structure for a newspaper publication: Front page Index of section #1 Index of section #2 Irrelevant documents Article #1 Article #2 Article #3 Article #4 Article # Figure 2: Sample structure of an electronic published newspaper In this case, the server root (front page) contains a set of hypertext links to section indexes, which in turn point to their corresponding articles. A set of DTS classes expressing these relationships may be similar to the following ones: FrontPage := [ date: Date, sections: SectionIndexRef+ ] SectionIndex := [ name: SectionName, articles: ArticleRef+ ] Article := (Report Chronicle Interview) By using appropriate semantics specifications for the SectionIndexRef and ArticleRef tokens (see section 0), it is possible to express that the corresponding links point to documents of the SectionIndex and Article classes respectively. In this way, the gatherer is instructed to traverse the web server using logical, well-known paths. As a consequence, no irrelevant documents are requested, and the network traffic is thereby minimized. Apart from this, the gatherer supports other methods for web site traversing. The most important one implements a traditional breadth-first search of the full target site, and stores the fetched documents into a simple database. This is mainly useful for learning about a site structure locally, in an off-line fashion. In fact, the implementation allows the addition of new traversing methods. For this purpose, it is only necessary to create a PROLOG module that implements a prefixed set of predicates. The most important ones are summarised in Table 1. Of course, the underlying features of the gathering system are available for these predicates via a set of well-defined interfaces. Predicate robot_start/0 robot_action/1 robot_finish/0 Description Performs any necessary initializations. The argument is the full information on the current HTML document, represented as an association list that contains not only the HTML source, but also the returning HTTP headers. It is the responsibility of this predicate to specify further URLs to be fetched. Cleans up if necessary. Table 1: Basic Predicates for traversing web publication sites

4 3 Type system implementation In order to use the DTS descriptions that regulates the server newspapers, it is necessary to translate them into a format that gatherers, which are implemented in PROLOG, could easily incorporate. At this respect, some minor syntactic additions need to be made on the DTS statements. We introduce the modified syntax by means of the example above, which looks like as follows: class BodyAndPhoto uses Rawdata BodyAndPhoto := [ body: Paper+, photo: Photograph ] Here, the clause class declares the name of the DTS type that is to be defined, the clause uses specifies the PROLOG file that comprises the set of rawdata types involved by this type (details about this file are given in the next section), finally the DTS description is specified. The statement above is then compiled into a PROLOG file that contains a Definite Clause Grammar (DCG) version of the class definition, as well as some glue code necessary for the dynamic loading of the code into the final gatherering system. 3.1 Rawdata specification and markup styles The DTS implementation distinguishes two different kinds of rawdata classes: primitive types and tokens. The former are generic multimedia types (Text and Graphic are currently recognized), whereas the latter are class-specific formats which usually correspond to markup styles. Let us explain this concept in the following paragraphs. A document can be viewed as a sequence of marked-up texts 1. Usually, the function of each text within the document is expressed by a characteristic style, for instance, a headline and an image footage will probably have different visual appearance. Since we are dealing with HTML files, we consider that the markup for each text is defined by a set of tags. For example, some headlines at the top of a piece of news can look like as follows: Troublesome European Summit UE governments disagree about the starting date for the new euro Table 2 presents their markup-text pairs: Markup center, bold, font size=+2 center, font size=+1 Text "Troublesome European Summit" "UE governments disagree about the starting date for the new euro" Table 2: Example of mark-up-text pairs. A token is just a set of markup tags. In this case, we could define the token Headline as the set {center, bold, font size = +1} and the token Secondary headline as the set {center, font size = +1}. In publications sites, only a few combinations are used for the overall layout. These are usually specified by an internal book of style, similar to those of printed newspapers, and constitute the distinctive graphic vocabulary of the publication server. In order to use the layout descriptions within gatherers, a PROLOG representation for them must be established as for DTS type descriptions. For easy interfacing with the compiled DTS classes, 1 Provided that a strict sequential order can be obtained for every element in the document. For HTML documents, this is always possible.

5 Definite Clause Grammars are also used here. In this case, they must assume to be parsing a list of atoms of the following form (as later described in section 0): paragraph(tags, Text) where Tags is a list of HTML tags with the format Name $ List-of-attributes, and Text is the associated string of characters (see Table 2). For instance, the following grammar rule states the Headline token defined above: Headline( Text #T)-> [paragraph([center$[],b$[],font$[size= +2 ]],T)]. Of course, arbitrary PROLOG code may be added to these grammar rules. This means that powerful extensions may be programmed by building on this basic mechanism; for instance, this capability is used in the gatherer for the implementation of the DTS-directed site traversal. 3.2 Obtaining styles In order to transform the HTML source into lists of paragraph/2 terms, a special routine is used that attempts to exploit any knowledge about the characteristics of the site. This routine performs the following steps: 1. It transforms the HTML source into Prolog terms, using facilities provide by the PiLLoW library. The tree structure of the tags in the source file is preserved by nesting terms appropriately. 2. The resulting structure is simplified using rules defined in an external file. These rules basically specify which tags are the relevant ones, and which ones should be removed together with their entire subtree. 3. Finally, the simplified tree is flattened into a list of paragraph/2 terms. 4 Identification of document classes Each grammar associated to a DTS type, together with its corresponding token specifications, is capable of parsing a document represented as a list of styles, and extracting all the relevant information from it. Specifically, each document is recognised by using a set of candidate grammars which must be checked one by one until the document conforms with one of them. Experience shows that non-conforming grammars tend to fail very soon, and thus the searching procedure is kept efficient. In order to limit the number of grammars that are attempted for each document, DTS descriptions may be clustered together by putting related groups into separate directories. Each cluster shares a common style definition file (see section 0). For instance, for identifying the documents in the treelike structure shown in Figure 1, it would be reasonable to separate into different clusters the classes that represent site structures (FrontPage, SectionIndex) and the ones for the articles (Report, Chronicle, Interview). The result of this process is a PROLOG term with a structure that corresponds to the original DTS class description, and contains the extracted value of each attribute. The format, as returned by the compiled DTS descriptions, may be informally described as: Class name # List of attributes or, for primitive types, Class name # Primitive value 2 2 This value is specific to the particular primitive class. For a Text class it is a string, and for a Graphic class it is a URL.

6 where each attribute has the following syntax: Attribute name : Class description Attribute name : List of values For instance, given the following DTS descriptions Body := Paragraph+ Article := [ headline:headline, body:body ] where Headline and Paragraph are tokens that return Text values, a term representing a conforming document could be the following one: Article#[headline: Text# PrologScript Revisited, body: Body#[Text# First paragraph text, Text# Second paragraph text ]] This format is devised to allow easy insertion of data into an object-oriented database. 5 Conclusions This work has described an implementation of gatherers for web repositories of digitized newspapers. Furthermore, we have shown the usefulness of formal data description methods in the retrieval and classification of structured sets of web documents. Specifically, DTS types have been used to effectively manage complex collections of documents. In this work, PROLOG has taken a relevant role in the design of gatherers, by providing grammar rules for type recognition and HTTP protocol primitives for performing web traversal. Future work is focusing on extending the here presented technique to other kind of publications, such as journals, patents and so on. 6 References [Ara96] Aramburu, M.J. and Berlanga, R. Object-oriented modelling of periodicals Proceedings of the 7th Workshop on Databases and Expert System Applications (DEXA'96), Ed. IEEE, Zurich, [Ara97a] Aramburu, M.J. and Berlanga, R. An approach to a digital library of newspapers. To appear in Information Processing & Management, Special Issue on Electronic News, 1997 [Ara97b] Aramburu, M.J. and Berlanga, R. Metadata in a Digital Library of Perodicals Informe Técnico DI 01-01/97, Departamento de Informática, Universitat Jaume I, January [Tbl94] Berners-Lee, T., Cailliau, A., Nielsen, H.F., Luotonen, A. and Secret, A. The world wide web. Communications of the ACM, 37(8):76-82, August [Fah96] Fah-Chun Cheong, "Internet Agents", New Riders Publishing, Indianapolis, 1986 [Cab96] Cabeza, D., Hermenegildo, M. and Varma, S. "The PiLLoW/CIAO library for Internet/ WWW programming using computational logic systems" Proceedings of the 1st Workshop on Logic Programming Tools for INTERNET Applications, IJCSLP'96, Bonn, 1996.

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October

More information

Semantic Web Lecture Part 1. Prof. Do van Thanh

Semantic Web Lecture Part 1. Prof. Do van Thanh Semantic Web Lecture Part 1 Prof. Do van Thanh Overview of the lecture Part 1 Why Semantic Web? Part 2 Semantic Web components: XML - XML Schema Part 3 - Semantic Web components: RDF RDF Schema Part 4

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents

- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents Purpose of these slides Introduction to XML for parliamentary documents (and all other kinds of documents, actually) Prof. Fabio Vitali University of Bologna Part 1 Introduce the principal aspects of electronic

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University

Markup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University Markup Languages SGML, HTML, XML, XHTML CS 431 February 13, 2006 Carl Lagoze Cornell University Problem Richness of text Elements: letters, numbers, symbols, case Structure: words, sentences, paragraphs,

More information

Seaglex Software, Inc. Web Harvesting White Paper. Overview

Seaglex Software, Inc. Web Harvesting White Paper. Overview Seaglex Software, Inc. Web Harvesting White Paper Overview Seaglex Software is a leading developer of products for systematic location, identification, classification, and extraction of structured and

More information

Yonghyun Whang a, Changwoo Jung b, Jihong Kim a, and Sungkwon Chung c

Yonghyun Whang a, Changwoo Jung b, Jihong Kim a, and Sungkwon Chung c WebAlchemist: A Web Transcoding System for Mobile Web Access in Handheld Devices Yonghyun Whang a, Changwoo Jung b, Jihong Kim a, and Sungkwon Chung c a School of Computer Science & Engineering, Seoul

More information

SEARCH SEMI-STRUCTURED DATA ON WEB

SEARCH SEMI-STRUCTURED DATA ON WEB SEARCH SEMI-STRUCTURED DATA ON WEB Sabin-Corneliu Buraga 1, Teodora Rusu 2 1 Faculty of Computer Science, Al.I.Cuza University of Iaşi, Romania Berthelot Str., 16 6600 Iaşi, Romania, tel: +40 (32 201529,

More information

Semantic Web and Electronic Information Resources Danica Radovanović

Semantic Web and Electronic Information Resources Danica Radovanović D.Radovanovic: Semantic Web and Electronic Information Resources 1, Infotheca journal 4(2003)2, p. 157-163 UDC 004.738.5:004.451.53:004.22 Semantic Web and Electronic Information Resources Danica Radovanović

More information

Introduction to XML. XML: basic elements

Introduction to XML. XML: basic elements Introduction to XML XML: basic elements XML Trying to wrap your brain around XML is sort of like trying to put an octopus in a bottle. Every time you think you have it under control, a new tentacle shows

More information

A network is a group of two or more computers that are connected to share resources and information.

A network is a group of two or more computers that are connected to share resources and information. Chapter 1 Introduction to HTML, XHTML, and CSS HTML Hypertext Markup Language XHTML Extensible Hypertext Markup Language CSS Cascading Style Sheets The Internet is a worldwide collection of computers and

More information

Knowledge Representation, Ontologies, and the Semantic Web

Knowledge Representation, Ontologies, and the Semantic Web Knowledge Representation, Ontologies, and the Semantic Web Evimaria Terzi 1, Athena Vakali 1, and Mohand-Saïd Hacid 2 1 Informatics Dpt., Aristotle University, 54006 Thessaloniki, Greece evimaria,avakali@csd.auth.gr

More information

The Nature of the Web

The Nature of the Web The Nature of the Web Agenda Code The Internet The Web Useful References 2 CODE is King (or Queen) The language of the Web: Hypertext Markup Language - HTML Cascading Style Sheets - CSS Build over successive

More information

Device Independent Principles for Adapted Content Delivery

Device Independent Principles for Adapted Content Delivery Device Independent Principles for Adapted Content Delivery Tayeb Lemlouma 1 and Nabil Layaïda 2 OPERA Project Zirst 655 Avenue de l Europe - 38330 Montbonnot, Saint Martin, France Tel: +33 4 7661 5281

More information

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Applying Java for the Retrieval of Multimedia Knowledge Distributed on High Performance Clusters on the Internet

Applying Java for the Retrieval of Multimedia Knowledge Distributed on High Performance Clusters on the Internet Proceedings of the International Conference on Practical Applications of JAVA, London, UK, (1999), 193 203 Applying Java for the Retrieval of Multimedia Knowledge Distributed on High Performance Clusters

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015 RESEARCH ARTICLE OPEN ACCESS Multi-Lingual Ontology Server (MOS) For Discovering Web Services Abdelrahman Abbas Ibrahim [1], Dr. Nael Salman [2] Department of Software Engineering [1] Sudan University

More information

Automatic Metadata Extraction for Archival Description and Access

Automatic Metadata Extraction for Archival Description and Access Automatic Metadata Extraction for Archival Description and Access WILLIAM UNDERWOOD Georgia Tech Research Institute Abstract: The objective of the research reported is this paper is to develop techniques

More information

M2-R4: INTERNET TECHNOLOGY AND WEB DESIGN

M2-R4: INTERNET TECHNOLOGY AND WEB DESIGN M2-R4: INTERNET TECHNOLOGY AND WEB DESIGN NOTE: 1. There are TWO PARTS in this Module/Paper. PART ONE contains FOUR questions and PART TWO contains FIVE questions. 2. PART ONE is to be answered in the

More information

Aspects of an XML-Based Phraseology Database Application

Aspects of an XML-Based Phraseology Database Application Aspects of an XML-Based Phraseology Database Application Denis Helic 1 and Peter Ďurčo2 1 University of Technology Graz Insitute for Information Systems and Computer Media dhelic@iicm.edu 2 University

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

ISO. International Organization for Standardization. ISO/IEC JTC 1/SC 32 Data Management and Interchange WG4 SQL/MM. Secretariat: USA (ANSI)

ISO. International Organization for Standardization. ISO/IEC JTC 1/SC 32 Data Management and Interchange WG4 SQL/MM. Secretariat: USA (ANSI) ISO/IEC JTC 1/SC 32 N 0736 ISO/IEC JTC 1/SC 32/WG 4 SQL/MM:VIE-006 January, 2002 ISO International Organization for Standardization ISO/IEC JTC 1/SC 32 Data Management and Interchange WG4 SQL/MM Secretariat:

More information

Interoperability for Digital Libraries

Interoperability for Digital Libraries DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: C Interoperability for Digital Libraries Michael Shepherd Faculty of Computer Science Dalhousie University Halifax, NS, Canada

More information

JISC PALS2 PROJECT: ONIX FOR LICENSING TERMS PHASE 2 (OLT2)

JISC PALS2 PROJECT: ONIX FOR LICENSING TERMS PHASE 2 (OLT2) JISC PALS2 PROJECT: ONIX FOR LICENSING TERMS PHASE 2 (OLT2) Functional requirements and design specification for an ONIX-PL license expression drafting system 1. Introduction This document specifies a

More information

Designing a Semantic Ground Truth for Mathematical Formulas

Designing a Semantic Ground Truth for Mathematical Formulas Designing a Semantic Ground Truth for Mathematical Formulas Alan Sexton 1, Volker Sorge 1, and Masakazu Suzuki 2 1 School of Computer Science, University of Birmingham, UK, A.P.Sexton V.Sorge@cs.bham.ac.uk,

More information

Transformation of structured documents with the use of grammar

Transformation of structured documents with the use of grammar ELECTRONIC PUBLISHING, VOL. 6(4), 373 383 (DECEMBER 1993) Transformation of structured documents with the use of grammar EILA KUIKKA MARTTI PENTTONEN University of Kuopio University of Joensuu P. O. Box

More information

M359 Block5 - Lecture12 Eng/ Waleed Omar

M359 Block5 - Lecture12 Eng/ Waleed Omar Documents and markup languages The term XML stands for extensible Markup Language. Used to label the different parts of documents. Labeling helps in: Displaying the documents in a formatted way Querying

More information

5/19/2015. Objectives. JavaScript, Sixth Edition. Introduction to the World Wide Web (cont d.) Introduction to the World Wide Web

5/19/2015. Objectives. JavaScript, Sixth Edition. Introduction to the World Wide Web (cont d.) Introduction to the World Wide Web Objectives JavaScript, Sixth Edition Chapter 1 Introduction to JavaScript When you complete this chapter, you will be able to: Explain the history of the World Wide Web Describe the difference between

More information

Proving the validity and accessibility of dynamic web pages

Proving the validity and accessibility of dynamic web pages Loughborough University Institutional Repository Proving the validity and accessibility of dynamic web pages This item was submitted to Loughborough University's Institutional Repository by the/an author.

More information

[MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document

[MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document [MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document Intellectual Property Rights Notice for Open Specifications Documentation Technical Documentation. Microsoft

More information

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions) By the end of this course, students should CIS 1.5 Course Objectives a. Understand the concept of a program (i.e., a computer following a series of instructions) b. Understand the concept of a variable

More information

Web Ontology for Software Package Management

Web Ontology for Software Package Management Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 2. pp. 331 338. Web Ontology for Software Package Management Péter Jeszenszky Debreceni

More information

Enabling Grids for E-sciencE ISSGC 05. XML documents. Richard Hopkins, National e-science Centre, Edinburgh June

Enabling Grids for E-sciencE ISSGC 05. XML documents. Richard Hopkins, National e-science Centre, Edinburgh June ISSGC 05 XML documents Richard Hopkins, National e-science Centre, Edinburgh June 2005 www.eu-egee.org Overview Goals General appreciation of XML Sufficient detail to understand WSDLs Structure Philosophy

More information

Stylus Studio Case Study: FIXML Working with Complex Message Sets Defined Using XML Schema

Stylus Studio Case Study: FIXML Working with Complex Message Sets Defined Using XML Schema Stylus Studio Case Study: FIXML Working with Complex Message Sets Defined Using XML Schema Introduction The advanced XML Schema handling and presentation capabilities of Stylus Studio have valuable implications

More information

HTML+ CSS PRINCIPLES. Getting started with web design the right way

HTML+ CSS PRINCIPLES. Getting started with web design the right way HTML+ CSS PRINCIPLES Getting started with web design the right way HTML : a brief history ❶ 1960s : ARPANET is developed... It is the first packet-switching network using TCP/IP protocol and is a precursor

More information

Part VII. Querying XML The XQuery Data Model. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153

Part VII. Querying XML The XQuery Data Model. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153 Part VII Querying XML The XQuery Data Model Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153 Outline of this part 1 Querying XML Documents Overview 2 The XQuery Data Model The XQuery

More information

A Prolog-based Proof Tool for Type Theory TA λ and Implicational Intuitionistic-Logic

A Prolog-based Proof Tool for Type Theory TA λ and Implicational Intuitionistic-Logic for Type Theory TA λ and Implicational Intuitionistic-Logic L. Yohanes Stefanus University of Indonesia Depok 16424, Indonesia yohanes@cs.ui.ac.id and Ario Santoso Technische Universität Dresden Dresden

More information

XML and information exchange. XML extensible Markup Language XML

XML and information exchange. XML extensible Markup Language XML COS 425: Database and Information Management Systems XML and information exchange 1 XML extensible Markup Language History 1988 SGML: Standard Generalized Markup Language Annotate text with structure 1992

More information

Hypertext Markup Language, or HTML, is a markup

Hypertext Markup Language, or HTML, is a markup Introduction to HTML Hypertext Markup Language, or HTML, is a markup language that enables you to structure and display content such as text, images, and links in Web pages. HTML is a very fast and efficient

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

('cre Learning that works for Utah STRANDS AND STANDARDS WEB DEVELOPMENT 1

('cre Learning that works for Utah STRANDS AND STANDARDS WEB DEVELOPMENT 1 STRANDS AND STANDARDS Course Description Web Development is a course designed to guide students in a project-based environment, in the development of up-to-date concepts and skills that are used in the

More information

ACCESSIBLE DESIGN THEMES

ACCESSIBLE DESIGN THEMES WCAG GUIDELINES The Web Content Accessibility Guidelines (WCAG) has been made to guide the Web Content Developers and the Authoring Tools Developers in order to make the Web Content more accessible to

More information

From Open Data to Data- Intensive Science through CERIF

From Open Data to Data- Intensive Science through CERIF From Open Data to Data- Intensive Science through CERIF Keith G Jeffery a, Anne Asserson b, Nikos Houssos c, Valerie Brasse d, Brigitte Jörg e a Keith G Jeffery Consultants, Shrivenham, SN6 8AH, U, b University

More information

Adaptive and Personalized System for Semantic Web Mining

Adaptive and Personalized System for Semantic Web Mining Journal of Computational Intelligence in Bioinformatics ISSN 0973-385X Volume 10, Number 1 (2017) pp. 15-22 Research Foundation http://www.rfgindia.com Adaptive and Personalized System for Semantic Web

More information

Executing Evaluations over Semantic Technologies using the SEALS Platform

Executing Evaluations over Semantic Technologies using the SEALS Platform Executing Evaluations over Semantic Technologies using the SEALS Platform Miguel Esteban-Gutiérrez, Raúl García-Castro, Asunción Gómez-Pérez Ontology Engineering Group, Departamento de Inteligencia Artificial.

More information

(1) I (2) S (3) P allow subscribers to connect to the (4) often provide basic services such as (5) (6)

(1) I (2) S (3) P allow subscribers to connect to the (4) often provide basic services such as (5) (6) Collection of (1) Meta-network That is, a (2) of (3) Uses a standard set of protocols Also uses standards d for structuring t the information transferred (1) I (2) S (3) P allow subscribers to connect

More information

Data Presentation and Markup Languages

Data Presentation and Markup Languages Data Presentation and Markup Languages MIE456 Tutorial Acknowledgements Some contents of this presentation are borrowed from a tutorial given at VLDB 2000, Cairo, Agypte (www.vldb.org) by D. Florescu &.

More information

Introduction to web development and HTML MGMT 230 LAB

Introduction to web development and HTML MGMT 230 LAB Introduction to web development and HTML MGMT 230 LAB After this lab you will be able to... Understand the VIU network and web server environment and how to access it Save files to your web folder for

More information

Metadata Standards and Applications. 4. Metadata Syntaxes and Containers

Metadata Standards and Applications. 4. Metadata Syntaxes and Containers Metadata Standards and Applications 4. Metadata Syntaxes and Containers Goals of Session Understand the origin of and differences between the various syntaxes used for encoding information, including HTML,

More information

COMMIUS Project Newsletter COMMIUS COMMUNITY-BASED INTEROPERABILITY UTILITY FOR SMES

COMMIUS Project Newsletter COMMIUS COMMUNITY-BASED INTEROPERABILITY UTILITY FOR SMES Project Newsletter COMMUNITY-BASED INTEROPERABILITY UTILITY FOR SMES Issue n.4 January 2011 This issue s contents: Project News The Process Layer Dear Community member, You are receiving this newsletter

More information

Data is the new Oil (Ann Winblad)

Data is the new Oil (Ann Winblad) Data is the new Oil (Ann Winblad) Keith G Jeffery keith.jeffery@keithgjefferyconsultants.co.uk 20140415-16 JRC Workshop Big Open Data Keith G Jeffery 1 Data is the New Oil Like oil has been, data is Abundant

More information

Search Engine Optimisation Basics for Government Agencies

Search Engine Optimisation Basics for Government Agencies Search Engine Optimisation Basics for Government Agencies Prepared for State Services Commission by Catalyst IT Neil Bertram May 11, 2007 Abstract This document is intended as a guide for New Zealand government

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Using UML To Define XML Document Types

Using UML To Define XML Document Types Using UML To Define XML Document Types W. Eliot Kimber ISOGEN International, A DataChannel Company Created On: 10 Dec 1999 Last Revised: 14 Jan 2000 Defines a convention for the use of UML to define XML

More information

Introduction to Information Systems

Introduction to Information Systems Table of Contents 1... 2 1.1 Introduction... 2 1.2 Architecture of Information systems... 2 1.3 Classification of Data Models... 4 1.4 Relational Data Model (Overview)... 8 1.5 Conclusion... 12 1 1.1 Introduction

More information

The XQuery Data Model

The XQuery Data Model The XQuery Data Model 9. XQuery Data Model XQuery Type System Like for any other database query language, before we talk about the operators of the language, we have to specify exactly what it is that

More information

XML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11

XML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11 !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... 7:4 @import Directive... 9:11 A Absolute Units of Length... 9:14 Addressing the First Line... 9:6 Assigning Meaning to XML Tags...

More information

Chapter 13 XML: Extensible Markup Language

Chapter 13 XML: Extensible Markup Language Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server

More information

Automated Classification. Lars Marius Garshol Topic Maps

Automated Classification. Lars Marius Garshol Topic Maps Automated Classification Lars Marius Garshol Topic Maps 2007 2007-03-21 Automated classification What is it? Why do it? 2 What is automated classification? Create parts of a topic map

More information

3. ALGORITHM FOR HIERARCHICAL STRUC TURE DISCOVERY Edges in a webgraph represent the links between web pages. These links can have a type such as: sta

3. ALGORITHM FOR HIERARCHICAL STRUC TURE DISCOVERY Edges in a webgraph represent the links between web pages. These links can have a type such as: sta Semantics of Links and Document Structure Discovery John R. Punin puninj@cs.rpi.edu http://www.cs.rpi.edu/~ puninj Department of Computer Science, RPI,Troy, NY 12180. M. S. Krishnamoorthy moorthy@cs.rpi.edu

More information

ITEC 810 Minor Project. Inferring Document Structure. Final Report. Author: Weiyen Lin SID: Supervised by: Jette Viethen

ITEC 810 Minor Project. Inferring Document Structure. Final Report. Author: Weiyen Lin SID: Supervised by: Jette Viethen ITEC 810 Minor Project Inferring Document Structure Final Report Author: Weiyen Lin SID: 41348133 Supervised by: Jette Viethen 4th June 2009 Abstract PDF documents form a rich resource repository of knowledge

More information

Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications

Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications Digital Libraries Agenda Digital Libraries Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications What is Library Collection of books, documents,

More information

User Interaction: XML and JSON

User Interaction: XML and JSON User Interaction: XML and JSON Asst. Professor Donald J. Patterson INF 133 Fall 2011 1 What might a design notebook be like? Cooler What does a design notebook entry look like? HTML and XML 1989: Tim Berners-Lee

More information

EPiServer s Compliance to WCAG and ATAG

EPiServer s Compliance to WCAG and ATAG EPiServer s Compliance to WCAG and ATAG An evaluation of EPiServer s compliance to the current WCAG and ATAG guidelines elaborated by the World Wide Web Consortium s (W3C) Web Accessibility Initiative

More information

Design and Implementation of an RDF Triple Store

Design and Implementation of an RDF Triple Store Design and Implementation of an RDF Triple Store Ching-Long Yeh and Ruei-Feng Lin Department of Computer Science and Engineering Tatung University 40 Chungshan N. Rd., Sec. 3 Taipei, 04 Taiwan E-mail:

More information

UR what? ! URI: Uniform Resource Identifier. " Uniquely identifies a data entity " Obeys a specific syntax " schemename:specificstuff

UR what? ! URI: Uniform Resource Identifier.  Uniquely identifies a data entity  Obeys a specific syntax  schemename:specificstuff CS314-29 Web Protocols URI, URN, URL Internationalisation Role of HTML and XML HTTP and HTTPS interacting via the Web UR what? URI: Uniform Resource Identifier Uniquely identifies a data entity Obeys a

More information

The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu

The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce

More information

Publishing Technology 101 A Journal Publishing Primer. Mike Hepp Director, Technology Strategy Dartmouth Journal Services

Publishing Technology 101 A Journal Publishing Primer. Mike Hepp Director, Technology Strategy Dartmouth Journal Services Publishing Technology 101 A Journal Publishing Primer Mike Hepp Director, Technology Strategy Dartmouth Journal Services mike.hepp@sheridan.com Publishing Technology 101 AGENDA 12 3 EVOLUTION OF PUBLISHING

More information

Distributed Database System. Project. Query Evaluation and Web Recognition in Document Databases

Distributed Database System. Project. Query Evaluation and Web Recognition in Document Databases 74.783 Distributed Database System Project Query Evaluation and Web Recognition in Document Databases Instructor: Dr. Yangjun Chen Student: Kang Shi (6776229) August 1, 2003 1 Abstract A web and document

More information

Information Retrieval (IR) through Semantic Web (SW): An Overview

Information Retrieval (IR) through Semantic Web (SW): An Overview Information Retrieval (IR) through Semantic Web (SW): An Overview Gagandeep Singh 1, Vishal Jain 2 1 B.Tech (CSE) VI Sem, GuruTegh Bahadur Institute of Technology, GGS Indraprastha University, Delhi 2

More information

CHAPTER 1: GETTING STARTED WITH HTML CREATED BY L. ASMA RIKLI (ADAPTED FROM HTML, CSS, AND DYNAMIC HTML BY CAREY)

CHAPTER 1: GETTING STARTED WITH HTML CREATED BY L. ASMA RIKLI (ADAPTED FROM HTML, CSS, AND DYNAMIC HTML BY CAREY) CHAPTER 1: GETTING STARTED WITH HTML EXPLORING THE HISTORY OF THE WORLD WIDE WEB Network: a structure that allows devices known as nodes or hosts to be linked together to share information and services.

More information

PART. Oracle and the XML Standards

PART. Oracle and the XML Standards PART I Oracle and the XML Standards CHAPTER 1 Introducing XML 4 Oracle Database 10g XML & SQL E xtensible Markup Language (XML) is a meta-markup language, meaning that the language, as specified by the

More information

Service Oriented Architectures (ENCS 691K Chapter 2)

Service Oriented Architectures (ENCS 691K Chapter 2) Service Oriented Architectures (ENCS 691K Chapter 2) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ The Key Technologies on Which Cloud

More information

What is a web site? Web editors Introduction to HTML (Hyper Text Markup Language)

What is a web site? Web editors Introduction to HTML (Hyper Text Markup Language) What is a web site? Web editors Introduction to HTML (Hyper Text Markup Language) What is a website? A website is a collection of web pages containing text and other information, such as images, sound

More information

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1 Slide 27-1 Chapter 27 XML: Extensible Markup Language Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree) Data Model. XML Documents, DTD, and XML Schema.

More information

Digital Asset Management 2. Introduction to Digital Media Format

Digital Asset Management 2. Introduction to Digital Media Format Digital Asset Management 2. Introduction to Digital Media Format 2009-09-24 Outline Image format and coding methods Audio format and coding methods Video format and coding methods Introduction to HTML

More information

EMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents

EMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents EMERGING TECHNOLOGIES XML Documents and Schemas for XML documents Outline 1. Introduction 2. Structure of XML data 3. XML Document Schema 3.1. Document Type Definition (DTD) 3.2. XMLSchema 4. Data Model

More information

From administrivia to what really matters

From administrivia to what really matters From administrivia to what really matters Questions about the syllabus? Logistics Daily lectures, quizzes and labs Two exams and one long project My teaching philosophy...... is informed by my passion

More information

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward

Comp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward Comp 336/436 - Markup Languages Fall Semester 2017 - Week 4 Dr Nick Hayward XML - recap first version of XML became a W3C Recommendation in 1998 a useful format for data storage and exchange config files,

More information

introduction to XHTML

introduction to XHTML introduction to XHTML XHTML stands for Extensible HyperText Markup Language and is based on HTML 4.0, incorporating XML. Due to this fusion the mark up language will remain compatible with existing browsers

More information

Lecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck

Lecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck Lecture Telecooperation D. Fensel Leopold-Franzens- Universität Innsbruck First Lecture: Introduction: Semantic Web & Ontology Introduction Semantic Web and Ontology Part I Introduction into the subject

More information

Development of an Ontology-Based Portal for Digital Archive Services

Development of an Ontology-Based Portal for Digital Archive Services Development of an Ontology-Based Portal for Digital Archive Services Ching-Long Yeh Department of Computer Science and Engineering Tatung University 40 Chungshan N. Rd. 3rd Sec. Taipei, 104, Taiwan chingyeh@cse.ttu.edu.tw

More information

Webbed Documents 1- Malcolm Graham and Andrew Surray. Abstract. The Problem and What We ve Already Tried

Webbed Documents 1- Malcolm Graham and Andrew Surray. Abstract. The Problem and What We ve Already Tried Webbed Documents 1- Malcolm Graham and Andrew Surray WriteDoc Inc. Northern Telecom malcolm@writedoc.com surray@bnr.ca Abstract This paper describes the work currently being done within Northern Telecom

More information

extensible Markup Language

extensible Markup Language extensible Markup Language XML is rapidly becoming a widespread method of creating, controlling and managing data on the Web. XML Orientation XML is a method for putting structured data in a text file.

More information

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery Simon Pelletier Université de Moncton, Campus of Shippagan, BGI New Brunswick, Canada and Sid-Ahmed Selouani Université

More information

CHAPTER-23 MINING COMPLEX TYPES OF DATA

CHAPTER-23 MINING COMPLEX TYPES OF DATA CHAPTER-23 MINING COMPLEX TYPES OF DATA 23.1 Introduction 23.2 Multidimensional Analysis and Descriptive Mining of Complex Data Objects 23.3 Generalization of Structured Data 23.4 Aggregation and Approximation

More information

MODULE 2 HTML 5 FUNDAMENTALS. HyperText. > Douglas Engelbart ( )

MODULE 2 HTML 5 FUNDAMENTALS. HyperText. > Douglas Engelbart ( ) MODULE 2 HTML 5 FUNDAMENTALS HyperText > Douglas Engelbart (1925-2013) Tim Berners-Lee's proposal In March 1989, Tim Berners- Lee submitted a proposal for an information management system to his boss,

More information

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5

SDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5 2 Basics of XML and XML documents 2.1 XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.3 XML Namespaces XML 1.0

More information

A Meta-Model for Fact Extraction from Delphi Source Code

A Meta-Model for Fact Extraction from Delphi Source Code Electronic Notes in Theoretical Computer Science 94 (2004) 9 28 www.elsevier.com/locate/entcs A Meta-Model for Fact Extraction from Delphi Source Code Jens Knodel and G. Calderon-Meza 2 Fraunhofer Institute

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

The XML Metalanguage

The XML Metalanguage The XML Metalanguage Mika Raento mika.raento@cs.helsinki.fi University of Helsinki Department of Computer Science Mika Raento The XML Metalanguage p.1/442 2003-09-15 Preliminaries Mika Raento The XML Metalanguage

More information

Structured documents

Structured documents Structured documents An overview of XML Structured documents Michael Houghton 15/11/2000 Unstructured documents Broadly speaking, text and multimedia document formats can be structured or unstructured.

More information

I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING. Andrii Donchenko

I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING. Andrii Donchenko International Journal "Information Technologies and Knowledge" Vol.1 / 2007 293 I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING Andrii Donchenko Abstract: This article considers

More information

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.

1. true / false By a compiler we mean a program that translates to code that will run natively on some machine. 1. true / false By a compiler we mean a program that translates to code that will run natively on some machine. 2. true / false ML can be compiled. 3. true / false FORTRAN can reasonably be considered

More information

Use of Mobile Agents for IPR Management and Negotiation

Use of Mobile Agents for IPR Management and Negotiation Use of Mobile Agents for Management and Negotiation Isabel Gallego 1, 2, Jaime Delgado 1, Roberto García 1 1 Universitat Pompeu Fabra (UPF), Departament de Tecnologia, La Rambla 30-32, E-08002 Barcelona,

More information

A new generation of tools for SGML

A new generation of tools for SGML Article A new generation of tools for SGML R. W. Matzen Oklahoma State University Department of Computer Science EMAIL rmatzen@acm.org Exceptions are used in many standard DTDs, including HTML, because

More information

Defining an Abstract Core Production Rule System

Defining an Abstract Core Production Rule System WORKING PAPER!! DRAFT, e.g., lacks most references!! Version of December 19, 2005 Defining an Abstract Core Production Rule System Benjamin Grosof Massachusetts Institute of Technology, Sloan School of

More information

Achitectural specification: Base

Achitectural specification: Base Contents Introduction to DITA... 5 DITA terminology and notation...5 Basic concepts...9 File extensions...10 Producing different deliverables from a single source...11 DITA markup...12 DITA topics...12

More information

Part III: Survey of Internet technologies

Part III: Survey of Internet technologies Part III: Survey of Internet technologies Content (e.g., HTML) kinds of objects we re moving around? References (e.g, URLs) how to talk about something not in hand? Protocols (e.g., HTTP) how do things

More information