AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET
|
|
- Albert Richards
- 6 years ago
- Views:
Transcription
1 AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET Ismael Sanz, Rafael Berlanga, María José Aramburu and Francisco Toledo Departament d'informàtica Campus Penyeta Roja, Universitat Jaume I, E Castellón, SPAIN Keywords: Internet, Digital Libraries, Document Recognition and Logic Programming. ABSTRACT After our previous works on modelling a database of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published in distinct web resources. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus with the library PiLLoW).
2 1 Introduction The soaring availability of periodical publications in the Internet is making necessary new methods for the management of these kind of documents [Ara96], as well as large specific digital libraries that provide sophisticated indexing and retrieval on them [Ara97a]. After our previous works on modelling databases of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published at distinct web servers. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement our gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus and the PiLLoW library).this approach to the gathering problem has been possible thanks to the regular newspaper document structure which can be described by using the Document Type System (DTS) [Ara97a]. 1.1 Overall structure Gatherers take as input sets of web-accesible documents [Tbl94] and returns metadata descriptions for each recognised publication. Figure 1 illustrates how DTS descriptions, which are internally transformed into grammars, are used to represent the structure of individual documents as well as the relationships among them. In this way, the system is able to classify different kinds of documents and extract relevant information from them. Document descriptions DTS Layout Classes Rules Schema View GATHERER data types meta-attributes files Context Free Grammar Values Output Metadata Object navigation Web server Schema complex value Figure 1: Architecture of the gathering system 1.2 The DTS type system Newspaper documents in a web server are regulated by a set of types and layout rules which define their structure and contents. These documents can be represented using an object-oriented data model whose underlying type system supports flexibility and optionality when representing complex document structures [Ara96]. The gathering process is also regulated by this type system, called DTS (Document Type System), which can be viewed as a formal object-oriented type system with a syntax similar to the SGML DTD language. Briefly, DTS types follows the syntax [Ara97b]: τ := rawdata class (τ 1 τ m ) τ+ τ* τ? [ A 1 :τ 1,, A m :τ m ] where, rawdata is any basic multimedia data type (e.g. Text and Graphic), class is the name of a document class (e.g. Article, Paper, etc...) whose type is in turn a DTS type, the type constructor " " expresses the union of DTS types, the constructor [.. ] expresses a (possibly nested) tuple by using a set of document attributes A i, finally the suffixes +, * and? express the different optionality degrees
3 of a document component, namely: at least one occurrence, zero or more occurrences and less than two occurrences, respectively. 2 Gathering HTML files We assume that the publications to be analyzed are available in HTML format through a HTTP server. The details of HTTP requests are handled by using the PiLLoW library [Cab96], a publicly available package that implements such low level routines for PROLOG languages. Upon this layer, a set of services are provided for traversing web servers gracefully; in particular, the Standard for Robot Exclusion [Fah96] is fully supported. But, despite these robot-like features, the gathering system does not behave blindly, as common web-indexing programs do. Instead, it is able to advantageously exploit any available information on the internal web server structure which is modelled with DTS descriptions. As an example, let us consider a typical tree-like web site structure for a newspaper publication: Front page Index of section #1 Index of section #2 Irrelevant documents Article #1 Article #2 Article #3 Article #4 Article # Figure 2: Sample structure of an electronic published newspaper In this case, the server root (front page) contains a set of hypertext links to section indexes, which in turn point to their corresponding articles. A set of DTS classes expressing these relationships may be similar to the following ones: FrontPage := [ date: Date, sections: SectionIndexRef+ ] SectionIndex := [ name: SectionName, articles: ArticleRef+ ] Article := (Report Chronicle Interview) By using appropriate semantics specifications for the SectionIndexRef and ArticleRef tokens (see section 0), it is possible to express that the corresponding links point to documents of the SectionIndex and Article classes respectively. In this way, the gatherer is instructed to traverse the web server using logical, well-known paths. As a consequence, no irrelevant documents are requested, and the network traffic is thereby minimized. Apart from this, the gatherer supports other methods for web site traversing. The most important one implements a traditional breadth-first search of the full target site, and stores the fetched documents into a simple database. This is mainly useful for learning about a site structure locally, in an off-line fashion. In fact, the implementation allows the addition of new traversing methods. For this purpose, it is only necessary to create a PROLOG module that implements a prefixed set of predicates. The most important ones are summarised in Table 1. Of course, the underlying features of the gathering system are available for these predicates via a set of well-defined interfaces. Predicate robot_start/0 robot_action/1 robot_finish/0 Description Performs any necessary initializations. The argument is the full information on the current HTML document, represented as an association list that contains not only the HTML source, but also the returning HTTP headers. It is the responsibility of this predicate to specify further URLs to be fetched. Cleans up if necessary. Table 1: Basic Predicates for traversing web publication sites
4 3 Type system implementation In order to use the DTS descriptions that regulates the server newspapers, it is necessary to translate them into a format that gatherers, which are implemented in PROLOG, could easily incorporate. At this respect, some minor syntactic additions need to be made on the DTS statements. We introduce the modified syntax by means of the example above, which looks like as follows: class BodyAndPhoto uses Rawdata BodyAndPhoto := [ body: Paper+, photo: Photograph ] Here, the clause class declares the name of the DTS type that is to be defined, the clause uses specifies the PROLOG file that comprises the set of rawdata types involved by this type (details about this file are given in the next section), finally the DTS description is specified. The statement above is then compiled into a PROLOG file that contains a Definite Clause Grammar (DCG) version of the class definition, as well as some glue code necessary for the dynamic loading of the code into the final gatherering system. 3.1 Rawdata specification and markup styles The DTS implementation distinguishes two different kinds of rawdata classes: primitive types and tokens. The former are generic multimedia types (Text and Graphic are currently recognized), whereas the latter are class-specific formats which usually correspond to markup styles. Let us explain this concept in the following paragraphs. A document can be viewed as a sequence of marked-up texts 1. Usually, the function of each text within the document is expressed by a characteristic style, for instance, a headline and an image footage will probably have different visual appearance. Since we are dealing with HTML files, we consider that the markup for each text is defined by a set of tags. For example, some headlines at the top of a piece of news can look like as follows: Troublesome European Summit UE governments disagree about the starting date for the new euro Table 2 presents their markup-text pairs: Markup center, bold, font size=+2 center, font size=+1 Text "Troublesome European Summit" "UE governments disagree about the starting date for the new euro" Table 2: Example of mark-up-text pairs. A token is just a set of markup tags. In this case, we could define the token Headline as the set {center, bold, font size = +1} and the token Secondary headline as the set {center, font size = +1}. In publications sites, only a few combinations are used for the overall layout. These are usually specified by an internal book of style, similar to those of printed newspapers, and constitute the distinctive graphic vocabulary of the publication server. In order to use the layout descriptions within gatherers, a PROLOG representation for them must be established as for DTS type descriptions. For easy interfacing with the compiled DTS classes, 1 Provided that a strict sequential order can be obtained for every element in the document. For HTML documents, this is always possible.
5 Definite Clause Grammars are also used here. In this case, they must assume to be parsing a list of atoms of the following form (as later described in section 0): paragraph(tags, Text) where Tags is a list of HTML tags with the format Name $ List-of-attributes, and Text is the associated string of characters (see Table 2). For instance, the following grammar rule states the Headline token defined above: Headline( Text #T)-> [paragraph([center$[],b$[],font$[size= +2 ]],T)]. Of course, arbitrary PROLOG code may be added to these grammar rules. This means that powerful extensions may be programmed by building on this basic mechanism; for instance, this capability is used in the gatherer for the implementation of the DTS-directed site traversal. 3.2 Obtaining styles In order to transform the HTML source into lists of paragraph/2 terms, a special routine is used that attempts to exploit any knowledge about the characteristics of the site. This routine performs the following steps: 1. It transforms the HTML source into Prolog terms, using facilities provide by the PiLLoW library. The tree structure of the tags in the source file is preserved by nesting terms appropriately. 2. The resulting structure is simplified using rules defined in an external file. These rules basically specify which tags are the relevant ones, and which ones should be removed together with their entire subtree. 3. Finally, the simplified tree is flattened into a list of paragraph/2 terms. 4 Identification of document classes Each grammar associated to a DTS type, together with its corresponding token specifications, is capable of parsing a document represented as a list of styles, and extracting all the relevant information from it. Specifically, each document is recognised by using a set of candidate grammars which must be checked one by one until the document conforms with one of them. Experience shows that non-conforming grammars tend to fail very soon, and thus the searching procedure is kept efficient. In order to limit the number of grammars that are attempted for each document, DTS descriptions may be clustered together by putting related groups into separate directories. Each cluster shares a common style definition file (see section 0). For instance, for identifying the documents in the treelike structure shown in Figure 1, it would be reasonable to separate into different clusters the classes that represent site structures (FrontPage, SectionIndex) and the ones for the articles (Report, Chronicle, Interview). The result of this process is a PROLOG term with a structure that corresponds to the original DTS class description, and contains the extracted value of each attribute. The format, as returned by the compiled DTS descriptions, may be informally described as: Class name # List of attributes or, for primitive types, Class name # Primitive value 2 2 This value is specific to the particular primitive class. For a Text class it is a string, and for a Graphic class it is a URL.
6 where each attribute has the following syntax: Attribute name : Class description Attribute name : List of values For instance, given the following DTS descriptions Body := Paragraph+ Article := [ headline:headline, body:body ] where Headline and Paragraph are tokens that return Text values, a term representing a conforming document could be the following one: Article#[headline: Text# PrologScript Revisited, body: Body#[Text# First paragraph text, Text# Second paragraph text ]] This format is devised to allow easy insertion of data into an object-oriented database. 5 Conclusions This work has described an implementation of gatherers for web repositories of digitized newspapers. Furthermore, we have shown the usefulness of formal data description methods in the retrieval and classification of structured sets of web documents. Specifically, DTS types have been used to effectively manage complex collections of documents. In this work, PROLOG has taken a relevant role in the design of gatherers, by providing grammar rules for type recognition and HTTP protocol primitives for performing web traversal. Future work is focusing on extending the here presented technique to other kind of publications, such as journals, patents and so on. 6 References [Ara96] Aramburu, M.J. and Berlanga, R. Object-oriented modelling of periodicals Proceedings of the 7th Workshop on Databases and Expert System Applications (DEXA'96), Ed. IEEE, Zurich, [Ara97a] Aramburu, M.J. and Berlanga, R. An approach to a digital library of newspapers. To appear in Information Processing & Management, Special Issue on Electronic News, 1997 [Ara97b] Aramburu, M.J. and Berlanga, R. Metadata in a Digital Library of Perodicals Informe Técnico DI 01-01/97, Departamento de Informática, Universitat Jaume I, January [Tbl94] Berners-Lee, T., Cailliau, A., Nielsen, H.F., Luotonen, A. and Secret, A. The world wide web. Communications of the ACM, 37(8):76-82, August [Fah96] Fah-Chun Cheong, "Internet Agents", New Riders Publishing, Indianapolis, 1986 [Cab96] Cabeza, D., Hermenegildo, M. and Varma, S. "The PiLLoW/CIAO library for Internet/ WWW programming using computational logic systems" Proceedings of the 1st Workshop on Logic Programming Tools for INTERNET Applications, IJCSLP'96, Bonn, 1996.
Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction
Adaptable and Adaptive Web Information Systems School of Computer Science and Information Systems Birkbeck College University of London Lecture 1: Introduction George Magoulas gmagoulas@dcs.bbk.ac.uk October
More informationSemantic Web Lecture Part 1. Prof. Do van Thanh
Semantic Web Lecture Part 1 Prof. Do van Thanh Overview of the lecture Part 1 Why Semantic Web? Part 2 Semantic Web components: XML - XML Schema Part 3 - Semantic Web components: RDF RDF Schema Part 4
More informationResearch and implementation of search engine based on Lucene Wan Pu, Wang Lisha
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,
More information- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents
Purpose of these slides Introduction to XML for parliamentary documents (and all other kinds of documents, actually) Prof. Fabio Vitali University of Bologna Part 1 Introduce the principal aspects of electronic
More informationAn Approach To Web Content Mining
An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research
More informationMarkup Languages SGML, HTML, XML, XHTML. CS 431 February 13, 2006 Carl Lagoze Cornell University
Markup Languages SGML, HTML, XML, XHTML CS 431 February 13, 2006 Carl Lagoze Cornell University Problem Richness of text Elements: letters, numbers, symbols, case Structure: words, sentences, paragraphs,
More informationSeaglex Software, Inc. Web Harvesting White Paper. Overview
Seaglex Software, Inc. Web Harvesting White Paper Overview Seaglex Software is a leading developer of products for systematic location, identification, classification, and extraction of structured and
More informationYonghyun Whang a, Changwoo Jung b, Jihong Kim a, and Sungkwon Chung c
WebAlchemist: A Web Transcoding System for Mobile Web Access in Handheld Devices Yonghyun Whang a, Changwoo Jung b, Jihong Kim a, and Sungkwon Chung c a School of Computer Science & Engineering, Seoul
More informationSEARCH SEMI-STRUCTURED DATA ON WEB
SEARCH SEMI-STRUCTURED DATA ON WEB Sabin-Corneliu Buraga 1, Teodora Rusu 2 1 Faculty of Computer Science, Al.I.Cuza University of Iaşi, Romania Berthelot Str., 16 6600 Iaşi, Romania, tel: +40 (32 201529,
More informationSemantic Web and Electronic Information Resources Danica Radovanović
D.Radovanovic: Semantic Web and Electronic Information Resources 1, Infotheca journal 4(2003)2, p. 157-163 UDC 004.738.5:004.451.53:004.22 Semantic Web and Electronic Information Resources Danica Radovanović
More informationIntroduction to XML. XML: basic elements
Introduction to XML XML: basic elements XML Trying to wrap your brain around XML is sort of like trying to put an octopus in a bottle. Every time you think you have it under control, a new tentacle shows
More informationA network is a group of two or more computers that are connected to share resources and information.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML Hypertext Markup Language XHTML Extensible Hypertext Markup Language CSS Cascading Style Sheets The Internet is a worldwide collection of computers and
More informationKnowledge Representation, Ontologies, and the Semantic Web
Knowledge Representation, Ontologies, and the Semantic Web Evimaria Terzi 1, Athena Vakali 1, and Mohand-Saïd Hacid 2 1 Informatics Dpt., Aristotle University, 54006 Thessaloniki, Greece evimaria,avakali@csd.auth.gr
More informationThe Nature of the Web
The Nature of the Web Agenda Code The Internet The Web Useful References 2 CODE is King (or Queen) The language of the Web: Hypertext Markup Language - HTML Cascading Style Sheets - CSS Build over successive
More informationDevice Independent Principles for Adapted Content Delivery
Device Independent Principles for Adapted Content Delivery Tayeb Lemlouma 1 and Nabil Layaïda 2 OPERA Project Zirst 655 Avenue de l Europe - 38330 Montbonnot, Saint Martin, France Tel: +33 4 7661 5281
More informationISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationApplying Java for the Retrieval of Multimedia Knowledge Distributed on High Performance Clusters on the Internet
Proceedings of the International Conference on Practical Applications of JAVA, London, UK, (1999), 193 203 Applying Java for the Retrieval of Multimedia Knowledge Distributed on High Performance Clusters
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015
RESEARCH ARTICLE OPEN ACCESS Multi-Lingual Ontology Server (MOS) For Discovering Web Services Abdelrahman Abbas Ibrahim [1], Dr. Nael Salman [2] Department of Software Engineering [1] Sudan University
More informationAutomatic Metadata Extraction for Archival Description and Access
Automatic Metadata Extraction for Archival Description and Access WILLIAM UNDERWOOD Georgia Tech Research Institute Abstract: The objective of the research reported is this paper is to develop techniques
More informationM2-R4: INTERNET TECHNOLOGY AND WEB DESIGN
M2-R4: INTERNET TECHNOLOGY AND WEB DESIGN NOTE: 1. There are TWO PARTS in this Module/Paper. PART ONE contains FOUR questions and PART TWO contains FIVE questions. 2. PART ONE is to be answered in the
More informationAspects of an XML-Based Phraseology Database Application
Aspects of an XML-Based Phraseology Database Application Denis Helic 1 and Peter Ďurčo2 1 University of Technology Graz Insitute for Information Systems and Computer Media dhelic@iicm.edu 2 University
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationISO. International Organization for Standardization. ISO/IEC JTC 1/SC 32 Data Management and Interchange WG4 SQL/MM. Secretariat: USA (ANSI)
ISO/IEC JTC 1/SC 32 N 0736 ISO/IEC JTC 1/SC 32/WG 4 SQL/MM:VIE-006 January, 2002 ISO International Organization for Standardization ISO/IEC JTC 1/SC 32 Data Management and Interchange WG4 SQL/MM Secretariat:
More informationInteroperability for Digital Libraries
DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: C Interoperability for Digital Libraries Michael Shepherd Faculty of Computer Science Dalhousie University Halifax, NS, Canada
More informationJISC PALS2 PROJECT: ONIX FOR LICENSING TERMS PHASE 2 (OLT2)
JISC PALS2 PROJECT: ONIX FOR LICENSING TERMS PHASE 2 (OLT2) Functional requirements and design specification for an ONIX-PL license expression drafting system 1. Introduction This document specifies a
More informationDesigning a Semantic Ground Truth for Mathematical Formulas
Designing a Semantic Ground Truth for Mathematical Formulas Alan Sexton 1, Volker Sorge 1, and Masakazu Suzuki 2 1 School of Computer Science, University of Birmingham, UK, A.P.Sexton V.Sorge@cs.bham.ac.uk,
More informationTransformation of structured documents with the use of grammar
ELECTRONIC PUBLISHING, VOL. 6(4), 373 383 (DECEMBER 1993) Transformation of structured documents with the use of grammar EILA KUIKKA MARTTI PENTTONEN University of Kuopio University of Joensuu P. O. Box
More informationM359 Block5 - Lecture12 Eng/ Waleed Omar
Documents and markup languages The term XML stands for extensible Markup Language. Used to label the different parts of documents. Labeling helps in: Displaying the documents in a formatted way Querying
More information5/19/2015. Objectives. JavaScript, Sixth Edition. Introduction to the World Wide Web (cont d.) Introduction to the World Wide Web
Objectives JavaScript, Sixth Edition Chapter 1 Introduction to JavaScript When you complete this chapter, you will be able to: Explain the history of the World Wide Web Describe the difference between
More informationProving the validity and accessibility of dynamic web pages
Loughborough University Institutional Repository Proving the validity and accessibility of dynamic web pages This item was submitted to Loughborough University's Institutional Repository by the/an author.
More information[MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document
[MS-PICSL]: Internet Explorer PICS Label Distribution and Syntax Standards Support Document Intellectual Property Rights Notice for Open Specifications Documentation Technical Documentation. Microsoft
More informationCIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)
By the end of this course, students should CIS 1.5 Course Objectives a. Understand the concept of a program (i.e., a computer following a series of instructions) b. Understand the concept of a variable
More informationWeb Ontology for Software Package Management
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 2. pp. 331 338. Web Ontology for Software Package Management Péter Jeszenszky Debreceni
More informationEnabling Grids for E-sciencE ISSGC 05. XML documents. Richard Hopkins, National e-science Centre, Edinburgh June
ISSGC 05 XML documents Richard Hopkins, National e-science Centre, Edinburgh June 2005 www.eu-egee.org Overview Goals General appreciation of XML Sufficient detail to understand WSDLs Structure Philosophy
More informationStylus Studio Case Study: FIXML Working with Complex Message Sets Defined Using XML Schema
Stylus Studio Case Study: FIXML Working with Complex Message Sets Defined Using XML Schema Introduction The advanced XML Schema handling and presentation capabilities of Stylus Studio have valuable implications
More informationHTML+ CSS PRINCIPLES. Getting started with web design the right way
HTML+ CSS PRINCIPLES Getting started with web design the right way HTML : a brief history ❶ 1960s : ARPANET is developed... It is the first packet-switching network using TCP/IP protocol and is a precursor
More informationPart VII. Querying XML The XQuery Data Model. Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153
Part VII Querying XML The XQuery Data Model Marc H. Scholl (DBIS, Uni KN) XML and Databases Winter 2005/06 153 Outline of this part 1 Querying XML Documents Overview 2 The XQuery Data Model The XQuery
More informationA Prolog-based Proof Tool for Type Theory TA λ and Implicational Intuitionistic-Logic
for Type Theory TA λ and Implicational Intuitionistic-Logic L. Yohanes Stefanus University of Indonesia Depok 16424, Indonesia yohanes@cs.ui.ac.id and Ario Santoso Technische Universität Dresden Dresden
More informationXML and information exchange. XML extensible Markup Language XML
COS 425: Database and Information Management Systems XML and information exchange 1 XML extensible Markup Language History 1988 SGML: Standard Generalized Markup Language Annotate text with structure 1992
More informationHypertext Markup Language, or HTML, is a markup
Introduction to HTML Hypertext Markup Language, or HTML, is a markup language that enables you to structure and display content such as text, images, and links in Web pages. HTML is a very fast and efficient
More informationINFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE
15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find
More information('cre Learning that works for Utah STRANDS AND STANDARDS WEB DEVELOPMENT 1
STRANDS AND STANDARDS Course Description Web Development is a course designed to guide students in a project-based environment, in the development of up-to-date concepts and skills that are used in the
More informationACCESSIBLE DESIGN THEMES
WCAG GUIDELINES The Web Content Accessibility Guidelines (WCAG) has been made to guide the Web Content Developers and the Authoring Tools Developers in order to make the Web Content more accessible to
More informationFrom Open Data to Data- Intensive Science through CERIF
From Open Data to Data- Intensive Science through CERIF Keith G Jeffery a, Anne Asserson b, Nikos Houssos c, Valerie Brasse d, Brigitte Jörg e a Keith G Jeffery Consultants, Shrivenham, SN6 8AH, U, b University
More informationAdaptive and Personalized System for Semantic Web Mining
Journal of Computational Intelligence in Bioinformatics ISSN 0973-385X Volume 10, Number 1 (2017) pp. 15-22 Research Foundation http://www.rfgindia.com Adaptive and Personalized System for Semantic Web
More informationExecuting Evaluations over Semantic Technologies using the SEALS Platform
Executing Evaluations over Semantic Technologies using the SEALS Platform Miguel Esteban-Gutiérrez, Raúl García-Castro, Asunción Gómez-Pérez Ontology Engineering Group, Departamento de Inteligencia Artificial.
More information(1) I (2) S (3) P allow subscribers to connect to the (4) often provide basic services such as (5) (6)
Collection of (1) Meta-network That is, a (2) of (3) Uses a standard set of protocols Also uses standards d for structuring t the information transferred (1) I (2) S (3) P allow subscribers to connect
More informationData Presentation and Markup Languages
Data Presentation and Markup Languages MIE456 Tutorial Acknowledgements Some contents of this presentation are borrowed from a tutorial given at VLDB 2000, Cairo, Agypte (www.vldb.org) by D. Florescu &.
More informationIntroduction to web development and HTML MGMT 230 LAB
Introduction to web development and HTML MGMT 230 LAB After this lab you will be able to... Understand the VIU network and web server environment and how to access it Save files to your web folder for
More informationMetadata Standards and Applications. 4. Metadata Syntaxes and Containers
Metadata Standards and Applications 4. Metadata Syntaxes and Containers Goals of Session Understand the origin of and differences between the various syntaxes used for encoding information, including HTML,
More informationCOMMIUS Project Newsletter COMMIUS COMMUNITY-BASED INTEROPERABILITY UTILITY FOR SMES
Project Newsletter COMMUNITY-BASED INTEROPERABILITY UTILITY FOR SMES Issue n.4 January 2011 This issue s contents: Project News The Process Layer Dear Community member, You are receiving this newsletter
More informationData is the new Oil (Ann Winblad)
Data is the new Oil (Ann Winblad) Keith G Jeffery keith.jeffery@keithgjefferyconsultants.co.uk 20140415-16 JRC Workshop Big Open Data Keith G Jeffery 1 Data is the New Oil Like oil has been, data is Abundant
More informationSearch Engine Optimisation Basics for Government Agencies
Search Engine Optimisation Basics for Government Agencies Prepared for State Services Commission by Catalyst IT Neil Bertram May 11, 2007 Abstract This document is intended as a guide for New Zealand government
More informationA Review on Identifying the Main Content From Web Pages
A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,
More informationUsing UML To Define XML Document Types
Using UML To Define XML Document Types W. Eliot Kimber ISOGEN International, A DataChannel Company Created On: 10 Dec 1999 Last Revised: 14 Jan 2000 Defines a convention for the use of UML to define XML
More informationIntroduction to Information Systems
Table of Contents 1... 2 1.1 Introduction... 2 1.2 Architecture of Information systems... 2 1.3 Classification of Data Models... 4 1.4 Relational Data Model (Overview)... 8 1.5 Conclusion... 12 1 1.1 Introduction
More informationThe XQuery Data Model
The XQuery Data Model 9. XQuery Data Model XQuery Type System Like for any other database query language, before we talk about the operators of the language, we have to specify exactly what it is that
More informationXML: Introduction. !important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... Directive... 9:11
!important Declaration... 9:11 #FIXED... 7:5 #IMPLIED... 7:5 #REQUIRED... 7:4 @import Directive... 9:11 A Absolute Units of Length... 9:14 Addressing the First Line... 9:6 Assigning Meaning to XML Tags...
More informationChapter 13 XML: Extensible Markup Language
Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server
More informationAutomated Classification. Lars Marius Garshol Topic Maps
Automated Classification Lars Marius Garshol Topic Maps 2007 2007-03-21 Automated classification What is it? Why do it? 2 What is automated classification? Create parts of a topic map
More information3. ALGORITHM FOR HIERARCHICAL STRUC TURE DISCOVERY Edges in a webgraph represent the links between web pages. These links can have a type such as: sta
Semantics of Links and Document Structure Discovery John R. Punin puninj@cs.rpi.edu http://www.cs.rpi.edu/~ puninj Department of Computer Science, RPI,Troy, NY 12180. M. S. Krishnamoorthy moorthy@cs.rpi.edu
More informationITEC 810 Minor Project. Inferring Document Structure. Final Report. Author: Weiyen Lin SID: Supervised by: Jette Viethen
ITEC 810 Minor Project Inferring Document Structure Final Report Author: Weiyen Lin SID: 41348133 Supervised by: Jette Viethen 4th June 2009 Abstract PDF documents form a rich resource repository of knowledge
More informationInformation retrieval concepts Search and browsing on unstructured data sources Digital libraries applications
Digital Libraries Agenda Digital Libraries Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications What is Library Collection of books, documents,
More informationUser Interaction: XML and JSON
User Interaction: XML and JSON Asst. Professor Donald J. Patterson INF 133 Fall 2011 1 What might a design notebook be like? Cooler What does a design notebook entry look like? HTML and XML 1989: Tim Berners-Lee
More informationEPiServer s Compliance to WCAG and ATAG
EPiServer s Compliance to WCAG and ATAG An evaluation of EPiServer s compliance to the current WCAG and ATAG guidelines elaborated by the World Wide Web Consortium s (W3C) Web Accessibility Initiative
More informationDesign and Implementation of an RDF Triple Store
Design and Implementation of an RDF Triple Store Ching-Long Yeh and Ruei-Feng Lin Department of Computer Science and Engineering Tatung University 40 Chungshan N. Rd., Sec. 3 Taipei, 04 Taiwan E-mail:
More informationUR what? ! URI: Uniform Resource Identifier. " Uniquely identifies a data entity " Obeys a specific syntax " schemename:specificstuff
CS314-29 Web Protocols URI, URN, URL Internationalisation Role of HTML and XML HTTP and HTTPS interacting via the Web UR what? URI: Uniform Resource Identifier Uniquely identifies a data entity Obeys a
More informationThe Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu
International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce
More informationPublishing Technology 101 A Journal Publishing Primer. Mike Hepp Director, Technology Strategy Dartmouth Journal Services
Publishing Technology 101 A Journal Publishing Primer Mike Hepp Director, Technology Strategy Dartmouth Journal Services mike.hepp@sheridan.com Publishing Technology 101 AGENDA 12 3 EVOLUTION OF PUBLISHING
More informationDistributed Database System. Project. Query Evaluation and Web Recognition in Document Databases
74.783 Distributed Database System Project Query Evaluation and Web Recognition in Document Databases Instructor: Dr. Yangjun Chen Student: Kang Shi (6776229) August 1, 2003 1 Abstract A web and document
More informationInformation Retrieval (IR) through Semantic Web (SW): An Overview
Information Retrieval (IR) through Semantic Web (SW): An Overview Gagandeep Singh 1, Vishal Jain 2 1 B.Tech (CSE) VI Sem, GuruTegh Bahadur Institute of Technology, GGS Indraprastha University, Delhi 2
More informationCHAPTER 1: GETTING STARTED WITH HTML CREATED BY L. ASMA RIKLI (ADAPTED FROM HTML, CSS, AND DYNAMIC HTML BY CAREY)
CHAPTER 1: GETTING STARTED WITH HTML EXPLORING THE HISTORY OF THE WORLD WIDE WEB Network: a structure that allows devices known as nodes or hosts to be linked together to share information and services.
More informationPART. Oracle and the XML Standards
PART I Oracle and the XML Standards CHAPTER 1 Introducing XML 4 Oracle Database 10g XML & SQL E xtensible Markup Language (XML) is a meta-markup language, meaning that the language, as specified by the
More informationService Oriented Architectures (ENCS 691K Chapter 2)
Service Oriented Architectures (ENCS 691K Chapter 2) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ The Key Technologies on Which Cloud
More informationWhat is a web site? Web editors Introduction to HTML (Hyper Text Markup Language)
What is a web site? Web editors Introduction to HTML (Hyper Text Markup Language) What is a website? A website is a collection of web pages containing text and other information, such as images, sound
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1
Slide 27-1 Chapter 27 XML: Extensible Markup Language Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree) Data Model. XML Documents, DTD, and XML Schema.
More informationDigital Asset Management 2. Introduction to Digital Media Format
Digital Asset Management 2. Introduction to Digital Media Format 2009-09-24 Outline Image format and coding methods Audio format and coding methods Video format and coding methods Introduction to HTML
More informationEMERGING TECHNOLOGIES. XML Documents and Schemas for XML documents
EMERGING TECHNOLOGIES XML Documents and Schemas for XML documents Outline 1. Introduction 2. Structure of XML data 3. XML Document Schema 3.1. Document Type Definition (DTD) 3.2. XMLSchema 4. Data Model
More informationFrom administrivia to what really matters
From administrivia to what really matters Questions about the syllabus? Logistics Daily lectures, quizzes and labs Two exams and one long project My teaching philosophy...... is informed by my passion
More informationComp 336/436 - Markup Languages. Fall Semester Week 4. Dr Nick Hayward
Comp 336/436 - Markup Languages Fall Semester 2017 - Week 4 Dr Nick Hayward XML - recap first version of XML became a W3C Recommendation in 1998 a useful format for data storage and exchange config files,
More informationintroduction to XHTML
introduction to XHTML XHTML stands for Extensible HyperText Markup Language and is based on HTML 4.0, incorporating XML. Due to this fusion the mark up language will remain compatible with existing browsers
More informationLecture Telecooperation. D. Fensel Leopold-Franzens- Universität Innsbruck
Lecture Telecooperation D. Fensel Leopold-Franzens- Universität Innsbruck First Lecture: Introduction: Semantic Web & Ontology Introduction Semantic Web and Ontology Part I Introduction into the subject
More informationDevelopment of an Ontology-Based Portal for Digital Archive Services
Development of an Ontology-Based Portal for Digital Archive Services Ching-Long Yeh Department of Computer Science and Engineering Tatung University 40 Chungshan N. Rd. 3rd Sec. Taipei, 104, Taiwan chingyeh@cse.ttu.edu.tw
More informationWebbed Documents 1- Malcolm Graham and Andrew Surray. Abstract. The Problem and What We ve Already Tried
Webbed Documents 1- Malcolm Graham and Andrew Surray WriteDoc Inc. Northern Telecom malcolm@writedoc.com surray@bnr.ca Abstract This paper describes the work currently being done within Northern Telecom
More informationextensible Markup Language
extensible Markup Language XML is rapidly becoming a widespread method of creating, controlling and managing data on the Web. XML Orientation XML is a method for putting structured data in a text file.
More informationAn Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery
An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery Simon Pelletier Université de Moncton, Campus of Shippagan, BGI New Brunswick, Canada and Sid-Ahmed Selouani Université
More informationCHAPTER-23 MINING COMPLEX TYPES OF DATA
CHAPTER-23 MINING COMPLEX TYPES OF DATA 23.1 Introduction 23.2 Multidimensional Analysis and Descriptive Mining of Complex Data Objects 23.3 Generalization of Structured Data 23.4 Aggregation and Approximation
More informationMODULE 2 HTML 5 FUNDAMENTALS. HyperText. > Douglas Engelbart ( )
MODULE 2 HTML 5 FUNDAMENTALS HyperText > Douglas Engelbart (1925-2013) Tim Berners-Lee's proposal In March 1989, Tim Berners- Lee submitted a proposal for an information management system to his boss,
More informationSDPL : XML Basics 2. SDPL : XML Basics 1. SDPL : XML Basics 4. SDPL : XML Basics 3. SDPL : XML Basics 5
2 Basics of XML and XML documents 2.1 XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2 Basics of XML DTDs 2.3 XML Namespaces XML 1.0
More informationA Meta-Model for Fact Extraction from Delphi Source Code
Electronic Notes in Theoretical Computer Science 94 (2004) 9 28 www.elsevier.com/locate/entcs A Meta-Model for Fact Extraction from Delphi Source Code Jens Knodel and G. Calderon-Meza 2 Fraunhofer Institute
More informationWeb Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India
Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the
More informationThe XML Metalanguage
The XML Metalanguage Mika Raento mika.raento@cs.helsinki.fi University of Helsinki Department of Computer Science Mika Raento The XML Metalanguage p.1/442 2003-09-15 Preliminaries Mika Raento The XML Metalanguage
More informationStructured documents
Structured documents An overview of XML Structured documents Michael Houghton 15/11/2000 Unstructured documents Broadly speaking, text and multimedia document formats can be structured or unstructured.
More informationI&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING. Andrii Donchenko
International Journal "Information Technologies and Knowledge" Vol.1 / 2007 293 I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING Andrii Donchenko Abstract: This article considers
More information1. true / false By a compiler we mean a program that translates to code that will run natively on some machine.
1. true / false By a compiler we mean a program that translates to code that will run natively on some machine. 2. true / false ML can be compiled. 3. true / false FORTRAN can reasonably be considered
More informationUse of Mobile Agents for IPR Management and Negotiation
Use of Mobile Agents for Management and Negotiation Isabel Gallego 1, 2, Jaime Delgado 1, Roberto García 1 1 Universitat Pompeu Fabra (UPF), Departament de Tecnologia, La Rambla 30-32, E-08002 Barcelona,
More informationA new generation of tools for SGML
Article A new generation of tools for SGML R. W. Matzen Oklahoma State University Department of Computer Science EMAIL rmatzen@acm.org Exceptions are used in many standard DTDs, including HTML, because
More informationDefining an Abstract Core Production Rule System
WORKING PAPER!! DRAFT, e.g., lacks most references!! Version of December 19, 2005 Defining an Abstract Core Production Rule System Benjamin Grosof Massachusetts Institute of Technology, Sloan School of
More informationAchitectural specification: Base
Contents Introduction to DITA... 5 DITA terminology and notation...5 Basic concepts...9 File extensions...10 Producing different deliverables from a single source...11 DITA markup...12 DITA topics...12
More informationPart III: Survey of Internet technologies
Part III: Survey of Internet technologies Content (e.g., HTML) kinds of objects we re moving around? References (e.g, URLs) how to talk about something not in hand? Protocols (e.g., HTTP) how do things
More information