XML-based production of Eurostat publications

Similar documents
XML-Publishing Implementation Strategy of an XML-based publishing in Eurostat

A new international standard for data validation and processing

Joining the BRICKS Network - A Piece of Cake

Content Management for the Defense Intelligence Enterprise

PTC Employs Its Own Arbortext Software to Improve Delivery of PTC University Learning Content Materials

strategy IT Str a 2020 tegy

Proposals for the 2018 JHAQ data collection

DIRECTORS OF METHODOLOGY/IT DIRECTORS JOINT STEERING GROUP 18 NOVEMBER 2015

COLLECTION OF RAW DATA TASK FORCE MEETING N 7 12 MARCH Doc. CoRD 096. XML for Foreign Trade Statistics. For information

Update on the TDL Metadata Working Group s activities for

Final Report. Phase 2. Virtual Regional Dissertation & Thesis Archive. August 31, Texas Center Research Fellows Grant Program

INSPIRE status report

Business Architecture concepts and components: BA shared infrastructures, capability modeling and guiding principles

The Salesforce Migration Playbook

XML in Book Publishing

Title: Interactive data entry and validation tool: A collaboration between librarians and researchers

An Annotation Tool for Semantic Documents

Microsoft Windows SharePoint Services

OUTCOME OF THE 3 RD MEETING OF TARGET CONSOLIDATION CONTACT GROUP (TCCG)

A Centralised System for Administrative Data Collection at Statistics Finland

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

New IT solutions for item list management and data validation. 4 th Inter-Agency Coordinating Group Meeting October 23-25, 2017 Washington, DC

Simile Tools Workshop Summary MacKenzie Smith, MIT Libraries

Introduction. Key Features and Benefits

D2.5 Data mediation. Project: ROADIDEA

COLUMN. Choosing the right CMS authoring tools. Three key criteria will determine the most suitable authoring environment NOVEMBER 2003

- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents

A Breakthrough In the Science of Proposal Development: P-XML. APMP TM Southern California Fall Seminar. October 22, 2004.

SDMX GLOBAL CONFERENCE

OASIS WSIA Technical Committee. Requirements Document Business Scenario Report: Product Configurator. Version 1.0

Adobe. Using DITA XML for Instructional Documentation. Andrew Thomas 08/10/ Adobe Systems Incorporated. All Rights Reserved.

How to Simplify PCB Design

Digital Design of Paper Technologies Workshop

EBS goes social - The triumvirate Liferay, Application Express and EBS

Aligned Elements The professional Product Suite built to keep the Design History Files complete and consistent at all times, using minimal effort and

Invitation to Tender Content Management System Upgrade

Hello, I m Melanie Feltner-Reichert, director of Digital Library Initiatives at the University of Tennessee. My colleague. Linda Phillips, is going

SDMX self-learning package No. 3 Student book. SDMX-ML Messages

WORLD TELECOMMUNICATION STANDARDIZATION ASSEMBLY Hammamet, 25 October 3 November 2016

Publishing Technology 101 A Journal Publishing Primer. Mike Hepp Director, Technology Strategy Dartmouth Journal Services

Case Study: Document Management and Localization

Step: 9 Conduct Data Standardization

Army Data Services Layer (ADSL) Data Mediation Providing Data Interoperability and Understanding in a

Survey Introduction. Thank you for participating in the WritersUA Skills and Technologies survey!

AiM Overview and Basic Navigation User Guide

ECLIPSE PERSISTENCE PLATFORM (ECLIPSELINK) FAQ

ON TWO ADAPTIVE SYSTEMS FOR DOCUMENT MANAGEMENT * Vanyo G. Peychev, Ivo I. Damyanov

Index A Access data formats, 215 exporting data from, to SharePoint, forms and reports changing table used by form, 213 creating, cont

Power Sample Point Slide Show

Metadata and Encoding Standards for Digital Initiatives: An Introduction

ESSnet. Common Reference Architecture. WP number and name: WP2 Requirements collection & State of the art. Questionnaire

KM COLUMN. How to evaluate a content management system. Ask yourself: what are your business goals and needs? JANUARY What this article isn t

Oracle Fusion Middleware 11g: Build Applications with ADF I

LiXuid Manuscript. Sean MacRae, Business Systems Analyst

Oracle Fusion Middleware 11g: Build Applications with ADF I

Copyright 2010 Redstone Content Solutions LLC OCM & WCM Training Agenda Revised Thursday, November 18, 2010

Opening up new opportunities through Cross-selling and Upselling. GMC Software Technology

Dictionary Driven Exchange Content Assembly Blueprints

BEAWebLogic. Portal. Overview

INTRODUCING THE UNIFIED E-BOOK FORMAT AND A HYBRID LIBRARY 2.0 APPLICATION MODEL BASED ON IT. 1. Introduction

Adobe Acrobat 8 Professional - Available November 8, 2006 Communicate and Collaborate with the Essential PDF Solution

Development of Web Applications for Savannah River Site

Submission Guide.

What's New in Release 2017r2

Salesforce ID of the Feature record is stored in the Product Option record to maintain the relationship.

Proven video conference management software for Cisco Meeting Server

User Stories Report. Project. Statistics: Name Start End Weather Forecaster 5/2/ /7/2005

Nuno Freire National Library of Portugal Lisbon, Portugal

JobRouter Product description Version 3.0

CoE CENTRE of EXCELLENCE ON DATA WAREHOUSING

Analysis Exchange Framework Terms of Reference December 2016

Call: SharePoint 2013 Course Content:35-40hours Course Outline

HKALE 2012 ASL Computer Applications Paper 2 Disscussion forum system

1 Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Etanova Enterprise Solutions

OpenScape Contact Center Multimedia. First Contact Resolution in a Multi-Channel World <insert date here>

MODULAR CONCEPT AND BASIC FUNCTIONS OF SPEEDIKON C

...and the value of XML-based graphical applications. a white paper from Corel Corporation

The DMS provides a web browser, a desktop client and a mobile browser as standard features.

Packaging Content Management Technical Whitepaper

How to choose the right approach to analytics and reporting

Collaborating in a Digital World with Bluebeam Revu

Main Window. Overview. Do this Click the New Report link. Create a New Report.

The XML Metalanguage

Quantum, a Data Storage Solutions Leader, Delivers Responsive HTML5-Based Documentation Centers Using MadCap Flare

WEB-BASED COLLECTION MANAGEMENT FOR ARCHIVES

High Fidelity Programmatic Access to Document Content

LimeSurvey manual. Version; Authors: Carl DEVOS, Laurence Kohn. Page 1 of 48

Saving Potential in Technical Documentation

Qlik Analytics Platform

Using JBI for Service-Oriented Integration (SOI)

DMS as an additional network drive in the Windows Explorer. VFS stands for Virtual File System.

Horizon2020/EURO Coordination and Support Actions. SOcietal Needs analysis and Emerging Technologies in the public Sector

Proven video conference management software for Cisco Meeting Server

Towards Semantic Interoperability between C2 Systems Following the Principles of Distributed Simulation

ODM The Operational Efficiency Model: Using ODM to Deliver Proven Cost and Time Savings in Study Set-up

Hospital System Lowers IT Costs After Epic Migration Flatirons Digital Innovations, Inc. All rights reserved.

Sustainable File Formats for Electronic Records A Guide for Government Agencies

ZENworks Reporting System Reference. January 2017

Transcription:

Doc. Eurostat/ITDG/October 2007/2.3.1 IT Directors Group 15 and 16 October 2007 BECH Building, 5, rue Alphonse Weicker, Luxembourg-Kirchberg Room QUETELET 9.30 a.m. - 5.30 p.m. 9.00 a.m 1.00 p.m. XML-based production of Eurostat publications Item 2.3.1 of the agenda

XML-based production of Eurostat publications 1. BACKGROUND Eurostat's yearly publication programme includes approximately 100 larger publications (collections Statistical books, Pocketbooks, Methodologies and Working papers, Methods and nomenclatures, Detailed tables) and more than 200 shorter publications (collection Statistics in Focus and the new collection Data in Focus). While the shorter publications are produced using a MS Word add-on called SIF/DIF- Kit, there are no similar technical tools which could be used for the larger ones which would offer satisfactory graphical quality. Instead, the decision about which tool or standard to use is usually left to contractors, which makes it difficult to exchange or reuse the source files in case of a change of a contractor. Also, there is no common Eurostat tool for producing different output formats such as web-pdf, print-pdf or HTML. This might be acceptable for some ad-hoc publications but the majority of Eurostat titles are produced on a regular basis over many years, and a common tool and standard could bring many benefits, in particular if the layout stays more or less stable for several editions of the same publication. A MS Word add-on like SIF-Kit would limit the layout possibilities too much. First, it would not allow for graphical high quality solutions. Second, the publication's layout would need to be hard-coded in the software tools itself. It would be impossible to build a generic tool usable for more than one publication. Therefore, this approach was not investigated further. Another option is to use XML (Extensible Markup Language), a widely used standard for sharing documents and data between different information systems, in particular via the Internet. The manuscript is either created by the author directly in XML (e.g. as a database extraction) or it is converted into XML from another format (e.g. MS Word or Excel document). As a next step, the XML source is transformed into a presentational form: web- or print-optimized PDF, HTML or other structured document formats. Eurostat unit B6 "Dissemination" conducted a survey on the usage of XML in the dissemination process among the Statistical Institutes of the Member States (NSIs). The results were presented during the Dissemination Working Group held on 4-5 May, 2006. Out of 17 NSIs who replied to a questionnaire distributed by Eurostat, more than half (10) are using XML at various stages of the production process of their data, publications and websites. The most common usage of XML is for data exchange (as a common format to share data) and for producing content for their websites. Moreover, two NSIs (Statistics Finland and Statec Luxemburg) are using XML-based solutions for the production of publications. Based on the demand of the Dissemination Working Group, Eurostat organized a Task Force on using XML in dissemination and particularly in publishing in October 2006. During this Task Force 5 Member states, the ECB and OPOCE presented their experiences; the Finnish and Luxembourgish NSIs presented their operational solutions. Doc. Eurostat/ITDG/October 2007/2.3.1 1

In November 2006 Eurostat launched a feasibility and implementation study for using XML for producing Eurostat publications. This paper summarizes the results of this study. Both the results of the Task force and of the study were made available to the NSIs via Circa. 2. XML-BASED PUBLISHING SYSTEM 2.1. Project goals and high level requirements The demand which triggered the search for alternatives to today's traditional production processes was the reduction of production times. Fast and seamless assembly of periodically produced Eurostat publications is the main objective. However, an XML-based system gives the opportunity to cover other objectives as well. Today, all publications can be downloaded (as a PDF) from Eurostat website but now browsable web version exists although many publications would deserve one. The XML-based publishing system should therefore also allow easy dissemination in multiple output formats (paper, PDF, HTML). All Eurostat publications are based on teamwork. No system exists in Eurostat that would support collaboration of multiple authors. Keeping track on changes and versions can easily become a daunting task. The new system should therefore assist the authors who contribute to the collective work. Besides the main objectives, the XML format should also allow format independent archiving of all publications and improve consistency in layout. Another possible application might be a common use of one XML format in order to make browsable versions of publications easily available on the websites on the NSIs. As the system is intended for non-specialists, it should be designed in a way that the authors should not need to understand the technology behind. That means that neither familiarity with XML nor extensive training for basic use should be necessary. Doc. Eurostat/ITDG/October 2007/2.3.1 2

2.2. Basic concept The high-level concept of the system to be developed relies on a single format (XML) into which various other inputs are transformed. As the schema below shows, data from Eurostat dissemination databases (New Cronos, Comext) are combined with data that are not in these databases (coming as CSV or Excel files), text (coming as a MS word document and graphics (e.g. maps). The individual content fragments are then assembled into the final document and transformed into the required output format. 2.3. XML Schema A publishing process based on XML has a number of significant advantages. Content defined in XML is platform and software independent. It is also independent of a particular display format, since XML separates content from presentational information. This simplifies the generation of multiple formats from a single source using technologies like XSLT. In addition, this allows the content to be future compatible with emerging publication formats by defining an appropriate transformation to those formats. Best Practices suggest that before designing a new format, designers should try to look up existing XML vocabularies on similar data. Ideally this allows reusing them, in which case a lot of the existing tools like DTD, Schemas and style sheets may already be available. The Eurostat study concluded that Open Document Format is the leading format with the biggest potential and flexibility. Its advantages clearly overweight its disadvantages. As an ISO standard, ODF has been widely adopted by the industry and is well supported by most common Office Tools. Furthermore from a technical point of view, the schema is clear, well structured and easily convertible into any other format using XSLT. ODF is an office format and cannot represent the semantics of Eurostat publications. Since the format will be used internally by the XML based publishing solution and users (or administrators) will not be exposed to the format directly, this does not impose any limitation or cause any potential Doc. Eurostat/ITDG/October 2007/2.3.1 3

problem. It is important to mention that the application will be responsible for mapping the semantics of Eurostat publications to the internal format chosen. In order to extend the selected XML format used to persist documents (basically typographical data), adding the extra metadata required to drive the publication process, an object model will be created and it will be composed of: data (the actual representation of the document in ODF format) and; metadata (data describing the document, like publication ID, author, title). The proposed model will be defined in order to take advantage of the Content Management capabilities of Alfresco, including metadata managing, versioning and a clear distinction between content (data) and properties (metadata). This will allow us to extend the ODF format in order to support the process, eliminating the requirement of an ad-hoc format. Following this approach, an object will be constructed for each publication (or publication component) as seen below: Title Publication ID ODF Reviewer Reviewer Author (s) Publication date Reviewer Reviewer Reviewer (s) In the scenario presented above, the ODF document would store the contents of the publication, while the process metadata would represent information required to drive the business process. Doc. Eurostat/ITDG/October 2007/2.3.1 4

2.4. System architecture Eurostat XML-based publishing solution will be build upon the Alfresco architecture. Alfresco uses state of the art core components that assembled together provide a powerful, scalable and reliable Content Management foundation. The system will use a file system to store documents and a relational database (like Oracle) in order to persist metadata and internal business related information. It will be also linked with LDAP directory to manage user rights. Both user actions as well as system management will be performed via the Alfresco web interface. The Alfresco framework will be integrated with custom made components. The role of these pieces of software will be to perform actions not supported directly by the framework. The preliminary list of components to be developed includes: User interface components, built to integrate seamlessly with Statpub. Workflow custom components, to facilitate the integration with Statpub and support the collaborative business process (creation, authoring, proofreading, translation, publication). ODF custom generators, to facilitate the construction of ODF fragments containing tabular data and charts generated from external data sources. Doc. Eurostat/ITDG/October 2007/2.3.1 5

Digesters, to process and homogenise information coming from different data sources. Custom transformers, to produce publications compatible with the different output channels (PDF, mini Web sites, etc). Metadata extractors, to extract metadata contained in ODF documents, populating the Content Management System. Metadata assemblers, to stamp metadata on exported documents in order to facilitate tracking and control. 2.5. XML authoring solution A large number of EUROSTAT users create, review, translate, proofread, and approve publications as a part of their daily activities. In order to minimize the impact in terms of re-training, the XML based publishing solution to put in place will let users operate on documents with the help of their usual tools (for example Microsoft Word for editing text documents). Even though the internal representation of documents will be based on XML (ODF), users will not be required to deal with XML authoring directly. The users will use the existing tools (mainly MS Office) and the publishing solution will convert the documents into and ODF representation. A key finding of the study is that by using carefully designed templates, the conversion accuracy and metadata synchronization (properties in documents that must reflect the value of metadata stored in the content repository) can be maximized. Alfresco ODF support proved to be the less intrusive alternative (zero installation) while providing proper conversions in most of the tested cases (a set of EUROSTAT publications were used to test each analyzed alternative). 3. CONCLUSIONS As the next step, this project will continue with a pilot. The main goal of the pilot is to implement the automated production of a selected Eurostat publication: Eurostatistics. It was chosen since it is a periodical and as a part of the core publications programme the automatization would bring long lasting benefits both in the effort needed to produce and in substantial reduction of production time. Last but not least, most of the content can be database-generated. This should simplify the implementation of the pilot. Provided that the pilot project is successful, implementation of other publications will follow. The work on the pilot should start during October and after the initial specifications phase (lasting three months), several incremental prototypes will be developed. This Doc. Eurostat/ITDG/October 2007/2.3.1 6

development phase is foreseen to last 7 months. The last four months of the project will be devoted to testing and putting the system into actual production. NSIs are welcome to participate actively in the XML task force which is expected to meet on an annual basis. NSIs also invited to use the study conducted by Eurostat for their own purposes, as well as any further results made available (all relevant documents are or will be made available on the Circa site of the Dissemination Working Group). Since the Eurostat system will be integrated in the existing infrastructure (mainly Eurostat workflow management tool called Statpub) simple re-use of the full application will be difficult. However, the overall concept could be of use. Further, provided that other projects would be based on the same architecture, some of the custom developed components (like templates or transformation style sheets) could be re-used. Eurostat will gradually make all the future developments freely available and those who are interested are encouraged to contact Eurostat Unit B6 for further details. Doc. Eurostat/ITDG/October 2007/2.3.1 7