Doc. Eurostat/ITDG/October 2007/2.3.1 IT Directors Group 15 and 16 October 2007 BECH Building, 5, rue Alphonse Weicker, Luxembourg-Kirchberg Room QUETELET 9.30 a.m. - 5.30 p.m. 9.00 a.m 1.00 p.m. XML-based production of Eurostat publications Item 2.3.1 of the agenda
XML-based production of Eurostat publications 1. BACKGROUND Eurostat's yearly publication programme includes approximately 100 larger publications (collections Statistical books, Pocketbooks, Methodologies and Working papers, Methods and nomenclatures, Detailed tables) and more than 200 shorter publications (collection Statistics in Focus and the new collection Data in Focus). While the shorter publications are produced using a MS Word add-on called SIF/DIF- Kit, there are no similar technical tools which could be used for the larger ones which would offer satisfactory graphical quality. Instead, the decision about which tool or standard to use is usually left to contractors, which makes it difficult to exchange or reuse the source files in case of a change of a contractor. Also, there is no common Eurostat tool for producing different output formats such as web-pdf, print-pdf or HTML. This might be acceptable for some ad-hoc publications but the majority of Eurostat titles are produced on a regular basis over many years, and a common tool and standard could bring many benefits, in particular if the layout stays more or less stable for several editions of the same publication. A MS Word add-on like SIF-Kit would limit the layout possibilities too much. First, it would not allow for graphical high quality solutions. Second, the publication's layout would need to be hard-coded in the software tools itself. It would be impossible to build a generic tool usable for more than one publication. Therefore, this approach was not investigated further. Another option is to use XML (Extensible Markup Language), a widely used standard for sharing documents and data between different information systems, in particular via the Internet. The manuscript is either created by the author directly in XML (e.g. as a database extraction) or it is converted into XML from another format (e.g. MS Word or Excel document). As a next step, the XML source is transformed into a presentational form: web- or print-optimized PDF, HTML or other structured document formats. Eurostat unit B6 "Dissemination" conducted a survey on the usage of XML in the dissemination process among the Statistical Institutes of the Member States (NSIs). The results were presented during the Dissemination Working Group held on 4-5 May, 2006. Out of 17 NSIs who replied to a questionnaire distributed by Eurostat, more than half (10) are using XML at various stages of the production process of their data, publications and websites. The most common usage of XML is for data exchange (as a common format to share data) and for producing content for their websites. Moreover, two NSIs (Statistics Finland and Statec Luxemburg) are using XML-based solutions for the production of publications. Based on the demand of the Dissemination Working Group, Eurostat organized a Task Force on using XML in dissemination and particularly in publishing in October 2006. During this Task Force 5 Member states, the ECB and OPOCE presented their experiences; the Finnish and Luxembourgish NSIs presented their operational solutions. Doc. Eurostat/ITDG/October 2007/2.3.1 1
In November 2006 Eurostat launched a feasibility and implementation study for using XML for producing Eurostat publications. This paper summarizes the results of this study. Both the results of the Task force and of the study were made available to the NSIs via Circa. 2. XML-BASED PUBLISHING SYSTEM 2.1. Project goals and high level requirements The demand which triggered the search for alternatives to today's traditional production processes was the reduction of production times. Fast and seamless assembly of periodically produced Eurostat publications is the main objective. However, an XML-based system gives the opportunity to cover other objectives as well. Today, all publications can be downloaded (as a PDF) from Eurostat website but now browsable web version exists although many publications would deserve one. The XML-based publishing system should therefore also allow easy dissemination in multiple output formats (paper, PDF, HTML). All Eurostat publications are based on teamwork. No system exists in Eurostat that would support collaboration of multiple authors. Keeping track on changes and versions can easily become a daunting task. The new system should therefore assist the authors who contribute to the collective work. Besides the main objectives, the XML format should also allow format independent archiving of all publications and improve consistency in layout. Another possible application might be a common use of one XML format in order to make browsable versions of publications easily available on the websites on the NSIs. As the system is intended for non-specialists, it should be designed in a way that the authors should not need to understand the technology behind. That means that neither familiarity with XML nor extensive training for basic use should be necessary. Doc. Eurostat/ITDG/October 2007/2.3.1 2
2.2. Basic concept The high-level concept of the system to be developed relies on a single format (XML) into which various other inputs are transformed. As the schema below shows, data from Eurostat dissemination databases (New Cronos, Comext) are combined with data that are not in these databases (coming as CSV or Excel files), text (coming as a MS word document and graphics (e.g. maps). The individual content fragments are then assembled into the final document and transformed into the required output format. 2.3. XML Schema A publishing process based on XML has a number of significant advantages. Content defined in XML is platform and software independent. It is also independent of a particular display format, since XML separates content from presentational information. This simplifies the generation of multiple formats from a single source using technologies like XSLT. In addition, this allows the content to be future compatible with emerging publication formats by defining an appropriate transformation to those formats. Best Practices suggest that before designing a new format, designers should try to look up existing XML vocabularies on similar data. Ideally this allows reusing them, in which case a lot of the existing tools like DTD, Schemas and style sheets may already be available. The Eurostat study concluded that Open Document Format is the leading format with the biggest potential and flexibility. Its advantages clearly overweight its disadvantages. As an ISO standard, ODF has been widely adopted by the industry and is well supported by most common Office Tools. Furthermore from a technical point of view, the schema is clear, well structured and easily convertible into any other format using XSLT. ODF is an office format and cannot represent the semantics of Eurostat publications. Since the format will be used internally by the XML based publishing solution and users (or administrators) will not be exposed to the format directly, this does not impose any limitation or cause any potential Doc. Eurostat/ITDG/October 2007/2.3.1 3
problem. It is important to mention that the application will be responsible for mapping the semantics of Eurostat publications to the internal format chosen. In order to extend the selected XML format used to persist documents (basically typographical data), adding the extra metadata required to drive the publication process, an object model will be created and it will be composed of: data (the actual representation of the document in ODF format) and; metadata (data describing the document, like publication ID, author, title). The proposed model will be defined in order to take advantage of the Content Management capabilities of Alfresco, including metadata managing, versioning and a clear distinction between content (data) and properties (metadata). This will allow us to extend the ODF format in order to support the process, eliminating the requirement of an ad-hoc format. Following this approach, an object will be constructed for each publication (or publication component) as seen below: Title Publication ID ODF Reviewer Reviewer Author (s) Publication date Reviewer Reviewer Reviewer (s) In the scenario presented above, the ODF document would store the contents of the publication, while the process metadata would represent information required to drive the business process. Doc. Eurostat/ITDG/October 2007/2.3.1 4
2.4. System architecture Eurostat XML-based publishing solution will be build upon the Alfresco architecture. Alfresco uses state of the art core components that assembled together provide a powerful, scalable and reliable Content Management foundation. The system will use a file system to store documents and a relational database (like Oracle) in order to persist metadata and internal business related information. It will be also linked with LDAP directory to manage user rights. Both user actions as well as system management will be performed via the Alfresco web interface. The Alfresco framework will be integrated with custom made components. The role of these pieces of software will be to perform actions not supported directly by the framework. The preliminary list of components to be developed includes: User interface components, built to integrate seamlessly with Statpub. Workflow custom components, to facilitate the integration with Statpub and support the collaborative business process (creation, authoring, proofreading, translation, publication). ODF custom generators, to facilitate the construction of ODF fragments containing tabular data and charts generated from external data sources. Doc. Eurostat/ITDG/October 2007/2.3.1 5
Digesters, to process and homogenise information coming from different data sources. Custom transformers, to produce publications compatible with the different output channels (PDF, mini Web sites, etc). Metadata extractors, to extract metadata contained in ODF documents, populating the Content Management System. Metadata assemblers, to stamp metadata on exported documents in order to facilitate tracking and control. 2.5. XML authoring solution A large number of EUROSTAT users create, review, translate, proofread, and approve publications as a part of their daily activities. In order to minimize the impact in terms of re-training, the XML based publishing solution to put in place will let users operate on documents with the help of their usual tools (for example Microsoft Word for editing text documents). Even though the internal representation of documents will be based on XML (ODF), users will not be required to deal with XML authoring directly. The users will use the existing tools (mainly MS Office) and the publishing solution will convert the documents into and ODF representation. A key finding of the study is that by using carefully designed templates, the conversion accuracy and metadata synchronization (properties in documents that must reflect the value of metadata stored in the content repository) can be maximized. Alfresco ODF support proved to be the less intrusive alternative (zero installation) while providing proper conversions in most of the tested cases (a set of EUROSTAT publications were used to test each analyzed alternative). 3. CONCLUSIONS As the next step, this project will continue with a pilot. The main goal of the pilot is to implement the automated production of a selected Eurostat publication: Eurostatistics. It was chosen since it is a periodical and as a part of the core publications programme the automatization would bring long lasting benefits both in the effort needed to produce and in substantial reduction of production time. Last but not least, most of the content can be database-generated. This should simplify the implementation of the pilot. Provided that the pilot project is successful, implementation of other publications will follow. The work on the pilot should start during October and after the initial specifications phase (lasting three months), several incremental prototypes will be developed. This Doc. Eurostat/ITDG/October 2007/2.3.1 6
development phase is foreseen to last 7 months. The last four months of the project will be devoted to testing and putting the system into actual production. NSIs are welcome to participate actively in the XML task force which is expected to meet on an annual basis. NSIs also invited to use the study conducted by Eurostat for their own purposes, as well as any further results made available (all relevant documents are or will be made available on the Circa site of the Dissemination Working Group). Since the Eurostat system will be integrated in the existing infrastructure (mainly Eurostat workflow management tool called Statpub) simple re-use of the full application will be difficult. However, the overall concept could be of use. Further, provided that other projects would be based on the same architecture, some of the custom developed components (like templates or transformation style sheets) could be re-used. Eurostat will gradually make all the future developments freely available and those who are interested are encouraged to contact Eurostat Unit B6 for further details. Doc. Eurostat/ITDG/October 2007/2.3.1 7