CORA Technical Annex. ESSnet. WP3: Definition of the Layered Architecture. 3.2 Technical Annex. Statistics Netherlands

Statistics Netherlands WP3: Definition of the Layered Architecture 3.2 Technical Annex Technical Annex Partner in charge Statistics Netherlands Version.0 Date 2/0/200 2/0/200.0 of 6

Version Changes Date 0. Creation 08/09/200 0.2 Peer Review by Statistics Netherlands 24/09/200 0.3 Example implementation from Stat Sweden added 06/0/200.0 Final version 2/0/200 2/0/200.0 2 of 6

. Introduction... 5.. Scope... 5.2. Basic principle... 5.3. Organisation of this document.... 5.4. References... 5 2. The Common Generic Service Interface... 6 3. Implementation guidelines... 8 4. Example implementations... 9 4.. Event Driven Architecture at Statistics Sweden... 9 4... Introduction... 9 4..2. Processes, business objects and IT-tools... 9 4..3. Value chains... 0 4..4. Event-driven architecture and orchestrations... 0 4..5. Conclusion... 4.2. Generic Interface Architecture at Statistics Netherlands... 2 4.2.. Introduction... 2 4.2.2. Generic Information Exchange Model... 2 4.2.3. Generic functional specifications of service data exchange... 3 4.2.4. Mapping of information models... 4 4.2.5. The resulting generic statistical process... 6 4.2.6. Conclusion... 6 2/0/200.0 3 of 6

This document is distributed under Creative Commons licence "Attribution-Share Alike - 3.0 ", available at the Internet site: http://creativecommons.org/licenses/by-sa/3.0 2/0/200.0 4 of 6

.Introduction The goal of the project has been to define a platform for exchanging implementations of statistical processes on an EU / international level. The project has covered issues like licensing, the current use of potentially shareable tools and a layered model to enable communication about statistical processes on a common level. All these deliverables are important to reach s goals. They are however in a theoretical and experimental stage. This document describes a starting point for implementing the layered model for the production of statistics by means of common generic service interfaces... Scope is a platform-independent framework: it does not offer ready-made implementation solutions. This is why the scope of this document is limited to the activity of modelling statistical services..2. Basic principle allows you to produce a model that describes a statistical process in statistical terms. This will not be a technical model telling you how the work will be executed. It is a logical model telling you how to define statistical services, what kind of data they expect as input, and what kind of data they will deliver as output. It will give a picture of the statistically relevant characteristics of services..3. Organisation of this document. The description of s common generic service interface is followed by some implementation guidelines about how to get them to work. As a conclusion, project within Statistics Sweden and Statistics Netherlands are described that can be seen as some form of implementation of the ideas developed within..4. References This document draws upon the description of the model as described in Deliverable 3. of this project: Description of the Layered. Chapter 3 of that document provides a complete description of the model used here. The reader of this document is kindly requested to refer to it whenever needed. 2/0/200.0 5 of 6

2.The Common Generic Service Interface The requirement of that implementations of statistical processes should be shareable between national statistical agencies together with the idea of a layered model of the production of statistics based on the Generic Statistical Business Process Model (GSBPM), led to the idea that these implementations should be built on common generic service interfaces to encapsulate statistical functionality of actors in the statistical process, both human actors and tool-based actors. The following diagrams represent these generic service interfaces. The first one describes the interface of design services used to design a statistical process. user input in- and output model management internal data model(s) tool-specific parameter definitions Service Core Tool parameter management tool-coded fixed parameter values prescript generated by the core Service Generator Figure : design time common generic service interface The second diagram describes the interface of run time interfaces that manipulate data at run time to produce the desired output of the statistical process. data transformer integrated input data according to the core s internal data model parameter supplier tool-coded parameter values Service Core Tool design storage prescript generated by the design time core output data according to the core s internal data model data transformer Figure 2: run time common generic service interface 2/0/200.0 6 of 6

The following diagram shows these two interfaces in connection with each other. First of all, the run time services are generated from designs created with design time services. Run time prescripts (rules), parameter values known at design time and internal data models are encapsulated within the generated run time services. Service Design user input Design Time Service Core user input D internal data model(s) tool-specific parameter definitions Tool internal (in/out)put transformation models D 2 tool-coded fixed parameter values prescript generated by the core D 3 service definition design time parameter values Service Runtime generic -coded runtime parameter values user input prescript and prescript attributes R 2 input data from multiple sources according to external input datamodels R tool-coded parameter values integrated input data according to the core s internal data model Runtime Service Core Tool R 3 prescript generated by the design time core output data according to the core s internal data model R 4 output data to multiple targets according to external output datamodels user input D 0 Transformation Design external (in/out)put transformation models Figure 3: relationship between design time and run time services This diagram also introduces the concept of transformation design. This is the design of transformations that read data produced by one services and transforms it to data according to a model that a second service understands. 2/0/200.0 7 of 6

3.Implementation guidelines The following table contains some implementation guidelines that should be kept in mind when implementing a compliant architecture. These guidelines are based on experimentations within the participating statistical agencies. Nr# Guideline 0 Statistical design processes should be separated from statistical production processes. Therefore, statistical design functionality should be encapsulated in service interfaces that are separated from run-time service interfaces. 02 A run-time service interface that can be designed (which means it can work with different data models and/or configurable rules should have a related design-time service interface. 03 Data that is communicated between service interfaces corresponds to s layered model of statistical processes. 04 All services share the same formal interface definitons, runtime optimisations are permitted provided they are transparent. 05 At their interfaces, services are implementation independent. Service cores however are inherently implementation dependent. 06 Services cannot extend beyond their own layer (max layers =, max GSBPM subprocess =). 07 Operational data stores cannot exist outside services. 08 Workflow cannot exist outside services, orchestration might. 09 Prime movers are always at the top level (demand based) but may end in event based services at low levels which wait for input or events. 2/0/200.0 8 of 6

4.Example implementations As mentioned earlier, only sets out general principles about how to model statistical processes and services. Some of the participating countries however have implemented aspects of this general idea in their own organizations. This chapter describes two example implementations from Statistics Sweden and Statistics Netherlands. The first is an example of how to orchestrate processes using an event driven architecture and business objects. The latter is an example of how to implement s common generic service interface based on SDMX. 4.. Event Driven Architecture at Statistics Sweden 4... Introduction During the most recent years, Statistic Sweden has realized that a substantial part of our development budget has gone to the data collection and data processing process and the underlying IT-systems. Focus has therefore been to modernize these processes and ITsystems. To get a coordinated view of the development in these processes, the development has been gathered under the projectname Triton. The vision for the Triton project is to provide the production staff with the possibility of selecting and configuring the needed services without any knowlage of the underlying ITstructure or programming skills. The Triton-project has therefore strived to provide a set of integrated IT-tools and services for the design, data collection and data processing processes of the GSBPM. 4..2. Processes, business objects and IT-tools To provide IT-services for the processes, the Triton project begun by looking at the GSBPM but shifted viewpoint somewhat and focused instead on which information (data and metadata) is necessary for executing the process and which information is created as an output of a completed process. Since information created in earlier processes is used in later processes, the Triton project tries to make the processes speak the same language. This doesn t mean that the same programming language is used in the IT-services for each process, but instead that the same business language is used. If the process for designing a web questionnaire uses the same business language, and means the same thing when describing a question, as the process of presenting this question for the respondent, these processes can talk to each other. The project has therefore spent much time creating this Business Information Model which is used for communicating information (data and metadata) between GSBPM-processes. 2/0/200.0 9 of 6

Figure 4: Each process needs data and metadata input and results in new or refined data and/or metadata The Business Information Model does not only cover information regarding the data such as Variable, Sample, Classification and so on, but also covers business objects such as Staff, Customer, Process, Cell in questionnaire, Reminder to respondents. This means that the configuration (metadata) made in the design process is communicated as business object to create a metadata driven production process. This approach therefore does not distinguish so much between the services in design process and the services in the subsequent processes since all processes in this viewpoint creates some form of Business Objects. 4..3. Value chains By shifting GSBPM-viewpoint from the end user-view to the Business Object-view, it is possible to observe where each type of Business Object is created and follow which subsequent processes adds value/information to the Business Object. In the case of the Answers, which the respondent enters in the web questionnaire, we describe the value chain -route that this Business Object should take, given the configuration made by the production staff in the design process. Therefore this instance of the Business Object Answers will be transported between the preconfigured services in the value chain. 4..4. Event-driven architecture and orchestrations To transport the data/metadata (in Business Object-form) between the preconfigured services, a communication platform is used. This communication platform does not contain any data processing logic and is only responsible for Business Object-transportation. The data processing is instead done in the services. The communication is built by using a middleware product. 2/0/200.0 0 of 6

Figure 5: Each instance of a business object is transported between preconfigured services Since the production staff should be able to select relevant services in the design process, the communication platform needs to be able to orchestrate (route) each message (Business Object) in a specific way. Another reason for the necessity of a communication platform with orchestration capability is that the route a message takes could depend on the outcome of a specific service. In the example of the Answers the service for the manual investigation should only be called if the Check data-service finds any problem with the answers that need manual investigation. The communication form in the platform is event driven. This means that as soon as a new business object is created and sent to the communication platform, it is immediately transported to the next service in the value chain. This architecture was selected to overcome some of the performance issues that a SOA-based architecture can result in. 4..5. Conclusion The architecture in the Triton project has many similarities with the efforts and ideas in the model and, since the architecture was created before the model was created, also some differences. Both projects do however try to minimize the gap between the GSBPM model and the underlying IT-systems. The Triton project relies heavily on the Business Information Model to enable the processes to communicate between each other, and our conclusion is that this will be a necessity in a common reference architecture and environment. 2/0/200.0 of 6

4.2. Generic Interface Architecture at Statistics Netherlands 4.2.. Introduction Statistics Netherlands has made a first attempt to define and implement a small aspect of the statistical data processing process in terms of statistical services with a generic statistical service interface. This was done in the scope of the SPIES project. The Project Initiation Document of SPIES stated the following product definition: to build a limited set of standard combinations of tools that are commonly used to implement statistics, in other words; to define a couple of production lines. Main goal is to optimize those combinations and to develop integration components specifically for these tool combinations. As a pilot, generic service interfaces were built around two tools: Digros; a standardized set of tables and procedures to store data under version control according to a data vault structure in a common SQL database management system (SQL Server). R; an open source, internationally developed statistical programming language with a vast library of statistical functions. The goal of the pilot was to wrap those two tools with the same generic service interfaces and then let them exchange data using these interfaces. 4.2.2. Generic Information Exchange Model The first step to accomplish the SPIES goals was to establish a generic information exchange model, i.e. an intermediate data model by which Digros an R and, in the future, other tools could communicate. This model had to hide the specifics of Digros, R and any specific tool to be truly compliant, as defines services as statistically meaningful and independent of specific implementations and tools. To maintain a link with the statistical process and international developments, it was decided that the generic information exchange model should be based on SDMX, the statistical data exchange format. To keep things simple, only a subset of SDMX was chosen. The following diagram shows which SDMX concepts are used in SPIES. 2/0/200.0 2 of 6

XSComponent 0..* KeyValue +value : string XSDataSet +reportingperiod : string +dataextractiondate : string..* Group +groupkeyid : string..* Section +groupkeyid : string..* XSObservation +groupkeyid : string +value : string valuefor valuefor valuefor valuefor valuefor DataflowDefinition GroupKeyDescriptor XSMeasure 0..* 0..* components..* 0..* 0..* Dimension Code MeasureTypeDimension structure KeyFamily CodeList codelist Figure 6: the SDMX subset used by SPIES 4.2.3. Generic functional specifications of service data exchange The requirements for SPIES have been modelled in a very generic way: it just states that any Tool X should conform to generic ( inspired) interface requirements. If it does, it can be integrated into a SPIES environment and talk to other SPIES compliant services. The abstract functionality of the SPIES design environment is shown in the use case model below. 2/0/200.0 3 of 6

UC0.0 Design X Output to SPIES <<include>> UC0.3 Import SPIES-Metadata into X X design environment Statistical designer <<include>> UC0.20 Design X Input from SPIES UC0.32 Generate SPIES-Metadata from X X run-time environment Figure 7: abstract design functionality in SPIES Any tool that is made SPIES compliant might implement this use case model. As indicated before, in the pilot performed at Statistics Netherlands the tools R and Digros were SPIES-E-FIED. Pleas note that SPIES complies to the guideline that design functionality should be separated from runtime functionality. The diagram below shows the abstract SPIES run time functionality. UC0.50 Export X Output to SPIES SPIES run-time X run-time environment UC0.60 Import X Input from SPIES Figure 8: abstract run time functionality in SPIES 4.2.4. Mapping of information models An important aspect of the services defined in the project is the notion of mappings between the information domain external to statistical services and the internal information domain. Within SPIES, both the internal information model of a service ( tool X ) as the information model of the service environment (the SPIES model) are exposed in design time functionality where users can define the mappings between them. This requires that the internal data model of each tool that complies with spies must have a formally defined model. 2/0/200.0 4 of 6

The model below was created to translate data in R (in so called data frames) to the SPIES environment and vice versa. data.frame Spies.globals.codelists.columns * 0..* globals schemaid datasetid creationdate versiondate filename codelist (list) id name isexternal coderecords..* id name texttype length isdimension codelist code description columns coderecord Figure 9: mapping of R data onto the common generic service interface The same was done for the internal data model of Digros; it was translated to the common generic service interface s model as well. This is shown in the diagram below. Datamodel ( from Use Case View) Globaal Invariant 2..n 2 Digros invariant Invariant 0..n Voor i = 3.. aantal niveaus 0..n Als aantal niveaus >= 2 0..n <Data i> 0..n <Matrix> Dataset Id Databron Id..n Bron 0..n <Variabele> Figure 0: mapping of Digros data onto the common generic service interface 2/0/200.0 5 of 6

4.2.5. The resulting generic statistical process As a final step, the above described statistical services, consisting of the proprietary, custom built data storage tool Digros and the international open source language R both encapsulated behind the generic common service interface, were combined into a miniature statistical process as shown below. Archive (GSBPM 8.3) Aggregate (GSBPM 5.7) Archive (GSBPM 8.3) Digros R Digros sdmx sdmx Digros models R script Digros models Figure : a miniature process built from generic services This process performs the simple tasks of reading income data on the person level (entity level data according to the layered model) from an archive (a generic service with Digros as the service core, define by several Digros storage models). This data is then communicated through SPIES/SDMX to an Aggregation service (which is a combination of the R environment with a specific R script). This service aggregates the personal incomes to a regional level (thus creating population data according to the layered model). The resulting data is, through SPIES/SDMX, stored in an Archive service (again based on Digros). 4.2.6. Conclusion Within Statistics Netherlands, some important principles regarding a service-based structure, generic service interfaces and abstract data levels were applied in a concrete situation where two independent tools were encapsulated in such a way that they are able to interchange data without knowing about each others implementation (technical) details. In the CORE project this idea might allow for exchange of such services on the European level. 2/0/200.0 6 of 6