CORA COmmon Reference Architecture

Size: px

Start display at page:

Download "CORA COmmon Reference Architecture"

Nathan Dennis
5 years ago
Views:

1 CORA COmmon Reference Architecture Monica Scannapieco Istat Carlo Vaccari Università di Camerino Antonino Virgillito Istat

2 Outline Introduction (90 mins) CORE Design (60 mins) CORE Architectural Components (90 mins) Illustration of CORE Platform (135 mins) Case studies (90 mins) CORE Follow-up (60 mins)

3 Introduction

4 CORE Generalities Principal Outcome: Environment for the definition and execution of standard statistical processes Definition of a process in terms of available services Execution of the composed workflow

5 CORE Generalities Plug and play approach to process execution Process View Service Repository

6 CORE Generalities Plug and play approach to process execution Process View Service Repository Data View

7 Why CORE? START Allocation (MAUSS R) Estimation (ReGenesees) Selection (SAS Script) STOP

8 Why CORE? START Allocation (MAUSS R) Estimation (ReGenesees) Selection (SAS Script) STOP

9 Why CORE? START Technological Heterogeneity Different technologies Different formats Allocation (MAUSS R) Estimation (ReGenesees) Selection (SAS Script) STOP

10 Why CORE? START Technological Heterogeneity Data Allocation Heterogeneity (MAUSS R) Different technologies Different formats Different names for variables Variables as combinations of other variables... Estimation (ReGenesees) Selection (SAS Script) STOP

11 Why CORE? Technological heterogeneity can be solved by solutions available on the market CORE permits to solve both technological and data heterogeneity in a single environment

12 CORE Vision 1. Abstract services: well-defined, technology-independent functionalities implemented by different IT tools; 2. Statistical process: workflow defined in terms of available services; 3. Data model: standardization of the semantics/format of services data, i.e. definition of the domain entities involved as input/output between services.

13 CORE Vision 1. Abstract services: well-defined, technologyindependent functionalities implemented by IT tools Allocation (MAUSS R) Estimation (ReGenesees) Selection (SAS Script)

14 CORE Vision 3. Data model: standardization of the semantics/format of services data 3.1 Domain descriptor (DD) Allocation (MAUSS R) DD <schema name="demo_dd"> <entity name="sampleplan"> <property name="var"/> <property name="size"/>... </entity> </schema> Selection (SAS Script) 3.2 Mapping to/from DD

15 CORE Design Tasks - 1 Design of services Definition of integration APIs (IAPIs) Data conversion from/to CORA model to/from tool specific format Graphical front ends for designing schemas and mappings

16 CORE Design Tasks - 2 Design of processes How to define and execute processes within CORE Modelling language Execution Visual interfaces design Design of a service repository

17 CORE Design Tasks - 3 Design of exchanged data Definition of data models and formats (plain XML/XSD, SDMX ) to be used for data exchanges Definition of metadata necessary for process execution SDMX Relationships

18 CORE Design

19 CORE Design: Services Abstract services: specify a well-defined functionality in a technology-independent way An abstract service can be implemented by one or more concrete services, i.e. IT tools Examples: sample allocation, record linkage, estimates and errors computation, etc.

20 CORE Design: Services GSBPM classification Documentation purpose Provided that a CORE service can be linked to IT tools, GSBPM tagging enables the performance of a search e.g. retrieving all the IT tools implementing the 5.4 Impute subprocess of GSBPM proposal

21 CORE Design: Services Service inputs and outputs Specified by logical names Characterized with respect to their role in data exchange Non-CORE: if they are not provided by/to other services of the process, but are only local to a specific service CORE: they are passed by/to other services and hence they do need to undergo CORE transformations

22 CORE Design: Data and Metadata They are specified as service inputs and outputs Logical names link them to previously specified services Non-CORE data only need the file system path where they can be retrieved

23 CORE Design: CORE Data The specification of CORE data is provided by 3 elements: Domain descriptor CORE data model Mapping model

24 Domain Descriptor: Model Entity Like entities in Entity Relationships Entity properties Like attributes in Entity Relationships Very simple (meta-)model

25 Domain Descriptor: Example <schema name="demo_domain_descriptor"> <entity name="sampleplan"> <property name="stratification_var"/> <property name="stratum_sample_size"/> <property name="stratum_population_size"/> </entity> <entity name="enterprise"> <property name="identifier"/> <property name="stratification_var"/> <property name="weight"/> <property name="sampling_fraction"/> <property name="enterprise_flag"/> <property name="employees_num"/> <property name="value_added"/> <property name="area"/> </entity> </schema>

26 Domain Descriptor Role Role of the Domain Descriptor (DD): from service-to-service data mapping to service-toglobal data mapping

27 CORE Data Model: Role Specified once and valid for all processes Extensible, i.e. core tag, data set kind, column kind can be modified Adds more semantics to data Example of usage: mapping to other models

28 CORE Data Model Rectangular data set CORE tag: Data set level (mandatory) Column level (optional) Rows level (optional) Data set kind Column kind

29 CORE Data Model Role Specified once and valid for all processes Extensible, i.e. core tag, data set kind, column kind can be modified Adds more semantics to data Example of usage: mapping to other models

30 Mapping Model Rectangular data assumption Mapping is intended to be specified with respect to Domain Descriptor Columns are to be mapped to properties of an entity It contains the specification of how CORE data model concepts are associated to data

31 CORE Logical Architecture

32 CORE GUIs Process design Ad-hoc customization of an existing tool (Oryx) Service data flow Service design Set of interfaces for the definition of services and related data flow Data design Set of interfaces for the specification of domain descriptors and mapping files

33 Use Case Specification CORE (Principal) Users 33

34 Use Case Specification: Tool Management 34

35 Use Case Specification: Service Management Statistical User Service Management «uses» Add Service «uses» «uses» Show Tools' List «uses» Modify Service «uses» «uses» Delete Service SelectTool Show Services' List «uses» Select Service 35

36 Use Case Specification: Process 36

37 Process design: Oryx Oryx is an academic open source framework for graphical process modeling Based on web technology Extensible via a plugin mechanism and new stencil sets Supports BPMN and other process modeling languages Programming language Javascript and Java, internal data format based on RDF

38 Stencil Set Set of graphical objects and rules that specify how to relate those graphical objects to others Additional properties that can later be used by other applications or Oryx extensions (e.g. setting element colors and visibility) Can be used to build process models

39 The CORE Stencil Set Graphical representation of CORE processes Easy-to-use editor (desktop feeling) Easy-to-extend source (JSON) Defined from BPMN Guarantees complete BPMN compliance

40 Integration APIs Purpose: wrapping a tool by a CORE service Translates inputs and outputs of the tool in a completely transparent and automatic way

41 Repository Processes and their instances Services with their GSBPM and CORE classifications Tools and their runtime features Data with their logical classification within CORE processes

42 Database design: Overview 42

43 Database Design: Principal Entities Service & Tool 43

44 Database Design: Principal Entities Service & Process service -id -name -GSBPMtag -coretag -version -namespace 0..* 0..* process -id -name -definition 44

45 Database Design: Principal Entities Operational Data 45

46 Process Engine Official statistics processes can be viewed from two perspectives: Functional: they are data-oriented, reflecting a common feature of scientific workflows Organizational: they are workflow-oriented, have the complexity of real production lines, with the need for harmonizing the work of different actors

47 Process Engine Hence our process engine has two layers WF ENGINE DATA FLOW CONTROL SYSTEM Complex control flows Syncronizing constructs, cycles, conditions, etc. E.g.: Interactive multi-user editing imputation Simple control flows Sequence of tasks is composed by connecting the output of one task to the input of another Data intensive operations

48 Worflow Engine Selection Process CORE workpackage (WP$) led by INSEE Business Process Management (BPM) platforms: Bonita ( Activiti ( ActiveVOS (

49 Worflow Engine Selection Process

50 SDMX Relationships Both propose an information model CORE information model takes explicitly process dimension into account Data dimension spanning over the whole statistical process SDMX information model focused on data exchange (though processes are also considered)

51 SDMX Relationships CORE information model Deals with both microdata and macrodata SDMX information model Mainly deals with macrodata

52 SDMX Relationships 1. Can we use SDMX for micro and macro data exchanges in a CORE process? Need for mapping of information models 2. What about metadata? CORE: Data and metadata managed at the same way SDMX: distinction between structural metadata and reference metadata. Possibility of having domain knowledge codified through concepts

53 SDMX Relationships Choices and steps: Conversion from CORE XML to CSV in order to use SDMX conversion tools Starting from the CORE file structure it was created a SDMX DSD (Data Structure Definition) SDMX data format : cross-sectional Once prepared the DSD, we proceeded to convert the CORE file using the SDMX Converter tool

54 SDMX Relationships CORE-to-SDMX Conversion Proof-of-Concept Setting: Italian Time-Use Survey Data Structure Wizard and the SDMX Converter Compute Estimates and Sampling Errors (as the aggregated data dissemination phase) Choices and steps: Conversion from CORE XML to CSV in order to use SDMX conversion tools Starting from the CORE file structure it was created a SDMX DSD (Data Structure Definition)

55 SDMX Relationships The experiment has shown the feasibility of the conversion to SDMX format of a data file obtained as a CORE output Not automated conversion: Manual mapping of the CORE output s fields to the dimensions and attributes of the SDMX DSD SDMX does not manage more than measure, it was necessary the verticalization of the CORE output file in order to convert it to the SDMX cross sectional

56 Architecture Deployment Web-based architectured centered on a centralized component CORE Environment Different CORE deployments can co-exist Intra- or Inter- organization Services can be remotely executed Support is needed in the form of a distibuted component for tool execution and data transfer

57 Types of service runtime Batch Tool executed by a command line call Can be automated Interactive User interact with the tool through a tool-provided GUI Cannot be automated Web service No tool procedure distributed on a web service actived by a programming language call Can be automated

58 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Process Engine Integration APIs Runtime agent Run on the machine on which the tool is deployed. Is responsible for: -Preparing the input -Gathering the output -Activating the tool Web service runtime Runtime Remote activation Web service client Web container

59 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine The process engine signals a service must be executed Integration APIs Runtime Remote activation Web service client

60 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine Integration APIs Runtime Remote activation Web service client Service definition is extracted from the repository, as well as the required datasets and the corresponding mappings

61 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine Integration APIs Runtime Remote activation Web service client Datasets are converted according to the mapping

62 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine Integration APIs Runtime Remote activation Web service client Converted datasets are transferred to the remote runtime

63 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine Integration APIs The tool is activated by the runtime agent Runtime Remote activation Web service client

64 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine The output datasets are gathered and sent back to the CORE environment Integration APIs Runtime Remote activation Web service client

65 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine Integration APIs Runtime Remote activation Web service client Datasets are converted back to CORE format according to the mapping

66 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine Integration APIs Runtime Remote activation Converted datasets are stored in the repository Web service client

67 CORE Technical Architecture CORE Environment Batch-Interactive runtime Runtime GUI Definition Repository Runtime agent Process Engine Integration APIs The process continues its execution Runtime Remote activation Web service client

68 Scenario 1 Remote execution command line/gui Physical layers: CORE env, Service AGENT

69 Scenario 2 Remote execution web service Physical layers: CORE env, Service

70 CORE Scenario

71 Why a Process Scenario? Helps to clarify ideas and to asses their feasibility Forces to make newly proposed solutions concrete Can/will be used as empirical test-bed during the whole implementation cycle of the CORE environment 71

72 How did we build the Scenario? Rationale for our Scenario: Naturality: involves typical processing steps performed by NSIs for sample surveys Minimality: very easy workflow (no conditionals, nor cycles), can be run without a Workflow Engine Appropriateness: incorporates as much heterogeneity as possible: heterogeneity is precisely what CORE must be able to get rid of 72

73 Spreading Heterogeneity over the Scenario The Scenario incorporates both: Data Heterogeneity Via data exchanged by CORE services belonging to the scenario process Technological Heterogeneity Via IT tools implementing scenario sub-processes 73

74 Data Heterogeneity The Scenario entails different levels of data heterogeneity: Format Heterogeneity: CSV files, relational DB tables, SDMX XML files involved Statistical Heterogeneity: both Micro and Aggregated Data involved Model Heterogeneity: some data refer to ordinary real-world concepts (e.g. enterprise, individual, ), some other to concepts arising from the statistical domain (e.g. stratum, variance, sampling weight, ) 74

75 Technological Heterogeneity The Scenario requires to wrap inside CORE-compliant services very different IT tools: simple SQL statements executed on a relational DB batch jobs based on SAS or R scripts full-fledged R-based systems requiring a human-computer interaction through a GUI layer 75

76 The Scenario at a glance ALLOCATION STAR T Compute Strata Statistics Allocate the Sample Collect Survey Data Check and Correct Survey Data Calibrate Survey Data Compute Estimates and Sampling Errors Store Estimates and Sampling Errors ESTIMATION Select the Sample Convert to SDMX STOP 76

77 Sample Allocation Subprocess ALLOCATION START Compute Strata Statistics Allocate the Sample Overall Goal: determine the minimum number of units to be sampled inside each stratum, when lower bounds are imposed on the expected level of precision of the estimates the survey has to deliver Two statistical services are needed: Compute Strata Statistics Allocate the Sample 77

78 Compute Strata Statistics Service ALLOCATION START Compute Strata Statistics Allocate the Sample Goal: compute, for each stratum, the population mean and standard deviation of a set of auxiliary variables IT tool: a simple SQL aggregated query with a group-by clause NSIs usually maintain their sampling frame(s) as Relational DB tables Integration API: must support Relational/CORE transformations CORA tag: Statistics 78

79 Allocate the Sample Service ALLOCATION START Compute Strata Statistics Allocate the Sample Goal: solve a constrained optimization problem to find and return the optimal sample allocation across strata IT tool: Istat MAUSS-R system implemented in R and Java, can be run either in batch mode or interactively via a GUI Integration API: must support CSV/CORE transformations MAUSS handles I/O via CSV files CORA tag: Statistics 79

80 Sample Selection Subprocess Select the Sample Goal: draw a stratified random sample of units from the sampling frame, according to the previously computed optimal allocation IT tool: a simple SAS script to be executed in batch mode Integration API: CSV/CORE transformation SAS datasets have proprietary, closed format we ll not support direct SAS/CORE conversions CORA tag: Population output stores the identifiers of the units to be later surveyed + basic information needed to contact them 80

81 Estimation Subprocess Calibrate Survey Data Compute Estimates and Sampling Errors ESTIMATION Overall Goal: compute the estimates the survey must deliver, and asses their precision as well Two statistical services are needed: Calibrate Survey Data Compute Estimates and Sampling Errors 81

82 Calibrate Survey Data Service Calibrate Survey Data Compute Estimates and Sampling Errors ESTIMATION Goal: provide a new set of weights (the calibrated weights ) to be used for estimation purposes IT tool: Istat ReGenesees system implemented in R, can be run either in batch mode or interactively via a GUI Integration API: can use both CSV/CORE and Relational/CORE transformations CORA tag: Variable 82

83 Estimates and Errors Service Calibrate Survey Data Compute Estimates and Sampling Errors ESTIMATION Goal: use the calibrated weights to compute the estimates the survey has to provide (typically for different subpopulations of interest) along with the corresponding confidence intervals IT tool: Istat ReGenesees system Integration API: can use both CSV/CORE and Relational/CORE transformations CORA tag: Statistic 83

84 Store Estimates Subprocess Store Estimates and Sampling Errors Goal: persistently store the previously computed survey estimates in a relational DB e.g. in order to subsequently feed a data warehouse for online publication IT tool: a set of SQL statements Integration API: Relational/CORE transformation again CORA tag: Statistics 84

85 Convert to SDMX Service Convert to SDMX STOP Goal: retrieve the aggregated data from the relational DB and directly convert them in SDMX XML format e.g. to later send them to IT tool:??? Integration API: must support SDMX/CORE transformations CORA tag: Statistics 85

86 Scenario Open Issues Besides I/O data, CORE must be able to handle service behaviour parameters. How? e.g. to analyze a complex survey, ReGenesees needs a lot of sampling design metadata, namely information about strata, stages, clusters identifiers, sampling weights, calibration models, and so on Enabling the CORE environment to support interactive services execution is still a challanging problem we plan to exploit MAUSS-R and/or ReGenesees to test the technical feasibility of any forthcoming solution How to implement a SDMX/CORE converter? 86

87 Demo Scenario Involves 3 typical processing steps performed by NSIs for sample surveys: Sample Allocation Sample Selection Estimation It has been used as empirical test-bed during the whole implementation cycle of the CORE environment 87

88 Rationale for the Scenario Minimality: very easy workflow (no conditionals, nor cycles), can be run without a Workflow Engine Appropriateness: addresses heterogeneity issues heterogeneity is precisely what CORE must be able to get rid of 88

89 Spreading Heterogeneity over the Scenario The Scenario incorporates both: Data Heterogeneity: Via data exchanged by CORE services belonging to the scenario process Technological Heterogeneity: Via IT tools implementing scenario services A batch job based on a SAS script Two full-fledged R-based systems 89

90 The Scenario at a glance START ESTIMATION ALLOCATION MAUSS- R SELECTION ReGenesees System SAS SCRIPT STOP 90

91 Sample Allocation Service ALLOCATION START MAUSS- R Overall Goal: determine the minimum number of units to be sampled inside each stratum, when lower bounds are imposed on the expected level of precision of the estimates the survey has to deliver IT tool: Istat MAUSS-R system implemented in R and Java CORA tag: Statistics 91

92 Sample Selection Service SELECTION SAS SCRIPT Goal: draw a stratified random sample of units from the sampling frame, according to the previously computed optimal allocation IT tool: a simple SAS script to be executed in batch mode CORA tag: Population 92

93 Estimates and Errors Service ESTIMATION ReGenesees System STOP Goal: compute the estimates the survey has to provide (typically for different subpopulations of interest) along with the corresponding confidence intervals IT tool: Istat ReGenesees System R-based CORA tag: Statistics 93

94 CORE Follow up

95 CORE in Istat CORE is an Action of the Istat strategic plan Stat2015 Period Objective: Usage of CORE platform in production scenarios of Istat Plan for 2013: Implementation of engineering activities Usage of CORE to support sharing of generalized software functionalities currently studying how to Usage of CORE in dissemination flow of the corporate architecture in conjunction with an ETL tool (Kettle) currently studying how to

96 Development of CORE Services for ESS: Issues CORE is strictly related to the Shared Services technical cross-cutting issue of the ESS VIP (Vision Infrastructure Project) Programme Period Role: Supporting standardisation of the communication protocol among standard statistical services

97 Issue 1: Relationship between CORE and SOA Hints for answering issue 1: CORE adopts a SOA design approach CORE services can be deployed as Web Services CORE do imply / include SOA technologies SOA technologies does not imply / include CORE

98 Issue 2: Relationship between CORE and GSIM Hints for answering issue 2: CORE did not have the purpose of defining yet another information model CORE takes into account the need for an information model Introduced only for demonstration purposes Hence from a design perspective CORE is open to adopt a full-fledged information model like GSIM CORE Model slot/core Domain Descriptor slot

99 Issue 3: Relationship between CORE and DDI/SDMX Hints for answering issue 3: DDI/SDMX provides logical information models GSIM serves a documentation purpose DDI/SDMX serve (mainly) a representation purpose CORE could be integrated with DDI/SDMX by: Mapping rectangular datasets representation of CORE data to such models Mapping in principle feasible as CORE model less expressive

100 Issue 4: CORE Deployment Issues in the ESS SOA supporting platform Hints for answering issue 4: Need for designing a CORE deployment for the ESS Service repositories Data exchanges Security issues Performance issues...

Description of CORE Implementation in Java

Description of CORE Implementation in Java Partner s name: Istat WP number and name: WP6 Implementation library for generic interface and production chain for Java Deliverable number and name: 6.1 Description of Implementation in Java Description