Towards Open Innovation with Open Data Service Platform

Similar documents
Open Data Search Framework based on Semi-structured Query Patterns

Data Governance for the Connected Enterprise

Demo: Linked Open Statistical Data for the Scottish Government

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

TWC LOGD: A Portal for Linking Open Government Data

: Course : SharePoint 2016 Site Collection and Site Administration

Support services for the Digital Agenda Scoreboard Website SMART 2012/ D3: User manual for the dataset upload procedure -

ArcGIS Open Data. Matt Bullock

Publishing Linked Statistical Data: Aragón, a case study.

Markus Kaindl Senior Manager Semantic Data Business Owner SN SciGraph

Building a National SGCN Dataset

A Survey of Metadata Use for Publishing Open Government Data in China

Expose Existing z Systems Assets as APIs to extend your Customer Reach

Interoperability and transparency The European context

Core Technology Development Team Meeting

Funding from the Robert Wood Johnson Foundation s Public Health Services & Systems Research Program (grant ID #71597 to Martin and Birkhead)

Outline. The Collaborative Research Platform for Data Curation and Repositories: CKAN For ANGIS Data Portal. Open Access & Open Data.

A Community-Driven Approach to Development of an Ontology-Based Application Management Framework

Microsoft End to End Business Intelligence Boot Camp

Text Conversion Process

Incremental Export of Relational Database Contents into RDF Graphs

SharePoint 2016 Site Collections and Site Owner Administration

MS-55045: Microsoft End to End Business Intelligence Boot Camp

Esri and MarkLogic: Location Analytics, Multi-Model Data

A detailed comparison of EasyMorph vs Tableau Prep

AN ONTOLOGY-BASED KNOWLEDGE AS A SERVICE FRAMEWORK: A CASE STUDY OF DEVELOPING A USER-CENTERED PORTAL FOR HOME RECOVERY

Introduction to Data Science

Advanced Solutions of Microsoft SharePoint Server 2013

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

Your Open Science and Research Publishing Platform. 1st SciShops Summer School

Information Workbench

Basics of Data Management

Quality Gates User guide


Methodological Guidelines for Publishing Linked Data

Before deepening into the subject, it is important to make some conceptual distinctions:

Re-using Cool URIs: Entity Reconciliation Against LOD Hubs

Switching to Sheets from Microsoft Excel Learning Center gsuite.google.com/learning-center

APPLYING KNOWLEDGE BASED AI TO MODERN DATA MANAGEMENT. Mani Keeran, CFA Gi Kim, CFA Preeti Sharma

OpenGovIntelligence. Deliverable 3.5. OpenGovIntelligence ICT tools

Publishing the Norwegian Petroleum Directorate s FactPages as Semantic Web Data

1Z0-526

Enterprise Data Catalog for Microsoft Azure Tutorial

Using the Heterogeneous Database and Linked Data Technologies with Case Study of Thai Local Government Planning Database

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

Software skills for librarians: Library carpentry. Module 2: Open Refine

Thailand Digital Government Development Plan Digital Government Development Agency (Public Organization) (DGA)

Call: SAS BI Course Content:35-40hours

Organize. Collaborate. Discover. All About Mendeley

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

Reducing Consumer Uncertainty

Seamless Dynamic Web (and Smart Device!) Reporting with SAS D.J. Penix, Pinnacle Solutions, Indianapolis, IN

Advanced Solutions of Microsoft SharePoint Server 2013 Course Contact Hours

PYRAMID Headline Features. April 2018 Release

Automated Testing of Tableau Dashboards

API MANAGEMENT WITH WEBMETHODS

W3C Working Group Report

RDF VISUALIZER. 3/4/ th CRM-SIG meeting M.Doerr, K. Doerr, K.Petrakis, L.Harami, N.Minadakis

Creating Codes with Spreadsheet Upload

Google Docs.

A U T O M A T E D C O N T E NT P R O T E C T I O N, A N A L Y T I C S A N D M O N E T I Z A T I O N A C R O S S S O C I A L P L A T F O R M S

Telerik Training for Mercury 3

Description of the European Big Data Hackathon 2019

Resilient Linked Data. Dave Reynolds, Epimorphics

Introducing Fedora 4. Overview, examples, and features. David Wilcox,

SharePoint 2016 Site Collections and Site Owner Administration

Harvesting Open Government Data with DCAT-AP

K-PAC Reporting Guide

SAP BW 3.5 Enhanced Reporting Capabilities SAP AG

TABLE OF CONTENTS SECTION 1- ORGANIZATION 1-2 SECTION 2- GOOGLE DOCS 3-4 SECTION 3- GOOGLE PRESENTATIONS 5-6 SECTION 4- GOOGLE SPREADSHEETS 7-8

C. The system is equally reliable for classifying any one of the eight logo types 78% of the time.

Linking and Finding Earth Observation (EO) Data on the Web

CEU Online System, The Friday Center for Continuing Education, UNC-Chapel Hill How to Obtain Participant IDs for Awarding of CEUs

OpenClinica - Data Import & Export

Telerik Training for Mercury 3

DKAN Open Data Platform

Towards Practical Differential Privacy for SQL Queries. Noah Johnson, Joseph P. Near, Dawn Song UC Berkeley

<Insert Picture Here>

COURSE 20466D: IMPLEMENTING DATA MODELS AND REPORTS WITH MICROSOFT SQL SERVER

Promoting semantic interoperability between public administrations in Europe

GOOGLE SHEETS TUTORIAL

Linked open data at Insee. Franck Cotton Guillaume Mordant

Data.gov Wiki: A Semantic Web Approach to Government Data

Building a Linked Open Data Knowledge Graph Henning Schoenenberger Michele Pasin. Frankfurt Book Fair 2017 October 11, 2017

SharePoint 2016 Site Collections and Site Owner Administration

Microsoft Advanced Solutions of Microsoft SharePoint Server

Oracle Big Data Discovery

Optimising a Semantic IoT Data Hub

Day 1 Agenda. Brio 101 Training. Course Presentation and Reference Material

Block Chain Real-Time Analytics & Alerts using TriggerWare. Deborah Taylor, CEO Dr. Swamy Narayanaswamy, CTO Contact:

The Emerging Data Lake IT Strategy

Groovy in Jenkins. Ioannis K. Moutsatsos. Repurposing Jenkins for Life Sciences Data Pipelining

Microsoft SharePoint Server 2013 Plan, Configure & Manage

OpenBudgets.eu: Fighting Corruption with Fiscal Transparency. Project Number: Start Date of Project: Duration: 30 months

"Charting the Course... MOC A: SharePoint 2016 Site Collections and Site Owner Administration. Course Summary

Advanced Solutions of Microsoft SharePoint 2013

Page 1. Oracle9i OLAP. Agenda. Mary Rehus Sales Consultant Patrick Larkin Vice President, Oracle Consulting. Oracle Corporation. Business Intelligence

Informatica PowerExchange for Tableau User Guide

AVANTUS TRAINING PTE LTD

THE GETTY VOCABULARIES TECHNICAL UPDATE

Transcription:

Towards Open Innovation with Open Data Service Platform Marut Buranarach Data Science and Analytics Research Group National Electronics and Computer Technology Center (NECTEC), Thailand The 44 th Congress on Science and Technology of Thailand, October 30, 2018, BITEC, Bangkok, Thailand

Outline Introduction to Open Data 5-Star Open Data Model Research Challenges Open Data Service Platform (Open-D) Problem and Motivation Framework Case Studies Experiments with Data.go.th Datasets Open-D Public Service Conclusion and Future Work 2

Introduction to Open Data

4

Big Data, Personal Data, Open Data Source: https://github.com/theodi/data-definitions 5

Open Government Data Open government data (OGD) is a global initiative to promote transparency, service innovation and citizen participation. Publishing OGD is usually in forms of datasets made available on OGD portals, e.g., Data.gov, Data.gov.uk, etc. Open data catalogs Datasets are available in spreadsheets form, i.e. Excel and CSV formats. 6

Open Government Data Initiatives Data.gov Data.gov.uk Open-data.europa.eu Data.go.jp 7

5-Star Open Data Model 8

5-Star Open Data Model (2) Open data as spreadsheets (2-3 stars) has disadvantages in terms of: No global identifiers Not queryable Open data as RDF (4-5 stars) has advantages in terms of: Support global identifiers URI Queryable using SPARQL Linked data technology for data integration 9

Some Research Challenges Data Publishing How to automatically publish open data from different data sources databases, sensor data. Data Quality How to measure and improve quality of open data. Data Cataloging How to catalogue and index open datasets so that it can be discovered effectively. Google Dataset Search 1 1 https://toolbox.google.com/datasetsearch 10

Some Research Challenges (2) Data Access and Utilization Tools for querying and analyzing data in the datasets Data API Data Integration How to combine data from different datasets How to move from open data to linked open data (5-star open data) 11

Open Data Service Platform (Open-D)

Problems and Motivations Although publishing open data as spreadsheets is straightforward and requires minimal technological skills, it is not ideal for the users who want to use the data in a more dynamic fashion. Datasets must be downloaded in full. Updated datasets must be re-retrieved. Such constraints for data consumption could limit the proliferation of OGD usage. 13

Problem Open data in spreadsheet is not easy to consume. Publish Spreadsheets Gap Not easy to consume by Spreadsheets Proposed Solution: Transform dataset in spreadsheet to Data API/ Data Visualization automatically. 14

Data API Application Programming Interface (API) for Data or Data API offers access to data in an abstract, dynamic and queryable fashion. Updates on the data are transparent to the data consumers. Users can retrieve portions of the data by means of data querying interfaces. Although data API is convenient for the data consumers, building API is typically costly for the data publishers. Usually developed only for some high-value datasets. RESTFul Web API is commonly preferred for application developers. Request URL Database JSON or XML Government Agencies Web API Application Developer 15

Our project Open Data Service Platform (Open-D) We propose a framework for building data services from existing datasets available in the OGD catalogs. Automatically extract tabular data in datasets. Convert the tabular data to the RDF (Resource Description Framework) data and provide data services on top of RDF data. The system-generated data services: Data API. Data Visualization. 16

Open Data Service Platform (Open-D) Application Developers Government Agencies Upload API Access API Automatic Transform Data Scientists Visualize data Datasets Data APIs & Visualization 17

OGD Service Ecosystem Government Agencies: Automatic publishing of RDF and data API from existing OGD datasets via RDB-RDF direct mapping (4-star open data). Application Developers: Data access provided in two forms: RDF and Data API. Data Scientists: Self-serviced Data Visualization for individual dataset. The services were created on top of the data APIs 18

Open-D Framework Built-in OGD services Form-based Data Data Form-based Visualization Data Data Query Query UI Report UI Generation Third-party Applications Application Layer Data Query and Aggregation Query API API Layer Dataset Validator Tabular data to RDF converter (D2RQ) SPARQL Query templates & generator RDF Data Layer Dataset Collector Dataset to RDF Graph Mapping Triplestore (Virtuoso) OGD Catalog and Datasets Data Source Layer 19

Open-D Framework (2) Data source layer: OGD catalog -- metadata of the datasets and links to download the dataset files (excel or csv). RDF data layer: Dataset collector and validator test the validity and well-formedness of the dataset. RDF converter utilizes the D2RQ system in direct mapping and stores data in Virtuoso triplestore. API layer: Data querying and aggregation APIs. API request is transformed to SPARQL query template. Application layer: Built-in data services: data query and visualization UI. Third-party applications. 20

Dataset Validator A dataset is marked as valid only if there is no problem found in the table and header levels. The table-level problems: metadata around the table (T-Metadata), multiple tables in one sheet (T-Multiple) and whitespace around the table (T-Whitespace). The header-level problems: empty header cells (H-Incomplete), header cells occupied multiple columns (H-Multiple-column-cell), repeated header (H-Duplicate) and the number of header columns not matched with the row columns (Hcardinality). Ermilov, I., Auer, S. and Stadler, C. 2013. User-driven semantic mapping of tabular data. In Proc. of the 9th International Conference on Semantic Systems - I-SEMANTICS 13 (2013), 105-112. 21

Well-formed table data T_Multiple H-Multiple-column-cell T_Metadata T-Whitespace 22

Dataset Validation Steps 1. Each row in the dataset is counted for the number of columns. 2. The inequality in the number of columns indicates a table-level problem. A table-level problem type is distinguished based on location of occurrences and empty rows 3. The dataset is additionally examined for its header (firstrow) irregularities. A header-level problem type is distinguished based on the number and location of missing columns in a header row and its successive rows. 4. When no problem is detected in a dataset, the dataset is marked as normal, i.e., well-formed tabular data. 23

System Architecture Output data formats Sources: Tools Dataset Validator Input data formats Dataset catalog & dataset files Valid datasets (well-formed tabular data) 24

Case Studies and Evaluation

Creating OGD services for Data.go.th The total of 1,366 dataset files retrieved from the data catalog of Data.go.th (as of October 31, 2016) as inputs for the system. The total of 318 files (23.3%) passed the dataset validation as valid (well-formed) tabular data. 267 files after data cleansing (84%) were successfully converted to OGD services. Unsuccessful conversions were mostly due to the very large file sizes and the false positivity of the validator. The total number of RDF triples generated for the datasets was over 20 million triples. 26

Dataset validator evaluation Precision Recall Accuracy Well-formed dataset 0.98 0.96 0.97 Table-level problems 0.85 0.78 0.94 Header-level problems 0.25 1 0.99 A preliminary evaluation of the dataset validator was conducted using 120 datasets randomly selected from Data.go.th. A confusion matrix was created for each problem type detection. The overall precision and recall of identifying well-formed tabular datasets were 0.98 and 0.96. Some detected confusion between the table-level and header-level problems. Some metadata above tables were mistakenly detected as incomplete table headers. The confusion did not affect the overall performance of service generation. 27

Demo OGD services for Data.go.th Demo URL: http://api.data.go.th/demo/ Browse datasets & APIs Link to original dataset Link to dataset service page 28

OGD services for a dataset Thai Schools Dataset Query: Find schools in Pathum Thani province whose school type is general education. API result in JSON format Forming API Request Query results 29

Open-D Public Service An initiative to provide a repository of datasets for data scientists and developers. Open-D Public Service focused on improving open data in 2 aspects: Data Quality Only well-formed table data are allowed. Data Accessibility Access to dataset is provided as Data API User can analyze and visualize the data in a dataset via the provided tool and shared the results via social media. 30

Open-D vs. other data services Data.go.th Datahub.io Tableau/ Power BI Open-D Public service Gov only X X X Well-formed table not required required required required Data API Export to BI Tools X (query/ aggregation) Data Visualization X (advanced) X (basic) Dataset Joins X under development Graph Sharing/ Embedding X (dashboard) X 31

Open-D Public Service Website http://opend.openservice.in.th/ 32

Open-D Public Service Website (2) 33

Open-D Public Service Website (3) 34

Open Data Hackathon Event Over 20 teams joined the 3-day hackathon event in Sep 2018. The event was powered by Open-D platform More than 500,000 cultural objects made available as open datasets Data access is provided as Data API created by Open-D 35

e-culture Open Data Hackathon 2018 36

e-culture Open Data Hackathon 2018 Winner Runners-up 37

Conclusions We introduce open data concepts and some research challenges. We present Open-D -- a software platform for creating data services from open datasets. Our framework is unique in that it does not require user technical skills in creating open data APIs from datasets. Dataset validator for tabular data in spreadsheets. RDB-to-RDF data mapping and SPARQL query templates. 38

Conclusions (2) The Open-D platform was demonstrated and evaluated using the datasets from Data.go.th. Over 80 percent of the identified well-formed datasets were successfully transformed to data services Open-D public service provides a Web portal for open data publishing and consumption for public use. e-culture Open Data Hackathon is a showcase of creating innovation from open data via Open-D. 39

Future Work Improving performance of the validator and applying automatic data cleansing to the datasets. Additional data services such as integrating (linking) data across datasets. Data.go.th 2.0 is planned to be integrated with the Open-D platform (DGA-NECTEC collaboration). 40

Acknowledgment This project was partially supported by the National Science and Technology Development Agency (NSTDA) and the Digital Government Agency (DGA), Thailand. 41