Towards Open Innovation with Open Data Service Platform Marut Buranarach Data Science and Analytics Research Group National Electronics and Computer Technology Center (NECTEC), Thailand The 44 th Congress on Science and Technology of Thailand, October 30, 2018, BITEC, Bangkok, Thailand
Outline Introduction to Open Data 5-Star Open Data Model Research Challenges Open Data Service Platform (Open-D) Problem and Motivation Framework Case Studies Experiments with Data.go.th Datasets Open-D Public Service Conclusion and Future Work 2
Introduction to Open Data
4
Big Data, Personal Data, Open Data Source: https://github.com/theodi/data-definitions 5
Open Government Data Open government data (OGD) is a global initiative to promote transparency, service innovation and citizen participation. Publishing OGD is usually in forms of datasets made available on OGD portals, e.g., Data.gov, Data.gov.uk, etc. Open data catalogs Datasets are available in spreadsheets form, i.e. Excel and CSV formats. 6
Open Government Data Initiatives Data.gov Data.gov.uk Open-data.europa.eu Data.go.jp 7
5-Star Open Data Model 8
5-Star Open Data Model (2) Open data as spreadsheets (2-3 stars) has disadvantages in terms of: No global identifiers Not queryable Open data as RDF (4-5 stars) has advantages in terms of: Support global identifiers URI Queryable using SPARQL Linked data technology for data integration 9
Some Research Challenges Data Publishing How to automatically publish open data from different data sources databases, sensor data. Data Quality How to measure and improve quality of open data. Data Cataloging How to catalogue and index open datasets so that it can be discovered effectively. Google Dataset Search 1 1 https://toolbox.google.com/datasetsearch 10
Some Research Challenges (2) Data Access and Utilization Tools for querying and analyzing data in the datasets Data API Data Integration How to combine data from different datasets How to move from open data to linked open data (5-star open data) 11
Open Data Service Platform (Open-D)
Problems and Motivations Although publishing open data as spreadsheets is straightforward and requires minimal technological skills, it is not ideal for the users who want to use the data in a more dynamic fashion. Datasets must be downloaded in full. Updated datasets must be re-retrieved. Such constraints for data consumption could limit the proliferation of OGD usage. 13
Problem Open data in spreadsheet is not easy to consume. Publish Spreadsheets Gap Not easy to consume by Spreadsheets Proposed Solution: Transform dataset in spreadsheet to Data API/ Data Visualization automatically. 14
Data API Application Programming Interface (API) for Data or Data API offers access to data in an abstract, dynamic and queryable fashion. Updates on the data are transparent to the data consumers. Users can retrieve portions of the data by means of data querying interfaces. Although data API is convenient for the data consumers, building API is typically costly for the data publishers. Usually developed only for some high-value datasets. RESTFul Web API is commonly preferred for application developers. Request URL Database JSON or XML Government Agencies Web API Application Developer 15
Our project Open Data Service Platform (Open-D) We propose a framework for building data services from existing datasets available in the OGD catalogs. Automatically extract tabular data in datasets. Convert the tabular data to the RDF (Resource Description Framework) data and provide data services on top of RDF data. The system-generated data services: Data API. Data Visualization. 16
Open Data Service Platform (Open-D) Application Developers Government Agencies Upload API Access API Automatic Transform Data Scientists Visualize data Datasets Data APIs & Visualization 17
OGD Service Ecosystem Government Agencies: Automatic publishing of RDF and data API from existing OGD datasets via RDB-RDF direct mapping (4-star open data). Application Developers: Data access provided in two forms: RDF and Data API. Data Scientists: Self-serviced Data Visualization for individual dataset. The services were created on top of the data APIs 18
Open-D Framework Built-in OGD services Form-based Data Data Form-based Visualization Data Data Query Query UI Report UI Generation Third-party Applications Application Layer Data Query and Aggregation Query API API Layer Dataset Validator Tabular data to RDF converter (D2RQ) SPARQL Query templates & generator RDF Data Layer Dataset Collector Dataset to RDF Graph Mapping Triplestore (Virtuoso) OGD Catalog and Datasets Data Source Layer 19
Open-D Framework (2) Data source layer: OGD catalog -- metadata of the datasets and links to download the dataset files (excel or csv). RDF data layer: Dataset collector and validator test the validity and well-formedness of the dataset. RDF converter utilizes the D2RQ system in direct mapping and stores data in Virtuoso triplestore. API layer: Data querying and aggregation APIs. API request is transformed to SPARQL query template. Application layer: Built-in data services: data query and visualization UI. Third-party applications. 20
Dataset Validator A dataset is marked as valid only if there is no problem found in the table and header levels. The table-level problems: metadata around the table (T-Metadata), multiple tables in one sheet (T-Multiple) and whitespace around the table (T-Whitespace). The header-level problems: empty header cells (H-Incomplete), header cells occupied multiple columns (H-Multiple-column-cell), repeated header (H-Duplicate) and the number of header columns not matched with the row columns (Hcardinality). Ermilov, I., Auer, S. and Stadler, C. 2013. User-driven semantic mapping of tabular data. In Proc. of the 9th International Conference on Semantic Systems - I-SEMANTICS 13 (2013), 105-112. 21
Well-formed table data T_Multiple H-Multiple-column-cell T_Metadata T-Whitespace 22
Dataset Validation Steps 1. Each row in the dataset is counted for the number of columns. 2. The inequality in the number of columns indicates a table-level problem. A table-level problem type is distinguished based on location of occurrences and empty rows 3. The dataset is additionally examined for its header (firstrow) irregularities. A header-level problem type is distinguished based on the number and location of missing columns in a header row and its successive rows. 4. When no problem is detected in a dataset, the dataset is marked as normal, i.e., well-formed tabular data. 23
System Architecture Output data formats Sources: Tools Dataset Validator Input data formats Dataset catalog & dataset files Valid datasets (well-formed tabular data) 24
Case Studies and Evaluation
Creating OGD services for Data.go.th The total of 1,366 dataset files retrieved from the data catalog of Data.go.th (as of October 31, 2016) as inputs for the system. The total of 318 files (23.3%) passed the dataset validation as valid (well-formed) tabular data. 267 files after data cleansing (84%) were successfully converted to OGD services. Unsuccessful conversions were mostly due to the very large file sizes and the false positivity of the validator. The total number of RDF triples generated for the datasets was over 20 million triples. 26
Dataset validator evaluation Precision Recall Accuracy Well-formed dataset 0.98 0.96 0.97 Table-level problems 0.85 0.78 0.94 Header-level problems 0.25 1 0.99 A preliminary evaluation of the dataset validator was conducted using 120 datasets randomly selected from Data.go.th. A confusion matrix was created for each problem type detection. The overall precision and recall of identifying well-formed tabular datasets were 0.98 and 0.96. Some detected confusion between the table-level and header-level problems. Some metadata above tables were mistakenly detected as incomplete table headers. The confusion did not affect the overall performance of service generation. 27
Demo OGD services for Data.go.th Demo URL: http://api.data.go.th/demo/ Browse datasets & APIs Link to original dataset Link to dataset service page 28
OGD services for a dataset Thai Schools Dataset Query: Find schools in Pathum Thani province whose school type is general education. API result in JSON format Forming API Request Query results 29
Open-D Public Service An initiative to provide a repository of datasets for data scientists and developers. Open-D Public Service focused on improving open data in 2 aspects: Data Quality Only well-formed table data are allowed. Data Accessibility Access to dataset is provided as Data API User can analyze and visualize the data in a dataset via the provided tool and shared the results via social media. 30
Open-D vs. other data services Data.go.th Datahub.io Tableau/ Power BI Open-D Public service Gov only X X X Well-formed table not required required required required Data API Export to BI Tools X (query/ aggregation) Data Visualization X (advanced) X (basic) Dataset Joins X under development Graph Sharing/ Embedding X (dashboard) X 31
Open-D Public Service Website http://opend.openservice.in.th/ 32
Open-D Public Service Website (2) 33
Open-D Public Service Website (3) 34
Open Data Hackathon Event Over 20 teams joined the 3-day hackathon event in Sep 2018. The event was powered by Open-D platform More than 500,000 cultural objects made available as open datasets Data access is provided as Data API created by Open-D 35
e-culture Open Data Hackathon 2018 36
e-culture Open Data Hackathon 2018 Winner Runners-up 37
Conclusions We introduce open data concepts and some research challenges. We present Open-D -- a software platform for creating data services from open datasets. Our framework is unique in that it does not require user technical skills in creating open data APIs from datasets. Dataset validator for tabular data in spreadsheets. RDB-to-RDF data mapping and SPARQL query templates. 38
Conclusions (2) The Open-D platform was demonstrated and evaluated using the datasets from Data.go.th. Over 80 percent of the identified well-formed datasets were successfully transformed to data services Open-D public service provides a Web portal for open data publishing and consumption for public use. e-culture Open Data Hackathon is a showcase of creating innovation from open data via Open-D. 39
Future Work Improving performance of the validator and applying automatic data cleansing to the datasets. Additional data services such as integrating (linking) data across datasets. Data.go.th 2.0 is planned to be integrated with the Open-D platform (DGA-NECTEC collaboration). 40
Acknowledgment This project was partially supported by the National Science and Technology Development Agency (NSTDA) and the Digital Government Agency (DGA), Thailand. 41