Data-Transformation on historical data using the RDF Data Cube Vocabulary

Similar documents
DATA WAREHOUSE- MODEL QUESTIONS

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

ETL and OLAP Systems

Data Warehouse and Data Mining

DATA MINING AND WAREHOUSING

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Chapter 13 Business Intelligence and Data Warehouses The Need for Data Analysis Business Intelligence. Objectives

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

FedDW Global Schema Architect

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Data Warehousing and Decision Support

How to Deploy Enterprise Analytics Applications With SAP BW and SAP HANA

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Data Warehousing and Decision Support

SAMPLE. Preface xi 1 Introducting Microsoft Analysis Services 1

UNIT -1 UNIT -II. Q. 4 Why is entity-relationship modeling technique not suitable for the data warehouse? How is dimensional modeling different?

Oracle BI 11g R1: Build Repositories Course OR102; 5 Days, Instructor-led

Oracle BI 11g R1: Build Repositories

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Liberate, a component-based service orientated reporting architecture

This download file shows detailed view for all updates from BW 7.5 SP00 to SP05 released from SAP help portal.

Big Data 13. Data Warehousing

Expanding Open Access to Your OLAP Data

Data Warehousing. Adopted from Dr. Sanjay Gunasekaran

Business Intelligence Roadmap HDT923 Three Days

Data processing. Intelligent Data Analysis Fault Tolerant Systems Research Group

Automated Visualization Support for Linked Research Data

1. Analytical queries on the dimensionally modeled database can be significantly simpler to create than on the equivalent nondimensional database.

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Implementing Data Models and Reports with Microsoft SQL Server Exam Summary Syllabus Questions

Informatica Power Center 10.1 Developer Training

Datameer for Data Preparation:

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

OPAX - An Open Peer-to-Peer Architecture for XML Message Exchange

Why Was Arbil Written

Data warehouses Decision support The multidimensional model OLAP queries

Cloud Computing & Visualization

Historization and Versioning of DDI-Lifecycle Metadata Objects

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20

Aggregating Knowledge in a Data Warehouse and Multidimensional Analysis

INDEPTH Network. Introduction to ETL. Tathagata Bhattacharjee ishare2 Support Team

Chapter 3. Architecture and Design

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Introduction to Data Science

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Automated Classification. Lars Marius Garshol Topic Maps

Implementing the RDA Data Citation Recommendations for Long Tail Research Data. Stefan Pröll

IBM Cognos Framework Manager: Design Metadata Models (V10.2)

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

Migrate from Netezza Workload Migration

Q1) Describe business intelligence system development phases? (6 marks)

Logi Info v12.5 WHAT S NEW

Tessera Rapid Modeling Environment: Production-Strength Data Mining Solution for Terabyte-Class Relational Data Warehouses

Use Table Styles to format an entire table. Format a table. What do you want to do? Hide All

Evolution of Database Systems

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Index COPYRIGHTED MATERIAL. Symbols and Numerics

Data Warehouse and Data Mining

Commercially Empowered Linked Open Data Ecosystems in Research

After completing this course, participants will be able to:

MicroStrategy Desktop MicroStrategy 10.2: New features overview. microstrategy.com 1

Decision Support. Chapter 25. CS 286, UC Berkeley, Spring 2007, R. Ramakrishnan 1

C_TBI30_74

D3.3 Refined Federated Querying and Aggregation

<Insert Picture Here> Oracle SQL Developer Data Modeler 3.0: Technical Overview

OLAP Introduction and Overview

Interactive PMCube Explorer

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

6 SSIS Expressions SSIS Parameters Usage Control Flow Breakpoints Data Flow Data Viewers

Data Warehousing and OLAP

Proceedings of the IE 2014 International Conference AGILE DATA MODELS

Guide Users along Information Pathways and Surf through the Data

Vector Issue Tracker and License Manager - Administrator s Guide. Configuring and Maintaining Vector Issue Tracker and License Manager

Release notes for version 3.7.1

Course Contents: 1 Business Objects Online Training

Filtering, Sorting and Ranking

Constructing Object Oriented Class for extracting and using data from data cube

Foundations of SQL Server 2008 R2 Business. Intelligence. Second Edition. Guy Fouche. Lynn Lang it. Apress*

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

The Emergence of Application Logic Compilers Stefan Dipper, SAP BW Development Sept, Public

A Benchmarking Criteria for the Evaluation of OLAP Tools

An Overview of Data Warehousing and OLAP Technology

Capability White Paper Straight-Through-Processing (STP)

Chapter 18: Data Analysis and Mining

Financial Dataspaces: Challenges, Approaches and Trends

SAS Web Report Studio 3.1

Jet Data Manager 2014 SR2 Product Enhancements

Appliances and DW Architecture. John O Brien President and Executive Architect Zukeran Technologies 1

Teradata Aggregate Designer

E(xtract) T(ransform) L(oad)

Moving to a Data Warehouse

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

In-memory Analytics Guide

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

alteryx training courses

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Release notes for version 3.7.2

Transcription:

Data-Transformation on historical data using the RD Data Cube Vocabulary Sebastian Bayerl, Michael Granitzer Department of Media Computer Science University of Passau SWIB15 Semantic Web in Libraries 22.10.2015

2 Overview Motivation Vocabulary and Dataset Problem Setting and Approach Workflow Contributions

3 Motivation Statistical and historical data source Statistics of the German Reich (Digitalized) Access the encapsulated knowledge Data Analytics and Recommendation Using Linked Data (RD Data Cube Vocabulary) But first: Data Integration Data Cleaning, -Transformation and -usion

Source structure 4

Target structure D3 a 1 2 3 D3 b 5 6 7 D1 a D1 b D1 c 4 8 D1 d D2 a D2 a D2 b D2 b 3 D1 c D2 b 1 D1 a D2 a D3 a D3 a 2 D1 b D2 a D3 a

6 Data Cubes and OLAP Cube: Multi-dimensional data structure Observation: measures and dimensions Measure: numerical fact Dimension: describes the fact(s) Enables Data Analytics OLAP: Online Analytical Processing Slicing, Dicing, Roll-Up,

The RD Data Cube Vocabulary RD based vocabulary Models an OLAP Data Cube Interlink components with existing concepts http://www.w3.org/tr/vocab-data-cube/

Examples 1 8

Examples 2 9

10

11 Problem Setting Data is encapsulated in multiple files Unusable for sophisticated Data Analysis Normalization of complex structured data Dirty and faulty data, structure or annotations Lots of similar problems in a huge dataset

12 Approach Use the RD Data Cube Vocabulary Enables: Interlinking, merging and analytics Use an incremental workflow Identify fine-granular transformations Implement the research prototype with GUI Select, configure and chain transformations (save/load) HTML preview

13 Workflow Java objects 4.Iterate transformations RD 2. Parse files 3. Merge into single table 5. Apply transformation 7. Convert to RD 8. Persist RD TEI HTML Relational database or Data-Warehouse 1. Load link group Transformations 6. Produce HTML visualisation Convert to SQL Statements Persist Data

14 Transformations 1. Pre-Normalization Sanity checks Data Cleaning ix structure (e.g. spans), data and annotations Delete row (e.g. repeating headers) 30 more 2. Normalization Normalization Compound normalization: Horizontal or vertical partitions 3. Post-Normalization Add/merge/delete columns Add headers/disambiguation Add metadata

15 Advanced transformations Compound transformations Combine multiple transformation ix more complex problems E.g. find problematic cells and fix with existing transformation Transformation suggestions ind common problems: Repeat symbol, annotation patterns A step towards automation

16 Contributions Modular workflow for the Data Integration process Definition of fine granular transformation steps Reusable within the same or for other data sources Lift and enrich historical statistical data Ready for Data Analytics Current datset contains 32169 files > 10% converted 10 conversion chains https://github.com/bayerls/statistics2cubes

17 Thank you for your attention! Question? RD Data Cube Vocabulary: http://www.w3.org/tr/vocab-data-cube/ https://github.com/bayerls/statistics2cubes Sebastian Bayerl Department of Media Computer Science University of Passau bayerl@dimis.fim.uni-passau.de

BACKUP

Publication Bayerl, Sebastian, and Michael Granitzer. "Data-transformation on historical data using the RD data cube vocabulary." Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business. ACM, 2015.

Abstract This work describes how XML-based TEI documents, containing statistical data, can be normalized, converted and enriched using the RD Data Cube Vocabulary. In particular we focus on a statistical real world data set, namely the statistics of the German Reich around the year 1880, which are available in the TEI format. The data is embedded in complex structured tables, which are relatively easy to understand for humans but they are not suitable for automated processing and data analysis, without heavy pre-processing, due to their varying structural properties and differing table layouts. Therefore, the complex structured tables must be validated, modified and transformed, until they are suitable for the standardized multi-dimensional data structure - the data cube. This work especially focuses on the transformations necessary to normalize the structure of the tables. Performing validation- and cleaning-steps, resolving row- and column-spans and reordering slices are available transformations among multiple others. By combining exiting transformations, compound operators are implemented, which can handle specific and complex problems. The identification of structural similarities or properties can be used to automatically suggest sequences of transformations. A second focus is on the advantages, which come by using the RD Data Cube Vocabulary. Also, a research prototype was implemented to execute the workflow and convert the statistical data into data cubes.