Interactive Data Exploration Related works

Similar documents
Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration

DATA MINING AND WAREHOUSING

Time: 3 hours. Full Marks: 70. The figures in the margin indicate full marks. Answers from all the Groups as directed. Group A.

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Basics of Dimensional Modeling

DATA WAREHOUING UNIT I

Efficient Cube Construction for Smart City Data

Code No: R Set No. 1

An Overview of Data Warehousing and OLAP Technology

Question Bank. 4) It is the source of information later delivered to data marts.

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Guide Users along Information Pathways and Surf through the Data

OLAP Introduction and Overview

Data Preprocessing. Slides by: Shree Jaswal

Decision Support Systems aka Analytical Systems

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Analysis and Data Science

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

International Journal of Scientific & Engineering Research, Volume 7, Issue 11, November ISSN

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

CS 1655 / Spring 2013! Secure Data Management and Web Applications

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Chapter 18: Data Analysis and Mining

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Overview of Web Mining Techniques and its Application towards Web

Evolution of Database Systems

Grid Computing Systems: A Survey and Taxonomy

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Introduction to Data Mining and Data Analytics

1. Inroduction to Data Mininig

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

CT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN

20466C - Version: 1. Implementing Data Models and Reports with Microsoft SQL Server

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Table Of Contents: xix Foreword to Second Edition

The strategic advantage of OLAP and multidimensional analysis

Interactive Data Exploration via Machine Learning Models

1 (eagle_eye) and Naeem Latif

Variety-Aware OLAP of Document-Oriented Databases

Data Mining Concepts & Techniques

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Graph Mining and Social Network Analysis

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Oracle 1Z0-515 Exam Questions & Answers

Exam Datawarehousing INFOH419 July 2013

Dta Mining and Data Warehousing

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Information Management Fundamentals by Dave Wells

SCHEME OF COURSE WORK. Data Warehousing and Data mining

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Data Warehousing and Data Mining

On-Line Analytical Processing (OLAP) Traditional OLTP

TIM 50 - Business Information Systems

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Contents. Foreword to Second Edition. Acknowledgments About the Authors

An Introduction to Big Data Formats

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

Knowledge Discovery and Data Mining

MICROSOFT BUSINESS INTELLIGENCE

SQL Server Analysis Services


SQL Server and MSBI Course Content SIDDHARTH PATRA

ETL and OLAP Systems

Advanced Data Management Technologies

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Information Management course

DATA MINING TRANSACTION

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

Data Mining and Analytics. Introduction

Data warehouses Decision support The multidimensional model OLAP queries

Top Five Reasons for Data Warehouse Modernization Philip Russom

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Compressed Aggregations for mobile OLAP Dissemination

OPTIMIZING ACCESS ACROSS HIERARCHIES IN DATA WAREHOUSES QUERY REWRITING ICS 624 FINAL PROJECT MAY 成玉 Cheng, Yu. Supervisor: Lipyeow Lim, PhD

DATA MINING - 1DL105, 1DL111

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

A Benchmarking Criteria for the Evaluation of OLAP Tools

Information Retrieval

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE)

UNIT 2 Data Preprocessing

Chapter 3. The Multidimensional Model: Basic Concepts. Introduction. The multidimensional model. The multidimensional model

Evolving To The Big Data Warehouse

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Unifying Big Data Workloads in Apache Spark

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Advanced Data Management Technologies Written Exam

Product Clustering for Online Crawlers

2 CONTENTS

Knowledge Discovery in Data Bases

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Transcription:

Interactive Data Exploration Related works Ali El Adi Bruno Rubio Deepak Barua Hussain Syed Databases and Information Retrieval Integration Project

Recap Smart-Drill

AlphaSum: Size constrained table summarization using value lattices K. Selc uk Candan, Huiping Cao, Yan Qi Arizona State Univ. Tempe, USA Maria Luisa Sapino Univ. di Torino Torino, Italy mlsapino@di.unito.it

Size-Constrained Table Summarization AlphaSum: Size-Constrained Table Summarization using Value Lattices Table summarization can benefit from knowledge about acceptable value clustering alternatives for clustering the values in the database. Summarization of large data tables is required in many scenarios where it is hard to display complete data sets.

Size-Constrained Table Summarization For example, table summarization for mobile commerce applications over PDAs, which cannot effectively present a large table of results with their small screens. Table summarization is also useful in various other scenarios, where it is hard or highly-impractical to visualize large data sets

Size-Constrained Table Summarization Consider, for example, a scientist exploring the Digi- tal Archaeologial Record (tdar/ficsr). A digital library which archives and provides access to a large number of (and diverse) data sets, collected by different researchers.

Example User is interested in summarizing the given table based on the attribute pair, Age,Location. The summarized table can be visualized in a space that can hold at most 2 tuples

Example Can be achieved by clustering rows such that each row in the summary corresponds to at least 3 (= 6/2 ) rows in the original table. It is 3-summarization.

Example

Value hierarchies (such as the one in example) are commonly used for user-driven data exploration within large data sets. For example, OLAP operators (such as drill- down, navigating from less specific to more detailed data by stepping down on a given hierarchy, and roll-up, which performs aggregation on by climbing up a hierarchy of concepts).

OLAP-based navigation (using drill-down and roll-up operations), on the other hand, does not take into account the visualization real estate (e.g., the number of tuples to be dis- played) available for exploration.

AlphaSum the gist As described above, the main goal of AlphaSum is to obtain OLAP-like navigable summaries from large tables. Its goals, however, differ from OLAP in two significant ways: First of all, unlike traditional OLAP, AlphaSum takes the visualization into account. the maximum number of tuples to be visualized to the user as a highest level summary is an input parameters to AlphaSum

Secondly, AlphaSum recognizes that in many applications (even in OLAP), systems may need to take into account the existence of multiple acceptable value clustering alternatives. For example, value lattice shown on next slide provides more clustering alternatives than the simpler hierarchy in the initial example.

Cube Data Structure for Smart Drill Scientific Paper Efficient Cube Construction for Smart City Data Michael Scriney & Mark Roantree Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, Ireland michael.scriney@insightcentre.org, mark.roantree@dcu.ie

What is DWARF Dwarf is a patented (US Patent 7,133,876) highly compressed structure for computing, storing, and querying Data Cubes. It is a highly compressed structure with reduction reaching 1:1,000,000 depending on the data distribution. The most important aspect of this patented Dwarf technology is that its data fusion (prefix and suffix redundancy elimination) is discovered and eliminated BEFORE the cube is computed and this explains the dramatic reduction in compute time The term Dwarf is used in analogy to dwarf stars that have a very large condensed mass but occupy very small space.

Sample Input Data

DWARF Cube Data Structure

DWARF Schema

Storage and Time Performance MySQL-Min has best storage performance of all schemas

Storage and Time Performance NoSQL-DWARF has best time performance of all schemas

Query Steering One of the Query Steering modes: Auto Steering System automatically recommends queries through the user profile. A new framework is proposed to address how the system should discover relevant data. Automatic Interactive Data Exploration (AIDE) Dimitriadou, K., Papaemmanouil, O., & Diao, Y. (2016). AIDE: An Active Learning-Based Approach for Interactive Data Exploration. IEEE Transactions on Knowledge and Data Engineering, 28(11), 2842-2856.

AIDE An iterative exploration model that: Retrieves objects to be extracted and labeled by the user Minimizes the number of samples presented to the user How to do it? By integrating machine learning techniques with classification algorithms and decision tree systems.

AIDE Framework Overview

AIDE Framework steps with examples CLINICAL TRIALS 1. Sample Extraction Records of clinical trials are extracted, along with its attributes (e.g. year, outcome, patience age, medication dosage, etc). 2. User Relevance Feedback Each sample is labeled as interesting or not. Users can also mark similar samples and modify their feedback on previously seen samples.

AIDE Framework steps with examples CLINICAL TRIALS 3. Data Classification Samples are used to train a classification model that characterizes user s interests and predicts which clinical trials are relevant based on the feedback collected so far. 4. Space Exploration Feedbacks and current user model are used to identify promising sample areas and retrieve next sample from dataset.

AIDE Framework steps with examples CLINICAL TRIALS 5. Query Formulation By utilizing decision tree classifiers to identify linear patterns of user interests, a classification model is produced and then translated into a query expression. Let s assume a decision tree classifier that predicts relevant and irrelevant clinical trials objects based on the attributes age and dosage:

AIDE Framework decision tree to query CLINICAL TRIALS Query Formulation select * from table where (age <= 20 and dosage >10 and dosage <= 15) or (age > 20 and age <= 40 and dosage >= 0 and dosage <= 10)

AIDE s effectiveness Goal: maximize F-measure of the final decision tree C on the total data space T, defined as: where : The perfect precision value of 1.0 means that every object characterized as relevant by the decision tree is indeed relevant.

Space Exploration Overview AIDE Framework Goal: - Discover relevant area(s) - Formulate user queries to select relevant area(s)

Space Exploration Overview 3 Exploration phases: Assumption: 1. Relevant Object Discovery 2. Misclassified Exploitation 3. Boundary Exploitation The user interests are captured by range queries, i.e., relevant objects are clustered in one or more areas in the data space.

Space Exploration Overview 1. Relevant Object Discovery Output: A sample from diverse data areas for reviewing by the user. Steps: Each normalized attribute domain is divided into equal width ranges. Grid cells One data object is retrieved from each non empty cells (near the center of the cell). Sample After the review: A Classifier is created/ updated

Space Exploration Overview 2. Misclassified Exploitation "Irrelevant" area Classifier Irrelevant areas Relevant areas! may contain false negative objects!

Space Exploration Overview 2. Misclassified Exploitation Aim: Examine further «irrelevant» area(s) "Irrelevant" area Classifier Irrelevant areas Relevant areas! may contain false negative objects!

Space Exploration Overview 2. Misclassified Exploitation Aim: Examine further «irrelevant» area(s) In a lower exploration level Classifier Irrelevant areas Relevant areas! may contain false negative objects!

Space Exploration Overview 2. Misclassified Exploitation Aim: Examine further «irrelevant» area(s) In a lower exploration level Classifier Relevant areas Irrelevant areas Relevant areas Next iteration: the classifier is updated and includes new relevant areas

Space Exploration Overview 3. Boundary Exploitation Classifier Irrelevant areas Relevant areas Need to redefine boundaries (remove false positives) Varying the sampling area