A Data Explorer System and Rulesets of Table Functions

Similar documents
Introduction to R and Statistical Data Analysis

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

netzen - a software tool for the analysis and visualization of network data about

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Intro to R for Epidemiologists

Machine Learning: Algorithms and Applications Mockup Examination

LaTeX packages for R and Advanced knitr

STAT 1291: Data Science

Input: Concepts, Instances, Attributes

MULTIVARIATE ANALYSIS USING R

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

An Introduction to R Graphics

Advanced Statistics 1. Lab 11 - Charts for three or more variables. Systems modelling and data analysis 2016/2017

Data analysis case study using R for readily available data set using any one machine learning Algorithm

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Work 2. Case-based reasoning exercise

DATA VISUALIZATION WITH GGPLOT2. Coordinates

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Chuck Cartledge, PhD. 20 January 2018

DEPARTMENT OF BIOSTATISTICS UNIVERSITY OF COPENHAGEN. Graphics. Compact R for the DANTRIP team. Klaus K. Holst

Introduction to Artificial Intelligence

Package catdap. R topics documented: March 20, 2018

Decision Trees In Weka,Data Formats

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft

Data Mining: Exploring Data

Graphing Bivariate Relationships

Advanced Graphics in R

Combo Charts. Chapter 145. Introduction. Data Structure. Procedure Options

k-nearest Neighbors + Model Selection

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

A Tour of Sweave. Max Kuhn. March 14, Pfizer Global R&D Non Clinical Statistics Groton

Experimental Design + k- Nearest Neighbors

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

Package quark. March 13, 2016

Linear discriminant analysis and logistic

Visualizing high-dimensional data:

Introduction to Statistical Graphics Procedures

Package elasticsearchr

Hypothesis Test Exercises from Class, Oct. 12, 2018

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Data Mining: Exploring Data. Lecture Notes for Chapter 3

2. Navigating high-dimensional spaces and the RnavGraph R package

Manuel Oviedo de la Fuente and Manuel Febrero Bande

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Function Approximation and Feature Selection Tool

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Tutorial for the R Statistical Package

Partitioning Cluster Analysis with Possibilistic C-Means Zeynel Cebeci

Finding Clusters 1 / 60

Effective Graphics Made Simple Using SAS/GRAPH SG Procedures Dan Heath, SAS Institute Inc., Cary, NC

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

BIO5312: R Session 1 An Introduction to R and Descriptive Statistics

Basic Concepts Weka Workbench and its terminology

Relational Model. IT 5101 Introduction to Database Systems. J.G. Zheng Fall 2011

UNSUPERVISED LEARNING IN PYTHON. Visualizing the PCA transformation

Creating publication-ready Word tables in R

Data Manipulation using dplyr

Introduction to R. Daniel Berglund. 9 November 2017

University of Florida CISE department Gator Engineering. Visualization

Visualizing class probability estimators

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Introduction to R for Epidemiologists

刘淇 School of Computer Science and Technology USTC

B. Graphing Representation of Data

Back-to-Back Stem-and-Leaf Plots

EPL451: Data Mining on the Web Lab 5

Data Visualization Using R & ggplot2. Karthik Ram October 6, 2013

DatXPlore A software for data exploration and visualization. January 2014

Radial Basis Function (RBF) Neural Networks Based on the Triple Modular Redundancy Technology (TMR)

Reverse Standardization from Public E-health Service

K-Means Clustering 3/3/17

Machine Learning Chapter 2. Input

CALUMMA Management Tool User Manual

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

LESSON 14: Box plots questions

NCSS Statistical Software

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India.

AUTOMATIC SCORING OF THE SEVERITY OF PSORIASIS SCALING

Hands on Datamining & Machine Learning with Weka

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

DATA VISUALIZATION WITH GGPLOT2. Grid Graphics

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

EMR web api documentation


Machine Learning with MATLAB --classification

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

CPRD Aurum Frequently asked questions (FAQs)

Cluster Analysis using Spherical SOM

Inventory management system Project description

Analysis and Latent Semantic Indexing

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Prostate Detection Using Principal Component Analysis

MATH 117 Statistical Methods for Management I Chapter Two

Transcription:

A Data Explorer System and Rulesets of Table Functions Kunihiko KANEKO a*, Ashir AHMED b*, Seddiq ALABBASI c* * Department of Advanced Information Technology, Kyushu University, Motooka 744, Fukuoka-Shi, 819-0395, Japan a E-mail address: kaneko@ait.kyushu-u.ac.jp b E-mail address: ashir@ait.kyushu-u.ac.jp c E-mail address: seddiq@f.ait.kyushu-u.ac.jp ABSTRACT In this paper, we present a data analysis and visualization system named "Data Explorer". The system read a data table, and produce analysis and visualization results interactively. The system include many types of table functions. There are different numbers and types of options for each table function. A problem to be tackled is the difficulty to set option values of the table functions. There will be many user mistakes in the option values. To solve the problem, we propose a rule set to decide a candidate set of the option values of the table functions. Here, the data-description data (i.e. metadata) of data table is employed to decide the candidate set. We use the metadata to decide the applicability of table functions, also. The feasibility of the idea is evaluated using two types of dataset. They are iris and the hospital dataset. Keywords: data analysis, visualization, relational database, metadata. 1. INTRODUCTION Recently, more and more data are collected. They may be measurement data from sensors. They may be collected using Web forms. If all elements of a dataset are data records (i.e. a data element is a list of fields and each field has an attribute name) of the same record type, the dataset is a data table. If there are many types of records, there will be a different kinds of data table. 2. DATA ANALYSIS AND VISUALIZATION A data table is a set of data records. Figure 1 depicts an example of data table of the Edgar Anderson's Iris dataset [1]. The dataset is the measurement values of the sepal length, sepal width, petal length, petal width, respectively, for 50 flowers from each of three species of iris. The line added on the top of the data table is called header. A header is the list of attribute names of a data table. In figure 1, there are five attribute names. They are (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width, Species). Each row of data table is a data record. In figure 1, each data record consists of measurement values and species name. X. The record type of each row of Y is equal to the record type of each row of X. The attribute names of Y may different from X. Analysis function Analysis functions will produce a new data table Y from a data table X. The record type of each line of Y is not equal to the record type of X. The attribute names of Y may different from X. Displaying function Displaying functions will display a result using a data table. The R system [2] is an open-source and free software for statistical analysis and visualization. There are more-than 2,000 packages for the R system. The packages include many types of table functions. We implemented our data explorer system using the R system. Fig. 2 depicts the data explorer system. In the system, there is one chain from the source data to the final result, and the results are provided to the end-users using displaying functions. The internal nodes of the chains are intermediate results. The final result can be represented using the an expression as below: op n (op n-1 ( op 1 (X, <option values>) )) Here, X is the source data table, and op i represents a table operation or an analysis function. Table 1 depicts an implementation of table function using the R system. The input of table functions contains one data table, and contains optionally attribute name as string value, and other numeric value or condition expression. Fig. 3 and Fig. 4 depict examples of the table functions. header data table Fig. 1: Data table example. This illustrates the header and top 6 lines of the Edgar Anderson's Iris dataset [1]. Given one data table, we can define several types of table functions described below. Table operation Table operations from X to Y will produce a new data table Y from a data table X. The numbers of rows and columns of Y are same as

Data Table Source Data Functions for data table intermediate data tables Data Explorer System Fig. 2: Data Explorer final result tabular form scatter plot diagram Data Analysis and Visualization Results Table 1: Data Operations. The variable X is data table. The variables A, A i, A j and A k are attribute names. The variable Alist is a list of attribute names. The variable cond is a condition expression. (a) Table operations Function Function name and parameters R system implementation (R source code) Principle pca(x) library(stringr) component pc <- princomp(x, cor=true) analysis (PCA) Y <- as.data.frame(data.matrix(x) %*% unclass(loadings(pc))) names(y) <- str_replace_all(names(y),"[.]","_") (b) Analysis functions Function Function name and parameters R system implementation (R source code) Selection of selection(x,alist) sqldf(paste("select",alist,"from X;")) rows Projection of projection(x,cond) sqldf(paste("select * from X where ",cond,";")) columns Frequency table frequency(x,a) R <- sqldf(paste("select",a,"from X;")) data.frame(table(r)) Cross table cross_table(x,a i,a j ) R <- sqldf(paste("select",ai,",",aj," from X;")) return(table(r)) (c) Displaying functions Function Function name and parameters R system implementation (R source code) Display first part head(x,lines) head(x, n=lines) Two-dimensional plot2d(x,a i,a j ) library(ggplot2) scatter plot R <- sqldf(paste("select",ai,"as x,",aj,"as y from X;")) ggplot(r, aes(x=x, y=y)) + geom_point + xlab(ai) + ylab(aj) Three-dimensional scatter plot plot3d(x,a i,a j,a k ) library(scatterplot3d) R <- sqldf(paste("select",ai,"as x,",aj,"as y, ",Ak,"as z from X;")) scatterplot3d(r) histogram histogram(x, A) R <- sqldf(paste("select",a,"from X;")) hist(as.matrix(r)) Two-fimensional histogram Cluster dendrogram Gaussian Mixture Model (GMM) classification histogram2d(x,ai,aj,n) cluster(x,m) GMM_classification(X,A i,aj) library(gregmisc) R <- sqldf(paste("select",ai,"as x,",aj,"as y from X;")) h <- hist2d(x=r$x,y=r$y,nbins=c(n,n)) persp(h$x,h$y,h$counts,shade=0.2) plot(hclust(dist(x), method=m)) R <- sqldf(paste("select",ai,",",aj,"from X;")) plot(mclust(r), what="classification")) Conditional Inference Tree plot_ctree(x,a) library(party) ct<-ctree(formula(paste(a," ~.")),data=x) plot(ct)

plot2d(pca(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width")),"Comp_1", "Comp_2") (a) Two-dimensional plot of PCA plot3d(pca(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width")),"Comp_1", "Comp_2", "Comp_3") (b) Three-dimensional plot of PCA histogram(iris, "Sepal_Length") (c) histogram histogram2d(iris, "Sepal_Length", "Sepal_Width", 10) (d) Two-dimensional histogram cluster(projection(iris, "Sepal_Length, Sepal_Width"), "ward") (e) cluster dendrogram produced by the ward clustering [3] GMM_classification(iris, "Sepal_Length", "Sepal_Width")) (f) Gaussian Mixture Model (GMM) classification plot_ctree(iris, "Petal_Width") (g) Conditional Inference Tree Fig. 3: Commands by Users and Results produced by the Data Explorer. The variable iris is the Edgar Anderson's Iris dataset.

(a) Selection of rows: head(selection(iris, "Species=' "setosa "'"),n=6) (b) Projection of columns: head(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width"),n=6) (c) Frequency table: head(frequency(iris,"sepal_length"),n=6) (d) Cross table: head(frequency(iris,"sepal_length","sepal_width"),n=6) Fig. 4: Analysis Function Examples. The variable iris is the Edgar Anderson's Iris dataset. 3. PROBLEM DEFINITION AND IMPLEMENTATION DETAILS 3.1 Relational Database Interface The data explorer system read a table from a relational database system. We employee the SQLite3 relational database management system to manage relational database. The source code to open SQLite3 database named mydb from R system is below. library(rsqlite) drv <- dbdriver("sqlite", max.con = 1) conn <- dbconnect(drv, dbname="mydb") The source code to read a table named T from the database to the variable X in R system is below. c <- dbsendquery(conn, "SELECT * from T;") X <- fetch(c, n=-1) 3.2 Metadata Metadata is a data table that stores the information about record types. For example, the iris dataset has one record type, and the record has five attributes. Table 2 depicts the metadata of the Edgar Anderson's Iris dataset. The attribute name is string valued. The attribute type may have a value in the set {integer, real, text, boolean, date, time, datetime}. The attribute is_ordered is either TRUE or. If the value of type is integer, real, date, time or datetime, then the value of is_ordered is TRUE. If the value of type is text and the text value represents an order (for example "A", "B", "C", and "D"), then the value of is_ordered is TRUE. Otherwise the value of is_ordered is. The attribute is_categorical is either TRUE or. The attribute is TRUE if the value of type is text and the value represents category. 3.3 Ruleset of table functions Each table functions has its own rules. For example the table operation pca has the following two rules. 1. All values of the input data table must be ordered. 2. All values of the output data table are all ordered. Table 3 depicts the rulesets of the table functions in Table 3. 3.4 Candidate set of attribute names in options We define the following rules that can be specified as option values in table functions. 1. for the table function that "all values of the input data table must be ordered" is. Arbitrary attribute name of the input table can be specified as the option value of the table function. The system can suggest the list of attribute name using the metadata of the input data table. 2. for the table function that "all values of the input data table must be ordered" is TRUE. If the input data table contains "non-ordered" attribute, the table function can not be evaluated The system can suggest to users to use the projection operation to eliminate all the non-ordered attributes from the input table. The system estimates the types of intermediate data table using the Table 3.

Table 2: the metadata of the Edgar Anderson's Iris dataset name type is_ordered is_categorical Sepal_Length real TRUE Sepal_Width real TRUE Petal_Length real TRUE Petal_Width real TRUE Species text TRUE Table 3: the ruleset of the table functions. *1: if all values of the input data tab function name All values of the input data All values of the output data table must be ordered table are all ordered pca TRUE TRUE selection *1 projection *1 frequency *1 cross_table *1 head plot2d plot3d histogram histogram2d cluster GMM_classification TRUE plot_ctree Table 4: The metadata of the patient data table. name type is_ordered comment id_patient integer Unique patient ID user_id text Reg. no first_name text Patient First Name middle_name text Patient Middle Name last_name text Patient Last Name address text Patient Address email text Patient email (in any) mobile_no text Contact Number password text Password sex text Patient Sex, enum('male','female') height text Height when registered weight text Weight when registered religion text Patient Religion birth_date date TRUE Date of Birth birth_place text Place of Birth reg_date datetime TRUE Reg. Date l_id integer Site ID last_login datetime TRUE Last login date time is_active boolean Active or inactive blood_group text Blood Group reg_issuer text Operator who Registered age integer TRUE Patient Age Table 5: The metadata of the prescription data table. name type is_ordered comment prescription_id integer Unique Checkup Id patient_checkup_id integer Checkup Date prescription_body text Blood Sugar doctor_id integer PBS/ FBS prescription_date datetime TRUE Blood Hemoglobin symptoms text Blood Pressure Systolic

4. EVALUATION We are developing a hospital database system. It contains patient nformation. The information is one data tables. The metadata of the patient data table is defined as shown in Table 4 and Table 5. Acknowledgment This work was partially executed under the consignment agreement with NEDO. 5. CONCLUSION In this paper, we present a data explorer system. In the system, data flows from a source data to data analysis and visualization results are the form of chains of functions for data table. Each function can be implemented easily using R system because R system already has many packages for data analysis and visualization. We already collected patient information in the hospital database system. Statistical analysis of the hospital database is future work. References [1] Becker, R. A., Chambers, J. M. and Wilks, A. R., "The New S Language. Wadsworth & Brooks/Cole," 1988. [2] http://www.r-project.org/index.html [3] Ward, J. H. Jr. "Hierarchical Grouping to Optimize an Objective Function," Journal of the American Statistical Association, 58, pp. 236-244, 1963.