A Data Explorer System and Rulesets of Table Functions

Size: px

Start display at page:

Download "A Data Explorer System and Rulesets of Table Functions"

Dominic McGee
5 years ago
Views:

1 A Data Explorer System and Rulesets of Table Functions Kunihiko KANEKO a*, Ashir AHMED b*, Seddiq ALABBASI c* * Department of Advanced Information Technology, Kyushu University, Motooka 744, Fukuoka-Shi, , Japan a address: kaneko@ait.kyushu-u.ac.jp b address: ashir@ait.kyushu-u.ac.jp c address: seddiq@f.ait.kyushu-u.ac.jp ABSTRACT In this paper, we present a data analysis and visualization system named "Data Explorer". The system read a data table, and produce analysis and visualization results interactively. The system include many types of table functions. There are different numbers and types of options for each table function. A problem to be tackled is the difficulty to set option values of the table functions. There will be many user mistakes in the option values. To solve the problem, we propose a rule set to decide a candidate set of the option values of the table functions. Here, the data-description data (i.e. metadata) of data table is employed to decide the candidate set. We use the metadata to decide the applicability of table functions, also. The feasibility of the idea is evaluated using two types of dataset. They are iris and the hospital dataset. Keywords: data analysis, visualization, relational database, metadata. 1. INTRODUCTION Recently, more and more data are collected. They may be measurement data from sensors. They may be collected using Web forms. If all elements of a dataset are data records (i.e. a data element is a list of fields and each field has an attribute name) of the same record type, the dataset is a data table. If there are many types of records, there will be a different kinds of data table. 2. DATA ANALYSIS AND VISUALIZATION A data table is a set of data records. Figure 1 depicts an example of data table of the Edgar Anderson's Iris dataset [1]. The dataset is the measurement values of the sepal length, sepal width, petal length, petal width, respectively, for 50 flowers from each of three species of iris. The line added on the top of the data table is called header. A header is the list of attribute names of a data table. In figure 1, there are five attribute names. They are (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width, Species). Each row of data table is a data record. In figure 1, each data record consists of measurement values and species name. X. The record type of each row of Y is equal to the record type of each row of X. The attribute names of Y may different from X. Analysis function Analysis functions will produce a new data table Y from a data table X. The record type of each line of Y is not equal to the record type of X. The attribute names of Y may different from X. Displaying function Displaying functions will display a result using a data table. The R system [2] is an open-source and free software for statistical analysis and visualization. There are more-than 2,000 packages for the R system. The packages include many types of table functions. We implemented our data explorer system using the R system. Fig. 2 depicts the data explorer system. In the system, there is one chain from the source data to the final result, and the results are provided to the end-users using displaying functions. The internal nodes of the chains are intermediate results. The final result can be represented using the an expression as below: op n (op n-1 ( op 1 (X, <option values>) )) Here, X is the source data table, and op i represents a table operation or an analysis function. Table 1 depicts an implementation of table function using the R system. The input of table functions contains one data table, and contains optionally attribute name as string value, and other numeric value or condition expression. Fig. 3 and Fig. 4 depict examples of the table functions. header data table Fig. 1: Data table example. This illustrates the header and top 6 lines of the Edgar Anderson's Iris dataset [1]. Given one data table, we can define several types of table functions described below. Table operation Table operations from X to Y will produce a new data table Y from a data table X. The numbers of rows and columns of Y are same as

2 Data Table Source Data Functions for data table intermediate data tables Data Explorer System Fig. 2: Data Explorer final result tabular form scatter plot diagram Data Analysis and Visualization Results Table 1: Data Operations. The variable X is data table. The variables A, A i, A j and A k are attribute names. The variable Alist is a list of attribute names. The variable cond is a condition expression. (a) Table operations Function Function name and parameters R system implementation (R source code) Principle pca(x) library(stringr) component pc <- princomp(x, cor=true) analysis (PCA) Y <- as.data.frame(data.matrix(x) %*% unclass(loadings(pc))) names(y) <- str_replace_all(names(y),"[.]","_") (b) Analysis functions Function Function name and parameters R system implementation (R source code) Selection of selection(x,alist) sqldf(paste("select",alist,"from X;")) rows Projection of projection(x,cond) sqldf(paste("select * from X where ",cond,";")) columns Frequency table frequency(x,a) R <- sqldf(paste("select",a,"from X;")) data.frame(table(r)) Cross table cross_table(x,a i,a j ) R <- sqldf(paste("select",ai,",",aj," from X;")) return(table(r)) (c) Displaying functions Function Function name and parameters R system implementation (R source code) Display first part head(x,lines) head(x, n=lines) Two-dimensional plot2d(x,a i,a j ) library(ggplot2) scatter plot R <- sqldf(paste("select",ai,"as x,",aj,"as y from X;")) ggplot(r, aes(x=x, y=y)) + geom_point + xlab(ai) + ylab(aj) Three-dimensional scatter plot plot3d(x,a i,a j,a k ) library(scatterplot3d) R <- sqldf(paste("select",ai,"as x,",aj,"as y, ",Ak,"as z from X;")) scatterplot3d(r) histogram histogram(x, A) R <- sqldf(paste("select",a,"from X;")) hist(as.matrix(r)) Two-fimensional histogram Cluster dendrogram Gaussian Mixture Model (GMM) classification histogram2d(x,ai,aj,n) cluster(x,m) GMM_classification(X,A i,aj) library(gregmisc) R <- sqldf(paste("select",ai,"as x,",aj,"as y from X;")) h <- hist2d(x=r$x,y=r$y,nbins=c(n,n)) persp(h$x,h$y,h$counts,shade=0.2) plot(hclust(dist(x), method=m)) R <- sqldf(paste("select",ai,",",aj,"from X;")) plot(mclust(r), what="classification")) Conditional Inference Tree plot_ctree(x,a) library(party) ct<-ctree(formula(paste(a," ~.")),data=x) plot(ct)

plot2d(pca(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width")),"Comp_1", "Comp_2") (a) Two-dimensional plot of PCA plot3d(pca(projection(iris, "Sepal_Length, Sepal_Width,

Two-dimensional histogram cluster(projection(iris, "Sepal_Length, Sepal_Width"), "ward") (e) cluster dendrogram produced by the ward clustering [3] GMM_classification(iris, "Sepal_Length",

3 plot2d(pca(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width")),"Comp_1", "Comp_2") (a) Two-dimensional plot of PCA plot3d(pca(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width")),"Comp_1", "Comp_2", "Comp_3") (b) Three-dimensional plot of PCA histogram(iris, "Sepal_Length") (c) histogram histogram2d(iris, "Sepal_Length", "Sepal_Width", 10) (d) Two-dimensional histogram cluster(projection(iris, "Sepal_Length, Sepal_Width"), "ward") (e) cluster dendrogram produced by the ward clustering [3] GMM_classification(iris, "Sepal_Length", "Sepal_Width")) (f) Gaussian Mixture Model (GMM) classification plot_ctree(iris, "Petal_Width") (g) Conditional Inference Tree Fig. 3: Commands by Users and Results produced by the Data Explorer. The variable iris is the Edgar Anderson's Iris dataset.

4 (a) Selection of rows: head(selection(iris, "Species=' "setosa "'"),n=6) (b) Projection of columns: head(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width"),n=6) (c) Frequency table: head(frequency(iris,"sepal_length"),n=6) (d) Cross table: head(frequency(iris,"sepal_length","sepal_width"),n=6) Fig. 4: Analysis Function Examples. The variable iris is the Edgar Anderson's Iris dataset. 3. PROBLEM DEFINITION AND IMPLEMENTATION DETAILS 3.1 Relational Database Interface The data explorer system read a table from a relational database system. We employee the SQLite3 relational database management system to manage relational database. The source code to open SQLite3 database named mydb from R system is below. library(rsqlite) drv <- dbdriver("sqlite", max.con = 1) conn <- dbconnect(drv, dbname="mydb") The source code to read a table named T from the database to the variable X in R system is below. c <- dbsendquery(conn, "SELECT * from T;") X <- fetch(c, n=-1) 3.2 Metadata Metadata is a data table that stores the information about record types. For example, the iris dataset has one record type, and the record has five attributes. Table 2 depicts the metadata of the Edgar Anderson's Iris dataset. The attribute name is string valued. The attribute type may have a value in the set {integer, real, text, boolean, date, time, datetime}. The attribute is_ordered is either TRUE or. If the value of type is integer, real, date, time or datetime, then the value of is_ordered is TRUE. If the value of type is text and the text value represents an order (for example "A", "B", "C", and "D"), then the value of is_ordered is TRUE. Otherwise the value of is_ordered is. The attribute is_categorical is either TRUE or. The attribute is TRUE if the value of type is text and the value represents category. 3.3 Ruleset of table functions Each table functions has its own rules. For example the table operation pca has the following two rules. 1. All values of the input data table must be ordered. 2. All values of the output data table are all ordered. Table 3 depicts the rulesets of the table functions in Table Candidate set of attribute names in options We define the following rules that can be specified as option values in table functions. 1. for the table function that "all values of the input data table must be ordered" is. Arbitrary attribute name of the input table can be specified as the option value of the table function. The system can suggest the list of attribute name using the metadata of the input data table. 2. for the table function that "all values of the input data table must be ordered" is TRUE. If the input data table contains "non-ordered" attribute, the table function can not be evaluated The system can suggest to users to use the projection operation to eliminate all the non-ordered attributes from the input table. The system estimates the types of intermediate data table using the Table 3.

5 Table 2: the metadata of the Edgar Anderson's Iris dataset name type is_ordered is_categorical Sepal_Length real TRUE Sepal_Width real TRUE Petal_Length real TRUE Petal_Width real TRUE Species text TRUE Table 3: the ruleset of the table functions. *1: if all values of the input data tab function name All values of the input data All values of the output data table must be ordered table are all ordered pca TRUE TRUE selection *1 projection *1 frequency *1 cross_table *1 head plot2d plot3d histogram histogram2d cluster GMM_classification TRUE plot_ctree Table 4: The metadata of the patient data table. name type is_ordered comment id_patient integer Unique patient ID user_id text Reg. no first_name text Patient First Name middle_name text Patient Middle Name last_name text Patient Last Name address text Patient Address text Patient (in any) mobile_no text Contact Number password text Password sex text Patient Sex, enum('male','female') height text Height when registered weight text Weight when registered religion text Patient Religion birth_date date TRUE Date of Birth birth_place text Place of Birth reg_date datetime TRUE Reg. Date l_id integer Site ID last_login datetime TRUE Last login date time is_active boolean Active or inactive blood_group text Blood Group reg_issuer text Operator who Registered age integer TRUE Patient Age Table 5: The metadata of the prescription data table. name type is_ordered comment prescription_id integer Unique Checkup Id patient_checkup_id integer Checkup Date prescription_body text Blood Sugar doctor_id integer PBS/ FBS prescription_date datetime TRUE Blood Hemoglobin symptoms text Blood Pressure Systolic

6 4. EVALUATION We are developing a hospital database system. It contains patient nformation. The information is one data tables. The metadata of the patient data table is defined as shown in Table 4 and Table 5. Acknowledgment This work was partially executed under the consignment agreement with NEDO. 5. CONCLUSION In this paper, we present a data explorer system. In the system, data flows from a source data to data analysis and visualization results are the form of chains of functions for data table. Each function can be implemented easily using R system because R system already has many packages for data analysis and visualization. We already collected patient information in the hospital database system. Statistical analysis of the hospital database is future work. References [1] Becker, R. A., Chambers, J. M. and Wilks, A. R., "The New S Language. Wadsworth & Brooks/Cole," [2] [3] Ward, J. H. Jr. "Hierarchical Grouping to Optimize an Objective Function," Journal of the American Statistical Association, 58, pp , 1963.

Introduction to R and Statistical Data Analysis

Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,