A Data Explorer System and Rulesets of Table Functions

Size: px
Start display at page:

Download "A Data Explorer System and Rulesets of Table Functions"

Transcription

1 A Data Explorer System and Rulesets of Table Functions Kunihiko KANEKO a*, Ashir AHMED b*, Seddiq ALABBASI c* * Department of Advanced Information Technology, Kyushu University, Motooka 744, Fukuoka-Shi, , Japan a address: kaneko@ait.kyushu-u.ac.jp b address: ashir@ait.kyushu-u.ac.jp c address: seddiq@f.ait.kyushu-u.ac.jp ABSTRACT In this paper, we present a data analysis and visualization system named "Data Explorer". The system read a data table, and produce analysis and visualization results interactively. The system include many types of table functions. There are different numbers and types of options for each table function. A problem to be tackled is the difficulty to set option values of the table functions. There will be many user mistakes in the option values. To solve the problem, we propose a rule set to decide a candidate set of the option values of the table functions. Here, the data-description data (i.e. metadata) of data table is employed to decide the candidate set. We use the metadata to decide the applicability of table functions, also. The feasibility of the idea is evaluated using two types of dataset. They are iris and the hospital dataset. Keywords: data analysis, visualization, relational database, metadata. 1. INTRODUCTION Recently, more and more data are collected. They may be measurement data from sensors. They may be collected using Web forms. If all elements of a dataset are data records (i.e. a data element is a list of fields and each field has an attribute name) of the same record type, the dataset is a data table. If there are many types of records, there will be a different kinds of data table. 2. DATA ANALYSIS AND VISUALIZATION A data table is a set of data records. Figure 1 depicts an example of data table of the Edgar Anderson's Iris dataset [1]. The dataset is the measurement values of the sepal length, sepal width, petal length, petal width, respectively, for 50 flowers from each of three species of iris. The line added on the top of the data table is called header. A header is the list of attribute names of a data table. In figure 1, there are five attribute names. They are (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width, Species). Each row of data table is a data record. In figure 1, each data record consists of measurement values and species name. X. The record type of each row of Y is equal to the record type of each row of X. The attribute names of Y may different from X. Analysis function Analysis functions will produce a new data table Y from a data table X. The record type of each line of Y is not equal to the record type of X. The attribute names of Y may different from X. Displaying function Displaying functions will display a result using a data table. The R system [2] is an open-source and free software for statistical analysis and visualization. There are more-than 2,000 packages for the R system. The packages include many types of table functions. We implemented our data explorer system using the R system. Fig. 2 depicts the data explorer system. In the system, there is one chain from the source data to the final result, and the results are provided to the end-users using displaying functions. The internal nodes of the chains are intermediate results. The final result can be represented using the an expression as below: op n (op n-1 ( op 1 (X, <option values>) )) Here, X is the source data table, and op i represents a table operation or an analysis function. Table 1 depicts an implementation of table function using the R system. The input of table functions contains one data table, and contains optionally attribute name as string value, and other numeric value or condition expression. Fig. 3 and Fig. 4 depict examples of the table functions. header data table Fig. 1: Data table example. This illustrates the header and top 6 lines of the Edgar Anderson's Iris dataset [1]. Given one data table, we can define several types of table functions described below. Table operation Table operations from X to Y will produce a new data table Y from a data table X. The numbers of rows and columns of Y are same as

2 Data Table Source Data Functions for data table intermediate data tables Data Explorer System Fig. 2: Data Explorer final result tabular form scatter plot diagram Data Analysis and Visualization Results Table 1: Data Operations. The variable X is data table. The variables A, A i, A j and A k are attribute names. The variable Alist is a list of attribute names. The variable cond is a condition expression. (a) Table operations Function Function name and parameters R system implementation (R source code) Principle pca(x) library(stringr) component pc <- princomp(x, cor=true) analysis (PCA) Y <- as.data.frame(data.matrix(x) %*% unclass(loadings(pc))) names(y) <- str_replace_all(names(y),"[.]","_") (b) Analysis functions Function Function name and parameters R system implementation (R source code) Selection of selection(x,alist) sqldf(paste("select",alist,"from X;")) rows Projection of projection(x,cond) sqldf(paste("select * from X where ",cond,";")) columns Frequency table frequency(x,a) R <- sqldf(paste("select",a,"from X;")) data.frame(table(r)) Cross table cross_table(x,a i,a j ) R <- sqldf(paste("select",ai,",",aj," from X;")) return(table(r)) (c) Displaying functions Function Function name and parameters R system implementation (R source code) Display first part head(x,lines) head(x, n=lines) Two-dimensional plot2d(x,a i,a j ) library(ggplot2) scatter plot R <- sqldf(paste("select",ai,"as x,",aj,"as y from X;")) ggplot(r, aes(x=x, y=y)) + geom_point + xlab(ai) + ylab(aj) Three-dimensional scatter plot plot3d(x,a i,a j,a k ) library(scatterplot3d) R <- sqldf(paste("select",ai,"as x,",aj,"as y, ",Ak,"as z from X;")) scatterplot3d(r) histogram histogram(x, A) R <- sqldf(paste("select",a,"from X;")) hist(as.matrix(r)) Two-fimensional histogram Cluster dendrogram Gaussian Mixture Model (GMM) classification histogram2d(x,ai,aj,n) cluster(x,m) GMM_classification(X,A i,aj) library(gregmisc) R <- sqldf(paste("select",ai,"as x,",aj,"as y from X;")) h <- hist2d(x=r$x,y=r$y,nbins=c(n,n)) persp(h$x,h$y,h$counts,shade=0.2) plot(hclust(dist(x), method=m)) R <- sqldf(paste("select",ai,",",aj,"from X;")) plot(mclust(r), what="classification")) Conditional Inference Tree plot_ctree(x,a) library(party) ct<-ctree(formula(paste(a," ~.")),data=x) plot(ct)

3 plot2d(pca(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width")),"Comp_1", "Comp_2") (a) Two-dimensional plot of PCA plot3d(pca(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width")),"Comp_1", "Comp_2", "Comp_3") (b) Three-dimensional plot of PCA histogram(iris, "Sepal_Length") (c) histogram histogram2d(iris, "Sepal_Length", "Sepal_Width", 10) (d) Two-dimensional histogram cluster(projection(iris, "Sepal_Length, Sepal_Width"), "ward") (e) cluster dendrogram produced by the ward clustering [3] GMM_classification(iris, "Sepal_Length", "Sepal_Width")) (f) Gaussian Mixture Model (GMM) classification plot_ctree(iris, "Petal_Width") (g) Conditional Inference Tree Fig. 3: Commands by Users and Results produced by the Data Explorer. The variable iris is the Edgar Anderson's Iris dataset.

4 (a) Selection of rows: head(selection(iris, "Species=' "setosa "'"),n=6) (b) Projection of columns: head(projection(iris, "Sepal_Length, Sepal_Width, Petal_Length, Petal_Width"),n=6) (c) Frequency table: head(frequency(iris,"sepal_length"),n=6) (d) Cross table: head(frequency(iris,"sepal_length","sepal_width"),n=6) Fig. 4: Analysis Function Examples. The variable iris is the Edgar Anderson's Iris dataset. 3. PROBLEM DEFINITION AND IMPLEMENTATION DETAILS 3.1 Relational Database Interface The data explorer system read a table from a relational database system. We employee the SQLite3 relational database management system to manage relational database. The source code to open SQLite3 database named mydb from R system is below. library(rsqlite) drv <- dbdriver("sqlite", max.con = 1) conn <- dbconnect(drv, dbname="mydb") The source code to read a table named T from the database to the variable X in R system is below. c <- dbsendquery(conn, "SELECT * from T;") X <- fetch(c, n=-1) 3.2 Metadata Metadata is a data table that stores the information about record types. For example, the iris dataset has one record type, and the record has five attributes. Table 2 depicts the metadata of the Edgar Anderson's Iris dataset. The attribute name is string valued. The attribute type may have a value in the set {integer, real, text, boolean, date, time, datetime}. The attribute is_ordered is either TRUE or. If the value of type is integer, real, date, time or datetime, then the value of is_ordered is TRUE. If the value of type is text and the text value represents an order (for example "A", "B", "C", and "D"), then the value of is_ordered is TRUE. Otherwise the value of is_ordered is. The attribute is_categorical is either TRUE or. The attribute is TRUE if the value of type is text and the value represents category. 3.3 Ruleset of table functions Each table functions has its own rules. For example the table operation pca has the following two rules. 1. All values of the input data table must be ordered. 2. All values of the output data table are all ordered. Table 3 depicts the rulesets of the table functions in Table Candidate set of attribute names in options We define the following rules that can be specified as option values in table functions. 1. for the table function that "all values of the input data table must be ordered" is. Arbitrary attribute name of the input table can be specified as the option value of the table function. The system can suggest the list of attribute name using the metadata of the input data table. 2. for the table function that "all values of the input data table must be ordered" is TRUE. If the input data table contains "non-ordered" attribute, the table function can not be evaluated The system can suggest to users to use the projection operation to eliminate all the non-ordered attributes from the input table. The system estimates the types of intermediate data table using the Table 3.

5 Table 2: the metadata of the Edgar Anderson's Iris dataset name type is_ordered is_categorical Sepal_Length real TRUE Sepal_Width real TRUE Petal_Length real TRUE Petal_Width real TRUE Species text TRUE Table 3: the ruleset of the table functions. *1: if all values of the input data tab function name All values of the input data All values of the output data table must be ordered table are all ordered pca TRUE TRUE selection *1 projection *1 frequency *1 cross_table *1 head plot2d plot3d histogram histogram2d cluster GMM_classification TRUE plot_ctree Table 4: The metadata of the patient data table. name type is_ordered comment id_patient integer Unique patient ID user_id text Reg. no first_name text Patient First Name middle_name text Patient Middle Name last_name text Patient Last Name address text Patient Address text Patient (in any) mobile_no text Contact Number password text Password sex text Patient Sex, enum('male','female') height text Height when registered weight text Weight when registered religion text Patient Religion birth_date date TRUE Date of Birth birth_place text Place of Birth reg_date datetime TRUE Reg. Date l_id integer Site ID last_login datetime TRUE Last login date time is_active boolean Active or inactive blood_group text Blood Group reg_issuer text Operator who Registered age integer TRUE Patient Age Table 5: The metadata of the prescription data table. name type is_ordered comment prescription_id integer Unique Checkup Id patient_checkup_id integer Checkup Date prescription_body text Blood Sugar doctor_id integer PBS/ FBS prescription_date datetime TRUE Blood Hemoglobin symptoms text Blood Pressure Systolic

6 4. EVALUATION We are developing a hospital database system. It contains patient nformation. The information is one data tables. The metadata of the patient data table is defined as shown in Table 4 and Table 5. Acknowledgment This work was partially executed under the consignment agreement with NEDO. 5. CONCLUSION In this paper, we present a data explorer system. In the system, data flows from a source data to data analysis and visualization results are the form of chains of functions for data table. Each function can be implemented easily using R system because R system already has many packages for data analysis and visualization. We already collected patient information in the hospital database system. Statistical analysis of the hospital database is future work. References [1] Becker, R. A., Chambers, J. M. and Wilks, A. R., "The New S Language. Wadsworth & Brooks/Cole," [2] [3] Ward, J. H. Jr. "Hierarchical Grouping to Optimize an Objective Function," Journal of the American Statistical Association, 58, pp , 1963.

Introduction to R and Statistical Data Analysis

Introduction to R and Statistical Data Analysis Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,

More information

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing) k Nearest Neighbors k Nearest Neighbors To classify an observation: Look at the labels of some number, say k, of neighboring observations. The observation is then classified based on its nearest neighbors

More information

netzen - a software tool for the analysis and visualization of network data about

netzen - a software tool for the analysis and visualization of network data about Architect and main contributor: Dr. Carlos D. Correa Other contributors: Tarik Crnovrsanin and Yu-Hsuan Chan PI: Dr. Kwan-Liu Ma Visualization and Interface Design Innovation (ViDi) research group Computer

More information

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise

More information

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke Data Sorcery with Clojure & Incanter Introduction to Datasets & Charts National Capital Area Clojure Meetup 18 February 2010 David Edgar Liebke liebke@incanter.org Outline Overview What is Incanter? Getting

More information

Intro to R for Epidemiologists

Intro to R for Epidemiologists Lab 9 (3/19/15) Intro to R for Epidemiologists Part 1. MPG vs. Weight in mtcars dataset The mtcars dataset in the datasets package contains fuel consumption and 10 aspects of automobile design and performance

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

LaTeX packages for R and Advanced knitr

LaTeX packages for R and Advanced knitr LaTeX packages for R and Advanced knitr Iowa State University April 9, 2014 More ways to combine R and LaTeX Additional knitr options for formatting R output: \Sexpr{}, results='asis' xtable - formats

More information

STAT 1291: Data Science

STAT 1291: Data Science STAT 1291: Data Science Lecture 18 - Statistical modeling II: Machine learning Sungkyu Jung Where are we? data visualization data wrangling professional ethics statistical foundation Statistical modeling:

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

MULTIVARIATE ANALYSIS USING R

MULTIVARIATE ANALYSIS USING R MULTIVARIATE ANALYSIS USING R B N Mandal I.A.S.R.I., Library Avenue, New Delhi 110 012 bnmandal @iasri.res.in 1. Introduction This article gives an exposition of how to use the R statistical software for

More information

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules arulescba: Classification for Factor and Transactional Data Sets Using Association Rules Ian Johnson Southern Methodist University Abstract This paper presents an R package, arulescba, which uses association

More information

An Introduction to R Graphics

An Introduction to R Graphics An Introduction to R Graphics PnP Group Seminar 25 th April 2012 Why use R for graphics? Fast data exploration Easy automation and reproducibility Create publication quality figures Customisation of almost

More information

Advanced Statistics 1. Lab 11 - Charts for three or more variables. Systems modelling and data analysis 2016/2017

Advanced Statistics 1. Lab 11 - Charts for three or more variables. Systems modelling and data analysis 2016/2017 Advanced Statistics 1 Lab 11 - Charts for three or more variables 1 Preparing the data 1. Run RStudio Systems modelling and data analysis 2016/2017 2. Set your Working Directory using the setwd() command.

More information

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Data analysis case study using R for readily available data set using any one machine learning Algorithm Assignment-4 Data analysis case study using R for readily available data set using any one machine learning Algorithm Broadly, there are 3 types of Machine Learning Algorithms.. 1. Supervised Learning

More information

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Weka Lab Paula Matuszek Fall, 2015!1 Weka is Waikato Environment for Knowledge Analysis Machine Learning Software Suite from the University of Waikato Been under development

More information

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6 Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,

More information

Work 2. Case-based reasoning exercise

Work 2. Case-based reasoning exercise Work 2. Case-based reasoning exercise Marc Albert Garcia Gonzalo, Miquel Perelló Nieto November 19, 2012 1 Introduction In this exercise we have implemented a case-based reasoning system, specifically

More information

DATA VISUALIZATION WITH GGPLOT2. Coordinates

DATA VISUALIZATION WITH GGPLOT2. Coordinates DATA VISUALIZATION WITH GGPLOT2 Coordinates Coordinates Layer Controls plot dimensions coord_ coord_cartesian() Zooming in scale_x_continuous(limits =...) xlim() coord_cartesian(xlim =...) Original Plot

More information

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal

More information

Chuck Cartledge, PhD. 20 January 2018

Chuck Cartledge, PhD. 20 January 2018 Big Data: Data Analysis Boot Camp Visualizing the Iris Dataset Chuck Cartledge, PhD 20 January 2018 1/31 Table of contents (1 of 1) 1 Intro. 2 Histograms Background 3 Scatter plots 4 Box plots 5 Outliers

More information

DEPARTMENT OF BIOSTATISTICS UNIVERSITY OF COPENHAGEN. Graphics. Compact R for the DANTRIP team. Klaus K. Holst

DEPARTMENT OF BIOSTATISTICS UNIVERSITY OF COPENHAGEN. Graphics. Compact R for the DANTRIP team. Klaus K. Holst Graphics Compact R for the DANTRIP team Klaus K. Holst 2012-05-16 The R Graphics system R has a very flexible and powerful graphics system Basic plot routine: plot(x,y,...) low-level routines: lines, points,

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Package catdap. R topics documented: March 20, 2018

Package catdap. R topics documented: March 20, 2018 Version 1.3.4 Title Categorical Data Analysis Program Package Author The Institute of Statistical Mathematics Package catdap March 20, 2018 Maintainer Masami Saga Depends R (>=

More information

Decision Trees In Weka,Data Formats

Decision Trees In Weka,Data Formats CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Interpretation and Comparison of Multidimensional Data Partitions Esa Alhoniemi and Olli Simula Neural Networks Research Centre Helsinki University of Technology P. O.Box 5400 FIN-02015 HUT, Finland esa.alhoniemi@hut.fi

More information

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

BL5229: Data Analysis with Matlab Lab: Learning: Clustering BL5229: Data Analysis with Matlab Lab: Learning: Clustering The following hands-on exercises were designed to teach you step by step how to perform and understand various clustering algorithm. We will

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft Scalable Data Science in R and Apache Spark 2.0 Felix Cheung, Principal Engineer, Spark @ Microsoft About me Apache Spark Committer Apache Zeppelin PMC/Committer Contributing to Spark since 1.3 and Zeppelin

More information

Data Mining: Exploring Data

Data Mining: Exploring Data Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar But we start with a brief discussion of the Friedman article and the relationship between Data

More information

Graphing Bivariate Relationships

Graphing Bivariate Relationships Graphing Bivariate Relationships Overview To fully explore the relationship between two variables both summary statistics and visualizations are important. For this assignment you will describe the relationship

More information

Advanced Graphics in R

Advanced Graphics in R Advanced Graphics in R Laurel Stell February 7, 8 Introduction R Markdown file and slides Download in easy steps: http://web.stanford.edu/ lstell/ Click on Data Studio presentation: Advanced graphics in

More information

Combo Charts. Chapter 145. Introduction. Data Structure. Procedure Options

Combo Charts. Chapter 145. Introduction. Data Structure. Procedure Options Chapter 145 Introduction When analyzing data, you often need to study the characteristics of a single group of numbers, observations, or measurements. You might want to know the center and the spread about

More information

k-nearest Neighbors + Model Selection

k-nearest Neighbors + Model Selection 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University k-nearest Neighbors + Model Selection Matt Gormley Lecture 5 Jan. 30, 2019 1 Reminders

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

A Tour of Sweave. Max Kuhn. March 14, Pfizer Global R&D Non Clinical Statistics Groton

A Tour of Sweave. Max Kuhn. March 14, Pfizer Global R&D Non Clinical Statistics Groton A Tour of Sweave Max Kuhn Pfizer Global R&D Non Clinical Statistics Groton March 14, 2011 Creating Data Analysis Reports For most projects where we need a written record of our work, creating the report

More information

Experimental Design + k- Nearest Neighbors

Experimental Design + k- Nearest Neighbors 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Experimental Design + k- Nearest Neighbors KNN Readings: Mitchell 8.2 HTF 13.3

More information

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017 K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017 Requirements This demo requires several packages: tidyverse (dplyr, tidyr, tibble, ggplot2) modelr broom proc Background K-fold

More information

Package quark. March 13, 2016

Package quark. March 13, 2016 Package quark March 13, 2016 Type Package Title Missing data analysis with principal component auxiliary variables Version 0.6.1 Date 2016-02-25 Author Kyle M. Lang, Steven Chesnut, Todd D. Little Maintainer

More information

Linear discriminant analysis and logistic

Linear discriminant analysis and logistic Practical 6: classifiers Linear discriminant analysis and logistic This practical looks at two different methods of fitting linear classifiers. The linear discriminant analysis is implemented in the MASS

More information

Visualizing high-dimensional data:

Visualizing high-dimensional data: Visualizing high-dimensional data: Applying graph theory to data visualization Wayne Oldford based on joint work with Catherine Hurley (Maynooth, Ireland) Adrian Waddell (Waterloo, Canada) Challenge p

More information

Introduction to Statistical Graphics Procedures

Introduction to Statistical Graphics Procedures Introduction to Statistical Graphics Procedures Selvaratnam Sridharma, U.S. Census Bureau, Washington, DC ABSTRACT SAS statistical graphics procedures (SG procedures) that were introduced in SAS 9.2 help

More information

Package elasticsearchr

Package elasticsearchr Type Package Version 0.2.2 Package elasticsearchr March 29, 2018 Title A Lightweight Interface for Interacting with Elasticsearch from R Date 2018-03-29 Author Alex Ioannides Maintainer Alex Ioannides

More information

Hypothesis Test Exercises from Class, Oct. 12, 2018

Hypothesis Test Exercises from Class, Oct. 12, 2018 Hypothesis Test Exercises from Class, Oct. 12, 218 Question 1: Is there a difference in mean sepal length between virsacolor irises and setosa ones? Worked on by Victoria BienAime and Pearl Park Null Hypothesis:

More information

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

2. Navigating high-dimensional spaces and the RnavGraph R package

2. Navigating high-dimensional spaces and the RnavGraph R package Graph theoretic methods for Data Visualization: 2. Navigating high-dimensional spaces and the RnavGraph R package Wayne Oldford based on joint work with Adrian Waddell and Catherine Hurley Tutorial B2

More information

Manuel Oviedo de la Fuente and Manuel Febrero Bande

Manuel Oviedo de la Fuente and Manuel Febrero Bande Supervised classification methods in by fda.usc package Manuel Oviedo de la Fuente and Manuel Febrero Bande Universidade de Santiago de Compostela CNTG (Centro de Novas Tecnoloxías de Galicia). Santiago

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

Function Approximation and Feature Selection Tool

Function Approximation and Feature Selection Tool Function Approximation and Feature Selection Tool Version: 1.0 The current version provides facility for adaptive feature selection and prediction using flexible neural tree. Developers: Varun Kumar Ojha

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

Tutorial for the R Statistical Package

Tutorial for the R Statistical Package Tutorial for the R Statistical Package University of Colorado Denver Stephanie Santorico Mark Shin Contents 1 Basics 2 2 Importing Data 10 3 Basic Analysis 14 4 Plotting 22 5 Installing Packages 29 This

More information

Partitioning Cluster Analysis with Possibilistic C-Means Zeynel Cebeci

Partitioning Cluster Analysis with Possibilistic C-Means Zeynel Cebeci Partitioning Cluster Analysis with Possibilistic C-Means Zeynel Cebeci 2017-11-10 Contents 1 PREPARING FOR THE ANALYSIS 1 1.1 Install and load the package ppclust................................ 1 1.2

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Effective Graphics Made Simple Using SAS/GRAPH SG Procedures Dan Heath, SAS Institute Inc., Cary, NC

Effective Graphics Made Simple Using SAS/GRAPH SG Procedures Dan Heath, SAS Institute Inc., Cary, NC Effective Graphics Made Simple Using SAS/GRAPH SG Procedures Dan Heath, SAS Institute Inc., Cary, NC ABSTRACT There are many types of graphics displays that you might need to create on a daily basis. In

More information

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction

Hsiaochun Hsu Date: 12/12/15. Support Vector Machine With Data Reduction Support Vector Machine With Data Reduction 1 Table of Contents Summary... 3 1. Introduction of Support Vector Machines... 3 1.1 Brief Introduction of Support Vector Machines... 3 1.2 SVM Simple Experiment...

More information

BIO5312: R Session 1 An Introduction to R and Descriptive Statistics

BIO5312: R Session 1 An Introduction to R and Descriptive Statistics BIO5312: R Session 1 An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, 2016 Yujin Chung R Session 1 Fall, 2016 1/24 Introduction to R R software R is both open source

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Relational Model. IT 5101 Introduction to Database Systems. J.G. Zheng Fall 2011

Relational Model. IT 5101 Introduction to Database Systems. J.G. Zheng Fall 2011 Relational Model IT 5101 Introduction to Database Systems J.G. Zheng Fall 2011 Overview What is the relational model? What are the most important practical elements of the relational model? 2 Introduction

More information

UNSUPERVISED LEARNING IN PYTHON. Visualizing the PCA transformation

UNSUPERVISED LEARNING IN PYTHON. Visualizing the PCA transformation UNSUPERVISED LEARNING IN PYTHON Visualizing the PCA transformation Dimension reduction More efficient storage and computation Remove less-informative "noise" features... which cause problems for prediction

More information

Creating publication-ready Word tables in R

Creating publication-ready Word tables in R Creating publication-ready Word tables in R Sara Weston and Debbie Yee 12/09/2016 Has this happened to you? You re working on a draft of a manuscript with your adviser, and one of her edits is something

More information

Data Manipulation using dplyr

Data Manipulation using dplyr Data Manipulation in R Reading and Munging Data L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto Oct, 2017 Data Manipulation using dplyr The dplyr is a package

More information

Introduction to R. Daniel Berglund. 9 November 2017

Introduction to R. Daniel Berglund. 9 November 2017 Introduction to R Daniel Berglund 9 November 2017 1 / 15 R R is available at the KTH computers If you want to install it yourself it is available at https://cran.r-project.org/ Rstudio an IDE for R is

More information

University of Florida CISE department Gator Engineering. Visualization

University of Florida CISE department Gator Engineering. Visualization Visualization Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida What is visualization? Visualization is the process of converting data (information) in to

More information

Visualizing class probability estimators

Visualizing class probability estimators Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. Topics Exploratory Data Analysis

More information

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Data Mining Tools Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean-Gabriel.Ganascia@lip6.fr DATA BASES Data mining Extraction Data mining Interpretation/

More information

Introduction to R for Epidemiologists

Introduction to R for Epidemiologists Introduction to R for Epidemiologists Jenna Krall, PhD Thursday, January 29, 2015 Final project Epidemiological analysis of real data Must include: Summary statistics T-tests or chi-squared tests Regression

More information

刘淇 School of Computer Science and Technology USTC

刘淇 School of Computer Science and Technology USTC Data Exploration 刘淇 School of Computer Science and Technology USTC http://staff.ustc.edu.cn/~qiliuql/dm2013.html t t / l/dm2013 l What is data exploration? A preliminary exploration of the data to better

More information

B. Graphing Representation of Data

B. Graphing Representation of Data B Graphing Representation of Data The second way of displaying data is by use of graphs Although such visual aids are even easier to read than tables, they often do not give the same detail It is essential

More information

Back-to-Back Stem-and-Leaf Plots

Back-to-Back Stem-and-Leaf Plots Chapter 195 Back-to-Back Stem-and-Leaf Plots Introduction This procedure generates a stem-and-leaf plot of a batch of data. The stem-and-leaf plot is similar to a histogram and its main purpose is to show

More information

EPL451: Data Mining on the Web Lab 5

EPL451: Data Mining on the Web Lab 5 EPL451: Data Mining on the Web Lab 5 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Predictive modeling techniques IBM reported in June 2012 that 90% of data available

More information

Data Visualization Using R & ggplot2. Karthik Ram October 6, 2013

Data Visualization Using R & ggplot2. Karthik Ram October 6, 2013 Data Visualization Using R & ggplot2 Karthik Ram October 6, 2013 Some housekeeping Install some packages install.packages("ggplot2", dependencies = TRUE) install.packages("plyr") install.packages("ggthemes")

More information

DatXPlore A software for data exploration and visualization. January 2014

DatXPlore A software for data exploration and visualization. January 2014 DatXPlore 1.0.0 January 2014 A software for data exploration and visualization Contents 1 Introduction 3 2 Project 4 2.1 File information..................................... 5 2.2 Data file content....................................

More information

Radial Basis Function (RBF) Neural Networks Based on the Triple Modular Redundancy Technology (TMR)

Radial Basis Function (RBF) Neural Networks Based on the Triple Modular Redundancy Technology (TMR) Radial Basis Function (RBF) Neural Networks Based on the Triple Modular Redundancy Technology (TMR) Yaobin Qin qinxx143@umn.edu Supervisor: Pro.lilja Department of Electrical and Computer Engineering Abstract

More information

Reverse Standardization from Public E-health Service

Reverse Standardization from Public E-health Service ITU Kaleidoscope 2014 Living in a converged world - impossible without standards? Reverse Standardization from Public E-health Service Masahiro Kuroda National Institute of Information and Communications

More information

K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

CALUMMA Management Tool User Manual

CALUMMA Management Tool User Manual CALUMMA Management Tool User Manual CALUMMA Management Tool Your Data Management SIMPLIFIED. by RISC Software GmbH The CALUMMA system is a highly adaptable data acquisition and management system, for complex

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

LESSON 14: Box plots questions

LESSON 14: Box plots questions LESSON 14: Box plots questions FOCUS QUESTION: How can I compare the distributions for data sets that have outliers? Contents EXAMPLE 1: Load the Fisher iris data (comes with MATLAB) EXAMPLE 2: Compare

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 152 Introduction When analyzing data, you often need to study the characteristics of a single group of numbers, observations, or measurements. You might want to know the center and the spread about

More information

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India.

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India. Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Training Artificial

More information

AUTOMATIC SCORING OF THE SEVERITY OF PSORIASIS SCALING

AUTOMATIC SCORING OF THE SEVERITY OF PSORIASIS SCALING AUTOMATIC SCORING OF THE SEVERITY OF PSORIASIS SCALING David Delgado Bjarne Ersbøll Jens Michael Carstensen IMM, IMM, IMM Denmark Denmark Denmark email: ddg@imm.dtu.dk email: be@imm.dtu.dk email: jmc@imm.dtu.dk

More information

Hands on Datamining & Machine Learning with Weka

Hands on Datamining & Machine Learning with Weka Step1: Click the Experimenter button to launch the Weka Experimenter. The Weka Experimenter allows you to design your own experiments of running algorithms on datasets, run the experiments and analyze

More information

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts 6 Subscripting 6.1 Basics of Subscripting For objects that contain more than one element (vectors, matrices, arrays, data frames, and lists), subscripting is used to access some or all of those elements.

More information

DATA VISUALIZATION WITH GGPLOT2. Grid Graphics

DATA VISUALIZATION WITH GGPLOT2. Grid Graphics DATA VISUALIZATION WITH GGPLOT2 Grid Graphics ggplot2 internals Explore grid graphics 35 30 Elements of ggplot2 plot 25 How do graphics work in R? 2 plotting systems mpg 20 15 base package grid graphics

More information

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART

More information

EMR web api documentation

EMR web api documentation Introduction EMR web api documentation This is the documentation of Medstreaming EMR Api. You will find all available Apis and the details of every api. Including its url, parameters, Description, Response

More information

http://www.cmplx.cse.nagoya-u.ac.jp/~fuzzdata/ Professor Takeshi Furuhashi Associate Professor Tomohiro Yoshikawa A student in doctor course 6 students in master course 2 undergraduate students For contact

More information

Machine Learning with MATLAB --classification

Machine Learning with MATLAB --classification Machine Learning with MATLAB --classification Stanley Liang, PhD York University Classification the definition In machine learning and statistics, classification is the problem of identifying to which

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

CPRD Aurum Frequently asked questions (FAQs)

CPRD Aurum Frequently asked questions (FAQs) CPRD Aurum Frequently asked questions (FAQs) Version 2.0 Date: 10 th April 2019 Authors: Helen Booth, Daniel Dedman, Achim Wolf (CPRD, UK) 1 Documentation Control Sheet During the course of the project

More information

Cluster Analysis using Spherical SOM

Cluster Analysis using Spherical SOM Cluster Analysis using Spherical SOM H. Tokutaka 1, P.K. Kihato 2, K. Fujimura 2 and M. Ohkita 2 1) SOM Japan Co-LTD, 2) Electrical and Electronic Department, Tottori University Email: {tokutaka@somj.com,

More information

Inventory management system Project description

Inventory management system Project description Semester: Fall 2015 Instructor: Dr. Ayman Ezzat General GuideLines All projects must have: o User management o User login o User types Manage keyword = Add / Edit / Delete / Search Inventory management

More information

Analysis and Latent Semantic Indexing

Analysis and Latent Semantic Indexing 18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding

More information

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster

More information

Prostate Detection Using Principal Component Analysis

Prostate Detection Using Principal Component Analysis Prostate Detection Using Principal Component Analysis Aamir Virani (avirani@stanford.edu) CS 229 Machine Learning Stanford University 16 December 2005 Introduction During the past two decades, computed

More information

MATH 117 Statistical Methods for Management I Chapter Two

MATH 117 Statistical Methods for Management I Chapter Two Jubail University College MATH 117 Statistical Methods for Management I Chapter Two There are a wide variety of ways to summarize, organize, and present data: I. Tables 1. Distribution Table (Categorical

More information