Beer, Farms, and Peas

Size: px
Start display at page:

Download "Beer, Farms, and Peas"

Transcription

1 Sir Francis Galton Karl Pearson William Sealy Gosset Beer, Farms, and Peas Applied Statistics & 석사과정이상열 Sir Ronald Fisher

2 Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data, [1] or the quantitative description itself. Descriptive statistics are distinguished from inferential statistics (Wikipedia) 1. Central tendency(arithmetic mean, median, mode, Geometric mean ) 2. Dispersion(Sample standard deviation, Range, variance, mean difference ) 3. Minimum and Maximum values 4. Kurtosis, skewness R graphics : low-level graphics function high-level graphics function (R Package ggplot2, igraph) FF / Bigmemory : Large data processing package 01 Challenge 02 Descriptive statistics 03 R graphics 04 FF / Bigmemory

3 Challenge

4 Pareto Distribution( Generate 51 random numbers. Create a histogram of these random numbers hist(rnorm(51, , )) hist(rpareto(51,1,1))

5 Random numbers Pareto Distribution Code ( ppareto <- function(x, scale, shape) { ifelse(x > scale, 1 - (scale / x) ^ shape, 0) } - CDF qpareto <- function(y, scale, shape) { ifelse( y >= 0 & y <= 1, scale * ((1 - y) ^ (-1 / shape)), NaN ) } inverse cdf rpareto <- function(n, scale, shape, lower_bound = scale, upper_bound = Inf) { quantiles <- ppareto(c(lower_bound, upper_bound), scale, shape) uniform_random_numbers <- runif(n, quantiles[1], quantiles[2]) qpareto(uniform_random_numbers, scale, shape) }

6 Descriptive statistics

7 1. Data acquisition 서울열린데이터광장 ( 생필품가격 ( 2. Descriptive statistics pp <- read.csv("c:/users/user/desktop/korea/2013 년 2 학기 /datascience/ 발표 1/ 생필품가격.csv", header=true) dim(pp) [1] ppta <- split(pp$ 가격, list(pp$ 품목. 이름, pp$ 년도. 월, pp$ 시장유형. 구분. 시장. 마트.. 이름 )) ppta$` 고등어 ( 생물, 국산 ).Apr-13. 대형마트 ` [1] [24] ppta$` 고등어 ( 생물, 국산 ).Apr-13. 전통시장 ` [1] mean(ppta$` 사과 ( 부사, 300g).Sep-13. 전통시장 `, trim=0.05) [1] mean(ppta$` 사과 ( 부사, 300g).Sep-13. 대형마트 `, trim=0.05) [1] median(ppta$` 사과 ( 부사, 300g).Sep-13. 전통시장 `) [1] 2000 median(ppta$` 사과 ( 부사, 300g).Sep-13. 대형마트 `) [1] 2175

8 2. Descriptive statistics median(ppta$` 쇠고기 ( 한우, 불고기 ).Sep-13. 대형마트 `) [1] median(ppta$` 쇠고기 ( 한우, 불고기 ).Sep-13. 전통시장 `) [1] sd(ppta$` 쇠고기 ( 한우, 불고기 ).Sep-13. 전통시장 `) [1] sd(ppta$` 쇠고기 ( 한우, 불고기 ).Sep-13. 대형마트 `) [1] ( 한우, 불고기 ).Sep-13. 전통시장 `) [1] lapply(ppta, summary) $` 배추 ( 중간 ).Sep-13. 전통시장 ` Min. 1st Qu. Median Mean 3rd Qu. Max $` 사과.Sep-13. 전통시장 ` Min. 1st Qu. Median Mean 3rd Qu. Max $` 사과 ( 부사 ).Sep-13. 전통시장 ` Min. 1st Qu. Median Mean 3rd Qu. Max

9 # Split 함수이외의그룹핑하는방법 1. By (tapply 와비슷한역할이지만벡터대신객체를사용 ) 1. Aggregate ( 그룹의각변수별로 tapply 를한번식호출 )

10 3. Histogram 4. Boxplot

11 R graphics

12 1. High-level graphics function plot (generic graph function) boxplot Hist qqnorm curve 2. Low-level graphics function points lines abline segments polygon text High-level graphics function 를사용하면새로운그래프생성, low-level graphics function 는우선고수준그래픽을불러온뒤에추가호출

13 3. 9 월달생필품가격차이 ( 전통시장, 대형마트 ) plyr, ggplot2 package 사용하여생필품의품목별, 시장별가격차이를알수있었다. ( 코드는 에서참조 )

14 4. ggplot2 ggplot example iris data R 기반의그래픽패키지 The Grammar of Graphcis(Wikinson, 2005) 기반그래프객체를사용 유연한플로팅환경을제공. 그래프를프로그래밍화함. qplot 빠른플로팅을위한함수 ggplot 문법기준, 상세설정을위한함수 ggplot(iris,aes(sepal.length,sepal.width)) + geom_point(aes(colour=species)) + geom_smooth(aes(colour=species), method=lm) Referenece (

15 5. Network graph 활용사례 : 시계열데이터간상관계수값을이용하여네트워크그래프생성 library(igraph) gd <- graph(c(1,2, 2,3, 2,4, 1,4, 5,5, 3,6)) plot(gd) Label 사이의이름을추가하거나거리, 색깔, 사이즈모두바꿀수있음. 네트워크그래프에대한자세한설명은아래사이트에나와있음. R Graphics Cookbook ( urce=web&cd=1&ved=0cdiqfjaa&url=http%3a%2f%2fdeca.c uc.edu.cn%2fcommunity%2fmedia%2fp%2f23508%2fdownloa d.aspx&ei=vem9uu6- O8W1iQeFrIHgAg&usg=AFQjCNHJNZvnRcHXqfmXzNGuwmVRsE ZD1A&sig2=xNEX- Z9Eu6S4qsEyyHw7aQ&bvm=bv ,d.aGc&cad=rjt) : 개별 : 그룹화

16 FF / Bigmemory

17 Big data in the limit of R R works only on RAM (R 은 바이트의객체크기제한이있다. 객체가메모리에저장되기때문에 ) File.txt (big data) Data = read.table( File.txt, ) 1. 로딩이되어도오래걸리거나 2. 메모리에저장이안될정도로크거나 해결책 1. 청킹 (Chunking) 2. 메모리관리용 R package (ff, bigmemory, RevoScaleR) 3. 병렬실행 (Parallel execution)

18 1. Chunking read.table(file=, header = FALSE, nrow =?, skip =?, ) File.txt Data1 = read.table( File.txt, nrow = , skip = 0, ) Stat1 = summary(data1) Data2 = read.table( File.txt, nrow = , skip = , ) Stat2 = summary(data2) Data3 = read.table( File.txt, nrow = , skip = , ) Stat3 = summary(data3) Data4 = read.table( File.txt, nrow = , skip = , ) Stat4 = summary(data4) Aggregate(Stat1, Stat2, Stat3, Stat4) 데이터를분할하여분석가능하지만이것조차크다면처리가불가능하고분할및순차계산으로의해계산속도가느려질수있음.

19 2. 메모리관리용패키지 설명장점단점 ff 메모리를디스크에저장 Clean system few examples Bigmemory 메모리를디스크뿐만아니라메인메모리에도저장 쉬운사용방법 ff 보다널리쓰임 병렬패키지와의연동 (SNOW) 윈도우에서사용불가, 문자가들어있는데이터는전처리작업필요데이터자료형태를통일해야함 (like double, integer, short, char) RevoScaleR Revolution R Enterprise 상용프로그램으로 XDF 데이터형식을사용해서메모리한계를극복 C++ 프로그래머가외부메모리알고리즘을작성할수있도록확장가능한프레임워크제공 상용제품이기때문에아카데미버전만사용가능 ( 학생 ) 빠른연산속도

20 2. ff package ( 관련패키지 : ff, ffbase, biglm, biglars, speedglm)

21 2. ff example library(ff) rff <- read.table.ffdf(file="e:/dataset/3d_spatial_network.txt", sep=",",header=false) rff Data (3D Road Network Data set) V1 : ID V2 : LONGITUDE( 경도 ) V3 : LATITUDE( 위도 ) V4 : ALTITUDE( 높이 ) Data UCI machine Learning (

22 2. ff example library(ffbase) - Basic statistical functions for ff dim(rff) [1] sum(rff$v2) [1] mean(rff$v2) [1] min(rff$v2) [1] max(rff$v2) [1] range(rff$v2) [1] quantile(rff$v2) 0% 25% 50% 75% 100% hist(rff$v2)

23 2. ff example library(biglm) - Bounded memory linear regression dim(rff) [1] rfflm <- biglm(v1 ~ V2 + V3 + V4, data=rff) rfflm Large data regression model: biglm(v1 ~ V2 + V3 + V4, data = rff) Sample size = summary(rfflm) Large data regression model: biglm(v1 ~ V2 + V3 + V4, data = rff)

24 2. Bigmemory package( bigalgebra (linear algebra function) biganalytics (big.matrix such as GLM, bigkmeans) bigmemory bigtabulate (table, tapply, split 와같은함수를제공 ) synchronicity (data streaming, shared-memory capabilities) library(bigmemory) X <- read.big.matrix("/media/604e93df4e93ac72/dataset/3d_spatial_network.txt", header = FALSE, sep=",", type = "double", backingfile ="BigMem.bin", descriptorfile = "BigMem.desc", shared = TRUE) X An object of class "big.matrix" Slot "address": <pointer: 0xad5c470> > summary(x) Length Class Mode big.matrix S4 Data UCI machine Learning (

25 2. Bigmemory example > head(x) y x1 x2 x3 [1,] [2,] [3,] [4,] [5,] [6,] > dim(x) [1] Library(biganalytics) > sum(x) [1] e+13 > mean(x) [1] > colmax(x) y x1 x2 x e e e e+02 > colsd(x) y x1 x2 x e e e e+01 > colrange(x) min max y e e+08 x e e+01 x e e+01 x e e+02 > colsum(x) y x1 x2 x e e e e+06

26 2. Bigmemory example library(biganalytics) xlm <- biglm.big.matrix(y ~ x1 + x2 + x3, data=x) xlm Large data regression model: biglm(formula = formula, data = data,...) Sample size = summary(xlm) Large data regression model: biglm(formula = formula, data = data,...) Sample size = Coef (95% CI) SE p (Intercept) x x x bigkmeans(x, 3, iter.max=0, nstart=1) K-means clustering with 3 clusters of sizes , , Cluster means: [,1] [,2] [,3] [,4] [1,] [2,] [3,] Clustering vector: [1] [67] [133]

Package biganalytics

Package biganalytics Version 1.1.14 Date 2016-02-17 Package biganalytics February 18, 2016 Title Utilities for 'big.matrix' Objects from Package 'bigmemory' Author John W. Emerson and Michael J. Kane

More information

Package biganalytics

Package biganalytics Version 1.0.14 Date 2010-06-22 Package biganalytics June 25, 2010 Title A library of utilities for big.matrix objects of package bigmemory. Author John W. Emerson and Michael J.

More information

Statistics 251: Statistical Methods

Statistics 251: Statistical Methods Statistics 251: Statistical Methods Summaries and Graphs in R Module R1 2018 file:///u:/documents/classes/lectures/251301/renae/markdown/master%20versions/summary_graphs.html#1 1/14 Summary Statistics

More information

What is Scalable Data Processing?

What is Scalable Data Processing? SCALABLE DATA PROCESSING IN R What is Scalable Data Processing? Michael J. Kane and Simon Urbanek Instructors, DataCamp In this course.. Work with data that is too large for your computer Write Scalable

More information

Psychology 405: Psychometric Theory Homework 1: answers

Psychology 405: Psychometric Theory Homework 1: answers Psychology 405: Psychometric Theory Homework 1: answers William Revelle Department of Psychology Northwestern University Evanston, Illinois USA April, 2017 1 / 12 Outline Preliminaries Assignment Analysis

More information

Introduction to R, Github and Gitlab

Introduction to R, Github and Gitlab Introduction to R, Github and Gitlab 27/11/2018 Pierpaolo Maisano Delser mail: maisanop@tcd.ie ; pm604@cam.ac.uk Outline: Why R? What can R do? Basic commands and operations Data analysis in R Github and

More information

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II) STA 2023 Module 2B Organizing Data and Comparing Distributions (Part II) Learning Objectives Upon completing this module, you should be able to 1 Explain the purpose of a measure of center 2 Obtain and

More information

Visualizing univariate data 1

Visualizing univariate data 1 Visualizing univariate data 1 Xijin Ge SDSU Math/Stat Broad perspectives of exploratory data analysis(eda) EDA is not a mere collection of techniques; EDA is a new altitude and philosophy as to how we

More information

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures STA 2023 Module 3 Descriptive Measures Learning Objectives Upon completing this module, you should be able to: 1. Explain the purpose of a measure of center. 2. Obtain and interpret the mean, median, and

More information

Old title: The bigmemory package: handling large data sets in R using RAM and shared memory

Old title: The bigmemory package: handling large data sets in R using RAM and shared memory New Abstract: Old title: The bigmemory package: handling large data sets in R using RAM and shared memory New title: The R Package bigmemory: Supporting Efficient Computation and Concurrent Programming

More information

Chapter 5: The beast of bias

Chapter 5: The beast of bias Chapter 5: The beast of bias Self-test answers SELF-TEST Compute the mean and sum of squared error for the new data set. First we need to compute the mean: + 3 + + 3 + 2 5 9 5 3. Then the sum of squared

More information

An introduction to R WS 2013/2014

An introduction to R WS 2013/2014 An introduction to R WS 2013/2014 Dr. Noémie Becker (AG Metzler) Dr. Sonja Grath (AG Parsch) Special thanks to: Dr. Martin Hutzenthaler (previously AG Metzler, now University of Frankfurt) course development,

More information

TI-83 Users Guide. to accompany. Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock

TI-83 Users Guide. to accompany. Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock TI-83 Users Guide to accompany by Lock, Lock, Lock, Lock, and Lock TI-83 Users Guide- 1 Getting Started Entering Data Use the STAT menu, then select EDIT and hit Enter. Enter data for a single variable

More information

An introduction to WS 2015/2016

An introduction to WS 2015/2016 An introduction to WS 2015/2016 Dr. Noémie Becker (AG Metzler) Dr. Sonja Grath (AG Parsch) Special thanks to: Prof. Dr. Martin Hutzenthaler (previously AG Metzler, now University of Duisburg-Essen) course

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot.

No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot. No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot. 3 confint A metafor package function that gives you the confidence intervals of effect sizes.

More information

file:///d:/r/stateofther/bigdata/slides/bigdatapresentation.html#1 1 sur 44 06/07/2018 à 16:56

file:///d:/r/stateofther/bigdata/slides/bigdatapresentation.html#1 1 sur 44 06/07/2018 à 16:56 1 sur 44 06/07/2018 à 16:56 2 sur 44 06/07/2018 à 16:56 Arabidopsis[1:5,1:10] ## L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 ## M1 1 0 1 1 0 1 0 1 1 1 ## M2 1 0 1 1 0 1 1 1 1 1 ## M3 1 0 1 1 0 1 1 1 1 1 ## M4 0 0 0

More information

CS Introduction to Computational and Data Science. Instructor: Renzhi Cao Computer Science Department Pacific Lutheran University Spring 2017

CS Introduction to Computational and Data Science. Instructor: Renzhi Cao Computer Science Department Pacific Lutheran University Spring 2017 CS 133 - Introduction to Computational and Data Science Instructor: Renzhi Cao Computer Science Department Pacific Lutheran University Spring 2017 Announcement Read book for R control structure and function.

More information

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015 STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression

More information

Package bigmemory. January 11, 2018

Package bigmemory. January 11, 2018 Version 4.5.33 Package bigmemory January 11, 2018 Title Manage Massive Matrices with Shared Memory and Memory-Mapped Files Author Michael J. Kane , John W. Emerson ,

More information

Quantitative - One Population

Quantitative - One Population Quantitative - One Population The Quantitative One Population VISA procedures allow the user to perform descriptive and inferential procedures for problems involving one population with quantitative (interval)

More information

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties. Standard 1: Number Sense and Computation Students simplify and compare expressions. They use rational exponents and simplify square roots. IM1.1.1 Compare real number expressions. IM1.1.2 Simplify square

More information

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

R Programming: Worksheet 6

R Programming: Worksheet 6 R Programming: Worksheet 6 Today we ll study a few useful functions we haven t come across yet: all(), any(), `%in%`, match(), pmax(), pmin(), unique() We ll also apply our knowledge to the bootstrap.

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

with High Performance Computing: Parallel processing and large memory Many thanks allocations

with High Performance Computing: Parallel processing and large memory Many thanks allocations R with High Performance Computing: Parallel processing and large memory Amy F. Szczepański, Remote Data Analysis and Visualization Center, University of Tennessee http://rdav.nics.tennessee.edu/ Many thanks

More information

Package bigalgebra. R topics documented: February 19, 2015

Package bigalgebra. R topics documented: February 19, 2015 Package bigalgebra February 19, 2015 Version 0.8.4 Date 2014-04-15 Title BLAS routines for native R matrices and big.matrix objects. Author Michael J. Kane, Bryan Lewis, and John W. Emerson Maintainer

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

Statistical Graphics

Statistical Graphics Idea: Instant impression Statistical Graphics Bad graphics abound: From newspapers, magazines, Excel defaults, other software. 1 Color helpful: if used effectively. Avoid "chartjunk." Keep level/interests

More information

#a- a vector of 100 random number from a normal distribution a<-rnorm(100, mean= 32, sd=6)

#a- a vector of 100 random number from a normal distribution a<-rnorm(100, mean= 32, sd=6) 1 Transition to R Class 3: Basic functions for descriptive statistics and summarizing data Use simple functions and Xapply functions for summarizing and describing data Goals: (1) Summarizing continuous

More information

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010 UCLA Statistical Consulting Center R Bootcamp Irina Kukuyeva ikukuyeva@stat.ucla.edu September 20, 2010 Outline 1 Introduction 2 Preliminaries 3 Working with Vectors and Matrices 4 Data Sets in R 5 Overview

More information

15 Wyner Statistics Fall 2013

15 Wyner Statistics Fall 2013 15 Wyner Statistics Fall 2013 CHAPTER THREE: CENTRAL TENDENCY AND VARIATION Summary, Terms, and Objectives The two most important aspects of a numerical data set are its central tendencies and its variation.

More information

Homework set 4 - Solutions

Homework set 4 - Solutions Homework set 4 - Solutions Math 3200 Renato Feres 1. (Eercise 4.12, page 153) This requires importing the data set for Eercise 4.12. You may, if you wish, type the data points into a vector. (a) Calculate

More information

Stat 290: Lab 2. Introduction to R/S-Plus

Stat 290: Lab 2. Introduction to R/S-Plus Stat 290: Lab 2 Introduction to R/S-Plus Lab Objectives 1. To introduce basic R/S commands 2. Exploratory Data Tools Assignment Work through the example on your own and fill in numerical answers and graphs.

More information

Data Management Project Using Software to Carry Out Data Analysis Tasks

Data Management Project Using Software to Carry Out Data Analysis Tasks Data Management Project Using Software to Carry Out Data Analysis Tasks This activity involves two parts: Part A deals with finding values for: Mean, Median, Mode, Range, Standard Deviation, Max and Min

More information

BIO5312: R Session 1 An Introduction to R and Descriptive Statistics

BIO5312: R Session 1 An Introduction to R and Descriptive Statistics BIO5312: R Session 1 An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, 2016 Yujin Chung R Session 1 Fall, 2016 1/24 Introduction to R R software R is both open source

More information

Basics of Plotting Data

Basics of Plotting Data Basics of Plotting Data Luke Chang Last Revised July 16, 2010 One of the strengths of R over other statistical analysis packages is its ability to easily render high quality graphs. R uses vector based

More information

Index. Bar charts, 106 bartlett.test function, 159 Bottles dataset, 69 Box plots, 113

Index. Bar charts, 106 bartlett.test function, 159 Bottles dataset, 69 Box plots, 113 Index A Add-on packages information page, 186 187 Linux users, 191 Mac users, 189 mirror sites, 185 Windows users, 187 aggregate function, 62 Analysis of variance (ANOVA), 152 anova function, 152 as.data.frame

More information

36-402/608 HW #1 Solutions 1/21/2010

36-402/608 HW #1 Solutions 1/21/2010 36-402/608 HW #1 Solutions 1/21/2010 1. t-test (20 points) Use fullbumpus.r to set up the data from fullbumpus.txt (both at Blackboard/Assignments). For this problem, analyze the full dataset together

More information

In Minitab interface has two windows named Session window and Worksheet window.

In Minitab interface has two windows named Session window and Worksheet window. Minitab Minitab is a statistics package. It was developed at the Pennsylvania State University by researchers Barbara F. Ryan, Thomas A. Ryan, Jr., and Brian L. Joiner in 1972. Minitab began as a light

More information

R Workshop Guide. 1 Some Programming Basics. 1.1 Writing and executing code in R

R Workshop Guide. 1 Some Programming Basics. 1.1 Writing and executing code in R R Workshop Guide This guide reviews the examples we will cover in today s workshop. It should be a helpful introduction to R, but for more details, you can access a more extensive user guide for R on the

More information

Topics for today Input / Output Using data frames Mathematics with vectors and matrices Summary statistics Basic graphics

Topics for today Input / Output Using data frames Mathematics with vectors and matrices Summary statistics Basic graphics Topics for today Input / Output Using data frames Mathematics with vectors and matrices Summary statistics Basic graphics Introduction to S-Plus 1 Input: Data files For rectangular data files (n rows,

More information

Statistical Programming Camp: An Introduction to R

Statistical Programming Camp: An Introduction to R Statistical Programming Camp: An Introduction to R Handout 3: Data Manipulation and Summarizing Univariate Data Fox Chapters 1-3, 7-8 In this handout, we cover the following new materials: ˆ Using logical

More information

The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan.

The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan. Graphics Agenda Grammer of Graphics Using ggplot2 The following presentation is based on the ggplot2 tutotial written by Prof. Jennifer Bryan. ggplot2 (wiki) ggplot2 is a data visualization package Created

More information

Visualizing the World

Visualizing the World Visualizing the World An Introduction to Visualization 15.071x The Analytics Edge Why Visualization? The picture-examining eye is the best finder we have of the wholly unanticipated -John Tukey Visualizing

More information

Streaming Data And Concurrency In R

Streaming Data And Concurrency In R Streaming Data And Concurrency In R Rory Winston rory@theresearchkitchen.com About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Interested in practical

More information

Package samplesizelogisticcasecontrol

Package samplesizelogisticcasecontrol Package samplesizelogisticcasecontrol February 4, 2017 Title Sample Size Calculations for Case-Control Studies Version 0.0.6 Date 2017-01-31 Author Mitchell H. Gail To determine sample size for case-control

More information

1 Simple Linear Regression

1 Simple Linear Regression Math 158 Jo Hardin R code 1 Simple Linear Regression Consider a dataset from ISLR on credit scores. Because we don t know the sampling mechanism used to collect the data, we are unable to generalize the

More information

biglasso: extending lasso model to Big Data in R

biglasso: extending lasso model to Big Data in R biglasso: extending lasso model to Big Data in R Yaohui Zeng, Patrick Breheny Package Version: 1.2-3 December 1, 2016 1 User guide 1.1 Small data When the data size is small, the usage of biglasso package

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions

More information

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition Online Learning Centre Technology Step-by-Step - Minitab Minitab is a statistical software application originally created

More information

Install RStudio from - use the standard installation.

Install RStudio from   - use the standard installation. Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/

More information

STAT:5400 Computing in Statistics

STAT:5400 Computing in Statistics STAT:5400 Computing in Statistics Introduction to SAS Lecture 18 Oct 12, 2015 Kate Cowles 374 SH, 335-0727 kate-cowles@uiowaedu SAS SAS is the statistical software package most commonly used in business,

More information

Fathom Dynamic Data TM Version 2 Specifications

Fathom Dynamic Data TM Version 2 Specifications Data Sources Fathom Dynamic Data TM Version 2 Specifications Use data from one of the many sample documents that come with Fathom. Enter your own data by typing into a case table. Paste data from other

More information

Extremely Large Data Challenges What R can and can't do

Extremely Large Data Challenges What R can and can't do Extremely Large Data Challenges What R can and can't do Susan Holmes http://www-stat.stanford.edu/ susan/ Bio-X and Statistics, Stanford University NIH-R01GM086884 jeihgfdcbabakl A roadmap xkcd Some Advantages

More information

IST 3108 Data Analysis and Graphics Using R. Summarizing Data Data Import-Export

IST 3108 Data Analysis and Graphics Using R. Summarizing Data Data Import-Export IST 3108 Data Analysis and Graphics Using R Summarizing Data Data Import-Export Engin YILDIZTEPE, PhD Working with Vectors and Logical Subscripts >xsum(x) how many of the values were less than

More information

Introduction to Graphics with ggplot2

Introduction to Graphics with ggplot2 Introduction to Graphics with ggplot2 Reaction 2017 Flavio Santi Sept. 6, 2017 Flavio Santi Introduction to Graphics with ggplot2 Sept. 6, 2017 1 / 28 Graphics with ggplot2 ggplot2 [... ] allows you to

More information

An Introductory Guide to R

An Introductory Guide to R An Introductory Guide to R By Claudia Mahler 1 Contents Installing and Operating R 2 Basics 4 Importing Data 5 Types of Data 6 Basic Operations 8 Selecting and Specifying Data 9 Matrices 11 Simple Statistics

More information

Lecture 11: Distributions as Models October 2014

Lecture 11: Distributions as Models October 2014 Lecture 11: Distributions as Models 36-350 1 October 2014 Previously R functions for regression models R functions for probability distributions Agenda Distributions from data Review of R for theoretical

More information

STA Module 4 The Normal Distribution

STA Module 4 The Normal Distribution STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

More information

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

More information

Solution to Tumor growth in mice

Solution to Tumor growth in mice Solution to Tumor growth in mice Exercise 1 1. Import the data to R Data is in the file tumorvols.csv which can be read with the read.csv2 function. For a succesful import you need to tell R where exactly

More information

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Contents Overview 2 Generating random numbers 2 rnorm() to generate random numbers from

More information

plots Chris Parrish August 20, 2015

plots Chris Parrish August 20, 2015 plots Chris Parrish August 20, 2015 plots We construct some of the most commonly used types of plots for numerical data. dotplot A stripchart is most suitable for displaying small data sets. data

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

GEN BUS 806 R COMMANDS

GEN BUS 806 R COMMANDS GEN BUS 806 R COMMANDS The following list of commands and information intends to assist you in getting familiar with the commands used in R common to the panel data analysis in GEN BUS 806 Useful Websites

More information

Tessera: Open Source Tools for Big Data Analysis in R

Tessera: Open Source Tools for Big Data Analysis in R Tessera: Open Source Tools for Big Data Analysis in R David Zeitler - Grand Valley State University Statistics August 12, 2015 Attribution This presentation is based work done for the June 30, 2015 user!

More information

Statistics Lecture 6. Looking at data one variable

Statistics Lecture 6. Looking at data one variable Statistics 111 - Lecture 6 Looking at data one variable Chapter 1.1 Moore, McCabe and Craig Probability vs. Statistics Probability 1. We know the distribution of the random variable (Normal, Binomial)

More information

Page 1. Graphical and Numerical Statistics

Page 1. Graphical and Numerical Statistics TOPIC: Description Statistics In this tutorial, we show how to use MINITAB to produce descriptive statistics, both graphical and numerical, for an existing MINITAB dataset. The example data come from Exercise

More information

Pre-Calculus Multiple Choice Questions - Chapter S2

Pre-Calculus Multiple Choice Questions - Chapter S2 1 Which of the following is NOT part of a univariate EDA? a Shape b Center c Dispersion d Distribution Pre-Calculus Multiple Choice Questions - Chapter S2 2 Which of the following is NOT an acceptable

More information

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies. Instructions: You are given the following data below these instructions. Your client (Courtney) wants you to statistically analyze the data to help her reach conclusions about how well she is teaching.

More information

Massive data, shared and distributed memory, and concurrent programming: bigmemory and foreach

Massive data, shared and distributed memory, and concurrent programming: bigmemory and foreach Massive data, shared and distributed memory, and concurrent programming: bigmemory and foreach Michael J Kane John W. Emerson Department of Statistics Yale University ASA 2009 Data Expo: Airline on-time

More information

Business Statistics: R tutorials

Business Statistics: R tutorials Business Statistics: R tutorials Jingyu He September 29, 2017 Install R and RStudio R is a free software environment for statistical computing and graphics. Download free R and RStudio for Windows/Mac:

More information

Assignments. Math 338 Lab 1: Introduction to R. Atoms, Vectors and Matrices

Assignments. Math 338 Lab 1: Introduction to R. Atoms, Vectors and Matrices Assignments Math 338 Lab 1: Introduction to R. Generally speaking, there are three basic forms of assigning data. Case one is the single atom or a single number. Assigning a number to an object in this

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining: Exploring Data. Lecture Notes for Chapter 3 Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include

More information

A (very) brief introduction to R

A (very) brief introduction to R A (very) brief introduction to R You typically start R at the command line prompt in a command line interface (CLI) mode. It is not a graphical user interface (GUI) although there are some efforts to produce

More information

Themes in the Texas CCRS - Mathematics

Themes in the Texas CCRS - Mathematics 1. Compare real numbers. a. Classify numbers as natural, whole, integers, rational, irrational, real, imaginary, &/or complex. b. Use and apply the relative magnitude of real numbers by using inequality

More information

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version) Practice in R January 28, 2010 (pdf version) 1 Sivan s practice Her practice file should be (here), or check the web for a more useful pointer. 2 Hetroskadasticity ˆ Let s make some hetroskadastic data:

More information

Sections 2.3 and 2.4

Sections 2.3 and 2.4 Sections 2.3 and 2.4 Shiwen Shen Department of Statistics University of South Carolina Elementary Statistics for the Biological and Life Sciences (STAT 205) 2 / 25 Descriptive statistics For continuous

More information

Description/History Objects/Language Description Commonly Used Basic Functions. More Specific Functionality Further Resources

Description/History Objects/Language Description Commonly Used Basic Functions. More Specific Functionality Further Resources R Outline Description/History Objects/Language Description Commonly Used Basic Functions Basic Stats and distributions I/O Plotting Programming More Specific Functionality Further Resources www.r-project.org

More information

Table Of Contents. Table Of Contents

Table Of Contents. Table Of Contents Statistics Table Of Contents Table Of Contents Basic Statistics... 7 Basic Statistics Overview... 7 Descriptive Statistics Available for Display or Storage... 8 Display Descriptive Statistics... 9 Store

More information

Lecture 6: Chapter 6 Summary

Lecture 6: Chapter 6 Summary 1 Lecture 6: Chapter 6 Summary Z-score: Is the distance of each data value from the mean in standard deviation Standardizes data values Standardization changes the mean and the standard deviation: o Z

More information

Lampiran 6 HASIL STATISTIK

Lampiran 6 HASIL STATISTIK Lampiran 6 HASIL STATISTIK Usia 11.37 of.450 Median 12.00 Mode 12 Std. Deviation 3.488 Minimum 2 Maximum 16 usia Frequency Valid Valid 2 2 3.3 3.3 3.3 4 2 3.3 3.3 6.7 6 2 3.3 3.3 10.0 7 4 6.7 6.7 16.7

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data

More information

CITS4009 Introduction to Data Science

CITS4009 Introduction to Data Science School of Computer Science and Software Engineering CITS4009 Introduction to Data Science SEMESTER 2, 2017: CHAPTER 4 MANAGING DATA 1 Chapter Objectives Fixing data quality problems Organizing your data

More information

Package bayesdp. July 10, 2018

Package bayesdp. July 10, 2018 Type Package Package bayesdp July 10, 2018 Title Tools for the Bayesian Discount Prior Function Version 1.3.2 Date 2018-07-10 Depends R (>= 3.2.3), ggplot2, survival, methods Functions for data augmentation

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?

More information

050 0 N 03 BECABCDDDBDBCDBDBCDADDBACACBCCBAACEDEDBACBECCDDCEA

050 0 N 03 BECABCDDDBDBCDBDBCDADDBACACBCCBAACEDEDBACBECCDDCEA 050 0 N 03 BECABCDDDBDBCDBDBCDADDBACACBCCBAACEDEDBACBECCDDCEA 55555555555555555555555555555555555555555555555555 NYYNNYNNNYNYYYYYNNYNNNNNYNYYYYYNYNNNNYNNYNNNYNNNNN 01 CAEADDBEDEDBABBBBCBDDDBAAAECEEDCDCDBACCACEECACCCEA

More information

Chapter 5: Joint Probability Distributions and Random

Chapter 5: Joint Probability Distributions and Random Chapter 5: Joint Probability Distributions and Random Samples Curtis Miller 2018-06-13 Introduction We may naturally inquire about collections of random variables that are related to each other in some

More information

Integrated Math 1. Integrated Math, Part 1

Integrated Math 1. Integrated Math, Part 1 Integrated Math 1 Course Description: This Integrated Math course will give students an understanding of the foundations of Algebra and Geometry. Students will build on an an understanding of variables,

More information

Lecture Notes 3: Data summarization

Lecture Notes 3: Data summarization Lecture Notes 3: Data summarization Highlights: Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean &

More information

Univariate Data - 2. Numeric Summaries

Univariate Data - 2. Numeric Summaries Univariate Data - 2. Numeric Summaries Young W. Lim 2018-08-01 Mon Young W. Lim Univariate Data - 2. Numeric Summaries 2018-08-01 Mon 1 / 36 Outline 1 Univariate Data Based on Numerical Summaries R Numeric

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

K-means Clustering. customers.data <- read.csv(file = "wholesale_customers_data1.csv") str(customers.data)

K-means Clustering. customers.data <- read.csv(file = wholesale_customers_data1.csv) str(customers.data) K-means Clustering Dataset Wholesale Customer dataset contains data about clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.the

More information

R Programming for Computational Linguists and Similar Creatures

R Programming for Computational Linguists and Similar Creatures R Programming for Computational Linguists and Similar Creatures Marco Baroni 1 and Stefan Evert 2 1 Center for Mind/Brain Sciences University of Trento 2 Cognitive Science Institute University of Onsabrück

More information