In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train.

Size: px
Start display at page:

Download "In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train."

Transcription

1 Data frames in R In this tutorial we will see some of the basic operations on data frames in R Understand the structure Indexing Column names Add a column/row Delete a column/row Subset Summarize We will again use the Titanic data set available at Kaggle Understand the structure We begin by first importing the data into an R object called train. train <- read.csv("train.csv", na.strings = "") Once the csv file is in our workspace, it is stored as a object of class data.frame. Everything in R is an object and every object belongs to a particular class. We can check the class of any R object using the class() function. class(train) ## [1] "data.frame" A data frame is a two dimensional array; the dimensions being the rows and columns. A column contains information for a particular variable and hence can contain data of one type only, e.g., either numeric or character or factor or date etc. It can have both numbers and strings as data, but the storage type will be unique, i.e., if the first row has an entry - '1234' and the second row has an entry - 'a word', then the column will be classified as character (or factor) but not numeric. To find out how the columns in our Titanic data are classified, we can use the str() function which displays the internal structure of an R object. str(train) ## 'data.frame': 891 obs. of 11 variables: ## $ survived: int ## $ pclass : int ## $ name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: ## $ sex : Factor w/ 2 levels "female","male": ## $ age : num NA

2 ## $ sibsp : int ## $ parch : int ## $ ticket : Factor w/ 681 levels "110152","110413",..: ## $ fare : num ## $ cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA... ## $ embarked: Factor w/ 3 levels "C","Q","S": The output from the str() function tells us that our data frame has 891 observations (rows) and 11 variables (columns). The details of each column are provided along with the column name. Note that each column name is preceded by a '$' sign. This sign has a special meaning in R, which we will come to shortly. To understand the output, consider the first column mentioned in the result box - 'survived'. This column has class integer and the first few values are shown. Now consider the third column - 'name'. This column in of class factor and has 891 levels, i.e., 891 unique values. The first of these levels is 'Abbing, Mr. Anthony'. This is not the first observation in the data for this column. It is the first level (category) for the factor (categorical) variable - 'name'. Unless manually specified, the levels are chosen by R automatically in alphabetical order. The first observation for the variable in the data is for level 109, followed by level 191, and then 358. Again, note that, R is not showing the actual value that the field holds, but rather the category number corresponding to that value. We have seen the structure of our data set. Now let's look at the actual data itself. To get a quick snapshot of the data frame, we can use the head() function which displays the first few observations of all the variables in the data. head(train) ## survived pclass name ## Braund, Mr. Owen Harris ## Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## Heikkinen, Miss. Laina ## Futrelle, Mrs. Jacques Heath (Lily May Peel) ## Allen, Mr. William Henry ## Moran, Mr. James ## sex age sibsp parch ticket fare cabin embarked ## 1 male A/ <NA> S ## 2 female PC C85 C ## 3 female STON/O <NA> S ## 4 female C123 S ## 5 male <NA> S ## 6 male NA <NA> Q There is also an analogous function called tail() that displays the last few observations.

3 Indexing If there are too many columns in the data frame then using the head() function straight away might not be a very good idea. In that case, we can select the columns (and rows) that we want to see using the [m, n] notation, where m corresponds to rows and n corresponds to columns. The index in R starts from 1 as opposed to python where it starts from 0. To view the observations of the first column use head(train[, 1]) ## [1] To view the observations of the first three columns use head(train[, c(1, 2, 3)]) ## survived pclass name ## Braund, Mr. Owen Harris ## Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## Heikkinen, Miss. Laina ## Futrelle, Mrs. Jacques Heath (Lily May Peel) ## Allen, Mr. William Henry ## Moran, Mr. James To view the observations of columns 3 and 7 use head(train[, c(3, 7)]) ## name parch ## 1 Braund, Mr. Owen Harris 0 ## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 0 ## 3 Heikkinen, Miss. Laina 0 ## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 ## 5 Allen, Mr. William Henry 0 ## 6 Moran, Mr. James 0 We can also have the corresponding view from select rows. To view the first row for all columns use train[1, ] ## survived pclass name sex age sibsp parch ticket ## Braund, Mr. Owen Harris male A/ ## fare cabin embarked ## <NA> S To view the first three rows for all columns

4 train[c(1, 2, 3), ] ## survived pclass name ## Braund, Mr. Owen Harris ## Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## Heikkinen, Miss. Laina ## sex age sibsp parch ticket fare cabin embarked ## 1 male A/ <NA> S ## 2 female PC C85 C ## 3 female STON/O <NA> S To view rows 3 and 7 for all columns train[c(3, 7), ] ## survived pclass name sex age sibsp parch ## Heikkinen, Miss. Laina female ## McCarthy, Mr. Timothy J male ## ticket fare cabin embarked ## 3 STON/O <NA> S ## E46 S We do not need to use the head() function here since we are explicitly telling R to show us a few observations by specifying the ones we would like to see. We can combine the two sets of examples and view any desired combination of rows and columns. For example, to view the first row for columns 4, 5, and 6 use train[1, c(4, 5, 6)] ## sex age sibsp ## 1 male 22 1 To view the first ten rows for columns 2 to 6 use train[c(1:10), c(2:6)] ## pclass name sex age ## 1 3 Braund, Mr. Owen Harris male 22 ## 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 ## 3 3 Heikkinen, Miss. Laina female 26 ## 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 ## 5 3 Allen, Mr. William Henry male 35 ## 6 3 Moran, Mr. James male NA ## 7 1 McCarthy, Mr. Timothy J male 54 ## 8 3 Palsson, Master. Gosta Leonard male 2 ## 9 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 ## 10 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 ## sibsp ## 1 1 ## 2 1 ## 3 0

5 ## 4 1 ## 5 0 ## 6 0 ## 7 0 ## 8 3 ## 9 0 ## 10 1 The a:b notation produces a vector of integers ranging from a to b. If a < b, then a vector with increasing values is created and if a > b, then a vector with decreasing values is created. To view the rows 50 to 60 and 110 to 115 for columns 2, 3, and 6 use train[c(1:10), c(2:6)] ## pclass name sex age ## 1 3 Braund, Mr. Owen Harris male 22 ## 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 ## 3 3 Heikkinen, Miss. Laina female 26 ## 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 ## 5 3 Allen, Mr. William Henry male 35 ## 6 3 Moran, Mr. James male NA ## 7 1 McCarthy, Mr. Timothy J male 54 ## 8 3 Palsson, Master. Gosta Leonard male 2 ## 9 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 ## 10 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 ## sibsp ## 1 1 ## 2 1 ## 3 0 ## 4 1 ## 5 0 ## 6 0 ## 7 0 ## 8 3 ## 9 0 ## 10 1 Column names A data.frame object has two attributes attached to it by default - column names and row names. Given these, any column or row can be identified and manipulated using its name. The column and row names of a data frame can be identified using colnames(train) ## [1] "survived" "pclass" "name" "sex" "age" "sibsp" ## [7] "parch" "ticket" "fare" "cabin" "embarked"

6 head(rownames(train)) ## [1] "1" "2" "3" "4" "5" "6" Note: we used the head() function on rownames() only to restrict the size of the output. We mentioned above that everything in R is an object. This means that every function call also returns an object. Calling the function colnames() returns an object of class character. How do we know this? Simple - just pass the output function call through the class() function. variables <- colnames(train) class(variables) ## [1] "character" Since the output is an R object, it can be manipulated as required. For example, to change the names of the columns use colnames(train) <- c("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10", "col11") colnames(train) ## [1] "col1" "col2" "col3" "col4" "col5" "col6" "col7" "col8" ## [9] "col9" "col10" "col11" The arguments on the right hand side should be equal to the number of variables in the data frame. To change the name of a particular column, say column no. 4, use colnames(train)[4] <- "newname4" [ ] is the same indexing operator we used above. To change the name of a few columns, say column nos. 5, 8 and 11 use colnames(train)[c(5, 8, 11)] <- c("newname5", "newname8", "newname11") colnames(train) ## [1] "col1" "col2" "col3" "newname4" "newname5" ## [6] "col6" "col7" "newname8" "col9" "col10" ## [11] "newname11" [ ] can also take negative values. By using a negative integer, we are calling all the values from the object except the one(s) stored at the location(s). For example, to rename all the columns except columns 4, 5, 8, and 11 use

7 colnames(train)[-c(4, 5, 8, 11)] <- c("newname1", "newname2", "newname3", "newname6", "newname7", "newname9", "newname10") colnames(train) ## [1] "newname1" "newname2" "newname3" "newname4" "newname5" ## [6] "newname6" "newname7" "newname8" "newname9" "newname10" ## [11] "newname11" In the past couple of examples, we manipulated and replaced the original names in our data set. We can get these back by using the 'variables' vector we created above. colnames(train) <- variables colnames(train) ## [1] "survived" "pclass" "name" "sex" "age" "sibsp" ## [7] "parch" "ticket" "fare" "cabin" "embarked" We had mentioned the '$' sign above. This sign is a very convenient utility and can be used to retrieve named elements from an R object. For example, to view the column 'survived' in our data, do head(train$survived) ## [1] head(train$sex) ## [1] male female female female male male ## Levels: female male Add a column/row The '$' can also be used to create an element within an object. For example, to create a column that contains the squared values of the 'fare' column use train$fare.sq <- train$fare * train$fare head(train$fare.sq) ## [1] We can confirm that the squared values have been correctly calculated by using the [ ] operation in a different way. Instead of giving the index value, we can also provide the column names directly. head(train[, c("fare", "fare.sq")])

8 ## fare fare.sq ## ## ## ## ## ## We can add a row to the data set as well. Let's add one below the last row. As a simple example, we will just take the first row and make a copy of it at the end. For this, we will use the indexing operator [ ] and the nrow() function which gives the total number of rows present in a data frame. nrow(train) ## [1] 891 train[nrow(train), ] ## survived pclass name sex age sibsp parch ticket fare ## Dooley, Mr. Patrick male ## cabin embarked fare.sq ## 891 <NA> Q train[nrow(train) + 1, ] <- train[1, ] nrow(train) ## [1] 892 train[nrow(train), ] ## survived pclass name sex age sibsp parch ticket ## Braund, Mr. Owen Harris male A/ ## fare cabin embarked fare.sq ## <NA> S Delete a column/row Deleting a column/row is as easy as creating one. Simply use negation with the column/row index that needs to be deleted. For example, to delete the 'fare.sq' column calculated above, use train <- train[, -12] train$fare.sq ## NULL To delete the last row created above, use

9 train <- train[-892, ] train[892, ] ## survived pclass name sex age sibsp parch ticket fare cabin embarked ## NA NA NA <NA> <NA> NA NA NA <NA> NA <NA> <NA> Subset A data frame can be subset using different conditions. For example, we can subset the train data to include observations only for females using the subset() function train.female <- subset(train, sex == "female") To check whether the subset worked properly, we can look at the frequency table of the 'sex' variable in both the data sets. table(train$sex) ## ## female male ## table(train.female$sex) ## ## female male ## Consider another example where we subset the data by taking observations for only those cases for which 'fare' is between 100 and 500. train.sub1 <- subset(train, fare >= 100 & fare <= 500) dim(train.sub1) ## [1] We can also subset using two different variables. Let's take the cases where passenger class is 3 and sex in male. train.sub2 <- subset(train, pclass == 3 & sex == "male") dim(train.sub2) ## [1] The above example used an 'and' condition while subsetting the data. The example below uses the same two variables with an 'or' condition between them. The 'or' condition in R is specified using ' '.

10 train.sub3 <- subset(train, pclass == 3 sex == "male") dim(train.sub3) ## [1] The exact same process can be executed using the indexing [ ] operator. For example, to replicate the previous example with [ ] use train.sub4 <- train[train$pclass == 3 train$sex == "male", ] dim(train.sub4) ## [1] Summarize Summarizing a data set is extremely easy and can be done using a simple function called summary() summary(train) ## survived pclass ## Min. :0.000 Min. :1.00 ## 1st Qu.: st Qu.:2.00 ## Median :0.000 Median :3.00 ## Mean :0.384 Mean :2.31 ## 3rd Qu.: rd Qu.:3.00 ## Max. :1.000 Max. :3.00 ## ## name sex age ## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42 ## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12 ## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00 ## Abelson, Mr. Samuel : 1 Mean :29.70 ## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00 ## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00 ## (Other) :885 NA's :177 ## sibsp parch ticket fare ## Min. :0.000 Min. : : 7 Min. : 0.0 ## 1st Qu.: st Qu.: : 7 1st Qu.: 7.9 ## Median :0.000 Median :0.000 CA. 2343: 7 Median : 14.5 ## Mean :0.523 Mean : : 6 Mean : 32.2 ## 3rd Qu.: rd Qu.: : 6 3rd Qu.: 31.0 ## Max. :8.000 Max. :6.000 CA 2144 : 6 Max. :512.3 ## (Other) :852 ## cabin embarked ## B96 B98 : 4 C :168 ## C23 C25 C27: 4 Q : 77 ## G6 : 4 S :644 ## C22 C26 : 3 NA's: 2 ## D : 3 ## (Other) :186 ## NA's :687

11 The summary of a data frame gives a clear snapshot of values each variable holds, including the missing ones. The '$' sign and the indexing operator can used to summarize a single variable or a group of variables as shown below. summary(train$pclass) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## summary(train[, c("pclass", "sex", "cabin")]) ## pclass sex cabin ## Min. :1.00 female:314 B96 B98 : 4 ## 1st Qu.:2.00 male :577 C23 C25 C27: 4 ## Median :3.00 G6 : 4 ## Mean :2.31 C22 C26 : 3 ## 3rd Qu.:3.00 D : 3 ## Max. :3.00 (Other) :186 ## NA's :687 The above examples are just a representative sample of the functions available in R to process data frames. They are intended to serve as a starting point and a quick reference guide for those who have just started playing with R. In the next tutorial, we will learn about data manipulation in R.

Importing data sets in R

Importing data sets in R Importing data sets in R R can import and export different types of data sets including csv files text files excel files access database STATA data SPSS data shape files audio files image files and many

More information

IMPORTING DATA IN PYTHON I. Welcome to the course!

IMPORTING DATA IN PYTHON I. Welcome to the course! IMPORTING DATA IN PYTHON I Welcome to the course! Import data Flat files, e.g..txts,.csvs Files from other software Relational databases Plain text files Source: Project Gutenberg Table data titanic.csv

More information

Figure 3.20: Visualize the Titanic Dataset

Figure 3.20: Visualize the Titanic Dataset 80 Chapter 3. Data Mining with Azure Machine Learning Studio Figure 3.20: Visualize the Titanic Dataset 3. After verifying the output, we will cast categorical values to the corresponding columns. To begin,

More information

Lab and Assignment Activity

Lab and Assignment Activity Lab and Assignment Activity 1 Introduction Sometime ago, a Titanic dataset was released to the general public. This file is given to you as titanic_data.csv. This data is in text format and contains 12

More information

COMP 364: Computer Tools for Life Sciences

COMP 364: Computer Tools for Life Sciences COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn Christopher J.F. Cameron and Carlos G. Oliver 1 / 1 Key course information Assignment #4 available now due Monday,

More information

Hands-on Machine Learning for Cybersecurity

Hands-on Machine Learning for Cybersecurity Hands-on Machine Learning for Cybersecurity James Walden 1 1 Center for Information Security Northern Kentucky University 11th Annual NKU Cybersecurity Symposium Highland Heights, KY October 11, 2018 Topics

More information

Tutorial for the R Statistical Package

Tutorial for the R Statistical Package Tutorial for the R Statistical Package University of Colorado Denver Stephanie Santorico Mark Shin Contents 1 Basics 2 2 Importing Data 10 3 Basic Analysis 14 4 Plotting 22 5 Installing Packages 29 This

More information

Reading and wri+ng data

Reading and wri+ng data An introduc+on to Reading and wri+ng data Noémie Becker & Benedikt Holtmann Winter Semester 16/17 Course outline Day 4 Course outline Review Data types and structures Reading data How should data look

More information

Homework: Data Mining

Homework: Data Mining : Data Mining This homework sheet will test your knowledge of data mining using R. 3 a) Load the files Titanic.csv into R as follows. This dataset provides information on the survival of the passengers

More information

Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC)

Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC) Intro to R Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC) fuz@mrl.ucsb.edu MRL 2066B Sharon Solis Paul Weakliem Research Computing

More information

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017 Exercise 3 AMTH/CPSC 445a/545a - Fall Semester 2016 October 7, 2017 Problem 1 Compress your solutions into a single zip file titled assignment3.zip, e.g. for a student named Tom

More information

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1)

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1) Basic R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20classes/1-3_basic_r.html#(1) 1/21 Preliminary R is case sensitive: a is not the same as A.

More information

TITANIC. Predicting Survival Using Classification Algorithms

TITANIC. Predicting Survival Using Classification Algorithms TITANIC Predicting Survival Using Classification Algorithms 1 Nicholas King IE 5300-001 May 2016 PROJECT OVERVIEW > Historical Background ### > Project Intent > Data: Target and Feature Variables > Initial

More information

Getting Started with SAS Viya 3.2 for Python

Getting Started with SAS Viya 3.2 for Python Getting Started with SAS Viya 3.2 for Python Requirements To use Python with SAS Cloud Analytic Services, the client machine that runs Python must meet the following requirements: Use 64-bit Linux. Use

More information

1 Building a simple data package for R. 2 Data files. 2.1 bmd data

1 Building a simple data package for R. 2 Data files. 2.1 bmd data 1 Building a simple data package for R Suppose that we wish to make a package containing data sets only available in-house or on CRAN. This is often done for the data sets in the examples and exercises

More information

Lab #3. Viewing Data in SAS. Tables in SAS. 171:161: Introduction to Biostatistics Breheny

Lab #3. Viewing Data in SAS. Tables in SAS. 171:161: Introduction to Biostatistics Breheny 171:161: Introduction to Biostatistics Breheny Lab #3 The focus of this lab will be on using SAS and R to provide you with summary statistics of different variables with a data set. We will look at both

More information

POL 345: Quantitative Analysis and Politics

POL 345: Quantitative Analysis and Politics POL 345: Quantitative Analysis and Politics Precept Handout 1 Week 2 (Verzani Chapter 1: Sections 1.2.4 1.4.31) Remember to complete the entire handout and submit the precept questions to the Blackboard

More information

Supervised Learning Classification Algorithms Comparison

Supervised Learning Classification Algorithms Comparison Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***---------------------------------------------------------

More information

Stat 579: Objects in R Vectors

Stat 579: Objects in R Vectors Stat 579: Objects in R Vectors Ranjan Maitra 2220 Snedecor Hall Department of Statistics Iowa State University. Phone: 515-294-7757 maitra@iastate.edu, 1/23 Logical Vectors I R allows manipulation of logical

More information

STAT 540 Computing in Statistics

STAT 540 Computing in Statistics STAT 540 Computing in Statistics Introduces programming skills in two important statistical computer languages/packages. 30-40% R and 60-70% SAS Examples of Programming Skills: 1. Importing Data from External

More information

Outline. Mixed models in R using the lme4 package Part 1: Introduction to R. Following the operations on the slides

Outline. Mixed models in R using the lme4 package Part 1: Introduction to R. Following the operations on the slides Outline Mixed models in R using the lme4 package Part 1: Introduction to R Douglas Bates University of Wisconsin - Madison and R Development Core Team UseR!2009, Rennes, France

More information

Chapter 7. The Data Frame

Chapter 7. The Data Frame Chapter 7. The Data Frame The R equivalent of the spreadsheet. I. Introduction Most analytical work involves importing data from outside of R and carrying out various manipulations, tests, and visualizations.

More information

Series. >>> import numpy as np >>> import pandas as pd

Series. >>> import numpy as np >>> import pandas as pd 7 Pandas I: Introduction Lab Objective: Though NumPy and SciPy are powerful tools for numerical computing, they lack some of the high-level functionality necessary for many data science applications. Python

More information

Draft Proof - do not copy, post, or distribute DATA MUNGING LEARNING OBJECTIVES

Draft Proof - do not copy, post, or distribute DATA MUNGING LEARNING OBJECTIVES 6 DATA MUNGING LEARNING OBJECTIVES Describe what data munging is. Demonstrate how to read a CSV data file. Explain how to select, remove, and rename rows and columns. Assess why data scientists need to

More information

Rearranging and manipula.ng data

Rearranging and manipula.ng data An introduc+on to Rearranging and manipula.ng data Noémie Becker & Benedikt Holtmann Winter Semester 16/17 Course outline Day 7 Course outline Review Checking and cleaning data Rearranging and manipula+ng

More information

Data Import and Export

Data Import and Export Data Import and Export Eugen Buehler October 17, 2018 Importing Data to R from a file CSV (comma separated value) tab delimited files Excel formats (xls, xlsx) SPSS/SAS/Stata RStudio will tell you if you

More information

Data types and structures

Data types and structures An introduc+on to Data types and structures Noémie Becker & Benedikt Holtmann Winter Semester 16/17 Course outline Day 3 Review GeFng started with R Crea+ng Objects Data types in R Data structures in R

More information

Lecture 06: Feb 04, Transforming Data. Functions Classes and Objects Vectorization Subsets. James Balamuta STAT UIUC

Lecture 06: Feb 04, Transforming Data. Functions Classes and Objects Vectorization Subsets. James Balamuta STAT UIUC Lecture 06: Feb 04, 2019 Transforming Data Functions Classes and Objects Vectorization Subsets James Balamuta STAT 385 @ UIUC Announcements hw02 is will be released Tonight Due on Wednesday, Feb 13th,

More information

Computer lab 2 Course: Introduction to R for Biologists

Computer lab 2 Course: Introduction to R for Biologists Computer lab 2 Course: Introduction to R for Biologists April 23, 2012 1 Scripting As you have seen, you often want to run a sequence of commands several times, perhaps with small changes. An efficient

More information

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #33 Pointer Arithmetic

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #33 Pointer Arithmetic Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #33 Pointer Arithmetic In this video let me, so some cool stuff which is pointer arithmetic which helps you to

More information

Data Structures STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Data Structures STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley Data Structures STAT 133 Gaston Sanchez Department of Statistics, UC Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133 Data Types and Structures To make the

More information

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html Intro to R R is a functional programming language, which means that most of what one does is apply functions to objects. We will begin with a brief introduction to R objects and how functions work, and

More information

> glucose = c(81, 85, 93, 93, 99, 76, 75, 84, 78, 84, 81, 82, 89, + 81, 96, 82, 74, 70, 84, 86, 80, 70, 131, 75, 88, 102, 115, + 89, 82, 79, 106)

> glucose = c(81, 85, 93, 93, 99, 76, 75, 84, 78, 84, 81, 82, 89, + 81, 96, 82, 74, 70, 84, 86, 80, 70, 131, 75, 88, 102, 115, + 89, 82, 79, 106) This document describes how to use a number of R commands for plotting one variable and for calculating one variable summary statistics Specifically, it describes how to use R to create dotplots, histograms,

More information

R programming Philip J Cwynar University of Pittsburgh School of Information Sciences and Intelligent Systems Program

R programming Philip J Cwynar University of Pittsburgh School of Information Sciences and Intelligent Systems Program R programming Philip J Cwynar University of Pittsburgh School of Information Sciences and Intelligent Systems Program Background R is a programming language and software environment for statistical analysis,

More information

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar MACHINE LEARNING TOOLBOX Logistic regression on Sonar Classification models Categorical (i.e. qualitative) target variable Example: will a loan default? Still a form of supervised learning Use a train/test

More information

ITS Introduction to R course

ITS Introduction to R course ITS Introduction to R course Nov. 29, 2018 Using this document Code blocks and R code have a grey background (note, code nested in the text is not highlighted in the pdf version of this document but is

More information

Getting Started with SAS Viya 3.4 for Python

Getting Started with SAS Viya 3.4 for Python Getting Started with SAS Viya 3.4 for Python Requirements To use Python with SAS Cloud Analytic Services, the client machine that runs Python must meet the following requirements: Use 64-bit Linux or 64-bit

More information

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus Fitting Classification and Regression Trees Using Statgraphics and R Presented by Dr. Neil W. Polhemus Classification and Regression Trees Machine learning methods used to construct predictive models from

More information

Week 4. Big Data Analytics - data.frame manipulation with dplyr

Week 4. Big Data Analytics - data.frame manipulation with dplyr Week 4. Big Data Analytics - data.frame manipulation with dplyr Hyeonsu B. Kang hyk149@eng.ucsd.edu April 2016 1 Dplyr In the last lecture we have seen how to index an individual cell in a data frame,

More information

Pandas III: Grouping and Presenting Data

Pandas III: Grouping and Presenting Data Lab 8 Pandas III: Grouping and Presenting Data Lab Objective: Learn about Pivot tables, groupby, etc. Introduction Pandas originated as a wrapper for numpy that was developed for purposes of data analysis.

More information

SISG/SISMID Module 3

SISG/SISMID Module 3 SISG/SISMID Module 3 Introduction to R Ken Rice Tim Thornton University of Washington Seattle, July 2018 Introduction: Course Aims This is a first course in R. We aim to cover; Reading in, summarizing

More information

Mails : ; Document version: 14/09/12

Mails : ; Document version: 14/09/12 Mails : leslie.regad@univ-paris-diderot.fr ; gaelle.lelandais@univ-paris-diderot.fr Document version: 14/09/12 A freely available language and environment Statistical computing Graphics Supplementary

More information

EPIB Four Lecture Overview of R

EPIB Four Lecture Overview of R EPIB-613 - Four Lecture Overview of R R is a package with enormous capacity for complex statistical analysis. We will see only a small proportion of what it can do. The R component of EPIB-613 is divided

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

BIOSTATS 640 Spring 2018 Introduction to R Data Description. 1. Start of Session. a. Preliminaries... b. Install Packages c. Attach Packages...

BIOSTATS 640 Spring 2018 Introduction to R Data Description. 1. Start of Session. a. Preliminaries... b. Install Packages c. Attach Packages... BIOSTATS 640 Spring 2018 Introduction to R and R-Studio Data Description Page 1. Start of Session. a. Preliminaries... b. Install Packages c. Attach Packages... 2. Load R Data.. a. Load R data frames...

More information

command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C ))

command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C )) Tutorial 3: Data Manipulation Anatomy of an R Command Every command has a unique name. These names are specific to the program and case-sensitive. In the example below, command.name is the name of the

More information

MATH36032 Problem Solving by Computer. More Data Structure

MATH36032 Problem Solving by Computer. More Data Structure MATH36032 Problem Solving by Computer More Data Structure Data from real life/applications How do the data look like? In what format? Data from real life/applications How do the data look like? In what

More information

K Reference Card. Complete example

K Reference Card. Complete example K Reference Card Complete example package examples.example1 annotation doc : String class Date class Person { name : String age : Int ssnr : Int @doc("employee inherits from Person") class Employee extends

More information

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Contents Overview 2 Generating random numbers 2 rnorm() to generate random numbers from

More information

Text Mining with R: Building a Text Classifier

Text Mining with R: Building a Text Classifier Martin Schweinberger July 28, 2016 This post 1 will exemplify how to create a text classifier with R, i.e. it will implement a machine-learning algorithm, which classifies texts as being either a speech

More information

Ten Great Reasons to Learn SAS Software's SQL Procedure

Ten Great Reasons to Learn SAS Software's SQL Procedure Ten Great Reasons to Learn SAS Software's SQL Procedure Kirk Paul Lafler, Software Intelligence Corporation ABSTRACT The SQL Procedure has so many great features for both end-users and programmers. It's

More information

DATA STRUCTURE AND ALGORITHM USING PYTHON

DATA STRUCTURE AND ALGORITHM USING PYTHON DATA STRUCTURE AND ALGORITHM USING PYTHON Common Use Python Module II Peter Lo Pandas Data Structures and Data Analysis tools 2 What is Pandas? Pandas is an open-source Python library providing highperformance,

More information

Lecture 09: Feb 13, Data Oddities. Lists Coercion Special Values Missingness and NULL. James Balamuta STAT UIUC

Lecture 09: Feb 13, Data Oddities. Lists Coercion Special Values Missingness and NULL. James Balamuta STAT UIUC Lecture 09: Feb 13, 2019 Data Oddities Lists Coercion Special Values Missingness and NULL James Balamuta STAT 385 @ UIUC Announcements hw03 slated to be released on Thursday, Feb 14th, 2019 Due on Wednesday,

More information

An Introduction to R- Programming

An Introduction to R- Programming An Introduction to R- Programming Hadeel Alkofide, Msc, PhD NOT a biostatistician or R expert just simply an R user Some slides were adapted from lectures by Angie Mae Rodday MSc, PhD at Tufts University

More information

LISP. Everything in a computer is a string of binary digits, ones and zeros, which everyone calls bits.

LISP. Everything in a computer is a string of binary digits, ones and zeros, which everyone calls bits. LISP Everything in a computer is a string of binary digits, ones and zeros, which everyone calls bits. From one perspective, sequences of bits can be interpreted as a code for ordinary decimal digits,

More information

Introduction to R. Nishant Gopalakrishnan, Martin Morgan January, Fred Hutchinson Cancer Research Center

Introduction to R. Nishant Gopalakrishnan, Martin Morgan January, Fred Hutchinson Cancer Research Center Introduction to R Nishant Gopalakrishnan, Martin Morgan Fred Hutchinson Cancer Research Center 19-21 January, 2011 Getting Started Atomic Data structures Creating vectors Subsetting vectors Factors Matrices

More information

Functions and data structures. Programming in R for Data Science Anders Stockmarr, Kasper Kristensen, Anders Nielsen

Functions and data structures. Programming in R for Data Science Anders Stockmarr, Kasper Kristensen, Anders Nielsen Functions and data structures Programming in R for Data Science Anders Stockmarr, Kasper Kristensen, Anders Nielsen Objects of the game In R we have objects which are functions and objects which are data.

More information

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015 STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................

More information

Input/Output Data Frames

Input/Output Data Frames Input/Output Data Frames Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Input/Output Importing text files Rectangular (n rows, c columns) Usually you want to use read.table read.table(file,

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Extremely short introduction to R Jean-Yves Sgro Feb 20, 2018

Extremely short introduction to R Jean-Yves Sgro Feb 20, 2018 Extremely short introduction to R Jean-Yves Sgro Feb 20, 2018 Contents 1 Suggested ahead activities 1 2 Introduction to R 2 2.1 Learning Objectives......................................... 2 3 Starting

More information

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.

More information

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables Jennie Murack You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables How to conduct basic descriptive statistics

More information

Basic matrix math in R

Basic matrix math in R 1 Basic matrix math in R This chapter reviews the basic matrix math operations that you will need to understand the course material and how to do these operations in R. 1.1 Creating matrices in R Create

More information

int64 : 64 bits integer vectors

int64 : 64 bits integer vectors int64 : 64 bits integer vectors Romain François - romain@r-enthusiasts.com int64 version 1.1.2 Abstract The int64 package adds 64 bit integer vectors to R. The package provides the int64 and uint64 classes

More information

Week 4: Describing data and estimation

Week 4: Describing data and estimation Week 4: Describing data and estimation Goals Investigate sampling error; see that larger samples have less sampling error. Visualize confidence intervals. Calculate basic summary statistics using R. Calculate

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

CS 33. Data Representation, Part 1. CS33 Intro to Computer Systems VII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Data Representation, Part 1. CS33 Intro to Computer Systems VII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Data Representation, Part 1 CS33 Intro to Computer Systems VII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Number Representation Hindu-Arabic numerals developed by Hindus starting in

More information

ARTIFICIAL INTELLIGENCE AND PYTHON

ARTIFICIAL INTELLIGENCE AND PYTHON ARTIFICIAL INTELLIGENCE AND PYTHON DAY 1 STANLEY LIANG, LASSONDE SCHOOL OF ENGINEERING, YORK UNIVERSITY WHAT IS PYTHON An interpreted high-level programming language for general-purpose programming. Python

More information

Data 8 Final Review #1

Data 8 Final Review #1 Data 8 Final Review #1 Topics we ll cover: Visualizations Arrays and Table Manipulations Programming constructs (functions, for loops, conditional statements) Chance, Simulation, Sampling and Distributions

More information

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R MBV4410/9410 Fall 2018 Bioinformatics for Molecular Biology Introduction to R Outline Introduce R Basic operations RStudio Bioconductor? Goal of the lecture Introduce you to R Show how to run R, basic

More information

IS5 in R: Relationships Between Categorical Variables Contingency Tables (Chapter 3)

IS5 in R: Relationships Between Categorical Variables Contingency Tables (Chapter 3) IS5 in R: Relationships Between Categorical Variables Contingency Tables (Chapter 3) Margaret Chien and Nicholas Horton (nhorton@amherst.edu) July 17, 2018 Introduction and background This document is

More information

No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot.

No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot. No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot. 3 confint A metafor package function that gives you the confidence intervals of effect sizes.

More information

A new recommended way of dealing with multiple missing values: Using missforest for all your imputation needs.

A new recommended way of dealing with multiple missing values: Using missforest for all your imputation needs. A new recommended way of dealing with multiple missing values: Using missforest for all your imputation needs. As published in Benchmarks RSS Matters, July 2014 http://web3.unt.edu/benchmarks/issues/2014/07/rss-matters

More information

Outline. 1 If Statement. 2 While Statement. 3 For Statement. 4 Nesting. 5 Applications. 6 Other Conditional and Loop Constructs 2 / 19

Outline. 1 If Statement. 2 While Statement. 3 For Statement. 4 Nesting. 5 Applications. 6 Other Conditional and Loop Constructs 2 / 19 Control Flow 1 / 19 Outline 1 If Statement 2 While Statement 3 For Statement 4 Nesting 5 Applications 6 Other Conditional and Loop Constructs 2 / 19 If Statement Most computations require different actions

More information

k-nn classification with R QMMA

k-nn classification with R QMMA k-nn classification with R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 1/16 HW (Height and weight) of adults Statistics

More information

Package csvread. August 29, 2016

Package csvread. August 29, 2016 Title Fast Specialized CSV File Loader Version 1.2 Author Sergei Izrailev Package csvread August 29, 2016 Maintainer Sergei Izrailev Description Functions for loading large

More information

A Brief Introduction to R

A Brief Introduction to R A Brief Introduction to R Babak Shahbaba Department of Statistics, University of California, Irvine, USA Chapter 1 Introduction to R 1.1 Installing R To install R, follow these steps: 1. Go to http://www.r-project.org/.

More information

Introduction to R. Adrienn Szabó. DMS Group, MTA SZTAKI. Aug 30, /62

Introduction to R. Adrienn Szabó. DMS Group, MTA SZTAKI. Aug 30, /62 Introduction to R Adrienn Szabó DMS Group, MTA SZTAKI Aug 30, 2014 1/62 1 What is R? What is R for? Who is R for? 2 Basics Data Structures Control Structures 3 ExtRa stuff R packages Unit testing in R

More information

Statistical Software Camp: Introduction to R

Statistical Software Camp: Introduction to R Statistical Software Camp: Introduction to R Day 1 August 24, 2009 1 Introduction 1.1 Why Use R? ˆ Widely-used (ever-increasingly so in political science) ˆ Free ˆ Power and flexibility ˆ Graphical capabilities

More information

Programming for Engineers Arrays

Programming for Engineers Arrays Programming for Engineers Arrays ICEN 200 Spring 2018 Prof. Dola Saha 1 Array Ø Arrays are data structures consisting of related data items of the same type. Ø A group of contiguous memory locations that

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Elements of a programming language 3

Elements of a programming language 3 Elements of a programming language 3 Marcin Kierczak 21 September 2016 Contents of the lecture variables and their types operators vectors numbers as vectors strings as vectors matrices lists data frames

More information

Introduction to R and R-Studio Getting Data Into R. 1. Enter Data Directly into R...

Introduction to R and R-Studio Getting Data Into R. 1. Enter Data Directly into R... Introduction to R and R-Studio 2017-18 02. Getting Data Into R 1. Enter Data Directly into R...... 2. Import Excel Data (.xlsx ) into R..... 3. Import Stata Data (.dta ) into R...... a) From a folder on

More information

Introduction to computer science C language Homework 4 Due Date: Save the confirmation code that will be received from the system

Introduction to computer science C language Homework 4 Due Date: Save the confirmation code that will be received from the system Introduction to computer science C language Homework 4 Due Date: 20.12.2017 Save the confirmation code that will be received from the system Submission Instructions : Electronic submission is individual.

More information

ECOLOGY OF AMPHIBIANS & REPTILES. A Statistical R Primer

ECOLOGY OF AMPHIBIANS & REPTILES. A Statistical R Primer ECOLOGY OF AMPHIBIANS & REPTILES A Statistical R Primer Jeffrey Row, Stephen Lougheed & Grégory Bulté Queen s University Biological Station May, 2017 Row, Lougheed & Bulté; Ecology of Amphibians and Reptiles

More information

Reading data into R. 1. Data in human readable form, which can be inspected with a text editor.

Reading data into R. 1. Data in human readable form, which can be inspected with a text editor. Reading data into R There is a famous, but apocryphal, story about Mrs Beeton, the 19th century cook and writer, which says that she began her recipe for rabbit stew with the instruction First catch your

More information

Introduction to R Commander

Introduction to R Commander Introduction to R Commander 1. Get R and Rcmdr to run 2. Familiarize yourself with Rcmdr 3. Look over Rcmdr metadata (Fox, 2005) 4. Start doing stats / plots with Rcmdr Tasks 1. Clear Workspace and History.

More information

MYcsvtu Notes LECTURE 34. POINTERS

MYcsvtu Notes LECTURE 34.  POINTERS LECTURE 34 POINTERS Pointer Variable Declarations and Initialization Pointer variables Contain memory addresses as their values Normal variables contain a specific value (direct reference) Pointers contain

More information

(c) What is the result of running the following program? x = 3 f = function (y){y+x} g = function (y){x =10; f(y)} g (7) Solution: The result is 10.

(c) What is the result of running the following program? x = 3 f = function (y){y+x} g = function (y){x =10; f(y)} g (7) Solution: The result is 10. Statistics 506 Exam 2 December 17, 2015 1. (a) Suppose that li is a list containing K arrays, each of which consists of distinct integers that lie between 1 and n. That is, for each k = 1,..., K, li[[k]]

More information

Ordered Pairs, Products, Sets versus Lists, Lambda Abstraction, Database Query

Ordered Pairs, Products, Sets versus Lists, Lambda Abstraction, Database Query Ordered Pairs, Products, Sets versus Lists, Lambda Abstraction, Database Query Jan van Eijck May 2, 2003 Abstract Ordered pairs, products, from sets to lists, from lists to sets. Next, we take a further

More information

Eric Pitman Summer Workshop in Computational Science

Eric Pitman Summer Workshop in Computational Science Eric Pitman Summer Workshop in Computational Science 2. Data Structures: Vectors and Data Frames Jeanette Sperhac Data Objects in R These objects, composed of multiple atomic data elements, are the bread

More information

Programming for Chemical and Life Science Informatics

Programming for Chemical and Life Science Informatics Programming for Chemical and Life Science Informatics I573 - Week 7 (Statistical Programming with R) Rajarshi Guha 24 th February, 2009 Resources Download binaries If you re working on Unix it s a good

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

An introduction to R WS 2013/2014

An introduction to R WS 2013/2014 An introduction to R WS 2013/2014 Dr. Noémie Becker (AG Metzler) Dr. Sonja Grath (AG Parsch) Special thanks to: Dr. Martin Hutzenthaler (previously AG Metzler, now University of Frankfurt) course development,

More information

Relational Algebra Part I. CS 377: Database Systems

Relational Algebra Part I. CS 377: Database Systems Relational Algebra Part I CS 377: Database Systems Recap of Last Week ER Model: Design good conceptual models to store information Relational Model: Table representation with structures and constraints

More information

COP 3223 Introduction to Programming with C - Study Union - Fall 2017

COP 3223 Introduction to Programming with C - Study Union - Fall 2017 COP 3223 Introduction to Programming with C - Study Union - Fall 2017 Chris Marsh and Matthew Villegas Contents 1 Code Tracing 2 2 Pass by Value Functions 4 3 Statically Allocated Arrays 5 3.1 One Dimensional.................................

More information

Pointers. 1 Background. 1.1 Variables and Memory. 1.2 Motivating Pointers Massachusetts Institute of Technology

Pointers. 1 Background. 1.1 Variables and Memory. 1.2 Motivating Pointers Massachusetts Institute of Technology Introduction to C++ Massachusetts Institute of Technology ocw.mit.edu 6.096 Pointers 1 Background 1.1 Variables and Memory When you declare a variable, the computer associates the variable name with a

More information

A Short Introduction to the caret Package

A Short Introduction to the caret Package A Short Introduction to the caret Package Max Kuhn max.kuhn@pfizer.com February 12, 2013 The caret package (short for classification and regression training) contains functions to streamline the model

More information

BEGINNING PROBLEM-SOLVING CONCEPTS FOR THE COMPUTER. Chapter 2

BEGINNING PROBLEM-SOLVING CONCEPTS FOR THE COMPUTER. Chapter 2 1 BEGINNING PROBLEM-SOLVING CONCEPTS FOR THE COMPUTER Chapter 2 2 3 Types of Problems that can be solved on computers : Computational problems involving some kind of mathematical processing Logical Problems

More information