In this tutorial we will see some of the basic operations on data frames in R. We begin by first importing the data into an R object called train.

Similar documents
Importing data sets in R

IMPORTING DATA IN PYTHON I. Welcome to the course!

Figure 3.20: Visualize the Titanic Dataset

Lab and Assignment Activity

COMP 364: Computer Tools for Life Sciences

Hands-on Machine Learning for Cybersecurity

Tutorial for the R Statistical Package

Reading and wri+ng data

Homework: Data Mining

Fuzzy Rogers Research Computing Administrator Materials Research Laboratory (MRL) Center for Scientific Computing (CSC)

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1)

TITANIC. Predicting Survival Using Classification Algorithms

Getting Started with SAS Viya 3.2 for Python

1 Building a simple data package for R. 2 Data files. 2.1 bmd data

Lab #3. Viewing Data in SAS. Tables in SAS. 171:161: Introduction to Biostatistics Breheny

POL 345: Quantitative Analysis and Politics

Supervised Learning Classification Algorithms Comparison

Stat 579: Objects in R Vectors

STAT 540 Computing in Statistics

Outline. Mixed models in R using the lme4 package Part 1: Introduction to R. Following the operations on the slides

Chapter 7. The Data Frame

Series. >>> import numpy as np >>> import pandas as pd

Draft Proof - do not copy, post, or distribute DATA MUNGING LEARNING OBJECTIVES

Rearranging and manipula.ng data

Data Import and Export

Data types and structures

Lecture 06: Feb 04, Transforming Data. Functions Classes and Objects Vectorization Subsets. James Balamuta STAT UIUC

Computer lab 2 Course: Introduction to R for Biologists

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #33 Pointer Arithmetic

Data Structures STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

file:///users/williams03/a/workshops/2015.march/final/intro_to_r.html

> glucose = c(81, 85, 93, 93, 99, 76, 75, 84, 78, 84, 81, 82, 89, + 81, 96, 82, 74, 70, 84, 86, 80, 70, 131, 75, 88, 102, 115, + 89, 82, 79, 106)

R programming Philip J Cwynar University of Pittsburgh School of Information Sciences and Intelligent Systems Program

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

ITS Introduction to R course

Getting Started with SAS Viya 3.4 for Python

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Week 4. Big Data Analytics - data.frame manipulation with dplyr

Pandas III: Grouping and Presenting Data

SISG/SISMID Module 3

Mails : ; Document version: 14/09/12

EPIB Four Lecture Overview of R

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

BIOSTATS 640 Spring 2018 Introduction to R Data Description. 1. Start of Session. a. Preliminaries... b. Install Packages c. Attach Packages...

command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C ))

MATH36032 Problem Solving by Computer. More Data Structure

K Reference Card. Complete example

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Text Mining with R: Building a Text Classifier

Ten Great Reasons to Learn SAS Software's SQL Procedure

DATA STRUCTURE AND ALGORITHM USING PYTHON

Lecture 09: Feb 13, Data Oddities. Lists Coercion Special Values Missingness and NULL. James Balamuta STAT UIUC

An Introduction to R- Programming

LISP. Everything in a computer is a string of binary digits, ones and zeros, which everyone calls bits.

Introduction to R. Nishant Gopalakrishnan, Martin Morgan January, Fred Hutchinson Cancer Research Center

Functions and data structures. Programming in R for Data Science Anders Stockmarr, Kasper Kristensen, Anders Nielsen

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Input/Output Data Frames

PSS718 - Data Mining

Extremely short introduction to R Jean-Yves Sgro Feb 20, 2018

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

Basic matrix math in R

int64 : 64 bits integer vectors

Week 4: Describing data and estimation

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

CS 33. Data Representation, Part 1. CS33 Intro to Computer Systems VII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

ARTIFICIAL INTELLIGENCE AND PYTHON

Data 8 Final Review #1

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

IS5 in R: Relationships Between Categorical Variables Contingency Tables (Chapter 3)

No Name What it does? 1 attach Attach your data frame to your working environment. 2 boxplot Creates a boxplot.

A new recommended way of dealing with multiple missing values: Using missforest for all your imputation needs.

Outline. 1 If Statement. 2 While Statement. 3 For Statement. 4 Nesting. 5 Applications. 6 Other Conditional and Loop Constructs 2 / 19

k-nn classification with R QMMA

Package csvread. August 29, 2016

A Brief Introduction to R

Introduction to R. Adrienn Szabó. DMS Group, MTA SZTAKI. Aug 30, /62

Statistical Software Camp: Introduction to R

Programming for Engineers Arrays

Chapter 6: DESCRIPTIVE STATISTICS

Elements of a programming language 3

Introduction to R and R-Studio Getting Data Into R. 1. Enter Data Directly into R...

Introduction to computer science C language Homework 4 Due Date: Save the confirmation code that will be received from the system

ECOLOGY OF AMPHIBIANS & REPTILES. A Statistical R Primer

Reading data into R. 1. Data in human readable form, which can be inspected with a text editor.

Introduction to R Commander

MYcsvtu Notes LECTURE 34. POINTERS

(c) What is the result of running the following program? x = 3 f = function (y){y+x} g = function (y){x =10; f(y)} g (7) Solution: The result is 10.

Ordered Pairs, Products, Sets versus Lists, Lambda Abstraction, Database Query

Eric Pitman Summer Workshop in Computational Science

Programming for Chemical and Life Science Informatics

Security Control Methods for Statistical Database

An introduction to R WS 2013/2014

Relational Algebra Part I. CS 377: Database Systems

COP 3223 Introduction to Programming with C - Study Union - Fall 2017

Pointers. 1 Background. 1.1 Variables and Memory. 1.2 Motivating Pointers Massachusetts Institute of Technology

A Short Introduction to the caret Package

BEGINNING PROBLEM-SOLVING CONCEPTS FOR THE COMPUTER. Chapter 2

Transcription:

Data frames in R In this tutorial we will see some of the basic operations on data frames in R Understand the structure Indexing Column names Add a column/row Delete a column/row Subset Summarize We will again use the Titanic data set available at Kaggle Understand the structure We begin by first importing the data into an R object called train. train <- read.csv("train.csv", na.strings = "") Once the csv file is in our workspace, it is stored as a object of class data.frame. Everything in R is an object and every object belongs to a particular class. We can check the class of any R object using the class() function. class(train) ## [1] "data.frame" A data frame is a two dimensional array; the dimensions being the rows and columns. A column contains information for a particular variable and hence can contain data of one type only, e.g., either numeric or character or factor or date etc. It can have both numbers and strings as data, but the storage type will be unique, i.e., if the first row has an entry - '1234' and the second row has an entry - 'a word', then the column will be classified as character (or factor) but not numeric. To find out how the columns in our Titanic data are classified, we can use the str() function which displays the internal structure of an R object. str(train) ## 'data.frame': 891 obs. of 11 variables: ## $ survived: int 0 1 1 1 0 0 0 0 1 1... ## $ pclass : int 3 1 3 1 3 3 1 3 3 2... ## $ name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581... ## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1... ## $ age : num 22 38 26 35 35 NA 54 2 27 14...

## $ sibsp : int 1 1 0 1 0 0 0 3 0 1... ## $ parch : int 0 0 0 0 0 0 0 1 2 0... ## $ ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133... ## $ fare : num 7.25 71.28 7.92 53.1 8.05... ## $ cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA... ## $ embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1... The output from the str() function tells us that our data frame has 891 observations (rows) and 11 variables (columns). The details of each column are provided along with the column name. Note that each column name is preceded by a '$' sign. This sign has a special meaning in R, which we will come to shortly. To understand the output, consider the first column mentioned in the result box - 'survived'. This column has class integer and the first few values are shown. Now consider the third column - 'name'. This column in of class factor and has 891 levels, i.e., 891 unique values. The first of these levels is 'Abbing, Mr. Anthony'. This is not the first observation in the data for this column. It is the first level (category) for the factor (categorical) variable - 'name'. Unless manually specified, the levels are chosen by R automatically in alphabetical order. The first observation for the variable in the data is for level 109, followed by level 191, and then 358. Again, note that, R is not showing the actual value that the field holds, but rather the category number corresponding to that value. We have seen the structure of our data set. Now let's look at the actual data itself. To get a quick snapshot of the data frame, we can use the head() function which displays the first few observations of all the variables in the data. head(train) ## survived pclass name ## 1 0 3 Braund, Mr. Owen Harris ## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## 3 1 3 Heikkinen, Miss. Laina ## 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ## 5 0 3 Allen, Mr. William Henry ## 6 0 3 Moran, Mr. James ## sex age sibsp parch ticket fare cabin embarked ## 1 male 22 1 0 A/5 21171 7.250 <NA> S ## 2 female 38 1 0 PC 17599 71.283 C85 C ## 3 female 26 0 0 STON/O2. 3101282 7.925 <NA> S ## 4 female 35 1 0 113803 53.100 C123 S ## 5 male 35 0 0 373450 8.050 <NA> S ## 6 male NA 0 0 330877 8.458 <NA> Q There is also an analogous function called tail() that displays the last few observations.

Indexing If there are too many columns in the data frame then using the head() function straight away might not be a very good idea. In that case, we can select the columns (and rows) that we want to see using the [m, n] notation, where m corresponds to rows and n corresponds to columns. The index in R starts from 1 as opposed to python where it starts from 0. To view the observations of the first column use head(train[, 1]) ## [1] 0 1 1 1 0 0 To view the observations of the first three columns use head(train[, c(1, 2, 3)]) ## survived pclass name ## 1 0 3 Braund, Mr. Owen Harris ## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## 3 1 3 Heikkinen, Miss. Laina ## 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ## 5 0 3 Allen, Mr. William Henry ## 6 0 3 Moran, Mr. James To view the observations of columns 3 and 7 use head(train[, c(3, 7)]) ## name parch ## 1 Braund, Mr. Owen Harris 0 ## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 0 ## 3 Heikkinen, Miss. Laina 0 ## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 ## 5 Allen, Mr. William Henry 0 ## 6 Moran, Mr. James 0 We can also have the corresponding view from select rows. To view the first row for all columns use train[1, ] ## survived pclass name sex age sibsp parch ticket ## 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 ## fare cabin embarked ## 1 7.25 <NA> S To view the first three rows for all columns

train[c(1, 2, 3), ] ## survived pclass name ## 1 0 3 Braund, Mr. Owen Harris ## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## 3 1 3 Heikkinen, Miss. Laina ## sex age sibsp parch ticket fare cabin embarked ## 1 male 22 1 0 A/5 21171 7.250 <NA> S ## 2 female 38 1 0 PC 17599 71.283 C85 C ## 3 female 26 0 0 STON/O2. 3101282 7.925 <NA> S To view rows 3 and 7 for all columns train[c(3, 7), ] ## survived pclass name sex age sibsp parch ## 3 1 3 Heikkinen, Miss. Laina female 26 0 0 ## 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 ## ticket fare cabin embarked ## 3 STON/O2. 3101282 7.925 <NA> S ## 7 17463 51.862 E46 S We do not need to use the head() function here since we are explicitly telling R to show us a few observations by specifying the ones we would like to see. We can combine the two sets of examples and view any desired combination of rows and columns. For example, to view the first row for columns 4, 5, and 6 use train[1, c(4, 5, 6)] ## sex age sibsp ## 1 male 22 1 To view the first ten rows for columns 2 to 6 use train[c(1:10), c(2:6)] ## pclass name sex age ## 1 3 Braund, Mr. Owen Harris male 22 ## 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 ## 3 3 Heikkinen, Miss. Laina female 26 ## 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 ## 5 3 Allen, Mr. William Henry male 35 ## 6 3 Moran, Mr. James male NA ## 7 1 McCarthy, Mr. Timothy J male 54 ## 8 3 Palsson, Master. Gosta Leonard male 2 ## 9 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 ## 10 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 ## sibsp ## 1 1 ## 2 1 ## 3 0

## 4 1 ## 5 0 ## 6 0 ## 7 0 ## 8 3 ## 9 0 ## 10 1 The a:b notation produces a vector of integers ranging from a to b. If a < b, then a vector with increasing values is created and if a > b, then a vector with decreasing values is created. To view the rows 50 to 60 and 110 to 115 for columns 2, 3, and 6 use train[c(1:10), c(2:6)] ## pclass name sex age ## 1 3 Braund, Mr. Owen Harris male 22 ## 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 ## 3 3 Heikkinen, Miss. Laina female 26 ## 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 ## 5 3 Allen, Mr. William Henry male 35 ## 6 3 Moran, Mr. James male NA ## 7 1 McCarthy, Mr. Timothy J male 54 ## 8 3 Palsson, Master. Gosta Leonard male 2 ## 9 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 ## 10 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 ## sibsp ## 1 1 ## 2 1 ## 3 0 ## 4 1 ## 5 0 ## 6 0 ## 7 0 ## 8 3 ## 9 0 ## 10 1 Column names A data.frame object has two attributes attached to it by default - column names and row names. Given these, any column or row can be identified and manipulated using its name. The column and row names of a data frame can be identified using colnames(train) ## [1] "survived" "pclass" "name" "sex" "age" "sibsp" ## [7] "parch" "ticket" "fare" "cabin" "embarked"

head(rownames(train)) ## [1] "1" "2" "3" "4" "5" "6" Note: we used the head() function on rownames() only to restrict the size of the output. We mentioned above that everything in R is an object. This means that every function call also returns an object. Calling the function colnames() returns an object of class character. How do we know this? Simple - just pass the output function call through the class() function. variables <- colnames(train) class(variables) ## [1] "character" Since the output is an R object, it can be manipulated as required. For example, to change the names of the columns use colnames(train) <- c("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10", "col11") colnames(train) ## [1] "col1" "col2" "col3" "col4" "col5" "col6" "col7" "col8" ## [9] "col9" "col10" "col11" The arguments on the right hand side should be equal to the number of variables in the data frame. To change the name of a particular column, say column no. 4, use colnames(train)[4] <- "newname4" [ ] is the same indexing operator we used above. To change the name of a few columns, say column nos. 5, 8 and 11 use colnames(train)[c(5, 8, 11)] <- c("newname5", "newname8", "newname11") colnames(train) ## [1] "col1" "col2" "col3" "newname4" "newname5" ## [6] "col6" "col7" "newname8" "col9" "col10" ## [11] "newname11" [ ] can also take negative values. By using a negative integer, we are calling all the values from the object except the one(s) stored at the location(s). For example, to rename all the columns except columns 4, 5, 8, and 11 use

colnames(train)[-c(4, 5, 8, 11)] <- c("newname1", "newname2", "newname3", "newname6", "newname7", "newname9", "newname10") colnames(train) ## [1] "newname1" "newname2" "newname3" "newname4" "newname5" ## [6] "newname6" "newname7" "newname8" "newname9" "newname10" ## [11] "newname11" In the past couple of examples, we manipulated and replaced the original names in our data set. We can get these back by using the 'variables' vector we created above. colnames(train) <- variables colnames(train) ## [1] "survived" "pclass" "name" "sex" "age" "sibsp" ## [7] "parch" "ticket" "fare" "cabin" "embarked" We had mentioned the '$' sign above. This sign is a very convenient utility and can be used to retrieve named elements from an R object. For example, to view the column 'survived' in our data, do head(train$survived) ## [1] 0 1 1 1 0 0 head(train$sex) ## [1] male female female female male male ## Levels: female male Add a column/row The '$' can also be used to create an element within an object. For example, to create a column that contains the squared values of the 'fare' column use train$fare.sq <- train$fare * train$fare head(train$fare.sq) ## [1] 52.56 5081.31 62.81 2819.61 64.80 71.54 We can confirm that the squared values have been correctly calculated by using the [ ] operation in a different way. Instead of giving the index value, we can also provide the column names directly. head(train[, c("fare", "fare.sq")])

## fare fare.sq ## 1 7.250 52.56 ## 2 71.283 5081.31 ## 3 7.925 62.81 ## 4 53.100 2819.61 ## 5 8.050 64.80 ## 6 8.458 71.54 We can add a row to the data set as well. Let's add one below the last row. As a simple example, we will just take the first row and make a copy of it at the end. For this, we will use the indexing operator [ ] and the nrow() function which gives the total number of rows present in a data frame. nrow(train) ## [1] 891 train[nrow(train), ] ## survived pclass name sex age sibsp parch ticket fare ## 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 ## cabin embarked fare.sq ## 891 <NA> Q 60.06 train[nrow(train) + 1, ] <- train[1, ] nrow(train) ## [1] 892 train[nrow(train), ] ## survived pclass name sex age sibsp parch ticket ## 892 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 ## fare cabin embarked fare.sq ## 892 7.25 <NA> S 52.56 Delete a column/row Deleting a column/row is as easy as creating one. Simply use negation with the column/row index that needs to be deleted. For example, to delete the 'fare.sq' column calculated above, use train <- train[, -12] train$fare.sq ## NULL To delete the last row created above, use

train <- train[-892, ] train[892, ] ## survived pclass name sex age sibsp parch ticket fare cabin embarked ## NA NA NA <NA> <NA> NA NA NA <NA> NA <NA> <NA> Subset A data frame can be subset using different conditions. For example, we can subset the train data to include observations only for females using the subset() function train.female <- subset(train, sex == "female") To check whether the subset worked properly, we can look at the frequency table of the 'sex' variable in both the data sets. table(train$sex) ## ## female male ## 314 577 table(train.female$sex) ## ## female male ## 314 0 Consider another example where we subset the data by taking observations for only those cases for which 'fare' is between 100 and 500. train.sub1 <- subset(train, fare >= 100 & fare <= 500) dim(train.sub1) ## [1] 50 11 We can also subset using two different variables. Let's take the cases where passenger class is 3 and sex in male. train.sub2 <- subset(train, pclass == 3 & sex == "male") dim(train.sub2) ## [1] 347 11 The above example used an 'and' condition while subsetting the data. The example below uses the same two variables with an 'or' condition between them. The 'or' condition in R is specified using ' '.

train.sub3 <- subset(train, pclass == 3 sex == "male") dim(train.sub3) ## [1] 721 11 The exact same process can be executed using the indexing [ ] operator. For example, to replicate the previous example with [ ] use train.sub4 <- train[train$pclass == 3 train$sex == "male", ] dim(train.sub4) ## [1] 721 11 Summarize Summarizing a data set is extremely easy and can be done using a simple function called summary() summary(train) ## survived pclass ## Min. :0.000 Min. :1.00 ## 1st Qu.:0.000 1st Qu.:2.00 ## Median :0.000 Median :3.00 ## Mean :0.384 Mean :2.31 ## 3rd Qu.:1.000 3rd Qu.:3.00 ## Max. :1.000 Max. :3.00 ## ## name sex age ## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42 ## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12 ## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00 ## Abelson, Mr. Samuel : 1 Mean :29.70 ## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00 ## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00 ## (Other) :885 NA's :177 ## sibsp parch ticket fare ## Min. :0.000 Min. :0.000 1601 : 7 Min. : 0.0 ## 1st Qu.:0.000 1st Qu.:0.000 347082 : 7 1st Qu.: 7.9 ## Median :0.000 Median :0.000 CA. 2343: 7 Median : 14.5 ## Mean :0.523 Mean :0.382 3101295 : 6 Mean : 32.2 ## 3rd Qu.:1.000 3rd Qu.:0.000 347088 : 6 3rd Qu.: 31.0 ## Max. :8.000 Max. :6.000 CA 2144 : 6 Max. :512.3 ## (Other) :852 ## cabin embarked ## B96 B98 : 4 C :168 ## C23 C25 C27: 4 Q : 77 ## G6 : 4 S :644 ## C22 C26 : 3 NA's: 2 ## D : 3 ## (Other) :186 ## NA's :687

The summary of a data frame gives a clear snapshot of values each variable holds, including the missing ones. The '$' sign and the indexing operator can used to summarize a single variable or a group of variables as shown below. summary(train$pclass) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 2.00 3.00 2.31 3.00 3.00 summary(train[, c("pclass", "sex", "cabin")]) ## pclass sex cabin ## Min. :1.00 female:314 B96 B98 : 4 ## 1st Qu.:2.00 male :577 C23 C25 C27: 4 ## Median :3.00 G6 : 4 ## Mean :2.31 C22 C26 : 3 ## 3rd Qu.:3.00 D : 3 ## Max. :3.00 (Other) :186 ## NA's :687 The above examples are just a representative sample of the functions available in R to process data frames. They are intended to serve as a starting point and a quick reference guide for those who have just started playing with R. In the next tutorial, we will learn about data manipulation in R.