Data frames in R In this tutorial we will see some of the basic operations on data frames in R Understand the structure Indexing Column names Add a column/row Delete a column/row Subset Summarize We will again use the Titanic data set available at Kaggle Understand the structure We begin by first importing the data into an R object called train. train <- read.csv("train.csv", na.strings = "") Once the csv file is in our workspace, it is stored as a object of class data.frame. Everything in R is an object and every object belongs to a particular class. We can check the class of any R object using the class() function. class(train) ## [1] "data.frame" A data frame is a two dimensional array; the dimensions being the rows and columns. A column contains information for a particular variable and hence can contain data of one type only, e.g., either numeric or character or factor or date etc. It can have both numbers and strings as data, but the storage type will be unique, i.e., if the first row has an entry - '1234' and the second row has an entry - 'a word', then the column will be classified as character (or factor) but not numeric. To find out how the columns in our Titanic data are classified, we can use the str() function which displays the internal structure of an R object. str(train) ## 'data.frame': 891 obs. of 11 variables: ## $ survived: int 0 1 1 1 0 0 0 0 1 1... ## $ pclass : int 3 1 3 1 3 3 1 3 3 2... ## $ name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581... ## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1... ## $ age : num 22 38 26 35 35 NA 54 2 27 14...
## $ sibsp : int 1 1 0 1 0 0 0 3 0 1... ## $ parch : int 0 0 0 0 0 0 0 1 2 0... ## $ ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133... ## $ fare : num 7.25 71.28 7.92 53.1 8.05... ## $ cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA... ## $ embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1... The output from the str() function tells us that our data frame has 891 observations (rows) and 11 variables (columns). The details of each column are provided along with the column name. Note that each column name is preceded by a '$' sign. This sign has a special meaning in R, which we will come to shortly. To understand the output, consider the first column mentioned in the result box - 'survived'. This column has class integer and the first few values are shown. Now consider the third column - 'name'. This column in of class factor and has 891 levels, i.e., 891 unique values. The first of these levels is 'Abbing, Mr. Anthony'. This is not the first observation in the data for this column. It is the first level (category) for the factor (categorical) variable - 'name'. Unless manually specified, the levels are chosen by R automatically in alphabetical order. The first observation for the variable in the data is for level 109, followed by level 191, and then 358. Again, note that, R is not showing the actual value that the field holds, but rather the category number corresponding to that value. We have seen the structure of our data set. Now let's look at the actual data itself. To get a quick snapshot of the data frame, we can use the head() function which displays the first few observations of all the variables in the data. head(train) ## survived pclass name ## 1 0 3 Braund, Mr. Owen Harris ## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## 3 1 3 Heikkinen, Miss. Laina ## 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ## 5 0 3 Allen, Mr. William Henry ## 6 0 3 Moran, Mr. James ## sex age sibsp parch ticket fare cabin embarked ## 1 male 22 1 0 A/5 21171 7.250 <NA> S ## 2 female 38 1 0 PC 17599 71.283 C85 C ## 3 female 26 0 0 STON/O2. 3101282 7.925 <NA> S ## 4 female 35 1 0 113803 53.100 C123 S ## 5 male 35 0 0 373450 8.050 <NA> S ## 6 male NA 0 0 330877 8.458 <NA> Q There is also an analogous function called tail() that displays the last few observations.
Indexing If there are too many columns in the data frame then using the head() function straight away might not be a very good idea. In that case, we can select the columns (and rows) that we want to see using the [m, n] notation, where m corresponds to rows and n corresponds to columns. The index in R starts from 1 as opposed to python where it starts from 0. To view the observations of the first column use head(train[, 1]) ## [1] 0 1 1 1 0 0 To view the observations of the first three columns use head(train[, c(1, 2, 3)]) ## survived pclass name ## 1 0 3 Braund, Mr. Owen Harris ## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## 3 1 3 Heikkinen, Miss. Laina ## 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ## 5 0 3 Allen, Mr. William Henry ## 6 0 3 Moran, Mr. James To view the observations of columns 3 and 7 use head(train[, c(3, 7)]) ## name parch ## 1 Braund, Mr. Owen Harris 0 ## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 0 ## 3 Heikkinen, Miss. Laina 0 ## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 ## 5 Allen, Mr. William Henry 0 ## 6 Moran, Mr. James 0 We can also have the corresponding view from select rows. To view the first row for all columns use train[1, ] ## survived pclass name sex age sibsp parch ticket ## 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 ## fare cabin embarked ## 1 7.25 <NA> S To view the first three rows for all columns
train[c(1, 2, 3), ] ## survived pclass name ## 1 0 3 Braund, Mr. Owen Harris ## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) ## 3 1 3 Heikkinen, Miss. Laina ## sex age sibsp parch ticket fare cabin embarked ## 1 male 22 1 0 A/5 21171 7.250 <NA> S ## 2 female 38 1 0 PC 17599 71.283 C85 C ## 3 female 26 0 0 STON/O2. 3101282 7.925 <NA> S To view rows 3 and 7 for all columns train[c(3, 7), ] ## survived pclass name sex age sibsp parch ## 3 1 3 Heikkinen, Miss. Laina female 26 0 0 ## 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 ## ticket fare cabin embarked ## 3 STON/O2. 3101282 7.925 <NA> S ## 7 17463 51.862 E46 S We do not need to use the head() function here since we are explicitly telling R to show us a few observations by specifying the ones we would like to see. We can combine the two sets of examples and view any desired combination of rows and columns. For example, to view the first row for columns 4, 5, and 6 use train[1, c(4, 5, 6)] ## sex age sibsp ## 1 male 22 1 To view the first ten rows for columns 2 to 6 use train[c(1:10), c(2:6)] ## pclass name sex age ## 1 3 Braund, Mr. Owen Harris male 22 ## 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 ## 3 3 Heikkinen, Miss. Laina female 26 ## 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 ## 5 3 Allen, Mr. William Henry male 35 ## 6 3 Moran, Mr. James male NA ## 7 1 McCarthy, Mr. Timothy J male 54 ## 8 3 Palsson, Master. Gosta Leonard male 2 ## 9 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 ## 10 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 ## sibsp ## 1 1 ## 2 1 ## 3 0
## 4 1 ## 5 0 ## 6 0 ## 7 0 ## 8 3 ## 9 0 ## 10 1 The a:b notation produces a vector of integers ranging from a to b. If a < b, then a vector with increasing values is created and if a > b, then a vector with decreasing values is created. To view the rows 50 to 60 and 110 to 115 for columns 2, 3, and 6 use train[c(1:10), c(2:6)] ## pclass name sex age ## 1 3 Braund, Mr. Owen Harris male 22 ## 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 ## 3 3 Heikkinen, Miss. Laina female 26 ## 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 ## 5 3 Allen, Mr. William Henry male 35 ## 6 3 Moran, Mr. James male NA ## 7 1 McCarthy, Mr. Timothy J male 54 ## 8 3 Palsson, Master. Gosta Leonard male 2 ## 9 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 ## 10 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 ## sibsp ## 1 1 ## 2 1 ## 3 0 ## 4 1 ## 5 0 ## 6 0 ## 7 0 ## 8 3 ## 9 0 ## 10 1 Column names A data.frame object has two attributes attached to it by default - column names and row names. Given these, any column or row can be identified and manipulated using its name. The column and row names of a data frame can be identified using colnames(train) ## [1] "survived" "pclass" "name" "sex" "age" "sibsp" ## [7] "parch" "ticket" "fare" "cabin" "embarked"
head(rownames(train)) ## [1] "1" "2" "3" "4" "5" "6" Note: we used the head() function on rownames() only to restrict the size of the output. We mentioned above that everything in R is an object. This means that every function call also returns an object. Calling the function colnames() returns an object of class character. How do we know this? Simple - just pass the output function call through the class() function. variables <- colnames(train) class(variables) ## [1] "character" Since the output is an R object, it can be manipulated as required. For example, to change the names of the columns use colnames(train) <- c("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10", "col11") colnames(train) ## [1] "col1" "col2" "col3" "col4" "col5" "col6" "col7" "col8" ## [9] "col9" "col10" "col11" The arguments on the right hand side should be equal to the number of variables in the data frame. To change the name of a particular column, say column no. 4, use colnames(train)[4] <- "newname4" [ ] is the same indexing operator we used above. To change the name of a few columns, say column nos. 5, 8 and 11 use colnames(train)[c(5, 8, 11)] <- c("newname5", "newname8", "newname11") colnames(train) ## [1] "col1" "col2" "col3" "newname4" "newname5" ## [6] "col6" "col7" "newname8" "col9" "col10" ## [11] "newname11" [ ] can also take negative values. By using a negative integer, we are calling all the values from the object except the one(s) stored at the location(s). For example, to rename all the columns except columns 4, 5, 8, and 11 use
colnames(train)[-c(4, 5, 8, 11)] <- c("newname1", "newname2", "newname3", "newname6", "newname7", "newname9", "newname10") colnames(train) ## [1] "newname1" "newname2" "newname3" "newname4" "newname5" ## [6] "newname6" "newname7" "newname8" "newname9" "newname10" ## [11] "newname11" In the past couple of examples, we manipulated and replaced the original names in our data set. We can get these back by using the 'variables' vector we created above. colnames(train) <- variables colnames(train) ## [1] "survived" "pclass" "name" "sex" "age" "sibsp" ## [7] "parch" "ticket" "fare" "cabin" "embarked" We had mentioned the '$' sign above. This sign is a very convenient utility and can be used to retrieve named elements from an R object. For example, to view the column 'survived' in our data, do head(train$survived) ## [1] 0 1 1 1 0 0 head(train$sex) ## [1] male female female female male male ## Levels: female male Add a column/row The '$' can also be used to create an element within an object. For example, to create a column that contains the squared values of the 'fare' column use train$fare.sq <- train$fare * train$fare head(train$fare.sq) ## [1] 52.56 5081.31 62.81 2819.61 64.80 71.54 We can confirm that the squared values have been correctly calculated by using the [ ] operation in a different way. Instead of giving the index value, we can also provide the column names directly. head(train[, c("fare", "fare.sq")])
## fare fare.sq ## 1 7.250 52.56 ## 2 71.283 5081.31 ## 3 7.925 62.81 ## 4 53.100 2819.61 ## 5 8.050 64.80 ## 6 8.458 71.54 We can add a row to the data set as well. Let's add one below the last row. As a simple example, we will just take the first row and make a copy of it at the end. For this, we will use the indexing operator [ ] and the nrow() function which gives the total number of rows present in a data frame. nrow(train) ## [1] 891 train[nrow(train), ] ## survived pclass name sex age sibsp parch ticket fare ## 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 ## cabin embarked fare.sq ## 891 <NA> Q 60.06 train[nrow(train) + 1, ] <- train[1, ] nrow(train) ## [1] 892 train[nrow(train), ] ## survived pclass name sex age sibsp parch ticket ## 892 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 ## fare cabin embarked fare.sq ## 892 7.25 <NA> S 52.56 Delete a column/row Deleting a column/row is as easy as creating one. Simply use negation with the column/row index that needs to be deleted. For example, to delete the 'fare.sq' column calculated above, use train <- train[, -12] train$fare.sq ## NULL To delete the last row created above, use
train <- train[-892, ] train[892, ] ## survived pclass name sex age sibsp parch ticket fare cabin embarked ## NA NA NA <NA> <NA> NA NA NA <NA> NA <NA> <NA> Subset A data frame can be subset using different conditions. For example, we can subset the train data to include observations only for females using the subset() function train.female <- subset(train, sex == "female") To check whether the subset worked properly, we can look at the frequency table of the 'sex' variable in both the data sets. table(train$sex) ## ## female male ## 314 577 table(train.female$sex) ## ## female male ## 314 0 Consider another example where we subset the data by taking observations for only those cases for which 'fare' is between 100 and 500. train.sub1 <- subset(train, fare >= 100 & fare <= 500) dim(train.sub1) ## [1] 50 11 We can also subset using two different variables. Let's take the cases where passenger class is 3 and sex in male. train.sub2 <- subset(train, pclass == 3 & sex == "male") dim(train.sub2) ## [1] 347 11 The above example used an 'and' condition while subsetting the data. The example below uses the same two variables with an 'or' condition between them. The 'or' condition in R is specified using ' '.
train.sub3 <- subset(train, pclass == 3 sex == "male") dim(train.sub3) ## [1] 721 11 The exact same process can be executed using the indexing [ ] operator. For example, to replicate the previous example with [ ] use train.sub4 <- train[train$pclass == 3 train$sex == "male", ] dim(train.sub4) ## [1] 721 11 Summarize Summarizing a data set is extremely easy and can be done using a simple function called summary() summary(train) ## survived pclass ## Min. :0.000 Min. :1.00 ## 1st Qu.:0.000 1st Qu.:2.00 ## Median :0.000 Median :3.00 ## Mean :0.384 Mean :2.31 ## 3rd Qu.:1.000 3rd Qu.:3.00 ## Max. :1.000 Max. :3.00 ## ## name sex age ## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42 ## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12 ## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00 ## Abelson, Mr. Samuel : 1 Mean :29.70 ## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00 ## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00 ## (Other) :885 NA's :177 ## sibsp parch ticket fare ## Min. :0.000 Min. :0.000 1601 : 7 Min. : 0.0 ## 1st Qu.:0.000 1st Qu.:0.000 347082 : 7 1st Qu.: 7.9 ## Median :0.000 Median :0.000 CA. 2343: 7 Median : 14.5 ## Mean :0.523 Mean :0.382 3101295 : 6 Mean : 32.2 ## 3rd Qu.:1.000 3rd Qu.:0.000 347088 : 6 3rd Qu.: 31.0 ## Max. :8.000 Max. :6.000 CA 2144 : 6 Max. :512.3 ## (Other) :852 ## cabin embarked ## B96 B98 : 4 C :168 ## C23 C25 C27: 4 Q : 77 ## G6 : 4 S :644 ## C22 C26 : 3 NA's: 2 ## D : 3 ## (Other) :186 ## NA's :687
The summary of a data frame gives a clear snapshot of values each variable holds, including the missing ones. The '$' sign and the indexing operator can used to summarize a single variable or a group of variables as shown below. summary(train$pclass) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 2.00 3.00 2.31 3.00 3.00 summary(train[, c("pclass", "sex", "cabin")]) ## pclass sex cabin ## Min. :1.00 female:314 B96 B98 : 4 ## 1st Qu.:2.00 male :577 C23 C25 C27: 4 ## Median :3.00 G6 : 4 ## Mean :2.31 C22 C26 : 3 ## 3rd Qu.:3.00 D : 3 ## Max. :3.00 (Other) :186 ## NA's :687 The above examples are just a representative sample of the functions available in R to process data frames. They are intended to serve as a starting point and a quick reference guide for those who have just started playing with R. In the next tutorial, we will learn about data manipulation in R.