Data Structures STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Similar documents
Control Flow Structures

Data types and structures

Stat 579: Objects in R Vectors

ITS Introduction to R course

Lecture 09: Feb 13, Data Oddities. Lists Coercion Special Values Missingness and NULL. James Balamuta STAT UIUC

Introduction to the R Language

Numeric Vectors STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Entering and Outputting Data 2 nd best TA ever: Steele H. Valenzuela February 2-6, 2015

SML 201 Week 2 John D. Storey Spring 2016

Introduction to R. Nishant Gopalakrishnan, Martin Morgan January, Fred Hutchinson Cancer Research Center

Introduction to R: Data Types

Package LSDinterface

lecture 2: a crash course in r

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

Statistical Programming with R

Sub-setting Data. Tzu L. Phang

Package tibble. August 22, 2017

Package slam. February 15, 2013

A Brief Introduction to R

Package slam. December 1, 2016

Description/History Objects/Language Description Commonly Used Basic Functions. More Specific Functionality Further Resources

Reading and wri+ng data

JME Language Reference Manual

the R environment The R language is an integrated suite of software facilities for:

R and parallel libraries. Introduction to R for data analytics Bologna, 26/06/2017

Introducion to R and parallel libraries. Giorgio Pedrazzi, CINECA Matteo Sartori, CINECA School of Data Analytics and Visualisation Milan, 09/06/2015

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

Data Classes. Introduction to R for Public Health Researchers

The Warhol Language Reference Manual

First Steps with R. 27 January, Install R 1. 2 Atomic Vectors R s Basic Types 2. 4 Matrix, data.frame, and 2-D Subsetting 7

MAT 275 Laboratory 2 Matrix Computations and Programming in MATLAB

Physics 326G Winter Class 2. In this class you will learn how to define and work with arrays or vectors.

SCHEME 8. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. March 23, 2017

Chapter 2: Basic Elements of C++ Objectives. Objectives (cont d.) A C++ Program. Introduction

Functions and data structures. Programming in R for Data Science Anders Stockmarr, Kasper Kristensen, Anders Nielsen

CLEANING DATA IN R. Type conversions

Basic R Part 1 BTI Plant Bioinformatics Course

Introduction to R. Adrienn Szabó. DMS Group, MTA SZTAKI. Aug 30, /62

Logical operators: R provides an extensive list of logical operators. These include

R basics workshop Sohee Kang

Basic matrix math in R

II.Matrix. Creates matrix, takes a vector argument and turns it into a matrix matrix(data, nrow, ncol, byrow = F)

Package extdplyr. February 27, 2017

OUTLINES. Variable names in MATLAB. Matrices, Vectors and Scalar. Entering a vector Colon operator ( : ) Mathematical operations on vectors.

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010

MATLAB for beginners. KiJung Yoon, 1. 1 Center for Learning and Memory, University of Texas at Austin, Austin, TX 78712, USA

ARTIFICIAL INTELLIGENCE AND PYTHON

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1)

limma: A brief introduction to R

Session 1 Nick Hathaway;

Fall 2018 Discussion 8: October 24, 2018 Solutions. 1 Introduction. 2 Primitives

R:If, else and loops

The SPL Programming Language Reference Manual

Introduction to Programming, Aug-Dec 2006

Mails : ; Document version: 14/09/12

Objectives. Chapter 2: Basic Elements of C++ Introduction. Objectives (cont d.) A C++ Program (cont d.) A C++ Program

Chapter 2: Basic Elements of C++

Extremely short introduction to R Jean-Yves Sgro Feb 20, 2018

An Introduction to the Fundamentals & Functionality of the R Programming Language. Part II: The Nuts & Bolts

Summer 2017 Discussion 10: July 25, Introduction. 2 Primitives and Define

An Introduction to R- Programming

MAT 275 Laboratory 2 Matrix Computations and Programming in MATLAB

Ph3 Mathematica Homework: Week 1

Create formulas in Excel

An Introduction to R for Epidemiologists using RStudio

seq(), seq_len(), min(), max(), length(), range(), any(), all() Comparison operators: <, <=, >, >=, ==,!= Logical operators: &&,,!

R: BASICS. Andrea Passarella. (plus some additions by Salvatore Ruggieri)

Implementing S4 objects in your package: Exercises

Where s the spreadsheet?

String Computation Program

IMPORTING DATA IN R. Introduction read.csv

Spring 2018 Discussion 7: March 21, Introduction. 2 Primitives

Chapter 2 Working with Data Types and Operators

Package filematrix. R topics documented: February 27, Type Package

CS Introduction to Computational and Data Science. Instructor: Renzhi Cao Computer Science Department Pacific Lutheran University Spring 2017

Fall 2017 Discussion 7: October 25, 2017 Solutions. 1 Introduction. 2 Primitives

Chapter 2. Designing a Program. Input, Processing, and Output Fall 2016, CSUS. Chapter 2.1

MAT 275 Laboratory 2 Matrix Computations and Programming in MATLAB

Lecture 1: Getting Started and Data Basics

R Basics / Course Business

Constraint-based Metabolic Reconstructions & Analysis H. Scott Hinton. Matlab Tutorial. Lesson: Matlab Tutorial

Maltepe University Computer Engineering Department. BİL 133 Algorithms and Programming. Chapter 8: Arrays

Package tester. R topics documented: February 20, 2015

MATLAB BASICS. Macro II. February 25, T.A.: Lucila Berniell. Macro II (PhD - uc3m) MatLab Basics 02/25/ / 15

1 Matrices and Vectors and Lists

Programming for Engineers Arrays

Lecture 06: Feb 04, Transforming Data. Functions Classes and Objects Vectorization Subsets. James Balamuta STAT UIUC

DSC 201: Data Analysis & Visualization

command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C ))

Basics of R. > x=2 (or x<-2) > y=x+3 (or y<-x+3)

C++ Programming: From Problem Analysis to Program Design, Third Edition

Armstrong State University Engineering Studies MATLAB Marina 2D Arrays and Matrices Primer

Full file at

13. Section 9 Exercises

Sketchpad Graphics Language Reference Manual. Zhongyu Wang, zw2259 Yichen Liu, yl2904 Yan Peng, yp2321

A Fast Review of C Essentials Part I

Strings Basics STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

B E C Y. Reference Manual

Text University of Bolton.

Lab #10 Multi-dimensional Arrays

Transcription:

Data Structures STAT 133 Gaston Sanchez Department of Statistics, UC Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133

Data Types and Structures To make the best of the R language, you ll need a strong understanding of the basic data types and data structures and how to operate on them. 2

Data Structures There are various data structures in R (we ll describe them in detail later): vectors matrices (2d arrays) arrays (in general) factors lists data frames 3

Vectors 4

Vectors A vector is the most basic data structure in R Vectors are contiguous cells containing data Can be of any length (including zero) R has five basic type of vectors: integer, double, complex, logical, character 5

Vectors The most simple type of vectors are scalars or single values: # integer x <- 1L # double (real) y <- 5 # complex z <- 3 + 5i # logical a <- TRUE # character b <- "yosemite" 6

Data modes A double vector stores regular (i.e. real) numbers An integer vector stores integers (no decimal component) A character vector stores text A logical vector stores TRUE s and FALSE s values A complex vector stores complex numbers 7

Data Types (or modes) value example mode storage integer 1L, 2L numeric integer real 1, -0.5 numeric double complex 3 + 5i complex complex logical TRUE, FALSE logical logical character "hello" character character 8

Special Values There are some special values NULL is the null object (it has length zero) Missing values are referred to by the symbol NA Inf indicates positive infinite -Inf indicates negative infinite NaN indicates Not a Number 9

Vectors The function to create a vector from individual values is c(), short for concatenate: # some vectors x <- c(1, 2, 3, 4, 5) y <- c("one", "two", "three") z <- c(true, FALSE, FALSE) Separate each element by a comma 10

Atomic Vectors vectors are atomic structures the values in a vector must be ALL of the same type either all integers, or reals, or complex, or characters, or logicals you cannot have a vector of different data types 11

Atomic Vectors If you mix different data values, R will implicitly coerce them so they are all of the same type # mixing numbers and characters x <- c(1, 2, 3, "four", "five") x ## [1] "1" "2" "3" "four" "five" # mixing numbers and logical values y <- c(true, FALSE, 3, 4) y ## [1] 1 0 3 4 12

Atomic Vectors # mixing numbers and logical values z <- c(true, FALSE, "TRUE", "FALSE") z ## [1] "TRUE" "FALSE" "TRUE" "FALSE" # mixing integer, real, and complex numbers w <- c(1l, -0.5, 3 + 5i) w ## [1] 1.0+0i -0.5+0i 3.0+5i 13

How does R coerce data types? R follows two basic rules of implicit coercion If a character is present, R will coerce everything else to characters If a vector contains logicals and numbers, R will convert the logicals to numbers (TRUE to 1, FALSE to 0) 14

Coercion functions Coercion Functions R provides a set of explicit coercion functions that allow us to convert one type of data into another as.character() as.numeric() as.logical() 15

Conversion between types from to function conversions logical numeric as.numeric FALSE 0 TRUE 1 logical character as.character FALSE "FALSE" TRUE "TRUE" character numeric as.numeric "1", "2" 1, 2 "A" NA character logical as.logical "FALSE" FALSE "TRUE" TRUE other NA numeric logical as.logical 0 FALSE other 1 numeric character as.character 1, 2 "1", "2" 16

Properties of Vectors all vectors have a length vector elements can have associated names vectors are objects of class "vector" vectors have a mode (storage mode) 17

Properties of Vectors # vector with named elements x <- c(a = 1, b = 2.5, c = 3.7, d = 10) x ## a b c d ## 1.0 2.5 3.7 10.0 length(x) ## [1] 4 mode(x) ## [1] "numeric" 18

Matrices and Arrays 19

From Vectors to Arrays We can transform a vector in an n-dimensional array by giving it a dimensions attribute dim # positive: from 1 to 8 x <- 1:8 # adding 'dim' attribute dim(x) <- c(2, 4) x ## [,1] [,2] [,3] [,4] ## [1,] 1 3 5 7 ## [2,] 2 4 6 8 20

From Vectors to Arrays a vector can be given a dim attribute a dim attribute is a numeric vector of length n R will reorganize the elements of the vector into n dimensions each dimension will have as many rows (or columns, etc.) as the n-th value of the dim vector 21

From Vectors to Arrays # dim attribute with 3 dimensions dim(x) <- c(2, 2, 2) x ##,, 1 ## ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ## ##,, 2 ## ## [,1] [,2] ## [1,] 5 7 ## [2,] 6 8 22

From Vector to Matrix A dim attribute of length 2 will convert a vector into a matrix # vector to matrix A <- 1:8 class(a) ## [1] "integer" dim(a) <- c(2, 4) class(a) ## [1] "matrix" When using dim(), R always fills up each matrix by rows. 23

From Vector to Matrix To have more control about how a matrix is filled, (by rows or columns), we use the matrix() function: # vector to matrix (by rows) A <- 1:8 matrix(a, nrow = 2, ncol = 4, byrow = TRUE) ## [,1] [,2] [,3] [,4] ## [1,] 1 2 3 4 ## [2,] 5 6 7 8 24

Matrices Matrices store values in a two-dimensional array To create a matrix, give a vector to matrix() and specify number of rows and columns you can also assign row and column names to a matrix 25

Creating a Matrix Exercise: create the following matrix ## [,1] [,2] ## [1,] "Harry" "Potter" ## [2,] "Ron" "Weasley" ## [3,] "Hermione" "Granger" 26

Creating a Matrix Solution: # vector of names hp <- c("harry", "Potter", "Ron", "Weasley", "Hermione", "Granger") # matrix filled up by rows matrix(hp, nrow = 3, byrow = TRUE) ## [,1] [,2] ## [1,] "Harry" "Potter" ## [2,] "Ron" "Weasley" ## [3,] "Hermione" "Granger" 27

Arrays Arrays store values in an n-dimensional array To create an array, give a vector to array() and specify number of dimensions you can also assign dim-names to an array 28

Creating an Array ar <- array(c(1:4, 5:8, 9:12), dim = c(2, 2, 3)) ar ##,, 1 ## ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ## ##,, 2 ## ## [,1] [,2] ## [1,] 5 7 ## [2,] 6 8 ## ##,, 3 ## ## [,1] [,2] ## [1,] 9 11 ## [2,] 10 12 29

Factors 30

Factors A similar structure to vectors are factors factors are used for handling categorical (i.e. qualitative) data they are represented as objects of class "factor" internally, factors are stored as integers factors behave much like vectors (but they are not vectors) 31

Factors To create a factor we use the function factor() # factor cols <- c("blue", "red", "blue", "gray", "red") cols <- factor(cols) cols ## [1] blue red blue gray red ## Levels: blue gray red The different values in a factor are called levels 32

So far... Vectors, matrices, and arrays are atomic structures (they can only store one type of data) Many operations in R need atomic structures to make sure all values are of the same mode In real life, however, many datasets contain multiple types of information R provides other data structures to store different types of data 33

Lists 34

Lists Lists are the most general class of data container Like vectors, lists group data into a one-dimensional set Unlike vectors, lists can store all kinds of objects Lists can be of any length Elements of a list can be named, or not 35

Lists To create a list we use the function list(). The list() function creates a list the same way c() creates a vector: lfriends <- list("harry", "Ron", "Hermione") lfriends ## [[1]] ## [1] "Harry" ## ## [[2]] ## [1] "Ron" ## ## [[3]] ## [1] "Hermione" 36

Lists Lists can contain any type of data object (even other lists): lst <- list( c("harry", "Ron", "Hermione"), matrix(1:6, nrow = 2, ncol = 3), factor(c("yes", "no", "no", "no", "yes")), lfriends ) 37

Lists Elements in a list can be named: list( first = c("harry", "Ron", "Hermione"), second = matrix(1:6, nrow = 2, ncol = 3), third = factor(c("yes", "no", "no", "no", "yes")), fourth = lfriends ) 38

Creating a List Exercise: create a list with your first name, middle name, and last name ## $first ## [1] "Gaston" ## ## $middle ## NULL ## ## $last ## [1] "Sanchez" 39

Data Frames 40

Data Frames Data Frame A data.frame is the primary data structure that R provides for handling tabular data sets (e.g. spreadsheet like). Function data.frame() The data.frame() function allows us to create data frames 41

Lists Data frames are the two-dimensional version of a list They are the conventional data storage structure for data analysis A data frame is displayed like a table (or matrix) A data frame is stored as a list 42

Creating a Data Frame # creating a data frame (manually) elements <- data.frame( name = c('hydrogen', 'nitrogen', 'oxygen'), symbol = c('h', 'N', 'O'), number = c(1, 7, 8) ) elements ## name symbol number ## 1 hydrogen H 1 ## 2 nitrogen N 7 ## 3 oxygen O 8 (vectors defined inside the data frame function) 43

Creating a Data Frame Give data.frame() any number of vectors, each separated with a comma Each vector should be set equal to a name that describes the vector data.frame() will turn each vector into a column of the new data frame All vectors should be of the same length By default, data.frame() converts strings into factors 44

Creating a Data Frame Create the following data frame: ## state abbreviation capital area ## 1 California CA Sacramento 163707 ## 2 New York NY Albany 54475 ## 3 Texas TX Austin 268601 45

Creating a Data Frame Solution: states <- data.frame( state = c('california', 'New York', 'Texas'), abbreviation = c('ca', 'NY', 'TX'), capital = c('sacramento', 'Albany', 'Austin'), area = c(163707, 54475, 268601), stringsasfactors = FALSE ) states ## state abbreviation capital area ## 1 California CA Sacramento 163707 ## 2 New York NY Albany 54475 ## 3 Texas TX Austin 268601 46

R common data structures 1D single data type Vector Matrix multiple data types List Data Frame 2D Array nd 47

Selecting Values 48

Selecting Values Data manipulation requires you to learn how to select and retrieve values from each of the data structures 49

Notation System Notation system to extract values from R objects to extract values use brackets: [ ] inside the brackets specify indices use as many indices as dimensions in the object each index is separated by comma indices can be numbers, logicals, or names 50

Values of Vectors # some vector vec <- 1:5 # adding names names(vec) <- letters[1:5] vec ## a b c d e ## 1 2 3 4 5 51

# first element vec[1] ## a ## 1 # third element vec[3] ## c ## 3 # fifth element vec[5] ## e ## 5 Extracting values with numbers 52

Extracting values with positive vectors # range 1 to 3 vec[1:3] ## a b c ## 1 2 3 # vector of positive numbers vec[c(1, 3, 4)] ## a c d ## 1 3 4 53

Extracting values with negative numbers # all values except the first one vec[-1] ## b c d e ## 2 3 4 5 # all values except 2nd and 4th vec[-c(2, 4)] ## a c e ## 1 3 5 54

Extracting values with logicals # first element vec[c(true, FALSE, FALSE, FALSE, FALSE)] ## a ## 1 # 4th and 5th elements vec[c(false, FALSE, FALSE, TRUE, TRUE)] ## d e ## 4 5 # logical negation (2nd and 4th) vec[!c(true, FALSE, TRUE, FALSE, TRUE)] ## b d ## 2 4 55

Extracting values with names Since vec has names, we can use characters to extract named values # element 'a' vec['a'] ## a ## 1 # elements 'b' and 'e' vec[c('b', 'e')] ## b e ## 2 5 56

Extracting values with names You can use repeated indices: # element 2 (3-times) vec[c(2, 2, 2)] ## b b b ## 2 2 2 # element 'a' (four times) vec[c('a', 'a', 'a', 'a')] ## a a a a ## 1 1 1 1 57

Extracting values of two-dimensional objects states ## state abbreviation capital area ## 1 California CA Sacramento 163707 ## 2 New York NY Albany 54475 ## 3 Texas TX Austin 268601 58

Extracting values of two-dimensional objects Extracting a single cell # value in row 1, and column 1 states[1,1] ## [1] "California" # value in row 3, and column 3 states[3,3] ## [1] "Austin" 59

Extracting values of two-dimensional objects To extract just rows, leave the column index empty # values in row 1 states[1, ] ## state abbreviation capital area ## 1 California CA Sacramento 163707 # values in rows 1 to 2 states[1:2, ] ## state abbreviation capital area ## 1 California CA Sacramento 163707 ## 2 New York NY Albany 54475 60

Extracting values of two-dimensional objects To extract just columns, leave the row index empty # values in column 1 states[, 1] ## [1] "California" "New York" "Texas" # values in columns 2 to 4 states[, 2:4] ## abbreviation capital area ## 1 CA Sacramento 163707 ## 2 NY Albany 54475 ## 3 TX Austin 268601 61

Extracting values of two-dimensional objects Extracting rows with negative indices # excluding 3rd row states[-3, ] ## state abbreviation capital area ## 1 California CA Sacramento 163707 ## 2 New York NY Albany 54475 # excluding 1st and 3rd row states[-c(1, 3), ] ## state abbreviation capital area ## 2 New York NY Albany 54475 62

Extracting values of two-dimensional objects Extracting columns with negative indices # excluding 2nd column states[, -2] ## state capital area ## 1 California Sacramento 163707 ## 2 New York Albany 54475 ## 3 Texas Austin 268601 # excluding 1st and 2dn columns states[, -c(1, 2)] ## capital area ## 1 Sacramento 163707 ## 2 Albany 54475 ## 3 Austin 268601 63

Extracting values of two-dimensional objects You can use repetition of indices # values in column 1 states[, c(1, 1)] ## state state.1 ## 1 California California ## 2 New York New York ## 3 Texas Texas # values in columns 2 to 4 states[c(2, 2, 3), ] ## state abbreviation capital area ## 2 New York NY Albany 54475 ## 2.1 New York NY Albany 54475 ## 3 Texas TX Austin 268601 64

Extracting values of two-dimensional objects Extracting values with logicals # exclude 3rd row states[c(true, TRUE, FALSE), ] ## state abbreviation capital area ## 1 California CA Sacramento 163707 ## 2 New York NY Albany 54475 # exclude 3rd column states[, c(true, TRUE, FALSE, TRUE)] ## state abbreviation area ## 1 California CA 163707 ## 2 New York NY 54475 ## 3 Texas TX 268601 65

Extracting values of two-dimensional objects Extracting columns by name # column 'state' states[, 'state'] ## [1] "California" "New York" "Texas" # columns 'state' and 'abbreviation' states[, c('state', 'abbreviation')] ## state abbreviation ## 1 California CA ## 2 New York NY ## 3 Texas TX 66

Extracting values of two-dimensional objects When you select one column from a two-dimensional array, R will return a vector. To get a column output, use the argument drop = FALSE # columns 'state' and 'abbreviation' states[, 1, drop = FALSE] ## state ## 1 California ## 2 New York ## 3 Texas 67

Dollar signs and Double Brackets 68

Dollar signs R lists and data frames obey an optional second system of notation for extracting values: using the dollar sign $ 69

Dollar signs with data frames The dollar sign $ notation works for selecting a column of a data frame using its name states$state ## [1] "California" "New York" "Texas" states$abbreviation ## [1] "CA" "NY" "TX" states$capital ## [1] "Sacramento" "Albany" "Austin" 70

Dollar signs with data frames You don t need to use quote marks, but you can if you want states$'state' ## [1] "California" "New York" "Texas" states$"abbreviation" ## [1] "CA" "NY" "TX" states$`capital` ## [1] "Sacramento" "Albany" "Austin" 71

Dollar signs with lists Elements in a named list can also be extracted with the dollar sign: lst <- list(numbers = 1:3, letter = c('a', 'B', 'C')) lst$numbers ## [1] 1 2 3 lst$letters ## NULL 72

Double brackets In addition, lists (and data frames) accept a third type of notation that uses double brackets: [[ ]] lst[[1]] ## [1] 1 2 3 lst[[2]] ## [1] "A" "B" "C" 73

Double brackets Double brackets are used when we want to get access to the individual elements; use double brackets followed by single brackets lst[[2]] ## [1] "A" "B" "C" lst[[2]][3] ## [1] "C" 74

Elements-Extraction Notation System object notation example vector [ ] v[1:5] factor [ ] g[1:5] matrix [, ] m[1:5, 1:3] array [,, ] arr[1, 2, 3] [,,, ] arr[1, 2, 3, 4] list [ ] lst[3] [[ ]] lst[[3]] $ lst$name data frame [, ] df[1, 2] $ df$name 75