R packages for this chapter (for later)

Size: px

Start display at page:

Download "R packages for this chapter (for later)"

Russell McBride
5 years ago
Views:

1 R packages for this chapter (for later) library(ggplot2) library(ggrepel) library(tidyr) library(dplyr) ## ## Attaching package: dplyr ## The following objects are masked from package:stats : ## ## filter, lag ## The following objects are masked from package:base : ## ## intersect, setdiff, setequal, union 1 / 166

2 SAS stuff

3 Reading data from a file This: a 20 a 21 a 16 b 11 b 14 b 17 b 15 c 13 c 9 c 12 c 13 got read in like this: data groups; infile '/home/ken/threegroups.dat'; input group $ y; 3 / 166

4 More than one observation per line Foregoing worked with: One obs. per line Separated by whitespace. Suppose you have this: Eg. one variable x, then: data xonly; infile '/home/ken/one.dat'; input proc means keep reading on same line until done. 4 / 166

5 The output Obs x / 166

6 If you leave off the Data: Code and output, doesn t get everything: data xonly; infile '/home/ken/one.dat'; input x; proc print; Obs x / 166

7 Two variables using Data: Suppose values in data file are an x then a y, repeated: data xonly; infile '/home/ken/one.dat'; input x proc print; Obs x y / 166

8 Skipping over header lines Data file like this: x y In SAS, supply variable names (on input line), so skip over header lines like this: data xy; infile '/home/ken/two.dat' firstobs=2; input xx yy; proc print; Can put any number on firstobs, depending on how many lines you want to skip. 8 / 166

9 Data as read in Note variable names: Obs xx yy / 166

10 Data separated by other things Might have data like this: 3,4 5,6 7,7 8,9 3,4 Eg. from spreadsheet saved as.csv. 10 / 166

11 Code and output Separated by commas, so read in like this: data xy; infile '/home/ken/three.dat' dlm=','; input x y; proc print; Obs x y / 166

12 The singers: reading in text Spreadsheet of female singer names, saved as.csv: 1,Bessie Smith 2,Peggy Lee 3,Aretha Franklin 4,Diana Ross 5,Dolly Parton 6,Tina Turner 7,Madonna 8,Mary J Blige 9,Salt n Pepa 10,Aaliyah 11,Beyonce Try reading in: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $; 12 / 166

13 What we got Obs number name 1 1 Bessie S 2 2 Peggy Le 3 3 Aretha F 4 4 Diana Ro 5 5 Dolly Pa 6 6 Tina Tur 7 7 Madonna 8 8 Mary J B 9 9 Salt n P Aaliyah Beyonce The names got cut off! 13 / 166

14 Reading the whole names Only got 1st 8 characters of each singer s name (SAS default for text). Tell SAS that the names are 20 characters long: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; proc print; Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 Tina Turner 7 7 Madonna 8 8 Mary J Blige 9 9 Salt n Pepa Aaliyah Beyonce 14 / 166

15 Why this worked On input number name $20.;, the 20. after dollar sign, specifying length of text, called informat. Singer s names have spaces, but this no problem, since delimiter is,. Possible trouble: commas inside the names, as in Robert Downey, Jr. Get around this by adding dsd to infile line. singer2.csv has Mr. Downey on the end: data singers2; infile '/home/ken/singers2.csv' dlm=',' dsd; input number name $20.; proc print; 15 / 166

16 Singers as read in Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 Tina Turner 7 7 Madonna 8 8 Mary J Blige 9 9 Salt n Pepa Aaliyah Beyonce Robert Downey, Jr. 16 / 166

17 A gotcha If you tried this for yourself, this might not have worked. Issue: Singer names must be at least 20 characters long. If not, you have to add spaces to make them so. singers3.csv has additional spaces removed. With code: data singers3; infile '/home/ken/singers3.csv' dlm=',' dsd; input number name $20.; proc print; you get output shown on next page. 17 / 166

18 Output from previous commands, data file below Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 7,Madonna 7 8 Mary J Blige 8 9 Salt n Pepa ,Beyonce Robert Downey, Jr. 1,Bessie Smith *** <- actual end of line 2,Peggy Lee *** 3,Aretha Franklin *** 4,Diana Ross *** 5,Dolly Parton *** 6,Tina Turner*** 7,Madonna *** 8,Mary J Blige *** 9,Salt n Pepa *** 10,Aaliyah*** 11,Beyonce *** 18 / 166

19 Reading spreadsheet data into SAS Two quick ways: Save data to.csv, transfer to SAS Studio Copy and paste into Program Editor (quick and dirty). Save in singsing.dat, read in like this: data sing; infile "/home/ken/singsing.dat" expandtabs; input singer $20. value; Read in actual spreadsheet using proc import: proc import out=singers datafile= '/home/ken/sing.xlsx' dbms=xlsx replace; sheet="sheet1"; getnames=yes; ; only at end (for clarity) out=: name of SAS data set datafile=: Excel spreadsheet sheet=: which sheet in workbook 19 / 166

20 Did it work? Obs singer number 1 Bessie Smith 1 2 Peggy Lee 2 3 Aretha Franklin 3 4 Diana Ross 4 5 Dolly Parton 5 6 Tina Turner 6 7 Madonna 7 8 Mary J Blige 8 9 Salt n Pepa 9 10 Aaliyah Beyonce 11 Yes! And without any issues about lengths of names. 20 / 166

21 Permanent data sets Can we read in data set once and not every time? Yes, use filename (in single quotes) when creating: data '/home/ken/cars'; infile '/home/ken/cars.txt' firstobs=2; input car $25. mpg weight cylinders hp country $; Car names max of 25 chars long. Country names max of 8, so no special treatment needed. SAS stores file called /home/username/cars.sas7bdat (!) on SAS Studio. Whenever you need it, add data= /home/username/cars to a proc line (replacing username with your username). Can use subfolders, using / forward slash syntax. Closing SAS breaks connection with temporary (ie. non-permanent) data sets. To get those back, need to run data step lines again. 21 / 166

22 Means, without data step! proc means data='/home/ken/cars'; var mpg weight cylinders hp; The MEANS Procedure Variable N Mean Std Dev Minimum Maximum mpg weight cylinders hp / 166

23 Mean MPG by country proc means data='/home/ken/cars'; var mpg; class country; The MEANS Procedure Analysis Variable : mpg N country Obs N Mean Std Dev Minimum Maximum France Germany Italy Japan Sweden U.S This kind of thing is SAS s strength. 23 / 166

24 How does SAS know which data set to use? Two rules: 1. Any proc can have data= on it. Tells SAS to use that data set. Can be unquoted data set name (created by data step) quoted data set name (permanent one on disk created as above) 2. Without data=, most recently created data set. Typically data set created by data step, though could also be spreadsheet via proc import. Also, data set created by out= counts. Does permanent data set count as most recently created? No, or at least not always. If unsure, use data=. 24 / 166

25 SAS: creating new data sets from old ones

26 Selecting individuals/observations Singers original data step: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; To select singers only 1 through 6: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number<=6; Select individuals with if: choose only these. 26 / 166

27 Did it work? proc print; Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 Tina Turner Most recently created data set has only singers with numbers 6 or less. 27 / 166

28 Omitting individuals Sometimes easier to focus on obs to leave out: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number<4 then delete; proc print; Obs number name 1 4 Diana Ross 2 5 Dolly Parton 3 6 Tina Turner 4 7 Madonna 5 8 Mary J Blige 6 9 Salt n Pepa 7 10 Aaliyah 8 11 Beyonce 28 / 166

29 Selecting on text variable Less than means earlier alphabetically. Singers before M: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if name<'m'; proc print; Obs number name 1 1 Bessie Smith 2 3 Aretha Franklin 3 4 Diana Ross 4 5 Dolly Parton 5 10 Aaliyah 6 11 Beyonce 29 / 166

30 Equality Selecting singer #7 ie. singer whose number is equal to 7: Note that SAS uses = while R uses == for logical equals. data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number=7; proc print; Obs number name 1 7 Madonna 30 / 166

31 Either/Or data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number=7 or name='diana Ross'; proc print; Obs number name 1 4 Diana Ross 2 7 Madonna 31 / 166

32 Both/And Have multiple if lines: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number<7; if name<'c'; proc print; Obs number name 1 1 Bessie Smith 2 3 Aretha Franklin 32 / 166

33 Selecting variables if, delete selects/omits individuals/observations. To select variables, use keep or drop: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; keep name; proc print; Obs name 1 Bessie Smith 2 Peggy Lee 3 Aretha Franklin 4 Diana Ross 5 Dolly Parton 6 Tina Turner 7 Madonna 8 Mary J Blige 9 Salt n Pepa 10 Aaliyah 33 / 166

34 Getting rid of variables data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; drop number; proc print; Obs name 1 Bessie Smith 2 Peggy Lee 3 Aretha Franklin 4 Diana Ross 5 Dolly Parton 6 Tina Turner 7 Madonna 8 Mary J Blige 9 Salt n Pepa 10 Aaliyah 11 Beyonce 34 / 166

35 Cloning a data set (pointless!) Use set to bring in all the variables and individuals from another data set: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; data singers2; set singers; singers2 exactly same as singers. set usually first step to doing something else with data. 35 / 166

36 A less pointless cloning There is point in combining set with keep or drop or if to copy only individuals/variables you want. Example: cars data, keep only those cars with mpg bigger than 30: data mycars; set '/home/ken/cars'; if mpg>30; proc print; 36 / 166

37 High-gas-mileage cars Obs car mpg weight cylinders hp country 1 Dodge Omni U.S. 2 Fiat Strada Italy 3 VW Rabbit Germany 4 Plymouth Horizon U.S. 5 Mazda GLC Japan 6 VW Dasher Germany 7 Dodge Colt Japan 8 VW Scirocco Germany 9 Datsun Japan 10 Pontiac Phoenix U.S. 37 / 166

38 Keep only car name and gas mileage data mycars; set '/home/ken/cars'; keep car mpg; proc print; 38 / 166

39 Just two variables Obs car mpg 1 Buick Skylark Dodge Omni Mercury Zephyr Fiat Strada Peugeot 694 SL VW Rabbit Plymouth Horizon Mazda GLC Buick Estate Wagon Audi Chevy Malibu Wagon Dodge Aspen VW Dasher Ford Mustang Dodge Colt Datsun VW Scirocco Chevy Citation Olds Omega Chrysler LeBaron Wagon Datsun AMC Concord D/L Buick Century Special Saab 99 GLE Datsun Ford LTD Volvo 240 GL Dodge St Regis Toyota Corona Chevette Ford Mustang Ghia / 166

40 Get rid of cylinders and hp data mycars; set '/home/ken/cars'; drop cylinders hp; proc print; 40 / 166

41 Those two variables gone Obs car mpg weight country 1 Buick Skylark U.S. 2 Dodge Omni U.S. 3 Mercury Zephyr U.S. 4 Fiat Strada Italy 5 Peugeot 694 SL France 6 VW Rabbit Germany 7 Plymouth Horizon U.S. 8 Mazda GLC Japan 9 Buick Estate Wagon U.S. 10 Audi Germany 11 Chevy Malibu Wagon U.S. 12 Dodge Aspen U.S. 13 VW Dasher Germany 14 Ford Mustang U.S. 15 Dodge Colt Japan 16 Datsun Japan 17 VW Scirocco Germany 18 Chevy Citation U.S. 19 Olds Omega U.S. 20 Chrysler LeBaron Wagon U.S. 21 Datsun Japan 22 AMC Concord D/L U.S. 23 Buick Century Special U.S. 24 Saab 99 GLE Sweden 25 Datsun Japan 26 Ford LTD U.S. 27 Volvo 240 GL Sweden 28 Dodge St Regis U.S. 29 Toyota Corona Japan 30 Chevette U.S. 31 Ford Mustang Ghia U.S. 41 / 166

42 Keeping only some individuals and variables Any variables not keep-ed are dropped. Any variables not drop-ed are kept. So only need one of keep and drop. But can combine with if: data mycars; set '/home/ken/cars'; keep car mpg; if weight>4; proc print; Obs car mpg 1 Buick Estate Wagon Ford Country Squire Wagon 15.5 Keeps only car name and gas mileage for those cars weighing over 4 tons. 42 / 166

43 Kernel density curve on histogram A kernel density curve smooths out a histogram and gives sense of shape of distribution. Car mpgs: proc sgplot data='/home/ken/cars'; histogram mpg; density mpg / type=kernel; 43 / 166

44 Histogram of MPGs with kernel density 44 / 166

45 Comments The kernel density has a wobble in the middle, suggesting that the data might be bimodal rather than unimodal. This is pretty clear from the hole in the middle of the histogram. 45 / 166

46 Kernel density for car weights proc sgplot data='/home/ken/cars'; histogram weight; density weight / type=kernel; 46 / 166

47 Comments For MPGs, clear evidence of bimodal shape. Cars seem to divide into low-mpg and high-mpg groups. For weights, not so much evidence of bimodality. Looks more right-skewed. 47 / 166

48 Loess curve Loess curve (note spelling) in SAS: Code like this: proc sgplot data='/home/ken/cars'; scatter x=weight y=mpg; loess x=weight y=mpg; 48 / 166

49 Loess curve on plot 49 / 166

50 Distinguishing points by colours or symbols Say we want to plot mpg by weight, with the points different colours and symbols according to what number of cylinders they are. sgplot takes a group= as option, similar to ggplot: proc sgplot data='/home/ken/cars'; scatter x=weight y=mpg / group=cylinders; 50 / 166

51 The plot 51 / 166

52 Comments Cars with different numbers of cylinders distinguished by colour and shape. Blue circles denote 4-cylinder cars... green crosses 8-cylinder. Legend at the bottom, so you can see which colour/symbol is which. Cars with more cylinders are heavier and have worse gas mileage. 52 / 166

53 Multiple series on one plot: the oranges data Data file like this (circumferences of 5 trees each at 7 times): row ages A B C D E Skip over first line of file; create permanent data set: data '/home/ken/oranges'; infile '/home/ken/oranges.txt' firstobs=2; input row age a b c d e; 53 / 166

54 Multiple series Growth curve for each tree, joined by lines. series joins points by lines. markers displays actual data points too. Do each series one at a time. proc sgplot; series x=age y=a / markers; series x=age y=b / markers; series x=age y=c / markers; series x=age y=d / markers; series x=age y=e / markers; 54 / 166

55 The growth curves 55 / 166

56 Labelling points on a plot The magic word here is datalabel. For example, to label each car on a scatterplot of MPG vs. weight with the name of the car: proc sgplot data='/home/ken/cars'; scatter y=mpg x=weight / datalabel=car; 56 / 166

57 The plot 57 / 166

58 Comments Each car labelled with its name, either left, right, above or below, whichever makes it clearest. (Some intelligence applied to placement.) Cars top left are nimble : light in weight, good gas mileage. Cars bottom right are boats : heavy, with terrible gas mileage. 58 / 166

59 Labelling by country Same idea: proc sgplot data='/home/ken/cars'; scatter x=weight y=mpg / datalabel=country; 59 / 166

60 Labelled by country 60 / 166

61 Labelling only some of the observations Create a new data set with all the old variables plus a new one that contains the text to plot. For example, label most fuel-efficient car (#4) and heaviest car (#9). Observation number given by SAS special variable n. Note the syntax: if then do followed by end. data cars2; set '/home/ken/cars'; if (_n_=4 or _n_=9) then do; newtext=car; end; For any cars not selected, newtext will be blank. Then, using the new data set that we just created: proc sgplot; scatter x=weight y=mpg / datalabel=newtext; 61 / 166

62 The plot 62 / 166

63 Or label cars with mpg greater than 34 data cars3; set '/home/ken/cars'; if mpg>34 then do; newtext=car; end; proc sgplot; scatter x=weight y=mpg / datalabel=newtext; 63 / 166

64 High-mpg cars 64 / 166

65 R stuff

66 More R stuff R has a thousand tiny parts, all working together, but to use them, need to know their names. Sometimes you do know the name, but you forget how it works. Then (at Console) type eg.?median or help(median). Help appears in R Studio bottom right. Read in the cars data to use for examples later: cars=read.csv("cars.csv") str(cars) ## 'data.frame': 38 obs. of 6 variables: ## $ Car : Factor w/ 38 levels "AMC Concord D/L",..: 7 18 ## $ MPG : num ## $ Weight : num ## $ Cylinders : int ## $ Horsepower: int ## $ Country : Factor w/ 6 levels "France","Germany",..: / 166

67 Structure of help file All R s help files laid out the same way: Purpose: what the function does Usage: how you make it go Arguments: what you need to feed in. Arguments with a = have default values. If the default is OK (it often is), you don t need to specify it. Details: more information about how the function works. Value: what comes back from the function. References to the literature, so that you can find out exactly how everything was calculated. Examples. Run these using eg. example(median). 67 / 166

68 If you don t know the name Then you have to find it out! If you know what it might be, apropos(name): apropos("read") ## [1] "readbin" "readchar" "readcitationfile" ## [4] "read.csv" "read.csv2" "read.dcf" ## [7] "read.delim" "read.delim2" "read.dif" ## [10] "read.fortran" "read.ftable" "read.fwf" ## [13] "readline" "readlines" "readrds" ## [16] ".readrds" "readrenviron" "read.socket" ## [19] "read.table" "spread" "spread_" ## [22] "Sys.readlink" and then you investigate more via help(). Google-searching, eg: r ggplot add horizontal line. Often turns up questions on stackexchange.com, which might be adapted to your needs. 68 / 166

69 That Google search Looks like geom_hline(). Look up in help as?ggplot2::geom_hline. 69 / 166

70 Gas mileage against weight, basic g=ggplot(cars,aes(x=weight,y=mpg))+geom_point() ; g 35 MPG Weight 70 / 166

71 Add regression line g+geom_smooth(method="lm",se=f) Weight MPG 71 / 166

72 Calculate and plot means mean.weight=mean(cars$weight) mean.mpg=mean(cars$mpg) g2=g+geom_smooth(method="lm",se=f)+ geom_hline(yintercept=mean.mpg,colour="red")+ geom_vline(xintercept=mean.weight,colour="darkgreen") 72 / 166

73 The plot g Weight MPG 73 / 166

74 With title g+ggtitle("gas mileage against weight") Gas mileage against weight 35 MPG Weight 74 / 166

75 Axis labels g+xlab("weight (tons)")+ylab("mpg (miles per US gallon)") 35 MPG (miles per US gallon) Weight (tons) 75 / 166

76 Adding text to plot g+geom_text(aes(label=car),hjust=-0.1,size=2)+ xlim(1.8,5.0) Fiat Strada MPG Dodge Colt Mazda GLCPlymouth Horizon Pontiac Phoenix VW Rabbit Datsun 210 VW Scirocco Dodge Omni VW Dasher Chevette Honda Accord LX Chevy Citation Buick Skylark Toyota Datsun 510 AMC Corona Spirit Olds Omega Ford Mustang Datsun Ford 810Mustang Ghia BMW 320i Saab 99 GLE Mercury Zephyr Buick Century Special Audi 5000 Chevy Malibu Wagon Dodge Aspen Chrysler LeBaron Wagon AMC Concord D/L Dodge St Regis Ford LTD Volvo 240 GL Chevy Caprice Classic Buick Estate Wagon Mercury Grand Marquis Peugeot 694 SL Ford Country Squire Wagon Weight 76 / 166

77 Comments geom_text needs a label aesthetic to say what text to plot. It inherits the x and y from the ggplot. hjust says where to put the labels relative to the points: 0.5 is centred over them, negative is on the right, greater than 1 is on the left. vjust similar to move labels up and down (less than 0, greater than 1 for above or below points). size controls size of text: 5 is default (so this is smaller). Not an obvious way to stop labels overlapping! But see over for a solution. xlim changes limits of x-axis (to stop labels going off side). Likewise ylim. 77 / 166

78 Non-overlapping labels Key is to use package ggrepel and geom_text_repel from that package instead of geom_text: library(ggrepel) # if not done already g+geom_text_repel(aes(label=car),size=2) MPG Dodge Colt Mazda GLC Fiat Strada Plymouth Horizon Datsun 210 Pontiac Phoenix VW Rabbit VW Dasher VW Scirocco Dodge Omni Chevette Chevy Citation Buick Skylark Honda Accord LX Toyota Corona AMC Spirit Datsun 510 Olds Omega Ford Mustang BMW 320i Datsun 810 Ford Mustang Ghia Buick Century Special Saab 99 GLE Mercury Zephyr Chevy Malibu Wagon Chrysler LeBaron Wagon Audi 5000 AMC Concord D/L Dodge St Regis Buick Estate Wagon Dodge Aspen Mercury Grand Marquis Volvo 240 GL Ford LTD Ford Country Squire Wagon Peugeot 694 SL Chevy Caprice Classic Weight 78 / 166

79 Labelling only some points Same idea as SAS: create a new variable in the data frame with the labels to plot, or empty, eg. using mutate from dplyr: cars2=dplyr::mutate(cars, newlabel=ifelse(mpg>34,as.character(car),"")) g2=ggplot(cars2,aes(x=weight,y=mpg))+geom_point()+ geom_text(aes(label=newlabel),size=2,hjust=-0.1) ifelse takes three things: something that can be true or false, the value if true, the value if false (like IF in a spreadsheet). 79 / 166

80 The plot g2 Fiat Strada 35 Dodge Colt Mazda GLC Plymouth Horizon MPG Weight 80 / 166

81 Labelling points by group g3=ggplot(cars,aes(x=weight,y=mpg,colour=cylinders))+ geom_point() ; g3 35 MPG Cylinders Weight 4 81 / 166

82 Fixing it up Only that isn t right: cylinders isn t really on a continuous scale; it should be treated as factor: g3=ggplot(cars,aes(x=weight,y=mpg, colour=as.factor(cylinders)))+ geom_point() ; g3 35 MPG as.factor(cylinders) Weight 82 / 166

83 Adding new data: averages by cylinders First make data frame of new data to add: tmp1=group_by(cars,cylinders) summ=summarize(tmp1,mw=mean(weight),mm=mean(mpg)) ; summ ## # A tibble: 4 x 3 ## Cylinders mw mm ## <int> <dbl> <dbl> ## ## ## ## then to plot averages on graph, add a new geom_point with a new data frame: g4=g3+geom_point(data=summ,aes(x=mw,y=mm, colour=as.factor(cylinders)),shape=3) 83 / 166

84 The plot, group mean marked by + g4 35 MPG as.factor(cylinders) Weight 84 / 166

85 Multiple series on one plot Oranges data frame oranges=read.table("oranges.txt",header=t) oranges ## row ages A B C D E ## ## ## ## ## ## ## Each column is circumference at given time. Want to plot each column against time, labelled. 85 / 166

86 Organizing the data ggplot way is to put all the circumferences in one column, labelled by which tree they come from, and then plot them using tree as group. This uses gather from tidyr: orange.long=gather(oranges,tree,circum,a:e) head(orange.long,8) ## row ages tree circum ## A 30 ## A 51 ## A 75 ## A 108 ## A 115 ## A 139 ## A 140 ## B / 166

87 The plot, joining points by lines g5=ggplot(orange.long,aes(x=ages,y=circum,colour=tree))+ geom_point()+geom_line(); g5 200 circum tree A B C D 50 E ages 87 / 166

88 Faceting Another way to plot the orange tree growth curves is each on a separate plot. In ggplot the separate graphs are called facets, and to get them, you add facet_wrap to the plot, with, inside, what distinguishes the facets, thus: g6=g5+facet_wrap(~tree) Or, for the car data, plot gas mileage against weight for each country separately: g7=ggplot(cars,aes(x=weight,y=mpg))+geom_point()+ facet_wrap(~country) 88 / 166

89 Growth curves by tree g6 A B C 200 circum D E tree A B C D E ages 89 / 166

90 Car MPG by weight for each country g7 France Germany Italy MPG Japan Sweden U.S Weight 90 / 166

91 Plotting against several variables Another use for faceting is to plot one y-variable (say MPG) against several x-variables at once. (Did this for asphalt data before.) Here we plot MPG against Weight, Cylinders and Horsepower. Strategy: put all x s in one column (using gather) and keep another column with names of x s. Plot y against combined x s, faceted by names of x s. x s will be on different scales; account for this: cars.3=gather(cars,xname,x,weight:horsepower) g8=ggplot(cars.3,aes(x=x,y=mpg))+geom_point()+ facet_wrap(~xname,scales="free_x") 91 / 166

92 The plot(s): all negative correlations g8 Cylinders Horsepower Weight x MPG 92 / 166

93 With regression lines g8+geom_smooth(method="lm",se=f) Cylinders Horsepower Weight x MPG 93 / 166

94 A last variation: separate graphs for levels of a factor The same faceting idea allows us to produce an array of plots, one for each combination of levels of factors. Here we plot MPG against Weight for each combo of Country (across) and number of Cylinders (up, treated as factor): g9=ggplot(cars,aes(x=weight,y=mpg))+geom_point()+ facet_grid(cylinders~country) Can also put only one factor in facet_grid to arrange facets up and down or across. With facet_wrap, don t control structure of display. 94 / 166

95 The plot g9 MPG France Germany Italy Japan Sweden U.S Weight / 166

96 Kernel density curve As we saw in SAS, this is a way of smoothing a histogram to understand underlying shape. ggplot histogram has bin width all wrong: ggplot(cars,aes(x=mpg))+geom_histogram() ## stat bin() using bins = 30. Pick better value with binwidth. 4 3 count MPG Fix up bin width, add kernel density with geom_density(). Also note that y-scale will be the (computed) density, not count: 96 / 166

97 Histogram of MPG with kernel density ggplot(cars,aes(x=mpg))+ geom_histogram(aes(y=..density..), binwidth=2.5)+geom_density() density /

98 Histogram of weight with density curve Histogram of MPG is clearly bimodal. What about weight? Not so much. ggplot(cars,aes(x=weight))+ geom_histogram(aes(y=..density..), binwidth=0.5)+geom_density() density Weight 98 / 166

99 Normal quantile plot Histogram, (especially) boxplot don t give focused assessment of whether a distribution is normal. Need normal quantile plot. Plot data values against what you d expect if normal distribution correct. If normal is correct, get straight line. If not, get a curve. ggplot has stat_qq for this, which goes this way (car weights): qq=ggplot(cars,aes(sample=weight))+stat_qq() 99 / 166

100 Normal quantile plot for Weight qq 4.0 sample theoretical 100 / 166

101 Comments Plot has no qqline! If data were perfectly normal, values exactly straight. Data stray off straight a bit at the ends: low values especially are too big/bunched up for normal. Weights are not normal. But line makes it much easier to judge. How might we draw one? 101 / 166

102 Figuring out qqline The qqline on R s other normal quantile plot goes through observed and theoretical quartiles. quantile gets percentiles of data, for example: y=quantile(cars$weight,c(0.25,0.75)) ; y ## 25% 75% ## qnorm gets percentiles of standard normal: x=qnorm(c(0.25,0.75)) ; x ## [1] I used y for data and x for theoretical since that s how they appear on the graph. 102 / 166

103 Figuring out qqline (2) Slope of line joining these is slope=(y[2]-y[1])/(x[2]-x[1]) ; slope ## 75% ## Intercept is int=y[1]-slope*x[1] ; int ## 25% ## geom_abline() draws a line with specified intercept and slope. 103 / 166

104 Making this into a function Make this into function so that we can use repeatedly. Generous use of copy/paste! qqplot=function(vals) { y=quantile(vals,c(0.25,0.75)) x=qnorm(c(0.25,0.75)) slope=(y[2]-y[1])/(x[2]-x[1]) int=y[1]-slope*x[1] d=data.frame(vals=vals) ggplot(d,aes(sample=vals))+stat_qq()+ geom_abline(slope=slope,intercept=int) } Make sure you understand what each line of the function does, and why it s there. 104 / 166

105 Testing on car weights qqplot(cars$weight) 4.0 sample theoretical 105 / 166

106 Making normal quantile plot of actually normal data How much deviation from the line might there be if data really normal? Generate some random normal data and find out: z=rnorm(100) qq=qqplot(z) See (over) that: overall pattern of points is straight, not curved points at extremes are not drifting away from line. 106 / 166

107 Normal quantile plot for genuinely normal data qq 2 sample theoretical 107 / 166

108 Right-skewed data The gamma distribution is skewed to right: g=rgamma(1000,2,2) gam=data.frame(g=g) ggplot(gam,aes(x=g))+geom_histogram(binwidth=0.2) count g Assess normality thus: qq=qqplot(g) 108 / 166

109 Normal quantile plot for gamma data qq 4 3 sample theoretical 109 / 166

110 Comments Seriously non-normal! Big-time curve on plot; points don t follow a line at all. Observations at top end too spread out for normal. Observations at bottom end bunched up for normal. Skewness in direction of spread-out values: skewed right. 110 / 166

111 Car MPGs Distribution had hole in middle some low MPGs, and some high ones: not normal. How does this show up on normal quantile plot? qq=qqplot(cars$mpg) 111 / 166

112 Normal quantile plot for car MPG qq 35 sample theoretical 112 / 166

113 Comments Hole shows up as vertical gap. Almost S-bend in data values. High ones not high enough. Low ones not low enough. Data too bunched up to be normal (short tails). 113 / 166

114 Functions: the geometric distribution Recall binomial distribution, eg. toss coin 10 times and count how many heads (W ). In general, prob. of success = p on every independent trial. Fixed # trials, W is #successes. Another angle: how many trials to get my first success? Random variable now #trials (denote X ); #successes fixed (= 1). Geometric distribution. P(X = 1) = p (success first time). P(X = 2) = (1 p)p (fail, then succeed). P(X = 3) = (1 p) 2 p (fail 2 times, then succeed). P(X = n) = (1 p) n 1 p (fail n 1 times, then succeed). Implement in R. 114 / 166

115 Writing a geometric probability function Input: #trials whose prob. we want x, single-trial success prob. p. Output: probability of succeeding for 1st time after exactly x trials (number). One-liner: geometric=function(x,p) p*(1-p)^(x-1) Or with curly brackets: geometric=function(x,p) { p*(1-p)^(x-1) } Testing: geometric(1,0.4) ## [1] 0.4 Prob. of succeeding first time same as p: good. 115 / 166

116 Errors Chance of first success on second trial? Fail, then succeed: geometric(2,0.4) ## [1] 0.24 (0.6)(0.4) = What if user gives p outside of [0, 1], or x less than 1? Function dies with error. Or gives nonsense answer. Catch that first: geometric(0,0.5) ## [1] 1 geometric(2,1.1) ## [1] Ugh! 116 / 166

117 Catching errors stopifnot: feed it some logical conditions, stops operation of function if any condition false. (If all true, nothing happens). If any condition false, R tells you which one. 3 things to check: p 0 or bigger, p 1 or smaller, x 1 or bigger: geometric=function(x,p) { stopifnot(p>=0,p<=1,x>=1) p*(1-p)^(x-1) } 117 / 166

118 Testing Test: geometric(2,0.5) ## [1] 0.25 geometric(0,0.5) ## Error: x >= 1 is not TRUE geometric(2,1.1) ## Error: p <= 1 is not TRUE Last two fail, and stopifnot tells you why. 118 / 166

119 Calling geometric with vector x What happens? Try it and see. geometric(1:5,0.5) ## [1] Probabilities of first success taking 1, 2, 3,... trials. Works because of how R handles vector arithmetic. R freebie: often get vector output from vector input with no extra coding. Above gives ingredients for first success in 5 trials or less : calculate prob of 1 to 5, then add up: sum(geometric(1:5,0.5)) ## [1] / 166

120 Function input If we use function as above, have to get inputs in right order: geometric(2,0.8) ## [1] 0.16 geometric(0.8,2) ## Error: p <= 1 is not TRUE Second one fails because it thinks 2 is success probability. But if we use the names, can do any order: geometric(x=2,p=0.8) ## [1] 0.16 geometric(p=0.8,x=2) ## [1] / 166

121 Defaults What if I write the function like this? geometric=function(x,p=0.5) { stopifnot(p>=0,p<=1,x>=1) p*(1-p)^(x-1) } If I call it without a value of p, shouldn t I get an error? geometric(x=3) ## [1] It works, because if I don t give a value for p, it uses the one in the function line, a default. Many R functions have defaults, that give reasonable behaviour without having to worry about details. 121 / 166

122 Cumulative probabilities as function Might be useful to have function for cumulative probabilities. Strategy: get individual probs as far as you wish to go, then add up. Eg. probability of 4 or less: need 1 through 4. In general, x or less with success prob. p: c.geometric=function(x,p) { probs=geometric(1:x,p) sum(probs) } Easy to write, using our geometric function and stuff in R. 122 / 166

123 Testing c.geometric Try the one we just did: c.geometric(5,0.5) ## [1] Answer we had before. How about this: c.geometric(20,0.1) ## [1] If success probability only 0.1, might even take longer than 20 trials to get first success. So this is reasonable. Mean number of trials until 1st success is 1/p: p = 0.5, mean #trials is 1/0.5 = 2. p = 0.1, mean #trials is 1/0.1 = / 166

124 Using R s geometric calculator Called pgeom: c.geometric(5,0.5) ## [1] c.geometric(20,0.1) ## [1] pgeom(5,0.5) ## [1] pgeom(20,0.1) ## [1] Oh. Not the same. Look in help for pgeom: this is other version of geometric, where you count how many failures happened before 1st success (#trials minus 1). So we need (compare c.geometric on left above): pgeom(4,0.5) ## [1] pgeom(19,0.1) ## [1] / 166

125 Another way of writing cumulative geometric Suppose we hadn t thought to try a vector for x. What then? Calculate each probability in turn, add on to a running total, return total at end. Uses a loop: c2.geometric=function(x,p) { total=0 for (i in 1:x) { prob=geometric(i,p) total=total+prob } total } 125 / 166

126 Checking c2.geometric(5,0.5) ## [1] c.geometric(5,0.5) ## [1] c2.geometric(20,0.1) ## [1] c.geometric(20,0.1) ## [1] Same as before. 126 / 166

127 Selecting stuff in R

128 Use dplyr Easiest way to select parts of data frame is to use dplyr tools. Use cars data for example: str(cars) ## 'data.frame': 38 obs. of 6 variables: ## $ Car : Factor w/ 38 levels "AMC Concord D/L",..: 7 ## $ MPG : num ## $ Weight : num ## $ Cylinders : int ## $ Horsepower: int ## $ Country : Factor w/ 6 levels "France","Germany",..: / 166

129 Selecting columns The base R way: cars$cylinders ## [1] ## [9] ## [17] ## [25] ## [33] select(cars,cylinders) ## Cylinders ## 1 4 ## 2 4 ## 3 6 ## 4 4 ## 5 6 ## 6 4 ## 7 4 ## 8 4 ## 9 8 ## 10 5 ## 11 8 ## 12 6 ## 13 4 ## 14 4 ## 15 4 ## 16 6 ## 17 4 ## 18 6 ## 19 6 ## 20 8 ## / 166

130 Columns by number select also takes a column number. For example, Cylinders is column number 4: select(cars,4) ## Cylinders ## 1 4 ## 2 4 ## 3 6 ## 4 4 ## 5 6 ## 6 4 ## 7 4 ## 8 4 ## 9 8 ## 10 5 ## 11 8 ## 12 6 ## 13 4 ## 14 4 ## 15 4 ## / 166

131 Selecting rows By logical condition using filter, eg. cars with MPG greater than 34: filter(cars,mpg>34) ## Car MPG Weight Cylinders Horsepower Country ## 1 Fiat Strada Italy ## 2 Plymouth Horizon U.S. ## 3 Mazda GLC Japan ## 4 Dodge Colt Japan By row number(s) using slice, eg. Fiat Strada, row 4: slice(cars,4) ## Car MPG Weight Cylinders Horsepower Country ## 1 Fiat Strada Italy or rows 3 and 5: slice(cars,c(3,5)) ## Car MPG Weight Cylinders Horsepower Country ## 1 Mercury Zephyr U.S. ## 2 Peugeot 694 SL France 131 / 166

132 Rows and columns, the base R way Use an empty row or column number to select a whole row or column (by number): 4th row: cars[4,] ## Car MPG Weight Cylinders Horsepower Country ## 4 Fiat Strada Italy 2nd column (all the MPG values): cars[,2] ## [1] ## [15] ## [29] / 166

133 Multiple selections for example, names and MPGs of cars with MPG over 34: tmp=filter(cars,mpg>34) select(tmp,c(car,mpg)) ## Car MPG ## 1 Fiat Strada 37.3 ## 2 Plymouth Horizon 34.2 ## 3 Mazda GLC 34.1 ## 4 Dodge Colt 35.1 (two selections one after the other, with first stored in temporary data frame) Order here does not matter, but if we wanted name and MPG of cars with 6 cylinders, must do filter first; else, after select, no column called Cylinders left. 133 / 166

134 Or, this way or like this (same selection): cars %>% filter(mpg>34) %>% select(c(car,mpg)) ## Car MPG ## 1 Fiat Strada 37.3 ## 2 Plymouth Horizon 34.2 ## 3 Mazda GLC 34.1 ## 4 Dodge Colt 35.1 Symbol %>% called pipe. Read above as take cars, and then take the rows where MPG bigger than 34, and then take columns called Car and MPG. 134 / 166

135 Comparing code with and without pipe Without pipe (original way): tmp=filter(cars,mpg>34) select(tmp,c(car,mpg)) With pipe: cars %>% filter(mpg>34) %>% select(c(car,mpg)) In a pipe, the first data frame argument of function disappears. Data frame used is whatever came out of the previous step. Code with pipe more concise and uses no temporary variables. 135 / 166

136 Another example Pipe way of selecting gas mileage (column 2) of Fiat Strada (row 4): cars %>% select(2) %>% slice(4) ## MPG ## Pipe comes with dplyr and can be used with any function that takes a data frame first: cars %>% filter(mpg<30) %>% head() ## Car MPG Weight Cylinders Horsepower Country ## 1 Buick Skylark U.S. ## 2 Mercury Zephyr U.S. ## 3 Peugeot 694 SL France ## 4 Buick Estate Wagon U.S. ## 5 Audi Germany ## 6 Chevy Malibu Wagon U.S. 136 / 166

137 And, or Combine multiple conditions in filter using & for and and for or. Cars that weigh more than 4 tons and have gas mileage less than 20: filter(cars,weight>4 & MPG<20) ## Car MPG Weight Cylinders Horsepower Country ## 1 Buick Estate Wagon U.S. ## 2 Ford Country Squire Wagon U.S. Can also do and as two filters, one after the other: cars %>% filter(weight>4) %>% filter(mpg<20) ## Car MPG Weight Cylinders Horsepower Country ## 1 Buick Estate Wagon U.S. ## 2 Ford Country Squire Wagon U.S. 137 / 166

138 Or example Cars that either weigh more than 4 tons or have gas mileage less than 20: filter(cars,weight>4 MPG<20) ## Car MPG Weight Cylinders Horsepower Country ## 1 Peugeot 694 SL France ## 2 Buick Estate Wagon U.S. ## 3 Chevy Malibu Wagon U.S. ## 4 Dodge Aspen U.S. ## 5 Chrysler LeBaron Wagon U.S. ## 6 AMC Concord D/L U.S. ## 7 Ford LTD U.S. ## 8 Volvo 240 GL Sweden ## 9 Dodge St Regis U.S. ## 10 Ford Country Squire Wagon U.S. ## 11 Mercury Grand Marquis U.S. ## 12 Chevy Caprice Classic U.S. 138 / 166

139 More selections Which countries do the 8-cylinder cars come from? cars %>% filter(cylinders==8) %>% select(country) ## Country ## 1 U.S. ## 2 U.S. ## 3 U.S. ## 4 U.S. ## 5 U.S. ## 6 U.S. ## 7 U.S. ## 8 U.S. All from the US. 139 / 166

140 Yet more selections Gas mileages of 8-cylinder cars? cars %>% filter(cylinders==8) %>% select(mpg) ## MPG ## ## ## ## ## ## ## ## All bad. 140 / 166

141 How many cylinders do the high-mpg cars have? Define high as 30 or more : cars %>% filter(mpg>=30) %>% select(cylinders) ## Cylinders ## 1 4 ## 2 4 ## 3 4 ## 4 4 ## 5 4 ## 6 4 ## 7 4 ## 8 4 ## 9 4 ## 10 4 ## 11 4 All 4. Not a surprise. (Conditional distribution of number of cylinders given that MPG 30 or more.) 141 / 166

142 Not How many cars not from the US? This is a filter too, but we have an extra step to count them: cars %>% filter(country!="u.s.") %>% summarize(n=n()) ## n ## of 38 cars are not from US. Or see which other countries we have, and how many of each: cars %>% filter(country!="u.s.") %>% group_by(country) %>% summarize(n=n()) ## # A tibble: 5 x 2 ## Country n ## <fctr> <int> ## 1 France 1 ## 2 Germany 5 ## 3 Italy 1 ## 4 Japan 7 ## 5 Sweden / 166

143 Doing things all at once using dplyr

144 Doing things all at once R very good at applying things to entire data frames, vectors. For example, calculating means by rows or columns. If you re a programmer, might do these tasks using loops. But no need in R: dplyr has all you need. 144 / 166

145 The orange trees again Go back to orange tree circumferences: oranges ## row ages A B C D E ## ## ## ## ## ## ## / 166

146 Row means Row means: dplyr, group by rows (there are n() of them, then calculate the means of columns A through E for each group (row): oranges %>% group_by(1:n()) %>% mutate(m=mean(a:e)) ## Source: local data frame [7 x 9] ## Groups: 1:n() [7] ## ## row ages A B C D E 1:n() m ## <int> <int> <int> <int> <int> <int> <int> <int> <dbl> ## ## ## ## ## ## ## Extra column m contains row means (mean circumference at each time). 146 / 166

147 Column medians Column medians: use summarize_each thus: oranges %>% summarize_each(funs(median),a:e) ## A B C D E ## The function to calculate for each column goes inside funs, and the columns to find the median for go after that. Column medians are actually 4th number in each column, since values in order. Same method for column-anything. 147 / 166

148 A more tricky one The first quartile Q1 for each row: oranges %>% group_by(1:n()) %>% mutate(q1=quantile(a:e,probs=0.25)) ## Source: local data frame [7 x 9] ## Groups: 1:n() [7] ## ## row ages A B C D E 1:n() q1 ## <int> <int> <int> <int> <int> <int> <int> <int> <dbl> ## ## ## ## ## ## ## Feed all the variables you want quartiles for into quantile, and then say which quantile you want. 148 / 166

149 Means etc. by groups Back to cars: mean MPG (quantitative) for each Country (categorical). aggregate will do this, but so will dplyr: cars %>% group_by(country) %>% summarize(m=mean(mpg), s=sd(mpg)) ## # A tibble: 6 x 3 ## Country m s ## <fctr> <dbl> <dbl> ## 1 France NA ## 2 Germany ## 3 Italy NA ## 4 Japan ## 5 Sweden ## 6 U.S / 166

150 Means by groups (2) For combination of categorical variables, put them all in the group_by, eg. by Country and Cylinders: cars %>% group_by(country,cylinders) %>% summarize(n=n(),m=mean(mpg),s=sd(mpg)) ## Source: local data frame [11 x 5] ## Groups: Country [?] ## ## Country Cylinders n m s ## <fctr> <int> <int> <dbl> <dbl> ## 1 France NA ## 2 Germany ## 3 Germany NA ## 4 Italy NA ## 5 Japan ## 6 Japan NA ## 7 Sweden NA ## 8 Sweden NA ## 9 U.S ## 10 U.S ## 11 U.S / 166

151 What happens with function returning several values? Function quantile returns 5-number summary by default: quantile(cars$mpg) ## 0% 25% 50% 75% 100% ## What happens with summarize then? cars %>% group_by(country) %>% summarize(q=quantile(mpg)) ## Error in eval(expr, envir, enclos): expecting a single value We have to work around this, as shown on next page. 151 / 166

152 Handling function returning several values This arcane code: cars %>% group_by(country) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country,which=names(.$q),value=.$q )) ## Source: local data frame [30 x 3] ## Groups: <by row> ## ## # A tibble: 30 x 3 ## ctry which value ## * <fctr> <fctr> <dbl> ## 1 France 0% 16.2 ## 2 France 25% 16.2 ## 3 France 50% 16.2 ## 4 France 75% 16.2 ## 5 France 100% 16.2 ## 6 Germany 0% 20.3 ## 7 Germany 25% 21.5 ## 8 Germany 50% 30.5 ## 9 Germany 75% 31.5 ## 10 Germany 100% 31.9 ## #... with 20 more rows 152 / 166

153 Comments Key part of code is to use do twice: first time to construct a variable holding all the quantiles (5 of them), which does this: cars %>% group_by(country) %>% do(q=quantile(.$mpg)) ## Source: local data frame [6 x 2] ## Groups: <by row> ## ## # A tibble: 6 x 2 ## Country q ## * <fctr> <list> ## 1 France <dbl [5]> ## 2 Germany <dbl [5]> ## 3 Italy <dbl [5]> ## 4 Japan <dbl [5]> ## 5 Sweden <dbl [5]> ## 6 U.S. <dbl [5]> 153 / 166

154 Comments (2) second time to pull out those values, by constructing a data frame containing their names (which percentile) and their values, labelled by country, producing this (summary): cars %>% group_by(country) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country,which=names(.$q),value=.$q )) %>% str() ## Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame': 30 obs. of ## $ ctry : Factor w/ 6 levels "France","Germany",..: ## $ which: Factor w/ 5 levels "0%","100%","25%",..: ## $ value: num / 166

155 Displaying it better Not the clearest display. We could put the percentiles in columns. This is the inverse of gather, which is spread : cars %>% group_by(country) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country,which=names(.$q),value=.$q )) %>% spread(which,value) ## # A tibble: 6 x 6 ## ctry 0% 100% 25% 50% 75% ## * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 France ## 2 Germany ## 3 Italy ## 4 Japan ## 5 Sweden ## 6 U.S spread seems to have put the percentiles in the wrong order. This is more trouble than it s worth to fix! 155 / 166

156 And even... Five-number summary of MPG by Country-Cylinders combo: cars %>% group_by(country,cylinders) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country, cyl=.$cylinders, which=names(.$q), value=.$q)) %>% spread(which,value) ## # A tibble: 11 x 7 ## ctry cyl 0% 100% 25% 50% 75% ## * <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 France ## 2 Germany ## 3 Germany ## 4 Italy ## 5 Japan ## 6 Japan ## 7 Sweden ## 8 Sweden ## 9 U.S ## 10 U.S ## 11 U.S / 166

157 Vector and matrix algebra in R

158 Vector addition Define a vector, then add 2 to it: u=c(2,3,6,5,7) k=2 u+k ## [1] Adds 2 to each element. Adding vectors: u ## [1] v=c(1,8,3,4,2) u+v ## [1] Elementwise addition. (MAT A23: vector addition.) 158 / 166

What R is. STAT:5400 (22S:166) Computing in Statistics

What R is. STAT:5400 (22S:166) Computing in Statistics STAT:5400 (22S:166) Computing in Statistics Introduction to R Lecture 5 September 9, 2015 Kate Cowles 374 SH, 335-0727 kate-cowles@uiowa.edu 1 What R is an integrated suite of software facilities for data