R packages for this chapter (for later)

Size: px
Start display at page:

Download "R packages for this chapter (for later)"

Transcription

1 R packages for this chapter (for later) library(ggplot2) library(ggrepel) library(tidyr) library(dplyr) ## ## Attaching package: dplyr ## The following objects are masked from package:stats : ## ## filter, lag ## The following objects are masked from package:base : ## ## intersect, setdiff, setequal, union 1 / 166

2 SAS stuff

3 Reading data from a file This: a 20 a 21 a 16 b 11 b 14 b 17 b 15 c 13 c 9 c 12 c 13 got read in like this: data groups; infile '/home/ken/threegroups.dat'; input group $ y; 3 / 166

4 More than one observation per line Foregoing worked with: One obs. per line Separated by whitespace. Suppose you have this: Eg. one variable x, then: data xonly; infile '/home/ken/one.dat'; input proc means keep reading on same line until done. 4 / 166

5 The output Obs x / 166

6 If you leave off the Data: Code and output, doesn t get everything: data xonly; infile '/home/ken/one.dat'; input x; proc print; Obs x / 166

7 Two variables using Data: Suppose values in data file are an x then a y, repeated: data xonly; infile '/home/ken/one.dat'; input x proc print; Obs x y / 166

8 Skipping over header lines Data file like this: x y In SAS, supply variable names (on input line), so skip over header lines like this: data xy; infile '/home/ken/two.dat' firstobs=2; input xx yy; proc print; Can put any number on firstobs, depending on how many lines you want to skip. 8 / 166

9 Data as read in Note variable names: Obs xx yy / 166

10 Data separated by other things Might have data like this: 3,4 5,6 7,7 8,9 3,4 Eg. from spreadsheet saved as.csv. 10 / 166

11 Code and output Separated by commas, so read in like this: data xy; infile '/home/ken/three.dat' dlm=','; input x y; proc print; Obs x y / 166

12 The singers: reading in text Spreadsheet of female singer names, saved as.csv: 1,Bessie Smith 2,Peggy Lee 3,Aretha Franklin 4,Diana Ross 5,Dolly Parton 6,Tina Turner 7,Madonna 8,Mary J Blige 9,Salt n Pepa 10,Aaliyah 11,Beyonce Try reading in: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $; 12 / 166

13 What we got Obs number name 1 1 Bessie S 2 2 Peggy Le 3 3 Aretha F 4 4 Diana Ro 5 5 Dolly Pa 6 6 Tina Tur 7 7 Madonna 8 8 Mary J B 9 9 Salt n P Aaliyah Beyonce The names got cut off! 13 / 166

14 Reading the whole names Only got 1st 8 characters of each singer s name (SAS default for text). Tell SAS that the names are 20 characters long: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; proc print; Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 Tina Turner 7 7 Madonna 8 8 Mary J Blige 9 9 Salt n Pepa Aaliyah Beyonce 14 / 166

15 Why this worked On input number name $20.;, the 20. after dollar sign, specifying length of text, called informat. Singer s names have spaces, but this no problem, since delimiter is,. Possible trouble: commas inside the names, as in Robert Downey, Jr. Get around this by adding dsd to infile line. singer2.csv has Mr. Downey on the end: data singers2; infile '/home/ken/singers2.csv' dlm=',' dsd; input number name $20.; proc print; 15 / 166

16 Singers as read in Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 Tina Turner 7 7 Madonna 8 8 Mary J Blige 9 9 Salt n Pepa Aaliyah Beyonce Robert Downey, Jr. 16 / 166

17 A gotcha If you tried this for yourself, this might not have worked. Issue: Singer names must be at least 20 characters long. If not, you have to add spaces to make them so. singers3.csv has additional spaces removed. With code: data singers3; infile '/home/ken/singers3.csv' dlm=',' dsd; input number name $20.; proc print; you get output shown on next page. 17 / 166

18 Output from previous commands, data file below Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 7,Madonna 7 8 Mary J Blige 8 9 Salt n Pepa ,Beyonce Robert Downey, Jr. 1,Bessie Smith *** <- actual end of line 2,Peggy Lee *** 3,Aretha Franklin *** 4,Diana Ross *** 5,Dolly Parton *** 6,Tina Turner*** 7,Madonna *** 8,Mary J Blige *** 9,Salt n Pepa *** 10,Aaliyah*** 11,Beyonce *** 18 / 166

19 Reading spreadsheet data into SAS Two quick ways: Save data to.csv, transfer to SAS Studio Copy and paste into Program Editor (quick and dirty). Save in singsing.dat, read in like this: data sing; infile "/home/ken/singsing.dat" expandtabs; input singer $20. value; Read in actual spreadsheet using proc import: proc import out=singers datafile= '/home/ken/sing.xlsx' dbms=xlsx replace; sheet="sheet1"; getnames=yes; ; only at end (for clarity) out=: name of SAS data set datafile=: Excel spreadsheet sheet=: which sheet in workbook 19 / 166

20 Did it work? Obs singer number 1 Bessie Smith 1 2 Peggy Lee 2 3 Aretha Franklin 3 4 Diana Ross 4 5 Dolly Parton 5 6 Tina Turner 6 7 Madonna 7 8 Mary J Blige 8 9 Salt n Pepa 9 10 Aaliyah Beyonce 11 Yes! And without any issues about lengths of names. 20 / 166

21 Permanent data sets Can we read in data set once and not every time? Yes, use filename (in single quotes) when creating: data '/home/ken/cars'; infile '/home/ken/cars.txt' firstobs=2; input car $25. mpg weight cylinders hp country $; Car names max of 25 chars long. Country names max of 8, so no special treatment needed. SAS stores file called /home/username/cars.sas7bdat (!) on SAS Studio. Whenever you need it, add data= /home/username/cars to a proc line (replacing username with your username). Can use subfolders, using / forward slash syntax. Closing SAS breaks connection with temporary (ie. non-permanent) data sets. To get those back, need to run data step lines again. 21 / 166

22 Means, without data step! proc means data='/home/ken/cars'; var mpg weight cylinders hp; The MEANS Procedure Variable N Mean Std Dev Minimum Maximum mpg weight cylinders hp / 166

23 Mean MPG by country proc means data='/home/ken/cars'; var mpg; class country; The MEANS Procedure Analysis Variable : mpg N country Obs N Mean Std Dev Minimum Maximum France Germany Italy Japan Sweden U.S This kind of thing is SAS s strength. 23 / 166

24 How does SAS know which data set to use? Two rules: 1. Any proc can have data= on it. Tells SAS to use that data set. Can be unquoted data set name (created by data step) quoted data set name (permanent one on disk created as above) 2. Without data=, most recently created data set. Typically data set created by data step, though could also be spreadsheet via proc import. Also, data set created by out= counts. Does permanent data set count as most recently created? No, or at least not always. If unsure, use data=. 24 / 166

25 SAS: creating new data sets from old ones

26 Selecting individuals/observations Singers original data step: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; To select singers only 1 through 6: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number<=6; Select individuals with if: choose only these. 26 / 166

27 Did it work? proc print; Obs number name 1 1 Bessie Smith 2 2 Peggy Lee 3 3 Aretha Franklin 4 4 Diana Ross 5 5 Dolly Parton 6 6 Tina Turner Most recently created data set has only singers with numbers 6 or less. 27 / 166

28 Omitting individuals Sometimes easier to focus on obs to leave out: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number<4 then delete; proc print; Obs number name 1 4 Diana Ross 2 5 Dolly Parton 3 6 Tina Turner 4 7 Madonna 5 8 Mary J Blige 6 9 Salt n Pepa 7 10 Aaliyah 8 11 Beyonce 28 / 166

29 Selecting on text variable Less than means earlier alphabetically. Singers before M: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if name<'m'; proc print; Obs number name 1 1 Bessie Smith 2 3 Aretha Franklin 3 4 Diana Ross 4 5 Dolly Parton 5 10 Aaliyah 6 11 Beyonce 29 / 166

30 Equality Selecting singer #7 ie. singer whose number is equal to 7: Note that SAS uses = while R uses == for logical equals. data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number=7; proc print; Obs number name 1 7 Madonna 30 / 166

31 Either/Or data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number=7 or name='diana Ross'; proc print; Obs number name 1 4 Diana Ross 2 7 Madonna 31 / 166

32 Both/And Have multiple if lines: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; if number<7; if name<'c'; proc print; Obs number name 1 1 Bessie Smith 2 3 Aretha Franklin 32 / 166

33 Selecting variables if, delete selects/omits individuals/observations. To select variables, use keep or drop: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; keep name; proc print; Obs name 1 Bessie Smith 2 Peggy Lee 3 Aretha Franklin 4 Diana Ross 5 Dolly Parton 6 Tina Turner 7 Madonna 8 Mary J Blige 9 Salt n Pepa 10 Aaliyah 33 / 166

34 Getting rid of variables data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; drop number; proc print; Obs name 1 Bessie Smith 2 Peggy Lee 3 Aretha Franklin 4 Diana Ross 5 Dolly Parton 6 Tina Turner 7 Madonna 8 Mary J Blige 9 Salt n Pepa 10 Aaliyah 11 Beyonce 34 / 166

35 Cloning a data set (pointless!) Use set to bring in all the variables and individuals from another data set: data singers; infile '/home/ken/singers.csv' dlm=','; input number name $20.; data singers2; set singers; singers2 exactly same as singers. set usually first step to doing something else with data. 35 / 166

36 A less pointless cloning There is point in combining set with keep or drop or if to copy only individuals/variables you want. Example: cars data, keep only those cars with mpg bigger than 30: data mycars; set '/home/ken/cars'; if mpg>30; proc print; 36 / 166

37 High-gas-mileage cars Obs car mpg weight cylinders hp country 1 Dodge Omni U.S. 2 Fiat Strada Italy 3 VW Rabbit Germany 4 Plymouth Horizon U.S. 5 Mazda GLC Japan 6 VW Dasher Germany 7 Dodge Colt Japan 8 VW Scirocco Germany 9 Datsun Japan 10 Pontiac Phoenix U.S. 37 / 166

38 Keep only car name and gas mileage data mycars; set '/home/ken/cars'; keep car mpg; proc print; 38 / 166

39 Just two variables Obs car mpg 1 Buick Skylark Dodge Omni Mercury Zephyr Fiat Strada Peugeot 694 SL VW Rabbit Plymouth Horizon Mazda GLC Buick Estate Wagon Audi Chevy Malibu Wagon Dodge Aspen VW Dasher Ford Mustang Dodge Colt Datsun VW Scirocco Chevy Citation Olds Omega Chrysler LeBaron Wagon Datsun AMC Concord D/L Buick Century Special Saab 99 GLE Datsun Ford LTD Volvo 240 GL Dodge St Regis Toyota Corona Chevette Ford Mustang Ghia / 166

40 Get rid of cylinders and hp data mycars; set '/home/ken/cars'; drop cylinders hp; proc print; 40 / 166

41 Those two variables gone Obs car mpg weight country 1 Buick Skylark U.S. 2 Dodge Omni U.S. 3 Mercury Zephyr U.S. 4 Fiat Strada Italy 5 Peugeot 694 SL France 6 VW Rabbit Germany 7 Plymouth Horizon U.S. 8 Mazda GLC Japan 9 Buick Estate Wagon U.S. 10 Audi Germany 11 Chevy Malibu Wagon U.S. 12 Dodge Aspen U.S. 13 VW Dasher Germany 14 Ford Mustang U.S. 15 Dodge Colt Japan 16 Datsun Japan 17 VW Scirocco Germany 18 Chevy Citation U.S. 19 Olds Omega U.S. 20 Chrysler LeBaron Wagon U.S. 21 Datsun Japan 22 AMC Concord D/L U.S. 23 Buick Century Special U.S. 24 Saab 99 GLE Sweden 25 Datsun Japan 26 Ford LTD U.S. 27 Volvo 240 GL Sweden 28 Dodge St Regis U.S. 29 Toyota Corona Japan 30 Chevette U.S. 31 Ford Mustang Ghia U.S. 41 / 166

42 Keeping only some individuals and variables Any variables not keep-ed are dropped. Any variables not drop-ed are kept. So only need one of keep and drop. But can combine with if: data mycars; set '/home/ken/cars'; keep car mpg; if weight>4; proc print; Obs car mpg 1 Buick Estate Wagon Ford Country Squire Wagon 15.5 Keeps only car name and gas mileage for those cars weighing over 4 tons. 42 / 166

43 Kernel density curve on histogram A kernel density curve smooths out a histogram and gives sense of shape of distribution. Car mpgs: proc sgplot data='/home/ken/cars'; histogram mpg; density mpg / type=kernel; 43 / 166

44 Histogram of MPGs with kernel density 44 / 166

45 Comments The kernel density has a wobble in the middle, suggesting that the data might be bimodal rather than unimodal. This is pretty clear from the hole in the middle of the histogram. 45 / 166

46 Kernel density for car weights proc sgplot data='/home/ken/cars'; histogram weight; density weight / type=kernel; 46 / 166

47 Comments For MPGs, clear evidence of bimodal shape. Cars seem to divide into low-mpg and high-mpg groups. For weights, not so much evidence of bimodality. Looks more right-skewed. 47 / 166

48 Loess curve Loess curve (note spelling) in SAS: Code like this: proc sgplot data='/home/ken/cars'; scatter x=weight y=mpg; loess x=weight y=mpg; 48 / 166

49 Loess curve on plot 49 / 166

50 Distinguishing points by colours or symbols Say we want to plot mpg by weight, with the points different colours and symbols according to what number of cylinders they are. sgplot takes a group= as option, similar to ggplot: proc sgplot data='/home/ken/cars'; scatter x=weight y=mpg / group=cylinders; 50 / 166

51 The plot 51 / 166

52 Comments Cars with different numbers of cylinders distinguished by colour and shape. Blue circles denote 4-cylinder cars... green crosses 8-cylinder. Legend at the bottom, so you can see which colour/symbol is which. Cars with more cylinders are heavier and have worse gas mileage. 52 / 166

53 Multiple series on one plot: the oranges data Data file like this (circumferences of 5 trees each at 7 times): row ages A B C D E Skip over first line of file; create permanent data set: data '/home/ken/oranges'; infile '/home/ken/oranges.txt' firstobs=2; input row age a b c d e; 53 / 166

54 Multiple series Growth curve for each tree, joined by lines. series joins points by lines. markers displays actual data points too. Do each series one at a time. proc sgplot; series x=age y=a / markers; series x=age y=b / markers; series x=age y=c / markers; series x=age y=d / markers; series x=age y=e / markers; 54 / 166

55 The growth curves 55 / 166

56 Labelling points on a plot The magic word here is datalabel. For example, to label each car on a scatterplot of MPG vs. weight with the name of the car: proc sgplot data='/home/ken/cars'; scatter y=mpg x=weight / datalabel=car; 56 / 166

57 The plot 57 / 166

58 Comments Each car labelled with its name, either left, right, above or below, whichever makes it clearest. (Some intelligence applied to placement.) Cars top left are nimble : light in weight, good gas mileage. Cars bottom right are boats : heavy, with terrible gas mileage. 58 / 166

59 Labelling by country Same idea: proc sgplot data='/home/ken/cars'; scatter x=weight y=mpg / datalabel=country; 59 / 166

60 Labelled by country 60 / 166

61 Labelling only some of the observations Create a new data set with all the old variables plus a new one that contains the text to plot. For example, label most fuel-efficient car (#4) and heaviest car (#9). Observation number given by SAS special variable n. Note the syntax: if then do followed by end. data cars2; set '/home/ken/cars'; if (_n_=4 or _n_=9) then do; newtext=car; end; For any cars not selected, newtext will be blank. Then, using the new data set that we just created: proc sgplot; scatter x=weight y=mpg / datalabel=newtext; 61 / 166

62 The plot 62 / 166

63 Or label cars with mpg greater than 34 data cars3; set '/home/ken/cars'; if mpg>34 then do; newtext=car; end; proc sgplot; scatter x=weight y=mpg / datalabel=newtext; 63 / 166

64 High-mpg cars 64 / 166

65 R stuff

66 More R stuff R has a thousand tiny parts, all working together, but to use them, need to know their names. Sometimes you do know the name, but you forget how it works. Then (at Console) type eg.?median or help(median). Help appears in R Studio bottom right. Read in the cars data to use for examples later: cars=read.csv("cars.csv") str(cars) ## 'data.frame': 38 obs. of 6 variables: ## $ Car : Factor w/ 38 levels "AMC Concord D/L",..: 7 18 ## $ MPG : num ## $ Weight : num ## $ Cylinders : int ## $ Horsepower: int ## $ Country : Factor w/ 6 levels "France","Germany",..: / 166

67 Structure of help file All R s help files laid out the same way: Purpose: what the function does Usage: how you make it go Arguments: what you need to feed in. Arguments with a = have default values. If the default is OK (it often is), you don t need to specify it. Details: more information about how the function works. Value: what comes back from the function. References to the literature, so that you can find out exactly how everything was calculated. Examples. Run these using eg. example(median). 67 / 166

68 If you don t know the name Then you have to find it out! If you know what it might be, apropos(name): apropos("read") ## [1] "readbin" "readchar" "readcitationfile" ## [4] "read.csv" "read.csv2" "read.dcf" ## [7] "read.delim" "read.delim2" "read.dif" ## [10] "read.fortran" "read.ftable" "read.fwf" ## [13] "readline" "readlines" "readrds" ## [16] ".readrds" "readrenviron" "read.socket" ## [19] "read.table" "spread" "spread_" ## [22] "Sys.readlink" and then you investigate more via help(). Google-searching, eg: r ggplot add horizontal line. Often turns up questions on stackexchange.com, which might be adapted to your needs. 68 / 166

69 That Google search Looks like geom_hline(). Look up in help as?ggplot2::geom_hline. 69 / 166

70 Gas mileage against weight, basic g=ggplot(cars,aes(x=weight,y=mpg))+geom_point() ; g 35 MPG Weight 70 / 166

71 Add regression line g+geom_smooth(method="lm",se=f) Weight MPG 71 / 166

72 Calculate and plot means mean.weight=mean(cars$weight) mean.mpg=mean(cars$mpg) g2=g+geom_smooth(method="lm",se=f)+ geom_hline(yintercept=mean.mpg,colour="red")+ geom_vline(xintercept=mean.weight,colour="darkgreen") 72 / 166

73 The plot g Weight MPG 73 / 166

74 With title g+ggtitle("gas mileage against weight") Gas mileage against weight 35 MPG Weight 74 / 166

75 Axis labels g+xlab("weight (tons)")+ylab("mpg (miles per US gallon)") 35 MPG (miles per US gallon) Weight (tons) 75 / 166

76 Adding text to plot g+geom_text(aes(label=car),hjust=-0.1,size=2)+ xlim(1.8,5.0) Fiat Strada MPG Dodge Colt Mazda GLCPlymouth Horizon Pontiac Phoenix VW Rabbit Datsun 210 VW Scirocco Dodge Omni VW Dasher Chevette Honda Accord LX Chevy Citation Buick Skylark Toyota Datsun 510 AMC Corona Spirit Olds Omega Ford Mustang Datsun Ford 810Mustang Ghia BMW 320i Saab 99 GLE Mercury Zephyr Buick Century Special Audi 5000 Chevy Malibu Wagon Dodge Aspen Chrysler LeBaron Wagon AMC Concord D/L Dodge St Regis Ford LTD Volvo 240 GL Chevy Caprice Classic Buick Estate Wagon Mercury Grand Marquis Peugeot 694 SL Ford Country Squire Wagon Weight 76 / 166

77 Comments geom_text needs a label aesthetic to say what text to plot. It inherits the x and y from the ggplot. hjust says where to put the labels relative to the points: 0.5 is centred over them, negative is on the right, greater than 1 is on the left. vjust similar to move labels up and down (less than 0, greater than 1 for above or below points). size controls size of text: 5 is default (so this is smaller). Not an obvious way to stop labels overlapping! But see over for a solution. xlim changes limits of x-axis (to stop labels going off side). Likewise ylim. 77 / 166

78 Non-overlapping labels Key is to use package ggrepel and geom_text_repel from that package instead of geom_text: library(ggrepel) # if not done already g+geom_text_repel(aes(label=car),size=2) MPG Dodge Colt Mazda GLC Fiat Strada Plymouth Horizon Datsun 210 Pontiac Phoenix VW Rabbit VW Dasher VW Scirocco Dodge Omni Chevette Chevy Citation Buick Skylark Honda Accord LX Toyota Corona AMC Spirit Datsun 510 Olds Omega Ford Mustang BMW 320i Datsun 810 Ford Mustang Ghia Buick Century Special Saab 99 GLE Mercury Zephyr Chevy Malibu Wagon Chrysler LeBaron Wagon Audi 5000 AMC Concord D/L Dodge St Regis Buick Estate Wagon Dodge Aspen Mercury Grand Marquis Volvo 240 GL Ford LTD Ford Country Squire Wagon Peugeot 694 SL Chevy Caprice Classic Weight 78 / 166

79 Labelling only some points Same idea as SAS: create a new variable in the data frame with the labels to plot, or empty, eg. using mutate from dplyr: cars2=dplyr::mutate(cars, newlabel=ifelse(mpg>34,as.character(car),"")) g2=ggplot(cars2,aes(x=weight,y=mpg))+geom_point()+ geom_text(aes(label=newlabel),size=2,hjust=-0.1) ifelse takes three things: something that can be true or false, the value if true, the value if false (like IF in a spreadsheet). 79 / 166

80 The plot g2 Fiat Strada 35 Dodge Colt Mazda GLC Plymouth Horizon MPG Weight 80 / 166

81 Labelling points by group g3=ggplot(cars,aes(x=weight,y=mpg,colour=cylinders))+ geom_point() ; g3 35 MPG Cylinders Weight 4 81 / 166

82 Fixing it up Only that isn t right: cylinders isn t really on a continuous scale; it should be treated as factor: g3=ggplot(cars,aes(x=weight,y=mpg, colour=as.factor(cylinders)))+ geom_point() ; g3 35 MPG as.factor(cylinders) Weight 82 / 166

83 Adding new data: averages by cylinders First make data frame of new data to add: tmp1=group_by(cars,cylinders) summ=summarize(tmp1,mw=mean(weight),mm=mean(mpg)) ; summ ## # A tibble: 4 x 3 ## Cylinders mw mm ## <int> <dbl> <dbl> ## ## ## ## then to plot averages on graph, add a new geom_point with a new data frame: g4=g3+geom_point(data=summ,aes(x=mw,y=mm, colour=as.factor(cylinders)),shape=3) 83 / 166

84 The plot, group mean marked by + g4 35 MPG as.factor(cylinders) Weight 84 / 166

85 Multiple series on one plot Oranges data frame oranges=read.table("oranges.txt",header=t) oranges ## row ages A B C D E ## ## ## ## ## ## ## Each column is circumference at given time. Want to plot each column against time, labelled. 85 / 166

86 Organizing the data ggplot way is to put all the circumferences in one column, labelled by which tree they come from, and then plot them using tree as group. This uses gather from tidyr: orange.long=gather(oranges,tree,circum,a:e) head(orange.long,8) ## row ages tree circum ## A 30 ## A 51 ## A 75 ## A 108 ## A 115 ## A 139 ## A 140 ## B / 166

87 The plot, joining points by lines g5=ggplot(orange.long,aes(x=ages,y=circum,colour=tree))+ geom_point()+geom_line(); g5 200 circum tree A B C D 50 E ages 87 / 166

88 Faceting Another way to plot the orange tree growth curves is each on a separate plot. In ggplot the separate graphs are called facets, and to get them, you add facet_wrap to the plot, with, inside, what distinguishes the facets, thus: g6=g5+facet_wrap(~tree) Or, for the car data, plot gas mileage against weight for each country separately: g7=ggplot(cars,aes(x=weight,y=mpg))+geom_point()+ facet_wrap(~country) 88 / 166

89 Growth curves by tree g6 A B C 200 circum D E tree A B C D E ages 89 / 166

90 Car MPG by weight for each country g7 France Germany Italy MPG Japan Sweden U.S Weight 90 / 166

91 Plotting against several variables Another use for faceting is to plot one y-variable (say MPG) against several x-variables at once. (Did this for asphalt data before.) Here we plot MPG against Weight, Cylinders and Horsepower. Strategy: put all x s in one column (using gather) and keep another column with names of x s. Plot y against combined x s, faceted by names of x s. x s will be on different scales; account for this: cars.3=gather(cars,xname,x,weight:horsepower) g8=ggplot(cars.3,aes(x=x,y=mpg))+geom_point()+ facet_wrap(~xname,scales="free_x") 91 / 166

92 The plot(s): all negative correlations g8 Cylinders Horsepower Weight x MPG 92 / 166

93 With regression lines g8+geom_smooth(method="lm",se=f) Cylinders Horsepower Weight x MPG 93 / 166

94 A last variation: separate graphs for levels of a factor The same faceting idea allows us to produce an array of plots, one for each combination of levels of factors. Here we plot MPG against Weight for each combo of Country (across) and number of Cylinders (up, treated as factor): g9=ggplot(cars,aes(x=weight,y=mpg))+geom_point()+ facet_grid(cylinders~country) Can also put only one factor in facet_grid to arrange facets up and down or across. With facet_wrap, don t control structure of display. 94 / 166

95 The plot g9 MPG France Germany Italy Japan Sweden U.S Weight / 166

96 Kernel density curve As we saw in SAS, this is a way of smoothing a histogram to understand underlying shape. ggplot histogram has bin width all wrong: ggplot(cars,aes(x=mpg))+geom_histogram() ## stat bin() using bins = 30. Pick better value with binwidth. 4 3 count MPG Fix up bin width, add kernel density with geom_density(). Also note that y-scale will be the (computed) density, not count: 96 / 166

97 Histogram of MPG with kernel density ggplot(cars,aes(x=mpg))+ geom_histogram(aes(y=..density..), binwidth=2.5)+geom_density() density /

98 Histogram of weight with density curve Histogram of MPG is clearly bimodal. What about weight? Not so much. ggplot(cars,aes(x=weight))+ geom_histogram(aes(y=..density..), binwidth=0.5)+geom_density() density Weight 98 / 166

99 Normal quantile plot Histogram, (especially) boxplot don t give focused assessment of whether a distribution is normal. Need normal quantile plot. Plot data values against what you d expect if normal distribution correct. If normal is correct, get straight line. If not, get a curve. ggplot has stat_qq for this, which goes this way (car weights): qq=ggplot(cars,aes(sample=weight))+stat_qq() 99 / 166

100 Normal quantile plot for Weight qq 4.0 sample theoretical 100 / 166

101 Comments Plot has no qqline! If data were perfectly normal, values exactly straight. Data stray off straight a bit at the ends: low values especially are too big/bunched up for normal. Weights are not normal. But line makes it much easier to judge. How might we draw one? 101 / 166

102 Figuring out qqline The qqline on R s other normal quantile plot goes through observed and theoretical quartiles. quantile gets percentiles of data, for example: y=quantile(cars$weight,c(0.25,0.75)) ; y ## 25% 75% ## qnorm gets percentiles of standard normal: x=qnorm(c(0.25,0.75)) ; x ## [1] I used y for data and x for theoretical since that s how they appear on the graph. 102 / 166

103 Figuring out qqline (2) Slope of line joining these is slope=(y[2]-y[1])/(x[2]-x[1]) ; slope ## 75% ## Intercept is int=y[1]-slope*x[1] ; int ## 25% ## geom_abline() draws a line with specified intercept and slope. 103 / 166

104 Making this into a function Make this into function so that we can use repeatedly. Generous use of copy/paste! qqplot=function(vals) { y=quantile(vals,c(0.25,0.75)) x=qnorm(c(0.25,0.75)) slope=(y[2]-y[1])/(x[2]-x[1]) int=y[1]-slope*x[1] d=data.frame(vals=vals) ggplot(d,aes(sample=vals))+stat_qq()+ geom_abline(slope=slope,intercept=int) } Make sure you understand what each line of the function does, and why it s there. 104 / 166

105 Testing on car weights qqplot(cars$weight) 4.0 sample theoretical 105 / 166

106 Making normal quantile plot of actually normal data How much deviation from the line might there be if data really normal? Generate some random normal data and find out: z=rnorm(100) qq=qqplot(z) See (over) that: overall pattern of points is straight, not curved points at extremes are not drifting away from line. 106 / 166

107 Normal quantile plot for genuinely normal data qq 2 sample theoretical 107 / 166

108 Right-skewed data The gamma distribution is skewed to right: g=rgamma(1000,2,2) gam=data.frame(g=g) ggplot(gam,aes(x=g))+geom_histogram(binwidth=0.2) count g Assess normality thus: qq=qqplot(g) 108 / 166

109 Normal quantile plot for gamma data qq 4 3 sample theoretical 109 / 166

110 Comments Seriously non-normal! Big-time curve on plot; points don t follow a line at all. Observations at top end too spread out for normal. Observations at bottom end bunched up for normal. Skewness in direction of spread-out values: skewed right. 110 / 166

111 Car MPGs Distribution had hole in middle some low MPGs, and some high ones: not normal. How does this show up on normal quantile plot? qq=qqplot(cars$mpg) 111 / 166

112 Normal quantile plot for car MPG qq 35 sample theoretical 112 / 166

113 Comments Hole shows up as vertical gap. Almost S-bend in data values. High ones not high enough. Low ones not low enough. Data too bunched up to be normal (short tails). 113 / 166

114 Functions: the geometric distribution Recall binomial distribution, eg. toss coin 10 times and count how many heads (W ). In general, prob. of success = p on every independent trial. Fixed # trials, W is #successes. Another angle: how many trials to get my first success? Random variable now #trials (denote X ); #successes fixed (= 1). Geometric distribution. P(X = 1) = p (success first time). P(X = 2) = (1 p)p (fail, then succeed). P(X = 3) = (1 p) 2 p (fail 2 times, then succeed). P(X = n) = (1 p) n 1 p (fail n 1 times, then succeed). Implement in R. 114 / 166

115 Writing a geometric probability function Input: #trials whose prob. we want x, single-trial success prob. p. Output: probability of succeeding for 1st time after exactly x trials (number). One-liner: geometric=function(x,p) p*(1-p)^(x-1) Or with curly brackets: geometric=function(x,p) { p*(1-p)^(x-1) } Testing: geometric(1,0.4) ## [1] 0.4 Prob. of succeeding first time same as p: good. 115 / 166

116 Errors Chance of first success on second trial? Fail, then succeed: geometric(2,0.4) ## [1] 0.24 (0.6)(0.4) = What if user gives p outside of [0, 1], or x less than 1? Function dies with error. Or gives nonsense answer. Catch that first: geometric(0,0.5) ## [1] 1 geometric(2,1.1) ## [1] Ugh! 116 / 166

117 Catching errors stopifnot: feed it some logical conditions, stops operation of function if any condition false. (If all true, nothing happens). If any condition false, R tells you which one. 3 things to check: p 0 or bigger, p 1 or smaller, x 1 or bigger: geometric=function(x,p) { stopifnot(p>=0,p<=1,x>=1) p*(1-p)^(x-1) } 117 / 166

118 Testing Test: geometric(2,0.5) ## [1] 0.25 geometric(0,0.5) ## Error: x >= 1 is not TRUE geometric(2,1.1) ## Error: p <= 1 is not TRUE Last two fail, and stopifnot tells you why. 118 / 166

119 Calling geometric with vector x What happens? Try it and see. geometric(1:5,0.5) ## [1] Probabilities of first success taking 1, 2, 3,... trials. Works because of how R handles vector arithmetic. R freebie: often get vector output from vector input with no extra coding. Above gives ingredients for first success in 5 trials or less : calculate prob of 1 to 5, then add up: sum(geometric(1:5,0.5)) ## [1] / 166

120 Function input If we use function as above, have to get inputs in right order: geometric(2,0.8) ## [1] 0.16 geometric(0.8,2) ## Error: p <= 1 is not TRUE Second one fails because it thinks 2 is success probability. But if we use the names, can do any order: geometric(x=2,p=0.8) ## [1] 0.16 geometric(p=0.8,x=2) ## [1] / 166

121 Defaults What if I write the function like this? geometric=function(x,p=0.5) { stopifnot(p>=0,p<=1,x>=1) p*(1-p)^(x-1) } If I call it without a value of p, shouldn t I get an error? geometric(x=3) ## [1] It works, because if I don t give a value for p, it uses the one in the function line, a default. Many R functions have defaults, that give reasonable behaviour without having to worry about details. 121 / 166

122 Cumulative probabilities as function Might be useful to have function for cumulative probabilities. Strategy: get individual probs as far as you wish to go, then add up. Eg. probability of 4 or less: need 1 through 4. In general, x or less with success prob. p: c.geometric=function(x,p) { probs=geometric(1:x,p) sum(probs) } Easy to write, using our geometric function and stuff in R. 122 / 166

123 Testing c.geometric Try the one we just did: c.geometric(5,0.5) ## [1] Answer we had before. How about this: c.geometric(20,0.1) ## [1] If success probability only 0.1, might even take longer than 20 trials to get first success. So this is reasonable. Mean number of trials until 1st success is 1/p: p = 0.5, mean #trials is 1/0.5 = 2. p = 0.1, mean #trials is 1/0.1 = / 166

124 Using R s geometric calculator Called pgeom: c.geometric(5,0.5) ## [1] c.geometric(20,0.1) ## [1] pgeom(5,0.5) ## [1] pgeom(20,0.1) ## [1] Oh. Not the same. Look in help for pgeom: this is other version of geometric, where you count how many failures happened before 1st success (#trials minus 1). So we need (compare c.geometric on left above): pgeom(4,0.5) ## [1] pgeom(19,0.1) ## [1] / 166

125 Another way of writing cumulative geometric Suppose we hadn t thought to try a vector for x. What then? Calculate each probability in turn, add on to a running total, return total at end. Uses a loop: c2.geometric=function(x,p) { total=0 for (i in 1:x) { prob=geometric(i,p) total=total+prob } total } 125 / 166

126 Checking c2.geometric(5,0.5) ## [1] c.geometric(5,0.5) ## [1] c2.geometric(20,0.1) ## [1] c.geometric(20,0.1) ## [1] Same as before. 126 / 166

127 Selecting stuff in R

128 Use dplyr Easiest way to select parts of data frame is to use dplyr tools. Use cars data for example: str(cars) ## 'data.frame': 38 obs. of 6 variables: ## $ Car : Factor w/ 38 levels "AMC Concord D/L",..: 7 ## $ MPG : num ## $ Weight : num ## $ Cylinders : int ## $ Horsepower: int ## $ Country : Factor w/ 6 levels "France","Germany",..: / 166

129 Selecting columns The base R way: cars$cylinders ## [1] ## [9] ## [17] ## [25] ## [33] select(cars,cylinders) ## Cylinders ## 1 4 ## 2 4 ## 3 6 ## 4 4 ## 5 6 ## 6 4 ## 7 4 ## 8 4 ## 9 8 ## 10 5 ## 11 8 ## 12 6 ## 13 4 ## 14 4 ## 15 4 ## 16 6 ## 17 4 ## 18 6 ## 19 6 ## 20 8 ## / 166

130 Columns by number select also takes a column number. For example, Cylinders is column number 4: select(cars,4) ## Cylinders ## 1 4 ## 2 4 ## 3 6 ## 4 4 ## 5 6 ## 6 4 ## 7 4 ## 8 4 ## 9 8 ## 10 5 ## 11 8 ## 12 6 ## 13 4 ## 14 4 ## 15 4 ## / 166

131 Selecting rows By logical condition using filter, eg. cars with MPG greater than 34: filter(cars,mpg>34) ## Car MPG Weight Cylinders Horsepower Country ## 1 Fiat Strada Italy ## 2 Plymouth Horizon U.S. ## 3 Mazda GLC Japan ## 4 Dodge Colt Japan By row number(s) using slice, eg. Fiat Strada, row 4: slice(cars,4) ## Car MPG Weight Cylinders Horsepower Country ## 1 Fiat Strada Italy or rows 3 and 5: slice(cars,c(3,5)) ## Car MPG Weight Cylinders Horsepower Country ## 1 Mercury Zephyr U.S. ## 2 Peugeot 694 SL France 131 / 166

132 Rows and columns, the base R way Use an empty row or column number to select a whole row or column (by number): 4th row: cars[4,] ## Car MPG Weight Cylinders Horsepower Country ## 4 Fiat Strada Italy 2nd column (all the MPG values): cars[,2] ## [1] ## [15] ## [29] / 166

133 Multiple selections for example, names and MPGs of cars with MPG over 34: tmp=filter(cars,mpg>34) select(tmp,c(car,mpg)) ## Car MPG ## 1 Fiat Strada 37.3 ## 2 Plymouth Horizon 34.2 ## 3 Mazda GLC 34.1 ## 4 Dodge Colt 35.1 (two selections one after the other, with first stored in temporary data frame) Order here does not matter, but if we wanted name and MPG of cars with 6 cylinders, must do filter first; else, after select, no column called Cylinders left. 133 / 166

134 Or, this way or like this (same selection): cars %>% filter(mpg>34) %>% select(c(car,mpg)) ## Car MPG ## 1 Fiat Strada 37.3 ## 2 Plymouth Horizon 34.2 ## 3 Mazda GLC 34.1 ## 4 Dodge Colt 35.1 Symbol %>% called pipe. Read above as take cars, and then take the rows where MPG bigger than 34, and then take columns called Car and MPG. 134 / 166

135 Comparing code with and without pipe Without pipe (original way): tmp=filter(cars,mpg>34) select(tmp,c(car,mpg)) With pipe: cars %>% filter(mpg>34) %>% select(c(car,mpg)) In a pipe, the first data frame argument of function disappears. Data frame used is whatever came out of the previous step. Code with pipe more concise and uses no temporary variables. 135 / 166

136 Another example Pipe way of selecting gas mileage (column 2) of Fiat Strada (row 4): cars %>% select(2) %>% slice(4) ## MPG ## Pipe comes with dplyr and can be used with any function that takes a data frame first: cars %>% filter(mpg<30) %>% head() ## Car MPG Weight Cylinders Horsepower Country ## 1 Buick Skylark U.S. ## 2 Mercury Zephyr U.S. ## 3 Peugeot 694 SL France ## 4 Buick Estate Wagon U.S. ## 5 Audi Germany ## 6 Chevy Malibu Wagon U.S. 136 / 166

137 And, or Combine multiple conditions in filter using & for and and for or. Cars that weigh more than 4 tons and have gas mileage less than 20: filter(cars,weight>4 & MPG<20) ## Car MPG Weight Cylinders Horsepower Country ## 1 Buick Estate Wagon U.S. ## 2 Ford Country Squire Wagon U.S. Can also do and as two filters, one after the other: cars %>% filter(weight>4) %>% filter(mpg<20) ## Car MPG Weight Cylinders Horsepower Country ## 1 Buick Estate Wagon U.S. ## 2 Ford Country Squire Wagon U.S. 137 / 166

138 Or example Cars that either weigh more than 4 tons or have gas mileage less than 20: filter(cars,weight>4 MPG<20) ## Car MPG Weight Cylinders Horsepower Country ## 1 Peugeot 694 SL France ## 2 Buick Estate Wagon U.S. ## 3 Chevy Malibu Wagon U.S. ## 4 Dodge Aspen U.S. ## 5 Chrysler LeBaron Wagon U.S. ## 6 AMC Concord D/L U.S. ## 7 Ford LTD U.S. ## 8 Volvo 240 GL Sweden ## 9 Dodge St Regis U.S. ## 10 Ford Country Squire Wagon U.S. ## 11 Mercury Grand Marquis U.S. ## 12 Chevy Caprice Classic U.S. 138 / 166

139 More selections Which countries do the 8-cylinder cars come from? cars %>% filter(cylinders==8) %>% select(country) ## Country ## 1 U.S. ## 2 U.S. ## 3 U.S. ## 4 U.S. ## 5 U.S. ## 6 U.S. ## 7 U.S. ## 8 U.S. All from the US. 139 / 166

140 Yet more selections Gas mileages of 8-cylinder cars? cars %>% filter(cylinders==8) %>% select(mpg) ## MPG ## ## ## ## ## ## ## ## All bad. 140 / 166

141 How many cylinders do the high-mpg cars have? Define high as 30 or more : cars %>% filter(mpg>=30) %>% select(cylinders) ## Cylinders ## 1 4 ## 2 4 ## 3 4 ## 4 4 ## 5 4 ## 6 4 ## 7 4 ## 8 4 ## 9 4 ## 10 4 ## 11 4 All 4. Not a surprise. (Conditional distribution of number of cylinders given that MPG 30 or more.) 141 / 166

142 Not How many cars not from the US? This is a filter too, but we have an extra step to count them: cars %>% filter(country!="u.s.") %>% summarize(n=n()) ## n ## of 38 cars are not from US. Or see which other countries we have, and how many of each: cars %>% filter(country!="u.s.") %>% group_by(country) %>% summarize(n=n()) ## # A tibble: 5 x 2 ## Country n ## <fctr> <int> ## 1 France 1 ## 2 Germany 5 ## 3 Italy 1 ## 4 Japan 7 ## 5 Sweden / 166

143 Doing things all at once using dplyr

144 Doing things all at once R very good at applying things to entire data frames, vectors. For example, calculating means by rows or columns. If you re a programmer, might do these tasks using loops. But no need in R: dplyr has all you need. 144 / 166

145 The orange trees again Go back to orange tree circumferences: oranges ## row ages A B C D E ## ## ## ## ## ## ## / 166

146 Row means Row means: dplyr, group by rows (there are n() of them, then calculate the means of columns A through E for each group (row): oranges %>% group_by(1:n()) %>% mutate(m=mean(a:e)) ## Source: local data frame [7 x 9] ## Groups: 1:n() [7] ## ## row ages A B C D E 1:n() m ## <int> <int> <int> <int> <int> <int> <int> <int> <dbl> ## ## ## ## ## ## ## Extra column m contains row means (mean circumference at each time). 146 / 166

147 Column medians Column medians: use summarize_each thus: oranges %>% summarize_each(funs(median),a:e) ## A B C D E ## The function to calculate for each column goes inside funs, and the columns to find the median for go after that. Column medians are actually 4th number in each column, since values in order. Same method for column-anything. 147 / 166

148 A more tricky one The first quartile Q1 for each row: oranges %>% group_by(1:n()) %>% mutate(q1=quantile(a:e,probs=0.25)) ## Source: local data frame [7 x 9] ## Groups: 1:n() [7] ## ## row ages A B C D E 1:n() q1 ## <int> <int> <int> <int> <int> <int> <int> <int> <dbl> ## ## ## ## ## ## ## Feed all the variables you want quartiles for into quantile, and then say which quantile you want. 148 / 166

149 Means etc. by groups Back to cars: mean MPG (quantitative) for each Country (categorical). aggregate will do this, but so will dplyr: cars %>% group_by(country) %>% summarize(m=mean(mpg), s=sd(mpg)) ## # A tibble: 6 x 3 ## Country m s ## <fctr> <dbl> <dbl> ## 1 France NA ## 2 Germany ## 3 Italy NA ## 4 Japan ## 5 Sweden ## 6 U.S / 166

150 Means by groups (2) For combination of categorical variables, put them all in the group_by, eg. by Country and Cylinders: cars %>% group_by(country,cylinders) %>% summarize(n=n(),m=mean(mpg),s=sd(mpg)) ## Source: local data frame [11 x 5] ## Groups: Country [?] ## ## Country Cylinders n m s ## <fctr> <int> <int> <dbl> <dbl> ## 1 France NA ## 2 Germany ## 3 Germany NA ## 4 Italy NA ## 5 Japan ## 6 Japan NA ## 7 Sweden NA ## 8 Sweden NA ## 9 U.S ## 10 U.S ## 11 U.S / 166

151 What happens with function returning several values? Function quantile returns 5-number summary by default: quantile(cars$mpg) ## 0% 25% 50% 75% 100% ## What happens with summarize then? cars %>% group_by(country) %>% summarize(q=quantile(mpg)) ## Error in eval(expr, envir, enclos): expecting a single value We have to work around this, as shown on next page. 151 / 166

152 Handling function returning several values This arcane code: cars %>% group_by(country) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country,which=names(.$q),value=.$q )) ## Source: local data frame [30 x 3] ## Groups: <by row> ## ## # A tibble: 30 x 3 ## ctry which value ## * <fctr> <fctr> <dbl> ## 1 France 0% 16.2 ## 2 France 25% 16.2 ## 3 France 50% 16.2 ## 4 France 75% 16.2 ## 5 France 100% 16.2 ## 6 Germany 0% 20.3 ## 7 Germany 25% 21.5 ## 8 Germany 50% 30.5 ## 9 Germany 75% 31.5 ## 10 Germany 100% 31.9 ## #... with 20 more rows 152 / 166

153 Comments Key part of code is to use do twice: first time to construct a variable holding all the quantiles (5 of them), which does this: cars %>% group_by(country) %>% do(q=quantile(.$mpg)) ## Source: local data frame [6 x 2] ## Groups: <by row> ## ## # A tibble: 6 x 2 ## Country q ## * <fctr> <list> ## 1 France <dbl [5]> ## 2 Germany <dbl [5]> ## 3 Italy <dbl [5]> ## 4 Japan <dbl [5]> ## 5 Sweden <dbl [5]> ## 6 U.S. <dbl [5]> 153 / 166

154 Comments (2) second time to pull out those values, by constructing a data frame containing their names (which percentile) and their values, labelled by country, producing this (summary): cars %>% group_by(country) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country,which=names(.$q),value=.$q )) %>% str() ## Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame': 30 obs. of ## $ ctry : Factor w/ 6 levels "France","Germany",..: ## $ which: Factor w/ 5 levels "0%","100%","25%",..: ## $ value: num / 166

155 Displaying it better Not the clearest display. We could put the percentiles in columns. This is the inverse of gather, which is spread : cars %>% group_by(country) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country,which=names(.$q),value=.$q )) %>% spread(which,value) ## # A tibble: 6 x 6 ## ctry 0% 100% 25% 50% 75% ## * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 France ## 2 Germany ## 3 Italy ## 4 Japan ## 5 Sweden ## 6 U.S spread seems to have put the percentiles in the wrong order. This is more trouble than it s worth to fix! 155 / 166

156 And even... Five-number summary of MPG by Country-Cylinders combo: cars %>% group_by(country,cylinders) %>% do(q=quantile(.$mpg)) %>% do(data.frame( ctry=.$country, cyl=.$cylinders, which=names(.$q), value=.$q)) %>% spread(which,value) ## # A tibble: 11 x 7 ## ctry cyl 0% 100% 25% 50% 75% ## * <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 France ## 2 Germany ## 3 Germany ## 4 Italy ## 5 Japan ## 6 Japan ## 7 Sweden ## 8 Sweden ## 9 U.S ## 10 U.S ## 11 U.S / 166

157 Vector and matrix algebra in R

158 Vector addition Define a vector, then add 2 to it: u=c(2,3,6,5,7) k=2 u+k ## [1] Adds 2 to each element. Adding vectors: u ## [1] v=c(1,8,3,4,2) u+v ## [1] Elementwise addition. (MAT A23: vector addition.) 158 / 166

What R is. STAT:5400 (22S:166) Computing in Statistics

What R is. STAT:5400 (22S:166) Computing in Statistics STAT:5400 (22S:166) Computing in Statistics Introduction to R Lecture 5 September 9, 2015 Kate Cowles 374 SH, 335-0727 kate-cowles@uiowa.edu 1 What R is an integrated suite of software facilities for data

More information

Assignment 0. Nothing here to hand in

Assignment 0. Nothing here to hand in Assignment 0 Nothing here to hand in The questions here have solutions attached. Follow the solutions to see what to do, if you cannot otherwise guess. Though there is nothing here to hand in, it is very

More information

Introduction to SAS. Hsueh-Sheng Wu. Center for Family and Demographic Research. November 1, 2010

Introduction to SAS. Hsueh-Sheng Wu. Center for Family and Demographic Research. November 1, 2010 Introduction to SAS Hsueh-Sheng Wu Center for Family and Demographic Research November 1, 2010 1 Outline What is SAS? Things you need to know before using SAS SAS user interface Using SAS to manage data

More information

Assignment 5.5. Nothing here to hand in

Assignment 5.5. Nothing here to hand in Assignment 5.5 Nothing here to hand in Load the tidyverse before we start: library(tidyverse) ## Loading tidyverse: ggplot2 ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse:

More information

Introduction to SAS and Stata: Data Construction. Hsueh-Sheng Wu CFDR Workshop Series February 2, 2015

Introduction to SAS and Stata: Data Construction. Hsueh-Sheng Wu CFDR Workshop Series February 2, 2015 Introduction to SAS and Stata: Data Construction Hsueh-Sheng Wu CFDR Workshop Series February 2, 2015 1 What are data? Outline The interface of SAS and Stata Important differences between SAS and Stata

More information

Data visualization with ggplot2

Data visualization with ggplot2 Data visualization with ggplot2 Visualizing data in R with the ggplot2 package Authors: Mateusz Kuzak, Diana Marek, Hedi Peterson, Dmytro Fishman Disclaimer We will be using the functions in the ggplot2

More information

1 Introduction to Using Excel Spreadsheets

1 Introduction to Using Excel Spreadsheets Survey of Math: Excel Spreadsheet Guide (for Excel 2007) Page 1 of 6 1 Introduction to Using Excel Spreadsheets This section of the guide is based on the file (a faux grade sheet created for messing with)

More information

History, installation and connection

History, installation and connection History, installation and connection The men behind our software Jim Goodnight, CEO SAS Inc Ross Ihaka Robert Gentleman (Duncan Temple Lang) originators of R 2 / 75 History SAS From late 1960s, North Carolina

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9 Contents 1 Introduction to Using Excel Spreadsheets 2 1.1 A Serious Note About Data Security.................................... 2 1.2

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Subsetting, dplyr, magrittr Author: Lloyd Low; add:

Subsetting, dplyr, magrittr Author: Lloyd Low;  add: Subsetting, dplyr, magrittr Author: Lloyd Low; Email add: wai.low@adelaide.edu.au Introduction So you have got a table with data that might be a mixed of categorical, integer, numeric, etc variables? And

More information

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1)

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1) Basic R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20classes/1-3_basic_r.html#(1) 1/21 Preliminary R is case sensitive: a is not the same as A.

More information

Statistics 133 Midterm Exam

Statistics 133 Midterm Exam Statistics 133 Midterm Exam March 2, 2011 When I ask for an R program, I mean one or more R commands. Try your best to make your answers general, i.e. they shouldn t depend on the specific values presented

More information

Introduction to R and the tidyverse. Paolo Crosetto

Introduction to R and the tidyverse. Paolo Crosetto Introduction to R and the tidyverse Paolo Crosetto Lecture 1: plotting Before we start: Rstudio Interactive console Object explorer Script window Plot window Before we start: R concatenate: c() assign:

More information

Install RStudio from - use the standard installation.

Install RStudio from   - use the standard installation. Session 1: Reading in Data Before you begin: Install RStudio from http://www.rstudio.com/ide/download/ - use the standard installation. Go to the course website; http://faculty.washington.edu/kenrice/rintro/

More information

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018 Contents Overview 2 Generating random numbers 2 rnorm() to generate random numbers from

More information

Introduction to R: Day 2 September 20, 2017

Introduction to R: Day 2 September 20, 2017 Introduction to R: Day 2 September 20, 2017 Outline RStudio projects Base R graphics plotting one or two continuous variables customizable elements of plots saving plots to a file Create a new project

More information

Stat 302 Statistical Software and Its Applications SAS: Data I/O

Stat 302 Statistical Software and Its Applications SAS: Data I/O Stat 302 Statistical Software and Its Applications SAS: Data I/O Yen-Chi Chen Department of Statistics, University of Washington Autumn 2016 1 / 33 Getting Data Files Get the following data sets from the

More information

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 3 - Displaying and Summarizing Quantitative Data Chapter 3 - Displaying and Summarizing Quantitative Data 3.1 Graphs for Quantitative Data (LABEL GRAPHS) August 25, 2014 Histogram (p. 44) - Graph that uses bars to represent different frequencies or relative

More information

Homework 1 Excel Basics

Homework 1 Excel Basics Homework 1 Excel Basics Excel is a software program that is used to organize information, perform calculations, and create visual displays of the information. When you start up Excel, you will see the

More information

Intro. Scheme Basics. scm> 5 5. scm>

Intro. Scheme Basics. scm> 5 5. scm> Intro Let s take some time to talk about LISP. It stands for LISt Processing a way of coding using only lists! It sounds pretty radical, and it is. There are lots of cool things to know about LISP; if

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions

More information

Getting started with ggplot2

Getting started with ggplot2 Getting started with ggplot2 STAT 133 Gaston Sanchez Department of Statistics, UC Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133 ggplot2 2 Resources for

More information

Chapter 2 - Graphical Summaries of Data

Chapter 2 - Graphical Summaries of Data Chapter 2 - Graphical Summaries of Data Data recorded in the sequence in which they are collected and before they are processed or ranked are called raw data. Raw data is often difficult to make sense

More information

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1 Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1 KEY SKILLS: Organize a data set into a frequency distribution. Construct a histogram to summarize a data set. Compute the percentile for a particular

More information

Chapter 2: Getting Data Into SAS

Chapter 2: Getting Data Into SAS Chapter 2: Getting Data Into SAS Data stored in many different forms/formats. Four categories of ways to read in data. 1. Entering data directly through keyboard 2. Creating SAS data sets from raw data

More information

STAT:5400 Computing in Statistics

STAT:5400 Computing in Statistics STAT:5400 Computing in Statistics Introduction to SAS Lecture 18 Oct 12, 2015 Kate Cowles 374 SH, 335-0727 kate-cowles@uiowaedu SAS SAS is the statistical software package most commonly used in business,

More information

1 Pencil and Paper stuff

1 Pencil and Paper stuff Spring 2008 - Stat C141/ Bioeng C141 - Statistics for Bioinformatics Course Website: http://www.stat.berkeley.edu/users/hhuang/141c-2008.html Section Website: http://www.stat.berkeley.edu/users/mgoldman

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics Fritz Scholz Department of Statistics, University of Washington Winter Quarter 2015 February 19, 2015 2 Getting

More information

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one. Probability and Statistics Chapter 2 Notes I Section 2-1 A Steps to Constructing Frequency Distributions 1 Determine number of (may be given to you) a Should be between and classes 2 Find the Range a The

More information

Earthquake data in geonet.org.nz

Earthquake data in geonet.org.nz Earthquake data in geonet.org.nz There is are large gaps in the 2012 and 2013 data, so let s not use it. Instead we ll use a previous year. Go to http://http://quakesearch.geonet.org.nz/ At the screen,

More information

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use? Chapter 4 Analyzing Skewed Quantitative Data Introduction: In chapter 3, we focused on analyzing bell shaped (normal) data, but many data sets are not bell shaped. How do we analyze quantitative data when

More information

10 Listing data and basic command syntax

10 Listing data and basic command syntax 10 Listing data and basic command syntax Command syntax This chapter gives a basic lesson on Stata s command syntax while showing how to control the appearance of a data list. As we have seen throughout

More information

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below. Graphing in Excel featuring Excel 2007 1 A spreadsheet can be a powerful tool for analyzing and graphing data, but it works completely differently from the graphing calculator that you re used to. If you

More information

Chapter 3 Analyzing Normal Quantitative Data

Chapter 3 Analyzing Normal Quantitative Data Chapter 3 Analyzing Normal Quantitative Data Introduction: In chapters 1 and 2, we focused on analyzing categorical data and exploring relationships between categorical data sets. We will now be doing

More information

University of Toronto Scarborough Department of Computer and Mathematical Sciences STAC32 (K. Butler), Midterm Exam October 24, 2016

University of Toronto Scarborough Department of Computer and Mathematical Sciences STAC32 (K. Butler), Midterm Exam October 24, 2016 University of Toronto Scarborough Department of Computer and Mathematical Sciences STAC32 (K. Butler), Midterm Exam October 24, 2016 Aids allowed: - My lecture slides - Any notes that you have taken in

More information

Depending on the computer you find yourself in front of, here s what you ll need to do to open SPSS.

Depending on the computer you find yourself in front of, here s what you ll need to do to open SPSS. 1 SPSS 11.5 for Windows Introductory Assignment Material covered: Opening an existing SPSS data file, creating new data files, generating frequency distributions and descriptive statistics, obtaining printouts

More information

More data analysis examples

More data analysis examples More data analysis examples R packages used library(ggplot2) library(tidyr) library(mass) library(leaps) library(dplyr) ## ## Attaching package: dplyr ## The following object is masked from package:mass

More information

Introduction... 3 Introduction... 4

Introduction... 3 Introduction... 4 User Manual Contents Introduction... 3 Introduction... 4 Placing an Order... 5 Overview of the Order Sheet... 6 Ordering Items... 9 Customising your Orders... 11 Previewing and Submitting your Basket...

More information

Lesson 76. Linear Regression, Scatterplots. Review: Shormann Algebra 2, Lessons 12, 24; Shormann Algebra 1, Lesson 94

Lesson 76. Linear Regression, Scatterplots. Review: Shormann Algebra 2, Lessons 12, 24; Shormann Algebra 1, Lesson 94 Lesson 76 Linear Regression, Scatterplots Review: Shormann Algebra 2, Lessons 12, 24; Shormann Algebra 1, Lesson 94 Tools required: A graphing calculator or some sort of spreadsheet program, like Excel

More information

Facets and Continuous graphs

Facets and Continuous graphs Facets and Continuous graphs One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

Excel Basics: Working with Spreadsheets

Excel Basics: Working with Spreadsheets Excel Basics: Working with Spreadsheets E 890 / 1 Unravel the Mysteries of Cells, Rows, Ranges, Formulas and More Spreadsheets are all about numbers: they help us keep track of figures and make calculations.

More information

0 Graphical Analysis Use of Excel

0 Graphical Analysis Use of Excel Lab 0 Graphical Analysis Use of Excel What You Need To Know: This lab is to familiarize you with the graphing ability of excels. You will be plotting data set, curve fitting and using error bars on the

More information

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis. 1.3 Density curves p50 Some times the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve. It is easier to work with a smooth curve, because the histogram

More information

Visual Analytics. Visualizing multivariate data:

Visual Analytics. Visualizing multivariate data: Visual Analytics 1 Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or

More information

Assignment 3 due Thursday Oct. 11

Assignment 3 due Thursday Oct. 11 Instructor Linda C. Stephenson due Thursday Oct. 11 GENERAL NOTE: These assignments often build on each other what you learn in one assignment may be carried over to subsequent assignments. If I have already

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

STA Module 4 The Normal Distribution

STA Module 4 The Normal Distribution STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

More information

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves STA 2023 Module 4 The Normal Distribution Learning Objectives Upon completing this module, you should be able to 1. Explain what it means for a variable to be normally distributed or approximately normally

More information

Chapter 1. Manage the data

Chapter 1. Manage the data 1.1. Coding of survey questions Appendix A shows a questionnaire with the corresponding coding sheet. Some observations of the selected variables are shown in the following table. AGE SEX JOB INCOME EDUCATE

More information

Intro to Stata for Political Scientists

Intro to Stata for Political Scientists Intro to Stata for Political Scientists Andrew S. Rosenberg Junior PRISM Fellow Department of Political Science Workshop Description This is an Introduction to Stata I will assume little/no prior knowledge

More information

How to Make Graphs in EXCEL

How to Make Graphs in EXCEL How to Make Graphs in EXCEL The following instructions are how you can make the graphs that you need to have in your project.the graphs in the project cannot be hand-written, but you do not have to use

More information

Lecture 6: Chapter 6 Summary

Lecture 6: Chapter 6 Summary 1 Lecture 6: Chapter 6 Summary Z-score: Is the distance of each data value from the mean in standard deviation Standardizes data values Standardization changes the mean and the standard deviation: o Z

More information

R syntax guide. Richard Gonzalez Psychology 613. August 27, 2015

R syntax guide. Richard Gonzalez Psychology 613. August 27, 2015 R syntax guide Richard Gonzalez Psychology 613 August 27, 2015 This handout will help you get started with R syntax. There are obviously many details that I cannot cover in these short notes but these

More information

Loading Data into R. Loading Data Sets

Loading Data into R. Loading Data Sets Loading Data into R Loading Data Sets Rather than manually entering data using c() or something else, we ll want to load data in stored in a data file. For this class, these will usually be one of three

More information

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 02 Lecture - 45 Memoization

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 02 Lecture - 45 Memoization Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute Module 02 Lecture - 45 Memoization Let us continue our discussion of inductive definitions. (Refer Slide Time: 00:05)

More information

How to use FSBforecast Excel add in for regression analysis

How to use FSBforecast Excel add in for regression analysis How to use FSBforecast Excel add in for regression analysis FSBforecast is an Excel add in for data analysis and regression that was developed here at the Fuqua School of Business over the last 3 years

More information

Lecture 2: Advanced data manipulation

Lecture 2: Advanced data manipulation Introduction to Stata- A. Chevalier Content of Lecture 2: Lecture 2: Advanced data manipulation -creating data -using dates -merging and appending datasets - wide and long -collapse 1 A] Creating data

More information

Distributions of Continuous Data

Distributions of Continuous Data C H A P T ER Distributions of Continuous Data New cars and trucks sold in the United States average about 28 highway miles per gallon (mpg) in 2010, up from about 24 mpg in 2004. Some of the improvement

More information

Page 1. Graphical and Numerical Statistics

Page 1. Graphical and Numerical Statistics TOPIC: Description Statistics In this tutorial, we show how to use MINITAB to produce descriptive statistics, both graphical and numerical, for an existing MINITAB dataset. The example data come from Exercise

More information

Math 227 EXCEL / MEGASTAT Guide

Math 227 EXCEL / MEGASTAT Guide Math 227 EXCEL / MEGASTAT Guide Introduction Introduction: Ch2: Frequency Distributions and Graphs Construct Frequency Distributions and various types of graphs: Histograms, Polygons, Pie Charts, Stem-and-Leaf

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

ECONOMICS 351* -- Stata 10 Tutorial 1. Stata 10 Tutorial 1

ECONOMICS 351* -- Stata 10 Tutorial 1. Stata 10 Tutorial 1 TOPIC: Getting Started with Stata Stata 10 Tutorial 1 DATA: auto1.raw and auto1.txt (two text-format data files) TASKS: Stata 10 Tutorial 1 is intended to introduce (or re-introduce) you to some of the

More information

Chapter 2 Assignment (due Thursday, April 19)

Chapter 2 Assignment (due Thursday, April 19) (due Thursday, April 19) Introduction: The purpose of this assignment is to analyze data sets by creating histograms and scatterplots. You will use the STATDISK program for both. Therefore, you should

More information

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet. Mr G s Java Jive #2: Yo! Our First Program With this handout you ll write your first program, which we ll call Yo. Programs, Classes, and Objects, Oh My! People regularly refer to Java as a language that

More information

1. What specialist uses information obtained from bones to help police solve crimes?

1. What specialist uses information obtained from bones to help police solve crimes? Mathematics: Modeling Our World Unit 4: PREDICTION HANDOUT VIDEO VIEWING GUIDE H4.1 1. What specialist uses information obtained from bones to help police solve crimes? 2.What are some things that can

More information

Lab 1. Introduction to R & SAS. R is free, open-source software. Get it here:

Lab 1. Introduction to R & SAS. R is free, open-source software. Get it here: Lab 1. Introduction to R & SAS R is free, open-source software. Get it here: http://tinyurl.com/yfet8mj for your own computer. 1.1. Using R like a calculator Open R and type these commands into the R Console

More information

Chapter 2 Modeling Distributions of Data

Chapter 2 Modeling Distributions of Data Chapter 2 Modeling Distributions of Data Section 2.1 Describing Location in a Distribution Describing Location in a Distribution Learning Objectives After this section, you should be able to: FIND and

More information

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements Programming, Data Structures and Algorithms Prof. Shankar Balachandran Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture 05 I/O statements Printf, Scanf Simple

More information

MATH NATION SECTION 9 H.M.H. RESOURCES

MATH NATION SECTION 9 H.M.H. RESOURCES MATH NATION SECTION 9 H.M.H. RESOURCES SPECIAL NOTE: These resources were assembled to assist in student readiness for their upcoming Algebra 1 EOC. Although these resources have been compiled for your

More information

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between MITOCW Lecture 10A [MUSIC PLAYING] PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between all these high-level languages like Lisp and the query

More information

Bar Charts and Frequency Distributions

Bar Charts and Frequency Distributions Bar Charts and Frequency Distributions Use to display the distribution of categorical (nominal or ordinal) variables. For the continuous (numeric) variables, see the page Histograms, Descriptive Stats

More information

GiftWorks Import Guide Page 2

GiftWorks Import Guide Page 2 Import Guide Introduction... 2 GiftWorks Import Services... 3 Import Sources... 4 Preparing for Import... 9 Importing and Matching to Existing Donors... 11 Handling Receipting of Imported Donations...

More information

Hacking FlowJo VX. 42 Time-Saving FlowJo Shortcuts To Help You Get Your Data Published No Matter What Flow Cytometer It Came From

Hacking FlowJo VX. 42 Time-Saving FlowJo Shortcuts To Help You Get Your Data Published No Matter What Flow Cytometer It Came From Hacking FlowJo VX 42 Time-Saving FlowJo Shortcuts To Help You Get Your Data Published No Matter What Flow Cytometer It Came From Contents 1. Change the default name of your files. 2. Edit your workspace

More information

Lecture 4: Data Visualization I

Lecture 4: Data Visualization I Lecture 4: Data Visualization I Data Science for Business Analytics Thibault Vatter Department of Statistics, Columbia University and HEC Lausanne, UNIL 11.03.2018 Outline 1 Overview

More information

Minitab Notes for Activity 1

Minitab Notes for Activity 1 Minitab Notes for Activity 1 Creating the Worksheet 1. Label the columns as team, heat, and time. 2. Have Minitab automatically enter the team data for you. a. Choose Calc / Make Patterned Data / Simple

More information

CREATING THE DISTRIBUTION ANALYSIS

CREATING THE DISTRIBUTION ANALYSIS Chapter 12 Examining Distributions Chapter Table of Contents CREATING THE DISTRIBUTION ANALYSIS...176 BoxPlot...178 Histogram...180 Moments and Quantiles Tables...... 183 ADDING DENSITY ESTIMATES...184

More information

MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel

MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel MIS 0855 Data Science (Section 006) Fall 2017 In-Class Exercise (Day 18) Finding Bad Data in Excel Objective: Find and fix a data set with incorrect values Learning Outcomes: Use Excel to identify incorrect

More information

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data Chapter 2 Descriptive Statistics: Organizing, Displaying and Summarizing Data Objectives Student should be able to Organize data Tabulate data into frequency/relative frequency tables Display data graphically

More information

Name: Tutor s

Name: Tutor s Name: Tutor s Email: Bring a couple, just in case! Necessary Equipment: Black Pen Pencil Rubber Pencil Sharpener Scientific Calculator Ruler Protractor (Pair of) Compasses 018 AQA Exam Dates Paper 1 4

More information

Exercise 1: Introduction to Stata

Exercise 1: Introduction to Stata Exercise 1: Introduction to Stata New Stata Commands use describe summarize stem graph box histogram log on, off exit New Stata Commands Downloading Data from the Web I recommend that you use Internet

More information

Tips and Guidance for Analyzing Data. Executive Summary

Tips and Guidance for Analyzing Data. Executive Summary Tips and Guidance for Analyzing Data Executive Summary This document has information and suggestions about three things: 1) how to quickly do a preliminary analysis of time-series data; 2) key things to

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

Data Management Project Using Software to Carry Out Data Analysis Tasks

Data Management Project Using Software to Carry Out Data Analysis Tasks Data Management Project Using Software to Carry Out Data Analysis Tasks This activity involves two parts: Part A deals with finding values for: Mean, Median, Mode, Range, Standard Deviation, Max and Min

More information

Excel Tips and FAQs - MS 2010

Excel Tips and FAQs - MS 2010 BIOL 211D Excel Tips and FAQs - MS 2010 Remember to save frequently! Part I. Managing and Summarizing Data NOTE IN EXCEL 2010, THERE ARE A NUMBER OF WAYS TO DO THE CORRECT THING! FAQ1: How do I sort my

More information

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT NAVAL POSTGRADUATE SCHOOL LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT Statistics (OA3102) Lab #2: Sampling, Sampling Distributions, and the Central Limit Theorem Goal: Use R to demonstrate sampling

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression OBJECTIVES 1. Prepare a scatter plot of the dependent variable on the independent variable 2. Do a simple linear regression

More information

Chapter 6. THE NORMAL DISTRIBUTION

Chapter 6. THE NORMAL DISTRIBUTION Chapter 6. THE NORMAL DISTRIBUTION Introducing Normally Distributed Variables The distributions of some variables like thickness of the eggshell, serum cholesterol concentration in blood, white blood cells

More information

Exploring and Understanding Data Using R.

Exploring and Understanding Data Using R. Exploring and Understanding Data Using R. Loading the data into an R data frame: variable

More information

Word: Print Address Labels Using Mail Merge

Word: Print Address Labels Using Mail Merge Word: Print Address Labels Using Mail Merge No Typing! The Quick and Easy Way to Print Sheets of Address Labels Here at PC Knowledge for Seniors we re often asked how to print sticky address labels in

More information

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva Econ 329 - Stata Tutorial I: Reading, Organizing and Describing Data Sanjaya DeSilva September 8, 2008 1 Basics When you open Stata, you will see four windows. 1. The Results window list all the commands

More information

Chapter 6. THE NORMAL DISTRIBUTION

Chapter 6. THE NORMAL DISTRIBUTION Chapter 6. THE NORMAL DISTRIBUTION Introducing Normally Distributed Variables The distributions of some variables like thickness of the eggshell, serum cholesterol concentration in blood, white blood cells

More information

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution Name: Date: Period: Chapter 2 Section 1: Describing Location in a Distribution Suppose you earned an 86 on a statistics quiz. The question is: should you be satisfied with this score? What if it is the

More information

Stat 290: Lab 2. Introduction to R/S-Plus

Stat 290: Lab 2. Introduction to R/S-Plus Stat 290: Lab 2 Introduction to R/S-Plus Lab Objectives 1. To introduce basic R/S commands 2. Exploratory Data Tools Assignment Work through the example on your own and fill in numerical answers and graphs.

More information