Lab 1 overview

In this course we will rely heavily on the R programming language for statistical computing and graphics (e.g., this website is actually built in R using ‘rmarkdown’ and rshiny). The purpose of this first laboratory exercise is to develop the level of familiarity with R that is needed to succeed in this course – and ultimately, to establish a foundation for you to develop your data analysis skills using R throughout your scientific career.

Lab 1 details

Specifically, this lab will provide a basic introduction to the R programming Language and the use of R to perform statistics and programming tasks.

As with all lab reports, your answers will either take the form of R functions or short written responses. The R functions should be stored in an R script file (‘.R’ extension). To allow me to evaluate your work more efficiently, please name your R script using the following convention: "[your first name]_[your last name]_lab1.R“. So my submission would be”kevin_shoemaker_lab1.R". Please include only the requested functions in your submitted R script. You will probably complete the lab exercises using a main script (that includes all your work), and then save a reduced copy (with only the requested functions) for submission.

Before submitting your code, please clear your environment, run your script and make sure that the script only defines functions within your global environment. Your script should generate no additional objects (eg, data frames).

Please submit the R script and the Word document via WebCampus by midnight on the due date (one week after the final lab session allocated for this topic). You can work in groups but submit the materials individually. One student in each group should be an ‘R guru’ if at all possible!

Please provide any requested written answers as a Word document in WebCampus- keep your responses as brief as possible! Also, I will not be asgining yoru scroes bsased on sepelling, grammmer or the propreiety orf yur senstence cosnstrujctin.

This lab exercise will extend over two laboratory periods (Aug. 24 and 31), with your R script and Word document due Sept. 7.

Let’s get started!

  1. Go to website http://cran.r-project.org/. This is the source for downloading the free, public-domain R software and where you can access R packages, find help, access the user community, etc. If you haven’t already installed R, do so now!
  2. Open the Rstudio software (https://www.rstudio.com– and install if you haven’t done so already. Change the working directory to a convenient directory (e.g., a subfolder called ‘Lab 1’ in your main course directory). NOTE- if you set up an Rstudio “project” (.Rproj extension) in this directory, the working directory is set automatically each time you load the project! I recommend you set up an R studio project for this course (and maybe a separate one for your final project, when that comes around).
  3. Open a new, blank R script window in Rstudio. Using comments (anything preceded by a pound sign is not interpreted by R and can be used to provide human-readable notes and commentary so your code is more readable), add a header to the script to indicate that this is lab 1- include your name and the course number in the header.
  4. Save the script to your working directory, using the naming convention: "[your first name]_[your last name]_lab1.R“, all lower-case. So my script would be named:”kevin_shoemaker_lab1.R". You will be submitting this script when you have finished all the exercises. Again, please include only the requested functions in your submitted R script.
  5. While you’re at it, also start a new Word document to record your written responses to the lab exercises (I will indicate where I expect a written response in Word vs. an R function). There is no specific naming convention for your Word documents.

Take some time to get more familiar with R

From the R manual, ‘Introduction to R’ you will implement all the steps in Appendix A, located here. This takes you far but without much explanation– it is a way to jump into the deep end of the pool. Alternatively, or in addition, go to the Links page and pick a tutorial to run through (e.g., the Datacamp introductory R course). You can also check out the materials from UNR’s R “bootcamp”. Before you move on, make sure you have a basic understanding of how to work with R and RStudio. If you are working with someone who has used R before, don’t be afraid to ask questions! Also, take your time with this- you might want to use the entire lab period on this if you have never used R before.

Another useful introductory R tutorial can be found here, courtesy of NCEAS. Please take the time to complete this tutorial, including going through the exercises.

If you already have basic R expertise, this is your opportunity to help your peers to develop the level of comfort and familiarity with R that they will need to perform data analysis and programming tasks in this course.

Depending on whether you are already familiar with R, you may also find the remainder of this document useful as you work your way through the course (and there are many other good introductory R resources available online… let me know if there is one you particularly like and I will add it to the course website (Links page). As you work your way through the tutorial(s) (on your own pace), please ask the instructor or your peers if you are uncertain about anything.

Interactive lab exercises!

Since there is no TA in this class, I have tried wherever possible to embed interactive text windows (which serve as a ‘virtual TA’ and were created using ‘R shiny’) in the course website where you can test your code and get instant feedback as to whether your answers are correct! Here’s an example:

Write a function that takes a numeric input vector and computes the sum of all its elements.

Basically, all you need to do here is write ‘sum(vector)’ where it says ‘[add code here]’

NOTE: you don’t need to submit this as part of your lab report! This is just practice…

myfunc <- function(vector){
    # [add code here!]
}

Hint: You may want to use the sum() function.

Demonstration: Central Limit Theorem (CLT)

To gain some familiarity with using R scripts and developing algorithms, complete the following exercise (you don’t need to submit this demo as part of your lab write-up BUT you will be asked to use this code as part of exercise 1 (below).

Remember that you may want to have two scripts for this and other labs. The first script contains all the code you built to learn new concepts, develop and test code for answering the questions, etc. The second script is the one you will submit, and only contains the functions you defined as part of each lab. I recommend putting the second script (for submission) together after you finish the lab (just before submitting).

Complete the following steps:

Review the meaning of the Central Limit Theorem, which implies that the mean of a sufficiently large number of independent random variables (small samples from a much larger population of interest) will have a sampling distribution that is approximately normally distributed, regardless of whether the underlying data distribution is normally distributed.

Type (or paste) the following code into your R script window (or RStudio script window):

####################
# CENTRAL LIMIT THEOREM demonstration
####################

infinity <- 100000       # number approximating infinity      

N_SAMPLES <- 1000       # number of independent random samples to draw from the specified distribution
SAMPLESIZE <- 10          # sample size of each independent random sample


########
#  Define the random number distribution. Here, we will use a "uniform" random number generator with a min of 10 and max of 20.   
TRUEMIN <- 10            
TRUEMAX <- 20      
data_population <- runif(infinity,TRUEMIN,TRUEMAX)      # here we define the full set of possible random numbers to draw random samples from (the population of potential data). We assume this is uniformly distributed

#######
# Draw multiple samples from the pool (population) of possible data.  

samplemean <- numeric(N_SAMPLES)     # set up storage vector
for(i in 1:N_SAMPLES){      # for each replicate (independent random sample)
  sample <- sample(data_population,SAMPLESIZE)   # draw an independent random sample from the population of interest
  samplemean[i] <- mean(sample)    # compute and record the sample mean
}

hist(data_population,freq=F,ylim=c(0,1),main="",xlab="Value")      # plot out the distribution of sample means
hist(samplemean,freq=F,add=T,col="red")    # overlay the distribution of the underlying data from which we are drawing samples.
legend("topleft",fill=c("white","red"),legend=c("data population","sample means"),bty="n")     # add a legend to the plot

  1. Experiment with executing this code in the following four ways:
    • copy and paste from the script window directly into the R console;
    • use to execute line by line from within the script window in RStudio; (or on Macs, do something else!… sorry…)
    • use to select the whole code block, then to execute all at once; (or on Macs, do something else!)
    • save the script to your working directory as a text file with .R extension, and then run the script using the “source()” function, e.g.:

source("CentralLimitTheorem.R")     # load the R script!

NOTE: The pound (#) sign used above allows the R programmer to insert comments adjacent to snippets of code, which facilitates readability of code (and lets the programmer remember later on what he/she was thinking when coding things a certain way!). It is good practice to comment every line so as to describe its precise meaning, including all variables and functions. Make sure you fully understand the commented code for the central limit theorem demonstration above!

Here is an interactive window you can use to play around with the CLT code (or even better, just open another RStudio instance and explore the code there!):

# [Add code here] 

Hint: Try running with different values for the ‘sample.size’ parameter.

  1. Now modify your R script to see how closely the distribution of sample means follows a normal distribution. Use a “quantile-quantile” (q-q) plot to visualize how closely the quantiles of the sampling distribution resemble the quantiles of a normal distribution. Use the “qqnorm()” function. To learn more about this function, type:
?qqnorm   # learn more about the "qqnorm()" function

Plot the q-q plot next to the histograms. The plot on the left should be the comparison of histograms (for population distribution and distribution of sample means) shown in the original script (above). The plot on the right should be the q-q plot. To produce side-by-side plots, you will need to add this line of code to the appropriate place in your script:

########
# Set graphical parameters for side by side plotting

par(mfrow=c(1,2)) # sets up two side by side plots as one row and two columns

## or alternatively...

layout(matrix(1:2,nrow=1))

In addition, run a Shapiro-Wilk normality test, which tests the null hypothesis that a set of numbers (in this case the vector of sample means) indeed comes from a normal distribution (so what does a low p-value mean??). Use the “shapiro.test()” function:

?shapiro.test

Again, here is an interactive window you can use to play around with the CLT code (or just use another instance of RStudio).

# [Add code here] 

Hint: Try running with different values for the ‘sample.size’ parameter.

So… what can you conclude from these tests? Can you conclude that the distribution of sample means is NOT normal/Gaussian??

Exercise 1: home-made functions!

Finally we have arrived at the ACTUAL lab exercises. Remember you don’t need to submit any of the work up to this point…

For the first part of this lab, you are asked to write several functions (and submit them as part of your lab script, of course!). The functions increase in order of complexity!

You should now know how to construct functions in R. If you don’t, go back to the NCEAS tutorial and review the section on writing functions.

Exercise 1a (for R script)

Write an R function called “CoefVar()” that takes a numeric vector (named ‘vector’ within the function but could have any name outside the function) as input, and computes (and returns) its coefficient of variation (CV; standard deviation as a proportion of the mean). To make sure it works, apply your function to the ‘Height’ vector in the ‘trees’ dataset that installs with R as sample data:

You can use this “sandbox” (below) to develop and test your function! You can test your function in the ‘sandbox’ below, but when you submit, make sure you delete all code except for the function itself (but comments are okay).

CoefVar <- function(vector){
    # [add code here!]
}

Hint: You probably want to use the ‘sd()’ and ‘mean()’ functions.

You can use the following code to test your function using the built-in ‘trees’ dataset.

############
# Explore the "trees" dataset

?trees
## starting httpd help server ... done
summary(trees)    # learn more about the data
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00
trees$Height    # extract the "Height" column from the trees dataset.
##  [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77
## [26] 81 82 80 80 80 87
CoefVar(trees$Height)    # run your new function!
## [1] 0.08383964

* Exercise 1b (for R script)

Write a function called “DrawLine()” for drawing a regression line through a scatter plot. This function should be specified as follows:

  • input:
    • x = a numeric vector specifying the x-coordinates of the scatter plot
    • y = a numeric vector specifying the y-coordinates of the scatter plot
  • suggested algorithm:
    • with the x and y coordinates, first produce a scatterplot (HINT: use the “plot()” function)
    • use the “lm()” function to regress the y variable on the x variable.
    • record the intercept and slope of the linear relationship between x and y (HINT: use the “coef()” function)
    • add a regression line to the scatter plot (HINT: use the “abline()” function)
  • return:
    • coefs = a vector of length 2, storing the intercept and slope of the linear relationship (in that order)

You can use this “sandbox” (below) to develop and test your function! Again, remember to comment out everything except for your “DrawLine” function definition prior to submitting (you can test the function using the “run” button).

DrawLine <- function(x,y){
    # [add code here!]
}
#DrawLine(trees$Height,trees$Volume)

Hint: You may want to use the coef() function.

As a test, apply this function to the ‘Height’ (x axis) and ‘Volume’ (yaxis) vectors in the ‘trees’ dataset, and then to the ‘waiting’ (x axis) and ‘eruptions’ (y axis) vectors in the ‘faithful’ dataset.

# ?faithful
# summary(faithful)

DrawLine(faithful$waiting,faithful$eruptions)    # test your function using the old faithful eruptions data

## (Intercept)           x 
## -1.87401599  0.07562795

* Exercise 1c (for R script)

Write a function called “DrawLine2()” for (optionally) drawing a “smoothed” regression line through a scatter plot, making the smoothing span (degree of smoothness, or non-wiggliness of the line) a user-defined option. This function should be specified as follows:

  • input:
    • x = a numeric vector specifying the x-coordinates of the scatter plot
    • y = a numeric vector specifying the y-coordinates of the scatter plot
    • smooth = a logical (TRUE/FALSE) value defining whether or not to add a smoothed line or a simple regression line
    • span = a number indicating the degree of smoothness, or “non-wiggliness” of the smoothed line (only applies if smooth=TRUE)
  • suggested algorithm:
    • with the x and y coordinates, first produce a scatterplot (HINT: use the “plot()” function)
    • if smooth is FALSE, then proceed to draw a straight line as before (and return the coefficients)
    • if smooth is TRUE, use the “scatter.smooth()” function to plot a smoothed, locally-weighted regression of the y variable on the x variable. Make sure you use the “span” argument!
    • if smooth is TRUE, use the “loess()” function to record the same smoothed, locally-weighted regression of the y variable on the x variable. Again, make sure you use the “span” argument!
  • return:
    • out = the loess model (the output produced by running “loess()” (or the slope and intercept from the linear regression, if smooth=FALSE)

You can use this “sandbox” (below) to develop and test your function!

xvec <- c(1:10)
yvec <- rnorm(length(xvec),c(2:6,7:3),2)     # you can use these vectors for testing the function
DrawLine2 <- function(x,y,smooth=F,span=1){    # note the default values (the ones set using equal sign)
    # [add code here!]
}
# DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.5)    # test using the old faithful eruptions data

Hint: You may want to use the scatter.smooth() function to draw the curve and the ‘loess()’ function to build the model.

Try testing your function using the trees and faithful datasets!

xvec <- c(1:10)
yvec <- rnorm(length(xvec),c(2:6,7:3),2)
DrawLine2(xvec,yvec,smooth=T,span=0.5)    # run your new function!

## Call:
## loess(formula = y ~ x, span = span)
## 
## Number of Observations: 10 
## Equivalent Number of Parameters: 8.49 
## Residual Standard Error: 2.423
DrawLine2(x=trees$Height,y=trees$Volume,smooth=F,span=NA)

## (Intercept)           x 
##   -87.12361     1.54335
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.5)    # test using the old faithful eruptions data

## Call:
## loess(formula = y ~ x, span = span)
## 
## Number of Observations: 272 
## Equivalent Number of Parameters: 6.71 
## Residual Standard Error: 0.3731
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.1)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1

## Call:
## loess(formula = y ~ x, span = span)
## 
## Number of Observations: 272 
## Equivalent Number of Parameters: 29.97 
## Residual Standard Error: 0.3762

Exercise 1d (for R script)

Write a function called “CLTdemo()” based on the central limit theorem (CLT) demonstration code above. This function should be specified as follows:

  • input:
    • n.samples = number of independent random samples to draw from the specified distribution (default = 1000)
    • sample.size = sample size (length of each independent random sample) (default = 10)
    • min = lower bound of the uniform distribution to draw from (default=10)
    • max = upper bound of the uniform distribution (default=20)
  • suggested algorithm:
    • see demonstration above!
    • generate side-by-side plots of the histogram of sample means (left) and a quantile-quantile plot to test for normality.
  • return:
    • out = the Shapiro-Wilks normality test results (the output produced by running “shapiro.test()”)

You can use this “sandbox” (below) to develop and test your function! As always, remember to comment out everything except for the requested function prior to submitting.

CLTdemo <- function(n.samples=1000,sample.size=10,min=10,max=20){
    # [add code here!]
}
# CLTdemo(n.samples=5000,sample.size=4,min=10,max=20)

Hint: Use the CLT Demo code from earlier in Lab 1!

And you can test your code with the following command:

answer1d <- CLTdemo(n.samples=5000,sample.size=4,min=10,max=20)    # run your new function!

Exercise 1e (for Word doc)

Finally, test the CLT function out for different parameter combinations to make sure it works! See if you can use this function to develop a reasonable rule for how large a sample size is necessary to ensure that the sample mean is normally distributed given that the underlying data population is a uniform distribution. Please include your answer in your Word document- and please justify your answer! [note: I am just looking for a thoughtful response, not a definitive answer! And just a few sentences is fine.]

You can use the following interactive R window for testing if that is helpful:

   # use this space if it's helpful

Exercise 1f (optional, not for submission)

Optionally, modify your CLTdemo() function (e.g., call it “CLTdemo2()”) to try a different underlying data distribution (other than uniform). If you really want to test the limits of the CLT, try creating your own highly non-standard distribution and seeing if the result you obtained in exercise 1e still holds. For example:

rlocodist <- function(n){
  vals <- c(1,7,10,35)         # possible data values
  probs <- c(1,2,5,0.5)     # relative probability of each data values
  probs <- probs/sum(probs)    
  vals[apply(rmultinom(n,1,probs),2,function(t) which(t==1))]     # sample from this made-up distribution
}

lots=10000
datafountain <- rlocodist(lots)

hist(datafountain, main="non-standard made-up distribution")

You can use this “sandbox” (below) to develop and test your function! Remember, you do not need to submit this function as part of your submitted r script.

CLTdemo2 <- function(n.samples=1000,sample.size=10,...){
    # [add code here!]
}
#CLTdemo2(n.samples=5000,sample.size=4,...)

Hint: Use the CLT Demo code from earlier in Lab 1!

Aside: default values in functions

NOTE: to set default values, just use the equals sign when specifying your function. For example, say I wanted to write a function that adds numbers in a vector. It might look something like this:

newsum <- function(x=c(1,2,4)){
  sm <- sum(x)
  return(sm)
}

newsum(x=c(5:10))   # specify x manually
## [1] 45
newsum()    # use default value!
## [1] 7

Try setting some alternative default values and re-running the function with and without arguments until you are sure you understand how default values work!

Multiple Regression 1: Air Quality Data

The following is a refresher on performing multiple regression analyses in R:

  1. Type the following for a list of sample datasets that come with the core R package (some of these you have already encountered).
library(help = "datasets")    # list of sample datasets that come with R

?airquality
  1. Examine the ‘airquality’ dataset (use the ‘head’ and ‘summary’ functions). Note that there are missing values where ozone concentration data and solar radiation data were not collected.

  2. We could ignore the missing values and just go ahead with our regression analysis, since the default response of the “lm()” (‘linear model’) function is to omit cases with missing values in any of the specified parameters. However, to avoid problems later, we will omit them explicitly by constructing a new, ‘cleaned’ dataset as follows:

air.cleaned <- na.omit(airquality)    # remove rows with missing data
  1. Conduct a multiple linear regression of ozone concentration as a function of solar radiation, wind and temperature. Use the ‘lm()’ function to conduct an ordinary least squares (OLS) regression analysis.

  2. Explore the regression outputs using the ‘summary’ function, and explore regression diagnostics using, e.g. (depending on what you named the regression model object):

par(mfrow=c(3,2))
plot(model1, which=c(1:4))   # diagnostic plots (NOTE: the plot function returns these plots by default when the input is a linear regression model)


hist(residuals(model1), breaks=10)   # histogram of residuals
plot(predict(model1) ~ air.cleaned$Ozone)     # plot predicted vs observed- should follow 1:1 line. Examine this for model biases.
abline(0,1)

NOTE: see this website for more information on the diagnostic plots produced by lm().

If no one in your group knows why you are doing any of this or what it all means, ask the instructor! That’s why he’s hanging around the lab…

Here is a practice R window in case it is helpful:

# practice space for regression analysis!
  1. Consider the possibility that there may be an important interaction effect between solar radiation and temperature on influencing ozone concentrations. Explore that with a simple scatter plot where symbol size is scaled to ozone concentration:
symbols(air.cleaned$Temp, air.cleaned$Solar.R, circles=air.cleaned$Ozone/100, ylab="Solar Radiation", xlab="Temperature", main="Interaction Plot", inches=FALSE)

       # alternatively...
coplot(air.cleaned$Ozone~air.cleaned$Temp|air.cleaned$Solar.R,rows=1)  # the pipe operator can be read "conditional on"

       # alternatively, you can use ggplot- which you probably already know better than your instructor! 
  1. Now fit a second model that includes the interaction between solar radiation and temperature. Use the following formula to fit the interaction:
formula2 <- "Ozone ~ Wind + Solar.R * Temp"   # you can name formulas...
  1. Explore regression outputs for the second model in the same way as you did for the first model without the interaction term.

  2. Conduct an ‘F Test’ (or a Likelihood Ratio Test, LRT, if you prefer…) to formally test whether the more complex model (including the interaction term) fits the data significantly better than the reduced model (with fewer parameters) that lacks the interaction term. Recall that the \(R^2\) value is inadequate for this purpose because \(R^2\) will always increase with additional parameters! Use the following syntax,

anova(model1, model2, test="F")   # how would you run an LRT test instead?

You can use this “sandbox” (below) to play around with multiple regression in R! You don’t need to include any of this code in your submission.

# your code here

Exercise 2: regression in R (written responses)

Very briefly (but in complete sentences) answer the following questions in your Word document:

  • Exercise 2a By how much (and in what direction) is ozone concentration expected to change if temperature increased from 26 to 30 degrees Celsius? Assume that solar radiation stays constant at 200 lang and wind speed stays constant at 9 mph. Be careful with the units for temperature! Make sure to use the ‘interaction’ model (model2) from part 7 above to answer this question. Please briefly explain (in your Word document) how you got your answer.

  • Exercise 2b What is the null hypothesis that the p-values for the individual regression coefficients are designed to test?

Multiple Regression 2: Noble fir data

For this exercise, we will use tree data from a forest in the western Oregon Cascades.

We will fit a multiple linear regression model that predicts forest tree biomass as a function of environmental variables (including a mix of continuous and categorical predictors) and stand age.

Obtain the TreeData.csv file from the “Data Sets” tab (or just download here)- save it to your working directory.

This describes a subset of forest inventory data from the Douglas-fir forests of western Oregon (n = 90, 0.1-ha sites).

Arranged in columns from left to right, variables are:

  • Site: site identifier
  • Biomass: tree biomass (for all species) in Mg/ha, the response variable for Part 1 of the lab.
  • ABPR: Presence/absence of Abies procera (noble fir) on a given site (coded 1 for presence).
  • StandAge: Maximum tree age in the 0.1-ha plot. This variable will be used as a proxy for successional stage. We assume that stand-replacing fires are the dominant form of disturbance and that stand age is a reasonable proxy variable for time since the last fire.
  • X, Y: geographic coordinates – UTM easting and northing, respectively
  • Elev: elevation (m)
  • Northeastness: slope aspect that has been linearized using a cosine transformation so that the aspect of 45 degrees has value 1 and aspect of 225 degrees has value -1. In this study area, this variable is expected to reflect a moisture gradient from moister (NE) to drier (SW) aspects.
  • Slope: slope steepness (degrees)
  • SlopePos: slope position, a categorical variable (i.e. factor) with three values: Valley, Slope and Ridge.

This is a comma-delimited (.csv) file, which is a common file format for importing data into R. Import the data into R as a data frame (R’s version of an excel spreadsheet), using the following command:

NobleFir.df <- read.csv("TreeData.csv")

Inspect the resulting data object. Summarize it using the ‘summary()’ and ‘plot()’ functions.

Obtain a correlation matrix for biomass and the four numeric predictor variables using the ‘cor()’ function and by subscripting column locations on the data frame (ask instructor for explanation of syntax if needed):

cor(NobleFir.df[,c(2,4,7:9)])

Are any of the predictor variables highly correlated?

Calculate Box Plots for the continuous predictor variables (excluding x and y coordinates) according to sites with or without noble fir. Use the ‘boxplot()’ function. What clear relationships, if any, emerge for how sites with and without noble fir differ with regard to their environmental setting? For example:

Use multiple linear regression to model tree biomass as a function of predictor variables (excluding spatial coordinates), using the same approach for regression fitting and diagnostics as we did previously.

Re-run the regression to obtain standardized regression coefficients, allowing direct comparison of effect sizes for the continuous predictor variables (since all variables are then transformed to standard deviate units, i.e. mean centered on zero with standard deviation of one). The ‘scale’ function provides an easy way to implement this.

Biomass_std.lm <- with(NobleFir.df,     # using the "with()" statement, we don't need to keep referencing the name of the data frame.
  lm(scale(Biomass) ~ scale(elev) + scale(Northeastness) + scale(Slope) + SlopePos + scale(StandAge))
)

Visually assess whether regression errors (residuals) are spatially autocorrelated using the ‘symbols’ function:

with(NobleFir.df,
  symbols(x,y,circles=abs(residuals(Biomass_std.lm)), inches=0.3, ylab="Northing", xlab="Easting", main="Errors from Biomass Regression Model")
)

Note the ‘with’ function above- this function essentially makes all columns in a data frame part of the global environment- that is, the commands you enclose in a ‘with’ function can refer to the columns in the data frame as if they were part of the global environment. This is a safe way to replicate the ‘attach’ function in R (which I recommend you never use!).

You can use this “sandbox” (below) to play around with this example in R!

# your code here

Exercise 3: noble fir regression

Answer the following questions (with brief justification) in your Word document:

  • Exercise 3a Can forest biomass be reliably predicted by topographic variables and stand age? Explain your reasoning.
  • Exercise 3b Is there spatial variation in model goodness of fit (i.e., on the residual error)? Use your visual assessment of regression errors (residuals) across space to answer this question. Explain your reasoning.
  • Exercise 3c Which of the environmental influences you included in your model are most important in predicting forest biomass? Explain your reasoning.

Exercise 4: Algorithmic (brute force) z-test

Review the “brute-force z-test” code from the “Why focus on algorithms” lecture. Then complete the following exercises:

Exercise 4a

What if we wanted to run a two-tailed z-test? That is, what if our alternative hypothesis were that salmon fed on the new vegetarian diet could plausibly be larger or smaller than those fed on the conventional diet after one year? Modify the function (“z.test.algorithm()”) with a new argument that allows for both one and two-tailed tests! Name your new function “z.test.q4a()”.
To convince yourself that your new function works, try running your function for a (made-up) case where the observed body mass for those fed on the new diet are generally higher than the expected body mass for those fed on the conventional diet after one year – the opposite of your (alternative) hypothesis!

NOTE: you may get tangled up with the null hypothesis/p-value concept, which is admittedly a difficult concept! A p-value always assumes the null hypothesis is true (given the null hypothesis is true, a p-value gives the probability of obtaining a test statistic as or more extreme than the observed test statistic). The alternative hypothesis for the 1-tailed test is that the population mean for the treatment group is less than the population mean for conventional farm-raised salmon. The alternative hypothesis for the 2-tailed test is that the population mean for the treatment group is less than or greater than the population mean for conventional farm-raised salmon. The null hypothesis, as always, is that there is no difference.

Include your function in your submitted r script!

This function should be specified as follows:

  • input:
    • sample = a vector of observed sample values
    • pop.mean = a scalar value representing the expected mean value under the null hypothesis
    • pop.sd = a scalar value representing the population standard deviation under the null hypothesis
    • onetail = a logical TRUE or FALSE to indicate whether or not to perform a one-tailed test or a two-tailed test
  • suggested algorithm:
    • See lecture for example z-test function
    • To implement the two-sided test, you need to define what ‘extreme’ means in both directions. You might first define the absolute difference between the population mean and the sample mean. Then define ‘more extreme’ as any sample mean that falls further away from the population mean in either direction (larger or smaller than the population mean).
    • Use an “if-else” statement to accommodate both a one-tailed and a two-tailed test.
  • return:
    • to_return = a list with three elements: “null_dist”, representing the sampling distribution for sample means under the null hypothesis,
      “p_value”, representing the p-value, and
      “observed_mean”, representing the mean of the observed sample.

You can use this “sandbox” (below) to develop and test your function!

population.mean = 4.5
population.sd = 0.9
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
# ztest <- z.test.algorithm(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd )  # use function from class..

z.test.q4a <- function(sample, pop.mean, pop.sd, onetail=T){
    # [add code here!]
}
#z.test.q4a(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd,onetail=F )   # test your function

Hint: No hints … yet!

Test your function using some alternative sample data values. For example:

population.mean = 4.5
population.sd = 0.9
my.sample = c(5.14,3.27,4.56,3.77,3.34,4.32,3.84,3.19,5.24,4.09)
z.test.q4a(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd,onetail=F )
## $null_dist
##    [1] 4.316747 4.770561 4.032500 4.603852 4.450013 4.328502 4.296983 4.511876
##    [9] 4.465084 4.639269 4.585778 4.522653 4.354704 4.465592 4.729727 4.335968
##   [17] 4.208391 4.830332 4.594505 4.703342 4.498340 4.658716 4.588669 4.741817
##   [25] 3.986019 4.734259 4.630257 4.802767 4.287979 4.648972 4.695111 4.785717
##   [33] 4.208295 4.541756 4.891147 4.477356 4.976918 4.801133 4.151884 4.290882
##   [41] 4.632383 4.590038 4.561054 4.403268 4.554950 4.788865 4.860299 4.321902
##   [49] 4.420150 4.289798 4.472190 4.099439 4.619850 4.451294 4.167545 4.782852
##   [57] 4.436746 3.987455 3.802015 4.622455 4.433434 4.721921 4.686328 4.379001
##   [65] 4.935248 4.052356 4.242759 4.547473 4.701919 4.416579 4.489660 4.469970
##   [73] 4.473287 4.440283 5.164047 4.555043 4.358827 4.221100 4.434869 4.280401
##   [81] 4.989331 4.394092 4.364043 4.335263 4.566799 4.930959 4.422621 4.367727
##   [89] 4.076260 4.196748 4.720624 4.930314 4.625350 4.311998 4.258147 4.875953
##   [97] 4.399998 4.711004 4.326041 4.261689 4.388493 4.564290 4.253488 4.690348
##  [105] 4.466371 4.509491 4.501854 4.293426 4.459171 4.349939 4.281777 4.765475
##  [113] 4.321327 4.920713 4.551723 4.204353 4.203020 4.449682 4.393013 4.633849
##  [121] 4.427271 4.445130 4.771630 3.593764 4.339265 4.701981 4.589546 4.395617
##  [129] 4.357728 5.124562 4.561166 4.637635 4.108844 4.608899 4.442772 4.754452
##  [137] 4.295925 4.492359 5.366682 4.653879 4.404673 4.510038 4.525972 4.751443
##  [145] 4.292190 4.812269 4.558240 4.778236 4.691260 4.514345 4.759007 4.572163
##  [153] 4.094375 4.149142 4.466727 4.412338 4.685000 4.640478 4.563273 4.940709
##  [161] 3.852339 4.991853 4.773284 4.371979 3.782515 4.560329 4.684194 4.574346
##  [169] 4.979941 4.437966 4.370146 4.482622 4.559052 4.205272 4.615152 4.639217
##  [177] 4.488914 4.181249 4.548420 4.759753 4.223196 4.525338 4.672106 4.289015
##  [185] 4.686932 4.661335 4.326158 4.219182 4.170768 4.755146 4.771968 4.487745
##  [193] 4.409578 3.921235 4.495015 4.674483 4.753118 4.052900 4.339214 4.022378
##  [201] 4.622523 4.454893 4.874894 4.825225 4.664857 4.859208 4.282902 4.335131
##  [209] 4.589804 4.027944 4.981772 4.732389 4.184094 4.473663 4.716402 4.437493
##  [217] 4.414365 4.481012 4.333433 4.581765 4.544169 4.153525 4.659882 4.490575
##  [225] 4.210870 4.813934 5.048715 4.796340 4.151161 4.512131 4.752260 4.465896
##  [233] 4.313751 3.932716 4.347864 4.818638 4.808917 4.140665 4.607157 4.404889
##  [241] 4.892227 4.748179 4.530640 4.646433 3.744763 4.026190 4.236494 4.691026
##  [249] 4.313480 4.113420 4.111184 4.991594 4.550630 5.037122 4.466484 3.898635
##  [257] 4.615216 4.313010 4.390403 4.283788 4.278831 4.314763 4.784068 4.762094
##  [265] 4.454299 4.796310 4.814192 4.757928 3.987556 4.277108 4.308022 4.409343
##  [273] 4.796586 4.443572 4.828134 3.735836 4.658617 3.994708 3.975210 4.567820
##  [281] 4.148388 4.544937 4.579417 4.991803 4.231855 4.476849 4.311346 4.244715
##  [289] 4.169289 4.391779 4.530481 4.885366 4.649666 4.723987 4.477360 3.720949
##  [297] 4.721756 4.751210 4.032115 4.206875 4.632252 4.298961 4.244539 4.436334
##  [305] 4.293852 4.606964 4.515290 4.665953 4.144912 4.534390 4.790463 4.331167
##  [313] 5.119804 4.746275 4.456400 4.441744 4.905767 4.210535 4.091231 4.177029
##  [321] 4.997503 4.294935 4.316000 5.031446 4.461462 4.490683 4.775761 4.404896
##  [329] 4.726645 4.503734 4.223112 4.670536 4.936354 4.060269 4.769946 4.275483
##  [337] 4.638672 4.063436 4.623539 4.755251 4.612749 4.553176 4.271294 4.378308
##  [345] 4.550206 4.451060 4.752670 4.331555 4.310893 4.338153 4.634420 4.383652
##  [353] 4.178441 4.955960 4.557162 4.513034 4.407934 4.786962 4.524878 4.612746
##  [361] 4.487379 4.172155 4.000951 4.226952 4.471185 3.782460 4.542027 4.681080
##  [369] 4.884899 4.328840 4.238976 4.994919 4.272135 4.183073 4.651960 4.834813
##  [377] 4.243298 3.899413 4.279913 4.423270 4.562731 4.492908 4.094315 4.764144
##  [385] 4.378211 4.204666 4.628369 4.784138 4.597098 4.370079 4.493915 4.832151
##  [393] 4.945185 4.189597 4.473125 4.540711 4.498540 4.400466 4.919241 4.449074
##  [401] 4.445461 4.533550 4.482217 4.376421 4.636864 4.553398 4.171159 4.404498
##  [409] 4.826868 3.962051 4.410676 4.288133 4.417517 4.452341 4.560798 4.138757
##  [417] 4.721164 4.689469 4.700975 4.974486 4.119262 5.064047 4.294205 4.737391
##  [425] 4.478121 4.173968 3.960976 4.397688 4.178113 3.784869 4.830469 4.803432
##  [433] 4.108872 4.656094 4.482926 4.730997 4.581534 4.318312 4.645291 4.513398
##  [441] 4.507244 4.639196 4.722597 4.338639 4.304570 4.252998 4.174941 4.262794
##  [449] 4.553821 4.329573 4.307321 4.737875 4.351659 4.541076 4.411459 4.109951
##  [457] 4.586195 4.439645 4.181221 4.606183 4.793824 4.626975 3.980394 5.101195
##  [465] 4.486015 4.033368 4.770831 4.495588 4.848306 4.462237 4.047291 4.559679
##  [473] 4.371495 4.215994 3.931421 4.564310 4.713681 4.492941 4.714866 4.661232
##  [481] 4.661357 4.326866 4.593385 4.764523 4.106499 4.640550 4.247274 4.739586
##  [489] 4.654888 4.468896 4.880006 4.815734 4.629406 4.577215 4.916019 4.190692
##  [497] 4.584513 4.285632 4.387866 4.574898 4.602442 4.316646 4.415032 4.294094
##  [505] 4.347858 4.909982 4.083851 4.267031 4.249453 4.624663 4.628857 5.032759
##  [513] 4.391531 4.898153 4.709326 4.735837 4.420464 4.662111 5.135883 4.521564
##  [521] 4.413254 4.788576 4.389444 4.572999 4.509638 4.519257 4.363705 4.943347
##  [529] 4.029989 4.285777 4.527885 4.505762 4.731283 4.782663 4.560392 4.180042
##  [537] 4.882546 4.213538 4.040280 4.178087 4.818003 4.562252 5.011877 4.487945
##  [545] 4.214280 4.897499 4.753589 4.128855 4.592912 4.766774 4.677803 4.071719
##  [553] 4.426421 4.442024 4.291847 4.480919 4.167798 4.433340 4.629236 4.764497
##  [561] 4.217569 4.866338 4.142858 3.952288 4.569929 4.100053 4.134121 4.753006
##  [569] 4.757046 4.103823 4.645994 4.384498 4.672280 4.491453 4.369014 4.590655
##  [577] 4.422548 4.612126 4.371357 4.378083 4.490364 4.697058 4.305055 4.611329
##  [585] 4.415590 4.057715 4.569984 4.925360 4.532121 4.643315 4.407150 4.159804
##  [593] 4.528593 4.406334 4.532533 4.233003 4.604701 4.597216 4.398766 4.558031
##  [601] 4.950845 4.498870 4.541300 4.636191 4.054203 5.079401 4.503323 4.198760
##  [609] 4.572710 4.127734 4.233839 4.441439 4.499130 3.678562 4.356375 4.307783
##  [617] 4.438272 4.359819 4.434958 4.278042 4.121143 4.684430 4.530078 4.645960
##  [625] 4.765422 4.728693 4.473531 3.984340 3.912301 4.058381 4.477691 4.685864
##  [633] 4.227445 4.576469 4.426526 4.466599 5.024264 4.302791 5.118676 4.251611
##  [641] 4.178850 4.123792 4.261268 4.425229 4.414916 4.547607 4.764350 3.941916
##  [649] 4.476154 4.701915 4.233648 4.477376 4.650705 4.463849 4.864249 4.792595
##  [657] 4.678310 4.405516 4.349482 4.481614 4.900713 4.645163 4.707827 4.262593
##  [665] 4.358857 4.396133 4.428179 4.910978 4.040920 4.563401 4.386326 4.362462
##  [673] 4.253115 4.247269 4.008814 4.496097 3.833050 4.655642 4.889529 4.536016
##  [681] 5.029261 4.300746 4.841502 4.149076 4.758941 4.664485 4.392807 4.867762
##  [689] 4.304098 4.253419 4.561112 4.307664 4.883979 4.002298 4.582731 4.663505
##  [697] 4.348648 4.917695 4.902282 4.213924 4.312817 4.838686 4.320497 4.244020
##  [705] 4.305479 4.323423 4.575700 4.263000 4.634939 4.390390 4.779058 4.386552
##  [713] 4.281292 4.359480 4.493412 4.344609 5.216580 4.545535 4.463812 4.607675
##  [721] 4.886878 3.806920 4.896332 4.500346 4.652223 5.028455 3.977530 3.840059
##  [729] 4.427781 4.606428 4.270668 4.349639 4.508685 4.420136 4.534446 4.680305
##  [737] 4.165394 4.755535 4.225843 4.710444 4.485627 4.583335 4.560034 4.235407
##  [745] 4.379618 4.871849 4.600983 4.471173 4.300754 4.667488 4.379810 4.604521
##  [753] 4.693217 4.669413 4.556853 4.224799 4.303881 4.314547 4.043208 4.704336
##  [761] 4.198941 4.418416 4.729176 4.438874 4.501512 4.890298 4.450245 4.279057
##  [769] 4.806912 4.472319 4.571102 4.549040 4.610101 4.707971 4.845058 4.181404
##  [777] 4.641833 4.437509 4.489471 4.009210 4.620221 3.841737 4.847796 4.194299
##  [785] 4.334230 4.200573 4.255521 4.180587 4.315617 4.757015 4.624459 4.711589
##  [793] 3.617069 4.322011 4.755832 4.065846 4.395793 4.344992 4.375103 4.518435
##  [801] 4.074912 4.383825 4.299655 4.366433 4.916015 4.646937 4.335940 4.386183
##  [809] 4.583114 4.480587 4.365883 4.534916 4.704998 4.759672 4.498924 4.551884
##  [817] 4.700013 4.679001 4.521834 4.751578 3.992182 4.766310 4.386933 4.349992
##  [825] 4.918631 4.246220 4.836436 5.100124 4.719299 4.526750 4.753320 4.815131
##  [833] 4.821955 4.686607 4.407839 4.940871 4.908128 4.451257 4.574496 4.130842
##  [841] 4.545531 4.750484 4.298851 4.518935 4.561029 4.548418 4.819964 4.289677
##  [849] 4.122220 4.449402 4.290607 4.551399 4.425597 5.117913 5.029843 4.686406
##  [857] 4.499324 4.927163 4.544927 4.599271 4.093079 4.248442 4.631660 4.618854
##  [865] 4.727979 4.976289 4.170559 4.513495 4.496263 3.944228 5.225250 4.157764
##  [873] 4.686310 4.645423 4.127868 4.349805 4.201819 4.780636 4.625905 5.055880
##  [881] 4.790236 4.177532 4.388084 4.741211 4.303211 4.175298 4.098124 4.181008
##  [889] 4.515698 4.977493 4.529489 5.015807 4.685965 4.744109 4.870566 4.441928
##  [897] 4.670682 4.759402 4.847670 4.821146 4.654368 3.813624 4.152606 4.459807
##  [905] 4.155822 4.063588 4.617168 4.597298 3.967168 4.475822 4.017342 4.336467
##  [913] 4.051133 3.932611 4.249138 4.636442 4.884996 4.544601 3.880613 4.347868
##  [921] 4.200803 4.984713 5.104904 3.983879 4.231326 4.804119 4.185800 5.136738
##  [929] 4.557633 4.551566 4.431674 4.580336 4.440891 4.545496 4.214823 5.074161
##  [937] 4.770897 4.501691 4.368651 4.396240 5.023392 4.310800 4.426661 4.433583
##  [945] 4.329128 3.868268 4.356343 4.561522 4.574470 4.813581 4.251138 4.795501
##  [953] 4.145448 4.356443 4.201690 4.430242 4.426506 3.953691 4.712272 4.497190
##  [961] 4.963926 4.030154 4.323355 4.390788 4.509719 4.583821 4.708235 4.326422
##  [969] 4.616452 4.317063 4.879864 4.369537 4.781301 4.293374 4.679306 4.205416
##  [977] 4.702429 4.264887 4.231266 4.684812 5.153215 4.171854 4.042909 4.818366
##  [985] 4.972043 4.405789 4.705868 5.052251 4.150549 4.466766 4.393466 4.289429
##  [993] 4.223133 4.400417 4.685691 4.668115 4.405018 4.579411 4.060233 4.360351
## 
## $p_value
## [1] 0.13
## 
## $observed_mean
## [1] 4.076
   # also try using the code from lecture to compare against a "real" z-test!
population.mean = 4.5
population.sd = 0.9
my.sample = c(5.14,3.27,4.56,3.77,3.34,4.32,3.84,3.19,5.24,4.09)

Exercise 4b

Now, what if we had access to a large set of samples under the null hypothesis AND that we would like to relax the assumption that the data follow a normal distribution under the null hypothesis?

In this exercise, you are asked to modify the brute-force z-test function so that you generate your sampling distribution by simply sampling from the known data set under the null hypothesis. To do this you can use the ‘sample’ function in R. For this purpose you should sample with replacement. Here is an example:

null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
sample(null.data,10,replace=T)    # use the sample function to sample from a large vector with replacement

For this problem, let’s go back to the original null hypothesis: that salmon raised on the new diet will be smaller (lower mass) than those raised on conventional diet after one year.

Include your function in your submitted r script!

This function should be named ‘z.test.q4b’ and should be specified as follows:

  • input:
    • sample = a vector of observed sample values
    • null.data = a vector of observed sample values under the null hypothesis
  • suggested algorithm:
    • Instead of sampling from a normal distribution to simulate samples under the null hypothesis, use the ‘sample’ function to sample directly from the null data distribution.
  • return:
    • to_return = a list with three elements: “null_dist”, representing the sampling distribution for sample means under the null hypothesis, “p_value”, representing the p-value, and “observed_mean”, representing the mean of the observed sample.

You can use this “sandbox” (below) to develop and test your function!

null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
# ztest <- z.test.algorithm(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd )  # use function from class..

z.test.q4b <- function(sample, null.data){
    # [add code here!]
}
#z.test.q4b(sample = my.sample, null.data )   # test your function

Hint: No hints … yet!

Test your function using some alternative sample data vectors. For example:

null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
z.test.q4b(sample = my.sample, null.data )
## $null_dist
##    [1] 4.241 5.111 4.432 4.840 4.600 3.953 4.054 4.032 4.165 4.560 4.566 3.544
##   [13] 3.940 4.613 5.492 4.986 4.522 4.731 4.214 5.196 5.070 4.698 3.855 4.494
##   [25] 4.909 4.100 4.670 4.639 5.093 4.416 3.592 4.005 3.996 4.328 4.753 4.248
##   [37] 4.607 4.539 3.651 4.179 4.500 3.916 4.243 3.808 4.519 4.119 4.359 4.739
##   [49] 3.850 4.669 4.639 4.399 4.341 4.348 4.403 4.177 3.759 4.558 4.052 4.871
##   [61] 3.476 4.345 4.252 4.647 4.760 4.197 4.348 4.332 3.406 4.590 4.890 4.601
##   [73] 4.533 4.598 4.205 4.468 4.399 4.285 4.211 4.510 4.386 4.368 4.324 4.269
##   [85] 4.930 4.792 4.368 4.161 4.681 5.152 4.342 4.302 5.088 5.183 4.709 3.862
##   [97] 4.575 4.646 4.085 4.369 4.627 3.683 4.845 4.346 4.696 4.840 5.329 4.116
##  [109] 4.043 3.674 4.025 4.288 4.497 5.071 4.126 4.282 4.560 5.404 4.453 3.723
##  [121] 5.113 5.014 4.457 4.636 4.887 4.398 3.941 4.437 4.661 4.114 4.182 4.148
##  [133] 5.411 3.427 4.779 4.892 4.045 5.324 4.862 3.877 4.660 4.279 4.913 4.022
##  [145] 4.345 4.664 4.072 4.865 4.262 4.431 4.069 4.248 4.543 4.376 4.414 4.481
##  [157] 3.825 4.550 4.711 4.368 4.803 4.358 4.887 4.758 4.820 4.043 4.955 3.942
##  [169] 4.719 5.216 4.630 3.981 4.217 3.477 4.677 4.835 4.063 4.006 4.190 4.437
##  [181] 4.680 5.023 4.300 5.425 5.106 5.017 4.579 4.315 4.371 4.599 4.473 4.830
##  [193] 4.654 5.127 4.899 4.374 4.257 4.273 4.545 3.748 5.059 4.287 5.094 4.220
##  [205] 4.342 4.277 4.368 4.589 4.221 4.700 4.177 4.392 4.679 3.926 4.691 4.689
##  [217] 4.270 4.624 4.318 4.731 4.702 4.390 5.000 4.625 3.561 4.050 5.025 4.168
##  [229] 5.132 4.379 4.579 4.641 4.649 4.368 4.339 4.361 3.858 5.050 4.756 4.723
##  [241] 4.330 4.895 3.749 3.553 3.955 4.682 4.255 4.314 4.288 4.720 4.500 3.706
##  [253] 4.518 4.930 4.976 3.387 4.820 4.120 5.109 5.247 4.412 4.324 4.568 4.150
##  [265] 4.324 3.962 4.231 4.849 3.921 4.451 4.605 4.728 3.871 4.423 4.199 4.788
##  [277] 4.721 4.552 4.098 3.897 4.790 4.330 4.358 4.816 5.113 4.808 4.282 4.955
##  [289] 4.463 4.359 4.389 4.725 4.812 4.731 4.432 4.177 4.389 4.685 3.790 3.830
##  [301] 4.872 4.597 3.913 4.541 4.207 4.651 4.173 4.181 5.014 4.424 4.714 4.235
##  [313] 4.747 4.831 4.489 5.060 4.417 4.253 4.641 4.057 3.974 4.232 4.587 4.767
##  [325] 4.216 4.239 4.816 4.273 4.226 4.614 4.630 4.405 4.120 4.704 4.707 4.571
##  [337] 3.891 4.877 4.361 4.766 4.669 4.979 4.383 4.507 5.093 4.215 4.821 3.773
##  [349] 5.019 4.652 4.648 4.521 4.531 4.708 4.513 4.402 4.307 4.455 4.495 4.515
##  [361] 4.293 4.082 4.425 4.270 5.020 4.450 4.777 4.366 4.230 4.572 4.122 4.303
##  [373] 4.792 4.516 4.640 3.960 4.774 4.503 3.009 3.984 4.682 5.098 4.661 4.154
##  [385] 4.770 4.859 4.547 4.318 4.386 4.536 4.469 3.728 4.379 3.799 4.084 4.708
##  [397] 4.326 4.639 4.598 4.524 4.202 4.723 4.386 3.672 4.750 4.548 4.433 4.437
##  [409] 4.618 4.547 4.121 3.959 4.860 4.384 4.250 3.925 4.426 4.363 4.572 4.292
##  [421] 4.026 4.330 4.396 4.290 4.522 3.579 4.225 4.419 4.495 4.338 5.027 3.762
##  [433] 4.339 4.924 4.608 4.809 4.452 3.654 4.006 4.704 4.830 4.974 4.698 3.817
##  [445] 4.479 4.425 4.612 4.468 4.187 4.034 5.204 4.262 4.238 4.702 3.881 4.548
##  [457] 4.726 4.579 4.008 4.173 3.897 4.320 4.101 4.308 5.086 4.514 4.288 4.569
##  [469] 4.467 4.898 3.550 4.829 3.942 4.170 4.390 5.085 4.295 4.683 4.571 3.841
##  [481] 4.567 4.407 4.152 4.730 4.145 4.062 4.237 4.263 4.145 3.937 4.140 4.408
##  [493] 4.571 4.508 4.602 4.339 4.371 4.659 4.608 4.598 4.495 3.552 4.355 3.689
##  [505] 4.451 4.942 4.372 4.217 4.529 5.266 4.915 3.969 4.762 4.557 4.135 4.964
##  [517] 4.703 4.464 4.069 4.699 4.851 4.260 4.554 4.155 4.323 4.434 4.419 4.352
##  [529] 4.499 5.113 4.459 4.737 4.737 4.808 4.884 4.786 3.614 4.192 4.397 4.745
##  [541] 4.148 4.370 4.385 4.580 4.426 4.225 4.533 5.264 4.842 4.116 4.498 4.501
##  [553] 3.727 4.685 3.712 4.726 4.692 4.155 3.788 5.135 4.360 4.279 4.499 4.686
##  [565] 4.686 4.520 4.002 4.556 4.173 4.496 4.872 4.521 4.575 4.592 5.153 4.353
##  [577] 5.268 4.370 4.885 4.340 4.098 4.535 4.595 4.429 4.294 4.312 4.636 4.644
##  [589] 4.771 4.487 4.637 4.493 4.146 3.945 4.119 4.614 4.464 5.266 4.855 5.034
##  [601] 4.076 4.554 3.923 4.633 4.689 4.621 4.630 4.113 4.380 3.662 3.504 4.887
##  [613] 4.213 3.924 4.237 4.431 4.727 4.896 4.698 4.310 4.872 4.170 4.960 4.309
##  [625] 4.569 4.685 4.671 4.336 4.539 4.716 4.381 4.712 4.061 3.918 4.117 4.435
##  [637] 5.041 5.445 4.084 4.188 4.724 4.788 4.023 4.503 4.357 4.093 4.011 4.018
##  [649] 4.292 4.126 5.047 4.031 4.390 4.215 4.388 5.189 4.547 4.622 3.920 4.159
##  [661] 4.655 4.956 4.116 4.102 4.238 4.611 3.709 4.893 4.104 4.404 4.425 4.869
##  [673] 4.511 3.865 4.015 4.301 4.787 4.132 4.669 4.492 4.613 4.058 4.446 4.628
##  [685] 4.653 4.485 4.600 4.166 4.966 4.978 4.305 4.462 4.752 4.700 3.475 4.496
##  [697] 4.659 4.437 4.450 4.416 4.775 4.915 4.855 5.021 5.303 2.621 4.327 4.870
##  [709] 4.785 4.557 5.604 4.374 4.620 4.881 4.882 4.964 4.370 4.132 4.591 4.457
##  [721] 4.367 4.095 4.535 4.765 4.794 4.649 4.595 4.695 4.966 4.524 3.986 4.612
##  [733] 4.481 4.025 4.652 4.912 5.036 3.941 4.357 3.872 5.052 4.661 4.475 4.293
##  [745] 4.666 4.428 4.372 5.282 5.059 4.137 4.681 4.751 4.716 4.441 4.756 3.757
##  [757] 4.134 5.073 4.251 3.671 4.570 4.341 4.654 3.889 4.473 4.889 5.392 4.636
##  [769] 4.354 4.616 4.515 4.969 4.101 4.097 4.625 4.689 4.418 4.626 4.563 4.296
##  [781] 4.620 4.596 4.636 4.028 4.630 4.714 4.771 4.512 4.767 4.262 4.935 4.879
##  [793] 4.313 4.417 5.198 4.507 4.492 4.467 4.365 4.515 4.867 4.389 4.246 4.436
##  [805] 4.411 4.744 5.156 4.211 4.617 4.109 4.686 4.151 4.143 4.154 4.338 4.288
##  [817] 4.717 4.815 4.477 4.116 4.925 4.584 4.077 4.733 4.796 4.549 4.144 4.481
##  [829] 4.225 4.128 4.310 4.298 4.661 4.280 4.807 5.202 5.510 4.841 4.288 4.509
##  [841] 4.032 4.786 4.671 4.400 4.828 5.092 3.912 4.559 4.728 4.774 4.686 4.359
##  [853] 4.154 4.454 4.393 4.243 3.751 4.291 4.812 3.901 4.844 4.110 4.977 4.077
##  [865] 4.861 4.570 5.177 3.881 4.469 4.052 4.148 5.112 4.701 4.744 4.676 4.385
##  [877] 4.739 4.239 4.068 4.700 4.249 4.370 4.135 4.304 4.132 4.262 4.007 4.472
##  [889] 4.010 4.341 4.445 4.727 4.784 4.290 4.231 5.256 4.876 4.439 4.494 4.144
##  [901] 4.674 4.257 4.581 4.268 4.741 4.107 4.621 4.764 3.998 3.392 4.870 4.829
##  [913] 4.164 4.612 4.479 5.451 3.961 5.047 4.370 4.212 3.842 4.869 4.858 5.167
##  [925] 4.699 4.958 4.270 4.299 4.744 4.144 4.221 4.450 4.426 4.577 5.061 4.714
##  [937] 4.249 4.774 3.899 4.770 4.484 5.047 4.015 3.910 4.355 4.678 4.210 4.040
##  [949] 4.019 4.328 4.661 4.728 4.416 4.593 4.866 3.998 4.845 4.347 4.526 4.211
##  [961] 4.868 4.459 4.011 5.041 3.873 4.837 5.029 4.390 4.678 4.419 4.505 4.486
##  [973] 4.385 4.451 4.418 4.431 4.528 5.276 4.548 3.916 4.663 4.557 4.871 4.145
##  [985] 4.776 4.478 4.212 4.122 5.075 4.620 4.890 4.140 4.409 4.593 4.241 5.082
##  [997] 4.239 4.472 4.081 4.777
## 
## $p_value
## [1] 0.008
## 
## $observed_mean
## [1] 3.476
null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28)
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)

Exercise 5: Bootstrapping algorithm

Review the bootstrapping code from the “Why focus on algorithms” lecture (to generate confidence intervals for arbitrary test statistics). Then complete the following exercise:

Exercise 5a

Generate a new R function, called “RegressionCoefs()” that takes a data frame as the first input parameter and the name of the response variable as the second input parameter, and returns the (univariate) regression coefficients (\(\beta\)) produced by regressing the response variable on each predictor variable (returning a vector of regression coefficients). You can use the “Rsquared()” function (from the lecture) as a reference!

More specifically, this function should be specified as follows:

  • input:
    • df = a data frame containing with one response variable and one or more predictor variables. All columns that are not the response variable are assumed to be predictor variables.
    • responsevar = the name of the column that represents the response variable.
  • suggested algorithm:
    • Follow the example from lecture. Store the regression coefficients instead of the R-squared values.
  • return:
    • coefs = a vector of the regression coefficients (slope terms) for all predictor variables. This should be a vector with one element (slope term) for each predictor variable. The number of elements in your vector should be one fewer than the number of columns in your dataset.

Include your function in your submitted r script!

You can use this “sandbox” (below) to develop and test your function! Remember to comment out everything but the function prior to submitting.

df <- mtcars[,c(1,3,4,6)]
   
RegressionCoefs <- function(df,responsevar){
    # [add code here!]
}
# RegressionCoefs(df,"mpg")

Hint: No hints … yet!

You can test your function using the following code:

RegressionCoefs(df=trees,responsevar="Volume")   # should return two regression coefficients
##    Girth   Height 
## 5.065856 1.543350

Exercise 5b:

Generate a new R function, called “BootCoefs()” that meets the following specifications:

  • inputs:
    • “df” = a data frame that includes the response variable and all possible predictor variables
    • “statfunc” = a function for generating summary statistics (in this case, regression coefficients) from a data frame (which you already developed in part 1 of this challenge problem)
    • “n_samples” = the number of bootstrapped samples to generate
    • “responsevar” = the name of the response variable
  • algorithm:
    • with the data frame, use the “boot_sample()” function provided in the lecture to generate summary statistics for multiple bootstrap samples. You do not need to modify the ‘boot_sample()’ function for this problem
    • Then, generate confidence intervals for each variable as the 2.5%, 50% and 97.5% quantile of the summary statistic for each predictor variable.
  • return: a matrix (rows=predictor vars, cols=2.5%, 50%, and 97.5% quantiles). Try to give the rows and columns proper labels!

Include your function in your submitted r script!

You can use this “sandbox” (below) to develop and test your function!

df <- mtcars[,c(1,3,4,6)]
responsevar="mpg"

 # note: this is copied from the lecture code...
boot_sample <- function(df,statfunc,n_samples,responsevar="Volume"){
  indices <- c(1:nrow(df))
  output <- matrix(NA,nrow=n_samples,ncol=ncol(df)-1)        # storage object- to store a single bootstrapped sample from the original data
  
  for(i in 1:n_samples){              # for each bootstrap replicate:
    boot_rows <- sample(indices,size=nrow(df),replace=T)         # randomly sample observations with replacement
    newdf <- df[boot_rows,]                       # dataframe of bootstrapped observations
    output[i,] <- statfunc(newdf,responsevar)                 # generate statistics from the bootstrapped sample  (e.g., compute Rsquared after regressing y on all possible x variables)
  }
  return(output)
}

BootCoefs <- function(df,statfunc,n_samples,responsevar){
    # [add code here!]
}
BootCoefs(df=df,statfunc=RegressionCoefs,
              n_samples=1000,responsevar=responsevar)

Hint: No hints … yet!

You can test your function using the following code:

BootCoefs(df=trees,statfunc=RegressionCoefs,n_samples=1000,responsevar="Volume")
##            2.5%      50%    97.5%
## stat1 4.3568227 5.070591 5.618382
## stat2 0.7304976 1.558113 2.442687
df <- mtcars[,c(1,3,4,6)]
responsevar="mpg"
BootCoefs(df=df,
          statfunc=RegressionCoefs,
          n_samples=1000,
          responsevar=responsevar
)
##              2.5%         50%       97.5%
## stat1 -0.05153572 -0.04122000 -0.03118659
## stat2 -0.10086452 -0.06948531 -0.04703679
## stat3 -6.95561525 -5.37038579 -4.13929210
RegressionCoefs <- function(df=trees,responsevar="Volume"){    # univariate models only- interaction and multiple regression not implemented here
  response <- df[,responsevar]       # extract the response variable
  names <- names(df)                  
  coefs <- numeric(length(names))        # named storage vector
  names(coefs) <- names(df)               
  coefs <- coefs[names(coefs)!=responsevar]           # assume that all columns that are not the response variable are possible predictor variables
  i=names(coefs)[1]
  for(i in names(coefs)){         # loop through predictors
      predictor <- df[,i]                  # extract this predictor
      model <- lm(response~predictor)       # regress response on predictor
      coefs[i] <- coefficients(model)["predictor"]       # extract slope term
  }
  return(coefs)     
}

Exercise 5c:

To test your new function(s), generate bootstrapped confidence intervals around the regression parameters for your tree biomass regression. Compare these bootstrapped confidence intervals with the standard confidence intervals given by R in the “lm()” function. Are there any important differences? Explain your answer in your Word document. As always, please use your Word document submission to raise any points of confusion or ask me questions!

You can use this “sandbox” (below) to develop and test your function!

BootCoefs(df=trees,statfunc=RegressionCoefs,n_samples=1000,responsevar="Volume")

confint.lm(lm(Volume~Girth,trees))    # compare with lm() 

End of Lab 1!

NRES 746

Fall 2021