Lab 1 overview

In this course we use the R programming language for statistical computing and graphics, The purpose of this laboratory exercise is to develop familiarity with programming in R.

Lab 1 details

This lab provides an introduction to the R programming Language and the use of R to perform statistical and programming tasks. As with all lab reports, your answers will either take the form of R functions or short written responses. The R functions should be stored in an R script file (‘.R’ extension). To allow me to evaluate your work more efficiently, please name your R script using the following convention: “[your first name]_[your last name]_lab1.R”. So my submission would be “kevin_shoemaker_lab1.R” (all lower case).

Please include only the requested functions in your submitted R script. That is, the R script you turn in should only include the custom R functions you constructed to answer the questions, and none of the ancillary code you used to develop and test your functions. You will probably want to complete the lab exercises using a ‘full working version’ of your script (that includes all your work, including code for testing your solutions), and then save a reduced copy (containing only the requested functions and nothing more) for submission. Ask your instructor if you’d like a demonstration of how this works!

Before submitting your code using WebCampus, please clear your environment and run the script you will submit (containing only the requested functions) from start to finish. Check the ‘Environment’ tab in RStudio after running your script, and make sure that the script ONLY defines functions within your global environment- no other objects (data, values, etc.). The script you submit should generate no additional objects (e.g., data frames), it should not read in any files. Don’t put a ‘clear workspace’ line (e.g., ‘rm(list=ls()))’ anywhere in your submitted code- unless you want to make life difficult for the instructor…

Submit your R script (functions only) using WebCampus. Follow the above instructions carefully!

Submit short-answer responses using WebCampus. In addition to submitting your script, you will need to respond to additional questions using the Lab 1 “quiz” interface on WebCampus.

Please submit your R script and complete the WebCampus quiz by midnight on the due date (one week after the final lab session allocated for this topic). You can work in groups but submit the materials individually. Refer to WebCampus for due dates.

This lab exercise (Lab 1) will extend over two laboratory periods.

Let’s get started!

  1. Go to website http://cran.r-project.org/. This is the source for downloading the free, public-domain R software and where you can access R packages, find help, access the user community, etc. If you haven’t already installed R, do so now!
  2. Open the Rstudio software (https://www.rstudio.com– and install if you haven’t done so already. Change the working directory to a convenient directory (e.g., a subfolder called ‘Lab 1’ in your main course directory). NOTE- if you set up an Rstudio “project” (.Rproj extension) in this directory, the working directory is set automatically each time you load the project! I recommend you set up an R studio project for this course.
  3. Open a new, blank R script window in Rstudio. Using comments (anything preceded by a hash mark is not interpreted by R and can be used to provide human-readable notes and commentary so your code is more readable), add a header to the script to indicate that this is lab 1- include your name and the course number in the header.
  4. Save the script to your working directory. You will be submitting a reduced version of this script (requested functions only) via NevadaBox when you have finished all the exercises.
  5. While you’re at it, I recommend starting a new Word document or Google Doc to record your written responses to the lab exercises. When you’re ready, you can copy and paste your responses from this document directly into WebCampus (rather than writing your responses into webcampus directly).

Take some time to get more familiar with R (optional)

This is the only lab period we will devote to learning R- after this we will assume you have proficiency with defining variables and functions, reading in datasets, visualizing data, etc. in R.

From the R manual, ‘Introduction to R’ you can implement all the steps in Appendix A, located here. This takes you far but without much explanation– it is a way to jump into the deep end of the pool.

Alternatively, or in addition, go to the Links page and pick a tutorial to run through (e.g., the Datacamp introductory R course, which uses the ‘tidyverse’ packages). You can also check out the materials from my R “bootcamp” archive.

Before you move on, make sure you have a basic understanding of how to work with R and RStudio. If you are working with someone who has used R before, don’t be afraid to ask questions! Take your time with this- you might want to use the entire lab period on this if you have never used R before - it will be time well spent!

Another useful introductory R tutorial can be found here, courtesy of NCEAS. If you haven’t used R before, consider working through this tutorial before going through the lab exercises.

If you already have basic R expertise, you can move ahead!

R practice

Here we will get warmed up for the lab exercises.

Part 1- define a function

Write a function that takes a numeric input vector and computes the sum of all its elements.

Basically, all you need to do here is write ‘sum(x)’ where it says ‘[add code here]’

NOTE: you don’t need to submit this as part of your lab report! This is just practice…

 # demo function (not part of lab)

myfunc <- function(x){
    # [add code here!]
}

Part 2: Central Limit Theorem (CLT)

To gain some familiarity with using R scripts and developing algorithms, complete the following exercise (you don’t need to submit this demo as part of your lab write-up BUT you will be asked to use this code as part of exercise 1 (below).

Complete the following steps:

Review the meaning of the Central Limit Theorem, which states that the sum (or mean) of a sufficiently large number of independent and identically distributed random variables will form another random variable that is asymptotically normally distributed (as the number of elements added together approaches infinity), regardless of the distribution of the underlying data. One of the key implications is that the sample mean (sum of iid random variables divided by the sample size) will be normally distributed with mean equal to the mean of the data distribution and variance equal to the variance of the data distribution divided by the sample size:

\[X \sim any \;distribution(mean=\mu, var=\sigma^2)\]

Note that, although the normal distribution is the only distribution that is fully determined by its mean and variance, all all probability distributions have a mean and variance (although some distributions ‘misbehave’ and have infinite mean and/or variance!).

\[ \bar{X}= \frac{1}{N}\sum_{i=1}^N X_i\] \[\bar{X} \xrightarrow{D} Normal(\mu,\frac{\sigma}{\sqrt{N}})\]

Type (or paste) the following code into your R script window (or RStudio script window):

# CENTRAL LIMIT THEOREM demonstration ------------------------

N <- 10          # sample size
lots <- 1000       # placeholder representing infinity 

##  Define the random number distribution. ------------------

runif(N,10,20)   # try uniform random number generator with min of 10 and max of 20. 

many_samples = replicate(lots,runif(N,10,20))  # do it many times!
many_xbars = colMeans(many_samples)

hist(many_samples,freq=F,ylim=c(0,1),main="",xlab="Value")      # plot out the distribution of sample means
hist(many_xbars,freq=F,add=T,col="red")    # overlay the distribution of the underlying data from which we are drawing samples.
legend("topleft",fill=c("gray","red"),legend=c("data population","sample means"),bty="n")     # add a legend to the plot

  1. Experiment with executing this code in the following four ways:
    • copy and paste from the script window directly into the R console
    • use to execute line by line from within the script window in RStudio; (or on Macs, use command enter)
    • use to select the whole code block, then to execute all at once
    • highlight any block of code and run only that block using
    • save the script to your working directory as a text file with .R extension, open a new script and then run the script from the new script using the “source()” function, e.g.:
source("CentralLimitTheorem.R")     # run the R script from a different script!

NOTE: The hash (#) used in the above code allows you to insert comments adjacent to snippets of code, which facilitates readability of code (and lets the programmer remember later on what he/she was thinking when coding things a certain way!). It is good practice to comment every line of code so as to describe what it does in plain English, including all variables and functions. Make sure you fully understand the commented code for the CLT demonstration above!

  1. Now modify your R script to see how closely the distribution of sample means follows a normal distribution. Use a “quantile-quantile” (q-q) plot to visualize how closely the quantiles of the sampling distribution resemble the quantiles of a normal distribution. Use the “qqnorm()” function. To learn more about this function, type:
?qqnorm   # learn more about the "qqnorm()" function

Plot the q-q plot next to the histograms. The plot on the left should be the comparison of histograms (for population distribution and distribution of sample means) shown in the original script (above). The plot on the right should be the q-q plot. To produce side-by-side plots, you will need to add this line of code to the appropriate place in your script:

# Set graphical parameters for side by side plotting -------------

par(mfrow=c(1,2)) # sets up two side by side plots as one row and two columns

## or alternatively...

layout(matrix(1:2,nrow=1))

In addition, run a Shapiro-Wilk normality test, which tests the null hypothesis that a set of numbers (in this case the vector of sample means) indeed comes from a normal distribution (so what does a low p-value mean??). Use the “shapiro.test()” function:

?shapiro.test

So… what can you conclude from these tests??

Exercise 1: home-made functions!

Finally we have arrived at the ACTUAL lab exercises. Remember you don’t need to submit any of the work up to this point.

For the first part of this lab, you are asked to write several functions (and submit them as part of your lab script, of course!). The functions increase quickly in complexity!

You should now know how to construct functions in R. If you don’t, go back to the NCEAS tutorial and review the section on writing functions.

Exercise 1a

Write an R function called “CoefVar()” that takes a numeric vector (named ‘x’ within the function but could have any name outside the function) as input, and computes (and returns) its coefficient of variation (CV; standard deviation as a proportion of the mean). To make sure it works, apply your function to the ‘Height’ vector in the ‘trees’ dataset that installs with R as sample data:

You can use the following code to test your function using the built-in ‘trees’ dataset.

# Testing your code ------------------------ 
#   Explore the "trees" dataset 

#?trees

summary(trees)    # learn more about the data
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00
trees$Height    # extract the "Height" column from the trees dataset.
##  [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77
## [26] 81 82 80 80 80 87
CoefVar(trees$Height)    # run your new function!
## [1] 0.08383964

Exercise 1b

Write a function called “DrawLine()” for drawing a regression line through a scatter plot. This function should be specified as follows:

  • input:
    • x = a numeric vector specifying the x-coordinates of the scatter plot
    • y = a numeric vector specifying the y-coordinates of the scatter plot
  • suggested algorithm (base R):
    • with the x and y coordinates, first produce a scatterplot (HINT: use the “plot()” function)
    • use the “lm()” function to regress the y variable on the x variable.
    • record the intercept and slope of the linear relationship between x and y (HINT: use the “coef()” function)
    • add a regression line to the scatter plot (HINT: use the “abline()” function)
  • suggested algorithm (ggplot):
    • use the “lm()” function to regress the y variable on the x variable, and store the coefs
    • make a data frame containing the x and y coordinates
    • use ggplot to produce a scatterplot (HINT: use geom_point)
    • add a regression line to the scatter plot (HINT: use the “geom_smooth” function with method “lm”)
    • NOTE: you need to use the ‘print’ function to make sure your plot shows up when you run the function (e.g., ‘print(myplot)’)
  • return:
    • coefs = a vector of length 2, storing the intercept and slope of the linear relationship (in that order)

As a test, apply this function to the ‘Height’ (x axis) and ‘Volume’ (yaxis) vectors in the ‘trees’ dataset, and then to the ‘waiting’ (x axis) and ‘eruptions’ (y axis) vectors in the ‘faithful’ dataset.

#DrawLine(trees$Height,trees$Volume)

# ?faithful
# summary(faithful)

## ggplot version:
DrawLine(faithful$waiting,faithful$eruptions)    # test your function using the old faithful eruptions data
## `geom_smooth()` using formula = 'y ~ x'

## (Intercept)           x 
## -1.87401599  0.07562795
## ggplot version:
DrawLine_base(faithful$waiting,faithful$eruptions)

## (Intercept)           x 
## -1.87401599  0.07562795

Exercise 1c (for R script)

Write a function called “DrawLine2()” for drawing a “smoothed” regression line through a scatter plot, making the smoothing span (degree of smoothness, or non-wiggliness of the line) a user-defined option. This function should be specified as follows:

  • input:
    • x = a numeric vector specifying the x-coordinates of the scatter plot
    • y = a numeric vector specifying the y-coordinates of the scatter plot
    • smooth = a logical (TRUE/FALSE) value defining whether or not to add a smoothed line or a linear regression line
    • span = a number indicating the degree of smoothness, or “un-wiggliness” of the smoothed line (only applies if smooth=TRUE)
  • suggested algorithm (base R):
    • with the x and y coordinates, first produce a scatterplot (HINT: use the “plot()” function)
    • if smooth is FALSE, then proceed to draw a straight line as before (and return the coefficients)
    • if smooth is TRUE, plot a smoothed, locally-weighted regression of the y variable on the x variable (e.g., using the “scatter.smooth()” function). Make sure you use the “span” argument!
    • if smooth is TRUE, use the “loess()” function to record the same smoothed, locally-weighted regression of the y variable on the x variable. Again, make sure you use the “span” argument!
  • suggested algorithm (ggplot):
    • if smooth=TRUE, use the “loess()” function to regress the y variable on the x variable, and store the coefs. Otherwise store the ‘lm’ coefficients as before.
    • make a data frame containing the x and y coordinates
    • use ggplot to produce a scatterplot (HINT: use ‘geom_point’)
    • add a regression line to the scatter plot (HINT: use the “geom_smooth” function with method “lm”). If smooth is TRUE, use the ‘loess’ method with specified span, otherwise use ‘lm’.
    • NOTE: you need to use the ‘print’ function to make sure your plot shows up when you run the function (e.g., ‘print(myplot)’)
  • return:
    • out = (if smooth=TRUE) the loess model (the output produced by running “loess()” (or the slope and intercept from the linear regression, if smooth=FALSE)

Try testing your function using the trees and faithful datasets!

xvec <- c(1:10)
yvec <- rnorm(length(xvec),c(2:6,7:3),2)
DrawLine2(xvec,yvec,smooth=T,span=0.5)    # run your new function!
## `geom_smooth()` using formula = 'y ~ x'
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Call:
## loess(formula = y ~ x, span = span)
## 
## Number of Observations: 10 
## Equivalent Number of Parameters: 8.49 
## Residual Standard Error: 5.028
DrawLine2(x=trees$Height,y=trees$Volume,smooth=F)
## `geom_smooth()` using formula = 'y ~ x'

## (Intercept)           x 
##   -87.12361     1.54335
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.5)    # test using the old faithful eruptions data
## `geom_smooth()` using formula = 'y ~ x'

## Call:
## loess(formula = y ~ x, span = span)
## 
## Number of Observations: 272 
## Equivalent Number of Parameters: 6.71 
## Residual Standard Error: 0.3731
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=1.9)
## `geom_smooth()` using formula = 'y ~ x'

## Call:
## loess(formula = y ~ x, span = span)
## 
## Number of Observations: 272 
## Equivalent Number of Parameters: 3.08 
## Residual Standard Error: 0.4639

Exercise 1d (for R script)

Write a function called “CLTdemo()” based on the central limit theorem (CLT) demonstration code above. This function should be specified as follows:

  • input:
    • N = sample size (size of random sample) (default = 10)
    • min = lower bound of the uniform distribution to draw from (default=0)
    • max = upper bound of the uniform distribution (default=1)
    • visualize = logical indicator of whether or not to produce visual tests for normality
  • suggested algorithm:
    • see CLT demonstration above! Note that you shouldn’t generate more than 5000 sample means- this is because the shapiro-wilks test only works for vectors less than or equal to 5000 in length.
    • if visualize=TRUE, generate side-by-side plots of the histogram of sample means (left) and a quantile-quantile plot for visual tests of normality.
  • return:
    • out = the p-value from a Shapiro-Wilks normality test (the output produced by running “shapiro.test()” on the sample means)

And you can test your code with the following command:

CLTdemo(3,10,20)    # run your new function!

## [1] 0.006422794
 ## you might want to replicate this many times...

replicate(5,CLTdemo(3,10,20,visualize=F))   # for example, replicate 5 times
## [1] 0.02666252 0.22219402 0.14816431 0.04354325 0.01335300

Exercise 1e (answer in WebCampus!)

Finally, test the CLT function out for different parameter combinations to make sure it works! See if you can use this function to develop a reasonable rule for how large a sample size is necessary to ensure that the sample mean statistic (\(\bar{X}\)) is normally distributed given that the underlying data population is a uniform distribution.

Please submit your answer in WebCampus- and please justify your answer! [note: I am just looking for a thoughtful response, not a definitive mathematical proof! Just a few sentences is fine.]

Aside: default values in functions

NOTE: to set default values, just use the equals sign when defining your function. For example, say I wanted to write a function that adds the numbers in vector input ‘x’ OR adds a scalar input ‘y’ to each element of vector ‘x’. With only 1 input (‘x’) the function will sum up the elements in ‘x’. If the user specifies two arguments (x and y) then the function will add y to each element of x. It might look something like this:

newsum <- function(x,y=NULL){
  if(is.null(y)) sum(x) else x+y
}

newsum(5:10)   # use default value of y
## [1] 45
newsum(5:10, 2)    
## [1]  7  8  9 10 11 12

Try setting some alternative default values and re-running the function with and without arguments until you are sure you understand how default values work!

Exercise 2: regression in R (written responses in WebCampus)

Before we start with the exercise, here is some background:

  1. Type the following for a list of sample datasets that come with the core R package (some of these you have already encountered).
library(help = "datasets")    # list of sample datasets that come with R
?airquality    # this is the dataset we'll work with in this exercise
  1. Examine the ‘airquality’ dataset (use the ‘head’ and ‘summary’ functions). Note that there are missing values where ozone concentration data and solar radiation data were not collected.

  2. We could ignore the missing values and just go ahead with our regression analysis, since the default response of the “lm()” (‘linear model’) function is to omit cases with missing values in any of the specified parameters. However, to avoid problems later, we will omit them explicitly by constructing a new, ‘cleaned’ dataset as follows:

air_cleaned <- na.omit(airquality)    # remove rows with missing data
  1. Conduct a multiple linear regression of ozone concentration as a function of solar radiation, wind and temperature. Use the ‘lm()’ function to conduct an ordinary least squares (OLS) regression analysis.

  2. Explore the regression outputs using the ‘summary’ function, and explore regression diagnostics:

par(mfrow=c(3,2))
plot(model1, which=c(1:4))   # diagnostic plots (NOTE: the 'plot()' function returns these plots by default when the input is a linear regression model)


hist(residuals(model1), breaks=10)   # histogram of residuals
plot(predict(model1) ~ air_cleaned$Ozone)     # plot predicted vs observed- should follow 1:1 line. Examine this for model biases.
abline(0,1)

NOTE: see this website for more information on the diagnostic plots produced by lm().

  1. Consider the possibility that there may be an important interaction effect between solar radiation and temperature on influencing ozone concentrations. Explore that with a scatter plot where symbol size is scaled to ozone concentration:
coplot(air_cleaned$Ozone~air_cleaned$Temp|air_cleaned$Solar.R,rows=1)  # the "|" operator can be read "conditional on"

       # alternatively, you can use ggplot
library(ggplot2)
ggplot(air_cleaned,aes(Temp,Solar.R)) +
  geom_point(aes(size=Ozone))

  1. Now fit a second model that includes the interaction between solar radiation and temperature. Use the following formula to fit the interaction:
formula2 <- "Ozone ~ Wind + Solar.R * Temp"   # you can name formulas...
  1. Explore regression diagnostics and coefficients for the second model in the same way as you did for the first model without the interaction term.

  2. Conduct a test to formally test whether the more complex model (including the interaction term) fits the data significantly better than the reduced model (with fewer parameters) that lacks the interaction term. Recall that the \(R^2\) value is inadequate for this purpose because \(R^2\) will always increase with additional parameters! Use the following syntax,

anova(model1, model2, test="Chisq")   

Very briefly (but in complete sentences) answer the following questions in WebCampus:

Exercise 2a

By how much (and in what direction) is ozone concentration expected to change if temperature increased from 26 to 30 degrees Celsius? Assume that solar radiation stays constant at 200 lang and wind speed stays constant at 9 mph. Be careful with the units for temperature! Make sure to use the ‘interaction’ model (model2) from part 7 above to answer this question. Please briefly explain (in WebCampus) how you got your answer.

Exercise 2b

What is the null hypothesis that the p-values for the individual regression coefficients are designed to test?

Exercise 2c

Based on your diagnostic plots, do you think this is an appropriate model for these data? Explain your reasoning.

  • Exercise 2d

Which of the predictor variables you included in your model is the most important in predicting ozone concentration? Explain your reasoning?

## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggeffects':
## 
##     get_title

## Analysis of Variance Table
## 
## Response: Ozone
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## Wind           1  45694   45694 112.708 < 2.2e-16 ***
## Solar.R        1   9055    9055  22.335 7.072e-06 ***
## Temp           1  19050   19050  46.988 4.901e-10 ***
## Solar.R:Temp   1   5028    5028  12.402 0.0006337 ***
## Residuals    106  42975     405                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Single term deletions
## 
## Model:
## Ozone ~ Wind + Solar.R * Temp
##              Df Sum of Sq   RSS    AIC  Pr(>Chi)    
## <none>                    42975 671.43              
## Wind          1   11679.6 54654 696.12 2.393e-07 ***
## Solar.R:Temp  1    5028.2 48003 681.71 0.0004573 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = Ozone ~ scale(Wind) + scale(Solar.R) * scale(Temp), 
##     data = air_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.634 -12.505  -2.300   8.956  94.162 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  39.996      2.002  19.976  < 2e-16 ***
## scale(Wind)                 -11.879      2.213  -5.367 4.74e-07 ***
## scale(Solar.R)                9.006      2.248   4.006 0.000115 ***
## scale(Temp)                  15.844      2.297   6.898 3.98e-10 ***
## scale(Solar.R):scale(Temp)    7.216      2.049   3.522 0.000634 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.14 on 106 degrees of freedom
## Multiple R-squared:  0.6472, Adjusted R-squared:  0.6339 
## F-statistic: 48.61 on 4 and 106 DF,  p-value: < 2.2e-16

Exercise 3: Algorithmic (brute force) z-test

Review the “brute-force z-test” code from the “Why focus on algorithms” lecture. Then complete the following exercises:

## $Xbar
## [1] 3.476
## 
## $nulldist
##    [1] 3.669712 3.254229 3.652084 3.564189 3.815889 3.322832 3.544433 3.780472
##    [9] 3.654221 3.589679 3.790542 3.899135 3.454186 3.659369 4.100179 3.957394
##   [17] 3.598548 4.052536 3.575869 4.337912 3.620587 4.276476 3.691364 3.700510
##   [25] 3.498121 3.435202 3.865283 3.741916 3.970636 3.671104 3.876468 3.686852
##   [33] 3.811222 4.109495 3.360885 3.982672 3.248445 3.749458 3.620709 3.442958
##   [41] 3.942586 3.861247 3.774395 3.758705 3.523177 3.702161 3.742156 3.690204
##   [49] 4.222359 3.296337 3.928293 3.585738 4.137988 3.936082 3.924982 3.707245
##   [57] 3.592527 3.625664 3.733251 4.195731 3.960671 3.733010 3.904756 3.691316
##   [65] 3.500518 4.295105 3.148134 3.452921 3.937403 3.873844 4.328467 3.181196
##   [73] 3.562082 3.998840 3.788905 4.062241 4.111899 3.696056 4.104583 3.672121
##   [81] 3.900601 3.960083 4.267937 3.242355 3.978550 3.810564 3.908044 3.591873
##   [89] 3.950084 4.180769 3.392870 3.438108 4.051590 3.863165 4.189594 4.229318
##   [97] 3.463033 4.040191 3.641476 3.971593 3.780135 3.264256 3.815773 3.668538
##  [105] 4.210738 4.030690 3.443345 3.728823 3.522614 3.604407 3.713411 3.843314
##  [113] 4.060608 3.759348 3.674675 3.609212 3.813675 3.971111 3.729164 4.105476
##  [121] 3.612751 3.977730 4.097239 3.668136 3.841684 4.150006 3.665152 3.725729
##  [129] 3.902404 3.771461 3.908629 3.561122 3.443252 3.529170 3.387868 3.911287
##  [137] 3.532629 3.694834 3.964450 3.805490 3.771006 3.233009 3.403895 3.614551
##  [145] 3.800829 3.328167 3.689644 3.874735 3.720021 3.480010 3.627782 3.743003
##  [153] 3.690840 3.827373 3.874360 3.735009 3.471178 3.897480 3.729435 3.513225
##  [161] 3.450719 3.558894 3.780157 4.360463 4.003570 3.398104 3.768641 3.671088
##  [169] 3.871145 3.146691 3.753190 3.940400 3.702831 3.917807 4.009774 4.224292
##  [177] 3.966777 3.757390 3.760036 4.024992 4.239837 3.479769 3.127774 3.743479
##  [185] 4.055547 3.439329 4.446313 3.646656 3.873945 3.469963 4.036970 3.775161
##  [193] 3.841437 3.265689 3.656146 4.170447 4.377496 3.516025 4.078850 4.308810
##  [201] 2.928671 3.702800 3.570553 3.959703 3.762640 3.627039 3.849902 3.628285
##  [209] 4.504112 3.296145 3.793292 3.719015 4.028247 3.360972 3.943531 4.285885
##  [217] 3.617967 4.166192 3.749303 3.773353 3.839293 3.609077 4.265239 3.990832
##  [225] 3.895842 4.130021 2.959438 3.940326 3.950888 3.784570 3.903796 3.371110
##  [233] 3.932815 3.527778 3.977042 3.867625 3.848653 4.030519 4.210072 3.482901
##  [241] 3.768801 3.907184 4.153274 3.415491 3.576061 4.174403 3.592792 3.898677
##  [249] 3.780230 3.544570 3.523576 3.631105 3.388536 3.864602 3.466899 3.628700
##  [257] 4.021931 3.695047 4.158974 3.035531 3.933955 3.957023 3.491345 3.402676
##  [265] 4.076516 3.635985 4.062853 3.573728 3.562682 3.586940 4.276320 3.978160
##  [273] 3.795264 3.417104 2.969283 3.909064 3.559953 3.823301 4.180739 3.875786
##  [281] 3.688603 3.840039 3.433600 3.900288 3.237292 3.597438 3.651372 3.424060
##  [289] 4.205296 3.622559 4.100887 2.881774 3.994057 3.327035 3.878763 3.607360
##  [297] 3.872094 3.358435 3.952180 3.541194 3.749271 3.467981 3.603740 3.697439
##  [305] 3.404568 3.827224 3.668539 3.320525 4.034901 3.946534 4.284333 3.931554
##  [313] 3.616161 3.423134 3.302998 3.544296 4.305800 3.769218 3.855698 3.888983
##  [321] 3.541001 4.252428 3.359073 3.630925 3.973487 3.405431 3.614946 3.113318
##  [329] 3.369648 3.884520 3.950925 3.453140 4.120670 3.391172 3.605839 3.666807
##  [337] 3.527055 4.045947 3.867689 3.917085 3.873623 3.472604 3.473169 3.757758
##  [345] 3.678703 3.904834 3.944398 3.107222 3.480169 3.509349 3.471974 3.572835
##  [353] 4.033360 4.141730 3.471794 3.772011 3.931093 3.586324 4.376682 3.957460
##  [361] 3.831293 3.627265 3.864052 4.058818 3.691119 3.732761 3.411508 3.785352
##  [369] 3.325477 4.097562 3.902608 4.265709 3.736246 3.589583 3.787697 3.937258
##  [377] 3.923670 3.770841 3.702480 4.057659 3.470919 3.835825 4.046076 3.878063
##  [385] 4.276117 3.883138 4.039831 3.765996 3.572934 4.187310 3.601681 3.938583
##  [393] 3.330771 3.760630 4.000399 3.359772 3.616968 3.483620 3.495223 3.996494
##  [401] 4.352507 3.845918 3.760334 4.014935 3.896150 3.998784 3.753618 4.049512
##  [409] 4.099030 3.770631 3.562577 3.830004 3.785422 3.780366 3.571021 4.103380
##  [417] 3.688641 4.097606 3.582250 3.945767 3.646371 3.600371 3.189354 3.501374
##  [425] 4.048925 3.235637 3.536898 3.793565 3.772462 3.733756 3.633892 4.176315
##  [433] 3.792252 3.744249 3.944158 3.831476 3.802529 4.181306 3.858358 3.855754
##  [441] 3.703852 3.587311 4.006020 3.845926 3.720599 3.878678 3.978970 3.408947
##  [449] 4.137883 3.726766 3.313179 3.452222 4.367188 3.634904 3.751518 4.252655
##  [457] 3.652005 3.721421 3.887301 3.718466 3.621522 4.199257 4.001700 3.561393
##  [465] 3.994410 3.915943 3.610284 3.874596 4.264464 3.760495 4.297905 3.814833
##  [473] 3.962583 3.730034 3.809178 3.640976 4.365021 3.616716 4.012092 3.992710
##  [481] 3.550355 4.176822 4.172132 3.849273 3.233133 3.295321 3.895957 3.644315
##  [489] 4.217582 3.994244 3.339583 3.855009 3.966545 3.570528 3.049203 3.407474
##  [497] 4.017634 4.374136 3.804496 3.775991 3.780358 3.683825 3.999740 3.570681
##  [505] 3.705741 3.224850 3.873480 3.644602 3.385866 3.339537 3.752642 3.507768
##  [513] 3.610561 3.711159 4.553861 4.050665 4.216972 3.748304 3.506151 3.721233
##  [521] 4.017639 3.705722 3.636159 4.189085 3.415423 4.000388 3.289068 3.958429
##  [529] 3.485951 4.301563 4.626263 3.387721 4.357709 3.829143 3.636640 3.675730
##  [537] 3.390940 3.977029 2.937372 3.628493 3.670029 3.379579 3.759721 4.149613
##  [545] 3.702459 4.294048 4.298654 3.873800 3.696677 4.479338 3.729647 4.044068
##  [553] 3.345777 4.376674 4.517260 4.102773 3.661120 4.150543 3.320080 4.071679
##  [561] 3.510957 3.752677 3.713480 3.478330 3.811902 3.996225 4.001495 3.905219
##  [569] 3.934510 3.486382 3.841402 4.030128 3.669901 3.921792 3.966622 3.768560
##  [577] 3.706270 3.602063 3.516429 3.927258 3.130386 3.796173 3.508126 3.773356
##  [585] 4.051911 3.800856 3.839833 3.983146 3.724824 3.721996 4.174448 3.737186
##  [593] 3.772653 4.039579 4.007118 3.759301 3.594568 4.106943 3.566381 3.816193
##  [601] 4.301618 3.891638 4.245080 4.082446 3.887323 3.886332 3.952873 3.735813
##  [609] 3.565839 4.206756 3.902200 3.395583 4.115076 3.834613 3.664909 3.690576
##  [617] 3.763218 3.867166 3.786365 3.666723 3.880316 3.590271 3.698706 3.647506
##  [625] 3.610983 3.790737 3.727848 3.369338 3.831540 3.665996 3.595257 4.016316
##  [633] 3.667163 4.062210 4.003439 3.638604 4.133077 3.857442 3.857411 3.319857
##  [641] 3.674969 4.141408 3.751345 3.927673 3.503393 3.752743 4.003473 3.472796
##  [649] 4.134980 3.717683 4.210234 4.227842 3.997116 4.034021 3.985154 3.724569
##  [657] 3.857118 4.120525 3.245752 4.172777 3.611667 3.605290 3.882167 4.438162
##  [665] 3.481108 4.082987 3.128933 3.536434 3.925511 4.285024 3.715596 3.984729
##  [673] 3.561088 3.553876 3.663589 3.263819 3.932064 3.966149 4.023030 3.800456
##  [681] 3.613082 4.047817 3.904951 3.572718 3.666018 4.084515 3.637624 3.806222
##  [689] 4.040390 3.590498 3.954960 4.065343 3.951546 3.538331 3.584591 3.309898
##  [697] 3.737062 4.493346 3.061567 4.348590 4.270332 4.295852 3.981581 3.720639
##  [705] 3.697179 3.439329 4.001627 4.144661 3.545334 3.825380 3.607143 3.898320
##  [713] 4.488054 3.665270 3.517307 3.884460 3.791184 3.981205 3.959909 3.917054
##  [721] 4.335681 3.680470 4.180743 3.528125 3.553512 3.994452 3.368364 3.749216
##  [729] 3.420545 4.082902 3.738432 3.382115 3.285276 4.036434 3.513660 2.787373
##  [737] 3.953801 3.698615 4.308370 4.135635 3.484485 3.357781 3.388939 4.066729
##  [745] 4.163793 3.975689 4.044179 3.426655 3.853673 4.047262 3.168996 3.562410
##  [753] 3.615619 4.093743 3.556776 3.622814 3.263167 3.558919 3.928857 3.987272
##  [761] 3.660177 3.652156 3.458990 3.572255 3.728179 3.693611 4.000907 4.051736
##  [769] 3.853108 3.739052 3.838245 3.391717 3.353016 3.671532 3.488584 3.859941
##  [777] 3.840269 3.893935 3.455139 3.680344 3.734856 4.023911 3.370704 3.769769
##  [785] 4.030189 3.217629 3.499357 3.515342 3.662378 3.858520 3.466157 3.567453
##  [793] 4.364967 3.590460 3.796150 3.699831 3.749671 4.012507 3.780891 3.922269
##  [801] 4.015283 4.069824 4.284173 3.768712 3.654887 3.283878 3.797197 3.707830
##  [809] 3.680316 3.349784 4.025706 3.647882 3.771443 3.681148 3.503823 3.982204
##  [817] 4.250547 3.831860 3.352016 4.241141 3.760202 4.073392 4.079094 3.719913
##  [825] 3.871260 3.529131 4.221389 3.941065 3.949339 3.605163 3.730293 3.885367
##  [833] 3.957696 3.668533 4.087480 4.022460 3.637710 3.835459 3.511295 3.832677
##  [841] 3.715321 3.106249 3.789469 3.880412 4.268301 3.158838 3.782177 4.034974
##  [849] 3.318012 4.624176 3.737130 3.561247 3.856091 3.735064 3.736720 3.633212
##  [857] 3.416923 3.636491 3.708283 3.831880 3.331963 3.581716 3.464599 3.714818
##  [865] 3.854737 3.317567 3.636175 4.193014 3.631022 3.560107 3.769544 3.759770
##  [873] 3.779539 3.898179 3.819345 4.135210 4.030307 3.774007 3.372521 3.971223
##  [881] 4.117598 3.635927 4.223461 3.954491 3.898440 3.828967 4.448175 4.239947
##  [889] 3.721966 3.962725 3.965589 3.786715 3.912367 3.760352 4.381862 3.681194
##  [897] 4.082125 4.077398 3.901845 3.338881 3.971142 4.242881 3.817311 3.875704
##  [905] 3.663259 3.344638 3.877076 3.406985 3.623787 3.943737 4.089110 3.474239
##  [913] 3.449526 3.180511 3.623781 3.569456 3.542664 3.765224 3.861564 3.374576
##  [921] 3.669876 3.423626 3.731189 4.198433 3.599247 3.534196 3.855143 3.447806
##  [929] 3.598925 3.904091 3.527479 4.104526 4.176183 3.533536 4.145619 3.905576
##  [937] 3.417029 4.477369 4.213680 3.944303 3.614795 3.433314 4.175776 4.423641
##  [945] 3.776315 3.799216 3.741634 4.178415 3.556338 3.884394 4.059725 3.750199
##  [953] 3.586445 4.099992 3.528260 4.249757 3.630984 4.016958 4.240332 4.053856
##  [961] 3.398993 3.638947 3.832877 3.982203 3.835949 3.363868 4.080127 3.540767
##  [969] 3.186824 3.499104 3.748342 4.280133 4.160958 3.657116 3.957788 4.334231
##  [977] 3.900107 3.969395 3.879396 3.896647 3.926107 3.601455 3.508952 3.808556
##  [985] 3.581609 3.648261 3.695890 3.651412 3.663927 4.013916 4.253409 3.676702
##  [993] 3.807242 3.843213 3.835338 3.330389 3.754193 3.850299 3.913826 3.471838
## 
## $p_value
## [1] 0.145
## [1] 0.1202629

Exercise 3a

What if we wanted to run a two-tailed z-test? That is, what if our alternative hypothesis were that salmon fed on the new vegetarian diet could plausibly be larger or smaller than those fed on the conventional diet after one year?

Modify the function (“ztest_bruteforce()”) with a new argument that allows for both one and two-tailed tests! Name your new function “my_ztest()”.

To convince yourself that your new function works, try running your function for a (made-up) case where the observed body mass for those fed on the new diet are generally higher than the expected body mass for those fed on the conventional diet after one year – the opposite of your hypothesis!

NOTE: you may get tangled up with the null hypothesis/p-value concept, which is admittedly a difficult concept! A p-value always assumes the null hypothesis is true (what is the probability that a random sample under the null hypothesis could give as much evidence for your alternative hypothesis than the data you observed?).

  • In the case of salmon example, our original alternative hypothesis (one-tailed: lower) is that the treatment mean (veg. diet) is less than the null mean (convenction diet).
  • We could have specified a different one-tailed hypothesis: that the treatment mean should be greater than the null mean (one-tailed: higher)
  • Under a 2-tailed hypothesis, the direction of the difference doesn’t matter- we just think that the treatment mean should be different than the null mean.

Include your function in your submitted r script!

This function should be specified as follows:

  • input:
    • d = a vector of observed sample values (data)
    • mu = a scalar value representing the expected mean value under the null hypothesis
    • sigma = a scalar value representing the population standard deviation under the null hypothesis
    • alternative = a character string with one of three possible values: ‘l’ to indicate a one-tailed “less than” alternative hypothesis, ‘g’ to indicate a “greater than” alternative, and “b” to indicate a 2-tailed test (default should be ‘b’)
  • suggested algorithm:
    • See lecture for example ‘brute force z-test()’ function
    • To implement the two-tailed test, you need to define what ‘extreme’ means in both directions. You might first define the absolute difference between the population mean (mu) and the sample mean. Then define ‘more extreme’ as any simulated sample mean that falls further away from the population mean in either direction (larger or smaller than mu) than the observed sample mean.
    • Use an “if-else” or “case_when” statement to accommodate all three possible alternative hypotheses.
  • return:
    • to_return = a named list with three elements: “x_bar”, representing the sample mean (scalar) “nulldist”, representing many samples from the sampling distribution for x_bar under the null hypothesis (vector)
      “p_value”, representing the p-value (scalar)

Test your function using some alternative sample data values. For example:

mu = 4     # testing code
sigma = 2
mysample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
test = my_ztest(mysample, mu, sigma,alternative = 'l' )
str(test)
## List of 3
##  $ Xbar    : num 3.48
##  $ nulldist: num [1:10000] 4.84 3.72 5.25 3.43 4.23 ...
##  $ p_value : num 0.207
test$p_value
## [1] 0.2072
   # also try using the code from lecture to compare against a "real" z-test!

Also try using the code from lecture to compare against a “real” z-test– for example:

mu = 4     # testing code
sigma = 2
mysample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)

samplemean = mean(mysample)    # real z-test
stderr <- sigma/sqrt(length(mysample))  # population standard error
zstat = (samplemean-mu)/stderr
pnorm(zstat)
## [1] 0.203689
test = my_ztest(mysample, mu, sigma,alternative = 'l' )   # compare your function with the real answer
test$p_value
## [1] 0.204
library(BSDA)
## Loading required package: lattice
## 
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
## 
##     Orange
z.test(x=mysample,mu=mu, sigma.x=sigma,alternative = "less")    # 'canned' z-test
## 
##  One-sample z-Test
## 
## data:  mysample
## z = -0.82852, p-value = 0.2037
## alternative hypothesis: true mean is less than 4
## 95 percent confidence interval:
##        NA 4.516297
## sample estimates:
## mean of x 
##     3.476

Exercise 3b: non-parametric test of sample mean vs null

Now, what if we had access to a large set of independent samples under the null hypothesis AND that we would like to relax the assumption that the data follow a normal distribution under the null hypothesis?

In this exercise, you are asked to modify the brute-force z-test function so that you generate your sampling distribution by simply sampling from the known data set under the null hypothesis. To do this you can use the ‘sample()’ function in R. For this purpose you should sample from the null with replacement. Here is an example:

nulldata=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
sample(nulldata,10,replace=T)    # use the sample function to sample from a large vector with replacement

For this exercise, we will return to the original null hypothesis: that salmon raised on the new diet will be smaller (lower mass) than those raised on conventional diet after one year.

Include your function in your submitted r script!

This function should be named ‘x_vs_null()’ and should be specified as follows:

  • input:
    • x = a vector containing the observed sample
    • nulldata = a vector of observations under the null hypothesis
  • suggested algorithm:
    • Instead of sampling from a normal distribution to simulate samples under the null hypothesis, use the ‘sample’ function to sample directly from the null data distribution.
  • return:
    • to_return = a named list with three elements: “xbar”, representing the sample mean, “nulldist”, representing the sampling distribution for sample means under the null hypothesis, and “p_value”, representing the p-value of the test

Test your function using some alternative sample data vectors. For example:

nulldata=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
mysample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
test = x_vs_null(mysample, nulldata )
test$p_value
## [1] 0.008

Exercise 4: Bootstrapping

Review the bootstrapping code from the first lecture topic (to generate confidence intervals for arbitrary test statistics). Then complete the following exercise:

Exercise 4a

Generate a new R function, called “RegressionCoefs()” that takes a data frame as the first input parameter and the name of the response variable as the second input parameter, and returns the regression coefficients (\(\beta\)) produced by regressing the response variable on each predictor variable one at a time (returning a vector of regression coefficients). You can use the “Rsquared()” function (from the lecture) as a reference!

More specifically, this function should be specified as follows:

  • input:
    • df = a data frame containing the response variable and one or more predictor variables. All columns that are not the response variable are assumed to be predictor variables
    • responsevar = the name of the column that represents the response variable.
  • suggested algorithm:
    • Follow the R squared example from lecture. Store the regression coefficients instead of the R-squared values.
  • return:
    • coefs = a vector of the regression coefficients (slope terms) for all predictor variables. This should be a vector with one element (slope term) for each predictor variable. The number of elements in your vector should be one fewer than the number of columns in your dataset. If possible, try to make this a named vector- with the name of each predictor variable associated with its respective coefficient.

Include your function in your submitted r script!

You can test your function using the following code:

RegressionCoefs(trees,"Volume")   # should return two regression coefficients
##    Girth   Height 
## 5.065856 1.543350

Exercise 4b:

Generate a new R function, called “BootCoefs()” that meets the following specifications:

  • inputs:
    • “df” = a data frame that includes the response variable and all possible predictor variables
    • “statfunc” = a function for generating summary statistics (in this case, regression coefficients) from a data frame (which you already developed in part 1 of this challenge problem)
    • “n_samples” = the number of bootstrapped samples to generate
    • “responsevar” = the name of the response variable
  • algorithm:
    • with the data frame, use the “boot_sample()” function provided in the lecture to generate summary statistics for multiple bootstrap samples. You do not need to modify the ‘boot_sample()’ function for this problem
    • Then, generate confidence intervals for each variable as the 2.5%, 50% and 97.5% quantile of the summary statistic for each predictor variable.
  • return: a matrix (rows=predictor vars, cols=2.5%, 50%, and 97.5% quantiles). Try to give the rows and columns proper labels!

Include your function in your submitted r script!

You can test your function using the following code:

BootCoefs(df=trees,fun=RegressionCoefs,responsevar="Volume")
##          Girth    Height
## 2.5%  4.362612 0.7690758
## 50%   5.027451 1.5358234
## 97.5% 5.632498 2.3628580
df <- mtcars[,c(1,3,4,6)]
responsevar="mpg"
BootCoefs(df,RegressionCoefs,responsevar)
##              disp          hp        wt
## 2.5%  -0.05132901 -0.10073301 -7.100329
## 50%   -0.04119646 -0.06940640 -5.377683
## 97.5% -0.03146126 -0.04591232 -4.166714

Exercise 4c:

To test your new function(s), generate bootstrapped confidence intervals around the regression parameters for your tree biomass regression. Compare these bootstrapped confidence intervals with the standard confidence intervals given by R in the “lm()” function. Are there any important differences? Explain your answer in WebCampus. As always, please use your WebCampus submission to raise any points of confusion or ask me questions!

##          Girth    Height
## 2.5%  4.343355 0.7827071
## 50%   5.054131 1.5375925
## 97.5% 5.665798 2.4448150
##           Girth   Height
## 2.5 %  4.559914 0.758249
## 97.5 % 5.571799 2.328451

End of Lab 1!