In this course we will rely heavily on the R programming language for statistical computing and graphics (e.g., this website is actually built in R using ‘rmarkdown’ and rshiny). The purpose of this first laboratory exercise is to develop the level of familiarity with R that is needed to succeed in this course – and ultimately, to establish a foundation for you to develop your data analysis skills using R throughout your scientific career.
Specifically, this lab will provide a basic introduction to the R programming Language and the use of R to perform statistics and programming tasks.
As with all lab reports, your answers will either take the form of R functions or short written responses. The R functions should be stored in an R script file (‘.R’ extension). To allow me to evaluate your work more efficiently, please name your R script using the following convention: "[your first name]_[your last name]_lab1.R“. So my submission would be”kevin_shoemaker_lab1.R". Please include only the requested functions in your submitted R script. You will probably complete the lab exercises using a main script (that includes all your work), and then save a reduced copy (with only the requested functions) for submission.
Before submitting your code, please clear your environment, run your script and make sure that the script only defines functions within your global environment. Your script should generate no additional objects (eg, data frames).
Please submit the R script and the Word document via WebCampus by midnight on the due date (one week after the final lab session allocated for this topic). You can work in groups but submit the materials individually. One student in each group should be an ‘R guru’ if at all possible!
Please provide any requested written answers as a Word document in WebCampus- keep your responses as brief as possible! Also, I will not be asgining yoru scroes bsased on sepelling, grammmer or the propreiety orf yur senstence cosnstrujctin.
This lab exercise will extend over two laboratory periods (Aug. 24 and 31), with your R script and Word document due Sept. 7.
From the R manual, ‘Introduction to R’ you will implement all the steps in Appendix A, located here. This takes you far but without much explanation– it is a way to jump into the deep end of the pool. Alternatively, or in addition, go to the Links page and pick a tutorial to run through (e.g., the Datacamp introductory R course). You can also check out the materials from UNR’s R “bootcamp”. Before you move on, make sure you have a basic understanding of how to work with R and RStudio. If you are working with someone who has used R before, don’t be afraid to ask questions! Also, take your time with this- you might want to use the entire lab period on this if you have never used R before.
Another useful introductory R tutorial can be found here, courtesy of NCEAS. Please take the time to complete this tutorial, including going through the exercises.
If you already have basic R expertise, this is your opportunity to help your peers to develop the level of comfort and familiarity with R that they will need to perform data analysis and programming tasks in this course.
Depending on whether you are already familiar with R, you may also find the remainder of this document useful as you work your way through the course (and there are many other good introductory R resources available online… let me know if there is one you particularly like and I will add it to the course website (Links page). As you work your way through the tutorial(s) (on your own pace), please ask the instructor or your peers if you are uncertain about anything.
Since there is no TA in this class, I have tried wherever possible to embed interactive text windows (which serve as a ‘virtual TA’ and were created using ‘R shiny’) in the course website where you can test your code and get instant feedback as to whether your answers are correct! Here’s an example:
Write a function that takes a numeric input vector and computes the sum of all its elements.
Basically, all you need to do here is write ‘sum(vector)’ where it says ‘[add code here]’
NOTE: you don’t need to submit this as part of your lab report! This is just practice…
myfunc <- function(vector){
# [add code here!]
}
Hint: You may want to use the sum()
function.
To gain some familiarity with using R scripts and developing algorithms, complete the following exercise (you don’t need to submit this demo as part of your lab write-up BUT you will be asked to use this code as part of exercise 1 (below).
Remember that you may want to have two scripts for this and other labs. The first script contains all the code you built to learn new concepts, develop and test code for answering the questions, etc. The second script is the one you will submit, and only contains the functions you defined as part of each lab. I recommend putting the second script (for submission) together after you finish the lab (just before submitting).
Review the meaning of the Central Limit Theorem, which implies that the mean of a sufficiently large number of independent random variables (small samples from a much larger population of interest) will have a sampling distribution that is approximately normally distributed, regardless of whether the underlying data distribution is normally distributed.
Type (or paste) the following code into your R script window (or RStudio script window):
####################
# CENTRAL LIMIT THEOREM demonstration
####################
infinity <- 100000 # number approximating infinity
N_SAMPLES <- 1000 # number of independent random samples to draw from the specified distribution
SAMPLESIZE <- 10 # sample size of each independent random sample
########
# Define the random number distribution. Here, we will use a "uniform" random number generator with a min of 10 and max of 20.
TRUEMIN <- 10
TRUEMAX <- 20
data_population <- runif(infinity,TRUEMIN,TRUEMAX) # here we define the full set of possible random numbers to draw random samples from (the population of potential data). We assume this is uniformly distributed
#######
# Draw multiple samples from the pool (population) of possible data.
samplemean <- numeric(N_SAMPLES) # set up storage vector
for(i in 1:N_SAMPLES){ # for each replicate (independent random sample)
sample <- sample(data_population,SAMPLESIZE) # draw an independent random sample from the population of interest
samplemean[i] <- mean(sample) # compute and record the sample mean
}
hist(data_population,freq=F,ylim=c(0,1),main="",xlab="Value") # plot out the distribution of sample means
hist(samplemean,freq=F,add=T,col="red") # overlay the distribution of the underlying data from which we are drawing samples.
legend("topleft",fill=c("white","red"),legend=c("data population","sample means"),bty="n") # add a legend to the plot
source("CentralLimitTheorem.R") # load the R script!
NOTE: The pound (#) sign used above allows the R programmer to insert comments adjacent to snippets of code, which facilitates readability of code (and lets the programmer remember later on what he/she was thinking when coding things a certain way!). It is good practice to comment every line so as to describe its precise meaning, including all variables and functions. Make sure you fully understand the commented code for the central limit theorem demonstration above!
Here is an interactive window you can use to play around with the CLT code (or even better, just open another RStudio instance and explore the code there!):
# [Add code here]
Hint: Try running with different values for the ‘sample.size’ parameter.
?qqnorm # learn more about the "qqnorm()" function
Plot the q-q plot next to the histograms. The plot on the left should be the comparison of histograms (for population distribution and distribution of sample means) shown in the original script (above). The plot on the right should be the q-q plot. To produce side-by-side plots, you will need to add this line of code to the appropriate place in your script:
########
# Set graphical parameters for side by side plotting
par(mfrow=c(1,2)) # sets up two side by side plots as one row and two columns
## or alternatively...
layout(matrix(1:2,nrow=1))
In addition, run a Shapiro-Wilk normality test, which tests the null hypothesis that a set of numbers (in this case the vector of sample means) indeed comes from a normal distribution (so what does a low p-value mean??). Use the “shapiro.test()” function:
?shapiro.test
Again, here is an interactive window you can use to play around with the CLT code (or just use another instance of RStudio).
# [Add code here]
Hint: Try running with different values for the ‘sample.size’ parameter.
So… what can you conclude from these tests? Can you conclude that the distribution of sample means is NOT normal/Gaussian??
Finally we have arrived at the ACTUAL lab exercises. Remember you don’t need to submit any of the work up to this point…
For the first part of this lab, you are asked to write several functions (and submit them as part of your lab script, of course!). The functions increase in order of complexity!
You should now know how to construct functions in R. If you don’t, go back to the NCEAS tutorial and review the section on writing functions.
Write an R function called “CoefVar()” that takes a numeric vector (named ‘vector’ within the function but could have any name outside the function) as input, and computes (and returns) its coefficient of variation (CV; standard deviation as a proportion of the mean). To make sure it works, apply your function to the ‘Height’ vector in the ‘trees’ dataset that installs with R as sample data:
You can use this “sandbox” (below) to develop and test your function! You can test your function in the ‘sandbox’ below, but when you submit, make sure you delete all code except for the function itself (but comments are okay).
CoefVar <- function(vector){
# [add code here!]
}
Hint: You probably want to use the ‘sd()’ and ‘mean()’ functions.
You can use the following code to test your function using the built-in ‘trees’ dataset.
############
# Explore the "trees" dataset
?trees
## starting httpd help server ... done
summary(trees) # learn more about the data
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
trees$Height # extract the "Height" column from the trees dataset.
## [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77
## [26] 81 82 80 80 80 87
CoefVar(trees$Height) # run your new function!
## [1] 0.08383964
Write a function called “DrawLine()” for drawing a regression line through a scatter plot. This function should be specified as follows:
You can use this “sandbox” (below) to develop and test your function! Again, remember to comment out everything except for your “DrawLine” function definition prior to submitting (you can test the function using the “run” button).
DrawLine <- function(x,y){
# [add code here!]
}
#DrawLine(trees$Height,trees$Volume)
Hint: You may want to use the coef()
function.
As a test, apply this function to the ‘Height’ (x axis) and ‘Volume’ (yaxis) vectors in the ‘trees’ dataset, and then to the ‘waiting’ (x axis) and ‘eruptions’ (y axis) vectors in the ‘faithful’ dataset.
# ?faithful
# summary(faithful)
DrawLine(faithful$waiting,faithful$eruptions) # test your function using the old faithful eruptions data
## (Intercept) x
## -1.87401599 0.07562795
Write a function called “DrawLine2()” for (optionally) drawing a “smoothed” regression line through a scatter plot, making the smoothing span (degree of smoothness, or non-wiggliness of the line) a user-defined option. This function should be specified as follows:
You can use this “sandbox” (below) to develop and test your function!
xvec <- c(1:10)
yvec <- rnorm(length(xvec),c(2:6,7:3),2) # you can use these vectors for testing the function
DrawLine2 <- function(x,y,smooth=F,span=1){ # note the default values (the ones set using equal sign)
# [add code here!]
}
# DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.5) # test using the old faithful eruptions data
Hint: You may want to use the scatter.smooth()
function to draw the curve and the ‘loess()’ function to build the model.
Try testing your function using the trees and faithful datasets!
xvec <- c(1:10)
yvec <- rnorm(length(xvec),c(2:6,7:3),2)
DrawLine2(xvec,yvec,smooth=T,span=0.5) # run your new function!
## Call:
## loess(formula = y ~ x, span = span)
##
## Number of Observations: 10
## Equivalent Number of Parameters: 8.49
## Residual Standard Error: 2.423
DrawLine2(x=trees$Height,y=trees$Volume,smooth=F,span=NA)
## (Intercept) x
## -87.12361 1.54335
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.5) # test using the old faithful eruptions data
## Call:
## loess(formula = y ~ x, span = span)
##
## Number of Observations: 272
## Equivalent Number of Parameters: 6.71
## Residual Standard Error: 0.3731
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.1)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## pseudoinverse used at 81
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = FALSE, :
## There are other near singularities as well. 1
## Call:
## loess(formula = y ~ x, span = span)
##
## Number of Observations: 272
## Equivalent Number of Parameters: 29.97
## Residual Standard Error: 0.3762
Write a function called “CLTdemo()” based on the central limit theorem (CLT) demonstration code above. This function should be specified as follows:
You can use this “sandbox” (below) to develop and test your function! As always, remember to comment out everything except for the requested function prior to submitting.
CLTdemo <- function(n.samples=1000,sample.size=10,min=10,max=20){
# [add code here!]
}
# CLTdemo(n.samples=5000,sample.size=4,min=10,max=20)
Hint: Use the CLT Demo code from earlier in Lab 1!
And you can test your code with the following command:
answer1d <- CLTdemo(n.samples=5000,sample.size=4,min=10,max=20) # run your new function!
Finally, test the CLT function out for different parameter combinations to make sure it works! See if you can use this function to develop a reasonable rule for how large a sample size is necessary to ensure that the sample mean is normally distributed given that the underlying data population is a uniform distribution. Please include your answer in your Word document- and please justify your answer! [note: I am just looking for a thoughtful response, not a definitive answer! And just a few sentences is fine.]
You can use the following interactive R window for testing if that is helpful:
# use this space if it's helpful
Optionally, modify your CLTdemo() function (e.g., call it “CLTdemo2()”) to try a different underlying data distribution (other than uniform). If you really want to test the limits of the CLT, try creating your own highly non-standard distribution and seeing if the result you obtained in exercise 1e still holds. For example:
rlocodist <- function(n){
vals <- c(1,7,10,35) # possible data values
probs <- c(1,2,5,0.5) # relative probability of each data values
probs <- probs/sum(probs)
vals[apply(rmultinom(n,1,probs),2,function(t) which(t==1))] # sample from this made-up distribution
}
lots=10000
datafountain <- rlocodist(lots)
hist(datafountain, main="non-standard made-up distribution")
You can use this “sandbox” (below) to develop and test your function! Remember, you do not need to submit this function as part of your submitted r script.
CLTdemo2 <- function(n.samples=1000,sample.size=10,...){
# [add code here!]
}
#CLTdemo2(n.samples=5000,sample.size=4,...)
Hint: Use the CLT Demo code from earlier in Lab 1!
NOTE: to set default values, just use the equals sign when specifying your function. For example, say I wanted to write a function that adds numbers in a vector. It might look something like this:
newsum <- function(x=c(1,2,4)){
sm <- sum(x)
return(sm)
}
newsum(x=c(5:10)) # specify x manually
## [1] 45
newsum() # use default value!
## [1] 7
Try setting some alternative default values and re-running the function with and without arguments until you are sure you understand how default values work!
The following is a refresher on performing multiple regression analyses in R:
library(help = "datasets") # list of sample datasets that come with R
?airquality
Examine the ‘airquality’ dataset (use the ‘head’ and ‘summary’ functions). Note that there are missing values where ozone concentration data and solar radiation data were not collected.
We could ignore the missing values and just go ahead with our regression analysis, since the default response of the “lm()” (‘linear model’) function is to omit cases with missing values in any of the specified parameters. However, to avoid problems later, we will omit them explicitly by constructing a new, ‘cleaned’ dataset as follows:
air.cleaned <- na.omit(airquality) # remove rows with missing data
Conduct a multiple linear regression of ozone concentration as a function of solar radiation, wind and temperature. Use the ‘lm()’ function to conduct an ordinary least squares (OLS) regression analysis.
Explore the regression outputs using the ‘summary’ function, and explore regression diagnostics using, e.g. (depending on what you named the regression model object):
par(mfrow=c(3,2))
plot(model1, which=c(1:4)) # diagnostic plots (NOTE: the plot function returns these plots by default when the input is a linear regression model)
hist(residuals(model1), breaks=10) # histogram of residuals
plot(predict(model1) ~ air.cleaned$Ozone) # plot predicted vs observed- should follow 1:1 line. Examine this for model biases.
abline(0,1)
NOTE: see this website for more information on the diagnostic plots produced by lm().
If no one in your group knows why you are doing any of this or what it all means, ask the instructor! That’s why he’s hanging around the lab…
Here is a practice R window in case it is helpful:
# practice space for regression analysis!
symbols(air.cleaned$Temp, air.cleaned$Solar.R, circles=air.cleaned$Ozone/100, ylab="Solar Radiation", xlab="Temperature", main="Interaction Plot", inches=FALSE)
# alternatively...
coplot(air.cleaned$Ozone~air.cleaned$Temp|air.cleaned$Solar.R,rows=1) # the pipe operator can be read "conditional on"
# alternatively, you can use ggplot- which you probably already know better than your instructor!
formula2 <- "Ozone ~ Wind + Solar.R * Temp" # you can name formulas...
Explore regression outputs for the second model in the same way as you did for the first model without the interaction term.
Conduct an ‘F Test’ (or a Likelihood Ratio Test, LRT, if you prefer…) to formally test whether the more complex model (including the interaction term) fits the data significantly better than the reduced model (with fewer parameters) that lacks the interaction term. Recall that the \(R^2\) value is inadequate for this purpose because \(R^2\) will always increase with additional parameters! Use the following syntax,
anova(model1, model2, test="F") # how would you run an LRT test instead?
You can use this “sandbox” (below) to play around with multiple regression in R! You don’t need to include any of this code in your submission.
# your code here
Very briefly (but in complete sentences) answer the following questions in your Word document:
Exercise 2a By how much (and in what direction) is ozone concentration expected to change if temperature increased from 26 to 30 degrees Celsius? Assume that solar radiation stays constant at 200 lang and wind speed stays constant at 9 mph. Be careful with the units for temperature! Make sure to use the ‘interaction’ model (model2) from part 7 above to answer this question. Please briefly explain (in your Word document) how you got your answer.
Exercise 2b What is the null hypothesis that the p-values for the individual regression coefficients are designed to test?
For this exercise, we will use tree data from a forest in the western Oregon Cascades.
We will fit a multiple linear regression model that predicts forest tree biomass as a function of environmental variables (including a mix of continuous and categorical predictors) and stand age.
Obtain the TreeData.csv file from the “Data Sets” tab (or just download here)- save it to your working directory.
This describes a subset of forest inventory data from the Douglas-fir forests of western Oregon (n = 90, 0.1-ha sites).
Arranged in columns from left to right, variables are:
This is a comma-delimited (.csv) file, which is a common file format for importing data into R. Import the data into R as a data frame (R’s version of an excel spreadsheet), using the following command:
NobleFir.df <- read.csv("TreeData.csv")
Inspect the resulting data object. Summarize it using the ‘summary()’ and ‘plot()’ functions.
Obtain a correlation matrix for biomass and the four numeric predictor variables using the ‘cor()’ function and by subscripting column locations on the data frame (ask instructor for explanation of syntax if needed):
cor(NobleFir.df[,c(2,4,7:9)])
Are any of the predictor variables highly correlated?
Calculate Box Plots for the continuous predictor variables (excluding x and y coordinates) according to sites with or without noble fir. Use the ‘boxplot()’ function. What clear relationships, if any, emerge for how sites with and without noble fir differ with regard to their environmental setting? For example:
Use multiple linear regression to model tree biomass as a function of predictor variables (excluding spatial coordinates), using the same approach for regression fitting and diagnostics as we did previously.
Re-run the regression to obtain standardized regression coefficients, allowing direct comparison of effect sizes for the continuous predictor variables (since all variables are then transformed to standard deviate units, i.e. mean centered on zero with standard deviation of one). The ‘scale’ function provides an easy way to implement this.
Biomass_std.lm <- with(NobleFir.df, # using the "with()" statement, we don't need to keep referencing the name of the data frame.
lm(scale(Biomass) ~ scale(elev) + scale(Northeastness) + scale(Slope) + SlopePos + scale(StandAge))
)
Visually assess whether regression errors (residuals) are spatially autocorrelated using the ‘symbols’ function:
with(NobleFir.df,
symbols(x,y,circles=abs(residuals(Biomass_std.lm)), inches=0.3, ylab="Northing", xlab="Easting", main="Errors from Biomass Regression Model")
)
Note the ‘with’ function above- this function essentially makes all columns in a data frame part of the global environment- that is, the commands you enclose in a ‘with’ function can refer to the columns in the data frame as if they were part of the global environment. This is a safe way to replicate the ‘attach’ function in R (which I recommend you never use!).
You can use this “sandbox” (below) to play around with this example in R!
# your code here
Answer the following questions (with brief justification) in your Word document:
Review the “brute-force z-test” code from the “Why focus on algorithms” lecture. Then complete the following exercises:
What if we wanted to run a two-tailed z-test? That is, what if our alternative hypothesis were that salmon fed on the new vegetarian diet could plausibly be larger or smaller than those fed on the conventional diet after one year? Modify the function (“z.test.algorithm()”) with a new argument that allows for both one and two-tailed tests! Name your new function “z.test.q4a()”.
To convince yourself that your new function works, try running your function for a (made-up) case where the observed body mass for those fed on the new diet are generally higher than the expected body mass for those fed on the conventional diet after one year – the opposite of your (alternative) hypothesis!
NOTE: you may get tangled up with the null hypothesis/p-value concept, which is admittedly a difficult concept! A p-value always assumes the null hypothesis is true (given the null hypothesis is true, a p-value gives the probability of obtaining a test statistic as or more extreme than the observed test statistic). The alternative hypothesis for the 1-tailed test is that the population mean for the treatment group is less than the population mean for conventional farm-raised salmon. The alternative hypothesis for the 2-tailed test is that the population mean for the treatment group is less than or greater than the population mean for conventional farm-raised salmon. The null hypothesis, as always, is that there is no difference.
Include your function in your submitted r script!
This function should be specified as follows:
You can use this “sandbox” (below) to develop and test your function!
population.mean = 4.5
population.sd = 0.9
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
# ztest <- z.test.algorithm(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd ) # use function from class..
z.test.q4a <- function(sample, pop.mean, pop.sd, onetail=T){
# [add code here!]
}
#z.test.q4a(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd,onetail=F ) # test your function
Hint: No hints … yet!
Test your function using some alternative sample data values. For example:
population.mean = 4.5
population.sd = 0.9
my.sample = c(5.14,3.27,4.56,3.77,3.34,4.32,3.84,3.19,5.24,4.09)
z.test.q4a(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd,onetail=F )
## $null_dist
## [1] 4.316747 4.770561 4.032500 4.603852 4.450013 4.328502 4.296983 4.511876
## [9] 4.465084 4.639269 4.585778 4.522653 4.354704 4.465592 4.729727 4.335968
## [17] 4.208391 4.830332 4.594505 4.703342 4.498340 4.658716 4.588669 4.741817
## [25] 3.986019 4.734259 4.630257 4.802767 4.287979 4.648972 4.695111 4.785717
## [33] 4.208295 4.541756 4.891147 4.477356 4.976918 4.801133 4.151884 4.290882
## [41] 4.632383 4.590038 4.561054 4.403268 4.554950 4.788865 4.860299 4.321902
## [49] 4.420150 4.289798 4.472190 4.099439 4.619850 4.451294 4.167545 4.782852
## [57] 4.436746 3.987455 3.802015 4.622455 4.433434 4.721921 4.686328 4.379001
## [65] 4.935248 4.052356 4.242759 4.547473 4.701919 4.416579 4.489660 4.469970
## [73] 4.473287 4.440283 5.164047 4.555043 4.358827 4.221100 4.434869 4.280401
## [81] 4.989331 4.394092 4.364043 4.335263 4.566799 4.930959 4.422621 4.367727
## [89] 4.076260 4.196748 4.720624 4.930314 4.625350 4.311998 4.258147 4.875953
## [97] 4.399998 4.711004 4.326041 4.261689 4.388493 4.564290 4.253488 4.690348
## [105] 4.466371 4.509491 4.501854 4.293426 4.459171 4.349939 4.281777 4.765475
## [113] 4.321327 4.920713 4.551723 4.204353 4.203020 4.449682 4.393013 4.633849
## [121] 4.427271 4.445130 4.771630 3.593764 4.339265 4.701981 4.589546 4.395617
## [129] 4.357728 5.124562 4.561166 4.637635 4.108844 4.608899 4.442772 4.754452
## [137] 4.295925 4.492359 5.366682 4.653879 4.404673 4.510038 4.525972 4.751443
## [145] 4.292190 4.812269 4.558240 4.778236 4.691260 4.514345 4.759007 4.572163
## [153] 4.094375 4.149142 4.466727 4.412338 4.685000 4.640478 4.563273 4.940709
## [161] 3.852339 4.991853 4.773284 4.371979 3.782515 4.560329 4.684194 4.574346
## [169] 4.979941 4.437966 4.370146 4.482622 4.559052 4.205272 4.615152 4.639217
## [177] 4.488914 4.181249 4.548420 4.759753 4.223196 4.525338 4.672106 4.289015
## [185] 4.686932 4.661335 4.326158 4.219182 4.170768 4.755146 4.771968 4.487745
## [193] 4.409578 3.921235 4.495015 4.674483 4.753118 4.052900 4.339214 4.022378
## [201] 4.622523 4.454893 4.874894 4.825225 4.664857 4.859208 4.282902 4.335131
## [209] 4.589804 4.027944 4.981772 4.732389 4.184094 4.473663 4.716402 4.437493
## [217] 4.414365 4.481012 4.333433 4.581765 4.544169 4.153525 4.659882 4.490575
## [225] 4.210870 4.813934 5.048715 4.796340 4.151161 4.512131 4.752260 4.465896
## [233] 4.313751 3.932716 4.347864 4.818638 4.808917 4.140665 4.607157 4.404889
## [241] 4.892227 4.748179 4.530640 4.646433 3.744763 4.026190 4.236494 4.691026
## [249] 4.313480 4.113420 4.111184 4.991594 4.550630 5.037122 4.466484 3.898635
## [257] 4.615216 4.313010 4.390403 4.283788 4.278831 4.314763 4.784068 4.762094
## [265] 4.454299 4.796310 4.814192 4.757928 3.987556 4.277108 4.308022 4.409343
## [273] 4.796586 4.443572 4.828134 3.735836 4.658617 3.994708 3.975210 4.567820
## [281] 4.148388 4.544937 4.579417 4.991803 4.231855 4.476849 4.311346 4.244715
## [289] 4.169289 4.391779 4.530481 4.885366 4.649666 4.723987 4.477360 3.720949
## [297] 4.721756 4.751210 4.032115 4.206875 4.632252 4.298961 4.244539 4.436334
## [305] 4.293852 4.606964 4.515290 4.665953 4.144912 4.534390 4.790463 4.331167
## [313] 5.119804 4.746275 4.456400 4.441744 4.905767 4.210535 4.091231 4.177029
## [321] 4.997503 4.294935 4.316000 5.031446 4.461462 4.490683 4.775761 4.404896
## [329] 4.726645 4.503734 4.223112 4.670536 4.936354 4.060269 4.769946 4.275483
## [337] 4.638672 4.063436 4.623539 4.755251 4.612749 4.553176 4.271294 4.378308
## [345] 4.550206 4.451060 4.752670 4.331555 4.310893 4.338153 4.634420 4.383652
## [353] 4.178441 4.955960 4.557162 4.513034 4.407934 4.786962 4.524878 4.612746
## [361] 4.487379 4.172155 4.000951 4.226952 4.471185 3.782460 4.542027 4.681080
## [369] 4.884899 4.328840 4.238976 4.994919 4.272135 4.183073 4.651960 4.834813
## [377] 4.243298 3.899413 4.279913 4.423270 4.562731 4.492908 4.094315 4.764144
## [385] 4.378211 4.204666 4.628369 4.784138 4.597098 4.370079 4.493915 4.832151
## [393] 4.945185 4.189597 4.473125 4.540711 4.498540 4.400466 4.919241 4.449074
## [401] 4.445461 4.533550 4.482217 4.376421 4.636864 4.553398 4.171159 4.404498
## [409] 4.826868 3.962051 4.410676 4.288133 4.417517 4.452341 4.560798 4.138757
## [417] 4.721164 4.689469 4.700975 4.974486 4.119262 5.064047 4.294205 4.737391
## [425] 4.478121 4.173968 3.960976 4.397688 4.178113 3.784869 4.830469 4.803432
## [433] 4.108872 4.656094 4.482926 4.730997 4.581534 4.318312 4.645291 4.513398
## [441] 4.507244 4.639196 4.722597 4.338639 4.304570 4.252998 4.174941 4.262794
## [449] 4.553821 4.329573 4.307321 4.737875 4.351659 4.541076 4.411459 4.109951
## [457] 4.586195 4.439645 4.181221 4.606183 4.793824 4.626975 3.980394 5.101195
## [465] 4.486015 4.033368 4.770831 4.495588 4.848306 4.462237 4.047291 4.559679
## [473] 4.371495 4.215994 3.931421 4.564310 4.713681 4.492941 4.714866 4.661232
## [481] 4.661357 4.326866 4.593385 4.764523 4.106499 4.640550 4.247274 4.739586
## [489] 4.654888 4.468896 4.880006 4.815734 4.629406 4.577215 4.916019 4.190692
## [497] 4.584513 4.285632 4.387866 4.574898 4.602442 4.316646 4.415032 4.294094
## [505] 4.347858 4.909982 4.083851 4.267031 4.249453 4.624663 4.628857 5.032759
## [513] 4.391531 4.898153 4.709326 4.735837 4.420464 4.662111 5.135883 4.521564
## [521] 4.413254 4.788576 4.389444 4.572999 4.509638 4.519257 4.363705 4.943347
## [529] 4.029989 4.285777 4.527885 4.505762 4.731283 4.782663 4.560392 4.180042
## [537] 4.882546 4.213538 4.040280 4.178087 4.818003 4.562252 5.011877 4.487945
## [545] 4.214280 4.897499 4.753589 4.128855 4.592912 4.766774 4.677803 4.071719
## [553] 4.426421 4.442024 4.291847 4.480919 4.167798 4.433340 4.629236 4.764497
## [561] 4.217569 4.866338 4.142858 3.952288 4.569929 4.100053 4.134121 4.753006
## [569] 4.757046 4.103823 4.645994 4.384498 4.672280 4.491453 4.369014 4.590655
## [577] 4.422548 4.612126 4.371357 4.378083 4.490364 4.697058 4.305055 4.611329
## [585] 4.415590 4.057715 4.569984 4.925360 4.532121 4.643315 4.407150 4.159804
## [593] 4.528593 4.406334 4.532533 4.233003 4.604701 4.597216 4.398766 4.558031
## [601] 4.950845 4.498870 4.541300 4.636191 4.054203 5.079401 4.503323 4.198760
## [609] 4.572710 4.127734 4.233839 4.441439 4.499130 3.678562 4.356375 4.307783
## [617] 4.438272 4.359819 4.434958 4.278042 4.121143 4.684430 4.530078 4.645960
## [625] 4.765422 4.728693 4.473531 3.984340 3.912301 4.058381 4.477691 4.685864
## [633] 4.227445 4.576469 4.426526 4.466599 5.024264 4.302791 5.118676 4.251611
## [641] 4.178850 4.123792 4.261268 4.425229 4.414916 4.547607 4.764350 3.941916
## [649] 4.476154 4.701915 4.233648 4.477376 4.650705 4.463849 4.864249 4.792595
## [657] 4.678310 4.405516 4.349482 4.481614 4.900713 4.645163 4.707827 4.262593
## [665] 4.358857 4.396133 4.428179 4.910978 4.040920 4.563401 4.386326 4.362462
## [673] 4.253115 4.247269 4.008814 4.496097 3.833050 4.655642 4.889529 4.536016
## [681] 5.029261 4.300746 4.841502 4.149076 4.758941 4.664485 4.392807 4.867762
## [689] 4.304098 4.253419 4.561112 4.307664 4.883979 4.002298 4.582731 4.663505
## [697] 4.348648 4.917695 4.902282 4.213924 4.312817 4.838686 4.320497 4.244020
## [705] 4.305479 4.323423 4.575700 4.263000 4.634939 4.390390 4.779058 4.386552
## [713] 4.281292 4.359480 4.493412 4.344609 5.216580 4.545535 4.463812 4.607675
## [721] 4.886878 3.806920 4.896332 4.500346 4.652223 5.028455 3.977530 3.840059
## [729] 4.427781 4.606428 4.270668 4.349639 4.508685 4.420136 4.534446 4.680305
## [737] 4.165394 4.755535 4.225843 4.710444 4.485627 4.583335 4.560034 4.235407
## [745] 4.379618 4.871849 4.600983 4.471173 4.300754 4.667488 4.379810 4.604521
## [753] 4.693217 4.669413 4.556853 4.224799 4.303881 4.314547 4.043208 4.704336
## [761] 4.198941 4.418416 4.729176 4.438874 4.501512 4.890298 4.450245 4.279057
## [769] 4.806912 4.472319 4.571102 4.549040 4.610101 4.707971 4.845058 4.181404
## [777] 4.641833 4.437509 4.489471 4.009210 4.620221 3.841737 4.847796 4.194299
## [785] 4.334230 4.200573 4.255521 4.180587 4.315617 4.757015 4.624459 4.711589
## [793] 3.617069 4.322011 4.755832 4.065846 4.395793 4.344992 4.375103 4.518435
## [801] 4.074912 4.383825 4.299655 4.366433 4.916015 4.646937 4.335940 4.386183
## [809] 4.583114 4.480587 4.365883 4.534916 4.704998 4.759672 4.498924 4.551884
## [817] 4.700013 4.679001 4.521834 4.751578 3.992182 4.766310 4.386933 4.349992
## [825] 4.918631 4.246220 4.836436 5.100124 4.719299 4.526750 4.753320 4.815131
## [833] 4.821955 4.686607 4.407839 4.940871 4.908128 4.451257 4.574496 4.130842
## [841] 4.545531 4.750484 4.298851 4.518935 4.561029 4.548418 4.819964 4.289677
## [849] 4.122220 4.449402 4.290607 4.551399 4.425597 5.117913 5.029843 4.686406
## [857] 4.499324 4.927163 4.544927 4.599271 4.093079 4.248442 4.631660 4.618854
## [865] 4.727979 4.976289 4.170559 4.513495 4.496263 3.944228 5.225250 4.157764
## [873] 4.686310 4.645423 4.127868 4.349805 4.201819 4.780636 4.625905 5.055880
## [881] 4.790236 4.177532 4.388084 4.741211 4.303211 4.175298 4.098124 4.181008
## [889] 4.515698 4.977493 4.529489 5.015807 4.685965 4.744109 4.870566 4.441928
## [897] 4.670682 4.759402 4.847670 4.821146 4.654368 3.813624 4.152606 4.459807
## [905] 4.155822 4.063588 4.617168 4.597298 3.967168 4.475822 4.017342 4.336467
## [913] 4.051133 3.932611 4.249138 4.636442 4.884996 4.544601 3.880613 4.347868
## [921] 4.200803 4.984713 5.104904 3.983879 4.231326 4.804119 4.185800 5.136738
## [929] 4.557633 4.551566 4.431674 4.580336 4.440891 4.545496 4.214823 5.074161
## [937] 4.770897 4.501691 4.368651 4.396240 5.023392 4.310800 4.426661 4.433583
## [945] 4.329128 3.868268 4.356343 4.561522 4.574470 4.813581 4.251138 4.795501
## [953] 4.145448 4.356443 4.201690 4.430242 4.426506 3.953691 4.712272 4.497190
## [961] 4.963926 4.030154 4.323355 4.390788 4.509719 4.583821 4.708235 4.326422
## [969] 4.616452 4.317063 4.879864 4.369537 4.781301 4.293374 4.679306 4.205416
## [977] 4.702429 4.264887 4.231266 4.684812 5.153215 4.171854 4.042909 4.818366
## [985] 4.972043 4.405789 4.705868 5.052251 4.150549 4.466766 4.393466 4.289429
## [993] 4.223133 4.400417 4.685691 4.668115 4.405018 4.579411 4.060233 4.360351
##
## $p_value
## [1] 0.13
##
## $observed_mean
## [1] 4.076
# also try using the code from lecture to compare against a "real" z-test!
population.mean = 4.5
population.sd = 0.9
my.sample = c(5.14,3.27,4.56,3.77,3.34,4.32,3.84,3.19,5.24,4.09)
Now, what if we had access to a large set of samples under the null hypothesis AND that we would like to relax the assumption that the data follow a normal distribution under the null hypothesis?
In this exercise, you are asked to modify the brute-force z-test function so that you generate your sampling distribution by simply sampling from the known data set under the null hypothesis. To do this you can use the ‘sample’ function in R. For this purpose you should sample with replacement. Here is an example:
null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
sample(null.data,10,replace=T) # use the sample function to sample from a large vector with replacement
For this problem, let’s go back to the original null hypothesis: that salmon raised on the new diet will be smaller (lower mass) than those raised on conventional diet after one year.
Include your function in your submitted r script!
This function should be named ‘z.test.q4b’ and should be specified as follows:
You can use this “sandbox” (below) to develop and test your function!
null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
# ztest <- z.test.algorithm(sample = my.sample, pop.mean=population.mean, pop.sd=population.sd ) # use function from class..
z.test.q4b <- function(sample, null.data){
# [add code here!]
}
#z.test.q4b(sample = my.sample, null.data ) # test your function
Hint: No hints … yet!
Test your function using some alternative sample data vectors. For example:
null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
z.test.q4b(sample = my.sample, null.data )
## $null_dist
## [1] 4.241 5.111 4.432 4.840 4.600 3.953 4.054 4.032 4.165 4.560 4.566 3.544
## [13] 3.940 4.613 5.492 4.986 4.522 4.731 4.214 5.196 5.070 4.698 3.855 4.494
## [25] 4.909 4.100 4.670 4.639 5.093 4.416 3.592 4.005 3.996 4.328 4.753 4.248
## [37] 4.607 4.539 3.651 4.179 4.500 3.916 4.243 3.808 4.519 4.119 4.359 4.739
## [49] 3.850 4.669 4.639 4.399 4.341 4.348 4.403 4.177 3.759 4.558 4.052 4.871
## [61] 3.476 4.345 4.252 4.647 4.760 4.197 4.348 4.332 3.406 4.590 4.890 4.601
## [73] 4.533 4.598 4.205 4.468 4.399 4.285 4.211 4.510 4.386 4.368 4.324 4.269
## [85] 4.930 4.792 4.368 4.161 4.681 5.152 4.342 4.302 5.088 5.183 4.709 3.862
## [97] 4.575 4.646 4.085 4.369 4.627 3.683 4.845 4.346 4.696 4.840 5.329 4.116
## [109] 4.043 3.674 4.025 4.288 4.497 5.071 4.126 4.282 4.560 5.404 4.453 3.723
## [121] 5.113 5.014 4.457 4.636 4.887 4.398 3.941 4.437 4.661 4.114 4.182 4.148
## [133] 5.411 3.427 4.779 4.892 4.045 5.324 4.862 3.877 4.660 4.279 4.913 4.022
## [145] 4.345 4.664 4.072 4.865 4.262 4.431 4.069 4.248 4.543 4.376 4.414 4.481
## [157] 3.825 4.550 4.711 4.368 4.803 4.358 4.887 4.758 4.820 4.043 4.955 3.942
## [169] 4.719 5.216 4.630 3.981 4.217 3.477 4.677 4.835 4.063 4.006 4.190 4.437
## [181] 4.680 5.023 4.300 5.425 5.106 5.017 4.579 4.315 4.371 4.599 4.473 4.830
## [193] 4.654 5.127 4.899 4.374 4.257 4.273 4.545 3.748 5.059 4.287 5.094 4.220
## [205] 4.342 4.277 4.368 4.589 4.221 4.700 4.177 4.392 4.679 3.926 4.691 4.689
## [217] 4.270 4.624 4.318 4.731 4.702 4.390 5.000 4.625 3.561 4.050 5.025 4.168
## [229] 5.132 4.379 4.579 4.641 4.649 4.368 4.339 4.361 3.858 5.050 4.756 4.723
## [241] 4.330 4.895 3.749 3.553 3.955 4.682 4.255 4.314 4.288 4.720 4.500 3.706
## [253] 4.518 4.930 4.976 3.387 4.820 4.120 5.109 5.247 4.412 4.324 4.568 4.150
## [265] 4.324 3.962 4.231 4.849 3.921 4.451 4.605 4.728 3.871 4.423 4.199 4.788
## [277] 4.721 4.552 4.098 3.897 4.790 4.330 4.358 4.816 5.113 4.808 4.282 4.955
## [289] 4.463 4.359 4.389 4.725 4.812 4.731 4.432 4.177 4.389 4.685 3.790 3.830
## [301] 4.872 4.597 3.913 4.541 4.207 4.651 4.173 4.181 5.014 4.424 4.714 4.235
## [313] 4.747 4.831 4.489 5.060 4.417 4.253 4.641 4.057 3.974 4.232 4.587 4.767
## [325] 4.216 4.239 4.816 4.273 4.226 4.614 4.630 4.405 4.120 4.704 4.707 4.571
## [337] 3.891 4.877 4.361 4.766 4.669 4.979 4.383 4.507 5.093 4.215 4.821 3.773
## [349] 5.019 4.652 4.648 4.521 4.531 4.708 4.513 4.402 4.307 4.455 4.495 4.515
## [361] 4.293 4.082 4.425 4.270 5.020 4.450 4.777 4.366 4.230 4.572 4.122 4.303
## [373] 4.792 4.516 4.640 3.960 4.774 4.503 3.009 3.984 4.682 5.098 4.661 4.154
## [385] 4.770 4.859 4.547 4.318 4.386 4.536 4.469 3.728 4.379 3.799 4.084 4.708
## [397] 4.326 4.639 4.598 4.524 4.202 4.723 4.386 3.672 4.750 4.548 4.433 4.437
## [409] 4.618 4.547 4.121 3.959 4.860 4.384 4.250 3.925 4.426 4.363 4.572 4.292
## [421] 4.026 4.330 4.396 4.290 4.522 3.579 4.225 4.419 4.495 4.338 5.027 3.762
## [433] 4.339 4.924 4.608 4.809 4.452 3.654 4.006 4.704 4.830 4.974 4.698 3.817
## [445] 4.479 4.425 4.612 4.468 4.187 4.034 5.204 4.262 4.238 4.702 3.881 4.548
## [457] 4.726 4.579 4.008 4.173 3.897 4.320 4.101 4.308 5.086 4.514 4.288 4.569
## [469] 4.467 4.898 3.550 4.829 3.942 4.170 4.390 5.085 4.295 4.683 4.571 3.841
## [481] 4.567 4.407 4.152 4.730 4.145 4.062 4.237 4.263 4.145 3.937 4.140 4.408
## [493] 4.571 4.508 4.602 4.339 4.371 4.659 4.608 4.598 4.495 3.552 4.355 3.689
## [505] 4.451 4.942 4.372 4.217 4.529 5.266 4.915 3.969 4.762 4.557 4.135 4.964
## [517] 4.703 4.464 4.069 4.699 4.851 4.260 4.554 4.155 4.323 4.434 4.419 4.352
## [529] 4.499 5.113 4.459 4.737 4.737 4.808 4.884 4.786 3.614 4.192 4.397 4.745
## [541] 4.148 4.370 4.385 4.580 4.426 4.225 4.533 5.264 4.842 4.116 4.498 4.501
## [553] 3.727 4.685 3.712 4.726 4.692 4.155 3.788 5.135 4.360 4.279 4.499 4.686
## [565] 4.686 4.520 4.002 4.556 4.173 4.496 4.872 4.521 4.575 4.592 5.153 4.353
## [577] 5.268 4.370 4.885 4.340 4.098 4.535 4.595 4.429 4.294 4.312 4.636 4.644
## [589] 4.771 4.487 4.637 4.493 4.146 3.945 4.119 4.614 4.464 5.266 4.855 5.034
## [601] 4.076 4.554 3.923 4.633 4.689 4.621 4.630 4.113 4.380 3.662 3.504 4.887
## [613] 4.213 3.924 4.237 4.431 4.727 4.896 4.698 4.310 4.872 4.170 4.960 4.309
## [625] 4.569 4.685 4.671 4.336 4.539 4.716 4.381 4.712 4.061 3.918 4.117 4.435
## [637] 5.041 5.445 4.084 4.188 4.724 4.788 4.023 4.503 4.357 4.093 4.011 4.018
## [649] 4.292 4.126 5.047 4.031 4.390 4.215 4.388 5.189 4.547 4.622 3.920 4.159
## [661] 4.655 4.956 4.116 4.102 4.238 4.611 3.709 4.893 4.104 4.404 4.425 4.869
## [673] 4.511 3.865 4.015 4.301 4.787 4.132 4.669 4.492 4.613 4.058 4.446 4.628
## [685] 4.653 4.485 4.600 4.166 4.966 4.978 4.305 4.462 4.752 4.700 3.475 4.496
## [697] 4.659 4.437 4.450 4.416 4.775 4.915 4.855 5.021 5.303 2.621 4.327 4.870
## [709] 4.785 4.557 5.604 4.374 4.620 4.881 4.882 4.964 4.370 4.132 4.591 4.457
## [721] 4.367 4.095 4.535 4.765 4.794 4.649 4.595 4.695 4.966 4.524 3.986 4.612
## [733] 4.481 4.025 4.652 4.912 5.036 3.941 4.357 3.872 5.052 4.661 4.475 4.293
## [745] 4.666 4.428 4.372 5.282 5.059 4.137 4.681 4.751 4.716 4.441 4.756 3.757
## [757] 4.134 5.073 4.251 3.671 4.570 4.341 4.654 3.889 4.473 4.889 5.392 4.636
## [769] 4.354 4.616 4.515 4.969 4.101 4.097 4.625 4.689 4.418 4.626 4.563 4.296
## [781] 4.620 4.596 4.636 4.028 4.630 4.714 4.771 4.512 4.767 4.262 4.935 4.879
## [793] 4.313 4.417 5.198 4.507 4.492 4.467 4.365 4.515 4.867 4.389 4.246 4.436
## [805] 4.411 4.744 5.156 4.211 4.617 4.109 4.686 4.151 4.143 4.154 4.338 4.288
## [817] 4.717 4.815 4.477 4.116 4.925 4.584 4.077 4.733 4.796 4.549 4.144 4.481
## [829] 4.225 4.128 4.310 4.298 4.661 4.280 4.807 5.202 5.510 4.841 4.288 4.509
## [841] 4.032 4.786 4.671 4.400 4.828 5.092 3.912 4.559 4.728 4.774 4.686 4.359
## [853] 4.154 4.454 4.393 4.243 3.751 4.291 4.812 3.901 4.844 4.110 4.977 4.077
## [865] 4.861 4.570 5.177 3.881 4.469 4.052 4.148 5.112 4.701 4.744 4.676 4.385
## [877] 4.739 4.239 4.068 4.700 4.249 4.370 4.135 4.304 4.132 4.262 4.007 4.472
## [889] 4.010 4.341 4.445 4.727 4.784 4.290 4.231 5.256 4.876 4.439 4.494 4.144
## [901] 4.674 4.257 4.581 4.268 4.741 4.107 4.621 4.764 3.998 3.392 4.870 4.829
## [913] 4.164 4.612 4.479 5.451 3.961 5.047 4.370 4.212 3.842 4.869 4.858 5.167
## [925] 4.699 4.958 4.270 4.299 4.744 4.144 4.221 4.450 4.426 4.577 5.061 4.714
## [937] 4.249 4.774 3.899 4.770 4.484 5.047 4.015 3.910 4.355 4.678 4.210 4.040
## [949] 4.019 4.328 4.661 4.728 4.416 4.593 4.866 3.998 4.845 4.347 4.526 4.211
## [961] 4.868 4.459 4.011 5.041 3.873 4.837 5.029 4.390 4.678 4.419 4.505 4.486
## [973] 4.385 4.451 4.418 4.431 4.528 5.276 4.548 3.916 4.663 4.557 4.871 4.145
## [985] 4.776 4.478 4.212 4.122 5.075 4.620 4.890 4.140 4.409 4.593 4.241 5.082
## [997] 4.239 4.472 4.081 4.777
##
## $p_value
## [1] 0.008
##
## $observed_mean
## [1] 3.476
null.data=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28)
my.sample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
Review the bootstrapping code from the “Why focus on algorithms” lecture (to generate confidence intervals for arbitrary test statistics). Then complete the following exercise:
Generate a new R function, called “RegressionCoefs()” that takes a data frame as the first input parameter and the name of the response variable as the second input parameter, and returns the (univariate) regression coefficients (\(\beta\)) produced by regressing the response variable on each predictor variable (returning a vector of regression coefficients). You can use the “Rsquared()” function (from the lecture) as a reference!
More specifically, this function should be specified as follows:
Include your function in your submitted r script!
You can use this “sandbox” (below) to develop and test your function! Remember to comment out everything but the function prior to submitting.
df <- mtcars[,c(1,3,4,6)]
RegressionCoefs <- function(df,responsevar){
# [add code here!]
}
# RegressionCoefs(df,"mpg")
Hint: No hints … yet!
You can test your function using the following code:
RegressionCoefs(df=trees,responsevar="Volume") # should return two regression coefficients
## Girth Height
## 5.065856 1.543350
Generate a new R function, called “BootCoefs()” that meets the following specifications:
Include your function in your submitted r script!
You can use this “sandbox” (below) to develop and test your function!
df <- mtcars[,c(1,3,4,6)]
responsevar="mpg"
# note: this is copied from the lecture code...
boot_sample <- function(df,statfunc,n_samples,responsevar="Volume"){
indices <- c(1:nrow(df))
output <- matrix(NA,nrow=n_samples,ncol=ncol(df)-1) # storage object- to store a single bootstrapped sample from the original data
for(i in 1:n_samples){ # for each bootstrap replicate:
boot_rows <- sample(indices,size=nrow(df),replace=T) # randomly sample observations with replacement
newdf <- df[boot_rows,] # dataframe of bootstrapped observations
output[i,] <- statfunc(newdf,responsevar) # generate statistics from the bootstrapped sample (e.g., compute Rsquared after regressing y on all possible x variables)
}
return(output)
}
BootCoefs <- function(df,statfunc,n_samples,responsevar){
# [add code here!]
}
BootCoefs(df=df,statfunc=RegressionCoefs,
n_samples=1000,responsevar=responsevar)
Hint: No hints … yet!
You can test your function using the following code:
BootCoefs(df=trees,statfunc=RegressionCoefs,n_samples=1000,responsevar="Volume")
## 2.5% 50% 97.5%
## stat1 4.3568227 5.070591 5.618382
## stat2 0.7304976 1.558113 2.442687
df <- mtcars[,c(1,3,4,6)]
responsevar="mpg"
BootCoefs(df=df,
statfunc=RegressionCoefs,
n_samples=1000,
responsevar=responsevar
)
## 2.5% 50% 97.5%
## stat1 -0.05153572 -0.04122000 -0.03118659
## stat2 -0.10086452 -0.06948531 -0.04703679
## stat3 -6.95561525 -5.37038579 -4.13929210
RegressionCoefs <- function(df=trees,responsevar="Volume"){ # univariate models only- interaction and multiple regression not implemented here
response <- df[,responsevar] # extract the response variable
names <- names(df)
coefs <- numeric(length(names)) # named storage vector
names(coefs) <- names(df)
coefs <- coefs[names(coefs)!=responsevar] # assume that all columns that are not the response variable are possible predictor variables
i=names(coefs)[1]
for(i in names(coefs)){ # loop through predictors
predictor <- df[,i] # extract this predictor
model <- lm(response~predictor) # regress response on predictor
coefs[i] <- coefficients(model)["predictor"] # extract slope term
}
return(coefs)
}
To test your new function(s), generate bootstrapped confidence intervals around the regression parameters for your tree biomass regression. Compare these bootstrapped confidence intervals with the standard confidence intervals given by R in the “lm()” function. Are there any important differences? Explain your answer in your Word document. As always, please use your Word document submission to raise any points of confusion or ask me questions!
You can use this “sandbox” (below) to develop and test your function!
BootCoefs(df=trees,statfunc=RegressionCoefs,n_samples=1000,responsevar="Volume")
confint.lm(lm(Volume~Girth,trees)) # compare with lm()