In this course we use the R programming language for statistical computing and graphics, The purpose of this laboratory exercise is to develop familiarity with programming in R.
This lab provides an introduction to the R programming Language and the use of R to perform statistical and programming tasks. As with all lab reports, your answers will either take the form of R functions or short written responses. The R functions should be stored in an R script file (‘.R’ extension). To allow me to evaluate your work more efficiently, please name your R script using the following convention: “[your first name]_[your last name]_lab1.R”. So my submission would be “kevin_shoemaker_lab1.R” (all lower case).
Please include only the requested functions in your submitted R script. That is, the R script you turn in should only include the custom R functions you constructed to answer the questions, and none of the ancillary code you used to develop and test your functions. You will probably want to complete the lab exercises using a ‘full working version’ of your script (that includes all your work, including code for testing your solutions), and then save a reduced copy (containing only the requested functions and nothing more) for submission. Ask your instructor if you’d like a demonstration of how this works!
Before submitting your code using WebCampus, please clear your environment and run the script you will submit (containing only the requested functions) from start to finish. Check the ‘Environment’ tab in RStudio after running your script, and make sure that the script ONLY defines functions within your global environment- no other objects (data, values, etc.). The script you submit should generate no additional objects (e.g., data frames), it should not read in any files. Don’t put a ‘clear workspace’ line (e.g., ‘rm(list=ls()))’ anywhere in your submitted code- unless you want to make life difficult for the instructor…
Submit your R script (functions only) using WebCampus. Follow the above instructions carefully!
Submit short-answer responses using WebCampus. In addition to submitting your script, you will need to respond to additional questions using the Lab 1 “quiz” interface on WebCampus.
Please submit your R script and complete the WebCampus quiz by midnight on the due date (one week after the final lab session allocated for this topic). You can work in groups but submit the materials individually. Refer to WebCampus for due dates.
This lab exercise (Lab 1) will extend over two laboratory periods.
This is the only lab period we will devote to learning R- after this we will assume you have proficiency with defining variables and functions, reading in datasets, visualizing data, etc. in R.
From the R manual, ‘Introduction to R’ you can implement all the steps in Appendix A, located here. This takes you far but without much explanation– it is a way to jump into the deep end of the pool.
Alternatively, or in addition, go to the Links page and pick a tutorial to run through (e.g., the Datacamp introductory R course, which uses the ‘tidyverse’ packages). You can also check out the materials from my R “bootcamp” archive.
Before you move on, make sure you have a basic understanding of how to work with R and RStudio. If you are working with someone who has used R before, don’t be afraid to ask questions! Take your time with this- you might want to use the entire lab period on this if you have never used R before - it will be time well spent!
Another useful introductory R tutorial can be found here, courtesy of NCEAS. If you haven’t used R before, consider working through this tutorial before going through the lab exercises.
If you already have basic R expertise, you can move ahead!
Here we will get warmed up for the lab exercises.
Write a function that takes a numeric input vector and computes the sum of all its elements.
Basically, all you need to do here is write ‘sum(x)’ where it says ‘[add code here]’
NOTE: you don’t need to submit this as part of your lab report! This is just practice…
# demo function (not part of lab)
myfunc <- function(x){
# [add code here!]
}
To gain some familiarity with using R scripts and developing algorithms, complete the following exercise (you don’t need to submit this demo as part of your lab write-up BUT you will be asked to use this code as part of exercise 1 (below).
Review the meaning of the Central Limit Theorem, which states that the sum (or mean) of a sufficiently large number of independent and identically distributed random variables will form another random variable that is asymptotically normally distributed (as the number of elements added together approaches infinity), regardless of the distribution of the underlying data. One of the key implications is that the sample mean (sum of iid random variables divided by the sample size) will be normally distributed with mean equal to the mean of the data distribution and variance equal to the variance of the data distribution divided by the sample size:
\[X \sim any \;distribution(mean=\mu, var=\sigma^2)\]
Note that, although the normal distribution is the only distribution that is fully determined by its mean and variance, all all probability distributions have a mean and variance (although some distributions ‘misbehave’ and have infinite mean and/or variance!).
\[ \bar{X}= \frac{1}{N}\sum_{i=1}^N X_i\] \[\bar{X} \xrightarrow{D} Normal(\mu,\frac{\sigma}{\sqrt{N}})\]
Type (or paste) the following code into your R script window (or RStudio script window):
# CENTRAL LIMIT THEOREM demonstration ------------------------
N <- 10 # sample size
lots <- 1000 # placeholder representing infinity
## Define the random number distribution. ------------------
runif(N,10,20) # try uniform random number generator with min of 10 and max of 20.
many_samples = replicate(lots,runif(N,10,20)) # do it many times!
many_xbars = colMeans(many_samples)
hist(many_samples,freq=F,ylim=c(0,1),main="",xlab="Value") # plot out the distribution of sample means
hist(many_xbars,freq=F,add=T,col="red") # overlay the distribution of the underlying data from which we are drawing samples.
legend("topleft",fill=c("gray","red"),legend=c("data population","sample means"),bty="n") # add a legend to the plot
source("CentralLimitTheorem.R") # run the R script from a different script!
NOTE: The hash (#) used in the above code allows you to insert comments adjacent to snippets of code, which facilitates readability of code (and lets the programmer remember later on what he/she was thinking when coding things a certain way!). It is good practice to comment every line of code so as to describe what it does in plain English, including all variables and functions. Make sure you fully understand the commented code for the CLT demonstration above!
?qqnorm # learn more about the "qqnorm()" function
Plot the q-q plot next to the histograms. The plot on the left should be the comparison of histograms (for population distribution and distribution of sample means) shown in the original script (above). The plot on the right should be the q-q plot. To produce side-by-side plots, you will need to add this line of code to the appropriate place in your script:
# Set graphical parameters for side by side plotting -------------
par(mfrow=c(1,2)) # sets up two side by side plots as one row and two columns
## or alternatively...
layout(matrix(1:2,nrow=1))
In addition, run a Shapiro-Wilk normality test, which tests the null hypothesis that a set of numbers (in this case the vector of sample means) indeed comes from a normal distribution (so what does a low p-value mean??). Use the “shapiro.test()” function:
?shapiro.test
So… what can you conclude from these tests??
Finally we have arrived at the ACTUAL lab exercises. Remember you don’t need to submit any of the work up to this point.
For the first part of this lab, you are asked to write several functions (and submit them as part of your lab script, of course!). The functions increase quickly in complexity!
You should now know how to construct functions in R. If you don’t, go back to the NCEAS tutorial and review the section on writing functions.
Write an R function called “CoefVar()” that takes a numeric vector (named ‘x’ within the function but could have any name outside the function) as input, and computes (and returns) its coefficient of variation (CV; standard deviation as a proportion of the mean). To make sure it works, apply your function to the ‘Height’ vector in the ‘trees’ dataset that installs with R as sample data:
You can use the following code to test your function using the built-in ‘trees’ dataset.
# Testing your code ------------------------
# Explore the "trees" dataset
#?trees
summary(trees) # learn more about the data
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
trees$Height # extract the "Height" column from the trees dataset.
## [1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78 80 74 72 77
## [26] 81 82 80 80 80 87
CoefVar(trees$Height) # run your new function!
## [1] 0.08383964
Write a function called “DrawLine()” for drawing a regression line through a scatter plot. This function should be specified as follows:
As a test, apply this function to the ‘Height’ (x axis) and ‘Volume’ (yaxis) vectors in the ‘trees’ dataset, and then to the ‘waiting’ (x axis) and ‘eruptions’ (y axis) vectors in the ‘faithful’ dataset.
#DrawLine(trees$Height,trees$Volume)
# ?faithful
# summary(faithful)
## ggplot version:
DrawLine(faithful$waiting,faithful$eruptions) # test your function using the old faithful eruptions data
## `geom_smooth()` using formula = 'y ~ x'
## (Intercept) x
## -1.87401599 0.07562795
## ggplot version:
DrawLine_base(faithful$waiting,faithful$eruptions)
## (Intercept) x
## -1.87401599 0.07562795
Write a function called “DrawLine2()” for drawing a “smoothed” regression line through a scatter plot, making the smoothing span (degree of smoothness, or non-wiggliness of the line) a user-defined option. This function should be specified as follows:
Try testing your function using the trees and faithful datasets!
xvec <- c(1:10)
yvec <- rnorm(length(xvec),c(2:6,7:3),2)
DrawLine2(xvec,yvec,smooth=T,span=0.5) # run your new function!
## `geom_smooth()` using formula = 'y ~ x'
## Warning in stats::qt(level/2 + 0.5, pred$df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Call:
## loess(formula = y ~ x, span = span)
##
## Number of Observations: 10
## Equivalent Number of Parameters: 8.49
## Residual Standard Error: 5.028
DrawLine2(x=trees$Height,y=trees$Volume,smooth=F)
## `geom_smooth()` using formula = 'y ~ x'
## (Intercept) x
## -87.12361 1.54335
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=.5) # test using the old faithful eruptions data
## `geom_smooth()` using formula = 'y ~ x'
## Call:
## loess(formula = y ~ x, span = span)
##
## Number of Observations: 272
## Equivalent Number of Parameters: 6.71
## Residual Standard Error: 0.3731
DrawLine2(faithful$waiting,faithful$eruptions,smooth=T,span=1.9)
## `geom_smooth()` using formula = 'y ~ x'
## Call:
## loess(formula = y ~ x, span = span)
##
## Number of Observations: 272
## Equivalent Number of Parameters: 3.08
## Residual Standard Error: 0.4639
Write a function called “CLTdemo()” based on the central limit theorem (CLT) demonstration code above. This function should be specified as follows:
And you can test your code with the following command:
CLTdemo(3,10,20) # run your new function!
## [1] 0.006422794
## you might want to replicate this many times...
replicate(5,CLTdemo(3,10,20,visualize=F)) # for example, replicate 5 times
## [1] 0.02666252 0.22219402 0.14816431 0.04354325 0.01335300
Finally, test the CLT function out for different parameter combinations to make sure it works! See if you can use this function to develop a reasonable rule for how large a sample size is necessary to ensure that the sample mean statistic (\(\bar{X}\)) is normally distributed given that the underlying data population is a uniform distribution.
Please submit your answer in WebCampus- and please justify your answer! [note: I am just looking for a thoughtful response, not a definitive mathematical proof! Just a few sentences is fine.]
NOTE: to set default values, just use the equals sign when defining your function. For example, say I wanted to write a function that adds the numbers in vector input ‘x’ OR adds a scalar input ‘y’ to each element of vector ‘x’. With only 1 input (‘x’) the function will sum up the elements in ‘x’. If the user specifies two arguments (x and y) then the function will add y to each element of x. It might look something like this:
newsum <- function(x,y=NULL){
if(is.null(y)) sum(x) else x+y
}
newsum(5:10) # use default value of y
## [1] 45
newsum(5:10, 2)
## [1] 7 8 9 10 11 12
Try setting some alternative default values and re-running the function with and without arguments until you are sure you understand how default values work!
Before we start with the exercise, here is some background:
library(help = "datasets") # list of sample datasets that come with R
?airquality # this is the dataset we'll work with in this exercise
Examine the ‘airquality’ dataset (use the ‘head’ and ‘summary’ functions). Note that there are missing values where ozone concentration data and solar radiation data were not collected.
We could ignore the missing values and just go ahead with our regression analysis, since the default response of the “lm()” (‘linear model’) function is to omit cases with missing values in any of the specified parameters. However, to avoid problems later, we will omit them explicitly by constructing a new, ‘cleaned’ dataset as follows:
air_cleaned <- na.omit(airquality) # remove rows with missing data
Conduct a multiple linear regression of ozone concentration as a function of solar radiation, wind and temperature. Use the ‘lm()’ function to conduct an ordinary least squares (OLS) regression analysis.
Explore the regression outputs using the ‘summary’ function, and explore regression diagnostics:
par(mfrow=c(3,2))
plot(model1, which=c(1:4)) # diagnostic plots (NOTE: the 'plot()' function returns these plots by default when the input is a linear regression model)
hist(residuals(model1), breaks=10) # histogram of residuals
plot(predict(model1) ~ air_cleaned$Ozone) # plot predicted vs observed- should follow 1:1 line. Examine this for model biases.
abline(0,1)
NOTE: see this website for more information on the diagnostic plots produced by lm().
coplot(air_cleaned$Ozone~air_cleaned$Temp|air_cleaned$Solar.R,rows=1) # the "|" operator can be read "conditional on"
# alternatively, you can use ggplot
library(ggplot2)
ggplot(air_cleaned,aes(Temp,Solar.R)) +
geom_point(aes(size=Ozone))
formula2 <- "Ozone ~ Wind + Solar.R * Temp" # you can name formulas...
Explore regression diagnostics and coefficients for the second model in the same way as you did for the first model without the interaction term.
Conduct a test to formally test whether the more complex model (including the interaction term) fits the data significantly better than the reduced model (with fewer parameters) that lacks the interaction term. Recall that the \(R^2\) value is inadequate for this purpose because \(R^2\) will always increase with additional parameters! Use the following syntax,
anova(model1, model2, test="Chisq")
Very briefly (but in complete sentences) answer the following questions in WebCampus:
By how much (and in what direction) is ozone concentration expected to change if temperature increased from 26 to 30 degrees Celsius? Assume that solar radiation stays constant at 200 lang and wind speed stays constant at 9 mph. Be careful with the units for temperature! Make sure to use the ‘interaction’ model (model2) from part 7 above to answer this question. Please briefly explain (in WebCampus) how you got your answer.
What is the null hypothesis that the p-values for the individual regression coefficients are designed to test?
Based on your diagnostic plots, do you think this is an appropriate model for these data? Explain your reasoning.
Which of the predictor variables you included in your model is the most important in predicting ozone concentration? Explain your reasoning?
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggeffects':
##
## get_title
## Analysis of Variance Table
##
## Response: Ozone
## Df Sum Sq Mean Sq F value Pr(>F)
## Wind 1 45694 45694 112.708 < 2.2e-16 ***
## Solar.R 1 9055 9055 22.335 7.072e-06 ***
## Temp 1 19050 19050 46.988 4.901e-10 ***
## Solar.R:Temp 1 5028 5028 12.402 0.0006337 ***
## Residuals 106 42975 405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Single term deletions
##
## Model:
## Ozone ~ Wind + Solar.R * Temp
## Df Sum of Sq RSS AIC Pr(>Chi)
## <none> 42975 671.43
## Wind 1 11679.6 54654 696.12 2.393e-07 ***
## Solar.R:Temp 1 5028.2 48003 681.71 0.0004573 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = Ozone ~ scale(Wind) + scale(Solar.R) * scale(Temp),
## data = air_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.634 -12.505 -2.300 8.956 94.162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.996 2.002 19.976 < 2e-16 ***
## scale(Wind) -11.879 2.213 -5.367 4.74e-07 ***
## scale(Solar.R) 9.006 2.248 4.006 0.000115 ***
## scale(Temp) 15.844 2.297 6.898 3.98e-10 ***
## scale(Solar.R):scale(Temp) 7.216 2.049 3.522 0.000634 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.14 on 106 degrees of freedom
## Multiple R-squared: 0.6472, Adjusted R-squared: 0.6339
## F-statistic: 48.61 on 4 and 106 DF, p-value: < 2.2e-16
Review the “brute-force z-test” code from the “Why focus on algorithms” lecture. Then complete the following exercises:
## $Xbar
## [1] 3.476
##
## $nulldist
## [1] 3.669712 3.254229 3.652084 3.564189 3.815889 3.322832 3.544433 3.780472
## [9] 3.654221 3.589679 3.790542 3.899135 3.454186 3.659369 4.100179 3.957394
## [17] 3.598548 4.052536 3.575869 4.337912 3.620587 4.276476 3.691364 3.700510
## [25] 3.498121 3.435202 3.865283 3.741916 3.970636 3.671104 3.876468 3.686852
## [33] 3.811222 4.109495 3.360885 3.982672 3.248445 3.749458 3.620709 3.442958
## [41] 3.942586 3.861247 3.774395 3.758705 3.523177 3.702161 3.742156 3.690204
## [49] 4.222359 3.296337 3.928293 3.585738 4.137988 3.936082 3.924982 3.707245
## [57] 3.592527 3.625664 3.733251 4.195731 3.960671 3.733010 3.904756 3.691316
## [65] 3.500518 4.295105 3.148134 3.452921 3.937403 3.873844 4.328467 3.181196
## [73] 3.562082 3.998840 3.788905 4.062241 4.111899 3.696056 4.104583 3.672121
## [81] 3.900601 3.960083 4.267937 3.242355 3.978550 3.810564 3.908044 3.591873
## [89] 3.950084 4.180769 3.392870 3.438108 4.051590 3.863165 4.189594 4.229318
## [97] 3.463033 4.040191 3.641476 3.971593 3.780135 3.264256 3.815773 3.668538
## [105] 4.210738 4.030690 3.443345 3.728823 3.522614 3.604407 3.713411 3.843314
## [113] 4.060608 3.759348 3.674675 3.609212 3.813675 3.971111 3.729164 4.105476
## [121] 3.612751 3.977730 4.097239 3.668136 3.841684 4.150006 3.665152 3.725729
## [129] 3.902404 3.771461 3.908629 3.561122 3.443252 3.529170 3.387868 3.911287
## [137] 3.532629 3.694834 3.964450 3.805490 3.771006 3.233009 3.403895 3.614551
## [145] 3.800829 3.328167 3.689644 3.874735 3.720021 3.480010 3.627782 3.743003
## [153] 3.690840 3.827373 3.874360 3.735009 3.471178 3.897480 3.729435 3.513225
## [161] 3.450719 3.558894 3.780157 4.360463 4.003570 3.398104 3.768641 3.671088
## [169] 3.871145 3.146691 3.753190 3.940400 3.702831 3.917807 4.009774 4.224292
## [177] 3.966777 3.757390 3.760036 4.024992 4.239837 3.479769 3.127774 3.743479
## [185] 4.055547 3.439329 4.446313 3.646656 3.873945 3.469963 4.036970 3.775161
## [193] 3.841437 3.265689 3.656146 4.170447 4.377496 3.516025 4.078850 4.308810
## [201] 2.928671 3.702800 3.570553 3.959703 3.762640 3.627039 3.849902 3.628285
## [209] 4.504112 3.296145 3.793292 3.719015 4.028247 3.360972 3.943531 4.285885
## [217] 3.617967 4.166192 3.749303 3.773353 3.839293 3.609077 4.265239 3.990832
## [225] 3.895842 4.130021 2.959438 3.940326 3.950888 3.784570 3.903796 3.371110
## [233] 3.932815 3.527778 3.977042 3.867625 3.848653 4.030519 4.210072 3.482901
## [241] 3.768801 3.907184 4.153274 3.415491 3.576061 4.174403 3.592792 3.898677
## [249] 3.780230 3.544570 3.523576 3.631105 3.388536 3.864602 3.466899 3.628700
## [257] 4.021931 3.695047 4.158974 3.035531 3.933955 3.957023 3.491345 3.402676
## [265] 4.076516 3.635985 4.062853 3.573728 3.562682 3.586940 4.276320 3.978160
## [273] 3.795264 3.417104 2.969283 3.909064 3.559953 3.823301 4.180739 3.875786
## [281] 3.688603 3.840039 3.433600 3.900288 3.237292 3.597438 3.651372 3.424060
## [289] 4.205296 3.622559 4.100887 2.881774 3.994057 3.327035 3.878763 3.607360
## [297] 3.872094 3.358435 3.952180 3.541194 3.749271 3.467981 3.603740 3.697439
## [305] 3.404568 3.827224 3.668539 3.320525 4.034901 3.946534 4.284333 3.931554
## [313] 3.616161 3.423134 3.302998 3.544296 4.305800 3.769218 3.855698 3.888983
## [321] 3.541001 4.252428 3.359073 3.630925 3.973487 3.405431 3.614946 3.113318
## [329] 3.369648 3.884520 3.950925 3.453140 4.120670 3.391172 3.605839 3.666807
## [337] 3.527055 4.045947 3.867689 3.917085 3.873623 3.472604 3.473169 3.757758
## [345] 3.678703 3.904834 3.944398 3.107222 3.480169 3.509349 3.471974 3.572835
## [353] 4.033360 4.141730 3.471794 3.772011 3.931093 3.586324 4.376682 3.957460
## [361] 3.831293 3.627265 3.864052 4.058818 3.691119 3.732761 3.411508 3.785352
## [369] 3.325477 4.097562 3.902608 4.265709 3.736246 3.589583 3.787697 3.937258
## [377] 3.923670 3.770841 3.702480 4.057659 3.470919 3.835825 4.046076 3.878063
## [385] 4.276117 3.883138 4.039831 3.765996 3.572934 4.187310 3.601681 3.938583
## [393] 3.330771 3.760630 4.000399 3.359772 3.616968 3.483620 3.495223 3.996494
## [401] 4.352507 3.845918 3.760334 4.014935 3.896150 3.998784 3.753618 4.049512
## [409] 4.099030 3.770631 3.562577 3.830004 3.785422 3.780366 3.571021 4.103380
## [417] 3.688641 4.097606 3.582250 3.945767 3.646371 3.600371 3.189354 3.501374
## [425] 4.048925 3.235637 3.536898 3.793565 3.772462 3.733756 3.633892 4.176315
## [433] 3.792252 3.744249 3.944158 3.831476 3.802529 4.181306 3.858358 3.855754
## [441] 3.703852 3.587311 4.006020 3.845926 3.720599 3.878678 3.978970 3.408947
## [449] 4.137883 3.726766 3.313179 3.452222 4.367188 3.634904 3.751518 4.252655
## [457] 3.652005 3.721421 3.887301 3.718466 3.621522 4.199257 4.001700 3.561393
## [465] 3.994410 3.915943 3.610284 3.874596 4.264464 3.760495 4.297905 3.814833
## [473] 3.962583 3.730034 3.809178 3.640976 4.365021 3.616716 4.012092 3.992710
## [481] 3.550355 4.176822 4.172132 3.849273 3.233133 3.295321 3.895957 3.644315
## [489] 4.217582 3.994244 3.339583 3.855009 3.966545 3.570528 3.049203 3.407474
## [497] 4.017634 4.374136 3.804496 3.775991 3.780358 3.683825 3.999740 3.570681
## [505] 3.705741 3.224850 3.873480 3.644602 3.385866 3.339537 3.752642 3.507768
## [513] 3.610561 3.711159 4.553861 4.050665 4.216972 3.748304 3.506151 3.721233
## [521] 4.017639 3.705722 3.636159 4.189085 3.415423 4.000388 3.289068 3.958429
## [529] 3.485951 4.301563 4.626263 3.387721 4.357709 3.829143 3.636640 3.675730
## [537] 3.390940 3.977029 2.937372 3.628493 3.670029 3.379579 3.759721 4.149613
## [545] 3.702459 4.294048 4.298654 3.873800 3.696677 4.479338 3.729647 4.044068
## [553] 3.345777 4.376674 4.517260 4.102773 3.661120 4.150543 3.320080 4.071679
## [561] 3.510957 3.752677 3.713480 3.478330 3.811902 3.996225 4.001495 3.905219
## [569] 3.934510 3.486382 3.841402 4.030128 3.669901 3.921792 3.966622 3.768560
## [577] 3.706270 3.602063 3.516429 3.927258 3.130386 3.796173 3.508126 3.773356
## [585] 4.051911 3.800856 3.839833 3.983146 3.724824 3.721996 4.174448 3.737186
## [593] 3.772653 4.039579 4.007118 3.759301 3.594568 4.106943 3.566381 3.816193
## [601] 4.301618 3.891638 4.245080 4.082446 3.887323 3.886332 3.952873 3.735813
## [609] 3.565839 4.206756 3.902200 3.395583 4.115076 3.834613 3.664909 3.690576
## [617] 3.763218 3.867166 3.786365 3.666723 3.880316 3.590271 3.698706 3.647506
## [625] 3.610983 3.790737 3.727848 3.369338 3.831540 3.665996 3.595257 4.016316
## [633] 3.667163 4.062210 4.003439 3.638604 4.133077 3.857442 3.857411 3.319857
## [641] 3.674969 4.141408 3.751345 3.927673 3.503393 3.752743 4.003473 3.472796
## [649] 4.134980 3.717683 4.210234 4.227842 3.997116 4.034021 3.985154 3.724569
## [657] 3.857118 4.120525 3.245752 4.172777 3.611667 3.605290 3.882167 4.438162
## [665] 3.481108 4.082987 3.128933 3.536434 3.925511 4.285024 3.715596 3.984729
## [673] 3.561088 3.553876 3.663589 3.263819 3.932064 3.966149 4.023030 3.800456
## [681] 3.613082 4.047817 3.904951 3.572718 3.666018 4.084515 3.637624 3.806222
## [689] 4.040390 3.590498 3.954960 4.065343 3.951546 3.538331 3.584591 3.309898
## [697] 3.737062 4.493346 3.061567 4.348590 4.270332 4.295852 3.981581 3.720639
## [705] 3.697179 3.439329 4.001627 4.144661 3.545334 3.825380 3.607143 3.898320
## [713] 4.488054 3.665270 3.517307 3.884460 3.791184 3.981205 3.959909 3.917054
## [721] 4.335681 3.680470 4.180743 3.528125 3.553512 3.994452 3.368364 3.749216
## [729] 3.420545 4.082902 3.738432 3.382115 3.285276 4.036434 3.513660 2.787373
## [737] 3.953801 3.698615 4.308370 4.135635 3.484485 3.357781 3.388939 4.066729
## [745] 4.163793 3.975689 4.044179 3.426655 3.853673 4.047262 3.168996 3.562410
## [753] 3.615619 4.093743 3.556776 3.622814 3.263167 3.558919 3.928857 3.987272
## [761] 3.660177 3.652156 3.458990 3.572255 3.728179 3.693611 4.000907 4.051736
## [769] 3.853108 3.739052 3.838245 3.391717 3.353016 3.671532 3.488584 3.859941
## [777] 3.840269 3.893935 3.455139 3.680344 3.734856 4.023911 3.370704 3.769769
## [785] 4.030189 3.217629 3.499357 3.515342 3.662378 3.858520 3.466157 3.567453
## [793] 4.364967 3.590460 3.796150 3.699831 3.749671 4.012507 3.780891 3.922269
## [801] 4.015283 4.069824 4.284173 3.768712 3.654887 3.283878 3.797197 3.707830
## [809] 3.680316 3.349784 4.025706 3.647882 3.771443 3.681148 3.503823 3.982204
## [817] 4.250547 3.831860 3.352016 4.241141 3.760202 4.073392 4.079094 3.719913
## [825] 3.871260 3.529131 4.221389 3.941065 3.949339 3.605163 3.730293 3.885367
## [833] 3.957696 3.668533 4.087480 4.022460 3.637710 3.835459 3.511295 3.832677
## [841] 3.715321 3.106249 3.789469 3.880412 4.268301 3.158838 3.782177 4.034974
## [849] 3.318012 4.624176 3.737130 3.561247 3.856091 3.735064 3.736720 3.633212
## [857] 3.416923 3.636491 3.708283 3.831880 3.331963 3.581716 3.464599 3.714818
## [865] 3.854737 3.317567 3.636175 4.193014 3.631022 3.560107 3.769544 3.759770
## [873] 3.779539 3.898179 3.819345 4.135210 4.030307 3.774007 3.372521 3.971223
## [881] 4.117598 3.635927 4.223461 3.954491 3.898440 3.828967 4.448175 4.239947
## [889] 3.721966 3.962725 3.965589 3.786715 3.912367 3.760352 4.381862 3.681194
## [897] 4.082125 4.077398 3.901845 3.338881 3.971142 4.242881 3.817311 3.875704
## [905] 3.663259 3.344638 3.877076 3.406985 3.623787 3.943737 4.089110 3.474239
## [913] 3.449526 3.180511 3.623781 3.569456 3.542664 3.765224 3.861564 3.374576
## [921] 3.669876 3.423626 3.731189 4.198433 3.599247 3.534196 3.855143 3.447806
## [929] 3.598925 3.904091 3.527479 4.104526 4.176183 3.533536 4.145619 3.905576
## [937] 3.417029 4.477369 4.213680 3.944303 3.614795 3.433314 4.175776 4.423641
## [945] 3.776315 3.799216 3.741634 4.178415 3.556338 3.884394 4.059725 3.750199
## [953] 3.586445 4.099992 3.528260 4.249757 3.630984 4.016958 4.240332 4.053856
## [961] 3.398993 3.638947 3.832877 3.982203 3.835949 3.363868 4.080127 3.540767
## [969] 3.186824 3.499104 3.748342 4.280133 4.160958 3.657116 3.957788 4.334231
## [977] 3.900107 3.969395 3.879396 3.896647 3.926107 3.601455 3.508952 3.808556
## [985] 3.581609 3.648261 3.695890 3.651412 3.663927 4.013916 4.253409 3.676702
## [993] 3.807242 3.843213 3.835338 3.330389 3.754193 3.850299 3.913826 3.471838
##
## $p_value
## [1] 0.145
## [1] 0.1202629
What if we wanted to run a two-tailed z-test? That is, what if our alternative hypothesis were that salmon fed on the new vegetarian diet could plausibly be larger or smaller than those fed on the conventional diet after one year?
Modify the function (“ztest_bruteforce()”) with a new argument that allows for both one and two-tailed tests! Name your new function “my_ztest()”.
To convince yourself that your new function works, try running your function for a (made-up) case where the observed body mass for those fed on the new diet are generally higher than the expected body mass for those fed on the conventional diet after one year – the opposite of your hypothesis!
NOTE: you may get tangled up with the null hypothesis/p-value concept, which is admittedly a difficult concept! A p-value always assumes the null hypothesis is true (what is the probability that a random sample under the null hypothesis could give as much evidence for your alternative hypothesis than the data you observed?).
Include your function in your submitted r script!
This function should be specified as follows:
Test your function using some alternative sample data values. For example:
mu = 4 # testing code
sigma = 2
mysample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
test = my_ztest(mysample, mu, sigma,alternative = 'l' )
str(test)
## List of 3
## $ Xbar : num 3.48
## $ nulldist: num [1:10000] 4.84 3.72 5.25 3.43 4.23 ...
## $ p_value : num 0.207
test$p_value
## [1] 0.2072
# also try using the code from lecture to compare against a "real" z-test!
Also try using the code from lecture to compare against a “real” z-test– for example:
mu = 4 # testing code
sigma = 2
mysample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
samplemean = mean(mysample) # real z-test
stderr <- sigma/sqrt(length(mysample)) # population standard error
zstat = (samplemean-mu)/stderr
pnorm(zstat)
## [1] 0.203689
test = my_ztest(mysample, mu, sigma,alternative = 'l' ) # compare your function with the real answer
test$p_value
## [1] 0.204
library(BSDA)
## Loading required package: lattice
##
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
##
## Orange
z.test(x=mysample,mu=mu, sigma.x=sigma,alternative = "less") # 'canned' z-test
##
## One-sample z-Test
##
## data: mysample
## z = -0.82852, p-value = 0.2037
## alternative hypothesis: true mean is less than 4
## 95 percent confidence interval:
## NA 4.516297
## sample estimates:
## mean of x
## 3.476
Now, what if we had access to a large set of independent samples under the null hypothesis AND that we would like to relax the assumption that the data follow a normal distribution under the null hypothesis?
In this exercise, you are asked to modify the brute-force z-test function so that you generate your sampling distribution by simply sampling from the known data set under the null hypothesis. To do this you can use the ‘sample()’ function in R. For this purpose you should sample from the null with replacement. Here is an example:
nulldata=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
sample(nulldata,10,replace=T) # use the sample function to sample from a large vector with replacement
For this exercise, we will return to the original null hypothesis: that salmon raised on the new diet will be smaller (lower mass) than those raised on conventional diet after one year.
Include your function in your submitted r script!
This function should be named ‘x_vs_null()’ and should be specified as follows:
Test your function using some alternative sample data vectors. For example:
nulldata=c(2.2,3.86,6.39,4.6,3.43,5.16,4.36,4.22,6.31,4.61,5.13,4.12,4.64,4.03,5.01,7.33,5.35,4.7,2.82,4.87,3.87,5.95,5.28,4.02,3.58,4.03,5.38,5.5,3.07,3.29,3.45,5.25,5.7,1.26,5.28,4.19,4.76,4.2,4.81,2.5)
mysample = c(3.14,3.27,2.56,3.77,3.34,4.32,3.84,2.19,5.24,3.09)
test = x_vs_null(mysample, nulldata )
test$p_value
## [1] 0.008
Review the bootstrapping code from the first lecture topic (to generate confidence intervals for arbitrary test statistics). Then complete the following exercise:
Generate a new R function, called “RegressionCoefs()” that takes a data frame as the first input parameter and the name of the response variable as the second input parameter, and returns the regression coefficients (\(\beta\)) produced by regressing the response variable on each predictor variable one at a time (returning a vector of regression coefficients). You can use the “Rsquared()” function (from the lecture) as a reference!
More specifically, this function should be specified as follows:
Include your function in your submitted r script!
You can test your function using the following code:
RegressionCoefs(trees,"Volume") # should return two regression coefficients
## Girth Height
## 5.065856 1.543350
Generate a new R function, called “BootCoefs()” that meets the following specifications:
Include your function in your submitted r script!
You can test your function using the following code:
BootCoefs(df=trees,fun=RegressionCoefs,responsevar="Volume")
## Girth Height
## 2.5% 4.362612 0.7690758
## 50% 5.027451 1.5358234
## 97.5% 5.632498 2.3628580
df <- mtcars[,c(1,3,4,6)]
responsevar="mpg"
BootCoefs(df,RegressionCoefs,responsevar)
## disp hp wt
## 2.5% -0.05132901 -0.10073301 -7.100329
## 50% -0.04119646 -0.06940640 -5.377683
## 97.5% -0.03146126 -0.04591232 -4.166714
To test your new function(s), generate bootstrapped confidence intervals around the regression parameters for your tree biomass regression. Compare these bootstrapped confidence intervals with the standard confidence intervals given by R in the “lm()” function. Are there any important differences? Explain your answer in WebCampus. As always, please use your WebCampus submission to raise any points of confusion or ask me questions!
## Girth Height
## 2.5% 4.343355 0.7827071
## 50% 5.054131 1.5375925
## 97.5% 5.665798 2.4448150
## Girth Height
## 2.5 % 4.559914 0.758249
## 97.5 % 5.571799 2.328451