Regression Analysis in R

Today’s exercises provide hands-on experience working with linear regression in R. By running through some R code and interpreting statistical outputs, as a class and individually (or in small groups), we will gain some valuable initial exposure to the regression modeling problem (as well as more practice coding in R).

I encourage you to work in small groups. This way, one group member can focus on a subset of problems, and then all team members can review their solutions and results with other team members.

Please submit your responses individually on WebCampus. If you worked with a team, please include the names of your team members at the top of your submission.

Reports must be uploaded to WebCampus by the date indicated (see WebCampus).

You are of course welcome to ask for help outside of your groups by using the Statistical Concepts Discussion Forum, or by asking the instructor for hints.

Good luck!!

Problem 1. Initial data visualizations.

Objectives: explore the airquality built-in dataset in R, through its documentation, and using descriptive functions such as head() and summary().

1.1)

Use the pairs() function to display a matrix of scatterplots, useful for exploring relationships among several predictor variables, as well as bivariate relationships between each of the predictors and the response. Note that we will use the Ozone variable in the airquality dataframe as our response when fitting our regression models.

IN YOUR RESPONSE: please include your code, followed by a one sentence description of an important pattern you observe from the pairs() plot.

1.2)

Re-run the pairs() function, this time using the lower.panel=panel.smooth argument (explore the help for the pairs function for more details) to display a loess() function (i.e. a local weighted regression to fit a smoothed curve), useful for exploring possible nonlinear relationships among variables. In addition, the help documentation for the pairs() function also provides an example of an argument that can be used for plotting histograms along the diagonal of the scatterplot matrix (using the argument diag.panel = panel.hist) – see if you can get this to work!

IN YOUR RESPONSE: please include your code, along with a short description of any potential violations of the following assumptions: normality of the residuals (looking at the histogram of the response variable can indicate potential issues with the assumption of normally distributed residuals), and non-linear relationships between the response and one of the predictor variables.

1.3)

Use the cor() function to generate correlation matrix for all predictor variables (all columns except for Ozone). You will note that there are many missing values in the correlation matrix because of missing observations in the data. Specify the argument use=”pairwise.complete.obs” so that correlations are calculated using all complete pairs of observations between each pair of variables.

IN YOUR RESPONSE: include your code, along with a one-sentence description of whether you have any potential issues with multicollinearity: as a general rule, pairwise correlation coefficients above 0.7 or 0.8 can lead to issues with multicollinearity.

Problem 2. Data exploration

By now you’ve noticed that the dataset has missing data values! We will deal with this bluntly and in one fell swoop by removing all cases for which any one of the variables has a missing value prior to running our analyses. To do this, we can use the syntax:

air.cleaned <- na.omit(airquality)

For all subsequent analyses, use the air.cleaned dataframe.

Before running any statistical analyses it’s a good idea to generate some summary statistics for your response and predictor variables. We have been through many of this already, so this should not take too long! Note that you can use the summary() function to accomplish most of the following tasks in one quick line of code!

2.1)

Please report the data types for the “Ozone”, “Wind”, and “Temp” columns in the airquality dataframe. You can use the str(), typeof(), and class() functions to help with this.

IN YOUR RESPONSE: include your code, in addition to a brief description of an appropriate analysis for modeling the effect of “Temp” on “Ozone” based on the data type of the response and predictor variable (identify which variable is the predictor and which is the response).

2.2)

Report the range (min and max values) for the “Ozone” and “Temp” columns in the airquality dataframe. You can use the range() or summary() function to help with this.

IN YOUR RESPONSE: include your code, followed by a brief statement about whether it is likely that either of these variables has a constraint – e.g., it cannot take values above 1 (e.g., a probability).

2.3)

Report the mean and median for the “Ozone” and “Temp” columns in the airquality dataframe. You can use the summary() function to do this!

IN YOUR RESPONSE: include your code, and comment briefly on whether either of these variables may be “right-skewed”, such that the mean is greater than the median value. This can indicate a potential deviation from the normality assumption for a potential response variable. You may also wish to use the hist() function to visualize a histogram and further evaluate the skewness (optional).

Problem 3. Simple linear regression.

Simple linear regression involves testing for a relationship between a continuous response variable (dependent variable) and a continuous predictor variable (independent variable).

The null hypothesis is that there is no relationship between the response variable and the predictor variable in your focal population.

Your alternative hypothesis is that there is a linear relationship between the mean value of our response variable (y) and our predictor variable (x) in our focal population:

\(E(y) = \beta_0 + \beta_1*x\)

To interpret this equation: the true population mean of our response variable \(E(y)\) is computed by taking the true intercept (\(\beta_0\)) and adding the product of the true slope term (\(\beta_1\)) and our predictor variable \(x\). β0 and β1 are both parameters that we wish to estimate.

We also assume that there is some “noise” in the focal population that can lead to sampling error. Specifically, we assume that the noise (error) is normally distributed with mean of zero and standard deviation of \(\epsilon\):

\(y = \beta_0 + \beta_1*x + \epsilon\)

OR:

\(y \equiv E(y) + \epsilon\)

WHERE:

\(\epsilon = Normal(0, \sigma)\)

As with a t-test, where we can only approximate the true population mean by computing the sample mean, we can only approximate the linear relationship between our response and predictor variables. That is, we have to deal with sampling uncertainty!

Just like any other statistical test, we assume that our observed linear relationship (defined by estimated coefficients \(\hat{beta_0}\) and \(\hat{beta_1}\)) is just one of many such possible relationships that could have been derived from random sampling. If we collected a different sample, we would estimate a different linear relationship.

NOTE: in linear regression we are generally far more interested in the slope of the linear relationship (\(\beta_1\)) rather than the intercept. So for now, we assume \(\hat{\beta_1}\) is the main test statistic of interest.

So.. what is the sampling distribution for our test statistic \(\hat{\beta_1}\) under the null hypothesis? The answer is that it (when converted to units of standard error to generate a t-statistic) is t-distributed! The degrees of freedom is equal to the sample size minus the number of estimated coefficients (in this case, intercept and B1; so df = n-2).

Let’s go back to the built-in airquality dataset.

3.1)

Calculate a simple linear regression model of ozone concentration as a function of air temperature. Use the lm() function and assign the results to an R object. Use the summary() function on that R object to extract and display key information about this model. The syntax will look something like this:

Ozone.temp.lm <- lm(Ozone ~ Temp, data=air.cleaned) 
summary(Ozone.temp.lm)

IN YOUR RESPONSE: include your code, and write a brief description of the relationship between Ozone concentration and air temperature on the basis of your results. In your answer, use and interpret both the regression coefficient for the effect of “Temp” (slope term) and the R-squared statistic. Our knowledge of atmospheric chemistry tells us that ozone concentration should increase with increasing air temperature. Is that what we find?

3.2)

Explore the relationship between ozone and temperature graphically using a scatter plot. Then overlay your regression results using the abline() function. You can syntax something like this (feel free to use ggplot instead!):

plot(y ~ x)
abline(mymod)   # 'mymod' is an lm() object modeling y as a function of x

IN YOUR RESPONSE: include your code, followed by a brief description of whether you think it is justified to model this relationship as linear on the basis of your scatterplot and regression line.

3.3)

In your model summary, take a look at the confidence interval on the regression coefficient (slope term) linking ozone concentration to temperature. You can use the confint() function to help with this.

IN YOUR RESPONSE: report the confidence interval, along with a concise description of what this confidence interval represents.

3.4)

We can use the predict function to generate a 95% confidence interval around our predictions. The shape of the confidence bounds should be intuitive if you consider that there is sampling error associated with both the intercept and the slope terms). Using the following template code (in this case, I want everyone to gain practice using the predict function, so don’t just use ggplot to generate the figure!), generate a scatterplot of Ozone vs Temp, and overlaying the regression line and a confidence interval:

newdata <- data.frame(    # make a data frame containing the temperature values we want to make predictions for, spanning the range of temperature values in our data
  Temp = seq(50,100,1)
)

my.predict <- predict(Ozone.temp.lm, newdata = newdata, interval = "confidence")  # 95% conf int by default

plot(Ozone ~ Temp, data=air.cleaned)
abline(Ozone.temp.lm, col="blue")   # change the model name to match your own
lines(newdata$Temp,my.predict[,"upr"],col="red",lty=2)   # add upper bound
lines(newdata$Temp,my.predict[,"lwr"],col="red",lty=2)   # add lower bound

IN YOUR RESPONSE: include your code, along with a concise description of what the 95% confidence interval in this plot represents!

3.5)

Finally, let’s examine some regression diagnostics (goodness of fit tests). These tests help us to evaluate if the linear regression model is in fact an adequate representation of our data. If it is not adequate, we should not really be taking the results very seriously!! You can use the following code template to help generate two key diagnostic plots:

layout(matrix(1:2, nrow=1,byrow = T))
hist(residuals(Ozone.temp.lm))
qqnorm(residuals(Ozone.temp.lm))
qqline(Ozone.temp.lm)

Or use this code to generate four diagnostic plots:

layout(matrix(1:4, nrow=2, byrow = T))
plot(Ozone.temp.lm)    # default diagnostic plots

IN YOUR RESPONSE: include your code, followed by a brief description of key takeaways from the diagnostic plots. Specifically, please evaluate whether the residuals are normally distributed and whether the homoskedasticity assumption is violated? Optionally, you may comment on whether there appear to be any outliers or high-influence points.

Problem 4. Multiple linear regression.

4.1)

Now build a similar regression model, keeping Ozone as the response variable and temperature as one of the predictor variables, but adding one additional predictor variable to the model. You pick which variable to add! Use the summary() function to evaluate the key model results.

IN YOUR RESPONSE: include your code, followed by a brief description of the effect of the new variable on ozone concentration, and whether (and how much) the regression coefficient linking Ozone to Temperature changed after including the new variable. Finally, briefly describe whether (and how much) the R-squared statistic changed after adding the new variable.

4.2)

Finally, run a test to see if your new model with two predictors is better than (outperforms) the model with only temperature as a predictor. Use the following code as a template:

anova(model 1, model 2) 

IN YOUR RESPONSE: include your code, followed by a brief description of which of these models is best on the basis of the above result. (try to do a Google search to help in interpretation- or you can ask your instructor!)

Optional additional questions

These questions are optional but will give you additional practice with regression analysis!

First, you can use the effects package in R to effortlessly visualize the fitted effects, along with partial residuals.

First you must install the effects package, in R studio, if you have not already installed it on your computer.

library(effects)   

plot(allEffects(Ozone.full.lm, partial.residuals=FALSE))
plot(allEffects(Ozone.full.lm, partial.residuals=TRUE), partial.residual=list(cex=0.4, col="red"))

You can ask the instructor to explain the graphs.

From the graphs showing fitted relationships along with partial residuals, does it appear that your model is well specified?

And here are some additional things to try!

  1. Z-standardize (normalize) your predictor variables prior to fitting the model, to obtain standardized regression coefficients that are all in standard deviate units, and hence comparable. You can use the scale() function to accomplish this. From there you can determine which of your variables have stronger vs. weaker effects.
  2. Transform predictor variables (e.g. log transform) to try to improve model fit.
  3. Transform your response variable (e.g. log transform) to try to improve model fit.
  4. Fit interaction terms as appropriate to try to improve model fit.
  5. Fit nonlinear relationships by including polynomial terms in your model, as appropriate to improve model fit.

—End exercise 3—