Exercise 4: multiple linear regression

Today’s exercises provide hands-on experience working with multiple linear regression in R, with an emphasis on drawing valid statistical interpretations from your results.

I encourage you to work in teams.

Please submit your responses individually on webcampus. If you worked with a team, please include the names of your team members at the top of your submission.

Reports must be uploaded to Webcampus by the date indicated (see WebCampus).

It is not expected that you will complete the whole process during the class period. Good luck!!

A new bird species has recently been discovered, the Pinyon Roadrunner(Geococcyx fictus). Like its famous cousin the Greater Roadrunner (Geococcyx californianus), this species prefers open country, is predatory on a wide variety of reptiles, insects, arthropods, and rodents; and forages primarily through rapid movements on the ground (or sometimes running in slow motion through the air), continuously vocalizing with its “beep beep” or “hmeep hmeep” call.

The new species occurs in cooler Great Basin environments where the Greater Roadrunner is not found, preferring montane sagebrush communities or sparse woodlands dominated by pinyon pine and juniper tree species. Not much is known yet about the specific habitat requirements of this newly discovered species, although data from recent surveys are available from several scattered mountain ranges where it has been observed to be surprisingly abundant.

You are provided (see file PJRoadRunner.csv) with the following data from 100 site locations where extensive point count surveys were implemented, including roadrunner density (response variable) as well as predictor variables representing topographic, edaphic (soil-related) and woodland vegetation structure:

RR: Pinyon Roadrunner density (birds/km2). This is your response variable
Steep: slope steepness (%)
Elev: elevation (feet)
TreeDens: Tree Density (trees/ha), all stems > 2-cm diameter at breast height (dbh)
CosAsp: Slope aspect, linearized from degrees using a cosine transformation such that 1.0 represents northeast-facing (45 degree) slopes (i.e., cooler slopes that receive less solar insolation) and -1.0 represents southwest-facing (225 degree) slopes.
PatchSize: Area (ha) of the habitat patch where observations were made (or the closest patch to the observation point), estimated from aerial photography
SoilpH: Surface soil pH from an aggregate sample (to 8-cm depth) obtained from numerous locations around the observation point.
SoilD: Soil depth (m) averaged from numerous locations around the observation point.
CanCov: Tree canopy cover (% aerial cover) estimated from aerial photography through image analysis.

Of course, this is actually a simulated data set. The true habitat relationships for this species are perfectly known as the parameters in a linear equation used to calculate the abundances.

You should therefore be able to reconstruct the true parameters through a multiple linear equation analysis!

However, some challenges that may make this a nontrivial exercise in regression modeling:

Some random noise (uncorrelated error) has been added to the observations
You are provided with a random sample of data from the underlying population of observations that was calculated
The predictor variables may be correlated with one another, as one expects for variables representing topography, soils and vegetation
There may be nonlinear relationships between predictor and response variables
There may be interaction effects in the relationship between predictor and response variables
There may be other confounding influences at play.

Your group’s assignment is to reconstruct the true parameters of the Pinyon Roadrunner species-habitat relationship, and to outline the process by which you arrived at your result.

Some hints:

Explore the correlation structure of your dataset carefully.
Use the vif function (car package) to explore possible multicollinearity, and refit your regression models accordingly.
Use regression diagnostic plots
You can use the poly() function to add polynomial terms to a regression model. Alternatively, use the I() operator: eg, to add a squared term use y ~ x1 + I(x1^2).
Try using the effects package to easily visualize the effects of predictor variables modeled from your regression models, including interaction and polynomial terms
Try using the coplot() function to generate graphs of suspected interaction terms empirically (from the data, not from a modeled relationship)
Use the anova() function on multiple, nested models to statistically test model parsimony using an F-Test or likelihood ratio test (increase in variance explained vs. number of parameters added). You could also use the AIC function to calculate AIC coefficients (which we will cover later in the course).
Adopt (and consistently follow) an approach for addressing the multicollinearity that is present in the dataset.

The assignment will be due on the deadline specified in WebCampus.

Your report should include:

R code
Key outputs (tables, figures) for describing important elements of your workflow and for supporting your conclusions.
Your regression model parameters for the final model that you believe to be the “true model” (see above for explanation)
A concluding section outlining your main findings, describing analysis steps or approaches that you considered especially helpful, and any remaining uncertainties.

– End of exercise 4 —

Exercise 4: multiple linear regression

NRES 710

Fall 2022