If you’d like to follow along in R, you can download the code here: SEM R code
Multivariate technique used to map and analyze complex networks of causal relationships between observed and latent variables. Most frequently used in social sciences.
SEMs are ideal for describing complex systems. Liu et al. 2016 used SEMs to model the relationship between growth rates, functional traits, environmental factors, and neighborhood competition effects in a subtropical forest. They hypothesized that the abiotic and biotic environment indirectly affect tree growth via their effect on plant traits. They defined latent variables based on observed data. For example, the environmental latent variable included six soil nutrients and pH. The functional trait latent variable included tree height (Height), specific leaf area (SLA), hydraulic conductivity (Ks), and stomatal density (SD). They kept conspecific basal area (BA-c) separate from the trait latent construct. By combining many observed variables into thoughtful latent variables, they could create models that make more intuitive sense. They found that there were strong positive relationships between functional traits and growth rates. The environmental latent variable had a weaker but still significant relationship with growth rate. Below is a path analysis diagram illustrating the SEM.
Path diagrams can be used to better visualize the matrix equations in SEMs. The figure below illustrates the many types of variables and relationships that SEM can encompass.
Simple regression and multiple regression both involve one endogenous variable. In simple linear regression, there is one exogenous variable that predicts one endogenous variable. In multiple regressions, multiple exogenous variables (with covariance) predict a single endogenous variable. The exogenous variables each have variance. The endogenous variable has an intercept (the triangle) and residual variance.
Multivariate regression and path analysis both have two or more observed endogenous variables. The main difference is that multivariate regression has only exogenous variables predicting endogenous variables. However, in path analysis, endogenous variables can also predict other endogenous variables.
Factor analysis and structural regression both have two or more endogenous variables. The main difference is that factor analysis looks at how a latent variable can predict observed variables. Structural regression can use latent variables to predict other latent variables.
These examples and data are from an introductory SEM seminar by the Institute for Digital Research and Education Statistical Consulting. The data set is from the social sciences. However, the same steps and concepts could apply to ecological research. The data has six observed variables ascribed to 3 hypothetical latent variables describing the effects of student background on academic achievement. Nine observed variables are clustered into hypothetical latent constructs based on prior knowledge or theory. In this example, the three hypothesized latent constructs are adjustment, risk, and achievement. Adjustment is defined by the observed variables motivation, harmony, and stability. Similarly, risk is defined by negative parental psychology, socioeconomic status, and verbal IQ. Finally, achievement is composed of reading, arithmetic, and spelling. Variable abbreviations are:
We will be using the lavaan() package. There are SEM programs outside of R that some people prefer, but they are not free to use.
library(lavaan)
## This is lavaan 0.6-9
## lavaan is FREE software! Please report any bugs.
dat <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2021/02/worland5.csv")
For this example, we will perform a simple regression that determines the relationship of two observed variables (one exogenous and one endogenous). This analysis is the first building block in SEM, as an SEM is a linear model framework. This analysis could be done using the lm() function. However, we will use lavaan(). The primary difference is that lavaan() uses maximum likelihood instead of the least square estimator. While the coefficients will be the same, the residual variance will be a little different.
Here we will choose to look at how motivation predicts reading. The notation is similar to lm(), but we must add 1 to the predictor side to include the intercept. We may choose to include the variance. For simple regression, it would only be the variance of the predictor.
We can illustrate this analysis using a path diagram such as the one below.
m1 <- '
# regressions
read ~ 1 + motiv
# variance (optional)
motiv ~~ motiv
'
Next, we fit the model to the data using sem() from lavaan. Followed by a summary, including model fit.
fit1 <- sem(m1, data=dat)
summary(fit1, fit.measures=TRUE)
## lavaan 0.6-9 ended normally after 14 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Number of observations 500
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Model Test Baseline Model:
##
## Test statistic 164.877
## Degrees of freedom 1
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 1.000
## Tucker-Lewis Index (TLI) 1.000
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -3638.084
## Loglikelihood unrestricted model (H1) -3638.084
##
## Akaike (AIC) 7286.168
## Bayesian (BIC) 7307.241
## Sample-size adjusted Bayesian (BIC) 7291.371
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.000
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.000
## P-value RMSEA <= 0.05 NA
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## read ~
## motiv 0.530 0.038 13.975 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .read -0.000 0.379 -0.000 1.000
## motiv 0.000 0.447 0.000 1.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## motiv 99.800 6.312 15.811 0.000
## .read 71.766 4.539 15.811 0.000
For this example, we will perform a path analysis, which is broader than a multivariate regression and allows endogenous variables to explain other endogenous variables.
First, we will use negative parental psychology and motivation as predictors of reading and arithmetic. We will also include reading as a predictor of arithmetic (an endogenous variable explaining another endogenous variable).
We can illustrate this analysis using a path diagram such as the one below.
m2 <- '
# regressions
read ~ 1 + ppsych + motiv
arith ~ 1 + motiv + read + ppsych
# covariance and variance
read~~read
ppsych~~ppsych
motiv~~motiv
arith~~arith
motiv ~~ ppsych
'
Next, we fit the model to the data using sem() from lavaan. Followed by a summary, including model fit.
fit2 <- sem(m2, data=dat)
summary(fit2, fit.measures=TRUE)
## lavaan 0.6-9 ended normally after 42 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 14
##
## Number of observations 500
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Model Test Baseline Model:
##
## Test statistic 707.017
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 1.000
## Tucker-Lewis Index (TLI) 1.000
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -7087.537
## Loglikelihood unrestricted model (H1) -7087.537
##
## Akaike (AIC) 14203.074
## Bayesian (BIC) 14262.078
## Sample-size adjusted Bayesian (BIC) 14217.641
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.000
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.000
## P-value RMSEA <= 0.05 NA
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## read ~
## ppsych -0.275 0.037 -7.385 0.000
## motiv 0.461 0.037 12.404 0.000
## arith ~
## motiv 0.300 0.033 8.993 0.000
## read 0.597 0.035 17.004 0.000
## ppsych 0.068 0.031 2.212 0.027
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## ppsych ~~
## motiv -24.950 4.601 -5.423 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .read -0.000 0.360 -0.000 1.000
## .arith -0.000 0.283 -0.000 1.000
## ppsych -0.000 0.447 -0.000 1.000
## motiv 0.000 0.447 0.000 1.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .read 64.708 4.092 15.811 0.000
## ppsych 99.800 6.312 15.811 0.000
## motiv 99.800 6.312 15.811 0.000
## .arith 39.923 2.525 15.811 0.000
For the last example, we will create a structural model that relates multiple latent variables. First, we create a measurement model by defining each latent variable. Adjustment will be motivation, harmony, and stability. Risk will be verbal, negative parent psychology, and socioeconomic status. Achievement will be reading, arithmetic, and spelling. Next, we define the regression path. Achievement will be the combination of adjustment and risk.
We can illustrate this analysis using a path diagram such as the one below.
m3 <- '
# measurement model
adjust =~ motiv + harm + stabi
risk =~ verbal + ppsych + ses
achieve =~ read + arith + spell
# regressions
achieve ~ adjust + risk
'
Next, we fit the model to the data using sem() from lavaan. Followed by a summary, including model fit.
fit3 <- sem(m3, data=dat)
summary(fit3, fit.measures=TRUE)
## lavaan 0.6-9 ended normally after 130 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 21
##
## Number of observations 500
##
## Model Test User Model:
##
## Test statistic 148.982
## Degrees of freedom 24
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 2597.972
## Degrees of freedom 36
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.951
## Tucker-Lewis Index (TLI) 0.927
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -15517.857
## Loglikelihood unrestricted model (H1) -15443.366
##
## Akaike (AIC) 31077.713
## Bayesian (BIC) 31166.220
## Sample-size adjusted Bayesian (BIC) 31099.565
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.102
## 90 Percent confidence interval - lower 0.087
## 90 Percent confidence interval - upper 0.118
## P-value RMSEA <= 0.05 0.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.041
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## adjust =~
## motiv 1.000
## harm 0.884 0.041 21.774 0.000
## stabi 0.695 0.043 15.987 0.000
## risk =~
## verbal 1.000
## ppsych -0.770 0.075 -10.223 0.000
## ses 0.807 0.076 10.607 0.000
## achieve =~
## read 1.000
## arith 0.837 0.034 24.437 0.000
## spell 0.976 0.028 34.338 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## achieve ~
## adjust 0.375 0.046 8.085 0.000
## risk 0.724 0.078 9.253 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## adjust ~~
## risk 32.098 4.320 7.431 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .motiv 12.870 2.852 4.512 0.000
## .harm 31.805 2.973 10.698 0.000
## .stabi 57.836 3.990 14.494 0.000
## .verbal 46.239 4.788 9.658 0.000
## .ppsych 68.033 5.068 13.425 0.000
## .ses 64.916 4.975 13.048 0.000
## .read 11.372 1.608 7.074 0.000
## .arith 37.818 2.680 14.109 0.000
## .spell 15.560 1.699 9.160 0.000
## adjust 86.930 6.830 12.727 0.000
## risk 53.561 6.757 7.927 0.000
## .achieve 30.685 3.449 8.896 0.000
Structural equation modeling is an effective form of analysis for ecology as it can describe complex networks of observed and unobserved (latent) variables and their relationships. This was a broad crash course in SEM. If you are interested in learning more about SEMs, I recommend “taking a look under the hood.” While I have given diagrams that illustrate how SEMs work, Lavaan() uses matrices to run these analyses.
Introduction to structural equation modeling (SEM) in R with Lavaan. UCLA: Statistical Consulting Group. from https://stats.idre.ucla.edu/r/seminars/rsem/#s2h
Liu, X., Swenson, N. G., Lin, D., Mi, X., Umaña, M. N., Schmid, B., & Ma, K. (2016). Linking individual-level functional traits to tree growth in a subtropical forest. Ecology (Durham), 97(9), 2396-2405. https://doi.org/10.1002/ecy.1445
UCLA Office of Advanced Research Computing. (2021) Introduction to Structural Equation Modeling [video]. YouTube. https://www.youtube.com/watch?v=sKVFkVoYfbs