Structural Equation Modeling (SEM)

If you’d like to follow along in R, you can download the code here: SEM R code

What is SEM

Multivariate technique used to map and analyze complex networks of causal relationships between observed and latent variables. Most frequently used in social sciences.

Why is it useful for ecologists?

Theory oriented
Causal network hypothesis testing
Testing multiple models
More intuitive interpretation

Example

SEMs are ideal for describing complex systems. Liu et al. 2016 used SEMs to model the relationship between growth rates, functional traits, environmental factors, and neighborhood competition effects in a subtropical forest. They hypothesized that the abiotic and biotic environment indirectly affect tree growth via their effect on plant traits. They defined latent variables based on observed data. For example, the environmental latent variable included six soil nutrients and pH. The functional trait latent variable included tree height (Height), specific leaf area (SLA), hydraulic conductivity (Ks), and stomatal density (SD). They kept conspecific basal area (BA-c) separate from the trait latent construct. By combining many observed variables into thoughtful latent variables, they could create models that make more intuitive sense. They found that there were strong positive relationships between functional traits and growth rates. The environmental latent variable had a weaker but still significant relationship with growth rate. Below is a path analysis diagram illustrating the SEM.

(Liu 2016)

Definitions

Observed variable - Exists in data
Latent variable - Constructed in model
Exogenous - Independent that explains endogenous (observed or latent) (predictor)
Endogenous - Dependent that has a causal path leading to it (observed or latent) (response)
Measurement model - Links observed and latent variables
- Indicator - Observed (exogenous or endogenous)
- Factor - Latent (exogenous or endogenous)
- Loading - Path between indicator and factor
Structural model - specifies causal relationships between exogenous and endogenous variables
Regression model - Path between exogenous and endogenous variables

Types of Structural Equation Models

Path Diagram Symbols

Path Diagram

Path diagrams can be used to better visualize the matrix equations in SEMs. The figure below illustrates the many types of variables and relationships that SEM can encompass.

image credit: Johnny Lin 2021

SEMs with one endogenous variable

Simple regression and multiple regression both involve one endogenous variable. In simple linear regression, there is one exogenous variable that predicts one endogenous variable. In multiple regressions, multiple exogenous variables (with covariance) predict a single endogenous variable. The exogenous variables each have variance. The endogenous variable has an intercept (the triangle) and residual variance.

image credit: Johnny Lin 2021

Two or more endogenous variables

Observed

Multivariate regression and path analysis both have two or more observed endogenous variables. The main difference is that multivariate regression has only exogenous variables predicting endogenous variables. However, in path analysis, endogenous variables can also predict other endogenous variables.

image credit: Johnny Lin 2021

Latent

Factor analysis and structural regression both have two or more endogenous variables. The main difference is that factor analysis looks at how a latent variable can predict observed variables. Structural regression can use latent variables to predict other latent variables.

image credit: Johnny Lin 2021

Steps of an SEM

Model Specification – defines hypothetical relationships
Model Identification
- Over identified – more knowns than free parameters
- Just-identified – the number of unknowns equals the number of free parameters
- Under-identified – the number of unknowns is greater than the number of parameters (model coefficients cannot be estimated)
Parameter Estimation – comparing actual and estimated covariance (i.e., maximum likelihood estimate)
Model Evaluation- goodness of fit (i.e., Chi-square, Akaike Information Criterion (AIC), Comparative Fit Index (CFI))
Model Modification – post hoc model modification

Examples

Data

These examples and data are from an introductory SEM seminar by the Institute for Digital Research and Education Statistical Consulting. The data set is from the social sciences. However, the same steps and concepts could apply to ecological research. The data has six observed variables ascribed to 3 hypothetical latent variables describing the effects of student background on academic achievement. Nine observed variables are clustered into hypothetical latent constructs based on prior knowledge or theory. In this example, the three hypothesized latent constructs are adjustment, risk, and achievement. Adjustment is defined by the observed variables motivation, harmony, and stability. Similarly, risk is defined by negative parental psychology, socioeconomic status, and verbal IQ. Finally, achievement is composed of reading, arithmetic, and spelling. Variable abbreviations are:

Motivation = motiv
Harmony = harm
Stability = stabi
Negative parental psychology = ppsych
Socioeconomic status = ses
Verbal IQ = verbal
Reading = read
Arithmetic = arith
Spelling = spell

We will be using the lavaan() package. There are SEM programs outside of R that some people prefer, but they are not free to use.

library(lavaan)

## This is lavaan 0.6-9
## lavaan is FREE software! Please report any bugs.

dat <- read.csv("https://stats.idre.ucla.edu/wp-content/uploads/2021/02/worland5.csv")

Simple Linear Regression

For this example, we will perform a simple regression that determines the relationship of two observed variables (one exogenous and one endogenous). This analysis is the first building block in SEM, as an SEM is a linear model framework. This analysis could be done using the lm() function. However, we will use lavaan(). The primary difference is that lavaan() uses maximum likelihood instead of the least square estimator. While the coefficients will be the same, the residual variance will be a little different.

Here we will choose to look at how motivation predicts reading. The notation is similar to lm(), but we must add 1 to the predictor side to include the intercept. We may choose to include the variance. For simple regression, it would only be the variance of the predictor.

We can illustrate this analysis using a path diagram such as the one below.

m1 <-   '
  # regressions
    read ~ 1 + motiv
  # variance (optional)
    motiv ~~ motiv
'

Next, we fit the model to the data using sem() from lavaan. Followed by a summary, including model fit.

fit1 <- sem(m1, data=dat)
summary(fit1, fit.measures=TRUE)

## lavaan 0.6-9 ended normally after 14 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
##                                                       
##   Number of observations                           500
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                               164.877
##   Degrees of freedom                                 1
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3638.084
##   Loglikelihood unrestricted model (H1)      -3638.084
##                                                       
##   Akaike (AIC)                                7286.168
##   Bayesian (BIC)                              7307.241
##   Sample-size adjusted Bayesian (BIC)         7291.371
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value RMSEA <= 0.05                             NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   read ~                                              
##     motiv             0.530    0.038   13.975    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .read             -0.000    0.379   -0.000    1.000
##     motiv             0.000    0.447    0.000    1.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     motiv            99.800    6.312   15.811    0.000
##    .read             71.766    4.539   15.811    0.000

Path Analysis

For this example, we will perform a path analysis, which is broader than a multivariate regression and allows endogenous variables to explain other endogenous variables.

First, we will use negative parental psychology and motivation as predictors of reading and arithmetic. We will also include reading as a predictor of arithmetic (an endogenous variable explaining another endogenous variable).

We can illustrate this analysis using a path diagram such as the one below.

m2 <- '
  # regressions
    read ~ 1 + ppsych + motiv
    arith ~ 1 + motiv + read + ppsych
  # covariance and variance
    read~~read
    ppsych~~ppsych
    motiv~~motiv
    arith~~arith
    motiv ~~ ppsych
    
'

Next, we fit the model to the data using sem() from lavaan. Followed by a summary, including model fit.

fit2 <- sem(m2, data=dat)
summary(fit2, fit.measures=TRUE)

## lavaan 0.6-9 ended normally after 42 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        14
##                                                       
##   Number of observations                           500
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                               707.017
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -7087.537
##   Loglikelihood unrestricted model (H1)      -7087.537
##                                                       
##   Akaike (AIC)                               14203.074
##   Bayesian (BIC)                             14262.078
##   Sample-size adjusted Bayesian (BIC)        14217.641
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value RMSEA <= 0.05                             NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   read ~                                              
##     ppsych           -0.275    0.037   -7.385    0.000
##     motiv             0.461    0.037   12.404    0.000
##   arith ~                                             
##     motiv             0.300    0.033    8.993    0.000
##     read              0.597    0.035   17.004    0.000
##     ppsych            0.068    0.031    2.212    0.027
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ppsych ~~                                           
##     motiv           -24.950    4.601   -5.423    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .read             -0.000    0.360   -0.000    1.000
##    .arith            -0.000    0.283   -0.000    1.000
##     ppsych           -0.000    0.447   -0.000    1.000
##     motiv             0.000    0.447    0.000    1.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .read             64.708    4.092   15.811    0.000
##     ppsych           99.800    6.312   15.811    0.000
##     motiv            99.800    6.312   15.811    0.000
##    .arith            39.923    2.525   15.811    0.000

Structural Regression

For the last example, we will create a structural model that relates multiple latent variables. First, we create a measurement model by defining each latent variable. Adjustment will be motivation, harmony, and stability. Risk will be verbal, negative parent psychology, and socioeconomic status. Achievement will be reading, arithmetic, and spelling. Next, we define the regression path. Achievement will be the combination of adjustment and risk.

We can illustrate this analysis using a path diagram such as the one below.

m3 <- '
# measurement model
adjust =~ motiv + harm + stabi
risk =~ verbal + ppsych + ses
achieve =~ read + arith + spell
# regressions
achieve ~ adjust + risk
'

Next, we fit the model to the data using sem() from lavaan. Followed by a summary, including model fit.

fit3 <- sem(m3, data=dat)
summary(fit3, fit.measures=TRUE)

## lavaan 0.6-9 ended normally after 130 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
##                                                       
##   Number of observations                           500
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               148.982
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              2597.972
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.951
##   Tucker-Lewis Index (TLI)                       0.927
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -15517.857
##   Loglikelihood unrestricted model (H1)     -15443.366
##                                                       
##   Akaike (AIC)                               31077.713
##   Bayesian (BIC)                             31166.220
##   Sample-size adjusted Bayesian (BIC)        31099.565
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.102
##   90 Percent confidence interval - lower         0.087
##   90 Percent confidence interval - upper         0.118
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.041
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   adjust =~                                           
##     motiv             1.000                           
##     harm              0.884    0.041   21.774    0.000
##     stabi             0.695    0.043   15.987    0.000
##   risk =~                                             
##     verbal            1.000                           
##     ppsych           -0.770    0.075  -10.223    0.000
##     ses               0.807    0.076   10.607    0.000
##   achieve =~                                          
##     read              1.000                           
##     arith             0.837    0.034   24.437    0.000
##     spell             0.976    0.028   34.338    0.000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   achieve ~                                           
##     adjust            0.375    0.046    8.085    0.000
##     risk              0.724    0.078    9.253    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   adjust ~~                                           
##     risk             32.098    4.320    7.431    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .motiv            12.870    2.852    4.512    0.000
##    .harm             31.805    2.973   10.698    0.000
##    .stabi            57.836    3.990   14.494    0.000
##    .verbal           46.239    4.788    9.658    0.000
##    .ppsych           68.033    5.068   13.425    0.000
##    .ses              64.916    4.975   13.048    0.000
##    .read             11.372    1.608    7.074    0.000
##    .arith            37.818    2.680   14.109    0.000
##    .spell            15.560    1.699    9.160    0.000
##     adjust           86.930    6.830   12.727    0.000
##     risk             53.561    6.757    7.927    0.000
##    .achieve          30.685    3.449    8.896    0.000

Conclusion

Structural equation modeling is an effective form of analysis for ecology as it can describe complex networks of observed and unobserved (latent) variables and their relationships. This was a broad crash course in SEM. If you are interested in learning more about SEMs, I recommend “taking a look under the hood.” While I have given diagrams that illustrate how SEMs work, Lavaan() uses matrices to run these analyses.

Resources

Introduction to structural equation modeling (SEM) in R with Lavaan. UCLA: Statistical Consulting Group. from https://stats.idre.ucla.edu/r/seminars/rsem/#s2h
Liu, X., Swenson, N. G., Lin, D., Mi, X., Umaña, M. N., Schmid, B., & Ma, K. (2016). Linking individual-level functional traits to tree growth in a subtropical forest. Ecology (Durham), 97(9), 2396-2405. https://doi.org/10.1002/ecy.1445
UCLA Office of Advanced Research Computing. (2021) Introduction to Structural Equation Modeling [video]. YouTube. https://www.youtube.com/watch?v=sKVFkVoYfbs

Structural Equation Modeling (SEM)

Jeremy Adkins

What is SEM

Why is it useful for ecologists?

Example

Definitions

Types of Structural Equation Models

Path Diagram Symbols

Path Diagram

SEMs with one endogenous variable

Two or more endogenous variables

Observed

Latent

Steps of an SEM

Examples

Data

Simple Linear Regression

Path Analysis

Structural Regression

Conclusion

Resources