This is a practical introduction to exploratory facotr analysis (EFA) and confirmatory factor analysis (CFA) in R. EFA is letting the data tell you what the latent structure could be, while CFA is to verify if the proposed latent structure fits the data well. I am not going to talk about the mathematical formulas. This could be a good but simple tutorial if you want to learn some basic knowledge about factor analysis.
Preliminary Work
I will use the SAQ-8 data set to illustrate how to perform exploratory factor analysis and confirmatory factor analysis in R, which is download from UCLA Statistical Consulting’s Factor Analysis seminar. In the SAQ-8 data set, the first 8 columns are 8 items, and we want to explore the latent structure of these 8 items.
library(foreign)
dat <- read.spss("SAQ.sav", to.data.frame = TRUE, use.value.labels = FALSE)
In psychometrics, we expect all items are in the same direction, especially for composite scores, so we can sum them up and get a total score. Therefore, we need to check the internal consistency, and we can use Cronbach’s \(\alpha\) to measure internal consistency. The calcualted value below is 0.46, which is not good. Usually, a value greater than 0.7 or 0.8 indicates good internal consistency. There is a warning message saying “Some items were negatively correlated with the total scale and probably should be reversed”. After reversing the items that are negatively correlated with the total score using option check.keys = TRUE
, the calculated value is 0.76 which is okay if we choose 0.7 as the benchmark. We can basically reverse these items, but since all the items are in the range 1-5, we also want to keep the range. In the following, I use 6 minus the original scores as the reversed scores.
library(psych)
alpha(dat[, 1:8])$total
## Warning in alpha(dat[, 1:8]): Some items were negatively correlated with the total scale and probably
## should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( q02 q03 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
alpha(dat[, 1:8], check.keys = TRUE)$total
## Warning in alpha(dat[, 1:8], check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
## This is indicated by a negative sign for the variable name.
alpha(dat[, 1:8], check.keys = TRUE)$keys
## Warning in alpha(dat[, 1:8], check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
## This is indicated by a negative sign for the variable name.
## q01 q02 q03 q04 q05 q06 q07 q08
## 1 -1 -1 1 1 1 1 1
# reverse items q02 and q03
dat[, c("q02", "q03")] <- 6 - dat[, c("q02", "q03")]
Exploratory factor analysis
At first, we want to figure out how many latent factors we need, and we can use the scree plot. The number of eigenvalues great than 1 is the maxinum number of factors we should try. For this data set, we should have at most 2 factors. I like to use the scree.plot
function in the psy
package for scree plots.
library(psy)
scree.plot(dat[, 1:8])
We hope 1) the factors can explain enough variance, 2) all items can be well explained, and 3) each item can only be explained by one facotr. For 3), we need to do rotation. There are different types of rotation, and the default oblimin ratation in function fa
in the psych
package works well. For both of the one-factor model and the two-factor model, item q02 is not well explained since the corresponding loadings in the two models are small. We can exclude q02 in the factor analysis.
fit_efa_1f <- fa(dat[, 1:8], nfactors = 1)
fit_efa_1f
## Factor Analysis using method = minres
## Call: fa(r = dat[, 1:8], nfactors = 1)
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 h2 u2 com
## q01 0.58 0.342 0.66 1
## q02 0.23 0.055 0.95 1
## q03 0.57 0.330 0.67 1
## q04 0.67 0.443 0.56 1
## q05 0.57 0.328 0.67 1
## q06 0.49 0.237 0.76 1
## q07 0.66 0.434 0.57 1
## q08 0.48 0.232 0.77 1
##
## MR1
## SS loadings 2.4
## Proportion Var 0.3
##
## Mean item complexity = 1
## Test of the hypothesis that 1 factor is sufficient.
##
## The degrees of freedom for the null model are 28 and the objective function was 1.62 with Chi Square of 4157.28
## The degrees of freedom for the model are 20 and the objective function was 0.22
##
## The root mean square of the residuals (RMSR) is 0.06
## The df corrected root mean square of the residuals is 0.07
##
## The harmonic number of observations is 2571 with the empirical chi square 568.33 with prob < 1.3e-107
## The total number of observations was 2571 with Likelihood Chi Square = 553.96 with prob < 1.4e-104
##
## Tucker Lewis Index of factoring reliability = 0.819
## RMSEA index = 0.102 and the 90 % confidence intervals are 0.095 0.109
## BIC = 396.92
## Fit based upon off diagonal values = 0.96
## Measures of factor score adequacy
## MR1
## Correlation of (regression) scores with factors 0.89
## Multiple R square of scores with factors 0.79
## Minimum correlation of possible factor scores 0.58
fit_efa_2f <- fa(dat[, 1:8], nfactors = 2, rotate = "oblimin")
fit_efa_2f
## Factor Analysis using method = minres
## Call: fa(r = dat[, 1:8], nfactors = 2, rotate = "oblimin")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 h2 u2 com
## q01 0.71 -0.09 0.434 0.566 1.0
## q02 0.18 0.07 0.052 0.948 1.3
## q03 0.48 0.12 0.319 0.681 1.1
## q04 0.64 0.06 0.461 0.539 1.0
## q05 0.57 0.03 0.345 0.655 1.0
## q06 0.11 0.47 0.295 0.705 1.1
## q07 -0.01 0.96 0.912 0.088 1.0
## q08 0.45 0.05 0.236 0.764 1.0
##
## MR1 MR2
## SS loadings 1.80 1.25
## Proportion Var 0.23 0.16
## Cumulative Var 0.23 0.38
## Proportion Explained 0.59 0.41
## Cumulative Proportion 0.59 1.00
##
## With factor correlations of
## MR1 MR2
## MR1 1.00 0.58
## MR2 0.58 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 28 and the objective function was 1.62 with Chi Square of 4157.28
## The degrees of freedom for the model are 13 and the objective function was 0.08
##
## The root mean square of the residuals (RMSR) is 0.04
## The df corrected root mean square of the residuals is 0.06
##
## The harmonic number of observations is 2571 with the empirical chi square 247.69 with prob < 1.9e-45
## The total number of observations was 2571 with Likelihood Chi Square = 199.6 with prob < 1.7e-35
##
## Tucker Lewis Index of factoring reliability = 0.903
## RMSEA index = 0.075 and the 90 % confidence intervals are 0.066 0.084
## BIC = 97.52
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy
## MR1 MR2
## Correlation of (regression) scores with factors 0.88 0.96
## Multiple R square of scores with factors 0.77 0.92
## Minimum correlation of possible factor scores 0.53 0.84
# exclude item q02
fit_efa_2f_v2 <- fa(dat[, c(1, 3:8)], nfactors = 2, rotate = "oblimin")
fit_efa_2f_v2
## Factor Analysis using method = minres
## Call: fa(r = dat[, c(1, 3:8)], nfactors = 2, rotate = "oblimin")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 h2 u2 com
## q01 0.72 -0.09 0.45 0.554 1.0
## q03 0.45 0.13 0.29 0.710 1.2
## q04 0.65 0.06 0.47 0.530 1.0
## q05 0.57 0.04 0.35 0.654 1.0
## q06 0.12 0.46 0.29 0.705 1.1
## q07 -0.01 0.97 0.93 0.072 1.0
## q08 0.46 0.05 0.24 0.756 1.0
##
## MR1 MR2
## SS loadings 1.76 1.26
## Proportion Var 0.25 0.18
## Cumulative Var 0.25 0.43
## Proportion Explained 0.58 0.42
## Cumulative Proportion 0.58 1.00
##
## With factor correlations of
## MR1 MR2
## MR1 1.00 0.57
## MR2 0.57 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 21 and the objective function was 1.51 with Chi Square of 3870.06
## The degrees of freedom for the model are 8 and the objective function was 0.01
##
## The root mean square of the residuals (RMSR) is 0.01
## The df corrected root mean square of the residuals is 0.02
##
## The harmonic number of observations is 2571 with the empirical chi square 13.29 with prob < 0.1
## The total number of observations was 2571 with Likelihood Chi Square = 17.13 with prob < 0.029
##
## Tucker Lewis Index of factoring reliability = 0.994
## RMSEA index = 0.021 and the 90 % confidence intervals are 0.006 0.035
## BIC = -45.69
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR1 MR2
## Correlation of (regression) scores with factors 0.87 0.97
## Multiple R square of scores with factors 0.76 0.93
## Minimum correlation of possible factor scores 0.53 0.86
Confimatory factor analysis
Based on the results of EFA, we propose a one-factor model, and we want to use CFA to verify the latent structure. =~
means the latent factor can explain the following items.
library(lavaan)
mod_cfa_1f <- "f1 =~ q01 + q03 + q04 + q05 + q06 + q07 + q08"
fit_cfa_1f <- cfa(mod_cfa_1f, data = dat)
summary(fit_cfa_1f, standardize = TRUE, fit.measures = TRUE, rsquare = TRUE)
## lavaan 0.6-8 ended normally after 27 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 14
##
## Number of observations 2571
##
## Model Test User Model:
##
## Test statistic 376.321
## Degrees of freedom 14
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 3876.345
## Degrees of freedom 21
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.906
## Tucker-Lewis Index (TLI) 0.859
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -23451.722
## Loglikelihood unrestricted model (H1) -23263.562
##
## Akaike (AIC) 46931.445
## Bayesian (BIC) 47013.374
## Sample-size adjusted Bayesian (BIC) 46968.892
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.100
## 90 Percent confidence interval - lower 0.092
## 90 Percent confidence interval - upper 0.109
## P-value RMSEA <= 0.05 0.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.049
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## f1 =~
## q01 1.000 0.489 0.590
## q03 1.216 0.057 21.275 0.000 0.594 0.553
## q04 1.304 0.054 24.167 0.000 0.637 0.672
## q05 1.137 0.052 21.910 0.000 0.556 0.576
## q06 1.140 0.058 19.652 0.000 0.557 0.497
## q07 1.461 0.062 23.654 0.000 0.714 0.648
## q08 0.877 0.045 19.491 0.000 0.429 0.491
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .q01 0.447 0.015 30.543 0.000 0.447 0.652
## .q03 0.802 0.025 31.495 0.000 0.802 0.694
## .q04 0.493 0.018 27.636 0.000 0.493 0.548
## .q05 0.621 0.020 30.920 0.000 0.621 0.668
## .q06 0.948 0.029 32.619 0.000 0.948 0.753
## .q07 0.705 0.025 28.641 0.000 0.705 0.580
## .q08 0.577 0.018 32.710 0.000 0.577 0.758
## f1 0.239 0.016 14.561 0.000 1.000 1.000
##
## R-Square:
## Estimate
## q01 0.348
## q03 0.306
## q04 0.452
## q05 0.332
## q06 0.247
## q07 0.420
## q08 0.242
Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Root Mean Square Error of Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR) are the model fit indicices. A CFI > 0.95, a TLI > 0.90, a RMSEA < 0.05, and a SRMR < 0.05 indicate a good fit. According to the calculated model fit indices above, only SRMR is satisfied. In order to improve model fit, we can check the modification indices.
modificationindices(fit_cfa_1f, sort. = TRUE, minimum.value = 10)
Based on the modification indices, the first option is to correlate factor item q06 and item q07 by q06 ~~ q07
.
mod_cfa_1f_v2 <- "f1 =~ q01 + q03 + q04 + q05 + q06 + q07 + q08
q06 ~~ q07"
fit_cfa_1f_v2 <- cfa(mod_cfa_1f_v2, data = dat)
summary(fit_cfa_1f_v2, standardize = TRUE, fit.measures = TRUE, rsquare = TRUE)
## lavaan 0.6-8 ended normally after 27 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 15
##
## Number of observations 2571
##
## Model Test User Model:
##
## Test statistic 66.768
## Degrees of freedom 13
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 3876.345
## Degrees of freedom 21
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.986
## Tucker-Lewis Index (TLI) 0.977
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -23296.945
## Loglikelihood unrestricted model (H1) -23263.562
##
## Akaike (AIC) 46623.891
## Bayesian (BIC) 46711.672
## Sample-size adjusted Bayesian (BIC) 46664.013
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.040
## 90 Percent confidence interval - lower 0.031
## 90 Percent confidence interval - upper 0.050
## P-value RMSEA <= 0.05 0.952
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.021
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## f1 =~
## q01 1.000 0.513 0.619
## q03 1.169 0.054 21.714 0.000 0.599 0.557
## q04 1.284 0.051 25.011 0.000 0.658 0.694
## q05 1.107 0.049 22.572 0.000 0.567 0.588
## q06 0.882 0.054 16.480 0.000 0.452 0.403
## q07 1.251 0.056 22.371 0.000 0.641 0.582
## q08 0.848 0.043 19.927 0.000 0.435 0.498
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .q06 ~~
## .q07 0.345 0.022 15.515 0.000 0.345 0.375
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .q01 0.423 0.014 29.157 0.000 0.423 0.617
## .q03 0.796 0.026 31.025 0.000 0.796 0.689
## .q04 0.466 0.018 25.824 0.000 0.466 0.518
## .q05 0.608 0.020 30.173 0.000 0.608 0.654
## .q06 1.054 0.031 33.558 0.000 1.054 0.838
## .q07 0.803 0.027 30.300 0.000 0.803 0.661
## .q08 0.572 0.018 32.332 0.000 0.572 0.752
## f1 0.263 0.017 15.230 0.000 1.000 1.000
##
## R-Square:
## Estimate
## q01 0.383
## q03 0.311
## q04 0.482
## q05 0.346
## q06 0.162
## q07 0.339
## q08 0.248
After correlating q06 and q07, the model fit is quite well. If we want to fit a two-factor CFA model, we can use the following code for example. f1 ~~ 0*f2
means the two latent factors are uncorrelated. If we want to correlate the two factors, we can change f1 ~~ 0*f2
to f1 ~~ f2
, which is also the default.
mod_cfa_2f <- "f1 =~ q01 + q03 + q04 + q05
f2 =~ q06 + q07 + q08
f1 ~~ 0*f2"
fit_cfa_2f <- cfa(mod_cfa_2f, data = dat)
After model fitting, we can plot diagrams to show the fitted latent structure. The following code provide a choice.
library(semPlot)
semPaths(fit_cfa_1f_v2, whatLabels = "std", residuals = FALSE, sizeMan = 3)