Factor Analysis in R

Posted by Jinjian Mu on April 16, 2021

This is a practical introduction to exploratory facotr analysis (EFA) and confirmatory factor analysis (CFA) in R. EFA is letting the data tell you what the latent structure could be, while CFA is to verify if the proposed latent structure fits the data well. I am not going to talk about the mathematical formulas. This could be a good but simple tutorial if you want to learn some basic knowledge about factor analysis.

Preliminary Work

I will use the SAQ-8 data set to illustrate how to perform exploratory factor analysis and confirmatory factor analysis in R, which is download from UCLA Statistical Consulting’s Factor Analysis seminar. In the SAQ-8 data set, the first 8 columns are 8 items, and we want to explore the latent structure of these 8 items.

library(foreign)
dat <- read.spss("SAQ.sav", to.data.frame = TRUE, use.value.labels = FALSE)

In psychometrics, we expect all items are in the same direction, especially for composite scores, so we can sum them up and get a total score. Therefore, we need to check the internal consistency, and we can use Cronbach’s \(\alpha\) to measure internal consistency. The calcualted value below is 0.46, which is not good. Usually, a value greater than 0.7 or 0.8 indicates good internal consistency. There is a warning message saying “Some items were negatively correlated with the total scale and probably should be reversed”. After reversing the items that are negatively correlated with the total score using option check.keys = TRUE, the calculated value is 0.76 which is okay if we choose 0.7 as the benchmark. We can basically reverse these items, but since all the items are in the range 1-5, we also want to keep the range. In the following, I use 6 minus the original scores as the reversed scores.

library(psych)
alpha(dat[, 1:8])$total
## Warning in alpha(dat[, 1:8]): Some items were negatively correlated with the total scale and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( q02 q03 ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option
alpha(dat[, 1:8], check.keys = TRUE)$total
## Warning in alpha(dat[, 1:8], check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
##  This is indicated by a negative sign for the variable name.
alpha(dat[, 1:8], check.keys = TRUE)$keys
## Warning in alpha(dat[, 1:8], check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
##  This is indicated by a negative sign for the variable name.
## q01 q02 q03 q04 q05 q06 q07 q08 
##   1  -1  -1   1   1   1   1   1
# reverse items q02 and q03
dat[, c("q02", "q03")] <- 6 - dat[, c("q02", "q03")]

Exploratory factor analysis

At first, we want to figure out how many latent factors we need, and we can use the scree plot. The number of eigenvalues great than 1 is the maxinum number of factors we should try. For this data set, we should have at most 2 factors. I like to use the scree.plot function in the psy package for scree plots.

library(psy)
scree.plot(dat[, 1:8])

We hope 1) the factors can explain enough variance, 2) all items can be well explained, and 3) each item can only be explained by one facotr. For 3), we need to do rotation. There are different types of rotation, and the default oblimin ratation in function fa in the psych package works well. For both of the one-factor model and the two-factor model, item q02 is not well explained since the corresponding loadings in the two models are small. We can exclude q02 in the factor analysis.

fit_efa_1f <- fa(dat[, 1:8], nfactors = 1)
fit_efa_1f
## Factor Analysis using method =  minres
## Call: fa(r = dat[, 1:8], nfactors = 1)
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1    h2   u2 com
## q01 0.58 0.342 0.66   1
## q02 0.23 0.055 0.95   1
## q03 0.57 0.330 0.67   1
## q04 0.67 0.443 0.56   1
## q05 0.57 0.328 0.67   1
## q06 0.49 0.237 0.76   1
## q07 0.66 0.434 0.57   1
## q08 0.48 0.232 0.77   1
## 
##                MR1
## SS loadings    2.4
## Proportion Var 0.3
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 factor is sufficient.
## 
## The degrees of freedom for the null model are  28  and the objective function was  1.62 with Chi Square of  4157.28
## The degrees of freedom for the model are 20  and the objective function was  0.22 
## 
## The root mean square of the residuals (RMSR) is  0.06 
## The df corrected root mean square of the residuals is  0.07 
## 
## The harmonic number of observations is  2571 with the empirical chi square  568.33  with prob <  1.3e-107 
## The total number of observations was  2571  with Likelihood Chi Square =  553.96  with prob <  1.4e-104 
## 
## Tucker Lewis Index of factoring reliability =  0.819
## RMSEA index =  0.102  and the 90 % confidence intervals are  0.095 0.109
## BIC =  396.92
## Fit based upon off diagonal values = 0.96
## Measures of factor score adequacy             
##                                                    MR1
## Correlation of (regression) scores with factors   0.89
## Multiple R square of scores with factors          0.79
## Minimum correlation of possible factor scores     0.58
fit_efa_2f <- fa(dat[, 1:8], nfactors = 2, rotate = "oblimin")
fit_efa_2f
## Factor Analysis using method =  minres
## Call: fa(r = dat[, 1:8], nfactors = 2, rotate = "oblimin")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       MR1   MR2    h2    u2 com
## q01  0.71 -0.09 0.434 0.566 1.0
## q02  0.18  0.07 0.052 0.948 1.3
## q03  0.48  0.12 0.319 0.681 1.1
## q04  0.64  0.06 0.461 0.539 1.0
## q05  0.57  0.03 0.345 0.655 1.0
## q06  0.11  0.47 0.295 0.705 1.1
## q07 -0.01  0.96 0.912 0.088 1.0
## q08  0.45  0.05 0.236 0.764 1.0
## 
##                        MR1  MR2
## SS loadings           1.80 1.25
## Proportion Var        0.23 0.16
## Cumulative Var        0.23 0.38
## Proportion Explained  0.59 0.41
## Cumulative Proportion 0.59 1.00
## 
##  With factor correlations of 
##      MR1  MR2
## MR1 1.00 0.58
## MR2 0.58 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  28  and the objective function was  1.62 with Chi Square of  4157.28
## The degrees of freedom for the model are 13  and the objective function was  0.08 
## 
## The root mean square of the residuals (RMSR) is  0.04 
## The df corrected root mean square of the residuals is  0.06 
## 
## The harmonic number of observations is  2571 with the empirical chi square  247.69  with prob <  1.9e-45 
## The total number of observations was  2571  with Likelihood Chi Square =  199.6  with prob <  1.7e-35 
## 
## Tucker Lewis Index of factoring reliability =  0.903
## RMSEA index =  0.075  and the 90 % confidence intervals are  0.066 0.084
## BIC =  97.52
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.88 0.96
## Multiple R square of scores with factors          0.77 0.92
## Minimum correlation of possible factor scores     0.53 0.84
# exclude item q02
fit_efa_2f_v2 <- fa(dat[, c(1, 3:8)], nfactors = 2, rotate = "oblimin")
fit_efa_2f_v2
## Factor Analysis using method =  minres
## Call: fa(r = dat[, c(1, 3:8)], nfactors = 2, rotate = "oblimin")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       MR1   MR2   h2    u2 com
## q01  0.72 -0.09 0.45 0.554 1.0
## q03  0.45  0.13 0.29 0.710 1.2
## q04  0.65  0.06 0.47 0.530 1.0
## q05  0.57  0.04 0.35 0.654 1.0
## q06  0.12  0.46 0.29 0.705 1.1
## q07 -0.01  0.97 0.93 0.072 1.0
## q08  0.46  0.05 0.24 0.756 1.0
## 
##                        MR1  MR2
## SS loadings           1.76 1.26
## Proportion Var        0.25 0.18
## Cumulative Var        0.25 0.43
## Proportion Explained  0.58 0.42
## Cumulative Proportion 0.58 1.00
## 
##  With factor correlations of 
##      MR1  MR2
## MR1 1.00 0.57
## MR2 0.57 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  21  and the objective function was  1.51 with Chi Square of  3870.06
## The degrees of freedom for the model are 8  and the objective function was  0.01 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  2571 with the empirical chi square  13.29  with prob <  0.1 
## The total number of observations was  2571  with Likelihood Chi Square =  17.13  with prob <  0.029 
## 
## Tucker Lewis Index of factoring reliability =  0.994
## RMSEA index =  0.021  and the 90 % confidence intervals are  0.006 0.035
## BIC =  -45.69
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.87 0.97
## Multiple R square of scores with factors          0.76 0.93
## Minimum correlation of possible factor scores     0.53 0.86

Confimatory factor analysis

Based on the results of EFA, we propose a one-factor model, and we want to use CFA to verify the latent structure. =~ means the latent factor can explain the following items.

library(lavaan)
mod_cfa_1f <- "f1 =~ q01 + q03 + q04 + q05 + q06 + q07 + q08"
fit_cfa_1f <- cfa(mod_cfa_1f, data = dat)
summary(fit_cfa_1f, standardize = TRUE, fit.measures = TRUE, rsquare = TRUE)
## lavaan 0.6-8 ended normally after 27 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        14
##                                                       
##   Number of observations                          2571
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               376.321
##   Degrees of freedom                                14
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              3876.345
##   Degrees of freedom                                21
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.906
##   Tucker-Lewis Index (TLI)                       0.859
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23451.722
##   Loglikelihood unrestricted model (H1)     -23263.562
##                                                       
##   Akaike (AIC)                               46931.445
##   Bayesian (BIC)                             47013.374
##   Sample-size adjusted Bayesian (BIC)        46968.892
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.100
##   90 Percent confidence interval - lower         0.092
##   90 Percent confidence interval - upper         0.109
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.049
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 =~                                                                 
##     q01               1.000                               0.489    0.590
##     q03               1.216    0.057   21.275    0.000    0.594    0.553
##     q04               1.304    0.054   24.167    0.000    0.637    0.672
##     q05               1.137    0.052   21.910    0.000    0.556    0.576
##     q06               1.140    0.058   19.652    0.000    0.557    0.497
##     q07               1.461    0.062   23.654    0.000    0.714    0.648
##     q08               0.877    0.045   19.491    0.000    0.429    0.491
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .q01               0.447    0.015   30.543    0.000    0.447    0.652
##    .q03               0.802    0.025   31.495    0.000    0.802    0.694
##    .q04               0.493    0.018   27.636    0.000    0.493    0.548
##    .q05               0.621    0.020   30.920    0.000    0.621    0.668
##    .q06               0.948    0.029   32.619    0.000    0.948    0.753
##    .q07               0.705    0.025   28.641    0.000    0.705    0.580
##    .q08               0.577    0.018   32.710    0.000    0.577    0.758
##     f1                0.239    0.016   14.561    0.000    1.000    1.000
## 
## R-Square:
##                    Estimate
##     q01               0.348
##     q03               0.306
##     q04               0.452
##     q05               0.332
##     q06               0.247
##     q07               0.420
##     q08               0.242

Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Root Mean Square Error of Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR) are the model fit indicices. A CFI > 0.95, a TLI > 0.90, a RMSEA < 0.05, and a SRMR < 0.05 indicate a good fit. According to the calculated model fit indices above, only SRMR is satisfied. In order to improve model fit, we can check the modification indices.

modificationindices(fit_cfa_1f, sort. = TRUE, minimum.value = 10)

Based on the modification indices, the first option is to correlate factor item q06 and item q07 by q06 ~~ q07.

mod_cfa_1f_v2 <- "f1 =~ q01 + q03 + q04 + q05 + q06 + q07 + q08
                  q06 ~~ q07"
fit_cfa_1f_v2 <- cfa(mod_cfa_1f_v2, data = dat)
summary(fit_cfa_1f_v2, standardize = TRUE, fit.measures = TRUE, rsquare = TRUE)
## lavaan 0.6-8 ended normally after 27 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        15
##                                                       
##   Number of observations                          2571
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                66.768
##   Degrees of freedom                                13
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              3876.345
##   Degrees of freedom                                21
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.986
##   Tucker-Lewis Index (TLI)                       0.977
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -23296.945
##   Loglikelihood unrestricted model (H1)     -23263.562
##                                                       
##   Akaike (AIC)                               46623.891
##   Bayesian (BIC)                             46711.672
##   Sample-size adjusted Bayesian (BIC)        46664.013
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.040
##   90 Percent confidence interval - lower         0.031
##   90 Percent confidence interval - upper         0.050
##   P-value RMSEA <= 0.05                          0.952
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.021
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 =~                                                                 
##     q01               1.000                               0.513    0.619
##     q03               1.169    0.054   21.714    0.000    0.599    0.557
##     q04               1.284    0.051   25.011    0.000    0.658    0.694
##     q05               1.107    0.049   22.572    0.000    0.567    0.588
##     q06               0.882    0.054   16.480    0.000    0.452    0.403
##     q07               1.251    0.056   22.371    0.000    0.641    0.582
##     q08               0.848    0.043   19.927    0.000    0.435    0.498
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .q06 ~~                                                                
##    .q07               0.345    0.022   15.515    0.000    0.345    0.375
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .q01               0.423    0.014   29.157    0.000    0.423    0.617
##    .q03               0.796    0.026   31.025    0.000    0.796    0.689
##    .q04               0.466    0.018   25.824    0.000    0.466    0.518
##    .q05               0.608    0.020   30.173    0.000    0.608    0.654
##    .q06               1.054    0.031   33.558    0.000    1.054    0.838
##    .q07               0.803    0.027   30.300    0.000    0.803    0.661
##    .q08               0.572    0.018   32.332    0.000    0.572    0.752
##     f1                0.263    0.017   15.230    0.000    1.000    1.000
## 
## R-Square:
##                    Estimate
##     q01               0.383
##     q03               0.311
##     q04               0.482
##     q05               0.346
##     q06               0.162
##     q07               0.339
##     q08               0.248

After correlating q06 and q07, the model fit is quite well. If we want to fit a two-factor CFA model, we can use the following code for example. f1 ~~ 0*f2 means the two latent factors are uncorrelated. If we want to correlate the two factors, we can change f1 ~~ 0*f2 to f1 ~~ f2, which is also the default.

mod_cfa_2f <- "f1 =~ q01 + q03 + q04 + q05
               f2 =~ q06 + q07 + q08
               f1 ~~ 0*f2"
fit_cfa_2f <- cfa(mod_cfa_2f, data = dat)

After model fitting, we can plot diagrams to show the fitted latent structure. The following code provide a choice.

library(semPlot)
semPaths(fit_cfa_1f_v2, whatLabels = "std", residuals = FALSE, sizeMan = 3)