Exploratory Factor Analysis: Techniques, Missing Data, and Diagnostics, Schemes and Mind Maps of Statistics

A step-by-step guide on conducting Exploratory Factor Analysis (EFA) using SPSS software. EFA is a statistical technique used to identify underlying factors that explain the correlations among a set of variables. checking for missing data, dealing with multicollinearity and singularity, and assessing normality and linearity. It also includes instructions on creating a principal factor analysis.

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 07/05/2022

carol_78
carol_78 🇦🇺

4.8

(59)

1K documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
E
XPLORATORY
F
ACTOR
A
NALYSIS
O
RIGINALLY
P
RESENTED BY
:
D
AWN
H
UBER FOR THE
COE
F
ACULTY
R
ESEARCH
C
ENTER
M
ODIFIED AND
U
PDATED FOR
EPS
624/725
BY
:
R
OBERT
A.
H
ORN
The purpose of this lesson on Exploratory Factor Analysis is to understand and apply statistical
techniques to a single set of variables when the researcher is interested in discovering which
variables in the set form coherent subsets that are relatively independent of one another.
Variables that are correlated with one another but largely independent of other subsets of
variables are combined into factors. Factors are thought to reflect underlying processes that have
created the correlations among variables.
I
NTRODUCTION
That dataset (FACTOR.sav) that we will be using is part of a larger data set from Tabachnick
and Fidell (2007). The study involved 369 middle-class, English-speaking women between the
ages of 21 and 60 who completed the Bem Sex Role Inventory (BSRI). Respondents attribute
traits to themselves by assigning numbers between 1 (never or almost never true of me) and 7
(always or almost always true of me) to each of the items. Forty-four items from the BSRI were
selected for this research example.
D
ATA
S
CREENING
S
AMPLE
S
IZE
A general rule of thumb is to have at least 300 cases for factor analysis. “Solutions that
have several high loading marker variables (> .80) do not require such large sample sizes
(about 150 cases should be sufficient) as solutions with lower loadings” (Tabachnick &
Fidell, 2007, p. 613).
*Our data set has an adequate sample size of 369 cases.
Bryant and Yarnold (1995) state that, “one’s sample should be at least five times the
number of variables. The subjects-to-variables ratio should be 5 or greater. Furthermore,
every analysis should be based on a minimum of 100 observations regardless of the
subjects-to-variables ratio” (p. 100).
M
ISSING
D
ATA
To check for missing data:
Click Analyze
Descriptive Statistics
Click Frequencies
Click over all 44 Items to Variable(s): (except Subno)
De-select [ ] Display frequency tables
This will produce a warning message, simply click OK
Click OK
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Exploratory Factor Analysis: Techniques, Missing Data, and Diagnostics and more Schemes and Mind Maps Statistics in PDF only on Docsity!

EXPLORATORY FACTOR ANALYSIS

ORIGINALLY PRESENTED BY: DAWN HUBER FOR THE COE FACULTY RESEARCH CENTER

MODIFIED AND UPDATED FOR EPS 624/725 BY: ROBERT A. HORN

The purpose of this lesson on Exploratory Factor Analysis is to understand and apply statistical techniques to a single set of variables when the researcher is interested in discovering which variables in the set form coherent subsets that are relatively independent of one another. Variables that are correlated with one another but largely independent of other subsets of variables are combined into factors. Factors are thought to reflect underlying processes that have created the correlations among variables.

INTRODUCTION

That dataset ( FACTOR.sav ) that we will be using is part of a larger data set from Tabachnick and Fidell (2007). The study involved 369 middle-class, English-speaking women between the ages of 21 and 60 who completed the Bem Sex Role Inventory (BSRI). Respondents attribute traits to themselves by assigning numbers between 1 ( never or almost never true of me ) and 7 ( always or almost always true of me ) to each of the items. Forty-four items from the BSRI were selected for this research example.

DATA SCREENING

SAMPLE SIZE

A general rule of thumb is to have at least 300 cases for factor analysis. “Solutions that have several high loading marker variables (> .80) do not require such large sample sizes (about 150 cases should be sufficient) as solutions with lower loadings” (Tabachnick & Fidell, 2007, p. 613). *Our data set has an adequate sample size of 369 cases. Bryant and Yarnold (1995) state that, “one’s sample should be at least five times the number of variables. The subjects-to-variables ratio should be 5 or greater. Furthermore, every analysis should be based on a minimum of 100 observations regardless of the subjects-to-variables ratio” (p. 100).

MISSING DATA

To check for missing data: Click Analyze  Descriptive Statistics Click Frequencies Click over all 44 Items to Variable(s): ( except Subno) De-select [ ] Display frequency tables This will produce a warning message, simply click OK

Click OK

Exploratory Factor Analysis

The first table of the output identifies missing values for each item. Scrolling across the output, you will notice that there are no missing values for this set of data. If there were missing data, use one option (estimate, delete, or missing data pairwise correlation matrix is analyzed). If nonrandom pattern or small sample size, consider estimation but it can lead to overfitting the data resulting in too high correlations. Please refer to Tabachnick and Fidell (2007) to obtain more information about deleting and dealing with missing data.

DETECTING MULTIVARIATE OUTLIERS

For the sake of this training, we will start with an assessment of multivariate outliers. However, we would usually begin by conducting screening for univariate outliers and assumptions. Many statistical methods are sensitive to outliers so it is important to identify outliers and make decisions about what to do with them. Recall, that a multivariate outlier is an extreme score on one or more variables.

REASON FOR OUTLIERS (TABACHNICK & FIDELL, 2007)

  1. Incorrect data entry
  2. Failure to specify missing values in the computer syntax so missing values are read as real data.
  3. Outlier is not member of population that you intended to sample.
  4. Outlier is representative of population you intended to sample but population has more extreme scores than a normal distribution.

To check for multivariate outliers: Click Analyze  Regression Click Linear Dependent: subno Independent(s): All remaining 44 Items Click Save Under Distances [√] Mahalanobis Click Continue Click OK

Exploratory Factor Analysis

  1. If cases with extreme scores are considered part of the population you sampled then a way to reduce the influence of a univariate outlier is to transform the variable to change the shape of the distribution to be more normal. Tukey said you are merely reexpressing what the data have to say in other terms (Howell, 2007).
  2. Another strategy for dealing with a univariate outlier is to “assign the outlying case(s) a raw score on the offending variable that is one unit larger (or smaller) than the next most extreme score in the distribution” (Tabachnick & Fidell, 2007, p. 77).
  3. Univariate transformations and score alterations often help reduce the impact of multivariate outliers but they can still be a problem. These cases are usually deleted (Tabachnick & Fidell, 2007). All transformations, changes to scores, and deletions are reported in the results section with the rationale and with citations.

MULTICOLLINEARITY AND SINGULARITY

Multicollinearity occurs when the IVs are highly correlated. Singularity occurs when you have redundant variables.

To test for multicollinearity and singularity, use the following SPSS commands: Click Analyze  Regression Click Linear Click Reset Dependent: subno Independent(s): All 44 Items Be sure not to include MAH_ Click Statistics [√] Collinearity diagnostics Click Continue Click OK

This will produce an output page… If the determinant of R and eigenvalues associated with some factors approach 0, multicollinearity or singularity may be in existence. “To investigate further, look at the SMCs for each variable where it serves as DV with all other variables as IVs” (Tabachnick & Fidell, 2007, p. 614).

Exploratory Factor Analysis

Looking at the output page on the following page, under Collinearity Statistics look at the Tolerance values for each item on the test. We want the Tolerance values to be high, closer to 1.0.

Next, we want to explore SMCs (squared multiple correlations) of a variable where it serves as DV with the rest as IVs in multiple correlation (Tabchnick & Fidell, 2007). Many programs, including SPSS, convert the SMC values for each variable to tolerance (1 – SMC) and deal with tolerance instead of SMC. Thus, we have to calculate the SMCs ourselves. Turn to the next page of this handout and next to the tolerance values – calculate the SMCs for the first tem items (1 – Tolerance). We want the SMCs to be low, closer to .00.

If any of the SMCs are one (1), then singularity if present. If any of the SMCs are very large (i.e., near one), then multicollinearity is present (Tabachnick & Fidell, 2007).

The tolerance and SMC values were fine for this group of data. However, if the tolerance values are too low, we would want to scroll down to the next table and examine the Condition Index for each item. According to Tabachnick and Fidell (2007), we do not want the Condition Index values to be greater than 30. Examine the Condition Index for all 44 items. As you can see, the last 25 items have Condition Indexes that are grater than

Because of these high Condition Indexes , you would next need to examine the Variance Proportion for those high Condition Index items which are located next to the Condition Index. According to Tabachnick and Fidell (2007), we do not want two Variance Proportions to be greater than .50 for each item.

To explain further, look at the Variance Proportion of Dimension 45. Scroll across the page and see if there are two items with Variance Proportions that are greater than. for Dimension 45.

Next, you have to make some decisions about multicollinearity. Because we did not find evidence of any Variance Proportions that are grater than .50, we may decide that we do not have evidence of multicollinearity. However, one can also combine evidence (explore the SMC, Tolerance Values, Condition Index, and Variance Proportions) and decide if there is combined evidence of multicollinearity.

Generally, if the Condition Index and Variance Proportion values are high, then there is evidence of multicollinearity.

For this set of data… we have no evidence that multicollinearity or singularity exist.

Save the output as “MULTICOLLINEARITY”

Exploratory Factor Analysis

NORMALITY

If Principal Factor Analysis is used descriptively, then assumptions about distributions are not essential. However, normality of variables enhances the solution (Tabachnick & Fidell, 2007).

When the numbers of factors are determined using statisicial inference, multivariate normality is assumed. “Normality among single variables is assessed by skewness and kurtosis” (Tabachnick & Fidell, 2007, p. 613) – and as such, the distributions of the 44 variables need to be examined for skewness and kurtosis.

To obtain the skewness and kurtosis of the 44 variables one would first Click Analyze  Descriptive Statistics Click Frequencies Click Reset Click over all 44 Items to Variable(s): box Be sure not to include Subno and MAH_ Click Statistics Under Dispersion

[√] all Under Central Tendency [√] all

Under Distribution [√] all Click Continue Click Charts  (^) Histograms

[√] With normal curve Click Continue De-select [ ] Display frequency tables Click OK An output will be produced… scroll to the top of the output to Frequencies. You will see the skewness values and their standard error values for all 44 items.

Exploratory Factor Analysis

Skewness: A distribution that is not symmetric but has more cases (more of a “tail”) toward one end of the distribution than the other is said to be skewed (Norusis, 1994).

  • Value of 0 = normal
  • Positive Value = positive skew (tail going out to right)
  • Negative Value = negative skew (tail going out to left) Divide the skewness statistic by its standard error. We want to know if this standard score value significantly departs from normality. Concern arises when the skewness statistic divided by its standard error is greater than z = +3.29 ( p < .001, two-tailed test) (Tabachnick & Fidell, 2007). To illustrate, calculate the standardized skewness of one item labeled helpful and provide the information asked for below. Keep in mind, that you would do this for each of the 44 items.

“helpful”

Skewness Standard Score

Direction of the Skewness

Significant Departure? (yes, no)

= Std. Error

SkewnessValue

Scroll to the top of the output to Frequencies. You will see the kurtosis values and their standard error values for all 44 items.

Kurtosis: The relative concentration of scores in the center, the upper and lower ends (tails) and the shoulders (between the center and the tails) of a distribution (Norusis, 1994).

  • Value of 0 = mesokurtic (normal, symmetric)
  • Positive Value = leptokurtic (shape is more narrow, peaked)
  • Negative Value = platykurtic (shape is more broad, widely dispersed, flat) Divide the kurtosis statistic by its standard error. We want to know if this standard score value significantly departs from normality. Concern arises when the kurtosis statistic divided by its standard error is greater than z = +3.29 ( p < .001, two-tailed test) (Tabachnick & Fidell, 2007). To illustrate, calculate the standardized kurtosis of one item labeled helpful and provide the information asked for below. Keep in mind, that you would do this for each of the 44 items.

“helpful”

Kurtosis Standard Score

Direction of the Kurtosis

Significant Departure? (yes, no)

= Std. Error

Kurtosis Value

Exploratory Factor Analysis

CONDUCTING A PRINCIPAL FACTOR ANALYSIS

Click Analyze  Data Reduction Click Factor Highlight all 44 Items and click them over to the Variable(s): box. Be sure not to include Subno and MAH_ Click Descriptives Under Statistics [√] Univariate descriptives

[√] Initial solution ( default )

Exploratory Factor Analysis

Under Correlation Matrix

[√] Coefficients

[√] Determinant

[√] KMO and Bartlett’s test of sphericity Click Continue Click Extraction Change Method to Principal axis factoring Under Display

[√] Unrotated factor solution ( default )

[√] Scree plot Click Continue Click OK

An output will then be produced…

FACTORABILITY OF R :

Look at the Correlation Matrix produced on the output page. “A matrix that is factorable should include several sizable correlations. The expected size depends, to some extent, on N (larger sample sizes tend to produce smaller correlations), but if no correlation exceeds .30, use of FA is questionable because there is probably nothing to factor analyze” (Tabachnick & Fidell, 2007, p. 614). We want the correlations between items to be greater than .30.

“High bivariate correlations, however, are not ironclad proof that the correlation matrix contains factors. It is possible that the correlations are between only two variables and do not reflect underlying processes that are simultaneously affecting several variables. For this reason, it is helpful to examine matrices of partial correlations where pairwise correlations are adjusted for effects of all other variables” (Tabachnick & Fidell, 2007, p. 614).

To examine partial correlations, look on the output page and scroll down to KMO and Bartlett’s Test.

The Kaiser-Meyer-Olkin Measure of Sampling (KMO) is an index for comparing the magnitudes of the observed correlation coefficients to the magnitudes of the partial correlation coefficients.

Exploratory Factor Analysis

Usually the scree plot is negatively decreasing – the eigenvalue is highest for the first factor and moderate but decreasing for the next few factors before reaching small values for the last several factors.

Examine the Scree Plot on your output page…

You look for the point where the line drawn through the points changes slope. Unfortunately, the scree test is not exact; it involves judgment of where the discontinuity in eigenvalues occurs and researchers are not perfectly reliable judges (Tabachnick & Fidell, 2007).

In the example, a single straight line can comfortably fit the first four eigenvalues. After that, another line, with a noticeably different slope, best fits the remaining eight points. Therefore, there appears to be about four (4) factors in the data.

Once you have determined the number of factors by these criteria, it is important to look at the rotated loading matrix to determine the number of variables that load on each factor.

CREATING 4 FACTORS:

Click Analyze  Data Reduction Click Factor Click Reset Highlight all 44 Items and click them over to the Variable(s): box. Be sure not to include Subno and MAH_ Click Extraction Change Method to Principal axis factoring Under Extract  (^) Number of factors: Type in the number 4 (four) Click Continue Click Rotation Under Method  (^) Varimax

Click Continue Click OK

An output should be produced…

Exploratory Factor Analysis

Parenthetically, we chose Varimax but it is acceptable and common to experiment with various extraction and rotation procedures before deciding upon the preferred solution (Tabachnick & Fidell, 2007).

Look at the Communalities chart on your output. Under the Extraction heading, we want values to be greater than .20. Looking at the output, you can see that there are several variables below .20.

Having many factors less than .20 indicates that the items are not loading properly on the factors. However, Tabachnick and Fidell (2007) explain that factorial purity was not a consideration with the development of the BSRI which means that when developing the BSRI there was no concern with items loading on certain factors.

Next, examine the table labeled Total Variance Explained on your output.

Under Rotation Sums of Squared Loadings , you can see that the four factors have eigenvalues greater than two (2).

Finally, examine the Rotated Factor Matrix table on your output.

Factors are interpreted through their factor loadings. Tabachnick and Fidell (2007) decided to use a loading of .45 (20% variance overlap between variable and factor). Factors appear as columns and items appear as rows. Tabachnick and Fidell also recommend a minimum factor loading of .32.

The greater the loading, the more the variable is a pure measure of the factor. Comrey and Lee (1992) suggest that loadings in excess of

  • .71 (50% overlapping variance) are considered excellent ,
  • .63 (40% overlapping variance) are considered very good ,
  • .55 (30% overlapping variance) are considered good ,
  • .45 (20% overlapping variance) are considered fair , and
  • .32 (10% overlapping variance) are considered poor.

Choice of the cutoff for size of loading to be interpreted is a matter of researcher preference (Tabachnick & Fidell, 2007).

Look at the chart below and you will see the output for the Rotated Factor Matrix. For each factor column (there should be four of them), circle the values that exceed .45 for each factor column.

There should be twelve (12) items circled for Factor 1, six (6) under Factor 2, five (5) under Factor 3, and three (3) under Factor 4.

Examine the items circles and label the factors accordingly.

Exploratory Factor Analysis

LABELING FACTORS:

One of the most important reasons for naming a factor is to communicate to others. The name should capsulize the substantive nature of the factor and enable others to grasp its meaning (Rummel, 1970).

The choice of factor names should be related to the basic purpose of the factor analysis. If the goal is to describe or simplify the complex interrelationships in the data, a descriptive factor label can be applied. The descriptive approach to factor naming involves selecting a label that best reflects the substance of the variables loaded highly and near zero on a factor. The factors are classificatory and names to define each category are sought (Rummel, 1970).

There are a number of considerations involved in descriptively naming factors:

  1. Those variables with zero or near-zero loadings are unrelated to the factor. In interpreting a factor, these unrelated variables should also be taken into consideration. The name should reflect what is as well as what is not involved in a factor (Rummel, 1970).
  2. The loading squared gives the variance of a variable explained by an orthogonal factor. Squaring the loadings on a factor helps determine the relative weight the variables should have in interpreting a factor (Rummel, 1970).
  3. The naming of the factors with high positive and high negative loadings should reflect this bipolarity. One term may be appropriate, as is “temperature” for a hot- cold bipolar factor. Additionally, each pole may be interpreted separately and the factor named by its opposite, e.g., hot versus cold (Rummel, 1970).
  4. Some factors that are difficult to name can be better interpreted by reversing the sign of some of the loadings. Reversing the sign for a variable has the effect of reversing the scaling (Rummel, 1970).

INTERNAL CONSISTENCY OF FACTORS

Click Analyze  Scale Click Reliability Analysis Click over the 44 Items under the Items: box Be sure not to include Subno and MAH_ For the Model: box – be sure that Alpha is selected Click OK

Interpret Cronbach’s Alpha by providing the information asked for below: Cronbach’s Alpha For all 44 items

N of items

Exploratory Factor Analysis

FOR EACH FACTOR (SCALE)

Next, examine the internal consistency of the items which have high factor loadings on each of the four factors (i.e., > .45). These are the item loadings you circled for each of the four factors in the Rotated Factor Matrix.

Click Analyze  Scale Click Reliability Analysis Click Reset Click over the items for that factor under the Items: box For the Model: box – be sure that Alpha is selected Click OK

Cronbach’s Alpha For Factor 1

N of items

Do the same procedure for the next three factors and interpret Cronbach’s Alpha by providing the information asked for below:

Cronbach’s Alpha For Factor 2

N of items

Cronbach’s Alpha For Factor 3

N of items

Cronbach’s Alpha For Factor 4

N of items