The factor analysis method dates from the start of the 20th century (Spearman, 1904) and has undergone a number of developments, several calculation methods having been put forward. This method was initially used by psychometricians, but its field of application has little by little spread into many other areas, for example, geology, medicine and finance.
Today, there are two main types of factor analysis:
Exploratory factor analysis (or EFA)
Confirmatory factor analysis (or CFA)
It is EFA which will be described below and which is used by XLSTAT. It is a method which reveals the possible existence of underlying factors which give an overview of the information contained in a very large number of measured variables. The structure linking factors to variables is initially unknown and only the number of factors may be assumed.
CFA in its traditional guise uses a method identical to EFA but the structure linking underlying factors to measured variables is assumed to be known. A more recent version of CFA is linked to models of structural equations.
Going from p variables to k factors
Spearman's historical example, even if the subject of numerous criticisms and improvements, may still be used to understand the principle and use of the method. By analyzing correlations between scores obtained by children in different subjects, Spearman wanted to form a hypothesis that the scores depended ultimately on one factor, intelligence, with a residual part due to an individual, cultural or other effect.
Thus the score obtained by an individual (i) in subject (j) could be written as x(i,j) = µ + b(j)F + e(i,j), where µ is the average score in the sample studied and F the individual's level of intelligence (the underlying factor) and e(i,j) the residual.
Generalizing this structure to p subjects (the input variables) and to k underlying factors, we obtain the following model:
(1) x = µ + Lf + u
where x is a vector of dimension (p x 1), µ in the mean vector, L is the matrix (p x k) of the factor loadings and f and u are the random vectors of dimensions (k x 1) and (p x 1) respectively are assumed to be independent. The elements of f are called common factors, and those of u specific factors.
If we set the norm of f to 1, then the covariance matrix for the input variables from expression (1) is written as:
(2) S = LL’ + Y
Thus the variance of each of the variables can be divided into two parts: The communality (as it arises from the common factors),
,and the specific variance or unique variance (as it is specific to the variable in question).
It can be shown that the method used to calculate matrix L, an essential challenge in factorial analysis, is independent of the scale. It is therefore equivalent to working from the covariance matrix or correlation matrix.
The challenge of factorial analysis is to find matrices L and Y, such that equation (2) can be at least approximately verified.
Note: factor analysis is sometimes included with Principle Component Analysis (PCA) as PCA is a special case of factor analysis (where k, the number of factors, equals p, the number of variables). Nevertheless, these two methods are not generally used in the same context. Indeed, PCA is first and foremost used to reduce the number of dimensions while maximizing the unchanged variability in order to obtain independent (non-correlated) factors or for visualizing data in a 2- or 3-dimensional space. Whereas, factor analysis is used to identify a latent structure and for possibly reducing afterwards the number of variables measured if they are redundant with respect to the latent factors.
Three methods of extracting latent factors are offered by XLSTAT:
Principle components: this method is also used in Principle Component Analysis (PCA). It is only offered here in order to make a comparison between the results of the three methods bearing in mind that the results from the module dedicated to PCA are more complete.
Principal factors: this method is probably the most used. It is an iterative method which enables the communalities to be gradually converged. The calculations are stopped when the maximum change in the communalities is below a given threshold or when a maximum number of iterations is reached. The initial communalities can be calculated according to various methods.
Maximum likelihood: this method was first put forward by Lawley (1940). The proposal to use the Newton-Raphson algorithm (iterative method) dates from Jennrich (1969). It was afterwards improved and generalized by Jöreskog (1977). This method assumes that the input variables follow a normal distribution. The initial communalities are calculated according to the method proposed by Jöreskog (1977). As part of this method, an adjustment test is calculated. The statistic used for the test follows a Chi2 distribution to (p-k)² / 2 – (p+k) / 2 degrees of freedom where p is the number of variables and k the number of factors.
Number of factors
Determining the number of factors to select is one of the challenges of factor analysis. The "automatic" method offered by XLSTAT is uniquely based on the spectral decomposition of the correlation matrix and the detection of a threshold from which the contribution made by information (in the sense of variability) is not significant.
The likelihood maximum method offers an adjustment test to help determine the correct number of principle factors for the principle factor method. For the principal factors method, the defining the number of factors is more difficult?
The Kaiser-Guttman rule suggests that only those factors with associated eigenvalues which are strictly greater than 1 should be kept. The number of factors to be kept corresponds to the first turning point found on the curve. Crossed validation methods have been suggested to achieve this aim.
Anomalies (Heywood cases)
Communalities are by definition the squares of correlations. They must therefore be between 0 and 1. However, it may happen that the iterative algorithms (principle factors method or likelihood maximum method) will produce solutions with communalities equal to 1 (Heywood cases), or greater than 1 (ultra Heywood cases). There may be many reasons for these anomalies (too many factors, not enough factors, etc.). When this happens, XLSTAT sets the communalities to 1 and adapts the elements of L in consequence.
Once the results have been obtained, they may be transformed in order to make them more easy to interpret, for example by trying to arrange that the coordinates of the variables on the factors are either high (in absolute value), or near to zero. There are two main families of rotations:
Orthogonal rotations can be used when the factors are not correlated (hence orthogonal). The methods offered by XLSTAT are Varimax, Quartimax, Equamax, Parsimax and Orthomax. Varimax rotation is the most used. It ensures that for each factor there are few high factor loadings and few that are low. Interpretation is thus made easier as, in principle, the initial variables will mostly be associated with one of the factors.
Oblique transformations can be used when the factors are correlated (hence oblique). The methods offered by XLSTAT are Quartimin and Oblimin.
The Promax method, also offered by XLSTAT, is a mixed procedure since it consists initially of a Varimax rotation followed by an oblique rotation so that the high factor loadings and low factor loadings are the same but with the low values even lower.