Finger exercises on data – Multicollinearity and factor analysis
How does multicollinearity affect a toy model under laboratory circumstances and what can be done about it
Multicollinearity is a topic that every first-year student comes across, even before multiple regression appears on the curriculum. However, as a small finger exercise, it is interesting to engage in some toy models, to understand exactly how regression models are affected, and more interestingly, in which case they are not affected, and also what to do about it.
As absence of it is one of the key requirements for linear regression analysis, most have in mind that regression coefficients will be biased in the presence of multicollinearity — thus, a fair understanding of it would be that when variables are correlated or can be constructed from linear combinations of a subset of the independent variables, then effect attribution becomes difficult. However, this is not necessarily the case.
Let us make up a readable toy model. Assume that longevity of humans is affected by five variables, their size, weight, their weekly working hours, their income, and their tenure. Underlying idea is that obesity shortens longevity, as long working hours, while higher income gives access to more healthy food and tenure reduces stress. Let’s assume all these variables affect longevity in the following way:
longevity = 50 + 0.1*size - 0.1*weight + 0.005*income - 0.5*hours + 0.05*tenure
Don’t hang on the details, it will merely serve us as a toy model.
Let us further consider that size and weight is linear dependent, as well as income, working hours and tenure:
The model is setup with all the linear dependencies between the variables. The regression result is then:
The multicollinearity has not affected the outcome significantly! The regression coefficients are all within range of the original setup. Also the confidence intervals are significantly narrow, so that all coefficients are found significantly different from 0. However, this is kind of expected:
“Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors.” — https://en.wikipedia.org/wiki/Multicollinearity
So what does this mean for the model? We have identified above all variables individually correctly.
The problem becomes apparent when looking at only a part of the model. So for instance, when removing one variable. Keep in mind, this a fully controlled toy model under laboraroty circumstances. In real cases, the “correct” set of variables might be unknown.
The model estimation without hours returns:
The coefficient for income is now changed and not significantly different from 0 anymore.
This would not happen in a properly setup regression model with orthogonal inputs. Consider for confirmation the following toy model:
The variables x1 and x2 are i.i.d. Thus the coefficient estimates do not change when one of them is removed.
The toy model above is already written in a way to arrive at meaningful factors. Let us see whether this is taken up by the algorithm:
The factor loadings show how the variance of the entire data set is distributed on the two factors. Hours, income and tenure are combined to one factor, size and weight on the second factor — exactly as it was designed.
Correct number of factors
In the toy model above, it is known by design how many latent variables are needed. In real world data, this is usually not the case. One straightforward way to estimate the correct number of factors is to look at the eigenvalues of the correlation matrix. The Kaiser Guttmann criterion states that only factors with eigenvalue greater than one should be considered. In addition, a scree plot can be used to infer the decaying contribution of the factors to the variance of the data set. The elbow in the scree plot gives also an indication on the maximum number of needed factors.
For the example above, the Scree plot looks as follows:
The plot is created with:
The elbow at the third factor is very clear, also only two factors have eigenvalues >1.
Multiple regression with factors
Then, let us continue with the regression. The factor scores are now the new independent variables. For readability, names for the new factors are introduced in the snippet below: career and bodyshape.
The regression model is found highly significant as expected.
Updating the factor model with new data
After the factors are identified and the model is fitted, the practitioner might need to update the model or predict new data with it.
Since the latent variables are not observed, this is a bit tricky, once everything is encoded in terms of these latent variables. When new data (income, tenure, size, weight, hours) becomes available, this needs to be translated into the latent variables career and bodyshape before it can be fed into the regression equation.
The following function does this task:
To check whether it works properly, a function call to
reproduces the factors as before. To use it in real-world applications, simply replace the second argument with a data frame of new data and the function returns the scores that can be fed into the regression equation.