Finger exercises on data – Multicollinearity and factor analysis

How does multicollinearity affect a toy model under laboratory circumstances and what can be done about it



Let us make up a readable toy model. Assume that longevity of humans is affected by five variables, their size, weight, their weekly working hours, their income, and their tenure. Underlying idea is that obesity shortens longevity, as long working hours, while higher income gives access to more healthy food and tenure reduces stress. Let’s assume all these variables affect longevity in the following way:

longevity = 50 + 0.1*size - 0.1*weight + 0.005*income - 0.5*hours + 0.05*tenure

Don’t hang on the details, it will merely serve us as a toy model.

Let us further consider that size and weight is linear dependent, as well as income, working hours and tenure:

The model is setup with all the linear dependencies between the variables. The regression result is then:

The multicollinearity has not affected the outcome significantly! The regression coefficients are all within range of the original setup. Also the confidence intervals are significantly narrow, so that all coefficients are found significantly different from 0. However, this is kind of expected:

“Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors.” —

So what does this mean for the model? We have identified above all variables individually correctly.

The problem becomes apparent when looking at only a part of the model. So for instance, when removing one variable. Keep in mind, this a fully controlled toy model under laboraroty circumstances. In real cases, the “correct” set of variables might be unknown.

The model estimation without hours returns:

The coefficient for income is now changed and not significantly different from 0 anymore.

This would not happen in a properly setup regression model with orthogonal inputs. Consider for confirmation the following toy model:

The variables x1 and x2 are i.i.d. Thus the coefficient estimates do not change when one of them is removed.

Factor analysis

The toy model above is already written in a way to arrive at meaningful factors. Let us see whether this is taken up by the algorithm:

The factor loadings show how the variance of the entire data set is distributed on the two factors. Hours, income and tenure are combined to one factor, size and weight on the second factor — exactly as it was designed.

Correct number of factors

For the example above, the Scree plot looks as follows:

Scree plot of eigenvalues

The plot is created with:

The elbow at the third factor is very clear, also only two factors have eigenvalues >1.

Multiple regression with factors

The regression model is found highly significant as expected.

Updating the factor model with new data

Since the latent variables are not observed, this is a bit tricky, once everything is encoded in terms of these latent variables. When new data (income, tenure, size, weight, hours) becomes available, this needs to be translated into the latent variables career and bodyshape before it can be fed into the regression equation.

The following function does this task:

To check whether it works properly, a function call to

reproduces the factors as before. To use it in real-world applications, simply replace the second argument with a data frame of new data and the function returns the scores that can be fed into the regression equation.

applied AI / strategy consultant / aspiring XC rider on weekends //