Multicollinearity: Why occur and how to remove

Spread the love

What is multicollinearity?

Multicollinearity is a statistical situation that occurs in a regression model when two or more predictors or explanatory or independent variables are highly correlated to each other. In this situation, there exists a strong correlation among the independent variables.

For example, let us consider a multiple regression model as, Y=α+βX1+ βX2+ βX3+ ε. Here, X1, X2, X3 are the independent variables. If X1, X2, X3 correlated to each other then this situation is called multicollinearity.

Why does multicollinearity occur?

We know the basic assumptions of linear regression where one of the most impotant assumption is, “the predictor or independent variables are independent of each other”. When this assumption is violated then multicollinearity occured and the independent variables are found to be highly correlated to each other.

multicollinearity-scatter plot

More precisely,

The use and interpretation of a multiple regression model depend implicitly on the assumption that the explanatory variables are not strongly interrelated. In most regression applications the explanatory variables are not orthogonal. Usually, the lack of orthogonality is not serious enough to affect the analysis. However, in some situations, the explanatory variables are so strongly interrelated that the regression results are ambiguous. Typically, it is impossible to estimate the unique effects of individual variables in the regression equation. The estimated values of the coefficients are very sensitive to slight changes in the data and to the addition or deletion of variables in the equation. The regression coefficients have large sampling errors which affect both inference and forecasting that is based on the regression model. The condition of severe non-orthogonality is also referred to as the problem of multicollinearity.
The presence of multicollinearity has a number of potentially serious effects on the least-squares estimates of regression coefficients. Multicollinearity also tends to produce least squares estimates that are too large in absolute value.

How to remove multicollinearity?

There are some remedial measures by which we can remove multicollinearity. The methods are as following,


Remedial Measures

  • Collection of additional data: Collecting additional data has been suggested as one of the methods of combating multicollinearity. The additional data should be collected in a manner designed to break up the multicollinearity in the existing data.
  • Model respecification: Multicollinearity is often caused by the choice of model, such as when two highly correlated regressors are used in the regression equation. In these situations, some respecification of the regression equation may lessen the impact of multicollinearity. One approach to respecification is to redefine the regressors. For example, if x1, x2 and x3 are nearly linearly dependent it may be possible to find some function such as x = (x1+x2)/x3 or x = x1x2x3 that preserves the information content in the original regressors but reduces the multicollinearity
  • Ridge Regression: When the method of least squares is used, parameter estimates are unbiased. A number of procedures have been developed for obtaining biased estimators of regression coefficients to tackle the problem of multicollinearity. One of these procedures is ridge regression. The ridge estimators are found by solving a slightly modified version of the normal equations. Each of the diagonal elements of X’X matrix are added a small quantity.
  • Simulteneous equation methods: There are various simulteneous equation methods by which are used as remedial measurs such as indirect least square (ILS), two stage least square (2SLS), limited information maximum likelihood (LIML), full information maximum likelihood (FIML) etc.

Additional features

Learn data science

Data analysis using SPSS

Data analysis using R

You cannot copy content of this page