Multicollinearity – A Marketing Researcher’s Curse Word

What is Multicollinearity?

Multicollinearity (also known as collinearity) occurs when two or more variables are very highly correlated. Singularity, a more serious form of multicollinearity, occurs when two or more variables are redundant, where one variable is a linear combination of the others.

High correlations can occur in several ways. First, the two variables could be naturally highly correlated. Second, the researcher could have created the collinearity by repeating the same question in a survey but changing the wording slightly, and then using those highly similar variables in the analysis. Third, the researcher could have created singularity through an error. One example of error-induced singularity includes creating dummy coded variables for all categories (instead of K-1 categories) and using them all in the analysis. Another example includes creating a composite by combining variables—and then using both the composite and the original variables in the analysis.

What Kind of Curse Word is Multicollinearity?

I would vote that multicollinearity is somewhere between PG-13 movie curse words, and those that make old ladies gasp and parents wash their kid’s mouth out with soap. Listed below is the havoc created by multicollinearity.

Increased collinearity creates larger error terms and unstable b coefficients. The beta coefficient explains how much the outcome measure changes based on a 1-unit change in the predictor variable, when all other predictors are held constant. In other words, we are assuming that the predictors are independent of one another—which, if multicollinearity is present, means that they are not. Instead, as one predictor changes, one or more other predictors change with it. Thus, the beta coefficients change greatly, depending on which predictors are in the analysis. This is also known as suppressor effects. Suppression effects are most noticeable when a variable that is known to have a positive significant bivariate correlation with the outcome variable suddenly has a negative or non-significant relationship with the outcome variable. Check out this blog to learn more about suppression effects. Unfortunately, inflated error and unstable bs weaken the overall analysis. This means finding significance will be much more difficult and trusting the results, even more so.
Matrix inversion, or division, becomes problematic when multicollinearity is present and impossible when singularity is present.
The correlation coefficient R is limited with highly correlated variables.
When two variables are very highly correlated, it is impossible to know which one is more important to predicting the outcome due to their shared variance.

Is There Multicollinearity in the Data?

There are several ways to check for collinearity in the data.

Examine the correlation coefficients of all predictor variables in the analysis. Are any of them above .80? While this first method is simple, bivariate correlations often miss multicollinearity (when there is collinearity between more than two predictors).
Select collinearity diagnostics when conducting a regression in SPSS or other statistical software. This will give VIF (Variance Inflation Factor), Tolerance statistics, and the Condition Index in the output. VIFs greater than 10 and Tolerance less than .10 indicate a problem with multicollinearity, whereas a Condition Index of 30 or more indicates multicollinearity.
If multicollinearity is indicated by the VIF, Tolerance, or Condition Index, follow up by investigating the Variance Proportions. The variables that load highly on the Variance Proportions for that dimension that has 30 or above on the Condition Index indicate problem variables.
Another way to find the offending variables is by conducting a factor analysis, using the same predictor variables. Investigate the communalities. The variables with an Initial value of .5 or greater indicate multicollinearity.
If you are receiving errors or your programs are aborting the analyses when examining relationships between variables, this is an indicator that you may have a multicollinearity problem. Singularity is most easily identified this way.

#%$&!! There’s Multicollinearity, Now What?

Although it may be tempting to let the curse words fly, to throw a few non-essential things, and to scream into the paging system, hold fast, and try these things first.

Get rid of any variables causing a linear combination (singularity). Make sure you are only using the composite variable, or the k-1 dummy variables, and not all the variables in the analysis.
Combine highly correlated variables. This can be accomplished by creating a summed or averaged composite variable of the two. Use the composite instead of the highly correlated variables.
Delete the variable with the highest variance proportion.
Conduct a Principal Components analysis and use the components in the variables’ place.
The final option is to mean center the variables.

Conclusion

Multicollinearity can be a real curse word for researchers because it weakens analyses, making it more difficult to find significant effects. Or worse, in the case of singularity, it can completely crash the statistical analyses being conducted. Although multicollinearity can make a researcher feel like cursing and possibly throwing things, it doesn’t have to mean we pack up our bags and go home. Once the offending variables are located through a little investigating, several options exist to take care of the problem. As Theodore Isaac Rubin states, “The problem is not that there are problems. The problem is expecting otherwise and thinking that having problems is a problem.” It is important for researchers not to ignore multicollinearity, but to deal with it in a way that minimizes its impact on the statistical tests being examined.

Author

Audrey Guinn

Statistical Consultant, Advanced Analytics Group

Email Audrey

Audrey utilizes her knowledge in both inferential and Bayesian statistics to solve real-world marketing problems. She has experience in research design, statistical methods, data analysis, and reporting. As a Statistical Consultant, she specializes in market segmentation, SEM, MaxDiff, GG, TURF, and Key Driver analysis. Audrey earned a Ph.D. and Master of Science in Experimental Psychology with an emphasis on emotional decision-making from The University of Texas at Arlington.

Statistics Blog