Stop using bivariate correlations for variable selection

November 26, 2021•144 words

The real problem comes into play here — the bivariate comparsions selects for the wrong variables by over-emphasizing the relationship between the marginal distributions. Because the selected model is incomplete and important variables are omitted, the resulting parameter estimates are biased and inaccurate.

The bivariate comparsion is a terrible way to select relevant variables for a highly dimensional model as the function of interest is relating the all of the predictors to the outcome. It can neither rule in nor rule out a predictor from a model. A better approach is to generate the model using domain knowledge and a model of the expected data generating process. If the process is to remain strictly data driven, methods like cross-validation or AIC/BIC provide better measures of model quality and predictor importance than the bivariate correlation coefficient.

Stop using bivariate correlations for variable selection

More from Steve Harris
All posts

Boring machine learning is where it's at

Stop using bivariate correlations for variable selection

More from Steve HarrisAll posts

Boring machine learning is where it's at

More from Steve Harris
All posts