Stop using bivariate correlations for variable selection
November 26, 2021•144 words
Stop using bivariate correlations for variable selection
The real problem comes into play here — the bivariate comparsions selects for the wrong variables by over-emphasizing the relationship between the marginal distributions. Because the selected model is incomplete and important variables are omitted, the resulting parameter estimates are biased and inaccurate.
The bivariate comparsion is a terrible way to select relevant variables for a highly dimensional model as the function of interest is relating the all of the predictors to the outcome. It can neither rule in nor rule out a predictor from a model. A better approach is to generate the model using domain knowledge and a model of the expected data generating process. If the process is to remain strictly data driven, methods like cross-validation or AIC/BIC provide better measures of model quality and predictor importance than the bivariate correlation coefficient.