Linear regression with multiple variables

I was working on Dataset of Insurance in Python it had both categorical and numeric variables. For fitting a linear regression model I did the conversion of categorical to dummy variable, did scaling of whole data frame afterwards, after splitting training and testing data and, fitting model on training data, I found that VIF of one variable was more than 5 .(So while finding method, for reducing this high VIF, I came across Recursive Feature elimination method, in which it detects essential variables and nonessential variables). That particular variable with high VIF came under essential, so my query is should drop that variable by criteria of high VIF or keep it as it is? Is it only criteria for dropping variable?

1 comment

  1. If this particular variable is essential, it should be included in the model. You may go ahead and build the model including this variable. However, you should check the correlation of this variable with other predictor variables. Drop the variable which is highly correlated with this particular variable. Because highly correlated variables provide the similar information and hence lead to multicollinearity.

    Also, after building the model, check if this particular variable is statistically significant or not and take appropriate action in the next version of the model.

    Additionally, try Lasso regression. It is an intrinsic method of feature selection and see if this model has included the variable in question.

Leave a comment