When we scale the date prior to train-test split, we cause, indirect data leakage. The algorithm would know the global mean and standard deviation in standardising and global minimum and maximum if doing normalisation. Some information about the hold out sample is captured in the summary statisticsRead more

When we scale the date prior to train-test split, we cause, indirect data leakage. The algorithm would know the global mean and standard deviation in standardising and global minimum and maximum if doing normalisation. Some information about the hold out sample is captured in the summary statistics and made available to the model in the training dataset.

Ideally the transformations should be fit using the training dataset only. Then the transform should be applied on both train and test dataset. This would avoid indirect data leakage and reduce over optimistic results on the train and test dataset.

If this particular variable is essential, it should be included in the model. You may go ahead and build the model including this variable. However, you should check the correlation of this variable with other predictor variables. Drop the variable which is highly correlated with this particular varRead more

If this particular variable is essential, it should be included in the model. You may go ahead and build the model including this variable. However, you should check the correlation of this variable with other predictor variables. Drop the variable which is highly correlated with this particular variable. Because highly correlated variables provide the similar information and hence lead to multicollinearity.

Also, after building the model, check if this particular variable is statistically significant or not and take appropriate action in the next version of the model.

Additionally, try Lasso regression. It is an intrinsic method of feature selection and see if this model has included the variable in question.

Intercept is the value of y, when all Xs are 0. So the y-intercept is the predicted value of y when all X1, X2, X3,....Xn are zero. In 2-dimensional space, y-intercept is where the regression line cuts the y-axis (value of x=0 at this point). Y-intercept is interpreted as the value of the target y-Read more

Intercept is the value of y, when all Xs are 0. So the y-intercept is the predicted value of y when all X1, X2, X3,….Xn are zero. In 2-dimensional space, y-intercept is where the regression line cuts the y-axis (value of x=0 at this point).

Y-intercept is interpreted as the value of the target y-variable when all the predictors are 0.

For example, if we try to fit a regression line to predict the marks obtained in the test based on no. of hours of studies:

y = 20+.6x, where y is the marks and x is the number of hours of study.

20 is the y-intercept and it means a student will obtain 20 marks, even if he does not study.

However, when this value zero is outside the range of the values of the predictor, used to build the model, the y-intercept will not make much sense in the context of the problem.

Statsmodel , by default fits a line passing through the origin, i.e., there is no y-intercept included by default.

To include y-intercept we use the function add_constant().

If the p-value for f statistic is less than 0.05, we reject the null hypothesis which means that at least one beta coefficient is not zero. We conclude that the over all model is significant. By checking the f-statistic we concluded that the overall model is statistically significant, however we neeRead more

If the p-value for f statistic is less than 0.05, we reject the null hypothesis which means that at least one beta coefficient is not zero. We conclude that the over all model is significant.

By checking the f-statistic we concluded that the overall model is statistically significant, however we need to identify if any of the predictors included in the model are not related to the response variable.

We check the t-statistic to confirm which of the predictor variables are NOT related to the response variable and which of these variables are statistically significant predictors of the response variables.

Alternatively, we may argue that if we are checking the t-statistic of individual predictor variables, and even if one of the predictor variable is a significant, then the overall model should be considered as significant or valid, then why do we check the F-statistic for overall validity of the model.

That’s because 5% of the predictor variables will be significant by sheer chance(@ 95% confidence level). This will be specially true for models with a model with multiple predictor variables. F-statistic does not suffer from this as it adjusts for the number of predictor variable. Hence we confirm the overall validity of the model using F-statistic.

If the model performs well on the training dataset but does not perform well on the test set, it is an indication of overfitting, as the model is unable to generalise on the unseen data. If the model performance is poor on both the training and the test set, then the model is underfitting. It indicaRead more

If the model performs well on the training dataset but does not perform well on the test set, it is an indication of overfitting, as the model is unable to generalise on the unseen data.

If the model performance is poor on both the training and the test set, then the model is underfitting. It indicates that neither the model is able to capture the underlying patterns in the data, and nor is it able to generalise on the unseen data.

## Scaling for numeric variables

## Suchita

When we scale the date prior to train-test split, we cause, indirect data leakage. The algorithm would know the global mean and standard deviation in standardising and global minimum and maximum if doing normalisation. Some information about the hold out sample is captured in the summary statisticsRead more

When we scale the date prior to train-test split, we cause, indirect data leakage. The algorithm would know the global mean and standard deviation in standardising and global minimum and maximum if doing normalisation. Some information about the hold out sample is captured in the summary statistics and made available to the model in the training dataset.

Ideally the transformations should be fit using the training dataset only. Then the transform should be applied on both train and test dataset. This would avoid indirect data leakage and reduce over optimistic results on the train and test dataset.

For e.g.

my_scaler = MinMaxScaler()

my_scaler.fit(X_Train)

X_Train = my_scaler.transform(X_Train)

X_Test = my_scaler.transform(X_Test)

See less## Linear regression with multiple variables

## Suchita

If this particular variable is essential, it should be included in the model. You may go ahead and build the model including this variable. However, you should check the correlation of this variable with other predictor variables. Drop the variable which is highly correlated with this particular varRead more

If this particular variable is essential, it should be included in the model. You may go ahead and build the model including this variable. However, you should check the correlation of this variable with other predictor variables. Drop the variable which is highly correlated with this particular variable. Because highly correlated variables provide the similar information and hence lead to multicollinearity.

Also, after building the model, check if this particular variable is statistically significant or not and take appropriate action in the next version of the model.

Additionally, try Lasso regression. It is an intrinsic method of feature selection and see if this model has included the variable in question.

See less## Intercept in linear regression model

## Suchita

Intercept is the value of y, when all Xs are 0. So the y-intercept is the predicted value of y when all X1, X2, X3,....Xn are zero. In 2-dimensional space, y-intercept is where the regression line cuts the y-axis (value of x=0 at this point). Y-intercept is interpreted as the value of the target y-Read more

Intercept is the value of y, when all Xs are 0. So the y-intercept is the predicted value of y when all X1, X2, X3,….Xn are zero. In 2-dimensional space, y-intercept is where the regression line cuts the y-axis (value of x=0 at this point).

Y-intercept is interpreted as the value of the target y-variable when all the predictors are 0.

For example, if we try to fit a regression line to predict the marks obtained in the test based on no. of hours of studies:

y = 20+.6x, where y is the marks and x is the number of hours of study.

20 is the y-intercept and it means a student will obtain 20 marks, even if he does not study.

However, when this value zero is outside the range of the values of the predictor, used to build the model, the y-intercept will not make much sense in the context of the problem.

Statsmodel , by default fits a line passing through the origin, i.e., there is no y-intercept included by default.

To include y-intercept we use the function add_constant().

See less## P-Value in Linear Regression

## Suchita

This answer was edited.If the p-value for f statistic is less than 0.05, we reject the null hypothesis which means that at least one beta coefficient is not zero. We conclude that the over all model is significant. By checking the f-statistic we concluded that the overall model is statistically significant, however we neeRead more

If the p-value for f statistic is less than 0.05, we reject the null hypothesis which means that at least one beta coefficient is not zero. We conclude that the over all model is significant.

By checking the f-statistic we concluded that the overall model is statistically significant, however we need to identify if any of the predictors included in the model are not related to the response variable.

We check the t-statistic to confirm which of the predictor variables are NOT related to the response variable and which of these variables are statistically significant predictors of the response variables.

Alternatively, we may argue that if we are checking the t-statistic of individual predictor variables, and even if one of the predictor variable is a significant, then the overall model should be considered as significant or valid, then why do we check the F-statistic for overall validity of the model.

That’s because 5% of the predictor variables will be significant by sheer chance(@ 95% confidence level). This will be specially true for models with a model with multiple predictor variables. F-statistic does not suffer from this as it adjusts for the number of predictor variable. Hence we confirm the overall validity of the model using F-statistic.

See less## Model Overfitting or Underfitting

## Suchita

This answer was edited.If the model performs well on the training dataset but does not perform well on the test set, it is an indication of overfitting, as the model is unable to generalise on the unseen data. If the model performance is poor on both the training and the test set, then the model is underfitting. It indicaRead more

If the model performs well on the training dataset but does not perform well on the test set, it is an indication of overfitting, as the model is unable to generalise on the unseen data.

If the model performance is poor on both the training and the test set, then the model is underfitting. It indicates that neither the model is able to capture the underlying patterns in the data, and nor is it able to generalise on the unseen data.

See less