Scaling for numeric variables

Why is scaling for numeric variables done before splitting it into a train & test dataset?


  1. It is possible to perform the numerical scaling after the test & train split, but since our Test data is an unseen data .We use it for the further predictions , so we don’t want to access it during the training stage . So it is preferred to use it before the train & test split.

  2. When we scale the date prior to train-test split,  we cause, indirect data leakage. The algorithm would know the global mean and standard deviation in standardising and global minimum and maximum if doing normalisation. Some information about the hold out sample is captured in the summary statistics and made available to the model in the training dataset.

    Ideally the transformations should be fit using the training dataset only. Then the transform should be applied on both train and test dataset. This would avoid indirect data leakage and reduce over optimistic results on the train and test dataset.

    For e.g.

    my_scaler = MinMaxScaler()

    X_Train = my_scaler.transform(X_Train)

    X_Test = my_scaler.transform(X_Test)

Leave a comment