Data Leakage

Data Leakage is a common issue in Machine Learning where information from outside the training dataset is inadvertently used to create the model. This causes the model to perform exceptionally well on training or validation data but fail on new, unseen data because it has “cheated” by using information it wouldn’t have in real-world predictions.

Why Data Leakage is a Problem

  • Leads to overly optimistic performance metrics
  • Produces models that do not generalize to real-world data
  • Can result in wrong business or scientific decisions

Common Causes of Data Leakage

  1. Including Future Data: Using data that would not be available at the time of prediction (e.g., using a future sales figure to predict current demand).
  2. Feature Leakage: Including features that are directly derived from the target variable (e.g., including a “loan approved” column when predicting loan approval).
  3. Improper Data Splitting: Failing to separate training and test sets properly, e.g., using test data to scale or normalize training data.

How to Prevent Data Leakage

  • Separate Data Before Preprocessing: Split your dataset into training, validation, and test sets before scaling, encoding, or feature engineering.
  • Carefully Review Features: Ensure no feature contains information from the future or directly derived from the target.
  • Use Cross-Validation Correctly: Apply transformations like scaling or encoding inside the cross-validation loop.
  • Audit Data Sources: Understand how each feature is collected and whether it could leak target information.

Examples of Data Leakage

  • Using a column “total_payment” when predicting customer default, if “total_payment” includes post-default information.
  • Normalizing the entire dataset before splitting into training and test sets.
  • Text classification using features that appear only in the test set but not in real deployment.

Conclusion

Data Leakage can severely compromise a Machine Learning model’s reliability. Careful data handling, feature selection, and proper training-test separation are essential to prevent leakage and ensure models perform accurately on unseen, real-world data.

Home » Intermediate Machine Learning > Feature Engineering > Data Leakage